1
votes

I've got a file with hundreds of JSON lines in it. I wrote a little python script that will let me extract some data but it only works for one line. I'm now wondering how I can loop through all the lines in my file if there's multiple. What I have so far is:

import json
from pprint import pprint

"""with open('1st_run_fixed.json') as f:"""
with open('fixed.json') as f:
    data = json.load(f)

    print "--------------------------------------------";
    """get number of characters"""
    nchar = data["frames"]["frame"]["lps"]["lp"]["ncharacter"];
    print "Got "+nchar+" characters";
    for x in range (1,int(nchar)+1):
        x = str(x);
        print data["frames"]["frame"]["lps"]["lp"]["characters"]["char"+x]["code_ascii"]+"    "+data["frames"]["frame"]["lps"]["lp"]["characters"]["char"+x]["confidence"];
    print "--------------------------------------------";

which works for data like:

{"response":{"container":{"id":"41d6efcb-24d6-490d-8880-762255519b5f","timestamp":"2018-Jul-11 19:51:06.461665"},
"id":"00000002-0000-0000-0000-000000000015"},
"frames":{"frame":{"id":"5583","timestamp":"2016-Nov-30 13:05:27","lps":{"lp":{"licenseplate":"15451BBL","text":"15451BBL","wtext":"15451BBL","confidence":"20","bkcolor":"16777215","color":"16777215","type":"0","ntip":"11","cct_country_short":"","cct_state_short":"","tips":{"tip":{"poly":{"p":{"x":"1094","y":"643"},
"p":{"x":"1099","y":"643"},
"p":{"x":"1099","y":"667"},
"p":{"x":"1094","y":"667"}},
"bkcolor":"16777215","color":"0","code":"49","code_ascii":"1","confidence":"97"},
"tip":{"poly":{"p":{"x":"1103","y":"642"},
"p":{"x":"1113","y":"642"},
"p":{"x":"1112","y":"667"},
"p":{"x":"1102","y":"667"}},
"bkcolor":"16777215","color":"0","code":"53","code_ascii":"5","confidence":"89"},
"tip":{"poly":{"p":{"x":"1112","y":"640"},
"p":{"x":"1122","y":"640"},
"p":{"x":"1122","y":"666"},
"p":{"x":"1112","y":"666"}},
"bkcolor":"16777215","color":"0","code":"52","code_ascii":"4","confidence":"97"},
"tip":{"poly":{"p":{"x":"1123","y":"640"},
"p":{"x":"1132","y":"640"},
"p":{"x":"1131","y":"665"},
"p":{"x":"1123","y":"665"}},
"bkcolor":"16777215","color":"0","code":"53","code_ascii":"5","confidence":"97"},
"tip":{"poly":{"p":{"x":"1134","y":"640"},
"p":{"x":"1139","y":"640"},
"p":{"x":"1139","y":"664"},
"p":{"x":"1133","y":"664"}},
"bkcolor":"16777215","color":"0","code":"49","code_ascii":"1","confidence":"77"},
"tip":{"poly":{"p":{"x":"1154","y":"639"},
"p":{"x":"1163","y":"639"},
"p":{"x":"1163","y":"663"},
"p":{"x":"1153","y":"663"}},
"bkcolor":"16777215","color":"0","code":"66","code_ascii":"B","confidence":"97"},
"tip":{"poly":{"p":{"x":"1164","y":"638"},
"p":{"x":"1173","y":"638"},
"p":{"x":"1173","y":"663"},
"p":{"x":"1163","y":"663"}},
"bkcolor":"16777215","color":"0","code":"66","code_ascii":"B","confidence":"94"},
"tip":{"poly":{"p":{"x":"1191","y":"637"},
"p":{"x":"1206","y":"636"},
"p":{"x":"1205","y":"660"},
"p":{"x":"1190","y":"661"}},
"bkcolor":"16777215","color":"0","code":"76","code_ascii":"L","confidence":"34"},
"tip":{"poly":{"p":{"x":"1103","y":"655"},
"p":{"x":"1111","y":"655"},
"p":{"x":"1111","y":"667"},
"p":{"x":"1103","y":"667"}},
"bkcolor":"16777215","color":"0","code":"74","code_ascii":"J","confidence":"57"},
"tip":{"poly":{"p":{"x":"1103","y":"655"},
"p":{"x":"1111","y":"655"},
"p":{"x":"1111","y":"667"},
"p":{"x":"1103","y":"667"}},
"bkcolor":"16777215","color":"0","code":"74","code_ascii":"J","confidence":"57"},
"tip":{"poly":{"p":{"x":"1176","y":"638"},
"p":{"x":"1185","y":"637"},
"p":{"x":"1184","y":"661"},
"p":{"x":"1175","y":"662"}},
"bkcolor":"16777215","color":"0","code":"52","code_ascii":"4","confidence":"7"}},
"ncharacter":"8","characters":{"char1":{"poly":{"p":{"x":"1094","y":"643"},
"p":{"x":"1099","y":"643"},
"p":{"x":"1099","y":"667"},
"p":{"x":"1094","y":"667"}},
"bkcolor":"16777215","color":"0","code":"49","code_ascii":"1","confidence":"97"},
"char2":{"poly":{"p":{"x":"1103","y":"642"},
"p":{"x":"1113","y":"642"},
"p":{"x":"1112","y":"667"},
"p":{"x":"1102","y":"667"}},
"bkcolor":"16777215","color":"0","code":"53","code_ascii":"5","confidence":"89"},
"char3":{"poly":{"p":{"x":"1112","y":"640"},
"p":{"x":"1122","y":"640"},
"p":{"x":"1122","y":"666"},
"p":{"x":"1112","y":"666"}},
"bkcolor":"16777215","color":"0","code":"52","code_ascii":"4","confidence":"97"},
"char4":{"poly":{"p":{"x":"1123","y":"640"},
"p":{"x":"1132","y":"640"},
"p":{"x":"1131","y":"665"},
"p":{"x":"1123","y":"665"}},
"bkcolor":"16777215","color":"0","code":"53","code_ascii":"5","confidence":"97"},
"char5":{"poly":{"p":{"x":"1134","y":"640"},
"p":{"x":"1139","y":"640"},
"p":{"x":"1139","y":"664"},
"p":{"x":"1133","y":"664"}},
"bkcolor":"16777215","color":"0","code":"49","code_ascii":"1","confidence":"77"},
"char6":{"poly":{"p":{"x":"1154","y":"639"},
"p":{"x":"1163","y":"639"},
"p":{"x":"1163","y":"663"},
"p":{"x":"1153","y":"663"}},
"bkcolor":"16777215","color":"0","code":"66","code_ascii":"B","confidence":"97"},
"char7":{"poly":{"p":{"x":"1164","y":"638"},
"p":{"x":"1173","y":"638"},
"p":{"x":"1173","y":"663"},
"p":{"x":"1163","y":"663"}},
"bkcolor":"16777215","color":"0","code":"66","code_ascii":"B","confidence":"94"},
"char8":{"poly":{"p":{"x":"1191","y":"637"},
"p":{"x":"1206","y":"636"},
"p":{"x":"1205","y":"660"},
"p":{"x":"1190","y":"661"}},
"bkcolor":"16777215","color":"0","code":"76","code_ascii":"L","confidence":"34"}},
"det_time_us":"1072592","poly":{"p":{"x":"1088","y":"642"},
"p":{"x":"1210","y":"634"},
"p":{"x":"1210","y":"661"},
"p":{"x":"1087","y":"669"}}}},
"det_time_us":"1720812"}}}

but I would like to also make it work for data like:

{"response":{"container":{"id":"80d996a1-c267-4fa4-b3f8-f61ff9fda198","timestamp":"2018-Jul-10 17:00:50.829709"},
"id":"00000002-0000-0000-0000-000000000002"},
"frames":{"frame":{"id":"398","timestamp":"2016-Nov-30 12:56:47.900000","lps":{"lp":{"licenseplate":"FRJ724","text":"FRJ724","wtext":"FRJ724","confidence":"67","bkcolor":"16777215","color":"16777215","type":"540122","ntip":"6","cct_country_short":"USA","cct_state_short":"NY","tips":{"tip":{"poly":{"p":{"x":"1553","y":"249"},
"p":{"x":"1559","y":"249"},
"p":{"x":"1559","y":"267"},
"p":{"x":"1553","y":"267"}},
"bkcolor":"16777215","color":"0","code":"70","code_ascii":"F","confidence":"88"},
"tip":{"poly":{"p":{"x":"1561","y":"248"},
"p":{"x":"1568","y":"248"},
"p":{"x":"1568","y":"267"},
"p":{"x":"1561","y":"267"}},
"bkcolor":"16777215","color":"0","code":"82","code_ascii":"R","confidence":"96"},
"tip":{"poly":{"p":{"x":"1569","y":"248"},
"p":{"x":"1575","y":"248"},
"p":{"x":"1576","y":"267"},
"p":{"x":"1569","y":"267"}},
"bkcolor":"16777215","color":"0","code":"74","code_ascii":"J","confidence":"96"},
"tip":{"poly":{"p":{"x":"1585","y":"248"},
"p":{"x":"1591","y":"248"},
"p":{"x":"1591","y":"267"},
"p":{"x":"1585","y":"267"}},
"bkcolor":"16777215","color":"0","code":"55","code_ascii":"7","confidence":"94"},
"tip":{"poly":{"p":{"x":"1593","y":"248"},
"p":{"x":"1600","y":"248"},
"p":{"x":"1600","y":"267"},
"p":{"x":"1593","y":"267"}},
"bkcolor":"16777215","color":"0","code":"50","code_ascii":"2","confidence":"88"},
"tip":{"poly":{"p":{"x":"1602","y":"248"},
"p":{"x":"1607","y":"248"},
"p":{"x":"1607","y":"266"},
"p":{"x":"1602","y":"266"}},
"bkcolor":"16777215","color":"0","code":"52","code_ascii":"4","confidence":"99"}},
"ncharacter":"6","characters":{"char1":{"poly":{"p":{"x":"1553","y":"249"},
"p":{"x":"1559","y":"249"},
"p":{"x":"1559","y":"267"},
"p":{"x":"1553","y":"267"}},
"bkcolor":"16777215","color":"0","code":"70","code_ascii":"F","confidence":"88"},
"char2":{"poly":{"p":{"x":"1561","y":"248"},
"p":{"x":"1568","y":"248"},
"p":{"x":"1568","y":"267"},
"p":{"x":"1561","y":"267"}},
"bkcolor":"16777215","color":"0","code":"82","code_ascii":"R","confidence":"96"},
"char3":{"poly":{"p":{"x":"1569","y":"248"},
"p":{"x":"1575","y":"248"},
"p":{"x":"1576","y":"267"},
"p":{"x":"1569","y":"267"}},
"bkcolor":"16777215","color":"0","code":"74","code_ascii":"J","confidence":"96"},
"char4":{"poly":{"p":{"x":"1585","y":"248"},
"p":{"x":"1591","y":"248"},
"p":{"x":"1591","y":"267"},
"p":{"x":"1585","y":"267"}},
"bkcolor":"16777215","color":"0","code":"55","code_ascii":"7","confidence":"94"},
"char5":{"poly":{"p":{"x":"1593","y":"248"},
"p":{"x":"1600","y":"248"},
"p":{"x":"1600","y":"267"},
"p":{"x":"1593","y":"267"}},
"bkcolor":"16777215","color":"0","code":"50","code_ascii":"2","confidence":"88"},
"char6":{"poly":{"p":{"x":"1602","y":"248"},
"p":{"x":"1607","y":"248"},
"p":{"x":"1607","y":"266"},
"p":{"x":"1602","y":"266"}},
"bkcolor":"16777215","color":"0","code":"52","code_ascii":"4","confidence":"99"}},
"det_time_us":"776874","poly":{"p":{"x":"1543","y":"237"},
"p":{"x":"1618","y":"237"},
"p":{"x":"1618","y":"274"},
"p":{"x":"1543","y":"274"}}}},
"det_time_us":"1883017"}}}
{"response":{"container":{"id":"fa75e8f8-1b44-4f2f-a09b-6fe3b801ca1b","timestamp":"2018-Jul-10 17:00:55.863641"},
"id":"00000002-0000-0000-0000-000000000002"},
"frames":{"frame":{"id":"399","timestamp":"2016-Nov-30 12:56:48","lps":{"lp":{"licenseplate":"FRJ724","text":"FRJ724","wtext":"FRJ724","confidence":"47","bkcolor":"16777215","color":"16777215","type":"540122","ntip":"6","cct_country_short":"USA","cct_state_short":"NY","tips":{"tip":{"poly":{"p":{"x":"1553","y":"248"},
"p":{"x":"1560","y":"248"},
"p":{"x":"1560","y":"266"},
"p":{"x":"1554","y":"266"}},
"bkcolor":"16777215","color":"0","code":"70","code_ascii":"F","confidence":"96"},
"tip":{"poly":{"p":{"x":"1561","y":"248"},
"p":{"x":"1568","y":"248"},
"p":{"x":"1568","y":"267"},
"p":{"x":"1561","y":"267"}},
"bkcolor":"16777215","color":"0","code":"82","code_ascii":"R","confidence":"98"},
"tip":{"poly":{"p":{"x":"1569","y":"247"},
"p":{"x":"1576","y":"247"},
"p":{"x":"1576","y":"267"},
"p":{"x":"1569","y":"267"}},
"bkcolor":"16777215","color":"0","code":"74","code_ascii":"J","confidence":"96"},
"tip":{"poly":{"p":{"x":"1586","y":"248"},
"p":{"x":"1592","y":"248"},
"p":{"x":"1592","y":"267"},
"p":{"x":"1586","y":"267"}},
"bkcolor":"16777215","color":"0","code":"55","code_ascii":"7","confidence":"95"},
"tip":{"poly":{"p":{"x":"1593","y":"248"},
"p":{"x":"1600","y":"248"},
"p":{"x":"1600","y":"267"},
"p":{"x":"1593","y":"267"}},
"bkcolor":"16777215","color":"0","code":"50","code_ascii":"2","confidence":"86"},
"tip":{"poly":{"p":{"x":"1601","y":"249"},
"p":{"x":"1608","y":"249"},
"p":{"x":"1608","y":"265"},
"p":{"x":"1601","y":"265"}},
"bkcolor":"16777215","color":"0","code":"52","code_ascii":"4","confidence":"63"}},
"ncharacter":"6","characters":{"char7":{"poly":{"p":{"x":"1553","y":"248"},
"p":{"x":"1560","y":"248"},
"p":{"x":"1560","y":"266"},
"p":{"x":"1554","y":"266"}},
"bkcolor":"16777215","color":"0","code":"70","code_ascii":"F","confidence":"96"},
"char8":{"poly":{"p":{"x":"1561","y":"248"},
"p":{"x":"1568","y":"248"},
"p":{"x":"1568","y":"267"},
"p":{"x":"1561","y":"267"}},
"bkcolor":"16777215","color":"0","code":"82","code_ascii":"R","confidence":"98"},
"char9":{"poly":{"p":{"x":"1569","y":"247"},
"p":{"x":"1576","y":"247"},
"p":{"x":"1576","y":"267"},
"p":{"x":"1569","y":"267"}},
"bkcolor":"16777215","color":"0","code":"74","code_ascii":"J","confidence":"96"},
"char10":{"poly":{"p":{"x":"1586","y":"248"},
"p":{"x":"1592","y":"248"},
"p":{"x":"1592","y":"267"},
"p":{"x":"1586","y":"267"}},
"bkcolor":"16777215","color":"0","code":"55","code_ascii":"7","confidence":"95"},
"char11":{"poly":{"p":{"x":"1593","y":"248"},
"p":{"x":"1600","y":"248"},
"p":{"x":"1600","y":"267"},
"p":{"x":"1593","y":"267"}},
"bkcolor":"16777215","color":"0","code":"50","code_ascii":"2","confidence":"86"},
"char12":{"poly":{"p":{"x":"1601","y":"249"},
"p":{"x":"1608","y":"249"},
"p":{"x":"1608","y":"265"},
"p":{"x":"1601","y":"265"}},
"bkcolor":"16777215","color":"0","code":"52","code_ascii":"4","confidence":"63"}},
"det_time_us":"600136","poly":{"p":{"x":"1543","y":"238"},
"p":{"x":"1618","y":"239"},
"p":{"x":"1619","y":"274"},
"p":{"x":"1543","y":"273"}}}},
"det_time_us":"1495308"}}}
{"response":{"container":{"id":"5c9c773c-a72a-488f-bc49-148dcd6cfa0a","timestamp":"2018-Jul-10 17:01:01.756522"},
"id":"00000002-0000-0000-0000-000000000002"},
"frames":{"frame":{"id":"400","timestamp":"2016-Nov-30 12:56:48.100000","lps":{"lp":{"licenseplate":"FRJ724","text":"FRJ724","wtext":"FRJ724","confidence":"47","bkcolor":"16777215","color":"16777215","type":"540122","ntip":"6","cct_country_short":"USA","cct_state_short":"NY","tips":{"tip":{"poly":{"p":{"x":"1553","y":"248"},
"p":{"x":"1560","y":"248"},
"p":{"x":"1560","y":"266"},
"p":{"x":"1554","y":"266"}},
"bkcolor":"16777215","color":"0","code":"70","code_ascii":"F","confidence":"96"},
"tip":{"poly":{"p":{"x":"1561","y":"248"},
"p":{"x":"1568","y":"248"},
"p":{"x":"1568","y":"267"},
"p":{"x":"1561","y":"267"}},
"bkcolor":"16777215","color":"0","code":"82","code_ascii":"R","confidence":"98"},
"tip":{"poly":{"p":{"x":"1569","y":"247"},
"p":{"x":"1576","y":"247"},
"p":{"x":"1576","y":"267"},
"p":{"x":"1569","y":"267"}},
"bkcolor":"16777215","color":"0","code":"74","code_ascii":"J","confidence":"96"},
"tip":{"poly":{"p":{"x":"1586","y":"248"},
"p":{"x":"1592","y":"248"},
"p":{"x":"1592","y":"267"},
"p":{"x":"1586","y":"267"}},
"bkcolor":"16777215","color":"0","code":"55","code_ascii":"7","confidence":"95"},
"tip":{"poly":{"p":{"x":"1593","y":"248"},
"p":{"x":"1600","y":"248"},
"p":{"x":"1600","y":"267"},
"p":{"x":"1593","y":"267"}},
"bkcolor":"16777215","color":"0","code":"50","code_ascii":"2","confidence":"86"},
"tip":{"poly":{"p":{"x":"1601","y":"249"},
"p":{"x":"1608","y":"249"},
"p":{"x":"1608","y":"265"},
"p":{"x":"1601","y":"265"}},
"bkcolor":"16777215","color":"0","code":"52","code_ascii":"4","confidence":"63"}},
"ncharacter":"6","characters":{"char13":{"poly":{"p":{"x":"1553","y":"248"},
"p":{"x":"1560","y":"248"},
"p":{"x":"1560","y":"266"},
"p":{"x":"1554","y":"266"}},
"bkcolor":"16777215","color":"0","code":"70","code_ascii":"F","confidence":"96"},
"char14":{"poly":{"p":{"x":"1561","y":"248"},
"p":{"x":"1568","y":"248"},
"p":{"x":"1568","y":"267"},
"p":{"x":"1561","y":"267"}},
"bkcolor":"16777215","color":"0","code":"82","code_ascii":"R","confidence":"98"},
"char15":{"poly":{"p":{"x":"1569","y":"247"},
"p":{"x":"1576","y":"247"},
"p":{"x":"1576","y":"267"},
"p":{"x":"1569","y":"267"}},
"bkcolor":"16777215","color":"0","code":"74","code_ascii":"J","confidence":"96"},
"char16":{"poly":{"p":{"x":"1586","y":"248"},
"p":{"x":"1592","y":"248"},
"p":{"x":"1592","y":"267"},
"p":{"x":"1586","y":"267"}},
"bkcolor":"16777215","color":"0","code":"55","code_ascii":"7","confidence":"95"},
"char17":{"poly":{"p":{"x":"1593","y":"248"},
"p":{"x":"1600","y":"248"},
"p":{"x":"1600","y":"267"},
"p":{"x":"1593","y":"267"}},
"bkcolor":"16777215","color":"0","code":"50","code_ascii":"2","confidence":"86"},
"char18":{"poly":{"p":{"x":"1601","y":"249"},
"p":{"x":"1608","y":"249"},
"p":{"x":"1608","y":"265"},
"p":{"x":"1601","y":"265"}},
"bkcolor":"16777215","color":"0","code":"52","code_ascii":"4","confidence":"63"}},
"det_time_us":"457492","poly":{"p":{"x":"1543","y":"238"},
"p":{"x":"1618","y":"239"},
"p":{"x":"1619","y":"274"},
"p":{"x":"1543","y":"273"}}}},
"det_time_us":"1311946"}}}

How can I get this done?

My script currently returns:

Traceback (most recent call last):
  File "read.py", line 8, in <module>
    data = json.load(f)
  File "/usr/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 68 column 1 - line 202 column 1 (char 3182 - 9548)

shell returned 1

When i run the large file.

1

1 Answers

1
votes

I've got a file with hundreds of JSON lines in it.

No you don't, and that's the problem.


Hundreds of JSON texts is not a valid JSON file. A valid JSON file is just one text. Which is why json.load is returning an error.


Hundreds of JSON texts that each fit on exactly one line with newlines in between them is a valid file in other formats like JSONlines or NDJ. It's still not a valid JSON file, so you can't use json.load, but you could use a JSONlines or NDJ library, or just parse it like this:

with open('fixed.json') as f:
    for line in f:
        data = json.loads(line)
        # do stuff

For writing a JSONlines file, again, you can use a JSONlines library, or you can just make sure that each JSON text has no embedded newlines—which actually happens by default, if you don't specify non-default ensure_ascii or indent parameters—and just write out json.dumps(data) + "\n" for each value.


But hundreds of JSON texts that each take up multiple lines isn't a valid anything file.

This is actually explained in the json module docs:

Note Unlike pickle and marshal, JSON is not a framed protocol, so trying to serialize multiple objects with repeated calls to dump() using the same fp will result in an invalid JSON file.

What "not a framed protocol" means is basically that the format would be ambiguous. For example, if you did a json.dump(2, f), and then a json.dump(3, f), what you'd get in your file is 23. Which is the same thing you get from json.dump(23, f).


If you can fix your file to be something valid, like JSONlines, that's the easy solution.


If you can't…

Well, pre-standardization, there was a concept of "JSON document", which meant basically a JSON text that's either an Array or Object. And a stream of JSON documents is not ambiguous.

Since this isn't a standard format, you're probably not going to find a parser for it, so you'll have to write one yourself.

One way you can do that is by using the raw_decode method in the json module. This will try to decode a JSON text, possibly with extra stuff after it, and also return the index to that extra stuff. Which, in your case, is the next JSON document.

Since hundreds of objects of that size isn't too big, it's probably simpler to just read the whole file into memory and then parse it, so we don't have to worry about buffering:

with open('fixed.json') as f:
    contents = f.read()
decoder = json.JSONDecoder()
while contents:
    data, idx = decoder.raw_decode(contents)
    do_stuff(data)
    contents = contents[idx:].lstrip()

Remember that this will only work if your file is a stream of JSON documents—that is, the top-level values are always Array or Object. Also, if you're editing these files by hand, unlike JSONlines, which can skip one bad text and continue to parse the rest, there's now way to recover from an error here, because you have no idea where the next document starts.