-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: JSONDecodeError
when trying to load JSON with an array at the top-level
#1347
Comments
ah interesting, all the files I tested with had an object at the top level instead of a list I can add handling for lists pretty easily, there are so many different JSON formats to support haha |
JSONDecodeError
when trying to load JSON with an array at the top-level
Instead of trying to figure out what is going on when the first line of the JSON file is garbage, it would be easier to try not skipping it first, and then try again after skipping it if that fails. try:
links = json.load(json_file)
except json.decoder.JSONDecodeError:
# sometimes the first line is a comment or other junk, so try without
json_file.seek(0)
first_line = json_file.readline()
#print(' > Trying JSON parser without first line: "', first_line.strip(), '"', sep= '')
links = json.load(json_file)
# we may fail again, which means we really don't know what to do But maybe even this isn't necessary? It looks like the original "skip the first line" logic came about because ArchiveBox would add the filename to the file as the first line when putting in the sources directory, but that doesn't seem to happen any more (which seems like a much, much better way to go). |
Rather than by assuming the JSON file we are parsing has junk at the beginning (which maybe only used to happen?), try parsing it as-is first, and then fall back to trying again after skipping the first line Fixes ArchiveBox#1347
Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their I do agree we should move it to a |
I don't see how the existing workaround ever worked for anything because it chops off everything before the first |
Ah sorry, it was long enough ago that I don't remember what it was for exactly... maybe it was to handle an extra newline at the start, or maybe I thought I was handling a JSON object at the top level instead of JSONL? Either way I'm down to change it, this parser is broken enough that it's not useful in its current state anyway. |
Handling JSONL wouldn’t be hard to add as a another fallback. We could try JSON, then JSONL, and then try them both again without the first line to handle old source lists that had that extra line added. |
This is done, thanks again @jimwins for all your great work here! Will be out in the next release, or pull |
With ArchiveBox version 0.6.2, I used to import URLs stored in JSON files with content looking like the following:
archivebox add --parser json < ./links.json
Everything worked well.
With version 0.7.2, however, I have a
JSONDecodeError
exception during the import:The error is caused by the following line, introduced by commit aaca74f:
ArchiveBox/archivebox/parsers/generic_json.py
Line 22 in 3ad3250
The text was updated successfully, but these errors were encountered: