-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC variants with different interpretations of version-block length #82
Comments
There seems to be some variation in how the length field in the version block is calculated between different ARC files. jwarc's ARC support was tested against files generated by Heritrix and some other tools from the Internet Archive. The example.arc file you linked to has a length value of "75" (0x4b) in the version block. This would exclude the two newlines at the end of it: Whereas an ARC file in our collection sourced from the Internet Archive includes just the first newline as part of the length "76" (0x4c): The ARC file format reference itself seems to introduce two more possible variations! It defines the length for the version block as:
and the grammar for version-block defines it as ending with two newlines:
But reading carefully we see that
So a strict reading of the grammar implies there should in fact be three newlines between the text "Archive-length" and the URL of the first doc, and the first two of them should count towards length as they're part of the version block. If we look at the example in that same document though it uses a length of "76" (0x4c) and only has two newlines and counts both of them: |
Have you seen this error with in the wild ARC files containing real data as well or just the example files from the warcio unit tests? I'm also curious what such files look like if they have more than one document in them and whether they also have extra linefeeds between documents or if it's just the version-block length that differs. For reference there's an example Heritrix ARC file here which jwarc can successfully read: https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.arc.gz |
Greetings - tballison has sourced many of his test files from my agency, the National Archives and Records Administration. The ARC files were actually created by the Internet Archive with Heritrix back in 2004. I can confirm that the files we received from the Internet Archive appear to have three newlines before the first record (0A 0A 0A). And the record length takes us from the end of the header to the beginning of the first record: |
As for the remaining records, a single new line is the most used separator: |
What @gleporeNARA said. LOL. Thank you so much @ato for looking into this! Let me know if I can help in any way. |
Bugs fixed * Improved compatibility with ARC variants (version-block length off by one, v2 version-block, spurious linefeeds) #82 * WarcParser: Context in parse error messages was incorrectly using the parser (file) position instead of buffer position
Fix released as v0.28.6. Should sync to Maven central in an hour or so. I've updated jwarc to accept 0 to 3 newlines between the end of the previous record's body and the URL of the next record. This should make it compatible with all the variants discussed above and it seems to work with the warcio example.arc:
I've also made it understand the "v2" version-block headers and fixed the parsing exception message so the "<-- HERE -->" should show the right context now. |
Wow. Thank you. I'll upgrade in Tika and see what I find on my local set of files. |
It looks from unit tests that jwarc should read arc files. When I try to read ARC test files from warcio, I'm getting an exception.
Is this user error in how I'm calling jwarc or are ARC files not supported?
Test files:
https://github.com/webrecorder/warcio/blob/master/test/data/example.arc
https://github.com/webrecorder/warcio/blob/master/test/data/example.arc.gz
My code:
"warcinfo" is printed once on the console, then there's an exception:
Exception (is the same for both files):
The text was updated successfully, but these errors were encountered: