Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support explicitly declared variable length offsets #267

Merged
merged 1 commit into from
Jan 18, 2025

Conversation

rien333
Copy link
Contributor

@rien333 rien333 commented Dec 10, 2024

A <ByteSequence> without a Reference=... attribute was previously assumed to implicitly indicate a variable length offset; however, variable length offsets may also be declared explicitly using Reference="Variable". The only file format that currently uses this explicit form is WACZ (fmt/1840):

<!-- from https://cdn.nationalarchives.gov.uk/documents/container-signature-20240715.xml --> 
<ContainerSignature Id="80000" ContainerType="ZIP">
    <Description>Web Archive Collection Zipped</Description>
    <Files>
        <File>
            <Path>datapackage.json</Path>
            <BinarySignatures>
                <InternalSignatureCollection>
                    <InternalSignature ID="80000">
                        <ByteSequence Reference="Variable">
                            <SubSequence Position="1">
                                <Sequence>'wacz_version'</Sequence>
                            </SubSequence>
                        </ByteSequence>
              ...

With this commit, siegfried can now thus differentiate WACZ from ZIP files — provided, at least, that you are also using a container signature file released in 2024 or later.

I can take a look at adding support for WACZ files in siegried later as well, since both WACZ and siegfried are important for some projects I'm working on!

A `<ByteSequence>` without a `Reference=...` was previously assumed to
indicate variable length offset; however, variable length offsets may also
be declared explicitly using `Reference="Variable"`. The only file
format that currently uses this explicit form is WACZ (fmt/1840).
@richardlehane
Copy link
Owner

richardlehane commented Dec 10, 2024

Thanks for this PR Rijnder, I check and merge shortly ... [edit] & WACZ support would be very welcome! Siegfried's warc/arc decoding is all in this repo: https://github.com/richardlehane/webarchive

@richardlehane richardlehane merged commit 063951c into richardlehane:main Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants