Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch content export #382

Closed
tokee opened this issue Jun 28, 2023 · 1 comment · Fixed by #390
Closed

Batch content export #382

tokee opened this issue Jun 28, 2023 · 1 comment · Fixed by #390
Assignees

Comments

@tokee
Copy link
Contributor

tokee commented Jun 28, 2023

At the Royal Danish Library there has been multiple requests for en masse exporting raw archive content, e.g. unmodified HTML, images or PDFs. The current exporter only supports WARC for this and for some researchers they can be cumbersome to work with.

SolrWayback should have an export option for a more common container format, where 64-bit zip is the obvious candidate as "all" platforms supports it out of the box.

The big question is how to handle naming for non-WARC export. Two options comes to mind:

  1. Best effort ala timestamp/Filename_cleaned_of_non-ASCII_spaces_and_similar.ext
  2. timestamp_hash.exe with a metadata.txt which contains timestamp, hash, WARC-file, WARC-offset, URL
@tokee
Copy link
Contributor Author

tokee commented Jul 1, 2023

The looks like a duplicate of #245.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants