Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zip export. #390

Merged
merged 39 commits into from
Jul 25, 2023
Merged

Zip export. #390

merged 39 commits into from
Jul 25, 2023

Conversation

VictorHarbo
Copy link
Collaborator

Extraction of content from WARC files by mimetype

Closes #382 and #245

@VictorHarbo VictorHarbo requested a review from thomasegense July 24, 2023 06:55
@VictorHarbo VictorHarbo self-assigned this Jul 24, 2023
@VictorHarbo VictorHarbo linked an issue Jul 24, 2023 that may be closed by this pull request

zos.close();
output.close();
log.info("Streamed {} warc entries with the contentType: '{}'.", streamedDocs, normalizedContentType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a useful, maybe add that was 'zip import' and it has completed successfully.

* @param warcMetadata contains the timestamp, id, originalUrl and file extension, which is used to create the filename.
* @return a string in the format timestamp_id_originalUrlStrippedForNonASCIIChars.extension.
*/
private String createFilename(String contentType, WarcMetadataFromSolr warcMetadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the file name, implement:

  1. Only keep alphanumeric part of URLs. Still use '_' seperator.

  2. Filename total length must not exceed 255 characters since this is the limit in Windows.

  3. Collapse multiple consecutive'_' into 1.

* @param contentType string to normalize.
* @return normalized contentType string.
*/
private String normalizeContentType(String contentType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used anymore

@thomasegense thomasegense merged commit 34037a7 into master Jul 25, 2023
@thomasegense thomasegense deleted the raw_zip_exports branch July 25, 2023 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch content export Add ZIP as export option
2 participants