-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zip export. #390
Zip export. #390
Conversation
|
||
zos.close(); | ||
output.close(); | ||
log.info("Streamed {} warc entries with the contentType: '{}'.", streamedDocs, normalizedContentType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a useful, maybe add that was 'zip import' and it has completed successfully.
* @param warcMetadata contains the timestamp, id, originalUrl and file extension, which is used to create the filename. | ||
* @return a string in the format timestamp_id_originalUrlStrippedForNonASCIIChars.extension. | ||
*/ | ||
private String createFilename(String contentType, WarcMetadataFromSolr warcMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the file name, implement:
-
Only keep alphanumeric part of URLs. Still use '_' seperator.
-
Filename total length must not exceed 255 characters since this is the limit in Windows.
-
Collapse multiple consecutive'_' into 1.
* @param contentType string to normalize. | ||
* @return normalized contentType string. | ||
*/ | ||
private String normalizeContentType(String contentType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used anymore
Extraction of content from WARC files by mimetype
Closes #382 and #245