Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve title extractor #924

Merged
merged 2 commits into from
May 10, 2022
Merged

improve title extractor #924

merged 2 commits into from
May 10, 2022

Conversation

prnake
Copy link
Contributor

@prnake prnake commented Feb 8, 2022

Summary

This PR trying to get title from offline html, such as singlefile first, the same idea as readability extractor, in order to handle some anti-scraping pages like https://xz.aliyun.com/t/10870 that can not get title from wget or curl.

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

@lgtm-com
Copy link

lgtm-com bot commented Feb 8, 2022

This pull request introduces 1 alert when merging de8e22e into bf432d4 - view on LGTM.com

new alerts:

  • 1 for Unused import

@pirate
Copy link
Member

pirate commented Mar 16, 2022

This is a good idea but the reason but I'm mildly concerned that putting title so late in the process means any failures during archiving will leave many URLs without titles. Have you tested this with a big import of several hundred URLs?

Thanks for this work! Excited to merge it.

@pirate pirate merged commit 950b5cb into ArchiveBox:dev May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants