improve title extractor #924

prnake · 2022-02-08T15:23:34Z

Summary

This PR trying to get title from offline html, such as singlefile first, the same idea as readability extractor, in order to handle some anti-scraping pages like https://xz.aliyun.com/t/10870 that can not get title from wget or curl.

Changes these areas

lgtm-com · 2022-02-08T16:52:59Z

This pull request introduces 1 alert when merging de8e22e into bf432d4 - view on LGTM.com

new alerts:

1 for Unused import

pirate · 2022-03-16T20:18:30Z

This is a good idea but the reason but I'm mildly concerned that putting title so late in the process means any failures during archiving will leave many URLs without titles. Have you tested this with a big import of several hundred URLs?

Thanks for this work! Excited to merge it.

improve title extractor

de8e22e

remove unused import

011bd10

pirate merged commit 950b5cb into ArchiveBox:dev May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve title extractor #924

improve title extractor #924

prnake commented Feb 8, 2022

lgtm-com bot commented Feb 8, 2022

pirate commented Mar 16, 2022

improve title extractor #924

improve title extractor #924

Conversation

prnake commented Feb 8, 2022

Summary

Changes these areas

lgtm-com bot commented Feb 8, 2022

pirate commented Mar 16, 2022