-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
search index extract html text #1244
search index extract html text #1244
Conversation
overhacked
commented
Oct 12, 2023
- Extract text from singlefile.html when indexing
- Make extracting text for indexing optional
singlefile.html contains a lot of large strings in the form of `data:` URLs, which can be unnecessarily stored in full-text indices. Also, large chunks of JavaScript shouldn't be indexed, either, as they pollute search results for searches about JS functions, etc. This commit takes a blanket approach of parsing singlefile.html as it is read and only outputting text and selected textual attributes (like `alt`) for indexing.
Add a configuration option to enable/disable HTML text extraction for indexing
Add space after any close tag to ensure that tokens that would be rendered separate in HTML get extracted as separate tokens in text. Example: `<p>First</p><p>Second</p>` --> `First Second` NOT `FirstSecond`
We already use readability to extract text from singlefile.html (https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors/readability.py#L57C23-L57C23), is a separate parser really needed? |
Indexing prefers readability over singlefile, and ignores mercury altogether:
Even if we add mercury into the precedence list (IMO a good idea), both readability and mercury frequently fail to extract content, because their goal is to extract the only the main, human-readable text content from a page. Both tools prefer failing with empty output over generating bad-looking content. The goal for full-text search is different. You'd rather get an accurate but ugly stream of text tokens than empty output. So--if readability or mercury produce high-quality, high-relevance content that can be indexed--great! But if they fail with empty output, it would be better to extract all the possibly meaningful text from |
Ok, I see the appeal, however I think if this is going to serve as a mercury/readability alternative, it should be implemented as a real extractor instead of a util. I'm sorry for the reimplementation burden but I think if this is a true post-processing step during archiving, then it's doing the job an extractor would do, and should be formatted like a sibling of all the other extractors. I forsee situations where a pure python text extraction option would be appealing to people, and having this outputted text directly as an extractor output like any other is nice. We can then re-use the existing code paths to ingested extractor output into sonic. We could call it something like There is also documentation on how to implement an extractor here: https://github.com/ArchiveBox/ArchiveBox#contributing-a-new-extractor |
I see what you're saying... same approach but make it better? I'll add some light formatting (linebreaks, etc.) to the output to make it marginally human-readable, rather than the current "glue everything together with spaces" approach. Incidentally, I've got a prototype ActivityPub extractor (with optional Mastodon API for authentication) in my work-in-progress branches. It downloads post-attached media more efficiently, and it will archive links included in posts. It's still a little rough, and it's been on the back burner for a while. I'll see about shaping it up to let you have a look. |
Saves HTML text nodes and selected element attributes in `htmltotext.txt` for each Snapshot. Primarily intended to be used for search indexing.
885d3dd
to
310b4d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! shall I merge it?