Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search index extract html text #1244

Merged
merged 5 commits into from
Nov 9, 2023

Conversation

overhacked
Copy link
Contributor

  • Extract text from singlefile.html when indexing
  • Make extracting text for indexing optional

singlefile.html contains a lot of large strings in the form of `data:`
URLs, which can be unnecessarily stored in full-text indices. Also,
large chunks of JavaScript shouldn't be indexed, either, as they pollute
search results for searches about JS functions, etc.

This commit takes a blanket approach of parsing singlefile.html as it is
read and only outputting text and selected textual attributes (like
`alt`) for indexing.
Add a configuration option to enable/disable HTML text extraction
for indexing
Add space after any close tag to ensure that
tokens that would be rendered separate in HTML
get extracted as separate tokens in text.

Example:

`<p>First</p><p>Second</p>` --> `First Second`
NOT `FirstSecond`
@overhacked overhacked marked this pull request as ready for review October 16, 2023 18:33
@pirate
Copy link
Member

pirate commented Oct 17, 2023

We already use readability to extract text from singlefile.html (https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors/readability.py#L57C23-L57C23), is a separate parser really needed?

@overhacked
Copy link
Contributor Author

Indexing prefers readability over singlefile, and ignores mercury altogether:

ARCHIVE_METHODS_INDEXING_PRECEDENCE = [('readability', 1), ('singlefile', 2), ('dom', 3), ('wget', 4)]

Even if we add mercury into the precedence list (IMO a good idea), both readability and mercury frequently fail to extract content, because their goal is to extract the only the main, human-readable text content from a page. Both tools prefer failing with empty output over generating bad-looking content.

The goal for full-text search is different. You'd rather get an accurate but ugly stream of text tokens than empty output. So--if readability or mercury produce high-quality, high-relevance content that can be indexed--great! But if they fail with empty output, it would be better to extract all the possibly meaningful text from singlefile.html for indexing rather than feed the entire singlefile output, including base64 binary data, to the search indexer.

@pirate
Copy link
Member

pirate commented Oct 19, 2023

Ok, I see the appeal, however I think if this is going to serve as a mercury/readability alternative, it should be implemented as a real extractor instead of a util.

I'm sorry for the reimplementation burden but I think if this is a true post-processing step during archiving, then it's doing the job an extractor would do, and should be formatted like a sibling of all the other extractors.

I forsee situations where a pure python text extraction option would be appealing to people, and having this outputted text directly as an extractor output like any other is nice. We can then re-use the existing code paths to ingested extractor output into sonic.

We could call it something like extractors/htmltotext.py? Open to other naming ideas.

There is also documentation on how to implement an extractor here: https://github.com/ArchiveBox/ArchiveBox#contributing-a-new-extractor

@overhacked
Copy link
Contributor Author

I see what you're saying... same approach but make it better? I'll add some light formatting (linebreaks, etc.) to the output to make it marginally human-readable, rather than the current "glue everything together with spaces" approach.

Incidentally, I've got a prototype ActivityPub extractor (with optional Mastodon API for authentication) in my work-in-progress branches. It downloads post-attached media more efficiently, and it will archive links included in posts. It's still a little rough, and it's been on the back burner for a while. I'll see about shaping it up to let you have a look.

Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
@overhacked overhacked force-pushed the search_index_extract_html_text branch from 885d3dd to 310b4d1 Compare October 24, 2023 01:44
Copy link
Member

@pirate pirate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! shall I merge it?

@pirate pirate merged commit f573950 into ArchiveBox:dev Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants