Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HTML title parsing bugs. #1242

Merged
merged 1 commit into from
Oct 10, 2023

Conversation

benmuth
Copy link
Contributor

@benmuth benmuth commented Oct 9, 2023

Summary

This slightly modifies the HTML_TITLE_REGEX and fixes two parsing errors. The first occurred when title tags were empty (e.g. <title></title>) which was parsed as </title. The second occurred when titles were a single character (e.g. <title>A</title>) which was not matched by the regex, and so the title would fall back to link.base_url.

With this change, now when tags are empty, the title falls back to link.base_url, and single character titles are parsed correctly.

I tested the new regex with the edge cases I could think of, and found some malformed HTML will still lead to undesired behavior. A more robust regex could probably be used in the future.

Related issues

#1222

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.

Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.

The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
@benmuth benmuth marked this pull request as ready for review October 9, 2023 23:16
@pirate
Copy link
Member

pirate commented Oct 10, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants