Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use feedparser for RSS parsing #1362

Merged
merged 5 commits into from
Mar 14, 2024
Merged

Use feedparser for RSS parsing #1362

merged 5 commits into from
Mar 14, 2024

Conversation

jimwins
Copy link
Contributor

@jimwins jimwins commented Feb 25, 2024

Summary

The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing.

The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary and should be removed. (The special parse for pinboard is just needed because of how tags work.)

Related issues

Fixes #1171
Fixes #870 (probably, would need to test against a Wallabag Atom file to be certain)1
Fixes #135
Fixes #123
Fixes #106

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

Footnotes

  1. I tried divining the format of Wallabag's XML/Atom export by reading the code but there's a lot of abstractions and it would probably be faster to just install Wallabag and produce a sample, or faster yet for someone to provide a sample.

The feedparser packages has 20 years of history and is very good at parsing
RSS and Atom, so use that instead of ad-hoc regex and XML parsing.

The medium_rss and shaarli_rss parsers weren't touched because they are
probably unnecessary. (The special parse for pinboard is just needing because
of how tags work.)

Doesn't include tests because I haven't figured out how to run them in the
docker development setup.

Fixes ArchiveBox#1171
@@ -89,7 +89,7 @@ def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional
parser.add_argument(
"--parser",
type=str,
help="Parser used to read inputted URLs.",
help="Parser used for input on stdin (not valid with URLs).",
Copy link
Member

@pirate pirate Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, parser still applies to URLs passed as CLI args in some cases. The behavior is subtle and maybe a bit too "magic", but the intent is to work for cases like this:

archivebox add --depth=0 --parser=rss 'https://dropbox.tech/feed'

Without needing the user to do this:

curl -s 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss
# or
archivebox add --depth=1 --parser=rss 'https://dropbox.tech/feed'

See this this case in particular here:
--depth=0 + --parser=rss + passed URL(s) or local file path(s) via stdin/args
#1363 (comment)

Honestly I'm still conflicted about the level of magic needed to get my original intent here working, as it introduces a ton of edge cases / unexpected results around --depth=0 and --depth=1 interacting with --parser=.

I could be convinced that we should remove all of my "guess what the user was trying to do" magic entirely and just better document which CLI args/stdin/etc. need to be used for each case. For example we could force --parser to only apply to stdin, and make the archivebox.main.add() function signature clearer by only accepting either a blob (stdin to parse) or url_list (CLI args never parsed). This would however break existing programs that depend on the current behavior, which I hate to do.

Maybe you can think of a better solution? Or maybe we could come up with an additional optional flag to control parsing behavior that could de-magic it somewhat? Very open to suggestions here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I understand better now how it was supposed to work. The intersection of parser bugs I was hitting along with the most-links behavior of auto was making me more confused than I should have been. I’ll dig into it more soon, maybe tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I got rid of the commit related to this and merged this pull request with the head of the dev branch so it stands on its own now.

jimwins added 3 commits March 1, 2024 11:25
The feedparser packages has 20 years of history and is very good at parsing
RSS and Atom, so use that instead of ad-hoc regex and XML parsing.

The medium_rss and shaarli_rss parsers weren't touched because they are
probably unnecessary. (The special parse for pinboard is just needing because
of how tags work.)

Doesn't include tests because I haven't figured out how to run them in the
docker development setup.

Fixes ArchiveBox#1171
@pirate pirate merged commit 099f7d0 into ArchiveBox:dev Mar 14, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants