Use feedparser for RSS parsing #1362

jimwins · 2024-02-25T21:18:34Z

Summary

The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing.

The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary and should be removed. (The special parse for pinboard is just needed because of how tags work.)

Related issues

Fixes #1171
Fixes #870 (probably, would need to test against a Wallabag Atom file to be certain)¹
Fixes #135
Fixes #123
Fixes #106

Changes these areas

I tried divining the format of Wallabag's XML/Atom export by reading the code but there's a lot of abstractions and it would probably be faster to just install Wallabag and produce a sample, or faster yet for someone to provide a sample. ↩

The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing. The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary. (The special parse for pinboard is just needing because of how tags work.) Doesn't include tests because I haven't figured out how to run them in the docker development setup. Fixes ArchiveBox#1171

pirate · 2024-02-29T02:47:14Z

archivebox/cli/archivebox_add.py

@@ -89,7 +89,7 @@ def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional
    parser.add_argument(
        "--parser",
        type=str,
-        help="Parser used to read inputted URLs.",
+        help="Parser used for input on stdin (not valid with URLs).",


To clarify, parser still applies to URLs passed as CLI args in some cases. The behavior is subtle and maybe a bit too "magic", but the intent is to work for cases like this:

archivebox add --depth=0 --parser=rss 'https://dropbox.tech/feed'

Without needing the user to do this:

curl -s 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss # or archivebox add --depth=1 --parser=rss 'https://dropbox.tech/feed'

See this this case in particular here:
--depth=0 + --parser=rss + passed URL(s) or local file path(s) via stdin/args
#1363 (comment)

Honestly I'm still conflicted about the level of magic needed to get my original intent here working, as it introduces a ton of edge cases / unexpected results around --depth=0 and --depth=1 interacting with --parser=.

I could be convinced that we should remove all of my "guess what the user was trying to do" magic entirely and just better document which CLI args/stdin/etc. need to be used for each case. For example we could force --parser to only apply to stdin, and make the archivebox.main.add() function signature clearer by only accepting either a blob (stdin to parse) or url_list (CLI args never parsed). This would however break existing programs that depend on the current behavior, which I hate to do.

Maybe you can think of a better solution? Or maybe we could come up with an additional optional flag to control parsing behavior that could de-magic it somewhat? Very open to suggestions here.

Okay, I think I understand better now how it was supposed to work. The intersection of parser bugs I was hitting along with the most-links behavior of auto was making me more confused than I should have been. I’ll dig into it more soon, maybe tomorrow.

Okay, I got rid of the commit related to this and merged this pull request with the head of the dev branch so it stands on its own now.

The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing. The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary. (The special parse for pinboard is just needing because of how tags work.) Doesn't include tests because I haven't figured out how to run them in the docker development setup. Fixes ArchiveBox#1171

pirate reviewed Feb 29, 2024

View reviewed changes

Add tests for generic_rss and pinboard_rss parsers

1f828d9

jimwins force-pushed the issue-1171 branch from 1bc5c13 to 1f828d9 Compare March 1, 2024 19:23

jimwins added 3 commits March 1, 2024 11:25

Add tests for generic_rss and pinboard_rss parsers

e7119ad

Merge with latest dev

0f402df

pirate merged commit 099f7d0 into ArchiveBox:dev Mar 14, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use feedparser for RSS parsing #1362

Use feedparser for RSS parsing #1362

jimwins commented Feb 25, 2024 •

edited

Loading

pirate Feb 29, 2024 •

edited

Loading

jimwins Feb 29, 2024

jimwins Mar 1, 2024

Use feedparser for RSS parsing #1362

Use feedparser for RSS parsing #1362

Conversation

jimwins commented Feb 25, 2024 • edited Loading

Summary

Related issues

Changes these areas

Footnotes

pirate Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

jimwins Feb 29, 2024

Choose a reason for hiding this comment

jimwins Mar 1, 2024

Choose a reason for hiding this comment

jimwins commented Feb 25, 2024 •

edited

Loading

pirate Feb 29, 2024 •

edited

Loading