-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use feedparser for RSS parsing #1362
Conversation
The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing. The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary. (The special parse for pinboard is just needing because of how tags work.) Doesn't include tests because I haven't figured out how to run them in the docker development setup. Fixes ArchiveBox#1171
archivebox/cli/archivebox_add.py
Outdated
@@ -89,7 +89,7 @@ def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional | |||
parser.add_argument( | |||
"--parser", | |||
type=str, | |||
help="Parser used to read inputted URLs.", | |||
help="Parser used for input on stdin (not valid with URLs).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, parser still applies to URLs passed as CLI args in some cases. The behavior is subtle and maybe a bit too "magic", but the intent is to work for cases like this:
archivebox add --depth=0 --parser=rss 'https://dropbox.tech/feed'
Without needing the user to do this:
curl -s 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss
# or
archivebox add --depth=1 --parser=rss 'https://dropbox.tech/feed'
See this this case in particular here:
--depth=0 + --parser=rss + passed URL(s) or local file path(s) via stdin/args
#1363 (comment)
Honestly I'm still conflicted about the level of magic needed to get my original intent here working, as it introduces a ton of edge cases / unexpected results around --depth=0
and --depth=1
interacting with --parser=
.
I could be convinced that we should remove all of my "guess what the user was trying to do" magic entirely and just better document which CLI args/stdin/etc. need to be used for each case. For example we could force --parser
to only apply to stdin, and make the archivebox.main.add()
function signature clearer by only accepting either a blob
(stdin to parse) or url_list
(CLI args never parsed). This would however break existing programs that depend on the current behavior, which I hate to do.
Maybe you can think of a better solution? Or maybe we could come up with an additional optional flag to control parsing behavior that could de-magic it somewhat? Very open to suggestions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I understand better now how it was supposed to work. The intersection of parser bugs I was hitting along with the most-links behavior of auto was making me more confused than I should have been. I’ll dig into it more soon, maybe tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I got rid of the commit related to this and merged this pull request with the head of the dev branch so it stands on its own now.
The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing. The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary. (The special parse for pinboard is just needing because of how tags work.) Doesn't include tests because I haven't figured out how to run them in the docker development setup. Fixes ArchiveBox#1171
Summary
The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing.
The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary and should be removed. (The special parse for pinboard is just needed because of how tags work.)
Related issues
Fixes #1171
Fixes #870 (probably, would need to test against a Wallabag Atom file to be certain)1
Fixes #135
Fixes #123
Fixes #106
Changes these areas
Footnotes
I tried divining the format of Wallabag's XML/Atom export by reading the code but there's a lot of abstractions and it would probably be faster to just install Wallabag and produce a sample, or faster yet for someone to provide a sample. ↩