-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Parse Atom RSS feeds #1171
Comments
Please do this. I'm surprised this project isn't using a proper RSS parser. I just spent the entire day writing regex to pick out the random RSS and W3 links that ArchiveBox keeps pulling out of my RSS feeds somehow. |
That's because the RSS parsers fail, hand over to the next parser, and eventually the Here's the script I use as a workaround to parse feeds into JSON first: #!/usr/bin/env python3
import feedparser
import sys
import json
dom = feedparser.parse(sys.argv[1])
links = []
for entry in dom.entries:
tags = ",".join(map(lambda tag: tag.term, entry.get('tags', [])))
link = {
'url': entry.link,
'title': entry.title,
'tags': tags,
'description': entry.summary,
# 'created': entry.published,
}
links.append(link)
print(json.dumps(links))
|
Wonder if there's a way to use a script like this in the scheduler... I guess not officially, would be easier to just fix the parsers if that's the way... but maybe I can modify the crontab directly to use the script. Let's see |
Wow feedparser is incredible, takes anything I throw at it. Could be an easy drop-in @pirate? |
Modified the crontab manually, and it works. I put the
The format you suggested above with the The Dockerfile file:
after which we edit the Also in this block, set the env variable |
Sorry for causing you so much extra overhead / debugging time to have to resort to this workaround @melyux, but thanks for documenting your process here for others! All my dev focus is currently on a refactor I have in progress to add Huey support to ArchiveBox, which has left a few of these relatively big issues to languish. I appreciate everyone's patience while I give some much-needed attention to the internal architecture! |
The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing. The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary. (The special parse for pinboard is just needing because of how tags work.) Doesn't include tests because I haven't figured out how to run them in the docker development setup. Fixes ArchiveBox#1171
The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing. The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary. (The special parse for pinboard is just needing because of how tags work.) Doesn't include tests because I haven't figured out how to run them in the docker development setup. Fixes ArchiveBox#1171
This should work now that we switched to |
@pirate I have several feeds I'd like to parse using this feature, but none of them are working. I'm using the Samples are attached (.txt added to bypass github mimetype detector) The sources are: |
Type
What is the problem that your feature request solves
Generic Atom-based RSS feeds (see https://datatracker.ietf.org/doc/html/rfc4287) cannot be parsed by the current parsers.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
Ideally, a library like feedparser would be used, both as a more robust solution than the current hand-rolled regex parsers, and something that already supports a wide range of feed formats.
What hacks or alternative solutions have you tried to solve the problem?
I wrote a script to use that library myself, and turn the feed info into JSON that I can pipe into
archivebox add --parser json
.How badly do you want this new feature?
The text was updated successfully, but these errors were encountered: