-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shaarli RSS parsing falls back to full-text and imports unneeded URLs from metadata fields #135
Comments
The same is true e.g. for wallabag feeds that contain links in the fulltext RSS feed with (maybe) broken HTML - they're
These are being parsed as (example number 4 from above)...
leading to invalid links / 404s. |
Thanks for reporting this. I think the fixes will be simple:
The intended behavior is only to take the actual page to archive from each RSS entry. I'm favoring usability over completeness here, I don't want archives filled with garbage URLs on every import, and if a user wants to add those URLs manually, they can force full-text parsing by passing the URLs individually via It might be worth just doing this ticket first, it will solve both these problems: #123 |
Thanks for looking into it. I agree, switching to a different / more roboust parser should solve all this. Not sure how you would exclude the fulltext / links therein, I guess you would need to limit the parsing to e.g. |
Full-text parsing is only ever used as a fallback if all the other parsing methods fail, so once the RSS parser is working again it should automatically ignore those other links. (the RSS parser knows to only take the main ones) |
I fixed the regex c37941e, give the latest master commit a try. |
Thanks - should work according to the code change (reviewed that). I will |
Ok RSS parser is fixed. There's now a dedicated RSS parser for the Shaarli export format. Give it a try with the latest version of master. |
Hey there,
The result is this XML file in
"Garbage" links like
|
Very strange. You can see in the output it now says I just ran it with exactly the XML you provided above, and it parsed it correctly...
I suspect there's some difference between our setups that's causing this, if you have a moment, do you mind uncommenting this line, and running it again to see why the Shaarli parser fails:
Thanks for helping debug this. |
Question - we're talking about the |
Ah sorry I forgot to push it to master! It was just on my local branch. Try pulling master and uncommenting that line now. |
Took me a while, sorry. Here we go. I uncommented the line, found in line 75 in parse.py.
It looks like there's an extra colon in the shaarli timestamp that is not being parsed correctly. full output:
|
Same is true for
Input XML (snippet):
|
At a conference right now and have a busy week ahead, so apologies if I don't get around to fixing this for a bit. |
No rush at all (at least for me). Thanks for doing all this btw. Let me know if you need any additional input. |
A redacted copy of your |
Sorry, I don't have that anymore. But it was basically just the shaarli demo with one link added to it. |
@mawmawmawm I think I fixed it (in eff0100), pull the latest master and give it a shot. Comment if it's still broken and I'll reopen the issue. |
Sorry, It's still happening for me after a full rebuild... Importing shaarli as plaintext. |
I just ran the latest master on this sample Shaarli export you provided: #135 (comment) and it worked as expected (imported 4 links and parsed as Shaarli RSS format). If the latest master still failing for you, post your export here as I need it to be able to debug the parsing. |
I tried an RSS import from wallabag using latest master 4a7f1d5 docker-compose run archivebox /bin/archive https://app.wallabag.it/jeanregisser/Hb3M3PHiPfZYvra/all.xml
[*] [2019-03-07 11:38:16] Downloading https://app.wallabag.it/jeanregisser/Hb3M3PHiPfZYvra/all.xml
> /data/sources/app.wallabag.it-1551958696.txt
[*] [2019-03-07 11:38:16] Parsing new links from output/sources/app.wallabag.it-1551958696.txt...
> Adding 1 new links to index (parsed import as RSS)
[...] app.wallabag.it-1551958696.txt Let me know if you need more info. |
Sorry for the delay, just fixed this @jeanregisser in 58c9b47. Pull the latest master and give it a try. Comment back here if it doesn't work and I'll reopen the ticket. The issue was that wallabag adds a bunch of newlines between the RSS items which broke my crappy parsing code. @mawmawmawm there have been lots of parser fixes since my last comment here, can you also give the latest master a shot and report back? |
Sorry for the late reply - tried it 3 days ago and was working fine except the |
This error still persists for me. I have Shaarli v0.10.4 (latest) and ArchiveBox master from git. Shaarli produces for example the following XML (original, but domain redacted):
ArchiveBox correctly imports the gcemetery.co link once, but also imports the shaarli.example.com link once. |
Hey @pirate I just discovered about archivebox, that's awesome and a great extension to Shaarli! Unfortunately the problem subsists. Running dev branch of Shaarli. Also besides the Shaarli links, I would not expect w3.org and purl.org from appearing. |
w3.org and purl.org are expected in full-text parsing mode (which it's falling back to due to a bug) because they are linked to in the RSS even though the links aren't visible, they wont archive multiple times so I recommend leaving them for now and ignoring those entries. I've re-opened the issue to track fixing it, PRs to fix are welcome. |
is this issue still relevant? |
Yes, it hasn't been fixed yet. PRs are welcome. I haven't gotten to it myself as I don't use shaarli myself. |
I'm also getting the unnecessary links (like |
Yes, because it falls back to URL parsing in plain text mode, it'll archive every single string that looks like a URL. Using a proper RSS parser library to fix the RSS parser bugs should result in not importing those w3 Atom schema reference URLs. |
It looks like Shaarli feeds are not being parsed correctly and markup is being included in the link structure (much like ticket 134 for pocket). Also, it looks like shaarli detail and tag pages are being parsed as source links, making the import much slower and leading to clutter in the archive.
You can use the public shaarli demo to reproduce this.
There's a demo (U: demo / PW: demo) running on https://demo.shaarli.org/.
The Atom feed then e.g. looks like this (with just one link, this is whats being parsed as the input file)
Note that ArchiveBox wants to include 8 links from this:
Most likely because 8 instances of
http://
were found (that's just my speculation).However, the expected behaviour should be that only the source link should be parsed / added, not the shaarli detail pages like
https://demo.shaarli.org/?cEV4vw
that contain nothing but the actual link to the source (again). IMO that doesn't make sense. It's even "worse" if a link has tags, because every tag then will lead to a new link to be crawled.docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom
(note the
</id>
at the end of the links)The text was updated successfully, but these errors were encountered: