Support for more path segments in the `finding news items` section of "HTML + XPath" #5071

SailReal · 2023-02-05T00:10:10Z

SailReal
Feb 5, 2023

First of all, thank you for this awesome piece of software!

It would be amazing if more than one path segment could be supported in the XPath finding news items section. Something like //foo works like a charm and I'm using it a lot but if you need select a path like e.g. //foo/bar/bas it doesn't work. The same applies for providing the full path like /foo/bar/bas or /html/body/foo/bar/bas.

The reason behind is that sometimes a website has multiple sections where not want all bas items in the rss, so I can not simply use something like //bas, only those under the //foo/bar/bas should be selected but not //foo/baz/bas.

Example data:
/html/body/foo/bar/bas/1
/html/body/foo/bar/bas/2
/html/body/foo/bar/bas/...
/html/body/foo/baz/bas/1
/html/body/foo/baz/bas/2
/html/body/foo/baz/bas/...

Now as mentioned, I want only //foo/bar/bas in the output.

Currently I think it is only possible to provide one path segment including a child in the finding news items section, right? I mean something like //foo works but //foo/bar/bas not, right?

To provide an real world scenario: I want to add an RSS for https://www.ebay-kleinanzeigen.de/s-giessen/kettcar-daytona/k0l4710r50 where I only want those results above the Alternative Anzeigen in der Umgebung section where the solution would be //div[@class="position-relative"]/ul/li/article because the result needs to be relative to the position-relative-div, otherwise I have all articles in the result.

Alkarex · 2023-02-05T14:18:01Z

Alkarex
Feb 5, 2023
Maintainer

Hello,
With our current XPath feature, you can already have precise deep paths, combine expressions, exclude elements, etc. So no need to add anything there. And your //foo/bar/bas is supposed to work.

To test such things, I suggest you make a clean HTML document that you control.

First problem: If you only test on the example of scenario you have given, there is no <div class="position-relative"> so anything based on that will obviously not work.

$ curl -sL 'https://www.ebay-kleinanzeigen.de/s-giessen/kettcar-daytona/k0l4710r50' | grep -i 'position-relative'

I believe this is due to a cookie-portal, so you need to include custom cookies in the FreshRSS advanced settings for this feed. But based on what you write, you seem to have passed this problem.

Second problem: The HTML of that page is severely broken, with more or less random opening and closing tags: https://validator.w3.org/nu/?doc=https%3A%2F%2Fwww.ebay-kleinanzeigen.de%2Fs-giessen%2Fkettcar-daytona%2Fk0l4710r50
In particular, many unclosed article, div, ul, li.

Which means that our HTML parser might not necessarily produce a DOM identical to what you can see in your browser.

So you need to go with some safer expressions (try to stick to elements that are properly closed), or use an extension that can clean the source prior to processing. Quick, not perfect, example:

E.g. //h2[contains(.,'Alternative Anzeigen in der Umgebung')]/preceding::h2[@class="text-module-begin"]

P.S. Tested with an offline copy of the page, as I did not bother fixing the cookie thing.

0 replies

mgnsk · 2023-02-06T16:25:46Z

mgnsk
Feb 6, 2023

I think I have a similar problem. I'm trying to use XPath to scrape a literal XML document.
For example:

<array>
	<object></object>
	<object>
		<array>
			<object></object>
		</array>
	</object>
</array>

The problem is that //array/object matches 3 objects where as I really want /array/object that matches the 2 objects from top level array.

3 replies

Alkarex Feb 6, 2023
Maintainer

@mgnsk and /array/object did not work for you?

mgnsk Feb 6, 2023

@mgnsk and /array/object did not work for you?

That's what I remember that it didn't, the actual XML was a bit more complex and I gave up eventually. I'll verify and test it again in couple of hours.

Alkarex Feb 6, 2023
Maintainer

#5075

mgnsk · 2023-02-06T19:41:09Z

mgnsk
Feb 6, 2023

A simplified XML case: https://gist.githubusercontent.com/mgnsk/39ffd7f8e0a24038373a53f76c317d93/raw/0c772312121ce089bc4c14082ae54e8882298d74/freshrss_test.xml

//array/object item and ./string[@name="title"] title gives only a single entry with the title Meta title
/array/object item and ./string[@name="title"] title gives an empty feed

Now the same XML but the inner array has a name attribute which is similar to the actual XML I dealt with: https://gist.githubusercontent.com/mgnsk/0aab124cfd3516cb8ccddb8d39d5c5a7/raw/e9fe30a5cdedcc736027353e08b4be994f91beb2/freshrss_test.xml

//array/object item and ./string[@name="title"] title gives me 4 entries
/array/object item and ./string[@name="title"] title gives empty feed

I was finally able to solve it with

//array[count(@*)=0]/object item and ./string[@name="title"] title gives correct 2 entries Item 1 and Item 2
but still wondering why just /array/object couldn't work.

2 replies

Alkarex Feb 6, 2023
Maintainer

Right, in your case, this is a pure XML document, which is not intended by the HTML+XPath mode. We could add such an XML+XPath mode, though. Because it is parsed as HTML, some extra tags are added around, ending up internally with something like:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml version="1.0" encoding="UTF-8" ??>
<html>
<body>
<array>
        <object>
                <string name="title">Item 2</string>
                <object name="meta">
                        <array>
                                <object>
                                        <string name="title">Meta title</string>
                                </object>
                        </array>
                </object>
        </object>
        <object>
                <string name="title">Item 1</string>
                <object name="meta">
                        <array>
                                <object>
                                        <string name="title">Meta title</string>
                                </object>
                        </array>
                </object>
        </object>
</array>
</body>
</html>

I do not think it is the same problem than the parent's.

Alkarex Feb 6, 2023
Maintainer

#5075

Uh oh!

Support for more path segments in the finding news items section of "HTML + XPath" #5071

Uh oh!

Uh oh!

SailReal Feb 5, 2023

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

Alkarex Feb 5, 2023 Maintainer

Uh oh!

mgnsk Feb 6, 2023

Uh oh!

Alkarex Feb 6, 2023 Maintainer

Uh oh!

mgnsk Feb 6, 2023

Uh oh!

Alkarex Feb 6, 2023 Maintainer

Uh oh!

mgnsk Feb 6, 2023

Uh oh!

Alkarex Feb 6, 2023 Maintainer

Uh oh!

Alkarex Feb 6, 2023 Maintainer

Support for more path segments in the `finding news items` section of "HTML + XPath" #5071

SailReal
Feb 5, 2023

Replies: 3 comments 5 replies

Alkarex
Feb 5, 2023
Maintainer

mgnsk
Feb 6, 2023

Alkarex Feb 6, 2023
Maintainer

Alkarex Feb 6, 2023
Maintainer

mgnsk
Feb 6, 2023

Alkarex Feb 6, 2023
Maintainer

Alkarex Feb 6, 2023
Maintainer