Looking for advice on HTML + XPath + JSON scraping new website #7352

dahlbergc · 2025-02-22T22:06:50Z

dahlbergc
Feb 22, 2025

I use FreshRSS XPath scraping to create feeds for upcoming events from the websites of local venues in my city. This has worked amazing up until recently when one of the venues changed the output of their website's event calendar from HTML to JavaScript.

https://www.theshell.org/performances/rady-shell-calendar/

Unfortunately, the entire script field is too long to post here, but the event data I'm trying to grab is buried within <script type="text/javascript">. All events are nested under n.performances in the following format.

		n.performances=[
			{
				performanceId:9443,
				productionSeasonId:9442,
				title:"Vänskä Conducts Sibelius and Beethoven",
				imageUrl:"/media/3wphxuck/feb28_300.jpg",
				location:"Jacobs Music Center",
				badgeId:"0",
				moreDetailsUrl:"/performances/vaenskae-sibelius-and-beethoven/",
				performanceDate:"2025-02-28T11:00:00-08:00",
				performanceType:"Jacobs Masterworks",
				performanceTypeBGColor:"#ffd532",
				performanceTypeFGColor:"#000",
				seasonDescription:"",
				buyButtonLink:null
			},

I'm trying to grab title, imageUrl, location, moreDetailsUrl, and performanceDate fields out of each performance listed in the script.

I tried configuring the feed source using HTML + XPath + JSON and pointing it at //script[@type="text/javascript"] but I'm unsure how to configure the remaining fields to pull the event data I need. Nothing I've tried is working and the feed is not pulling any events.

FreshRSS log output isn't giving me much information besides telling me the parsing failed

192.168.30.10 - admin [22/Feb/2025:13:57:30 -0800] "GET /i/?c=javascript&a=nbUnreadsPerFeed HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 Edg/133.0.0.0"
192.168.30.10 - admin [22/Feb/2025:13:58:52 -0800] "GET /i/?c=javascript&a=nbUnreadsPerFeed HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 Edg/133.0.0.0"
FreshRSS[112709]: FreshRSS GET html https://www.theshell.org/performances/rady-shell-calendar/
FreshRSS[112709]: [admin] [Sat, 22 Feb 2025 13:58:54 -0800] [warning] --- HTML+XPath+JSON parsing failed for [https://www.theshell.org/performances/rady-shell-calendar/]
192.168.30.10 - admin [22/Feb/2025:13:58:53 -0800] "GET /i/?c=feed&a=actualize&id=288 HTTP/1.1" 302 - "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 Edg/133.0.0.0"
192.168.30.10 - admin [22/Feb/2025:13:58:54 -0800] "GET /i/?get=f_288&rid=67ba489d80b38 HTTP/1.1" 200 28986 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 Edg/133.0.0.0"

Is it possible for FreshRSS to scrape the event data from this site calendar, and if so, could someone help point me in the right direction?

Answered by Alkarex

Feb 24, 2025

@dahlbergc See #7369. Tests welcome. This is because the pages you are trying to process contain multiple JSON fragments, which was not supported so far, but I have just implemented it

View full answer

Alkarex · 2025-02-23T14:50:49Z

Alkarex
Feb 23, 2025
Maintainer

Hello,
As you write, the relevant JSON is buried into a larger JavaScript section, so this is not a situation that is readily supported.
At the moment, two options would be to make a little FreshRSS extension, or implement a script in https://github.com/RSS-Bridge/rss-bridge.
We could consider adding a new mode HTML+Regex+JSON for supporting such use-cases

11 replies

dahlbergc Feb 23, 2025
Author

Thank you for taking the time to respond to my question and also for all the time and effort you put into building this amazing application. Up to this point I had only used straight Xpath scraping, so I wasn't sure if existing tools would work to parse the JavaScript code on the site.

Unfortunately, I don't possess the know-how to implement RegEx functionality into FreshRSS's scaping tools or a new tool within RSS-Bridge. I'll keep my eyes open for any future updates to FreshRSS that will allow me to build feeds off this kind of data.

dahlbergc Feb 24, 2025
Author

@Alkarex,

I found another site that contains most of the upcoming event info I'm looking for from the same venue and this site appears to publish JSON for each individual event under <script type="application/ld+json">, although all attempts to scrape it using HTML + JSON or HTML + XPath return no results. Any thoughts what could be causing blockage here?

https://san-diego.events/venue/the-rady-shell-at-jacobs-park-events/

I'm scratching my head because simple XPath scraping should work on //li[@class="date-row"] or //ul[@class="dates-list"]/li Any pointers you can provide would be much appreciated.

I only ask again because being able to successfully build feeds from https://san-diego.events would unlock all of the upcoming event data I need for a couple missing venues and I'd like to build feeds for as many local spots as I can. With the newly released functionality for sorting by article date, my Local Events category will be even more useful.

Alkarex Feb 24, 2025
Maintainer

@dahlbergc See #7369. Tests welcome. This is because the pages you are trying to process contain multiple JSON fragments, which was not supported so far, but I have just implemented it

Answer selected by dahlbergc

dahlbergc Feb 24, 2025
Author

@Alkarex You are a legend!! Thank you so much!

I will be able to test this out on my end later tonight. Do I just need to pull the freshrss:edge container to get the update? What version should I see reflected under Configuration > About > Version?

Alkarex Feb 24, 2025
Maintainer

Thanks for the kind words :-)
#7369 is not yet merged in edge (needs some more testing). To test this branch, you can modify your Docker Compose like so:

  freshrss:
    image: freshrss/freshrss:7369
    build:
      context: https://github.com/Alkarex/FreshRSS.git#support-json-multiple-fragments
      dockerfile: Docker/Dockerfile
    ...

dahlbergc Feb 25, 2025
Author

@Alkarex I spun up a new container and I can confirm that I'm able to build functional feeds off of the JSON fragments with this new version. Works like a charm!

I noticed in #7369 you mentioned this PR could cause breaking changes in existing XPath feeds. This new version didn't have any negative impact on my existing HTML + XPath feeds from what I could tell. After clearing articles and cache I was able to successfully reload all articles without issue from my most complex HTML + XPath 'upcoming event' feeds which use a couple operators to avoid pulling unwanted data from the site.

The only issue I ran into with this version was auth_openidc missing from the build, but I was able to roll back my auth config to web form for this test.

Please let me know if there's anything else I can test.

One last suggestion... I understand that sorting articles by article date vs published date is a new feature that was recently added to the application. One way this could be very handy is if users had the ability to configure the article sorting preference at the category level - Categories like Local News I would want organized by published date, but Local Events or Local Sports I would want to see listed by article date. Not sure if that's on the road map or not. I'm happy to open a feature request if needed.

Alkarex Feb 25, 2025
Maintainer

this PR could cause breaking changes in existing XPath feeds

Only for the HTML+XPath+JSON mode

The only issue I ran into with this version was auth_openidc missing from the build

Ah, then replace by dockerfile: Docker/Dockerfile instead of the Alpine version

Uh oh!

Looking for advice on HTML + XPath + JSON scraping new website #7352

Uh oh!

dahlbergc Feb 22, 2025

Replies: 1 comment · 11 replies

Uh oh!

Alkarex Feb 23, 2025 Maintainer

Uh oh!

dahlbergc Feb 23, 2025 Author

Uh oh!

Uh oh!

dahlbergc Feb 24, 2025 Author

Uh oh!

Alkarex Feb 24, 2025 Maintainer

Uh oh!

Uh oh!

dahlbergc Feb 24, 2025 Author

Uh oh!

Uh oh!

Alkarex Feb 24, 2025 Maintainer

Uh oh!

Uh oh!

dahlbergc Feb 25, 2025 Author

Uh oh!

Alkarex Feb 25, 2025 Maintainer

dahlbergc
Feb 22, 2025

Replies: 1 comment 11 replies

Alkarex
Feb 23, 2025
Maintainer

dahlbergc Feb 23, 2025
Author

dahlbergc Feb 24, 2025
Author

Alkarex Feb 24, 2025
Maintainer

dahlbergc Feb 24, 2025
Author

Alkarex Feb 24, 2025
Maintainer

dahlbergc Feb 25, 2025
Author

Alkarex Feb 25, 2025
Maintainer