add html download action in the registry #886

ai-naymul · 2025-02-27T08:09:11Z

if users want to download html of the current page they can save that into their preferred location
@MagMueller @gregpr07 @maticzav

Signed-off-by: Naymul Islam <[email protected]>

CLAassistant · 2025-02-27T08:09:17Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

pirate · 2025-03-22T08:36:40Z

this seems useful. seems worth it to add singlefile, MHTML, WARC, HAR, etc. support in the future too

MagMueller · 2025-03-23T02:15:07Z

The problem is that HTML files are quite large; therefore, the LLM's context is too small, so the agent cannot handle them. What would you like to do with the HTML? If you just need to save it, we could do it similarly to saving a PDF. However, I think with the current context of the LLM (without RAG), this is too large and not yet useful.

pirate · 2025-03-23T02:39:09Z

Yes the intent is not to pass the HTML file to an LLM, this action is just so the LLM can download HTML as an opaque output file if the user asks for it, similar to PDF.

e.g. task='follow all the links on example.com and download the raw HTML for each'

MagMueller · 2025-03-23T03:41:39Z

@ai-naymul
We should not pass the entire HTML to the LLM. This seems to me like a custom function for your use-case. If you need it, you can just register it.

ai-naymul · 2025-03-23T09:15:03Z

@ai-naymul We should not pass the entire HTML to the LLM. This seems to me like a custom function for your use-case. If you need it, you can just register it.

@MagMueller @pirate We are not passing the entire html to the LLM. The flow work like the following: go to a site and after loading the site it will just save the the html from the browser source.

For example: I was trying to aggregrate a site where there are many pagination as you mentioned llm model context window are limited although gemini has million in context window but if is there many paginations out there it will eventually out the context limit as I faced also and its also time taking process to extract information using agent in large amount. Thats where extracting data from html is efficient and now I am using it for that task I thought this would be good addidation for that task like aggregating with raw html similar to #1095

pirate · 2025-03-23T10:18:53Z

@MagMueller pointed out to me that it actually is passing it to the LLM currently because every ActionResult(extracted_content=html_content, ...) is passed during the next action in order to summarize the previous action. Even though it's not persisted in memory, it still can use up a ton of context out of the next action.

Better way to do it would be like how PDFs are handled, as a file saved directly to the current working directory, and return the path in an ActionResult #1095

However even then I think this is best moved to the examples/use-cases subdirectory so people can use it when they need it. There are many other better scraping tools out there if you really just need to download raw html.

ai-naymul · 2025-03-23T20:20:14Z

@MagMueller pointed out to me that it actually is passing it to the LLM currently because every ActionResult(extracted_content=html_content, ...) is passed during the next action in order to summarize the previous action. Even though it's not persisted in memory, it still can use up a ton of context out of the next action.

Better way to do it would be like how PDFs are handled, as a file saved directly to the current working directory, and return the path in an ActionResult #1095

However even then I think this is best moved to the examples/use-cases subdirectory so people can use it when they need it. There are many other better scraping tools out there if you really just need to download raw html.

Thanks for pointing it out. patched it here #1117

add html download action

3f68113

Signed-off-by: Naymul Islam <[email protected]>

pirate approved these changes Mar 22, 2025

View reviewed changes

pirate mentioned this pull request Mar 23, 2025

Feature to save webpage as PDF to given path #1095

Merged

MagMueller closed this Mar 23, 2025

ai-naymul mentioned this pull request Mar 23, 2025

add the html download feature #1117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add html download action in the registry #886

add html download action in the registry #886

ai-naymul commented Feb 27, 2025 •

edited

Loading

CLAassistant commented Feb 27, 2025 •

edited

Loading

pirate commented Mar 22, 2025

MagMueller commented Mar 23, 2025

pirate commented Mar 23, 2025

MagMueller commented Mar 23, 2025

ai-naymul commented Mar 23, 2025

pirate commented Mar 23, 2025 •

edited

Loading

ai-naymul commented Mar 23, 2025

add html download action in the registry #886

add html download action in the registry #886

Conversation

ai-naymul commented Feb 27, 2025 • edited Loading

CLAassistant commented Feb 27, 2025 • edited Loading

pirate commented Mar 22, 2025

MagMueller commented Mar 23, 2025

pirate commented Mar 23, 2025

MagMueller commented Mar 23, 2025

ai-naymul commented Mar 23, 2025

pirate commented Mar 23, 2025 • edited Loading

ai-naymul commented Mar 23, 2025

ai-naymul commented Feb 27, 2025 •

edited

Loading

CLAassistant commented Feb 27, 2025 •

edited

Loading

pirate commented Mar 23, 2025 •

edited

Loading