Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add html download action in the registry #886

Closed
wants to merge 1 commit into from

Conversation

ai-naymul
Copy link
Contributor

@ai-naymul ai-naymul commented Feb 27, 2025

Signed-off-by: Naymul Islam <[email protected]>
@CLAassistant
Copy link

CLAassistant commented Feb 27, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@pirate
Copy link
Member

pirate commented Mar 22, 2025

this seems useful. seems worth it to add singlefile, MHTML, WARC, HAR, etc. support in the future too

@MagMueller
Copy link
Collaborator

The problem is that HTML files are quite large; therefore, the LLM's context is too small, so the agent cannot handle them. What would you like to do with the HTML? If you just need to save it, we could do it similarly to saving a PDF. However, I think with the current context of the LLM (without RAG), this is too large and not yet useful.

@pirate
Copy link
Member

pirate commented Mar 23, 2025

Yes the intent is not to pass the HTML file to an LLM, this action is just so the LLM can download HTML as an opaque output file if the user asks for it, similar to PDF.

e.g. task='follow all the links on example.com and download the raw HTML for each'

@MagMueller
Copy link
Collaborator

@ai-naymul
We should not pass the entire HTML to the LLM. This seems to me like a custom function for your use-case. If you need it, you can just register it.

@MagMueller MagMueller closed this Mar 23, 2025
@ai-naymul
Copy link
Contributor Author

@ai-naymul We should not pass the entire HTML to the LLM. This seems to me like a custom function for your use-case. If you need it, you can just register it.

@MagMueller @pirate We are not passing the entire html to the LLM. The flow work like the following: go to a site and after loading the site it will just save the the html from the browser source.

For example: I was trying to aggregrate a site where there are many pagination as you mentioned llm model context window are limited although gemini has million in context window but if is there many paginations out there it will eventually out the context limit as I faced also and its also time taking process to extract information using agent in large amount. Thats where extracting data from html is efficient and now I am using it for that task I thought this would be good addidation for that task like aggregating with raw html similar to #1095

@pirate
Copy link
Member

pirate commented Mar 23, 2025

@MagMueller pointed out to me that it actually is passing it to the LLM currently because every ActionResult(extracted_content=html_content, ...) is passed during the next action in order to summarize the previous action. Even though it's not persisted in memory, it still can use up a ton of context out of the next action.

Better way to do it would be like how PDFs are handled, as a file saved directly to the current working directory, and return the path in an ActionResult #1095

However even then I think this is best moved to the examples/use-cases subdirectory so people can use it when they need it. There are many other better scraping tools out there if you really just need to download raw html.

@ai-naymul
Copy link
Contributor Author

@MagMueller pointed out to me that it actually is passing it to the LLM currently because every ActionResult(extracted_content=html_content, ...) is passed during the next action in order to summarize the previous action. Even though it's not persisted in memory, it still can use up a ton of context out of the next action.

Better way to do it would be like how PDFs are handled, as a file saved directly to the current working directory, and return the path in an ActionResult #1095

However even then I think this is best moved to the examples/use-cases subdirectory so people can use it when they need it. There are many other better scraping tools out there if you really just need to download raw html.

Thanks for pointing it out. patched it here #1117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants