-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add html download action in the registry #886
Conversation
Signed-off-by: Naymul Islam <[email protected]>
|
this seems useful. seems worth it to add singlefile, MHTML, WARC, HAR, etc. support in the future too |
The problem is that HTML files are quite large; therefore, the LLM's context is too small, so the agent cannot handle them. What would you like to do with the HTML? If you just need to save it, we could do it similarly to saving a PDF. However, I think with the current context of the LLM (without RAG), this is too large and not yet useful. |
Yes the intent is not to pass the HTML file to an LLM, this action is just so the LLM can download HTML as an opaque output file if the user asks for it, similar to PDF. e.g. |
@ai-naymul |
@MagMueller @pirate We are not passing the entire html to the LLM. The flow work like the following: go to a site and after loading the site it will just save the the html from the browser source. For example: I was trying to aggregrate a site where there are many pagination as you mentioned llm model context window are limited although gemini has million in context window but if is there many paginations out there it will eventually out the context limit as I faced also and its also time taking process to extract information using agent in large amount. Thats where extracting data from html is efficient and now I am using it for that task I thought this would be good addidation for that task like aggregating with raw html similar to #1095 |
@MagMueller pointed out to me that it actually is passing it to the LLM currently because every Better way to do it would be like how PDFs are handled, as a file saved directly to the current working directory, and return the path in an However even then I think this is best moved to the |
Thanks for pointing it out. patched it here #1117 |
@MagMueller @gregpr07 @maticzav