Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use COOKIES_FILE to fetch page titles #1364

Merged
merged 5 commits into from
Mar 14, 2024

Conversation

benmuth
Copy link
Contributor

@benmuth benmuth commented Feb 27, 2024

Summary

This PR lets the title extractor use the COOKIES_FILE, if available. This helps avoid extracting the titles of captcha or login pages.

Related issues

#761

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

@pirate pirate added touches: configuration why: functionality Intended to improve ArchiveBox functionality or features status: wip Work is in-progress / has already been partially completed type: enhancement expected: next release labels Feb 29, 2024
@pirate
Copy link
Member

pirate commented Feb 29, 2024

Down to merge this if you've tested it! Just change it to Ready for Review when you're ready :)

@benmuth
Copy link
Contributor Author

benmuth commented Mar 2, 2024

I've been testing it and I'm actually having trouble verifying that this works. I thought it was working when I opened this PR, but it seems I was wrong. For my test case, the title isn't correct if the title extractor is the only enabled extractor. It is correct when wget is enabled, so the cookie file is being read correctly by ArchiveBox, but my fix here doesn't seem to be working. Not sure where it's failing yet.

@benmuth
Copy link
Contributor Author

benmuth commented Mar 5, 2024

After throwing some stuff at the wall, I got to a solution that works for me. For some reason, the cookie file only seems to work with requests when I set each cookie one by one, like

        for cookie in cookie_jar:
            session.cookies.set(cookie.name, cookie.value, domain=cookie.domain, path=cookie.path)

It didn't work for me when I tried session.cookies = cookie_jar or session.cookies.update(cookie_jar), and from the docs I can't tell why. If anyone could tell me what I'm missing, I'd appreciate it.

@benmuth benmuth marked this pull request as ready for review March 5, 2024 08:10
@pirate pirate merged commit 48f4b12 into ArchiveBox:dev Mar 14, 2024
1 of 2 checks passed
@benmuth benmuth deleted the title-cookies-file branch March 15, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expected: next release status: wip Work is in-progress / has already been partially completed touches: configuration why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants