Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COOKIES_FILE isn't used when fetching page titles, leading to saving captcha-page titles like "Before you continue to YouTube..." #761

Open
dansbandit opened this issue Jun 5, 2021 · 5 comments
Labels
good first ticket help wanted size: easy status: backlog Work is planned someday but is not the highest priority at the moment why: functionality Intended to improve ArchiveBox functionality or features
Milestone

Comments

@dansbandit
Copy link

dansbandit commented Jun 5, 2021

Describe the bug

Title becomes 'Before you continue to YouTube' instead of video title due to youtube redirects to a cookie consent form. This could be solved if you could add a cookie file to the curl command that is run.

["curl", "--silent", "--location", "--compressed", "--max-time", "60", "--user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.76.0 (amd64-portbld-freebsd12.2)", "https://www.youtube.com/watch?v=aP8sRCun63M"]

Steps to reproduce

  1. archivebox add https://www.youtube.com/watch?v=aP8sRCun63M
  2. Title becomes 'Before you continue to YouTube' when it should be 'ArchiveBox'

Screenshots or log output

N/A

ArchiveBox version

ArchiveBox v0.6.2
Cpython FreeBSD FreeBSD-12.2-RELEASE-p6-amd64-64bit-ELF amd64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     ./.local/bin/archivebox                                                     
 √  PYTHON_BINARY         v3.7.10         valid     /usr/local/bin/python3.7                                                    
 √  DJANGO_BINARY         v3.1.12         valid     ./.local/lib/python3.7/site-packages/django/bin/django-admin.py             
 √  CURL_BINARY           v7.76.0         valid     /usr/local/bin/curl                                                         
 √  WGET_BINARY           v1.21           valid     /usr/local/bin/wget                                                         
 √  NODE_BINARY           v14.16.1        valid     /usr/local/bin/node                                                         
 √  SINGLEFILE_BINARY     v0.3.13         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.1.0          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.31.1         valid     /usr/local/bin/git                                                          
 √  YOUTUBEDL_BINARY      v2021.05.16     valid     /home/archivebox/.local/bin/youtube-dl                                      
 √  CHROME_BINARY         v90.0.4430.212  valid     /usr/local/bin/chrome                                                       
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/local/bin/rg                                                           

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     ./.local/lib/python3.7/site-packages/archivebox                             
 √  TEMPLATES_DIR         3 files         valid     ./.local/lib/python3.7/site-packages/archivebox/templates                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 √  CHROME_USER_DATA_DIR  1 files         valid     ./~/.config/chromium                                                        
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
@Glutamat42
Copy link

ArchiveBox provides a way to add a cockies file. I use the docker image and there i added the following environment variable for that: COOKIES_FILE=/data/cookies.txt

I think it is only used for media and wget.

@dansbandit
Copy link
Author

ArchiveBox provides a way to add a cockies file. I use the docker image and there i added the following environment variable for that: COOKIES_FILE=/data/cookies.txt

I think it is only used for media and wget.

Yes I've tried that environment variable and it seems that it doesn't affect the title.

@pirate
Copy link
Member

pirate commented Jun 18, 2021

Unfortunately the cookies file does not apply to the title, so there's no easy way to get around this right now till we push a fix to use the cookies in download_url() (see archivebox/extractors/title.py).

You'll have to edit the titles manually in the Admin to fix them, or try and stay under the rate limits that Youtube uses so that you're not throttled and getting captcha pages. You can always click Pull Title in the Admin UI to force re-fetching the title.

@pirate pirate added size: easy good first ticket help wanted is: enhancement status: backlog Work is planned someday but is not the highest priority at the moment why: functionality Intended to improve ArchiveBox functionality or features labels Jun 18, 2021
@pirate pirate changed the title Bug: Title becomes "Before you continue to YouTube" when archiving youtube videos COOKIES_FILE isn't used when fetching page titles, leading to saving captcha-page titles like "Before you continue to YouTube..." Jun 18, 2021
@pirate pirate added this to the v0.7.0 milestone Jun 18, 2021
@dansbandit
Copy link
Author

If I recall correctly the cookie consent form affects all European user regardless of rate limits.

In the meantime I will try to get the titles another way.

@JoshMock
Copy link

JoshMock commented Feb 5, 2023

Would this still be a good first ticket? Looking to start making some contributions to the project, but want to get familiar with the codebase first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first ticket help wanted size: easy status: backlog Work is planned someday but is not the highest priority at the moment why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

No branches or pull requests

4 participants