Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Fails to parse list of URLs txt file #968

Closed
rossvor opened this issue Apr 20, 2022 · 6 comments
Closed

Bug: Fails to parse list of URLs txt file #968

rossvor opened this issue Apr 20, 2022 · 6 comments

Comments

@rossvor
Copy link
Contributor

rossvor commented Apr 20, 2022

Describe the bug

I can't seem to get archivebox to add any URLs from simple txt file with a newline separated list of URLs.
Based on error message it fails to parse it. I may be doing something wrong.

Steps to reproduce

  1. Create txt file with some URLs. Eg.
https://www.example.com/
https://example.com/
  1. Run archivebox add /tmp/urls.txt

Screenshots or log output

Here's the output I get:

ross@xx> archivebox add /tmp/urls.txt                                                                                                                                                                                     /tmp/archivebox
[i] [2022-04-20 16:05:12] ArchiveBox v0.6.2: archivebox add /tmp/urls.txt
    > /tmp/archivebox

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
            

[+] [2022-04-20 16:05:13] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650470713-import.txt
                                                                                                                                                                                                                        0.0% (0/240sec)[X] Error while loading link! [1650470713.151664] /tmp/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                                                                                                                                           
    > Found 0 new URLs not already in index

[*] [2022-04-20 16:05:13] Writing 0 links to main index...
    √ ./index.sqlite3

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.17.1-arch1-1-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /home/ross/.local/bin/archivebox                                            
 √  PYTHON_BINARY         v3.10.4         valid     /usr/bin/python3.10                                                         
 √  DJANGO_BINARY         v3.1.14         valid     /home/ross/.local/lib/python3.10/site-packages/django/bin/django-admin.py   
 √  CURL_BINARY           v7.82.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 X  SINGLEFILE_BINARY     ?               invalid   single-file                                                                 
 X  READABILITY_BINARY    ?               invalid   readability-extractor                                                       
 X  MERCURY_BINARY        ?               invalid   mercury-parser                                                              
 √  GIT_BINARY            v2.35.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /home/ross/.local/bin/youtube-dl                                            
 √  CHROME_BINARY         v100.0.4896.88  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/ross/.local/lib/python3.10/site-packages/archivebox                   
 √  TEMPLATES_DIR         3 files         valid     /home/ross/.local/lib/python3.10/site-packages/archivebox/templates         
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /tmp/archivebox                                                             
 √  SOURCES_DIR           3 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           0 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3                                                             

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
  
@pirate
Copy link
Member

pirate commented Apr 20, 2022

Are you sure your URLs have schemes at the front? They have to be fully qualified URLs (e.g. google.com/example doens't work but https://google.com/example does).

Can you post a redacted snippet of your actual urls.txt file?

You can also try and force a specific parser with archivebox add --parse=generic_txt /tmp/urls.txt or archivebox add --parse=url_list /tmp/urls.txt. Also check the contents of sources/1650470713-import.txt to see how ArchiveBox is interpreting the file.

@rossvor
Copy link
Contributor Author

rossvor commented Apr 20, 2022

Are you sure your URLs have schemes at the front? They have to be fully qualified URLs (e.g. google.com/example doens't work but https://google.com/example does).

Yep, can confirm that file has fully qualified URLs.

Can you post a redacted snippet of your actual urls.txt file?

Sure.

https://www.example.com/
https://example.com/
https://github.com/ArchiveBox/ArchiveBox/
https://news.ycombinator.com/item?id=31083515
https://www.imdb.com/list/ls020840037/

I've tried setting parser explicitly as you suggested, none of them picked up the URLs, with slightly varying errors.

archivebox add --parse=generic_txt /tmp/urls.txt
Result:
archivebox add: error: argument --parser: invalid choice: 'generic_txt'

archivebox add --parse=url_list /tmp/urls.txt
Result:
[X] No links found using URL List parser

Also check the contents of sources/1650470713-import.txt to see how ArchiveBox is interpreting the file.

Contents of sources/1650479354-import.txt (with all the above variations of parsers) is just file path itself, so I guess it tries to interpret path as URL instead of a path.
Contents of sources/1650479354-import.txt:
/tmp/urls.txt

I can confirm that using input redirection does work fine, so this works:
archivebox add < /tmp/urls.txt

@pirate
Copy link
Member

pirate commented Apr 20, 2022

Try with --depth=1 and passing the file path as the first argument.

@rossvor
Copy link
Contributor Author

rossvor commented Apr 20, 2022

Doesn't seem to change the error

ross@xx> archivebox add --depth=1 /tmp/urls.txt                                                                                                                                                                             /media/shared-ext/archivebox
[i] [2022-04-20 19:00:23] ArchiveBox v0.6.2: archivebox add --depth=1 /tmp/urls.txt
    > /media/shared-ext/archivebox

[+] [2022-04-20 19:00:24] Adding 1 links to index (crawl depth=1)...
    > Saved verbatim input to sources/1650481224-import.txt
                                                                                                                                                                                                                                           0.0% (0/240sec)[X] Error while loading link! [1650481224.660288] /tmp/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                                                                                                                                                              
    > Found 0 new URLs not already in index

[*] [2022-04-20 19:00:24] Writing 0 links to main index...
    √ ./index.sqlite3   

@rossvor
Copy link
Contributor Author

rossvor commented Apr 20, 2022

I've also tried this using on a fresh docker image based installation and it fails similarly:

sudo docker run -v $PWD:/data -v /tmp/ff:/ff -it archivebox/archivebox add /ff/urls.txt
[i] [2022-04-20 21:32:03] ArchiveBox v0.6.2: archivebox add /ff/urls.txt
    > /data

[+] [2022-04-20 21:32:03] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650490323-import.txt
 0.0% (0/240sec)[X] Error while loading link! [1650490324.056402] /ff/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                               
    > Found 0 new URLs not already in index

[*] [2022-04-20 21:32:04] Writing 0 links to main index...
    √ ./index.sqlite3      

/tmp/ff/urls.txt being the same simple file:

https://www.example.com/
https://example.com/
https://github.com/ArchiveBox/ArchiveBox/
https://news.ycombinator.com/item?id=31083515
https://www.imdb.com/list/ls020840037/

@pirate
Copy link
Member

pirate commented Apr 21, 2022

Ah sorry I forgot I removed loading directly from a file path in a previous version because it conflicted with the new --depth=1 implementation!

I'll reopen and merge your original PR #967. For future reference stdin redirection is indeed necessary, or passing --depth=1 /path/to/file.txt also works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants