Bug: Fails to parse list of URLs txt file #968

rossvor · 2022-04-20T16:20:31Z

Describe the bug

I can't seem to get archivebox to add any URLs from simple txt file with a newline separated list of URLs.
Based on error message it fails to parse it. I may be doing something wrong.

Steps to reproduce

Create txt file with some URLs. Eg.

https://www.example.com/
https://example.com/

Run archivebox add /tmp/urls.txt

Screenshots or log output

Here's the output I get:

ross@xx> archivebox add /tmp/urls.txt                                                                                                                                                                                     /tmp/archivebox
[i] [2022-04-20 16:05:12] ArchiveBox v0.6.2: archivebox add /tmp/urls.txt
    > /tmp/archivebox

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
            

[+] [2022-04-20 16:05:13] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650470713-import.txt
                                                                                                                                                                                                                        0.0% (0/240sec)[X] Error while loading link! [1650470713.151664] /tmp/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                                                                                                                                           
    > Found 0 new URLs not already in index

[*] [2022-04-20 16:05:13] Writing 0 links to main index...
    √ ./index.sqlite3

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.17.1-arch1-1-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /home/ross/.local/bin/archivebox                                            
 √  PYTHON_BINARY         v3.10.4         valid     /usr/bin/python3.10                                                         
 √  DJANGO_BINARY         v3.1.14         valid     /home/ross/.local/lib/python3.10/site-packages/django/bin/django-admin.py   
 √  CURL_BINARY           v7.82.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 X  SINGLEFILE_BINARY     ?               invalid   single-file                                                                 
 X  READABILITY_BINARY    ?               invalid   readability-extractor                                                       
 X  MERCURY_BINARY        ?               invalid   mercury-parser                                                              
 √  GIT_BINARY            v2.35.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /home/ross/.local/bin/youtube-dl                                            
 √  CHROME_BINARY         v100.0.4896.88  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/ross/.local/lib/python3.10/site-packages/archivebox                   
 √  TEMPLATES_DIR         3 files         valid     /home/ross/.local/lib/python3.10/site-packages/archivebox/templates         
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /tmp/archivebox                                                             
 √  SOURCES_DIR           3 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           0 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3                                                             

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False

The text was updated successfully, but these errors were encountered:

pirate · 2022-04-20T17:04:55Z

Are you sure your URLs have schemes at the front? They have to be fully qualified URLs (e.g. google.com/example doens't work but https://google.com/example does).

Can you post a redacted snippet of your actual urls.txt file?

You can also try and force a specific parser with archivebox add --parse=generic_txt /tmp/urls.txt or archivebox add --parse=url_list /tmp/urls.txt. Also check the contents of sources/1650470713-import.txt to see how ArchiveBox is interpreting the file.

rossvor · 2022-04-20T18:48:08Z

Are you sure your URLs have schemes at the front? They have to be fully qualified URLs (e.g. google.com/example doens't work but https://google.com/example does).

Yep, can confirm that file has fully qualified URLs.

Can you post a redacted snippet of your actual urls.txt file?

Sure.

https://www.example.com/
https://example.com/
https://github.com/ArchiveBox/ArchiveBox/
https://news.ycombinator.com/item?id=31083515
https://www.imdb.com/list/ls020840037/

I've tried setting parser explicitly as you suggested, none of them picked up the URLs, with slightly varying errors.

archivebox add --parse=generic_txt /tmp/urls.txt
Result:
archivebox add: error: argument --parser: invalid choice: 'generic_txt'

archivebox add --parse=url_list /tmp/urls.txt
Result:
[X] No links found using URL List parser

Also check the contents of sources/1650470713-import.txt to see how ArchiveBox is interpreting the file.

Contents of sources/1650479354-import.txt (with all the above variations of parsers) is just file path itself, so I guess it tries to interpret path as URL instead of a path.
Contents of sources/1650479354-import.txt:
/tmp/urls.txt

I can confirm that using input redirection does work fine, so this works:
archivebox add < /tmp/urls.txt

pirate · 2022-04-20T18:57:25Z

Try with --depth=1 and passing the file path as the first argument.

rossvor · 2022-04-20T19:01:39Z

Doesn't seem to change the error

ross@xx> archivebox add --depth=1 /tmp/urls.txt                                                                                                                                                                             /media/shared-ext/archivebox
[i] [2022-04-20 19:00:23] ArchiveBox v0.6.2: archivebox add --depth=1 /tmp/urls.txt
    > /media/shared-ext/archivebox

[+] [2022-04-20 19:00:24] Adding 1 links to index (crawl depth=1)...
    > Saved verbatim input to sources/1650481224-import.txt
                                                                                                                                                                                                                                           0.0% (0/240sec)[X] Error while loading link! [1650481224.660288] /tmp/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                                                                                                                                                              
    > Found 0 new URLs not already in index

[*] [2022-04-20 19:00:24] Writing 0 links to main index...
    √ ./index.sqlite3

rossvor · 2022-04-20T21:36:49Z

I've also tried this using on a fresh docker image based installation and it fails similarly:

sudo docker run -v $PWD:/data -v /tmp/ff:/ff -it archivebox/archivebox add /ff/urls.txt
[i] [2022-04-20 21:32:03] ArchiveBox v0.6.2: archivebox add /ff/urls.txt
    > /data

[+] [2022-04-20 21:32:03] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650490323-import.txt
 0.0% (0/240sec)[X] Error while loading link! [1650490324.056402] /ff/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                               
    > Found 0 new URLs not already in index

[*] [2022-04-20 21:32:04] Writing 0 links to main index...
    √ ./index.sqlite3

/tmp/ff/urls.txt being the same simple file:

https://www.example.com/
https://example.com/
https://github.com/ArchiveBox/ArchiveBox/
https://news.ycombinator.com/item?id=31083515
https://www.imdb.com/list/ls020840037/

pirate · 2022-04-21T00:21:27Z

Ah sorry I forgot I removed loading directly from a file path in a previous version because it conflicted with the new --depth=1 implementation!

I'll reopen and merge your original PR #967. For future reference stdin redirection is indeed necessary, or passing --depth=1 /path/to/file.txt also works.

pirate closed this as completed Apr 21, 2022

pirate mentioned this issue Apr 21, 2022

Fix missing input redirection in a hint text #967

Merged

6 tasks

cutterkom mentioned this issue Oct 6, 2023

Support: Docker v0.7.1 unable to use /tmp directory for crontab mkfstemp #1237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Fails to parse list of URLs txt file #968

Bug: Fails to parse list of URLs txt file #968

rossvor commented Apr 20, 2022

pirate commented Apr 20, 2022 •

edited

Loading

rossvor commented Apr 20, 2022 •

edited

Loading

pirate commented Apr 20, 2022

rossvor commented Apr 20, 2022

rossvor commented Apr 20, 2022

pirate commented Apr 21, 2022 •

edited

Loading

Bug: Fails to parse list of URLs txt file #968

Bug: Fails to parse list of URLs txt file #968

Comments

rossvor commented Apr 20, 2022

Describe the bug

Steps to reproduce

Screenshots or log output

ArchiveBox version

pirate commented Apr 20, 2022 • edited Loading

rossvor commented Apr 20, 2022 • edited Loading

pirate commented Apr 20, 2022

rossvor commented Apr 20, 2022

rossvor commented Apr 20, 2022

pirate commented Apr 21, 2022 • edited Loading

pirate commented Apr 20, 2022 •

edited

Loading

rossvor commented Apr 20, 2022 •

edited

Loading

pirate commented Apr 21, 2022 •

edited

Loading