Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: All timers/progress bars timeout when archiving from REST API #1646

Open
1 of 4 tasks
benmuth opened this issue Feb 1, 2025 · 1 comment
Open
1 of 4 tasks

Bug: All timers/progress bars timeout when archiving from REST API #1646

benmuth opened this issue Feb 1, 2025 · 1 comment
Assignees

Comments

@benmuth
Copy link
Contributor

benmuth commented Feb 1, 2025

Provide a screenshot and describe the bug

When using the extension (in the redesign branch, using the REST API) to archive any URL, all timers/progress bars take the maximum amount of time. Parsers take 240 seconds, most extractors take 60 seconds, the media extractor takes an hour (I think, I didn't wait around to find out). This doesn't happen when running archivebox add through the CLI, only the REST API.

Image

I kind of figured out what's going on, but not why. In archivebox/logging_util.py, TimedProgress.end() tries to terminate the progress_bar process. For some reason (busy writing to stdout?), the process ignores the terminate, then the join() call after the terminate just blocks until the progress_bar function finishes execution.

Adding an explicit signal handler to the beginning of the progress_bar function seems to fix the problem:

    def handle_term(signum, frame):
        sys.exit(0)

    signal.signal(signal.SIGTERM, handle_term)

That works, but I'm not sure if there's a better solution. Should I open a PR with this fix against dev?

Steps to reproduce

1. run docker run -it -p 8000:8000 \                                                                                    
                       -v $PWD/data:/data \
                       -v $PWD/archivebox:/app/archivebox \
                       archivebox server 0.0.0.0:8000 --debug
2. load the redesign folder on the redesign branch of the archivebox-browser-extension as an extension.
3. try to archive a webpage by clicking on the extension in the Chrome menubar.

Logs or errors

╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [2025-02-01 11:35:03] ArchiveBox v0.8.5rc53: archivebox server 0.0.0.0:8000                                                                                 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[+] Starting ArchiveBox webserver...
    > Starting ArchiveBox webserver on http://0.0.0.0:8000
    > Log in to ArchiveBox Admin UI on http://0.0.0.0:8000/admin
    > Writing ArchiveBox error log to ./logs/errors.log
Performing system checks...

System check identified no issues (0 silenced).
February 01, 2025 - 11:35:04
Django version 5.1.2, using settings 'core.settings'
Starting ASGI/Daphne version 4.1.2 development server at http://0.0.0.0:8000/
Quit the server with CONTROL-C.
[2025-02-01 11:35:04] INFO     daphne.server HTTP/2 support not enabled (install the http2 and tls Twisted extras)                                server.py:120
                      INFO     daphne.server Configuring endpoint tcp:port=8000:interface=0.0.0.0                                                 server.py:129
                      INFO     daphne.server Listening on TCP address 0.0.0.0:8000                                                                server.py:160
[+] [2025-02-01 11:35:14] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1738409714-import.txt
███████████████████████████████████████████████████                                                                                         5.5% (13/240sec)[2025-02-01 11:35:30] INFO     django.channels.server [mHTTP GET /health/ 200 [0.02, 127.0.0.1:58462][0m                                       runserver.py:168
███████████████████████████████████████████████████████████████████████████████████                                                         15.5% (37/240sec)[2025-02-01 11:36:00] INFO     django.channels.server [mHTTP GET /health/ 200 [0.00, 127.0.0.1:47328][0m                                       runserver.py:168
█████████████████████████████████████████████████████████████████████████████████████████████████                                           25.3% (61/240sec)[2025-02-01 11:36:31] INFO     django.channels.server [mHTTP GET /health/ 200 [0.01, 127.0.0.1:41106][0m                                       runserver.py:168
███████████████████████████████████████████████████████████████████████████████████████████████████████████                                 35.1% (84/240sec)[2025-02-01 11:37:01] INFO     django.channels.server [mHTTP GET /health/ 200 [0.01, 127.0.0.1:55162][0m                                       runserver.py:168
███████████████████████████████████████████████████████████████████████████████████████████████████████████████████                         45.0% (108/240sec)[2025-02-01 11:37:31] INFO     django.channels.server [mHTTP GET /health/ 200 [0.00, 127.0.0.1:58490][0m                                       runserver.py:168
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                   54.8% (132/240sec)[2025-02-01 11:38:01] INFO     django.channels.server [mHTTP GET /health/ 200 [0.01, 127.0.0.1:46180][0m                                       runserver.py:168
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████              64.6% (155/240sec)[2025-02-01 11:38:31] INFO     django.channels.server [mHTTP GET /health/ 200 [0.01, 127.0.0.1:58234][0m                                       runserver.py:168
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████          74.5% (179/240sec)[2025-02-01 11:39:01] INFO     django.channels.server [mHTTP GET /health/ 200 [0.01, 127.0.0.1:60778][0m                                       runserver.py:168
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████      84.4% (203/240sec)[2025-02-01 11:39:31] INFO     django.channels.server [mHTTP GET /health/ 200 [0.01, 127.0.0.1:33708][0m                                       runserver.py:168
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████   94.3% (226/240sec)[2025-02-01 11:40:01] INFO     django.channels.server [mHTTP GET /health/ 200 [0.00, 127.0.0.1:50100][0m                                       runserver.py:168
████████████████████████████████████████████                                                                                                4.2% (10/240sec)[2025-02-01 11:40:31] INFO     django.channels.server [mHTTP GET /health/ 200 [0.01, 127.0.0.1:41456][0m                                       runserver.py:168
████████████████████████████████████████████████████████████████████████████████                                                            14.2% (34/240sec)[2025-02-01 11:41:01] INFO     django.channels.server [mHTTP GET /health/ 200 [0.01, 127.0.0.1:56638][0m                                       runserver.py:168
███████████████████████████████████████████████████████████████████████████████████                                                         15.6% (38/240sec)[2025-02-01 11:41:06] WARNING  daphne.server Application instance <Task pending name='Task-1' coro=<ASGIStaticFilesHandler.__call__() running at  server.py:278
                               /usr/local/lib/python3.11/site-packages/django/contrib/staticfiles/handlers.py:101> wait_for=<Task cancelling
                               name='Task-4' coro=<ASGIHandler.handle.<locals>.process_request() running at
                               /usr/local/lib/python3.11/site-packages/django/core/handlers/asgi.py:185> wait_for=<Future pending
                               cb=[_chain_future.<locals>._call_check_cancel() at /usr/local/lib/python3.11/asyncio/futures.py:387,
                               Task.task_wakeup()]> cb=[Task.task_wakeup()]>> for connection <WebRequest at 0xffff948a6810 method=POST
                               uri=/api/v1/cli/add clientproto=HTTP/1.1> took too long to shut down and was killed.
████████████████████████████████████████████████████████████████████████████████████                                                        16.0% (38/240sec)[2025-02-01 11:41:07] WARNING  daphne.server Application instance <Task cancelling name='Task-1' coro=<ASGIStaticFilesHandler.__call__() running  server.py:278
                               at /usr/local/lib/python3.11/site-packages/django/contrib/staticfiles/handlers.py:101> wait_for=<_GatheringFuture
                               pending cb=[Task.task_wakeup()]>> for connection <WebRequest at 0xffff948a6810 method=POST uri=/api/v1/cli/add
                               clientproto=HTTP/1.1> took too long to shut down and was killed.
    > Parsed 1 URLs from input (URL List)
    > Found 1 new URLs not already in index

[*] [2025-02-01 11:45:23] Writing 1 links to main index...
    √ ./index.sqlite3

[*] [2025-02-01 11:47:54] Archiving 1/26 URLs from added set...

[▶] [2025-02-01 11:47:54] Starting archiving of 1 snapshots in index...

[+] [2025-02-01 11:47:54] "nullprogram.com/blog/2017/09/01"
    https://nullprogram.com/blog/2017/09/01/
    > ./archive/1738409714.199639
      > favicon
      > headers
      > wget
      > title
      > readability
      > htmltotext
      > media
      █████████████████████                                                                                                                 2.1% (76/3600sec)

ArchiveBox Version

0.8.5rc53
ArchiveBox v0.8.5rc53 COMMIT_HASH=3e1cdcf BUILD_TIME=2025-01-31 18:02:40 1738346560
IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux PLATFORM=Linux-6.10.14-linuxkit-aarch64-with-glibc2.36 PYTHON=Cpython
EUID=911:0 UID=911:0 PUID=911:0 FS_UID=911:0 FS_PERMS=644 FS_ATOMIC=True FS_REMOTE=True
DEBUG=False IS_TTY=True SUDO=False ID=2cba7b94:9b67ccd9 SEARCH_BACKEND=ripgrep LDAP=False

 Binary Dependencies:
 √  python                3.11.11      sys_pip    /usr/local/bin/python3.11
 √  django                5.1.2        sys_pip    /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  sqlite                2.6.0        sys_pip    /usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py
 √  pip                   24.2.0       sys_pip    /usr/local/bin/pip
 √  pipx                  1.1.0        sys_pip    /usr/bin/pipx
 √  node                  22.13.1      apt        /usr/bin/node
 √  npm                   11.1.0       apt        /usr/bin/npm
 √  npx                   11.1.0       apt        /usr/bin/npx
 √  playwright            1.49.1       sys_pip    /usr/local/bin/playwright
 √  puppeteer             23.11.1      lib_npm    ~/.npm/bin/puppeteer
 √  ldap                  3.4.4        sys_pip    /usr/local/lib/python3.11/site-packages/ldap/__init__.py
 √  rg                    13.0.0       apt        /usr/bin/rg
 √  sonic                 1.4.9        env        /usr/local/bin/sonic
 √  chrome                131.0.6778   env        /usr/bin/chromium-browser
 √  curl                  8.11.1       apt        /usr/bin/curl
 √  git                   2.39.5       apt        /usr/bin/git
 √  postlight-parser      2.2.3        sys_npm    ~/.npm/bin/postlight-parser
 √  readability-extractor 0.0.11       lib_npm    ~/.npm/bin/readability-extractor
 √  single-file           1.1.54       lib_npm    ~/.npm/bin/single-file
 √  wget                  1.21.3       apt        /usr/bin/wget
 √  yt-dlp                2024.10.7    sys_pip    /usr/local/bin/yt-dlp
 √  ffmpeg                5.1.6        env        /usr/bin/ffmpeg

 Package Managers:
 √  env         /usr/bin/which                                       UID=911  PATH=~/.npm/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin…
 √  apt         /usr/bin/apt-get                                     UID=0    PATH=/usr/bin:/bin
 -  brew        not available                                        UID=911  PATH=
 √  sys_pip     /usr/local/bin/pip                                   UID=911  PATH=~/.local/bin:/usr/local/bin:/usr/bin
 -  venv_pip    not available                                        UID=911  PATH=/tmp/NotInsideAVenv/lib/bin
 -  lib_pip     not available                                        UID=911  PATH=./lib/aarch64-linux-docker/pip/venv/bin
 √  sys_npm     /usr/bin/npm                                         UID=911  PATH=~/.npm/bin
 -  lib_npm     /usr/bin/npm                                         UID=911  PATH=./lib/aarch64-linux-docker/npm/node_modules/.bin:./node_modules/.bin:~/.npm…
 √  playwright  /usr/local/bin/playwright                            UID=0    PATH=./lib/aarch64-linux-docker/bin:~/.npm/bin:/usr/local/bin:/usr/local/sbin:/u…
 √  puppeteer   /usr/bin/npx                                         UID=911  PATH=./lib/aarch64-linux-docker/bin

 Code locations:
 √  PACKAGE_DIR           41 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  missing         unused    ./user_templates
 -  USER_PLUGINS_DIR      missing         unused    ./user_plugins
 √  LIB_DIR               0 files         valid     /usr/share/archivebox/lib

 Data locations:
 √  DATA_DIR              14 files @      valid     /data
 √  CONFIG_FILE           368.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             736.0 KB        valid     ./index.sqlite3
 √  QUEUE_DATABASE        92.0 KB         valid     ./queue.sqlite3
 √  ARCHIVE_DIR           25 files        valid     ./archive
 √  SOURCES_DIR           101 files       valid     ./sources
 √  PERSONAS_DIR          1 files         valid     ./personas
 √  LOGS_DIR              1 files         valid     ./logs
 √  TMP_DIR               0 files         valid     /tmp/archivebox

How did you install the version of ArchiveBox you are using?

Other

What operating system are you running on?

macOS (including Docker on macOS)

What type of drive are you using to store your ArchiveBox data?

  • some of data/ is on a local SSD or NVMe drive
  • some of data/ is on a spinning hard drive or external USB drive
  • some of data/ is on a network mount (e.g. NFS/SMB/Ceph/GlusterFS/etc.)
  • some of data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/Google Drive/Dropbox/etc.)

Docker Compose Configuration

docker run -it -p 8000:8000 \                                                                             
                       -v $PWD/data:/data \
                       -v $PWD/archivebox:/app/archivebox \
                       archivebox server 0.0.0.0:8000 --debug

ArchiveBox Configuration

# Converted from INI to TOML format: https://toml.io/en/

[SERVER_CONFIG]
SECRET_KEY = "n************************************************y"
@pirate
Copy link
Member

pirate commented Feb 1, 2025

Sure I'll take the PR :)

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants