Skip to content

Releases: ArchiveBox/ArchiveBox

v0.7.3: Updates for Docker container's SingleFile, YT-DLP, Chrome, and other dependencies only

15 Dec 10:18
Compare
Choose a tag to compare

This is just an update for the archivebox/archivebox:latest Docker container's internal dependencies.

  • python, node
  • chrome -> v131 ⭐️
  • curl
  • wget
  • yt-dlp -> 2024.12.x ⭐️
  • single-file -> v1.1.54 ⭐️ (this update should help fix many archiving issues reported in v0.7.2)
  • readability
  • ripgrep
  • sonic -> archivebox/sonic:latest
  • git

There is no change to any of the Python code so no need to pip update to this version if you are not using Docker.
Run archivebox version if not using Docker and make sure any of your manually-installed dependencies are up-to-date .
Always make sure to back up your archive.


Tip

👾 All new development work is happening over in the v0.8.x branch ➡️


Full Changelog: v0.7.2...v0.7.3

v0.8.5-rc: Prettier + faster CLI for InstalledBinaries, Machines/NetworkInterfaces health now audit logged

03 Oct 11:37
Compare
Choose a tag to compare

Warning

This is a BETA pre-release that improves upon the previous v0.8.4-rc ALPHA pre-release. The next stable release will be v0.9.0. The v0.8.x-rc series of releases are for collecting feedback while we make big architectural improvements to support a new public plugin marketplace + ecosystem (powered by pluggy + huey + pydantic). We want brave early adopters to help us test it! (if that's not you, wait for v0.9!)

⬇️ BETA Instructions: 1. backup your collection 2. install the :dev branch with docker/pip (expand for details)
  1. 🗜️ Always make a full backup before installing new BETA releases!
    Remember, this is an unstable sneak-preview in the middle of a rewrite, so it MAY DAMAGE DATA.
gzip -k ./data/index.sqlite3         # do this at least 🙏
zip -r data.bak.zip data             # OR even better: backup the entire data dir
  1. 📦 Then get the latest nightly build from Docker Hub or Pip:
docker pull archivebox/archivebox:0.8.5rc51
# OR
pip install 'git+https://github.com/ArchiveBox/[email protected]'
  1. ↗️ Then run archivebox init to upgrade your collection:
    This take several hours to migrate existing data from v0.7.x on a slower HDDs (up to ~1min/1000 URLs).
archivebox install         # make sure all package and runtime dependencies are installed & available
archivebox init            # run data migrations (slow, theoretically safe to Ctrl+C and resume, but try not to)
archivebox version         # check that everything updated properly and dependencies are installed
archivebox status          # see a health report on the collection index & snapshot directories
  1. 💬 Let us know if you find bugs or have suggestions by opening a new issue! In particular we want to hear:
  • was the upgrade/migration process smooth?
  • can you find any areas of the UI/CLI that are slow?
  • how do you like the new plugin system? (see archivebox/plugins_extractor/*) Would you contribute a new plugin?

Highlights

Screenshot 2024-10-03 at 4 34 08 AM ArchiveBox shellScreenshot 2024-10-03 at 4 33 52 AM ArchiveBox help

What's Changed

  • 📦 Deprated apt and brew install methods in favor of pip + new archivebox install cmd
  • 🌈 Much improved archivebox help, archivebox version, and archivebox shell CLI interfaces
  • ⚡️ Massive speedups to binary detection and loading at startup time
  • ✍️ New Machine, NetworkInterface, and InstalledBinary models keep an audit log of host environment changes and health
  • Many other bugfixes, speedups, and internal architecture improvements
  • Move novnc web-ui to 8081 by @agowa in #1522
  • Add OpenContainer Image Format Annotations as Labels to Docker Image by @mpgirro in #1525

New Contributors

Full Changelog: v0.8.4-rc...v0.8.5-rc

v0.8.4-rc: New background worker system w/ huey+supervisord, plugin dependency auto-installing w/ Ansible/Pyinfra

11 Sep 23:27
ee1b881
Compare
Choose a tag to compare

Warning

This is an ALPHA pre-release that improves upon the previous v0.8.3-rc ALPHA pre-release. The next stable release will be v0.9.0. The v0.8.x-rc series of releases are for collecting feedback while we make big architectural improvements to support a new public plugin marketplace + ecosystem (powered by pluggy + huey + pydantic). We want brave early adopters to help us test it! (if that's not you, wait for v0.9!)


Highlights

  • 🪵 moved to proper event-driven task system huey + django-huey-monitor
  • 🦸‍♂️ integrated supervisord to manage bg workers
  • 📦 integrated ansible/pyinfra (an ansible alternative) to install subdependency packages at runtime
  • ⚡️ continued switching from runserver to proper Channels + Daphne ASGI
  • 🧩 lots more plugins!
Screenshot 2024-09-12 at 3 13 58 AM examplecom archivebox add

Full Changelog: v0.8.3-rc...v0.8.4-rc

v0.8.3-rc: New UI Buttons, adding/updating is now non-blocking, Daphne ASGI, Rich CLI logs, Byte Range support, ABIDs, and more...

06 Sep 10:23
Compare
Choose a tag to compare

Warning

This is an ALPHA pre-release that improves upon the previous v0.8.2-rc ALPHA pre-release. The next stable release will be v0.9.0. The v0.8.x-rc series of releases are for collecting feedback while we make big architectural improvements to support a new public plugin marketplace + ecosystem. We want brave early adopters to help us test it! (if that's not you, wait for v0.9!)


Highlights

Screenshot 2024-09-06 at 3 22 01 AM Get Title Get Missing Archive again

  • New Admin action buttons text should make it clearer what the butons do
  • Adding new URLs / clicking action buttons now runs task in a BG thread instead of running syncronously (and often timing out)
  • Added ability to click "View on site" from any object in admin to go directly to viewing the content
  • Switched archivebox server from using runserver to a proper daphne ASGI server
  • Added HTTP byte range request support (allows you to seek to the middle of a big .mp4 without downloading the whole thing)
  • Added ability to regenerate ABIDs on objects that have gone out of sync
  • New plugin system architecture is coming along, standard API for hooks now available in plugantic/base_hook.py
  • improved CLI logging output using rich for pretty colors and nicer tracebacks
  • improved HTTP request logging to filter out noisy 404/304/200 lines
  • renamed .created -> .created_at, .modified -> .modified_at, .added -> .bookmarked_at, .updated -> .downloaded_at
  • allow accessing admin change pages, API records, and archive contents by both ABID and ID (UUID)
  • add ruff linting and lots of type hint improvements with pydantic
  • improve auth and CSRF security for the new REST API (cookies no longer work for API auth, a token is appended to URLs instead)
  • bump default USER_AGENT settings to chrome v128, bump yt-dlp, singlefile, etc. versions
  • lots of other small fixes, speedups, and improvements!

Screenshot 2024-09-06 at 3 01 47 AM API Identifiers
Screenshot 2024-09-06 at 3 01 44 AM USER SQUASH


Full Changelog: v0.8.2-rc...v0.8.3-rc

v0.8.2-rc: New Snapshot UI ✨, Admin UI speedups, more REST API endpoints, Django 5.1, and bugfixes

21 Aug 03:27
Compare
Choose a tag to compare

Warning

This was a BETA pre-release that improved upon the previous v0.8.0-rc ALPHA pre-release. This one brings us closer to a final v0.8 release and contains several core architectural improvements around how we key things with unique IDs, as well as a ✨ new Snapshot Detail UI ✨.

image

image

Changelog: v0.8.0-rc...v0.8.2-rc

v0.8.0-rc: New REST API ✨, Django 5.0, S3/B2/SMB/NFS remote storage support, VNC viewer, and more

27 Mar 00:03
Compare
Choose a tag to compare

WIP ALPHA pre-release for the upcoming ArchiveBox v0.8 release.

Caution

This was an ALPHA pre-release. We were promoting it a little earlier than usual because it contains ✨ lots of big new features ✨ and we want brave early adopters to help us test it!

New ArchiveBox REST APIArchiveBox Admin Webhooks UIArchiveBox Configuration Admin UIS3/B2/SMB/NFS/GDrive Remote Storage Setup

Highlights

Expand to see see more...
  • add gitea and other domains to default GIT_DOMAINS list to run git archiving on
  • check /, /data, and /data/archive in Docker and warn if running low on disk space
  • Add COOKIES_FILE support for singlefile extractor by @naoph in #1372
  • Use COOKIES_FILE to fetch page titles by @benmuth in #1364
  • Fallback to not chown'ing ./data/archive dir if it's a network mount that prevents ownership changes by @gnattu in #1312
  • Show the upgrade notification only in specific views by @benmuth in #1314
  • ability to populate is_staff and is_superuser flags at LDAP authentication by @vladimirdulov in #1335
  • Make it a little easier to run specific tests by @jimwins in #1371
  • disable chrome automatic self-updating when running headless
  • Add ability to populate is_staff and is_superuser flags during LDAP first auth
  • allow more restrictive NFS permission coercion on ./data/archive
  • bump yt-dlp, singlefile, wget, curl, and chrome versions
  • fix RESOLUTION being ignored when using Chrome headless in Docker
  • fix sorting by Size / Files in the Admin Snapshots list page UI
  • fix spinner icon showing on some Snapshots instead of favicon when only a few extractors are enabled
  • fix yt-dlp sometimes failing to archive media due to filenames being too long or containing special characters
  • fix wget extractor not finding output when :80 or :443 port is present in the original URL
  • fix /var/spool/cron/crontabs permissions when mounting it via Docker
  • fix /browsers chown on Docker armv7 entrypoint failing

COMING SOON: new sci-dl scientific paper downloader being worked on by @benmuth

New Contributors

Full Changelog: v0.7.2...v0.8.0-rc

v0.7.2: Make scheduled imports taggable, fix admin buttons, readability, Docker permissions

04 Jan 19:25
315c9f3
Compare
Choose a tag to compare
Web version screenshot

Get this release via pip, docker, brew, or dpkg (apt & brew releases are delayed).

# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.2'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.2
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
pip install --upgrade 'archivebox==0.7.2'`
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb

# then run pip install after
pip install --upgrade 'archivebox==0.7.2'`

Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.

(Launchpad apt ppa & brew updates coming eventually, packaging all the vendored binaries that archivebox depends on has gotten harder lately)


CLI version screenshot
# Then run this to upgrade an existing collection data dir to 0.7.2
cd ~/path/to/data/dir
archivebox init

What's Changed

  • add --tag=tag1,tag2,tag3 support to archivebox schedule command
  • allow PGID=0 root-group ownership of data dir (but PUID=0 is still not allowed)
  • improve error messages, hints, and logging about permissions issues in Docker
  • notify users when new ArchiveBox version is available on Github (thanks @benmuth!)
  • bump dependency versions (yt-dlp, chrome, readability, node, python)
  • warn when Docker / or /data volume mounts don't have any space available
  • limit to compatible python version to >= 3.8 and <= 3.11

Bug Fixes

  • fix action buttons in Snapshot admin page not showing up correctly
  • tag links immediately in first stage of archivebox add instead of at the end (so that imports that are paused or interrupted still get tagged correctly)
  • fix config variables in CHROME_USER_AGENT format string not getting interpolated properly
  • switch readability to prefer Chrome DOM dumps for article text instead of singlefile (because singlefile output is often huge and crashes readability/times out)
  • make Docker image smaller by removing unneeded docs files
  • better current version detection and remove annoying +editable string and also add BUILD_TIME
  • fix /browsers/* does not exist warning on startup

v0.7.1: Minor new features, bugfixes, and new dependency versions

04 May 05:53
Compare
Choose a tag to compare

Get this release via pip, docker, brew, or dpkg (apt ppa update delayed).

# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.1'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.1
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb

Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.

(Launchpad apt ppa update coming eventually, packaging for apt has gotten harder lately)


# Then run this to upgrade an existing collection data dir to 0.7.1
cd ~/path/to/data/dir
archivebox init

What's Changed

Lots of bugfixes, speedups, and small convenience features.

New Contributors

Expand to see the list...

Full Changelog: v0.6.2...v0.7.1

v0.6.2: >10x performance gain, new Admin UI & CLI features, and more

10 Apr 12:24
Compare
Choose a tag to compare

New features

  • new ArchiveResult log in the admin web UI, with full editing ability of individual extractor outputs + list of outputs under each Snapshot admin entry
  • ability to save multiple snapshots of the same URL over time using new Re-snapshot button
  • add init --quick and server --quick-init options to quickly update the db version without doing a full re-init (for users with large archive collections this will make version upgrades a lot faster / less painful)
  • add new archivebox setup command and archivebox init --setup flag to aid in automatically installing dependencies and creating a superuser during initial setup
  • new SNAPSHOTS_PER_PAGE=40 and MEDIA_MAX_SIZE=750m config options
  • allow hotlinking directly to specific extractor output on the snapshot detail page using URL #hash e.g. /archive/<timestamp>/index.html#git
  • add ability to view snapshot matching a given URLs by visiting /archive/https://example.com/some/url -> redirects to -> /archive/<timestamp>/index.html (also works without scheme /archive/example.com)
  • #660 add ability to tag URLs while adding them via the web UI and via the CLI using archivebox add --tag=tag1,tag2,tag3 ...
  • #659 add back ability to override visual styling with custom HTML and CSS using new config option CUSTOM_TEMPLATES_DIR
  • ability to add and remove multiple tags at once from the snapshot admin using autocompleting dropdown

Enhancements

  • lots of performance improvements! (in testing with 100k entries, the main index was brought down from 10-14 second load times to ~110ms once cache warms up)
  • full text search now works on the public snapshot list
  • dates and times are now localized to your browser's timezone instead of showing in UTC
  • integrity and correctness improvements to readability, mercury, warc, and other extractors
  • video subtitles and description are now added to the full-text search index as well (including youtube's autogenerated transcripts in all languages)
  • log all errors with full tracebacks to new data/logs/errors.log file (so users no longer have to run in --debug mode to see error details)
  • better archivebox schedule logging and changed logfile location to ./logs/schedule.log
  • better docker-compose setup experience with sonic config example in docker-compose.yml
  • add Django Debug Toolbar + djdt_flamegraph for developers to profile UI performance
  • add --overwrite flag support to archivebox schedule, archived urls get added similarly to add --overwrite
  • #644 remove boostrap and jquery remove network requests to CDNs by inlining them instead
  • #647 allow filtering by ArchiveResult status in the Snapshot admin UI to select only links that have been archived or not archived
  • #550 kill all orphan child processes after each extractor finishes to prevent dangling chromium/node subprocesses and memory leaks
  • 3276434 add new SEARCH_BACKEND_TIMEOUT config option to tune amount of time search backend can take before it gives up
  • more diagnostic info added to the Snapshot admin view including most recent status code, content type, detected server, etc
  • make the order of the table columns, layout, and spacing the same on the public view and private view (also remove DataTable, we're not using it)
  • better snapshot grid page (faster load times, nicer CSS for tags and cards, more actions supported and metadata shown)
  • added Cache-Control headers to dramatically speed up load times by caching favicons, screenshots, etc. in browsers/upstreams
  • new project releases page https://releases.archivebox.io and demo url https://demo.archivebox.io

Bugfixes

  • #673 fix searching by URL substring in Snapshot admin list
  • #658 fix Snapshot admin action buttons not working in Safari and some other browsers
  • #678 fix AssertionError error when archivebox would to attempt archive with CHROME_BINARY=None when Chrome was not found on host system
  • #654 fix some issues with sonic attempting to index massive text blobs or binary blobs on some pages and hanging
  • #674 fix UTF-8 encoding encoding problems with file reading/writing on Windows (supporting a Python pkg on Windows is unreasonably painful ya'll)
  • #433 fix deleted items sometimes reappearing on next import/update
  • #473 fix issue preventing use of archivebox python API inside raw REPL (not using archivebox shell)
  • fix stdin/stdout/stderr handling for some edge cases in Docker/Docker-Compose

image
image

v0.5.6: Bugfixes and packaging improvements

09 Feb 14:25
9766ea2
Compare
Choose a tag to compare
  • add ARMv7 and ARMv8 CPU support for apt / deb distribution on Launchpad PPA
  • fix nodesource apt repo not supported on i386 b90afc8
  • fix handling of skipped ArchiveResult entries with null output 0aea5ed
  • catch exception on import of old index.json into ArchiveResult 171bbeb
  • move debsign to release not build 66fb5b2
  • skip tests during debian build a32eac3
  • fix emptystrings in cmd_version causing exception a49884a
  • automate deb dist better and bump version 0e6ac39
  • fix assertion 6705354
  • change wording of db not found error 683a087