Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on broken steps, use yt-dlp to avoid youtube-dl errors, and don't crash on bad UTF-8 #1026

Merged
merged 13 commits into from
Jan 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,6 @@ data1/
data2/
data3/
output/

# vim
*.sw?
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# This is the Dockerfile for ArchiveBox, it bundles the following dependencies:
# python3, ArchiveBox, curl, wget, git, chromium, youtube-dl, single-file
# python3, ArchiveBox, curl, wget, git, chromium, youtube-dl, yt-dlp, single-file
# Usage:
# git submodule update --init --recursive
# git pull --recurse-submodules
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ ls ./archive/*/index.json # or browse directly via the filesyste
- [**Free & open source**](https://github.com/ArchiveBox/ArchiveBox/blob/master/LICENSE), doesn't require signing up online, stores all data locally
- [**Powerful, intuitive command line interface**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage) with [modular optional dependencies](#dependencies)
- [**Comprehensive documentation**](https://github.com/ArchiveBox/ArchiveBox/wiki), [active development](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap), and [rich community](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community)
- [**Extracts a wide variety of content out-of-the-box**](https://github.com/ArchiveBox/ArchiveBox/issues/51): [media (youtube-dl), articles (readability), code (git), etc.](#output-formats)
- [**Extracts a wide variety of content out-of-the-box**](https://github.com/ArchiveBox/ArchiveBox/issues/51): [media (youtube-dl or yt-dlp), articles (readability), code (git), etc.](#output-formats)
- [**Supports scheduled/realtime importing**](https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving) from [many types of sources](#input-formats)
- [**Uses standard, durable, long-term formats**](#saves-lots-of-useful-stuff-for-each-imported-link) like HTML, JSON, PDF, PNG, and WARC
- [**Usable as a oneshot CLI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage), [**self-hosted web UI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#UI-Usage), [Python API](https://docs.archivebox.io/en/latest/modules.html) (BETA), [REST API](https://github.com/ArchiveBox/ArchiveBox/issues/496) (ALPHA), or [desktop app](https://github.com/ArchiveBox/electron-archivebox) (ALPHA)
Expand Down Expand Up @@ -469,7 +469,7 @@ Inside each Snapshot folder, ArchiveBox save these different types of extractor
- **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome
- **Article Text:** `article.html/json` Article text extraction using Readability & Mercury
- **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org
- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl
- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl (or yt-dlp)
- **Source Code:** `git/` clone of any repository found on GitHub, Bitbucket, or GitLab links
- _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._

Expand Down Expand Up @@ -529,7 +529,7 @@ To achieve high fidelity archives in as many situations as possible, ArchiveBox
- `node` & `npm` (for readability, mercury, and singlefile)
- `wget` (for plain HTML, static files, and WARC saving)
- `curl` (for fetching headers, favicon, and posting to Archive.org)
- `youtube-dl` (for audio, video, and subtitles)
- `youtube-dl` or `yt-dlp` (for audio, video, and subtitles)
- `git` (for cloning git repos)
- and more as we grow...

Expand Down
14 changes: 11 additions & 3 deletions archivebox/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,12 +144,19 @@
'--no-call-home',
'--write-sub',
'--all-subs',
'--write-auto-sub',
# There are too many of these and youtube
# throttles you with HTTP error 429
#'--write-auto-subs',
'--convert-subs=srt',
'--yes-playlist',
'--continue',
'--ignore-errors',
# This flag doesn't exist in youtube-dl
# only in yt-dlp
'--no-abort-on-error',
# --ignore-errors must come AFTER
# --no-abort-on-error
# https://github.com/yt-dlp/yt-dlp/issues/4914
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of these flags is---confusingly---quite important.

yt-dlp/yt-dlp#4914

'--ignore-errors',
'--geo-bypass',
'--add-metadata',
'--max-filesize={}'.format(c['MEDIA_MAX_SIZE']),
Expand Down Expand Up @@ -203,7 +210,8 @@
'SINGLEFILE_BINARY': {'type': str, 'default': lambda c: bin_path('single-file')},
'READABILITY_BINARY': {'type': str, 'default': lambda c: bin_path('readability-extractor')},
'MERCURY_BINARY': {'type': str, 'default': lambda c: bin_path('mercury-parser')},
'YOUTUBEDL_BINARY': {'type': str, 'default': 'youtube-dl'},
#'YOUTUBEDL_BINARY': {'type': str, 'default': 'youtube-dl'},
'YOUTUBEDL_BINARY': {'type': str, 'default': 'yt-dlp'},
'NODE_BINARY': {'type': str, 'default': 'node'},
'RIPGREP_BINARY': {'type': str, 'default': 'rg'},
'CHROME_BINARY': {'type': str, 'default': None},
Expand Down
7 changes: 5 additions & 2 deletions archivebox/extractors/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
__package__ = 'archivebox.extractors'

import os
import sys
from pathlib import Path

from typing import Optional, List, Iterable, Union
Expand Down Expand Up @@ -137,14 +138,16 @@ def archive_link(link: Link, overwrite: bool=False, methods: Optional[Iterable[s
link.url,
)) from e
"""
# Instead, use the kludgy workaround from
# Instead, use the kludgy workaround from
# https://github.com/ArchiveBox/ArchiveBox/issues/984#issuecomment-1150541627
with open(ERROR_LOG, "a", encoding='utf-8') as f:
command = ' '.join(sys.argv)
ts = datetime.now(timezone.utc).strftime('%Y-%m-%d__%H:%M:%S')
f.write(("\n" + 'Exception in archive_methods.save_{}(Link(url={}))'.format(
f.write(("\n" + 'Exception in archive_methods.save_{}(Link(url={})) command={}; ts={}'.format(
method_name,
link.url,
command,
ts
) + "\n"))
#f.write(f"\n> {command}; ts={ts} version={config['VERSION']} docker={config['IN_DOCKER']} is_tty={config['IS_TTY']}\n")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a break here to exit the loop?


Expand Down
16 changes: 13 additions & 3 deletions archivebox/extractors/media.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def should_save_media(link: Link, out_dir: Optional[Path]=None, overwrite: Optio

@enforce_types
def save_media(link: Link, out_dir: Optional[Path]=None, timeout: int=MEDIA_TIMEOUT) -> ArchiveResult:
"""Download playlists or individual video, audio, and subtitles using youtube-dl"""
"""Download playlists or individual video, audio, and subtitles using youtube-dl or yt-dlp"""

out_dir = out_dir or Path(link.link_dir)
output: ArchiveOutput = 'media'
Expand Down Expand Up @@ -61,7 +61,7 @@ def save_media(link: Link, out_dir: Optional[Path]=None, timeout: int=MEDIA_TIME
pass
else:
hints = (
'Got youtube-dl response code: {}.'.format(result.returncode),
'Got youtube-dl (or yt-dlp) response code: {}.'.format(result.returncode),
*result.stderr.decode().split('\n'),
)
raise ArchiveError('Failed to save media', hints)
Expand All @@ -72,8 +72,18 @@ def save_media(link: Link, out_dir: Optional[Path]=None, timeout: int=MEDIA_TIME
timer.end()

# add video description and subtitles to full-text index
# Let's try a few different
index_texts = [
text_file.read_text(encoding='utf-8').strip()
# errors:
# * 'strict' to raise a ValueError exception if there is an
# encoding error. The default value of None has the same effect.
# * 'ignore' ignores errors. Note that ignoring encoding errors
# can lead to data loss.
# * 'xmlcharrefreplace' is only supported when writing to a
# file. Characters not supported by the encoding are replaced with
# the appropriate XML character reference &#nnn;.
# There are a few more options described in https://docs.python.org/3/library/functions.html#open
text_file.read_text(encoding='utf-8', errors='xmlcharrefreplace').strip()
for text_file in (
*output_path.glob('*.description'),
*output_path.glob('*.srt'),
Expand Down
8 changes: 4 additions & 4 deletions bin/setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,9 @@ echo " This is a helper script which installs the ArchiveBox dependencies on
echo " You may be prompted for a sudo password in order to install the following:"
echo ""
echo " - archivebox"
echo " - python3, pip, nodejs, npm (languages used by ArchiveBox, and its extractor modules)"
echo " - curl, wget, git, youtube-dl (used for extracting title, favicon, git, media, and more)"
echo " - chromium (skips this if any Chrome/Chromium version is already installed)"
echo " - python3, pip, nodejs, npm (languages used by ArchiveBox, and its extractor modules)"
echo " - curl, wget, git, youtube-dl, yt-dlp (used for extracting title, favicon, git, media, and more)"
echo " - chromium (skips this if any Chrome/Chromium version is already installed)"
echo ""
echo " If you'd rather install these manually as-needed, you can find detailed documentation here:"
echo " https://github.com/ArchiveBox/ArchiveBox/wiki/Install"
Expand All @@ -115,7 +115,7 @@ if which apt-get > /dev/null; then
fi
echo
echo "[+] Installing ArchiveBox system dependencies using apt..."
sudo apt-get install -y git python3 python3-pip python3-distutils wget curl youtube-dl ffmpeg git nodejs npm ripgrep
sudo apt-get install -y git python3 python3-pip python3-distutils wget curl youtube-dl yt-dlp ffmpeg git nodejs npm ripgrep
sudo apt-get install -y libgtk2.0-0 libgtk-3-0 libnotify-dev libgconf-2-4 libnss3 libxss1 libasound2 libxtst6 xauth xvfb libgbm-dev || sudo apt-get install -y chromium || sudo apt-get install -y chromium-browser || true
sudo apt-get install -y archivebox
sudo apt-get --only-upgrade install -y archivebox
Expand Down
2 changes: 1 addition & 1 deletion etc/ArchiveBox.conf.default
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
# CURL_BINARY = curl
# GIT_BINARY = git
# WGET_BINARY = wget
# YOUTUBEDL_BINARY = youtube-dl
# YOUTUBEDL_BINARY = yt-dlp
# CHROME_BINARY = chromium

# CHROME_USER_DATA_DIR="~/.config/google-chrome/Default"
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
"django-extensions>=3.0.3",
"dateparser>=1.0.0",
"youtube-dl>=2021.04.17",
"yt-dlp>=2021.4.11",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I arbitrarily picked this version number to be the first datestamp ahead of the youtube-dl version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can bump this to the latest version whenever you touch these files (as long as it works), no need to match older versions of chrome/youtubledl/ffmpeg/git/wget, only django and a couple other edge cases are the ones that cant be bumped

"python-crontab>=2.5.1",
"croniter>=0.3.34",
"w3lib>=1.22.0",
Expand Down
2 changes: 1 addition & 1 deletion stdeb.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Package3: archivebox
Suite: focal
Suite3: focal
Build-Depends: debhelper, dh-python, python3-all, python3-pip, python3-setuptools, python3-wheel, python3-stdeb
Depends3: nodejs, wget, curl, git, ffmpeg, youtube-dl, python3-all, python3-pip, python3-setuptools, python3-croniter, python3-crontab, python3-dateparser, python3-django, python3-django-extensions, python3-django-jsonfield, python3-mypy-extensions, python3-requests, python3-w3lib, ripgrep
Depends3: nodejs, wget, curl, git, ffmpeg, youtube-dl, yt-dlp, python3-all, python3-pip, python3-setuptools, python3-croniter, python3-crontab, python3-dateparser, python3-django, python3-django-extensions, python3-django-jsonfield, python3-mypy-extensions, python3-requests, python3-w3lib, ripgrep
X-Python3-Version: >= 3.7
XS-Python-Version: >= 3.7
Setup-Env-Vars: DEB_BUILD_OPTIONS=nocheck