Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on broken steps, use yt-dlp to avoid youtube-dl errors, and don't crash on bad UTF-8 #1026

Merged
merged 13 commits into from
Jan 10, 2023

Conversation

turian
Copy link
Contributor

@turian turian commented Sep 12, 2022

Summary

  • If some step is broken, we issue a warning instead of crashing.
  • We use yt-dlp instead of youtube-dl by default.
  • On bad unicode, we don't crash.

Quickest workaround for many people, until this is merged

Add this to ArchiveBox.conf:

YOUTUBEDL_BINARY=/usr/bin/yt-dlp

If that doesn't work, you can use my Docker turian/archivebox:kludge-984-UTF8-bug which includes this patch, instead of archivebox/archivebox for now.

@jgoerzen I think you also said this bug was a showstopper for you

Try it out

This finally works with this patch:

archivebox add 'https://www.ashra.com/news.php?m=A'

Related issues

Should close these issues:

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

youtube-dl -> yt-dlp

So I began by changing the archive_link from a crashing exception to a warning. This gave me better diagnostics than the exceptions you see in the issues above. I observed the following:

       Extractor failed:
             Failed to save media
        Run to see full output:
            cd /home/archivebox/archivebox-turian/ArchiveBox/pipdata/archive/1663011964.432896;
            youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --no-abort-on-error --geo-bypass --add-metadata --max-filesize=750m https://www.ashra.com/news.php?m=A

Inspecting that, I saw:

$ youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --no-abort-on-error --geo-bypass --add-metadata --max-filesize=750m https://www.ashra.com/news.php?m=A
Usage: youtube-dl [OPTIONS] URL [URL...]

youtube-dl: error: no such option: --no-abort-on-error

Oups. youtube-dl doesn't have --no-abort-on-error. I think it only has --abort-on-error but I'm not entirely sure what the default behavior is.

I switched youtube-dl to yt-dlp as the default, which is a more actively maintained fork that pulls upstream from youtube-dl constantly. It also has option --no-abort-on-error so now the media download works.

It operates essentionally identically (and FASTER) to youtube-dl.

The above youtube-dl options are present, with the following caveats:

  • We aren't doing this, but "When --embed-subs and --write-subs are used together, the subtitles are written to disk and also embedded in the media file. You can use just --embed-subs to embed the subs and automatically delete the separate file."
  • We aren't doing this, but "--add-metadata attaches the infojson to mkv files in addition to writing the metadata when used with --write-info-json. Use --no-embed-info-json or --compat-options no-attach-info-json to revert this"
  • --write-annotations: "No supported site has annotations now"

I haven't pushed yt-dlp default changes to the submodules yet:
deb_dist, brew_dist, docker, docs, etc/ArchiveBox.conf.default, pip_dist
but I'd like to. I would even go so far as to deprecate youtube-dl, but perhaps that's too radical for some people.

UnicodeDecodeError fixes

A common complaint. In media.py, we used to have:

text_file.read_text(encoding='utf-8').strip()

This is strict and errors cause a crash. I have changed the behavior to xmlcharrefreplace ("Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.") which seemed the best to me, but you can read about other options in my comment there or in more detail at Python's open function documentation.

My preferred workaround, because sometimes things say they are utf-8 but are actually a different encoding, would be this:

  1. Try utf-8 in strict mode. If clean, then good.
  2. If error, use chardet to guess the encoding. (It's a little slow, so it's better as a fallback option.) Attempt to decode to Unicode using the guessed encoding in strict mode. If clean, then good.
  3. Worst case: Either use utf-8 or the highest confidence chardet guessed encoding, and decode it using xmlcharrefreplace for errors.

This would be a separate PR, if you approve of me including chardet as a pip dependency here and in all submodules (same as with yt-dlp).

Postscript

I ran flake8 and caught all the flakes in the code I introduced, as far as I know. Since I am a new committer, CI/CD won't run for me here.

command,
ts
) + "\n"))
#f.write(f"\n> {command}; ts={ts} version={config['VERSION']} docker={config['IN_DOCKER']} is_tty={config['IS_TTY']}\n")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a break here to exit the loop?

@notevenaperson
Copy link
Contributor

+1 for yt-dlp

@@ -42,6 +42,7 @@
"django-extensions>=3.0.3",
"dateparser>=1.0.0",
"youtube-dl>=2021.04.17",
"yt-dlp>=2021.4.11",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I arbitrarily picked this version number to be the first datestamp ahead of the youtube-dl version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can bump this to the latest version whenever you touch these files (as long as it works), no need to match older versions of chrome/youtubledl/ffmpeg/git/wget, only django and a couple other edge cases are the ones that cant be bumped

@turian
Copy link
Contributor Author

turian commented Sep 14, 2022

A few more notes:

  • yt-dlp is so much faster than youtube-dl that you get throttled if you download all the autogenerated subtitles. So I simply disabled them, and only generate "real" subtitles and captions.

'--no-abort-on-error',
# --ignore-errors must come AFTER
# --no-abort-on-error
# https://github.com/yt-dlp/yt-dlp/issues/4914
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of these flags is---confusingly---quite important.

yt-dlp/yt-dlp#4914

@WakeReality
Copy link

I'm using docker-compose to run ArchiveBox. How do I get yt-dlp inside the docker container?

@pirate
Copy link
Member

pirate commented Jan 10, 2023

I thought I made Yt-dlp the default already in the 0.6.3 latest builds but maybe I messed it up.

@pirate pirate merged commit 8a96563 into ArchiveBox:dev Jan 10, 2023
@turian turian deleted the feature/kludge-984-UTF8-bug branch January 19, 2023 18:58
@turian
Copy link
Contributor Author

turian commented Apr 2, 2023

boom!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants