-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn on broken steps, use yt-dlp to avoid youtube-dl errors, and don't crash on bad UTF-8 #1026
Conversation
command, | ||
ts | ||
) + "\n")) | ||
#f.write(f"\n> {command}; ts={ts} version={config['VERSION']} docker={config['IN_DOCKER']} is_tty={config['IS_TTY']}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a break
here to exit the loop?
+1 for yt-dlp |
@@ -42,6 +42,7 @@ | |||
"django-extensions>=3.0.3", | |||
"dateparser>=1.0.0", | |||
"youtube-dl>=2021.04.17", | |||
"yt-dlp>=2021.4.11", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I arbitrarily picked this version number to be the first datestamp ahead of the youtube-dl version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can bump this to the latest version whenever you touch these files (as long as it works), no need to match older versions of chrome/youtubledl/ffmpeg/git/wget, only django and a couple other edge cases are the ones that cant be bumped
A few more notes:
|
'--no-abort-on-error', | ||
# --ignore-errors must come AFTER | ||
# --no-abort-on-error | ||
# https://github.com/yt-dlp/yt-dlp/issues/4914 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of these flags is---confusingly---quite important.
…ture/kludge-984-UTF8-bug
I'm using docker-compose to run ArchiveBox. How do I get yt-dlp inside the docker container? |
I thought I made Yt-dlp the default already in the 0.6.3 |
boom! |
Summary
Quickest workaround for many people, until this is merged
Add this to ArchiveBox.conf:
If that doesn't work, you can use my Docker
turian/archivebox:kludge-984-UTF8-bug
which includes this patch, instead ofarchivebox/archivebox
for now.@jgoerzen I think you also said this bug was a showstopper for you
Try it out
This finally works with this patch:
archivebox add 'https://www.ashra.com/news.php?m=A'
Related issues
Should close these issues:
youtube-dl
errors: Bug: Docker install not able to save YouTube videos (media failure) #991 Question: YouTube videos not completing media archive method #998Changes these areas
youtube-dl
->yt-dlp
So I began by changing the
archive_link
from a crashing exception to a warning. This gave me better diagnostics than the exceptions you see in the issues above. I observed the following:Inspecting that, I saw:
Oups.
youtube-dl
doesn't have--no-abort-on-error
. I think it only has--abort-on-error
but I'm not entirely sure what the default behavior is.I switched youtube-dl to yt-dlp as the default, which is a more actively maintained fork that pulls upstream from youtube-dl constantly. It also has option
--no-abort-on-error
so now the media download works.It operates essentionally identically (and FASTER) to youtube-dl.
The above youtube-dl options are present, with the following caveats:
--write-annotations
: "No supported site has annotations now"I haven't pushed yt-dlp default changes to the submodules yet:
deb_dist, brew_dist, docker, docs, etc/ArchiveBox.conf.default, pip_dist
but I'd like to. I would even go so far as to deprecate
youtube-dl
, but perhaps that's too radical for some people.UnicodeDecodeError fixes
A common complaint. In
media.py
, we used to have:This is strict and errors cause a crash. I have changed the behavior to xmlcharrefreplace ("Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.") which seemed the best to me, but you can read about other options in my comment there or in more detail at Python's open function documentation.
My preferred workaround, because sometimes things say they are utf-8 but are actually a different encoding, would be this:
chardet
guessed encoding, and decode it usingxmlcharrefreplace
for errors.This would be a separate PR, if you approve of me including
chardet
as a pip dependency here and in all submodules (same as with yt-dlp).Postscript
I ran
flake8
and caught all the flakes in the code I introduced, as far as I know. Since I am a new committer, CI/CD won't run for me here.