Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bin_version: set LANG=C when calling executables to avoid parsing localized output #936

Merged
merged 1 commit into from
Mar 5, 2022

Conversation

pellaeon
Copy link
Contributor

Summary

bin_version is used to set user agent string in title extractor. bin_version calls external executables with --version and parses its output. The resulting version string is used to set user agent string.

Executables might output in localized languages, however, user agent strings can only be in latin-1, resulting in this error:

        Extractor failed:                                                                                                                                                                                                                   
            UnicodeEncodeError 'latin-1' codec can't encode characters in position 201-202: ordinal not in range(256) 

This problem is fixed by running executables with environment variable LANG=C.

Example of running wget --version when LANG="zh_TW.UTF-8":

GNU Wget 1.21,於 linux-gnu 上編譯。

-cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls 
+ntlm +opie +psl +ssl/openssl 

Wgetrc: 
    /etc/wgetrc (系統)
語系: 
    /usr/share/locale 
編譯: 
    gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc" 
    -DLOCALEDIR="/usr/share/locale" -I. -I../../src -I../lib 
    -I../../lib -Wdate-time -D_FORTIFY_SOURCE=2 -DHAVE_LIBSSL -DNDEBUG 
    -g -O2 -ffile-prefix-map=/build/wget-OM48Vs/wget-1.21=. 
    -fstack-protector-strong -Wformat -Werror=format-security 
    -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -g -Wall 
連結: 
    gcc -DHAVE_LIBSSL -DNDEBUG -g -O2 
    -ffile-prefix-map=/build/wget-OM48Vs/wget-1.21=. 
    -fstack-protector-strong -Wformat -Werror=format-security 
    -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -g -Wall -Wl,-Bsymbolic-functions 
    -Wl,-z,relro -Wl,-z,now -lpcre2-8 -luuid -lidn2 -lssl -lcrypto -lz 
    -lpsl ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a 

版權所有 (C) 2015 自由軟體基金會
GPLv3+ 授權:GNU GPL 第三版或更新版本
<http://www.gnu.org/licenses/gpl.html>。
此為自由軟體:您能自由修改與重散布它。
在法律允許的範圍內沒有任何擔保。

最初由 Hrvoje Niksic <[email protected]> 編寫。
請將漏洞報告和問題寄到 <[email protected]>。

Related issues

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

@pirate
Copy link
Member

pirate commented Mar 5, 2022

woah nice debugging, this must have taken you a while to track down. thanks for the fix! A+ PR

@pirate pirate merged commit feafe9a into ArchiveBox:dev Mar 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants