-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix the URL_REGEX used in generic_html parsers #1396
Conversation
Signed-off-by: longzai <[email protected]>
This is intentional 🙂 I want it to match both because both are syntactically valid URLs regardless of whether they exist in the world or not. |
@pirate In browsers,it will not be parsed as a url,just treated as a string and jump to a search engine to search the string. I guess you want to extract the redirect url so use the findall.In that case my URL_REGEX can get the correct result: and after some test,I get a more precise version:
|
You're right, sorry! I just read through your original comment more thoroughly and I agree with your findings about my broken regex clauses and missing I'll test your fix and will also cross-check it against some other big libraries to make sure our URL regex is more robust:
|
Thank you for your reply. |
After further testing I've settled on this regex below. It allows Changes:
URL_REGEX = re.compile(
r'(?=(' +
r'http[s]?://' + # start matching from allowed schemes
r'(?:[a-zA-Z]|[0-9]' + # followed by allowed alphanum characters
r'|[-_$@.&+!*\(\),]' + # or allowed symbols (keep hyphen first to match literal hyphen)
r'|[^\u0000-\u007F])+' + # or allowed unicode bytes
r'[^\]\[<>"\'\s]+' + # stop parsing at these symbols
r'))',
re.IGNORECASE | re.UNICODE,
)
def parens_are_matched(string: str, open_char='(', close_char=')'):
"""check that all parentheses in a string are balanced and nested properly"""
count = 0
for c in string:
if c == open_char:
count += 1
elif c == close_char:
count -= 1
if count < 0:
return False
return count == 0
def fix_url_from_markdown(url_str: str) -> str:
"""
cleanup a regex-parsed url that may contain dangling trailing parens from markdown link syntax
helpful to fix URLs parsed from markdown e.g.
input: https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def).somemoretext
result: https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def
"""
trimmed_url = url_str
# cut off one trailing character at a time
# until parens are balanced e.g. /a(b)c).x(y)z -> /a(b)c
while not parens_are_matched(trimmed_url):
trimmed_url = trimmed_url[:-1]
# make sure trimmed url is still valid
if re.findall(URL_REGEX, trimmed_url):
return trimmed_url
return url_str
def find_all_urls(urls_str: str):
for url in re.findall(URL_REGEX, urls_str):
yield fix_url_from_markdown(url)
assert fix_url_from_markdown('/a(b)c).x(y)z') == '/a(b)c'
assert fix_url_from_markdown('https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def).link(with)_trailingtext') == 'https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def'
URL_REGEX_TESTS = [
('https://example.com', ['https://example.com']),
('http://abc-file234example.com/abc?def=abc&23423=sdfsdf#abc=234&234=a234', ['http://abc-file234example.com/abc?def=abc&23423=sdfsdf#abc=234&234=a234']),
('https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=ア@サ!ト&hashtags=ア%オ,元+ア.ア-オ_イ*シ$ロ abc', ['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=ア@サ!ト&hashtags=ア%オ,元+ア.ア-オ_イ*シ$ロ', 'https://akaao.success-corp.co.jp&text=ア@サ!ト&hashtags=ア%オ,元+ア.ア-オ_イ*シ$ロ']),
('<a href="https://twitter.com/share#url=https://akaao.success-corp.co.jp&text=ア@サ!ト?hashtags=ア%オ,元+ア&abc=.ア-オ_イ*シ$ロ"> abc', ['https://twitter.com/share#url=https://akaao.success-corp.co.jp&text=ア@サ!ト?hashtags=ア%オ,元+ア&abc=.ア-オ_イ*シ$ロ', 'https://akaao.success-corp.co.jp&text=ア@サ!ト?hashtags=ア%オ,元+ア&abc=.ア-オ_イ*シ$ロ']),
('///a', []),
('http://', []),
('http://../', ['http://../']),
('http://-error-.invalid/', ['http://-error-.invalid/']),
('https://a(b)c+1#2?3&4/', ['https://a(b)c+1#2?3&4/']),
('http://उदाहरण.परीक्षा', ['http://उदाहरण.परीक्षा']),
('http://例子.测试', ['http://例子.测试']),
('http://➡.ws/䨹 htps://abc.1243?234', ['http://➡.ws/䨹']),
('http://⌘.ws">https://exa+mple.com//:abc ', ['http://⌘.ws', 'https://exa+mple.com//:abc']),
('http://مثال.إختبار/abc?def=ت&ب=abc#abc=234', ['http://مثال.إختبار/abc?def=ت&ب=abc#abc=234']),
('http://-.~_!$&()*+,;=:%40:80%2f::::::@example.c\'om', ['http://-.~_!$&()*+,;=:%40:80%2f::::::@example.c']),
('http://us:[email protected]:42/http://ex.co:19/a?_d=4#-a=2.3', ['http://us:[email protected]:42/http://ex.co:19/a?_d=4#-a=2.3', 'http://ex.co:19/a?_d=4#-a=2.3']),
('http://code.google.com/events/#&product=browser', ['http://code.google.com/events/#&product=browser']),
('http://foo.bar?q=Spaces should be encoded', ['http://foo.bar?q=Spaces']),
('http://foo.com/blah_(wikipedia)#c(i)t[e]-1', ['http://foo.com/blah_(wikipedia)#c(i)t']),
('http://foo.com/(something)?after=parens', ['http://foo.com/(something)?after=parens']),
('http://foo.com/unicode_(✪)_in_parens) abc', ['http://foo.com/unicode_(✪)_in_parens']),
('http://foo.bar/?q=Test%20URL-encoded%20stuff', ['http://foo.bar/?q=Test%20URL-encoded%20stuff']),
('[xyz](http://a.b/?q=(Test)%20U)RL-encoded%20stuff', ['http://a.b/?q=(Test)%20U']),
('[xyz](http://a.b/?q=(Test)%20U)-ab https://abc+123', ['http://a.b/?q=(Test)%20U', 'https://abc+123']),
('[xyz](http://a.b/?q=(Test)%20U) https://a(b)c+12)3', ['http://a.b/?q=(Test)%20U', 'https://a(b)c+12']),
('[xyz](http://a.b/?q=(Test)a\nabchttps://a(b)c+12)3', ['http://a.b/?q=(Test)a', 'https://a(b)c+12']),
('http://foo.bar/?q=Test%20URL-encoded%20stuff', ['http://foo.bar/?q=Test%20URL-encoded%20stuff']),
]
for urls_str, expected_url_matches in URL_REGEX_TESTS:
url_matches = list(find_all_urls(urls_str))
if url_matches != expected_url_matches:
print('FAILED URL_REGEX CHECK', urls_str)
print('matched: ', url_matches)
print('expected:', expected_url_matches)
print() In general I care a lot more about not missing any valid URLs, even if that means ingesting some invalid URLs sometimes, e.g. ( |
Fix the URL_REGEX in utils.py
I crawled a url:

url="https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ 公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ"
and the re.findall(URL_REGEX, url) in the generic_html.py will return
['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ', 'https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ'],
the latter is a wrong url which will raise error both in real browser and requests.
And In fact, the origin URL_REGEX:
it's line 3,4,5 will just catch only one char as there is no '+' follow them.
i change it to more standard one:
and it works for my example