fix the URL_REGEX used in generic_html parsers #1396

tqobqbq · 2024-04-07T21:26:56Z

Fix the URL_REGEX in utils.py

I crawled a url:
url="https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ"
and the re.findall(URL_REGEX, url) in the generic_html.py will return
['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ', 'https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ'],
the latter is a wrong url which will raise error both in real browser and requests.

And In fact, the origin URL_REGEX:

URL_REGEX = re.compile(
    r'(?=('
    r'http[s]?://'
    r'(?:[a-zA-Z]|[0-9]'                      # 3
    r'|[-_$@.&+!*\(\),]'                     # 4
    r'|(?:%[0-9a-fA-F][0-9a-fA-F]))'  # 5
    r'[^\]\[\(\)<>"\'\s]+' 
    r'))',
    re.IGNORECASE,
)

it's line 3,4,5 will just catch only one char as there is no '+' follow them.
i change it to more standard one:

URL_REGEX = re.compile(
    r'(?=('
    r'https?://'                        #match schemes http and https,but can't match ftp
    r'(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9-]+'#match domain
    r'(?::\d+)?'                        #match port,mabey not occur
    r'(?:/[^\\#\f\n\r\t\v]*)?'          #match path and query,maybe not occur
##    r'(?:#[^\]\[\(\)<>"\'\s]*){0,1}'  #match fragment,but don't need it actually 
    r'))',
##    re.IGNORECASE,                    #don't need to consider case problem
)

and it works for my example

Signed-off-by: longzai <[email protected]>

pirate · 2024-04-08T02:25:57Z

This is intentional 🙂 I want it to match both because both are syntactically valid URLs regardless of whether they exist in the world or not.

tqobqbq · 2024-04-08T11:31:31Z

@pirate
I undestand your intention.
My point is that the latter url "https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ" is syntactically invalid both in browsers(tested in Chrome and Edge) and python(requests and urllib module).

In browsers,it will not be parsed as a url,just treated as a string and jump to a search engine to search the string.
In requests.get,it will raise requests.exceptions.InvalidURL error

That's because '&' before '?' is illeagal

I guess you want to extract the redirect url so use the findall.In that case my URL_REGEX can get the correct result:
['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ', 'https://akaao.success-corp.co.jp']
which may be you want

and after some test,I get a more precise version:

URL_REGEX = re.compile(
    r'(?=('
    r'https?://'                        
    r'(?:[A-Za-z0-9-]+\.)*[A-Za-z0-9-]+'
    r'[^\\#\f\n\r\t\v?&]*'      
    r'(?:\?[^\\#\f\n\r\t\v]*)*'  
    r'))',
)

pirate · 2024-04-10T12:32:32Z

You're right, sorry! I just read through your original comment more thoroughly and I agree with your findings about my broken regex clauses and missing +'s.

I'll test your fix and will also cross-check it against some other big libraries to make sure our URL regex is more robust:

tqobqbq · 2024-04-11T08:03:15Z

Thank you for your reply.
I find a problem in first commit and make some modifications.You can just test the second commit.

pirate · 2024-04-24T01:57:48Z

After further testing I've settled on this regex below. It allows https://example.com&somearg=before+slash but it fixes many of the other edge cases.

Changes:

allows ( and ) in URLs now (e.g. https://wikipedia.org/some_(article)_example.html
allows unicode special characters in domains and paths (e.g. http://⌘.ws)
fixes URLs parsed from markdown links (which end in )) by looking for unbalanced parens and cutting off trailing chars until they're balanced

URL_REGEX = re.compile(
    r'(?=('                           +
    r'http[s]?://'                    +  # start matching from allowed schemes
    r'(?:[a-zA-Z]|[0-9]'              +  # followed by allowed alphanum characters
    r'|[-_$@.&+!*\(\),]'              +  #   or allowed symbols (keep hyphen first to match literal hyphen)
    r'|[^\u0000-\u007F])+'            +  #   or allowed unicode bytes
    r'[^\]\[<>"\'\s]+'                +  # stop parsing at these symbols
    r'))',
    re.IGNORECASE | re.UNICODE,
)

def parens_are_matched(string: str, open_char='(', close_char=')'):
    """check that all parentheses in a string are balanced and nested properly"""
    count = 0
    for c in string:
        if c == open_char:
            count += 1
        elif c == close_char:
            count -= 1
        if count < 0:
            return False
    return count == 0

def fix_url_from_markdown(url_str: str) -> str:
    """
    cleanup a regex-parsed url that may contain dangling trailing parens from markdown link syntax
    helpful to fix URLs parsed from markdown e.g.
      input:  https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def).somemoretext
      result: https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def
    """
    trimmed_url = url_str

    # cut off one trailing character at a time
    # until parens are balanced e.g. /a(b)c).x(y)z -> /a(b)c
    while not parens_are_matched(trimmed_url):
        trimmed_url = trimmed_url[:-1]
    
    # make sure trimmed url is still valid
    if re.findall(URL_REGEX, trimmed_url):
        return trimmed_url
    
    return url_str

def find_all_urls(urls_str: str):
    for url in re.findall(URL_REGEX, urls_str):
        yield fix_url_from_markdown(url)

assert fix_url_from_markdown('/a(b)c).x(y)z') == '/a(b)c'
assert fix_url_from_markdown('https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def).link(with)_trailingtext') == 'https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def'

URL_REGEX_TESTS = [
    ('https://example.com', ['https://example.com']),
    ('http://abc-file234example.com/abc?def=abc&23423=sdfsdf#abc=234&234=a234', ['http://abc-file234example.com/abc?def=abc&23423=sdfsdf#abc=234&234=a234']),

    ('https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=ア@サ!ト&hashtags=ア%オ,元+ア.ア-オ_イ*シ$ロ abc', ['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=ア@サ!ト&hashtags=ア%オ,元+ア.ア-オ_イ*シ$ロ', 'https://akaao.success-corp.co.jp&text=ア@サ!ト&hashtags=ア%オ,元+ア.ア-オ_イ*シ$ロ']),
    ('<a href="https://twitter.com/share#url=https://akaao.success-corp.co.jp&text=ア@サ!ト?hashtags=ア%オ,元+ア&abc=.ア-オ_イ*シ$ロ"> abc', ['https://twitter.com/share#url=https://akaao.success-corp.co.jp&text=ア@サ!ト?hashtags=ア%オ,元+ア&abc=.ア-オ_イ*シ$ロ', 'https://akaao.success-corp.co.jp&text=ア@サ!ト?hashtags=ア%オ,元+ア&abc=.ア-オ_イ*シ$ロ']),

    ('///a',                                                []),
    ('http://',                                             []),
    ('http://../',                                          ['http://../']),
    ('http://-error-.invalid/',                             ['http://-error-.invalid/']),
    ('https://a(b)c+1#2?3&4/',                              ['https://a(b)c+1#2?3&4/']),
    ('http://उदाहरण.परीक्षा',                                   ['http://उदाहरण.परीक्षा']),
    ('http://例子.测试',                                     ['http://例子.测试']),
    ('http://➡.ws/䨹 htps://abc.1243?234',                  ['http://➡.ws/䨹']),
    ('http://⌘.ws">https://exa+mple.com//:abc ',            ['http://⌘.ws', 'https://exa+mple.com//:abc']),
    ('http://مثال.إختبار/abc?def=ت&ب=abc#abc=234',          ['http://مثال.إختبار/abc?def=ت&ب=abc#abc=234']),
    ('http://-.~_!$&()*+,;=:%40:80%2f::::::@example.c\'om', ['http://-.~_!$&()*+,;=:%40:80%2f::::::@example.c']),
    
    ('http://us:[email protected]:42/http://ex.co:19/a?_d=4#-a=2.3', ['http://us:[email protected]:42/http://ex.co:19/a?_d=4#-a=2.3', 'http://ex.co:19/a?_d=4#-a=2.3']),
    ('http://code.google.com/events/#&product=browser',     ['http://code.google.com/events/#&product=browser']),
    ('http://foo.bar?q=Spaces should be encoded',           ['http://foo.bar?q=Spaces']),
    ('http://foo.com/blah_(wikipedia)#c(i)t[e]-1',          ['http://foo.com/blah_(wikipedia)#c(i)t']),
    ('http://foo.com/(something)?after=parens',             ['http://foo.com/(something)?after=parens']),
    ('http://foo.com/unicode_(✪)_in_parens) abc',           ['http://foo.com/unicode_(✪)_in_parens']),
    ('http://foo.bar/?q=Test%20URL-encoded%20stuff',        ['http://foo.bar/?q=Test%20URL-encoded%20stuff']),

    ('[xyz](http://a.b/?q=(Test)%20U)RL-encoded%20stuff',   ['http://a.b/?q=(Test)%20U']),
    ('[xyz](http://a.b/?q=(Test)%20U)-ab https://abc+123',  ['http://a.b/?q=(Test)%20U', 'https://abc+123']),
    ('[xyz](http://a.b/?q=(Test)%20U) https://a(b)c+12)3',  ['http://a.b/?q=(Test)%20U', 'https://a(b)c+12']),
    ('[xyz](http://a.b/?q=(Test)a\nabchttps://a(b)c+12)3',  ['http://a.b/?q=(Test)a', 'https://a(b)c+12']),
    ('http://foo.bar/?q=Test%20URL-encoded%20stuff',        ['http://foo.bar/?q=Test%20URL-encoded%20stuff']),
]
for urls_str, expected_url_matches in URL_REGEX_TESTS:
    url_matches = list(find_all_urls(urls_str))
    if url_matches != expected_url_matches:
        print('FAILED URL_REGEX CHECK', urls_str)
        print('matched: ', url_matches)
        print('expected:', expected_url_matches)
        print()

In general I care a lot more about not missing any valid URLs, even if that means ingesting some invalid URLs sometimes, e.g. (http://../ or https://a(b)c+1#2?3&4/).

fix the URL_REGEX used in generic_html parsers

4ae765e

Signed-off-by: longzai <[email protected]>

pirate closed this Apr 8, 2024

pirate reopened this Apr 10, 2024

fix URL_REGEX 2

e4dc270

Update util.py

c6f8a33

pirate changed the base branch from main to dev April 24, 2024 02:53

Merge branch 'dev' into fix-URL_REGEX

17f40f3

pirate merged commit 665a2e5 into ArchiveBox:dev Apr 24, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix the URL_REGEX used in generic_html parsers #1396

fix the URL_REGEX used in generic_html parsers #1396

tqobqbq commented Apr 7, 2024 •

edited

Loading

pirate commented Apr 8, 2024

tqobqbq commented Apr 8, 2024 •

edited

Loading

pirate commented Apr 10, 2024 •

edited

Loading

tqobqbq commented Apr 11, 2024

pirate commented Apr 24, 2024 •

edited

Loading

fix the URL_REGEX used in generic_html parsers #1396

fix the URL_REGEX used in generic_html parsers #1396

Conversation

tqobqbq commented Apr 7, 2024 • edited Loading

pirate commented Apr 8, 2024

tqobqbq commented Apr 8, 2024 • edited Loading

pirate commented Apr 10, 2024 • edited Loading

tqobqbq commented Apr 11, 2024

pirate commented Apr 24, 2024 • edited Loading

tqobqbq commented Apr 7, 2024 •

edited

Loading

tqobqbq commented Apr 8, 2024 •

edited

Loading

pirate commented Apr 10, 2024 •

edited

Loading

pirate commented Apr 24, 2024 •

edited

Loading