Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the URL_REGEX used in generic_html parsers #1396

Merged
merged 4 commits into from
Apr 24, 2024

Conversation

tqobqbq
Copy link
Contributor

@tqobqbq tqobqbq commented Apr 7, 2024

Fix the URL_REGEX in utils.py

I crawled a url:
url="https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ 公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ"
and the re.findall(URL_REGEX, url) in the generic_html.py will return
['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ', 'https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ'],
the latter is a wrong url which will raise error both in real browser and requests.
image

And In fact, the origin URL_REGEX:

URL_REGEX = re.compile(
    r'(?=('
    r'http[s]?://'
    r'(?:[a-zA-Z]|[0-9]'                      # 3
    r'|[-_$@.&+!*\(\),]'                     # 4
    r'|(?:%[0-9a-fA-F][0-9a-fA-F]))'  # 5
    r'[^\]\[\(\)<>"\'\s]+' 
    r'))',
    re.IGNORECASE,
)

it's line 3,4,5 will just catch only one char as there is no '+' follow them.
i change it to more standard one:

URL_REGEX = re.compile(
    r'(?=('
    r'https?://'                        #match schemes http and https,but can't match ftp
    r'(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9-]+'#match domain
    r'(?::\d+)?'                        #match port,mabey not occur
    r'(?:/[^\\#\f\n\r\t\v]*)?'          #match path and query,maybe not occur
##    r'(?:#[^\]\[\(\)<>"\'\s]*){0,1}'  #match fragment,but don't need it actually 
    r'))',
##    re.IGNORECASE,                    #don't need to consider case problem
)

and it works for my example

@pirate
Copy link
Member

pirate commented Apr 8, 2024

This is intentional 🙂 I want it to match both because both are syntactically valid URLs regardless of whether they exist in the world or not.

@pirate pirate closed this Apr 8, 2024
@tqobqbq
Copy link
Contributor Author

tqobqbq commented Apr 8, 2024

@pirate
I undestand your intention.
My point is that the latter url "https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ" is syntactically invalid both in browsers(tested in Chrome and Edge) and python(requests and urllib module).

In browsers,it will not be parsed as a url,just treated as a string and jump to a search engine to search the string.
In requests.get,it will raise requests.exceptions.InvalidURL error
7
8
That's because '&' before '?' is illeagal

I guess you want to extract the redirect url so use the findall.In that case my URL_REGEX can get the correct result:
['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ 公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ', 'https://akaao.success-corp.co.jp']
which may be you want

and after some test,I get a more precise version:

URL_REGEX = re.compile(
    r'(?=('
    r'https?://'                        
    r'(?:[A-Za-z0-9-]+\.)*[A-Za-z0-9-]+'
    r'[^\\#\f\n\r\t\v?&]*'      
    r'(?:\?[^\\#\f\n\r\t\v]*)*'  
    r'))',
)

@pirate pirate reopened this Apr 10, 2024
@pirate
Copy link
Member

pirate commented Apr 10, 2024

You're right, sorry! I just read through your original comment more thoroughly and I agree with your findings about my broken regex clauses and missing +'s.

I'll test your fix and will also cross-check it against some other big libraries to make sure our URL regex is more robust:

@tqobqbq
Copy link
Contributor Author

tqobqbq commented Apr 11, 2024

Thank you for your reply.
I find a problem in first commit and make some modifications.You can just test the second commit.

@pirate
Copy link
Member

pirate commented Apr 24, 2024

After further testing I've settled on this regex below. It allows https://example.com&somearg=before+slash but it fixes many of the other edge cases.

Changes:

  • allows ( and ) in URLs now (e.g. https://wikipedia.org/some_(article)_example.html
  • allows unicode special characters in domains and paths (e.g. http://⌘.ws)
  • fixes URLs parsed from markdown links (which end in )) by looking for unbalanced parens and cutting off trailing chars until they're balanced
URL_REGEX = re.compile(
    r'(?=('                           +
    r'http[s]?://'                    +  # start matching from allowed schemes
    r'(?:[a-zA-Z]|[0-9]'              +  # followed by allowed alphanum characters
    r'|[-_$@.&+!*\(\),]'              +  #   or allowed symbols (keep hyphen first to match literal hyphen)
    r'|[^\u0000-\u007F])+'            +  #   or allowed unicode bytes
    r'[^\]\[<>"\'\s]+'                +  # stop parsing at these symbols
    r'))',
    re.IGNORECASE | re.UNICODE,
)

def parens_are_matched(string: str, open_char='(', close_char=')'):
    """check that all parentheses in a string are balanced and nested properly"""
    count = 0
    for c in string:
        if c == open_char:
            count += 1
        elif c == close_char:
            count -= 1
        if count < 0:
            return False
    return count == 0

def fix_url_from_markdown(url_str: str) -> str:
    """
    cleanup a regex-parsed url that may contain dangling trailing parens from markdown link syntax
    helpful to fix URLs parsed from markdown e.g.
      input:  https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def).somemoretext
      result: https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def
    """
    trimmed_url = url_str

    # cut off one trailing character at a time
    # until parens are balanced e.g. /a(b)c).x(y)z -> /a(b)c
    while not parens_are_matched(trimmed_url):
        trimmed_url = trimmed_url[:-1]
    
    # make sure trimmed url is still valid
    if re.findall(URL_REGEX, trimmed_url):
        return trimmed_url
    
    return url_str

def find_all_urls(urls_str: str):
    for url in re.findall(URL_REGEX, urls_str):
        yield fix_url_from_markdown(url)

assert fix_url_from_markdown('/a(b)c).x(y)z') == '/a(b)c'
assert fix_url_from_markdown('https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def).link(with)_trailingtext') == 'https://wikipedia.org/en/some_article_(Disambiguation).html?abc=def'

URL_REGEX_TESTS = [
    ('https://example.com', ['https://example.com']),
    ('http://abc-file234example.com/abc?def=abc&23423=sdfsdf#abc=234&234=a234', ['http://abc-file234example.com/abc?def=abc&23423=sdfsdf#abc=234&234=a234']),

    ('https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=ア@サ!ト&hashtags=ア%オ,元+ア.ア-オ_イ*シ$ロ abc', ['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=ア@サ!ト&hashtags=ア%オ,元+ア.ア-オ_イ*シ$ロ', 'https://akaao.success-corp.co.jp&text=ア@サ!ト&hashtags=ア%オ,元+ア.ア-オ_イ*シ$ロ']),
    ('<a href="https://twitter.com/share#url=https://akaao.success-corp.co.jp&text=ア@サ!ト?hashtags=ア%オ,元+ア&abc=.ア-オ_イ*シ$ロ"> abc', ['https://twitter.com/share#url=https://akaao.success-corp.co.jp&text=ア@サ!ト?hashtags=ア%オ,元+ア&abc=.ア-オ_イ*シ$ロ', 'https://akaao.success-corp.co.jp&text=ア@サ!ト?hashtags=ア%オ,元+ア&abc=.ア-オ_イ*シ$ロ']),

    ('///a',                                                []),
    ('http://',                                             []),
    ('http://../',                                          ['http://../']),
    ('http://-error-.invalid/',                             ['http://-error-.invalid/']),
    ('https://a(b)c+1#2?3&4/',                              ['https://a(b)c+1#2?3&4/']),
    ('http://उदाहरण.परीक्षा',                                   ['http://उदाहरण.परीक्षा']),
    ('http://例子.测试',                                     ['http://例子.测试']),
    ('http://➡.ws/䨹 htps://abc.1243?234',                  ['http://➡.ws/䨹']),
    ('http://⌘.ws">https://exa+mple.com//:abc ',            ['http://⌘.ws', 'https://exa+mple.com//:abc']),
    ('http://مثال.إختبار/abc?def=ت&ب=abc#abc=234',          ['http://مثال.إختبار/abc?def=ت&ب=abc#abc=234']),
    ('http://-.~_!$&()*+,;=:%40:80%2f::::::@example.c\'om', ['http://-.~_!$&()*+,;=:%40:80%2f::::::@example.c']),
    
    ('http://us:[email protected]:42/http://ex.co:19/a?_d=4#-a=2.3', ['http://us:[email protected]:42/http://ex.co:19/a?_d=4#-a=2.3', 'http://ex.co:19/a?_d=4#-a=2.3']),
    ('http://code.google.com/events/#&product=browser',     ['http://code.google.com/events/#&product=browser']),
    ('http://foo.bar?q=Spaces should be encoded',           ['http://foo.bar?q=Spaces']),
    ('http://foo.com/blah_(wikipedia)#c(i)t[e]-1',          ['http://foo.com/blah_(wikipedia)#c(i)t']),
    ('http://foo.com/(something)?after=parens',             ['http://foo.com/(something)?after=parens']),
    ('http://foo.com/unicode_(✪)_in_parens) abc',           ['http://foo.com/unicode_(✪)_in_parens']),
    ('http://foo.bar/?q=Test%20URL-encoded%20stuff',        ['http://foo.bar/?q=Test%20URL-encoded%20stuff']),

    ('[xyz](http://a.b/?q=(Test)%20U)RL-encoded%20stuff',   ['http://a.b/?q=(Test)%20U']),
    ('[xyz](http://a.b/?q=(Test)%20U)-ab https://abc+123',  ['http://a.b/?q=(Test)%20U', 'https://abc+123']),
    ('[xyz](http://a.b/?q=(Test)%20U) https://a(b)c+12)3',  ['http://a.b/?q=(Test)%20U', 'https://a(b)c+12']),
    ('[xyz](http://a.b/?q=(Test)a\nabchttps://a(b)c+12)3',  ['http://a.b/?q=(Test)a', 'https://a(b)c+12']),
    ('http://foo.bar/?q=Test%20URL-encoded%20stuff',        ['http://foo.bar/?q=Test%20URL-encoded%20stuff']),
]
for urls_str, expected_url_matches in URL_REGEX_TESTS:
    url_matches = list(find_all_urls(urls_str))
    if url_matches != expected_url_matches:
        print('FAILED URL_REGEX CHECK', urls_str)
        print('matched: ', url_matches)
        print('expected:', expected_url_matches)
        print()

In general I care a lot more about not missing any valid URLs, even if that means ingesting some invalid URLs sometimes, e.g. (http://../ or https://a(b)c+1#2?3&4/).

@pirate pirate changed the base branch from main to dev April 24, 2024 02:53
@pirate pirate merged commit 665a2e5 into ArchiveBox:dev Apr 24, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants