Skip to content

Bad regexp in Chinese search #2544

@jwilk

Description

@jwilk

Hi @enhao!

sphinx/search/zh.py contains this:

    latin1_letters = re.compile(r'\w+(?u)[\u0000-\u00ff]')

But the \u sequence is supported by the re module only since Python 3.3. In previous versions this regexp is equivalent to:

r'\w+(?u)[0-u]'

which is definitely not what you wanted.

But even in Python 3.3+, the regexp looks dubious to me. The variable name is latin1_letters, but the regexp matches a sequence of (not necessarily Latin 1) alphanumeric characters, followed by a Latin 1 character (not necessarily a letter).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions