Skip to content

bug: Failure to parse files with UTF8 byte-order mark #386

@john-hen

Description

@john-hen

Griffe fails to parse Python files that begin with a UTF8 byte-order mark (a.k.a. BOM, code point U+FEFF).

Minimal reproducer with an otherwise empty Python module:

from griffe import GriffeLoader
from pathlib import Path

loader = GriffeLoader(search_paths=[Path('.')])
file = Path('empty_except_bom.py')
file.write_text('', encoding='utf-8-sig')
module = loader.load(file.stem)

Raises:

SyntaxError: invalid non-printable character U+FEFF
Full traceback
Could not load package Package(name='empty_except_bom', path=WindowsPath('C:/home/projects/MPh/docs/Griffe_bug_UTF8_BOM/empty_except_bom.py'), stubs=None)
Traceback (most recent call last):
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 531, in _load_module
    return self._load_module_path(module_name, module_path, submodules=submodules, parent=parent)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 555, in _load_module_path
    module = self._visit_module(module_name, module_path, parent)
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 634, in _visit_module
    module = visit(
        module_name,
    ...<7 lines>...
        modules_collection=self.modules_collection,
    )
  File "C:\scratch\repos\other\Griffe\src\_griffe\agents\visitor.py", line 113, in visit
    ).get_module()
      ~~~~~~~~~~^^
  File "C:\scratch\repos\other\Griffe\src\_griffe\agents\visitor.py", line 204, in get_module
    top_node = compile(self.code, mode="exec", filename=str(self.filepath), flags=ast.PyCF_ONLY_AST, optimize=1)
  File "C:\home\projects\MPh\docs\Griffe_bug_UTF8_BOM\empty_except_bom.py", line 1
    
    ^
SyntaxError: invalid non-printable character U+FEFF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 179, in load
    top_module = self._load_package(package, submodules=submodules)
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 508, in _load_package
    top_module = self._load_module(package.name, package.path, submodules=submodules)
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 533, in _load_module
    raise LoadingError(f"Syntax error: {error}") from error
_griffe.exceptions.LoadingError: Syntax error: invalid non-printable character U+FEFF (empty_except_bom.py, line 1)
Traceback (most recent call last):
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 531, in _load_module
    return self._load_module_path(module_name, module_path, submodules=submodules, parent=parent)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 555, in _load_module_path
    module = self._visit_module(module_name, module_path, parent)
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 634, in _visit_module
    module = visit(
        module_name,
    ...<7 lines>...
        modules_collection=self.modules_collection,
    )
  File "C:\scratch\repos\other\Griffe\src\_griffe\agents\visitor.py", line 113, in visit
    ).get_module()
      ~~~~~~~~~~^^
  File "C:\scratch\repos\other\Griffe\src\_griffe\agents\visitor.py", line 204, in get_module
    top_node = compile(self.code, mode="exec", filename=str(self.filepath), flags=ast.PyCF_ONLY_AST, optimize=1)
  File "C:\home\projects\MPh\docs\Griffe_bug_UTF8_BOM\empty_except_bom.py", line 1
    
    ^
SyntaxError: invalid non-printable character U+FEFF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\home\projects\MPh\docs\Griffe_bug_UTF8_BOM\demo_bug.py", line 20, in <module>
    module = loader.load(file.stem)
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 179, in load
    top_module = self._load_package(package, submodules=submodules)
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 508, in _load_package
    top_module = self._load_module(package.name, package.path, submodules=submodules)
  File "C:\scratch\repos\other\Griffe\src\_griffe\loader.py", line 533, in _load_module
    raise LoadingError(f"Syntax error: {error}") from error
_griffe.exceptions.LoadingError: Syntax error: invalid non-printable character U+FEFF (empty_except_bom.py, line 1)
Environment information
❯ griffe --debug-info
- __System__: Windows-11-10.0.22621-SP0
- __Python__: cpython 3.13.4 (C:\scratch\venvs\Griffe\Scripts\python.exe)
- __Environment variables__:
  - `PYTHONPATH`: `C:\home\tools;C:\polybox\work\tools`
- __Installed packages__:
  - `griffe` v1.7.4.dev1172+g441b3b7

UTF-8 BOMs aren't used a lot, but are supported by the Python interpreter. I, for one, use them routinely, as they prevent editing mishaps on Windows, where some editors default to ANSI encoding if there are no non-ASCII characters in the file yet, and a Unicode character is then added. (Though the situation has improved over the last years, and most Windows editors/IDEs now default to UTF-8 too, just like on other platforms.)

PR to follow. Related issue: #99.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions