Skip to content

Implement check-xml as builtin hook#894

Merged
j178 merged 2 commits intoj178:masterfrom
lmmx:check-xml
Oct 17, 2025
Merged

Implement check-xml as builtin hook#894
j178 merged 2 commits intoj178:masterfrom
lmmx:check-xml

Conversation

@lmmx
Copy link
Copy Markdown
Collaborator

@lmmx lmmx commented Oct 15, 2025

Description

To go with the existing format hooks, this is a hook to fast path the check XML hook. Follows #880 where it was noted as one of the hooks used by Airflow.

I tried a few different libraries: Python's XML checker uses a SAX parser (xml.sax.handler in the original), from the standard library (here)

  • Initially I tried xmltree, it failed a couple of different test cases
  • roxmltree failed an XML test case with a doctype
  • quick-xml failed to reject an XML which had multiple roots

Given that none of them matched the behaviour of the Python SAX parser, I then just modified the quick-xml result to reject that, and also to reject empty files (which the Python one rejects but this didn't).

I used this script to test in a ~/tmp subdir

Click to show repro script
#!/bin/bash

# Clean up any existing test directory
rm -rf ~/tmp/xml-full-test
mkdir -p ~/tmp/xml-full-test
cd ~/tmp/xml-full-test

# Initialize git repo
git init
git config user.email "[email protected]"
git config user.name "Test User"

# Create .pre-commit-config.yaml
cat > .pre-commit-config.yaml << 'EOF'
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: check-xml
EOF

# Create valid XML
cat > valid.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <element>value</element>
</root>
EOF

# Create invalid XML - unclosed tag
cat > invalid_unclosed.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <element>value
</root>
EOF

# Create invalid XML - mismatched tags
cat > invalid_mismatched.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <element>value</different>
</root>
EOF

# Create invalid XML - multiple root elements
cat > multiple_roots.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<element>value</element>
<another>value</another>
EOF

# Create empty XML
touch empty.xml

# Create XML with attributes
cat > with_attributes.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://example.com">
    <element id="1" type="test">value</element>
</root>
EOF

# Create XML with CDATA
cat > with_cdata.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <element><![CDATA[Some <special> characters & symbols]]></element>
</root>
EOF

# Create XML with comments
cat > with_comments.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <!-- This is a comment -->
    <element>value</element>
</root>
EOF

# Create XML with DOCTYPE
cat > with_doctype.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root SYSTEM "root.dtd">
<root>
    <element>value</element>
</root>
EOF

# Stage all files
git add .

# Run pre-commit
echo "========================================="
echo "Running pre-commit check-xml..."
echo "========================================="
pre-commit run --all-files --verbose
EXIT_CODE=$?

echo ""
echo "========================================="
echo "Exit code: $EXIT_CODE"
echo "========================================="

echo

# Run prek
echo "========================================="
echo "Running prek check-xml..."
echo "========================================="
prek run --all-files --verbose
EXIT_CODE=$?

echo ""
echo "========================================="
echo "Exit code: $EXIT_CODE"
echo "========================================="

I then ran it with this branch of prek and regular pre-commit and compared their results to ensure we maintain the same outcomes

louis 🌟 ~/tmp $ bash test_precommit_xml_2.sh 2>/dev/null
Running pre-commit check-xml...
check xml................................................................Failed
- hook id: check-xml
- duration: 0.05s
- exit code: 1

empty.xml: Failed to xml parse (empty.xml:1:0: no element found)
invalid_mismatched.xml: Failed to xml parse (invalid_mismatched.xml:3:20: mismatched tag)
invalid_unclosed.xml: Failed to xml parse (invalid_unclosed.xml:4:2: mismatched tag)
multiple_roots.xml: Failed to xml parse (multiple_roots.xml:3:0: junk after document element)
Running prek check-xml...
check xml................................................................Failed
- hook id: check-xml
- duration: 0.00s
- exit code: 1
  invalid_mismatched.xml: Failed to xml parse (ill-formed document: expected `</element>`, but `</different>` was found)
  empty.xml: Failed to xml parse (no element found)
  invalid_unclosed.xml: Failed to xml parse (ill-formed document: expected `</element>`, but `</root>` was found)
  multiple_roots.xml: Failed to xml parse (junk after document element)
  • The same files are rejected

  • The run takes 0.00s instead of 0.05s

  • New quick-xml dependency increases compilation time from 21.8s to 22.3s (+2%)

  • Unit and snapshot tests added and compared against check-xml behaviour

Demo

With the Apache Airflow repo, pre-commit takes 0.1s

louis 🌟 ~/tmp/airflow $ pre-commit run -a --verbose
Check XML files with xmllint.............................................Passed
- hook id: check-xml
- duration: 0.1s

While prek takes 0.01s (10x faster) 🎉

louis 🌟 ~/tmp/airflow $ prek run -a --verbose
Check XML files with xmllint.............................................Passed
- hook id: check-xml
- duration: 0.01s

@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.14%. Comparing base (054e1fb) to head (dc5a320).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #894      +/-   ##
==========================================
+ Coverage   89.98%   90.14%   +0.15%     
==========================================
  Files          63       64       +1     
  Lines       11756    11939     +183     
==========================================
+ Hits        10579    10762     +183     
  Misses       1177     1177              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@j178
Copy link
Copy Markdown
Owner

j178 commented Oct 15, 2025

Thank you!

I’m not really keen on pulling in a big dependency just for one hook, unless it’s something super commonly used (like check-toml). Like other hook ports, could you check how much the binary size changes?

@lmmx
Copy link
Copy Markdown
Collaborator Author

lmmx commented Oct 15, 2025

Sure let me take a look...

cargo build --release
du -h prek
  • master branch: 11MB (10552B)
  • this branch: 11MB (10572B)

So +0.18% bigger binary, not much. It is based on things we already have in the compiled tree I think: see its Cargo.toml

arbitrary = { version = "1", features = ["derive"], optional = true }
document-features = { version = "0.2", optional = true }
encoding_rs = { version = "0.8", optional = true }
serde = { version = ">=1.0.139", optional = true }
tokio = { version = "1.10", optional = true, default-features = false, features = ["io-util"] }
memchr = "2.1"

@j178
Copy link
Copy Markdown
Owner

j178 commented Oct 15, 2025

That's great!

@lmmx lmmx mentioned this pull request Oct 15, 2025
2 tasks
@lmmx lmmx force-pushed the check-xml branch 2 times, most recently from 55deba8 to c00932e Compare October 15, 2025 17:06
@lmmx
Copy link
Copy Markdown
Collaborator Author

lmmx commented Oct 15, 2025

Merge conflicts resolved

# Conflicts:
#	tests/builtin_hooks.rs
@j178 j178 changed the title feat(check-xml): implement builtin hook Implement check-xml as builtin hook Oct 17, 2025
@j178 j178 added the enhancement New feature or request label Oct 17, 2025
@j178 j178 merged commit bbdac78 into j178:master Oct 17, 2025
18 checks passed
@j178
Copy link
Copy Markdown
Owner

j178 commented Oct 17, 2025

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants