Skip to content

Attempts to introduce escaped unicode decoding#1854

Merged
zricethezav merged 1 commit into
masterfrom
unicode-decoder
May 14, 2025
Merged

Attempts to introduce escaped unicode decoding#1854
zricethezav merged 1 commit into
masterfrom
unicode-decoder

Conversation

@zricethezav
Copy link
Copy Markdown
Collaborator

Description:

Attempts to introduce escaped unicode decoding. This supports two kinds of escaped unicode; standard notation and common escape \, \\ sequences.

Checklist:

  • Does your PR pass tests?
  • Have you written new tests for your changes?
  • Have you lint your code locally prior to submission?

@zricethezav
Copy link
Copy Markdown
Collaborator Author

@bplaxco would love to see if this significantly slows down your benchmarks

@bplaxco
Copy link
Copy Markdown
Contributor

bplaxco commented May 14, 2025

Note: I wouldn't call this "good testing", I just kinda ran a few things when I had a minute yesterday evening but didn't really get to sit down and do it careflly. But I figured' I'd share what I got regardless ^_^

Did some basic hyperfine tests (ignore errors, 3 warmups, 10 runs):

baseline == master branch
unicode-decoder == this branch rebased on master

I pulled my own gitleaks config just because I wanted to use a consistent set of patterns between these two and the gitleaks command I had installed in my package manager.

Benchmark 1: ./baseline --config gitleaks.toml git --max-decode-depth 8 gitleaks.git
  Time (mean ± σ):      5.275 s ±  0.052 s    [User: 24.035 s, System: 1.360 s]
  Range (min … max):    5.197 s …  5.336 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: ./unicode-decoder --config gitleaks.toml git --max-decode-depth 8 gitleaks.git
  Time (mean ± σ):      5.305 s ±  0.174 s    [User: 24.855 s, System: 1.348 s]
  Range (min … max):    5.168 s …  5.756 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ./baseline --config gitleaks.toml git --max-decode-depth 8 gitleaks.git ran
    1.01 ± 0.03 times faster than ./unicode-decoder --config gitleaks.toml git --max-decode-depth 8 gitleaks.git

Benchmark 1: ./baseline --config gitleaks.toml dir --max-decode-depth 8 gitleaks.git
  Time (mean ± σ):      66.4 ms ±   4.2 ms    [User: 71.2 ms, System: 23.1 ms]
  Range (min … max):    60.2 ms …  76.3 ms    38 runs

Benchmark 2: ./unicode-decoder --config gitleaks.toml dir --max-decode-depth 8 gitleaks.git
  Time (mean ± σ):      59.5 ms ±   2.1 ms    [User: 62.5 ms, System: 21.9 ms]
  Range (min … max):    55.4 ms …  64.2 ms    47 runs

Summary
  ./unicode-decoder --config gitleaks.toml dir --max-decode-depth 8 gitleaks.git ran
    1.12 ± 0.08 times faster than ./baseline --config gitleaks.toml dir --max-decode-depth 8 gitleaks.git

(note: dir is probably so long because I expect it's going down into the .git dir for the repo and scanning those large files serially)

I did one diagnostics run of it against the kubernetes repo:

(Note: I had CPU and memory diagnostics running at the same time which probably isn't the best idea for clean results).

image

Thoughts:

Looks like the kubernetes repo ends up being a great benchmarking repo, there's lots of b64, percent encoded, unicode escaped data in there.

I'd probably be good to do something like this and pick through it, preferably on an idle system:

git clone --mirror [email protected]:kubernetes/kubernetes.git

hyperfine \
  --export-json unicode-perf.json -w 3 -i \
  './baseline --config gitleaks.toml git --max-decode-depth 8 kubernetes.git' \
  './unicode-decoder --config gitleaks.toml git --max-decode-depth 8 kubernetes.git' \
  './baseline --config gitleaks.toml git kubernetes.git' \
  './unicode-decoder --config gitleaks.toml git kubernetes.git'

# Assuming unicode-decoder is rebased on main
./baseline --diagnostics-dir=baseline-k8s --diagnostics=cpu --config gitleaks.toml git --max-decode-depth 8 kubernetes.git
./unicode-decoder --diagnostics-dir=unicode-k8s --diagnostics=cpu --config gitleaks.toml git --max-decode-depth 8 kubernetes.git

@zricethezav zricethezav merged commit 0589ae0 into master May 14, 2025
5 checks passed
@zricethezav zricethezav deleted the unicode-decoder branch May 14, 2025 15:14
alayne222 pushed a commit to alayne222/gitleaks that referenced this pull request May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants