Add experimental allowlist optimizations#1731
Conversation
40cba3e to
9ce24de
Compare
|
Also + @ahrav |
f051219 to
b73c7e6
Compare
| a.commitMap = uniqueCommits | ||
| } | ||
|
|
||
| if len(a.Paths) > 0 { |
There was a problem hiding this comment.
@rgmz I was thinking about doing something similar in my upstream patterns when they're "compiled". It'd be neat to have it baked in too.
So this does give a good performance boost?
There was a problem hiding this comment.
So this does give a good performance boost?
The BenchmarkPathAllowed and BenchmarkPathNotAllowed benchmarks look to be 12-18x faster. We'd need to do some real-world testing to see if that carries over in practice.
Do you know of any large repositories with an excessive number of ignore paths? Node modules, Python deps, media files, etc.
There was a problem hiding this comment.
I don't off the top of my head, but it probably wouldn't be to hard to make one through automation.
You could generate a list of ignore paths that you'd like to make and then do something like:
target_number_of_commits = 1000
for target_number_of_commits:
# the files to ignore
allowlist_paths_for_commit = paths.choose(random_int(len(allow_list_example_paths) - 1)
for path in allowlist_paths_for_commit:
prefix = random_path(max_depth=30)
create_file_with_parent_dirs(join_path(prefix, path), filesize=random_int(1024))
# other random files
for i in range(pow(len(allowlist_paths_for_commit), 2)):
prefix = random_path(max_depth=30)
create_file_with_parent_dirs(join_path(prefix, path), filesize=random_int(1024))
commit()
There was a problem hiding this comment.
One potential candidate is mattermost/mattermost-data-warehouse which takes ~45s-1m to scan. (It isn't consistent, and my laptop is generally quite slow.)
Yeah anything that we can do to keep state/caches on the detector would be handy for my use case but I can always tweak things. I load the config once and then have a process that continually scans using gitleaks as a lib. I do create a fresh detector each scan though. |
b73c7e6 to
2c94106
Compare
This is really nifty! Based on my limited understanding of Gitleaks, this seems like a solid use of both the cache and Aho-Corasick. One small consideration—aside from the build-time drawback—is the additional memory overhead required to store the structure. However, from my testing of Aho-Corasick (specifically the library used here), the tradeoff is well worth it. For my own understanding—apologies if I’m missing something about this codebase or the scanning approach—how do these optimizations affect hot paths? Does a typical scan run through all the optimized sections, or are there specific hotspots? Also, out of curiosity, do you have a CPU or trace profile that guided these optimizations? P.S: ❤️ the benchmarks |
I suspect it runs through most, however, I don't have any data to support that. It would be interesting to gather. Some users maintain their own config, others (likely most) use the default config. Moreover, the same config can have wildly different performance depending on what it's scanning. Every fragment is evaluated against the global
I do not. Are there any resources you recommend to get started with that? |
Thanks for the quick explanation. 🙇 Based on that, I’d expect these changes to have a noticeable impact. The Go Blog is a great starting point, as is this guide. |
06d47ca to
83708bd
Compare
83708bd to
7427792
Compare
|
@zricethezav I've moved the new logic behind a feature-flag, in case we want to test this "in the field" and compare results. |
bd3f576 to
a087f16
Compare
a087f16 to
25f2d70
Compare
|
On a large (15GB) repository, this reduced the scanning time from ~46m to ~32m. Before After |
25f2d70 to
13b4f91
Compare
rgmz
left a comment
There was a problem hiding this comment.
@zricethezav This isn't perfect but it's ready for review.
| } | ||
|
|
||
| func (a *Allowlist) Validate() error { | ||
| flagOnce.Do(func() { |
There was a problem hiding this comment.
Relying on an 'optional' call to Validate to set internal state is bad. A better idea is probably to make a NewAllowlist(...opts) function, but I didn't want to add noise before this change got reviewed.
13b4f91 to
8051dd6
Compare
75b058f to
264f2d8
Compare
264f2d8 to
227a4a0
Compare
85683f9 to
ff80727
Compare
|
this'll be up next after #1872 |
ff80727 to
9005a8b
Compare
9005a8b to
ee86073
Compare
|
I think this is good to go. @rgmz wdyt about giving this 2-3 weeks to bake as experimental then making this optimization the default behavior? |
That sounds good to me! What do you think about @bplaxco's suggestion to use a field on Config instead of a global feature flag? |
Sure, that sounds reasonable 👍🏻 |
0568322 to
7ae6fbf
Compare
This comment was marked as resolved.
This comment was marked as resolved.
7ae6fbf to
3caec4d
Compare
|
"Crisp and clean, no caffeine" 👌👌 |
Description:
These are experimental changes to allowlist to improve efficiency.
Ideas include the following -- not all of them are implemented yet, or obviously better:
CommitsRegexesinto a single patternPathsinto a single patternStopwordsinstead of a sliceCommits
Paths
Regexes
cc @zricethezav @bplaxco
Potential Drawbacks
Off the top of my head, potential drawbacks could be:
Allowliststruct, could mess with how some people use Gitleaks as a libraryChecklist: