Skip to content

[IDE-5719] cache SDS secret scanners#825

Merged
alonam merged 4 commits intomainfrom
alonam/IDE-5719_secret_scanning_optimizations
Mar 9, 2026
Merged

[IDE-5719] cache SDS secret scanners#825
alonam merged 4 commits intomainfrom
alonam/IDE-5719_secret_scanning_optimizations

Conversation

@alonam
Copy link
Copy Markdown
Contributor

@alonam alonam commented Mar 1, 2026

What problem are you trying to solve?

The /scan-secrets endpoint rebuilds the SDS Scanner from scratch on every request. This involves deserializing rules from JSON, compiling regex patterns, building automata, and configuring validators - all expensive operations. In contrast, the /analyze endpoint already caches compiled rules via RuleCache with DashMap.

In the IDE use-case, the same set of rules is sent with every file scan request (rules come from the Datadog API and rarely change), meaning the scanner is unnecessarily rebuilt hundreds of times with identical rules.

What is your solution?

Introduced a SecretScannerCache - a single-entry, RwLock-based cache that stores the most recently built Scanner along with its parsed SecretRules.

  • Cache hit (fast path): acquires a read lock only, returns Arc-cloned scanner + rules. Multiple concurrent scans proceed without blocking each other.
  • Cache miss (slow path): deserializes rules and builds the scanner outside any lock, then briefly acquires a write lock to store the result.
  • Cache key: u64 hash of the serialized JSON rules. Compared on every request to detect rule changes.

On cache hit, three expensive operations are skipped entirely:

  1. Rule deserialization (serde_json::Value -> SecretRule)
  2. SDS rule config conversion (convert_to_sds_ruleconfig with regex compilation)
  3. Scanner building (Scanner::builder().build() with automaton construction)

Alternatives considered

  • DashMap multi-entry cache (like RuleCache for /analyze): considered but unnecessary since the IDE client sends the same rule set every time. A single-entry cache is simpler and has the same hit rate.
  • Opt-in via CLI flag (like --use-rules-cache for /analyze): considered but there's no downside to always caching - memory is minimal (one entry) and the cache self-invalidates on rule changes. So it's always enabled.
  • Caching individual rules instead of the whole scanner: not feasible because the SDS Scanner is built atomically from all rules at once - you can't incrementally add/remove rules from a built scanner.

What the reviewer should know

  • dd_sds::Scanner is confirmed Send + Sync (the dd-sds benchmarks use Arc<Scanner> across 32 threads). Both scan() and validate_matches() take &self.
  • use_debug is deliberately excluded from the cache key - it only controls eprintln logging in convert_to_sds_ruleconfig / try_to_secondary_validator, not the scanner's behavior or output.
  • Scanner is re-exported from the secrets crate (pub use dd_sds::Scanner) so the bins crate can reference the type without adding a direct dd_sds dependency.
  • process_secret_scan_request accepts Option<&SecretScannerCache> so the non-cache path is preserved for testing.

@alonam alonam requested a review from a team as a code owner March 1, 2026 18:53
@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented Mar 1, 2026

🎯 Code Coverage (details)
Patch Coverage: 77.36%
Overall Coverage: 84.74% (+0.01%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: b0468e8 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@alonam alonam changed the title [IDE-5719] secret scanner cache [IDE-5719] cache SDS scanners Mar 1, 2026
@alonam alonam changed the title [IDE-5719] cache SDS scanners [IDE-5719] cache SDS secret scanners Mar 1, 2026
/// any lock to avoid blocking concurrent readers.
pub fn get_or_build(
&self,
raw_rules: &[serde_json::Value],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, let's change what we need to to store these as serde_json::RawValue and only deserialize them into a serde_json::Value when needed. Since it'll essentially always be a cache hit, by hashing the raw bytes, we can avoid two unnecessary allocations (RawValue -> Value and then the roundtrip Value -> String)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, not sure if it's easier (or possible with Rocket's API), but may we could just change the type to avoid deserializing the rules here?

So instead, deserialize as Vec<u8> representing the JSON string and then hash those?

Copy link
Copy Markdown
Contributor Author

@alonam alonam Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secret rules are now stored as Box<serde_json::value::RawValue> instead of Value, wdyt?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the review and feedback!

@alonam alonam requested a review from jasonforal March 8, 2026 10:28
@alonam
Copy link
Copy Markdown
Contributor Author

alonam commented Mar 9, 2026

Tested successfully with a release build locally, results show improved performance in secret scanning.
Scanning the shopist repository now takes 5 seconds less.

@alonam alonam merged commit 7de9d65 into main Mar 9, 2026
91 checks passed
@alonam alonam deleted the alonam/IDE-5719_secret_scanning_optimizations branch March 9, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants