Skip to content

feat: proximity-based extraction for minified files (rtk extract) #1567

@furkankoykiran

Description

@furkankoykiran

Problem

rtk grep is line-oriented. On minified bundles (single-line JS, compressed JSON, log files without newlines) a single match line can be 100K+ characters. The current behavior truncates to max_line_len (default 80) and destroys the surrounding semantic context that an LLM needs to reason about the match.

Concretely, running rtk grep 'fetch' bundle.min.js on a 1 MB minified file either:

  • truncates the matching line to 80 chars (losing the URL, payload, options object), or
  • returns the entire 1 MB line if --max-len is raised, blowing the token budget.

Neither outcome is useful, and this is exactly the kind of token-waste RTK is meant to prevent.

Proposal

Add a new subcommand rtk extract that finds regex hits and emits a configurable character window around each match — independent of line boundaries — with optional secondary-keyword filtering to drop irrelevant windows.

rtk extract <PATTERN> [PATH] [-w/--window N] [-b/--before N] [-a/--after N]
                              [-r/--require KEYWORD]... [-i/--ignore-case]
                              [-m/--max N] [--no-dedupe]
Flag Default Purpose
-w, --window 100 Symmetric window (chars).
-b, --before / -a, --after from --window Asymmetric overrides.
-r, --require none, repeatable Window must contain ALL of these substrings.
-i, --ignore-case false Case-insensitive primary regex and --require.
-m, --max 100 Token-budget cap on emitted windows.
--no-dedupe false Disable collapsing of identical windows ((xN)).

Example

$ rtk extract 'fetch\([^)]{0,300}\)' bundle.min.js -w 80 -r '/api/v1/login'
bundle.min.js (2 matches)
  @184523: ...auth.Ka.send(«fetch(\"/api/v1/login\",{method:\"PUT\",body:JSON.stringify(t)})»);return r...
  @201117: ...refresh=()=>«fetch(\"/api/v1/login\",{method:\"POST\",body:e})»;...
1 file, 2 windows shown (of 14 matches, 12 filtered by --require)

Why a new command instead of grep flags

Keeping grep line-oriented is valuable for the 99% case of normal source files. Folding --window and --require into grep would either change its default semantics (breaking) or add flags that only make sense in a different mental model. A separate command keeps both clean.

Expected savings

On a typical minified bundle, a 1 MB match-line collapses to a ~300-char window — roughly 99.9% reduction. Even on multi-match cases capped at --max=100, output stays in the low-KB range vs. raw multi-MB. Comfortably above RTK's 60% bar.

Prior art

I have a working Python proof-of-concept that uses the same model (re.finditer + character slicing + secondary substring filter) and it has been the most useful pattern for analyzing reverse-engineered minified bundles. Porting it natively into RTK keeps the workflow inside the existing rtk gain tracking pipeline.

Out of scope (deliberately deferred)

  • Streaming for files > 5 MB (initial version reads whole file; > 5 MB will fail loud).
  • Folding --window / --require back into rtk grep.
  • Binary-file detection beyond the read-as-utf8 short-circuit.

I have an implementation ready and will open a PR shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    effort-largePlusieurs jours, nouveau moduleenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions