eval: harden OpenSRE LLM judge JSON extraction by cerencamkiran · Pull Request #989 · Tracer-Cloud/opensre

cerencamkiran · 2026-04-27T07:58:26Z

Summary

This PR improves the robustness of the OpenSRE evaluation pipeline by hardening how LLM judge responses are parsed and validated.

The current evaluation flow relies on extracting a JSON object from LLM outputs. However, real LLM responses are often not strictly well-formed and may include additional text, markdown fences, or malformed structures. This can lead to unstable or incorrect benchmark results.

Problem

The existing extract_judge_json_from_response implementation:

assumes a clean JSON object in the response
does not explicitly reject JSON arrays
can incorrectly extract partial JSON from mixed outputs
is not tested against real-world LLM response patterns

As a result, evaluation can silently succeed with incorrect parsing or fail unpredictably depending on LLM formatting.

Changes

Hardened JSON extraction logic:
- captures full fenced content (not just {...}) to avoid partial extraction
- explicitly rejects JSON arrays ([...]), including fenced cases
- improves detection of valid JSON object boundaries
- ensures only dictionary outputs are accepted
Added dedicated test coverage for judge parsing:
- clean JSON responses
- markdown fenced JSON (```json)
- mixed prose + JSON outputs
- missing JSON cases
- JSON arrays (invalid)
- malformed JSON
- fenced JSON arrays (invalid, now explicitly rejected)

Why this matters

OpenSRE evaluation depends on LLM judge outputs for scoring (overall_pass, score_0_100, etc.).

This change improves:

evaluation stability
reproducibility of benchmark results
robustness to real LLM output variability
prevents silent mis-scoring due to partial JSON extraction

Result:

8 passed

Validation

pytest tests/integrations/opensre/test_llm_eval_judge.py

greptile-apps · 2026-04-27T08:00:54Z

Greptile Summary

This PR rewrites extract_judge_json_from_response to handle real-world LLM output variability: it now iterates all markdown fences in reverse (returning the last valid dict), falls through to a direct json.loads, and finally uses a bracket-containment heuristic to reject prose-wrapped JSON arrays without blocking valid dicts surrounded by unrelated bracket tokens. All three P1 issues raised in previous review threads — fenced-array bypass, premature raise on saw_array_fence, and false-positive bracket rejection — are resolved in this revision, and 13 test cases validate the new paths.

Confidence Score: 4/5

Safe to merge; all previously-flagged P1 issues are resolved and only a P2 style finding remains.

No P0 or P1 issues found. The one P2 (bare json.loads that can leak JSONDecodeError) is minor and pre-existing. Logic was thoroughly traced through all 13 test cases and edge cases — no correctness defects remain.

No files require special attention, though the bracket-containment logic in llm_eval_judge.py lines 87–109 is dense and would benefit from inline comments if maintained further.

Important Files Changed

Filename	Overview
app/integrations/opensre/llm_eval_judge.py	Rewrote `extract_judge_json_from_response` with multi-fence support, reversed iteration for last-fence-wins, and a bracket-containment check for prose-wrapped arrays; logic is correct for all tested inputs but the final `json.loads` at line 115 is unwrapped and can leak `JSONDecodeError`.
tests/integrations/opensre/test_llm_eval_judge.py	New test file with 13 cases covering clean JSON, fenced JSON, mixed-prose inputs, arrays (plain and fenced), invalid JSON, multiple-fence ordering, and bracketed-prose edge cases — comprehensive coverage for the new extraction logic.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Input: text.strip] --> B{Markdown fences found?}
    B -- Yes --> C[Iterate fences in reverse]
    C --> D{Parse fence as JSON}
    D -- dict --> E[Return dict]
    D -- list --> F[continue to next fence]
    D -- error --> F
    F --> G{More fences?}
    G -- Yes --> C
    G -- No --> H
    B -- No --> H[json.loads full text]
    H -- dict --> E
    H -- list --> I[Raise: must be object]
    H -- error --> J[Scan brace and bracket positions]
    J --> K{arr span contains obj span?}
    K -- Yes --> L[json.loads arr candidate]
    L -- list --> I
    L -- error --> M[Inner loop: each bracket before obj]
    M -- valid array found --> I
    M -- none found --> N{has_obj?}
    K -- No --> N
    N -- No --> O[Raise: no JSON object]
    N -- Yes --> P[json.loads obj span]
    P --> E

_{Reviews (4): Last reviewed commit: "eval: fix judge JSON parsing (array fenc..." | Re-trigger Greptile}

rrajan94 · 2026-04-27T19:53:58Z

Hey @cerencamkiran, thanks for tightening up the judge JSON parsing and adding targeted tests 👍

I see one behavior gap to fix before we merge:

Right now the startswith("[") guard only catches arrays when the response literally begins with [. If there is any leading whitespace before the array, we can still end up slicing out a {...} from inside an array and treating it as valid. To fully meet the “reject JSON arrays” goal, can you change that to use text.lstrip().startswith("[") (and then parse from the stripped text)?

Also, a small test alignment:

In test_extract_judge_json_raises_for_json_array you can simplify the match to just "JSON must be an object" once the guard is in place, since that will be the consistent error.

Optionally, adding a test for " \n [{\"overall_pass\": true}]" would lock in the whitespace-tolerant behavior.

Once those tweaks are in and tests are green, this should be good to go. The fenced array test you added already matches what Greptile flagged, which is great.

cerencamkiran · 2026-04-27T20:12:49Z

Great catch @rrajan94.
I'll update the guard to be whitespace-tolerant and add the extra test case. Thanks!

cerencamkiran · 2026-04-28T16:53:41Z

Hey @rrajan94! I updated the guard to handle leading whitespace and added the test. All passing 👍

WatchTree-19 · 2026-05-01T15:11:49Z

nice work @cerencamkiran. fenced-array test is the kind of thing that would silently corrupt benchmark numbers if missed.

one thing not blocking. fence regex went from ({.}) to ([\s\S]?) which is broader than just a fix. old required {...} inside fence. new captures any content. so a prose-only fence with a JSON object outside used to parse fine, now fence eats the prose and the function raises. visible failure beats silent mis-parse so probably right, just flagging.

lgtm

yashksaini-coder

Solid hardening, the tests are clear and greptile's two earlier P1s look fully addressed at HEAD. A couple of edge cases I'd love your take on inline.

Locally on this branch: make lint, make format-check, mypy on the modified file, and the new judge test file (8/8 passed in 1s) all pass. Caller in app/nodes/evaluate_opensre/node.py:37 already wraps the call in except Exception, so any new ValueErrors degrade gracefully into eval errors instead of crashing the pipeline.

yashksaini-coder · 2026-05-01T15:42:03Z

 def extract_judge_json_from_response(text: str) -> dict[str, Any]:
    text = text.strip()
-    fence = re.search(r"```(?:json)?\s*(\{.*\})\s*```", text, re.DOTALL)
+    fence = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", text, re.DOTALL)


The new fence regex is now non-greedy ([\s\S]*?) so it captures the first fenced block. If the judge ever produces something like:

Example shape: ```json {"old": 1} ``` Actual answer: ```json {"overall_pass": true, "score_0_100": 90} ```

the function silently returns {"old": 1} (verified locally). The old greedy regex would have spanned both fences and failed json.loads noisily, which the caller logs as an eval error. That's exactly the silent-mis-scoring failure mode this PR was written to prevent, just in a different shape.

Worst case it's a niche input, but if the judge prompt ever ends up with a "for example..." preamble it'd silently degrade scores. Two options that would close the gap:

Use re.findall and pick the last fenced block, or the largest one that parses as a dict.

After grabbing the first fence, parse it and if the result isn't a dict, fall through to the next fence.

Either way, would you mind adding a test_extract_judge_json_picks_correct_fence_with_multiple case to lock down the chosen behavior?

yashksaini-coder · 2026-05-01T15:42:03Z

-        text = fence.group(1)
+        text = fence.group(1).strip()
+
+    if text.lstrip().startswith("["):


Tiny one. The lstrip().startswith("[") guard rejects any input whose first non-whitespace char is [, which catches actual JSON arrays correctly but also has a false positive for prose that starts with a square bracket, e.g. an LLM that prefixes its response with a log-style label:

[INFO] result: {"overall_pass": true}

[Note] {"overall_pass": true, ...}

Both used to parse successfully via the find("{") / rfind("}") salvage path, and now raise (verified locally). The caller catches it as Exception so it just degrades the eval into an error, not silent, which is fine. But if you'd rather not lose the salvage path, a stricter guard would be to actually parse the input first and then isinstance(raw, list). Up to you.

cerencamkiran · 2026-05-01T16:24:18Z

Thanks @yashksaini-coder.

The multiple-fence case is now handled by scanning all fenced blocks and selecting the last valid JSON object, with a regression test added.

For the [INFO] ... case, I now parse first and only reject if the result is actually a list, so the salvage path is preserved.

All tests and checks are passing.

rrajan94 · 2026-05-01T17:25:17Z

A prose-wrapped JSON array still slips through because the brace-substring fallback (text.find("{") / text.rfind("}")) slices out the first object inside the array.

repro:

extract_judge_json_from_response('Here is the result: [{"overall_pass": true}] Thanks')
# returns {'overall_pass': True} instead of raising

flow at HEAD: no fences → json.loads of the whole text fails → isinstance(raw, list) guard is skipped → find("{")/rfind("}") salvage path returns the inner object.

can you add a regression test for this and tighten the fallback so arrays are rejected even when there's prose around them? matching the existing "JSON must be an object" error keeps it consistent with the other array tests.

cerencamkiran · 2026-05-01T17:51:35Z

Thanks @rrajan94 added a regression test for the prose-wrapped array case and tightened the fallback so arrays are rejected before the {...} salvage path when the array appears before any JSON object.

Validation:
pytest tests/integrations/opensre/test_llm_eval_judge.py -v
10 passed

rrajan94 · 2026-05-01T17:56:08Z

Thanks for fixing the prose-wrapped array case @cerencamkiran
I think the new bracket-based fix still has one related edge case.

Example: [INFO] result: [{"overall_pass": true}]

Because the check parses from the first [ to the last ], the substring starts with [INFO], fails JSON parsing, and then falls through to the {...} fallback. That still returns {"overall_pass": true} instead of rejecting the array.

Can you add this as a regression test too? It may be worth checking the fallback against a few bracket/prose variants so we don’t keep finding the same partial-parse issue one shape at a time.

cerencamkiran · 2026-05-01T18:10:32Z

@rrajan94 added a regression test for the [INFO] result: [...] case as well and updated the array detection to try parsing from each [ candidate before the {...} fallback.

Validation:
pytest tests/integrations/opensre/test_llm_eval_judge.py -v
11 passed

rrajan94 · 2026-05-02T04:06:08Z

Thanks @cerencamkiran, the [INFO] result: [...] array case is covered now.

I found one false positive from the new scan though.

This valid object now raises:

[INFO] result: {"overall_pass": true, "score_0_100": 90, "rubric_items": [], "summary": "ok"}

The loop tries each [ before the {...} fallback, so it eventually parses the nested rubric_items: [] array and rejects the whole response before the full object can be salvaged.

Can you add this as a regression test too? The array rejection should catch top-level/salvaged arrays, but not arrays nested inside the valid judge object.

… and nested cases)

cerencamkiran · 2026-05-02T13:26:05Z

Thanks @rrajan94.

I’ve updated the logic to distinguish top-level arrays from nested ones and added the regression test.

Validation:
pytest tests/integrations/opensre/test_llm_eval_judge.py -v
12 passed

yashksaini-coder

Approving. Both of my earlier concerns are fully addressed at HEAD b981c1f0:

Multi-fence first-match — the reversed(fences) + per-fence parse-and-check pattern picks the last valid block, exactly the right shape for judge prompts where the model puts examples first and the answer last. Locked down by test_extract_judge_json_picks_last_valid_fence_with_multiple_fences.
[INFO] result: {...} false positive — the new top-level-vs-nested array detection (the arr_start < obj_start AND arr_end > obj_end check at lines 87-89) is a clean way to distinguish a top-level array from a nested one. Covered by test_extract_judge_json_allows_log_prefix_with_nested_array.

I ran a standalone verifier against all 12 test cases (extracted the function source so I didn't need a venv) and got 12/12 passing. CI is green. The one open item is the CodeQL "empty except clause" info on line 68 — it's a known false-positive on except: raw = None fallthrough, and your workflow already passes. Adding a one-line # fall through to brace-substring fallback comment above the except would silence the bot if you want it gone, but it's not blocking.

Nice job working through the edge cases that came up after my review — the regression-test-per-bug-report discipline is exactly what I'd want from this kind of parsing code.

VaibhavUpreti · 2026-05-02T22:55:32Z

@greptileai review

greptile-apps · 2026-05-02T22:59:22Z

+    obj_start = text.find("{")
+    obj_end = text.rfind("}")
+    arr_start = text.find("[")
+    arr_end = text.rfind("]")
+
+    has_obj = obj_start != -1 and obj_end != -1 and obj_end > obj_start
+    has_arr = arr_start != -1 and arr_end != -1 and arr_end > arr_start
+
+    # If an array span exists and fully contains the object span,
+    # the top-level value is an array — reject it.
+    if has_arr and has_obj and arr_start < obj_start and arr_end > obj_end:
+        msg = "Judge response JSON must be an object"
+        raise ValueError(msg)


Bracket scan may falsely reject valid dicts in fallback path

The containment check at line 87 scans for the textually first [ and last ] in the entire string, not necessarily a matching JSON array pair. Any input that has an unrelated […] token before the { and another ] token after the } will trip the guard even though no actual JSON array wraps the object.

Concrete failing input that does not involve a real array:

[INFO] here is the result: {"overall_pass": true, "score_0_100": 90, …} [end]

arr_start = 0 (the [ in [INFO])

arr_end = len – 5 (the ] in [end])

obj_start / obj_end both fall inside that span → containment fires → ValueError raised

A correct dict-containing response would be silently rejected and the evaluation would abort instead of scoring. A safer approach:

if has_arr and has_obj and arr_start < obj_start and arr_end > obj_end: try: arr_candidate = json.loads(text[arr_start : arr_end + 1]) if isinstance(arr_candidate, list): msg = "Judge response JSON must be an object" raise ValueError(msg) except json.JSONDecodeError: pass # brackets were not a real array, continue

@cerencamkiran please resolve this, looks like the fallback uses the first [ and last ] in the whole text, so unrelated bracketed text before and after the object is treated like an array wrapping it.

cerencamkiran · 2026-05-03T11:58:39Z

@greptileai review

cerencamkiran · 2026-05-03T12:25:27Z

@greptileai review

cerencamkiran · 2026-05-03T12:35:55Z

@rrajan94 thanks for the review.

All previously flagged P1 issues are now addressed. The bracket handling has been updated to only reject when the span actually parses as a JSON array, so cases like [INFO] ... {dict} ... [end] no longer fail.

Added regression tests for these scenarios as well.

All tests are passing (15/15), and all checks are green.

I’ll leave the remaining P2 style note as non-blocking for now.

VaibhavUpreti

Awesome @cerencamkiran , thanks for improving this!

Merging this to main, tagging @w3joe for visibility

github-actions · 2026-05-03T17:51:23Z

💼 Interviewer: describe a time you shipped something impactful.

@cerencamkiran: points at this PR

Interviewer: you're hired. 🤝

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

cerencamkiran added 2 commits April 27, 2026 10:55

eval: harden OpenSRE LLM judge JSON extraction

5b9fd6d

chore: format test_llm_eval_judge with ruff

77031a6

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread app/integrations/opensre/llm_eval_judge.py Outdated

Comment thread tests/integrations/opensre/test_llm_eval_judge.py Outdated

cerencamkiran added 2 commits April 27, 2026 11:17

chore: format OpenSRE judge tests with ruff

fb935b9

eval: fix fenced JSON array parsing and harden OpenSRE judge extraction

fe3abd1

eval: handle whitespace before JSON arrays in judge parser

756d9e6

yashksaini-coder reviewed May 1, 2026

View reviewed changes

eval: robust judge JSON extraction (multi-fence + safe parsing)

cba792a

github-advanced-security AI found potential problems May 1, 2026

View reviewed changes

Comment thread app/integrations/opensre/llm_eval_judge.py Fixed

eval: robust judge JSON extraction

4da293d

github-advanced-security AI found potential problems May 1, 2026

View reviewed changes

Comment thread app/integrations/opensre/llm_eval_judge.py Fixed

eval: avoid empty JSON decode exception handler

812a4ec

cerencamkiran added 2 commits May 1, 2026 20:47

eval: reject prose-wrapped judge JSON arrays

3cc6807

eval: reject prose-wrapped judge JSON arrays

8721aec

cerencamkiran added 2 commits May 1, 2026 21:05

eval: reject bracket-prefixed judge JSON arrays

e038e12

test: format OpenSRE judge parser tests

a447714

eval: robust JSON extraction for LLM judge (handles fenced, prefixed,…

b981c1f

… and nested cases)

yashksaini-coder approved these changes May 2, 2026

View reviewed changes

greptile-apps Bot reviewed May 2, 2026

View reviewed changes

eval: handle array fences and bracketed prose correctly in judge parser

95eb0a7

github-advanced-security AI found potential problems May 3, 2026

View reviewed changes

Comment thread app/integrations/opensre/llm_eval_judge.py Fixed

Comment thread app/integrations/opensre/llm_eval_judge.py Fixed

eval: fix fence parsing (remove nested loop, handle arrays correctly)

8388378

greptile-apps Bot reviewed May 3, 2026

View reviewed changes

Comment thread app/integrations/opensre/llm_eval_judge.py Outdated

eval: fix judge JSON parsing (array fence + bracketed prose cases)

89d8395

VaibhavUpreti approved these changes May 3, 2026

View reviewed changes

VaibhavUpreti merged commit 02568d7 into Tracer-Cloud:main May 3, 2026
10 checks passed

Conversation

cerencamkiran commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Why this matters

Result:

Validation

Uh oh!

greptile-apps Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

rrajan94 commented Apr 27, 2026

Uh oh!

cerencamkiran commented Apr 27, 2026

Uh oh!

cerencamkiran commented Apr 28, 2026

Uh oh!

WatchTree-19 commented May 1, 2026

Uh oh!

yashksaini-coder left a comment

Choose a reason for hiding this comment

Uh oh!

yashksaini-coder May 1, 2026

Choose a reason for hiding this comment

Uh oh!

yashksaini-coder May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cerencamkiran commented May 1, 2026

Uh oh!

rrajan94 commented May 1, 2026

Uh oh!

cerencamkiran commented May 1, 2026

Uh oh!

rrajan94 commented May 1, 2026

Uh oh!

cerencamkiran commented May 1, 2026

Uh oh!

rrajan94 commented May 2, 2026

Uh oh!

cerencamkiran commented May 2, 2026

Uh oh!

yashksaini-coder left a comment

Choose a reason for hiding this comment

Uh oh!

VaibhavUpreti commented May 2, 2026

Uh oh!

greptile-apps Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

rrajan94 May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cerencamkiran commented May 3, 2026

Uh oh!

Uh oh!

cerencamkiran commented May 3, 2026

Uh oh!

cerencamkiran commented May 3, 2026

Uh oh!

VaibhavUpreti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

Reviewers

Assignees

cerencamkiran commented Apr 27, 2026 •

edited

Loading

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading