eval: harden OpenSRE LLM judge JSON extraction#989
eval: harden OpenSRE LLM judge JSON extraction#989VaibhavUpreti merged 16 commits intoTracer-Cloud:mainfrom
Conversation
Greptile SummaryThis PR rewrites Confidence Score: 4/5Safe to merge; all previously-flagged P1 issues are resolved and only a P2 style finding remains. No P0 or P1 issues found. The one P2 (bare No files require special attention, though the bracket-containment logic in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Input: text.strip] --> B{Markdown fences found?}
B -- Yes --> C[Iterate fences in reverse]
C --> D{Parse fence as JSON}
D -- dict --> E[Return dict]
D -- list --> F[continue to next fence]
D -- error --> F
F --> G{More fences?}
G -- Yes --> C
G -- No --> H
B -- No --> H[json.loads full text]
H -- dict --> E
H -- list --> I[Raise: must be object]
H -- error --> J[Scan brace and bracket positions]
J --> K{arr span contains obj span?}
K -- Yes --> L[json.loads arr candidate]
L -- list --> I
L -- error --> M[Inner loop: each bracket before obj]
M -- valid array found --> I
M -- none found --> N{has_obj?}
K -- No --> N
N -- No --> O[Raise: no JSON object]
N -- Yes --> P[json.loads obj span]
P --> E
Reviews (4): Last reviewed commit: "eval: fix judge JSON parsing (array fenc..." | Re-trigger Greptile |
|
Hey @cerencamkiran, thanks for tightening up the judge JSON parsing and adding targeted tests 👍 I see one behavior gap to fix before we merge:
Also, a small test alignment:
Optionally, adding a test for Once those tweaks are in and tests are green, this should be good to go. The fenced array test you added already matches what Greptile flagged, which is great. |
|
Great catch @rrajan94. |
|
Hey @rrajan94! I updated the guard to handle leading whitespace and added the test. All passing 👍 |
|
nice work @cerencamkiran. fenced-array test is the kind of thing that would silently corrupt benchmark numbers if missed. one thing not blocking. fence regex went from ({.}) to ([\s\S]?) which is broader than just a fix. old required {...} inside fence. new captures any content. so a prose-only fence with a JSON object outside used to parse fine, now fence eats the prose and the function raises. visible failure beats silent mis-parse so probably right, just flagging. lgtm |
yashksaini-coder
left a comment
There was a problem hiding this comment.
Solid hardening, the tests are clear and greptile's two earlier P1s look fully addressed at HEAD. A couple of edge cases I'd love your take on inline.
Locally on this branch: make lint, make format-check, mypy on the modified file, and the new judge test file (8/8 passed in 1s) all pass. Caller in app/nodes/evaluate_opensre/node.py:37 already wraps the call in except Exception, so any new ValueErrors degrade gracefully into eval errors instead of crashing the pipeline.
| def extract_judge_json_from_response(text: str) -> dict[str, Any]: | ||
| text = text.strip() | ||
| fence = re.search(r"```(?:json)?\s*(\{.*\})\s*```", text, re.DOTALL) | ||
| fence = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", text, re.DOTALL) |
There was a problem hiding this comment.
The new fence regex is now non-greedy ([\s\S]*?) so it captures the first fenced block. If the judge ever produces something like:
Example shape:
```json
{"old": 1}
```
Actual answer:
```json
{"overall_pass": true, "score_0_100": 90}
```
the function silently returns {"old": 1} (verified locally). The old greedy regex would have spanned both fences and failed json.loads noisily, which the caller logs as an eval error. That's exactly the silent-mis-scoring failure mode this PR was written to prevent, just in a different shape.
Worst case it's a niche input, but if the judge prompt ever ends up with a "for example..." preamble it'd silently degrade scores. Two options that would close the gap:
- Use
re.findalland pick the last fenced block, or the largest one that parses as a dict. - After grabbing the first fence, parse it and if the result isn't a dict, fall through to the next fence.
Either way, would you mind adding a test_extract_judge_json_picks_correct_fence_with_multiple case to lock down the chosen behavior?
| text = fence.group(1) | ||
| text = fence.group(1).strip() | ||
|
|
||
| if text.lstrip().startswith("["): |
There was a problem hiding this comment.
Tiny one. The lstrip().startswith("[") guard rejects any input whose first non-whitespace char is [, which catches actual JSON arrays correctly but also has a false positive for prose that starts with a square bracket, e.g. an LLM that prefixes its response with a log-style label:
[INFO] result: {"overall_pass": true}[Note] {"overall_pass": true, ...}
Both used to parse successfully via the find("{") / rfind("}") salvage path, and now raise (verified locally). The caller catches it as Exception so it just degrades the eval into an error, not silent, which is fine. But if you'd rather not lose the salvage path, a stricter guard would be to actually parse the input first and then isinstance(raw, list). Up to you.
|
Thanks @yashksaini-coder. The multiple-fence case is now handled by scanning all fenced blocks and selecting the last valid JSON object, with a regression test added. For the All tests and checks are passing. |
|
A prose-wrapped JSON array still slips through because the brace-substring fallback ( repro: flow at HEAD: no fences → can you add a regression test for this and tighten the fallback so arrays are rejected even when there's prose around them? matching the existing "JSON must be an object" error keeps it consistent with the other array tests. |
|
Thanks @rrajan94 added a regression test for the prose-wrapped array case and tightened the fallback so arrays are rejected before the Validation: |
|
Thanks for fixing the prose-wrapped array case @cerencamkiran Example: Because the check parses from the first Can you add this as a regression test too? It may be worth checking the fallback against a few bracket/prose variants so we don’t keep finding the same partial-parse issue one shape at a time. |
|
@rrajan94 added a regression test for the Validation: |
|
Thanks @cerencamkiran, the I found one false positive from the new scan though. This valid object now raises:
The loop tries each Can you add this as a regression test too? The array rejection should catch top-level/salvaged arrays, but not arrays nested inside the valid judge object. |
… and nested cases)
|
Thanks @rrajan94. I’ve updated the logic to distinguish top-level arrays from nested ones and added the regression test. Validation: |
yashksaini-coder
left a comment
There was a problem hiding this comment.
Approving. Both of my earlier concerns are fully addressed at HEAD b981c1f0:
- Multi-fence first-match — the
reversed(fences)+ per-fence parse-and-check pattern picks the last valid block, exactly the right shape for judge prompts where the model puts examples first and the answer last. Locked down bytest_extract_judge_json_picks_last_valid_fence_with_multiple_fences. [INFO] result: {...}false positive — the new top-level-vs-nested array detection (thearr_start < obj_start AND arr_end > obj_endcheck at lines 87-89) is a clean way to distinguish a top-level array from a nested one. Covered bytest_extract_judge_json_allows_log_prefix_with_nested_array.
I ran a standalone verifier against all 12 test cases (extracted the function source so I didn't need a venv) and got 12/12 passing. CI is green. The one open item is the CodeQL "empty except clause" info on line 68 — it's a known false-positive on except: raw = None fallthrough, and your workflow already passes. Adding a one-line # fall through to brace-substring fallback comment above the except would silence the bot if you want it gone, but it's not blocking.
Nice job working through the edge cases that came up after my review — the regression-test-per-bug-report discipline is exactly what I'd want from this kind of parsing code.
|
@greptileai review |
| obj_start = text.find("{") | ||
| obj_end = text.rfind("}") | ||
| arr_start = text.find("[") | ||
| arr_end = text.rfind("]") | ||
|
|
||
| has_obj = obj_start != -1 and obj_end != -1 and obj_end > obj_start | ||
| has_arr = arr_start != -1 and arr_end != -1 and arr_end > arr_start | ||
|
|
||
| # If an array span exists and fully contains the object span, | ||
| # the top-level value is an array — reject it. | ||
| if has_arr and has_obj and arr_start < obj_start and arr_end > obj_end: | ||
| msg = "Judge response JSON must be an object" | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
Bracket scan may falsely reject valid dicts in fallback path
The containment check at line 87 scans for the textually first [ and last ] in the entire string, not necessarily a matching JSON array pair. Any input that has an unrelated […] token before the { and another ] token after the } will trip the guard even though no actual JSON array wraps the object.
Concrete failing input that does not involve a real array:
[INFO] here is the result: {"overall_pass": true, "score_0_100": 90, …} [end]
arr_start = 0(the[in[INFO])arr_end = len – 5(the]in[end])obj_start/obj_endboth fall inside that span → containment fires →ValueErrorraised
A correct dict-containing response would be silently rejected and the evaluation would abort instead of scoring. A safer approach:
if has_arr and has_obj and arr_start < obj_start and arr_end > obj_end:
try:
arr_candidate = json.loads(text[arr_start : arr_end + 1])
if isinstance(arr_candidate, list):
msg = "Judge response JSON must be an object"
raise ValueError(msg)
except json.JSONDecodeError:
pass # brackets were not a real array, continueThere was a problem hiding this comment.
@cerencamkiran please resolve this, looks like the fallback uses the first [ and last ] in the whole text, so unrelated bracketed text before and after the object is treated like an array wrapping it.
|
@greptileai review |
|
@greptileai review |
|
@rrajan94 thanks for the review. All previously flagged P1 issues are now addressed. The bracket handling has been updated to only reject when the span actually parses as a JSON array, so cases like Added regression tests for these scenarios as well. All tests are passing (15/15), and all checks are green. I’ll leave the remaining P2 style note as non-blocking for now. |
VaibhavUpreti
left a comment
There was a problem hiding this comment.
Awesome @cerencamkiran , thanks for improving this!
Merging this to main, tagging @w3joe for visibility
|
💼 Interviewer: describe a time you shipped something impactful. @cerencamkiran: points at this PR Interviewer: you're hired. 🤝 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Summary
This PR improves the robustness of the OpenSRE evaluation pipeline by hardening how LLM judge responses are parsed and validated.
The current evaluation flow relies on extracting a JSON object from LLM outputs. However, real LLM responses are often not strictly well-formed and may include additional text, markdown fences, or malformed structures. This can lead to unstable or incorrect benchmark results.
Problem
The existing
extract_judge_json_from_responseimplementation:As a result, evaluation can silently succeed with incorrect parsing or fail unpredictably depending on LLM formatting.
Changes
Hardened JSON extraction logic:
{...}) to avoid partial extraction[...]), including fenced casesAdded dedicated test coverage for judge parsing:
Why this matters
OpenSRE evaluation depends on LLM judge outputs for scoring (
overall_pass,score_0_100, etc.).This change improves:
Result:
8 passed
Validation