Skip to content

eval: harden OpenSRE LLM judge JSON extraction#989

Merged
VaibhavUpreti merged 16 commits intoTracer-Cloud:mainfrom
cerencamkiran:eval-llm-judge-tests
May 3, 2026
Merged

eval: harden OpenSRE LLM judge JSON extraction#989
VaibhavUpreti merged 16 commits intoTracer-Cloud:mainfrom
cerencamkiran:eval-llm-judge-tests

Conversation

@cerencamkiran
Copy link
Copy Markdown
Contributor

@cerencamkiran cerencamkiran commented Apr 27, 2026

Summary

This PR improves the robustness of the OpenSRE evaluation pipeline by hardening how LLM judge responses are parsed and validated.

The current evaluation flow relies on extracting a JSON object from LLM outputs. However, real LLM responses are often not strictly well-formed and may include additional text, markdown fences, or malformed structures. This can lead to unstable or incorrect benchmark results.

Problem

The existing extract_judge_json_from_response implementation:

  • assumes a clean JSON object in the response
  • does not explicitly reject JSON arrays
  • can incorrectly extract partial JSON from mixed outputs
  • is not tested against real-world LLM response patterns

As a result, evaluation can silently succeed with incorrect parsing or fail unpredictably depending on LLM formatting.

Changes

  • Hardened JSON extraction logic:

    • captures full fenced content (not just {...}) to avoid partial extraction
    • explicitly rejects JSON arrays ([...]), including fenced cases
    • improves detection of valid JSON object boundaries
    • ensures only dictionary outputs are accepted
  • Added dedicated test coverage for judge parsing:

    • clean JSON responses
    • markdown fenced JSON (```json)
    • mixed prose + JSON outputs
    • missing JSON cases
    • JSON arrays (invalid)
    • malformed JSON
    • fenced JSON arrays (invalid, now explicitly rejected)

Why this matters

OpenSRE evaluation depends on LLM judge outputs for scoring (overall_pass, score_0_100, etc.).

This change improves:

  • evaluation stability
  • reproducibility of benchmark results
  • robustness to real LLM output variability
  • prevents silent mis-scoring due to partial JSON extraction

Result:

8 passed

Validation

pytest tests/integrations/opensre/test_llm_eval_judge.py

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR rewrites extract_judge_json_from_response to handle real-world LLM output variability: it now iterates all markdown fences in reverse (returning the last valid dict), falls through to a direct json.loads, and finally uses a bracket-containment heuristic to reject prose-wrapped JSON arrays without blocking valid dicts surrounded by unrelated bracket tokens. All three P1 issues raised in previous review threads — fenced-array bypass, premature raise on saw_array_fence, and false-positive bracket rejection — are resolved in this revision, and 13 test cases validate the new paths.

Confidence Score: 4/5

Safe to merge; all previously-flagged P1 issues are resolved and only a P2 style finding remains.

No P0 or P1 issues found. The one P2 (bare json.loads that can leak JSONDecodeError) is minor and pre-existing. Logic was thoroughly traced through all 13 test cases and edge cases — no correctness defects remain.

No files require special attention, though the bracket-containment logic in llm_eval_judge.py lines 87–109 is dense and would benefit from inline comments if maintained further.

Important Files Changed

Filename Overview
app/integrations/opensre/llm_eval_judge.py Rewrote extract_judge_json_from_response with multi-fence support, reversed iteration for last-fence-wins, and a bracket-containment check for prose-wrapped arrays; logic is correct for all tested inputs but the final json.loads at line 115 is unwrapped and can leak JSONDecodeError.
tests/integrations/opensre/test_llm_eval_judge.py New test file with 13 cases covering clean JSON, fenced JSON, mixed-prose inputs, arrays (plain and fenced), invalid JSON, multiple-fence ordering, and bracketed-prose edge cases — comprehensive coverage for the new extraction logic.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Input: text.strip] --> B{Markdown fences found?}
    B -- Yes --> C[Iterate fences in reverse]
    C --> D{Parse fence as JSON}
    D -- dict --> E[Return dict]
    D -- list --> F[continue to next fence]
    D -- error --> F
    F --> G{More fences?}
    G -- Yes --> C
    G -- No --> H
    B -- No --> H[json.loads full text]
    H -- dict --> E
    H -- list --> I[Raise: must be object]
    H -- error --> J[Scan brace and bracket positions]
    J --> K{arr span contains obj span?}
    K -- Yes --> L[json.loads arr candidate]
    L -- list --> I
    L -- error --> M[Inner loop: each bracket before obj]
    M -- valid array found --> I
    M -- none found --> N{has_obj?}
    K -- No --> N
    N -- No --> O[Raise: no JSON object]
    N -- Yes --> P[json.loads obj span]
    P --> E
Loading

Reviews (4): Last reviewed commit: "eval: fix judge JSON parsing (array fenc..." | Re-trigger Greptile

Comment thread app/integrations/opensre/llm_eval_judge.py Outdated
Comment thread tests/integrations/opensre/test_llm_eval_judge.py Outdated
@rrajan94
Copy link
Copy Markdown
Collaborator

Hey @cerencamkiran, thanks for tightening up the judge JSON parsing and adding targeted tests 👍

I see one behavior gap to fix before we merge:

  • Right now the startswith("[") guard only catches arrays when the response literally begins with [. If there is any leading whitespace before the array, we can still end up slicing out a {...} from inside an array and treating it as valid. To fully meet the “reject JSON arrays” goal, can you change that to use text.lstrip().startswith("[") (and then parse from the stripped text)?

Also, a small test alignment:

  • In test_extract_judge_json_raises_for_json_array you can simplify the match to just "JSON must be an object" once the guard is in place, since that will be the consistent error.

Optionally, adding a test for " \n [{\"overall_pass\": true}]" would lock in the whitespace-tolerant behavior.

Once those tweaks are in and tests are green, this should be good to go. The fenced array test you added already matches what Greptile flagged, which is great.

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

Great catch @rrajan94.
I'll update the guard to be whitespace-tolerant and add the extra test case. Thanks!

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

Hey @rrajan94! I updated the guard to handle leading whitespace and added the test. All passing 👍

@WatchTree-19
Copy link
Copy Markdown
Contributor

nice work @cerencamkiran. fenced-array test is the kind of thing that would silently corrupt benchmark numbers if missed.

one thing not blocking. fence regex went from ({.}) to ([\s\S]?) which is broader than just a fix. old required {...} inside fence. new captures any content. so a prose-only fence with a JSON object outside used to parse fine, now fence eats the prose and the function raises. visible failure beats silent mis-parse so probably right, just flagging.

lgtm

Copy link
Copy Markdown
Collaborator

@yashksaini-coder yashksaini-coder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid hardening, the tests are clear and greptile's two earlier P1s look fully addressed at HEAD. A couple of edge cases I'd love your take on inline.

Locally on this branch: make lint, make format-check, mypy on the modified file, and the new judge test file (8/8 passed in 1s) all pass. Caller in app/nodes/evaluate_opensre/node.py:37 already wraps the call in except Exception, so any new ValueErrors degrade gracefully into eval errors instead of crashing the pipeline.

def extract_judge_json_from_response(text: str) -> dict[str, Any]:
text = text.strip()
fence = re.search(r"```(?:json)?\s*(\{.*\})\s*```", text, re.DOTALL)
fence = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", text, re.DOTALL)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new fence regex is now non-greedy ([\s\S]*?) so it captures the first fenced block. If the judge ever produces something like:

Example shape:
```json
{"old": 1}
```

Actual answer:
```json
{"overall_pass": true, "score_0_100": 90}
```

the function silently returns {"old": 1} (verified locally). The old greedy regex would have spanned both fences and failed json.loads noisily, which the caller logs as an eval error. That's exactly the silent-mis-scoring failure mode this PR was written to prevent, just in a different shape.

Worst case it's a niche input, but if the judge prompt ever ends up with a "for example..." preamble it'd silently degrade scores. Two options that would close the gap:

  1. Use re.findall and pick the last fenced block, or the largest one that parses as a dict.
  2. After grabbing the first fence, parse it and if the result isn't a dict, fall through to the next fence.

Either way, would you mind adding a test_extract_judge_json_picks_correct_fence_with_multiple case to lock down the chosen behavior?

text = fence.group(1)
text = fence.group(1).strip()

if text.lstrip().startswith("["):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny one. The lstrip().startswith("[") guard rejects any input whose first non-whitespace char is [, which catches actual JSON arrays correctly but also has a false positive for prose that starts with a square bracket, e.g. an LLM that prefixes its response with a log-style label:

  • [INFO] result: {"overall_pass": true}
  • [Note] {"overall_pass": true, ...}

Both used to parse successfully via the find("{") / rfind("}") salvage path, and now raise (verified locally). The caller catches it as Exception so it just degrades the eval into an error, not silent, which is fine. But if you'd rather not lose the salvage path, a stricter guard would be to actually parse the input first and then isinstance(raw, list). Up to you.

Comment thread app/integrations/opensre/llm_eval_judge.py Fixed
Comment thread app/integrations/opensre/llm_eval_judge.py Fixed
@cerencamkiran
Copy link
Copy Markdown
Contributor Author

Thanks @yashksaini-coder.

The multiple-fence case is now handled by scanning all fenced blocks and selecting the last valid JSON object, with a regression test added.

For the [INFO] ... case, I now parse first and only reject if the result is actually a list, so the salvage path is preserved.

All tests and checks are passing.

@rrajan94
Copy link
Copy Markdown
Collaborator

rrajan94 commented May 1, 2026

A prose-wrapped JSON array still slips through because the brace-substring fallback (text.find("{") / text.rfind("}")) slices out the first object inside the array.

repro:

extract_judge_json_from_response('Here is the result: [{"overall_pass": true}] Thanks')
# returns {'overall_pass': True} instead of raising

flow at HEAD: no fences → json.loads of the whole text fails → isinstance(raw, list) guard is skipped → find("{")/rfind("}") salvage path returns the inner object.

can you add a regression test for this and tighten the fallback so arrays are rejected even when there's prose around them? matching the existing "JSON must be an object" error keeps it consistent with the other array tests.

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

Thanks @rrajan94 added a regression test for the prose-wrapped array case and tightened the fallback so arrays are rejected before the {...} salvage path when the array appears before any JSON object.

Validation:
pytest tests/integrations/opensre/test_llm_eval_judge.py -v
10 passed

@rrajan94
Copy link
Copy Markdown
Collaborator

rrajan94 commented May 1, 2026

Thanks for fixing the prose-wrapped array case @cerencamkiran
I think the new bracket-based fix still has one related edge case.

Example: [INFO] result: [{"overall_pass": true}]

Because the check parses from the first [ to the last ], the substring starts with [INFO], fails JSON parsing, and then falls through to the {...} fallback. That still returns {"overall_pass": true} instead of rejecting the array.

Can you add this as a regression test too? It may be worth checking the fallback against a few bracket/prose variants so we don’t keep finding the same partial-parse issue one shape at a time.

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

@rrajan94 added a regression test for the [INFO] result: [...] case as well and updated the array detection to try parsing from each [ candidate before the {...} fallback.

Validation:
pytest tests/integrations/opensre/test_llm_eval_judge.py -v
11 passed

@rrajan94
Copy link
Copy Markdown
Collaborator

rrajan94 commented May 2, 2026

Thanks @cerencamkiran, the [INFO] result: [...] array case is covered now.

I found one false positive from the new scan though.

This valid object now raises:

[INFO] result: {"overall_pass": true, "score_0_100": 90, "rubric_items": [], "summary": "ok"}

The loop tries each [ before the {...} fallback, so it eventually parses the nested rubric_items: [] array and rejects the whole response before the full object can be salvaged.

Can you add this as a regression test too? The array rejection should catch top-level/salvaged arrays, but not arrays nested inside the valid judge object.

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

Thanks @rrajan94.

I’ve updated the logic to distinguish top-level arrays from nested ones and added the regression test.

Validation:
pytest tests/integrations/opensre/test_llm_eval_judge.py -v
12 passed

Copy link
Copy Markdown
Collaborator

@yashksaini-coder yashksaini-coder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Both of my earlier concerns are fully addressed at HEAD b981c1f0:

  • Multi-fence first-match — the reversed(fences) + per-fence parse-and-check pattern picks the last valid block, exactly the right shape for judge prompts where the model puts examples first and the answer last. Locked down by test_extract_judge_json_picks_last_valid_fence_with_multiple_fences.
  • [INFO] result: {...} false positive — the new top-level-vs-nested array detection (the arr_start < obj_start AND arr_end > obj_end check at lines 87-89) is a clean way to distinguish a top-level array from a nested one. Covered by test_extract_judge_json_allows_log_prefix_with_nested_array.

I ran a standalone verifier against all 12 test cases (extracted the function source so I didn't need a venv) and got 12/12 passing. CI is green. The one open item is the CodeQL "empty except clause" info on line 68 — it's a known false-positive on except: raw = None fallthrough, and your workflow already passes. Adding a one-line # fall through to brace-substring fallback comment above the except would silence the bot if you want it gone, but it's not blocking.

Nice job working through the edge cases that came up after my review — the regression-test-per-bug-report discipline is exactly what I'd want from this kind of parsing code.

@VaibhavUpreti
Copy link
Copy Markdown
Member

@greptileai review

Comment on lines +77 to +89
obj_start = text.find("{")
obj_end = text.rfind("}")
arr_start = text.find("[")
arr_end = text.rfind("]")

has_obj = obj_start != -1 and obj_end != -1 and obj_end > obj_start
has_arr = arr_start != -1 and arr_end != -1 and arr_end > arr_start

# If an array span exists and fully contains the object span,
# the top-level value is an array — reject it.
if has_arr and has_obj and arr_start < obj_start and arr_end > obj_end:
msg = "Judge response JSON must be an object"
raise ValueError(msg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Bracket scan may falsely reject valid dicts in fallback path

The containment check at line 87 scans for the textually first [ and last ] in the entire string, not necessarily a matching JSON array pair. Any input that has an unrelated […] token before the { and another ] token after the } will trip the guard even though no actual JSON array wraps the object.

Concrete failing input that does not involve a real array:

[INFO] here is the result: {"overall_pass": true, "score_0_100": 90, …} [end]
  • arr_start = 0 (the [ in [INFO])
  • arr_end = len – 5 (the ] in [end])
  • obj_start / obj_end both fall inside that span → containment fires → ValueError raised

A correct dict-containing response would be silently rejected and the evaluation would abort instead of scoring. A safer approach:

if has_arr and has_obj and arr_start < obj_start and arr_end > obj_end:
    try:
        arr_candidate = json.loads(text[arr_start : arr_end + 1])
        if isinstance(arr_candidate, list):
            msg = "Judge response JSON must be an object"
            raise ValueError(msg)
    except json.JSONDecodeError:
        pass  # brackets were not a real array, continue

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cerencamkiran please resolve this, looks like the fallback uses the first [ and last ] in the whole text, so unrelated bracketed text before and after the object is treated like an array wrapping it.

Comment thread app/integrations/opensre/llm_eval_judge.py Fixed
Comment thread app/integrations/opensre/llm_eval_judge.py Fixed
@cerencamkiran
Copy link
Copy Markdown
Contributor Author

@greptileai review

Comment thread app/integrations/opensre/llm_eval_judge.py Outdated
@cerencamkiran
Copy link
Copy Markdown
Contributor Author

@greptileai review

@cerencamkiran
Copy link
Copy Markdown
Contributor Author

@rrajan94 thanks for the review.

All previously flagged P1 issues are now addressed. The bracket handling has been updated to only reject when the span actually parses as a JSON array, so cases like [INFO] ... {dict} ... [end] no longer fail.

Added regression tests for these scenarios as well.

All tests are passing (15/15), and all checks are green.

I’ll leave the remaining P2 style note as non-blocking for now.

Copy link
Copy Markdown
Member

@VaibhavUpreti VaibhavUpreti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @cerencamkiran , thanks for improving this!

Merging this to main, tagging @w3joe for visibility

@VaibhavUpreti VaibhavUpreti merged commit 02568d7 into Tracer-Cloud:main May 3, 2026
10 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

💼 Interviewer: describe a time you shipped something impactful.

@cerencamkiran: points at this PR

Interviewer: you're hired. 🤝


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants