Skip to content

Require real behavior proof for external PRs#77622

Merged
steipete merged 3 commits intomainfrom
codex/real-behavior-proof-gate
May 5, 2026
Merged

Require real behavior proof for external PRs#77622
steipete merged 3 commits intomainfrom
codex/real-behavior-proof-gate

Conversation

@pashpashpash
Copy link
Copy Markdown
Contributor

This adds the external-contributor proof gate we discussed. External PRs now have to fill a Real behavior proof section with the real setup, exact after-patch steps, evidence, observed result, and untested scope. The check deliberately treats tests, mocks, lint, typechecks, snapshots, and CI as supplemental, while accepting screenshots, recordings, terminal captures, console output, redacted runtime logs, linked artifacts, and copied live output.

Barnacle now labels missing or mock-only proof clearly and keeps that path passive until a maintainer uses the existing manual close trigger. The new required workflow check can be added to the main ruleset after this lands; the override remains the proof: override label instead of a ruleset exception.

@openclaw-barnacle openclaw-barnacle Bot added scripts Repository scripts size: L maintainer Maintainer-authored PR labels May 5, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 5, 2026

Codex review: needs changes before merge.

Summary
The PR adds a real-behavior-proof PR-template section, a required pull_request_target check, Barnacle proof labels/actions, contributor/changelog docs, and tests for external PR proof enforcement.

Reproducibility: yes. Source inspection of the PR head gives a high-confidence path: a Terminal transcript: or console output label sets hasRealEvidence, so pnpm test or CI-only proof is not classified as mock-only.

Next step before merge
A narrow repair can fix the proof classifier and add the missing regression test; maintainer approval is still needed for the broader policy gate.

Security
Cleared: The new workflow is security-sensitive, but it checks out trusted base code with read-only permissions and no persisted checkout credentials.

Review findings

  • [P2] Reject labeled test transcripts before passing proof — scripts/github/real-behavior-proof-policy.mjs:60
Review details

Best possible solution:

Land a maintainer-approved gate after tightening the classifier so evidence-type labels do not count without actual non-test live evidence, while preserving the trusted-base workflow design.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection of the PR head gives a high-confidence path: a Terminal transcript: or console output label sets hasRealEvidence, so pnpm test or CI-only proof is not classified as mock-only.

Is this the best way to solve the issue?

No. The workflow shape is reasonable, but the classifier should reject mock/test/lint/typecheck/CI-only content before accepting evidence labels, or require actual live evidence payload beyond those labels.

Full review comments:

  • [P2] Reject labeled test transcripts before passing proof — scripts/github/real-behavior-proof-policy.mjs:60
    Because realEvidenceRegex accepts labels like terminal transcript, console output, and stdout, and mock-only detection is disabled once hasRealEvidence is true, an external PR can provide Terminal transcript: followed only by pnpm test or CI output and still pass the required gate. That contradicts the new template and docs, which say tests and CI are supplemental only.
    Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.87

Acceptance criteria:

  • pnpm test test/scripts/real-behavior-proof-policy.test.ts test/scripts/barnacle-auto-response.test.ts
  • pnpm exec oxfmt --check --threads=1 scripts/github/real-behavior-proof-policy.mjs test/scripts/real-behavior-proof-policy.test.ts test/scripts/barnacle-auto-response.test.ts
  • pnpm check:changed in Testbox before handoff because this touches scripts and workflow policy

What I checked:

  • Protected label: The live issue API shows this PR is open with labels scripts, maintainer, and size: L; maintainer is protected and prevents cleanup closure. (d094506a411b)
  • Current main lacks the gate: Current main has no Real behavior proof, real-behavior-proof, proof: override, needs-real-behavior-proof, or mock-only-proof hits in the relevant GitHub/script/test/docs surfaces. (978bc53e80cc)
  • Documented contract: The PR template says unit tests, mocks, lint, typechecks, snapshots, and CI are supplemental only while requiring after-fix evidence from a real OpenClaw setup. (.github/pull_request_template.md:40, d094506a411b)
  • Classifier bypass remains: realEvidenceRegex still treats labels like terminal transcript, console output, linked artifact, and stdout as real evidence; mock-only detection then runs only when hasRealEvidence is false. (scripts/github/real-behavior-proof-policy.mjs:60, d094506a411b)
  • Regression test misses labeled transcript bypass: The new negative test covers a fenced pnpm test transcript, but not the accepted-label case such as Terminal transcript: followed only by test/CI output. (test/scripts/real-behavior-proof-policy.test.ts:94, d094506a411b)
  • Workflow trust boundary checked: The new workflow uses pull_request_target, but checks out the base SHA with persisted credentials disabled and grants only read permissions to contents and pull requests. (.github/workflows/real-behavior-proof.yml:14, d094506a411b)

Likely related people:

  • vincentkoc: Recent GitHub path history shows repeated Barnacle automation, candidate-label, and auto-response workflow changes on the central files. (role: recent maintainer; confidence: high; commits: 1b951c565ea1, 2f6615d2ee9d, 727e0e013ecb; files: scripts/github/barnacle-auto-response.mjs, .github/workflows/auto-response.yml, test/scripts/barnacle-auto-response.test.ts)
  • Takhoffman: Recent path history shows multiple PR template and contributor-guidance updates that define the contributor-facing proof contract touched by this PR. (role: adjacent docs/template owner; confidence: high; commits: c19321ed9e02, a4e5b23dc37f, 9f0305420aad; files: .github/pull_request_template.md, CONTRIBUTING.md)
  • steipete: Recent history includes Barnacle triage helper refactoring, automation label renames, and workflow maintenance near the policy surface. (role: adjacent maintainer; confidence: medium; commits: c571debf83e6, f28bc31ecdda, c73f7d6596ad; files: scripts/github/barnacle-auto-response.mjs, .github/workflows/auto-response.yml)

Remaining risk / open question:

  • If merged as-is and made required in the ruleset, external PRs can satisfy the gate with a labeled test or CI transcript rather than real behavior proof.
  • The new Barnacle labels and manual close paths affect contributor PR handling, so the classifier needs to be conservative before maintainers rely on it.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 978bc53e80cc.

Re-review progress:

@pashpashpash pashpashpash force-pushed the codex/real-behavior-proof-gate branch from 81876f5 to d094506 Compare May 5, 2026 01:54
@pashpashpash
Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 5, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@steipete steipete force-pushed the codex/real-behavior-proof-gate branch from d094506 to c4ed838 Compare May 5, 2026 04:36
@steipete steipete force-pushed the codex/real-behavior-proof-gate branch from c4ed838 to 5eeb9c8 Compare May 5, 2026 04:40
@steipete steipete merged commit 70f34bf into main May 5, 2026
96 checks passed
@steipete steipete deleted the codex/real-behavior-proof-gate branch May 5, 2026 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintainer Maintainer-authored PR scripts Repository scripts size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants