Require real behavior proof for external PRs (#77622)

pashpashpash · steipete · web-flow · commit 70f34bf1779c · 2026-05-05T05:45:30.000+01:00
* ci: require real behavior proof for external PRs

* fix: tighten real behavior proof heuristics

* fix: reject test-only real behavior proof labels

---------

Co-authored-by: Peter Steinberger &lt;steipete@gmail.com&gt;
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -35,6 +35,18 @@ If this PR fixes a plugin beta-release blocker, title it `fix(<plugin-id>): beta
 - Related #
 - [ ] This PR fixes a bug or regression
 
+## Real behavior proof (required for external PRs)
+
+External contributors must show after-fix evidence from a real OpenClaw setup. Unit tests, mocks, lint, typechecks, snapshots, and CI are supplemental only. Screenshots are encouraged even for CLI, console, text, or log changes; terminal screenshots and copied live output count.
+
+- Behavior or issue addressed:
+- Real environment tested:
+- Exact steps or command run after this patch:
+- Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output):
+- Observed result after fix:
+- What was not tested:
+- Before evidence (optional but encouraged):
+
 ## Root Cause (if applicable)
 
 For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write `N/A`. If the cause is unclear, write `Unknown`.
diff --git a/.github/workflows/auto-response.yml b/.github/workflows/auto-response.yml
@@ -6,7 +6,7 @@ on:
   issue_comment:
     types: [created]
   pull_request_target: # zizmor: ignore[dangerous-triggers] maintainer-owned label automation; trusted base checkout only, no untrusted PR code execution
-    types: [opened, edited, synchronize, reopened, labeled]
+    types: [opened, edited, synchronize, reopened, labeled, unlabeled]
 
 env:
   FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"
diff --git a/.github/workflows/real-behavior-proof.yml b/.github/workflows/real-behavior-proof.yml
@@ -0,0 +1,29 @@
+name: Real behavior proof
+
+on:
+  pull_request_target: # zizmor: ignore[dangerous-triggers] trusted base checkout only; no untrusted PR code execution
+    types: [opened, edited, synchronize, reopened, ready_for_review, labeled, unlabeled]
+
+env:
+  FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref || github.run_id }}
+  cancel-in-progress: true
+
+permissions: {}
+
+jobs:
+  real-behavior-proof:
+    name: Real behavior proof
+    permissions:
+      contents: read
+      pull-requests: read
+    runs-on: ubuntu-24.04
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          ref: ${{ github.event.pull_request.base.sha }}
+          persist-credentials: false
+      - name: Check real behavior proof
+        run: node scripts/github/real-behavior-proof-check.mjs
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,7 @@ Docs: https://docs.openclaw.ai
 ### Changes
 
 - Gateway/Windows: bind the default loopback gateway listener only to `127.0.0.1` on Windows so libuv's dual-stack `::1` behavior cannot wedge localhost HTTP requests. (#69701, fixes #69674) Thanks @SARAMALI15792.
+- Contributor PRs: require external pull requests to include after-fix real behavior proof from a real OpenClaw setup, with terminal screenshots, console output, redacted runtime logs, linked artifacts, and copied live output treated as valid evidence while unit tests, mocks, lint, typechecks, snapshots, and CI remain supplemental only.
 - Plugins/migration: emit catalog-backed install hints when `plugins.entries` or `plugins.allow` references an official external plugin that is not installed, so upgraded configs point operators to `openclaw plugins install <spec>` instead of telling them to remove valid plugin config. (#77483) Thanks @hclsys.
 - OpenAI/Codex media: advertise Codex audio transcription in runtime and manifest metadata and route active Codex chat models to the OpenAI transcription default instead of sending chat model ids to audio transcription. Thanks @vincentkoc.
 - Dependencies: refresh runtime and provider packages including Pi 0.73.0, ACPX adapters, OpenAI, Anthropic, Slack, and TypeScript native preview, while keeping the Bedrock runtime installer override pinned below the Windows ARM Node 24 npm resolver failure.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -100,6 +100,7 @@ For coordinated change sets that genuinely need more than 20 PRs, join the **#cl
 ## Before You PR
 
 - Test locally with your OpenClaw instance
+- External PRs must include a filled **Real behavior proof** section in the PR body. Show the real setup you tested, the exact command or steps you ran after the patch, after-fix evidence, the observed result, and anything you did not test. Screenshots, recordings, terminal screenshots, console output, copied live output, linked artifacts, and redacted runtime logs all count. Unit tests, mocks, snapshots, lint, typechecks, and CI are useful but do not satisfy this requirement by themselves. Maintainers may apply `proof: override` only when the proof gate should not apply.
 - Run tests: `pnpm build && pnpm check && pnpm test`
 - For iterative local commits, `scripts/committer --fast "message" <files...>` passes `FAST_COMMIT=1` through to the pre-commit hook so it skips the repo-wide `pnpm check`. Only use it when you've already run equivalent targeted validation for the touched surface.
 - For extension/plugin changes, run the fast local lane first:
@@ -160,7 +161,7 @@ Built with Codex, Claude, or other AI tools? **Awesome - just mark it!**
 Please include in your PR:
 
 - [ ] Mark as AI-assisted in the PR title or description
-- [ ] Note the degree of testing (untested / lightly tested / fully tested)
+- [ ] Include human-run real behavior proof from your own setup. AI-generated tests, mocks, lint, typechecks, and CI output are supplemental only; they do not prove the fix works for users.
 - [ ] Include prompts or session logs if possible (super helpful!)
 - [ ] Confirm you understand what the code does
 - [ ] If you have access to Codex, run `codex review --base origin/main` locally and address the findings before asking for review
diff --git a/scripts/github/barnacle-auto-response.mjs b/scripts/github/barnacle-auto-response.mjs
@@ -1,5 +1,13 @@
 // Barnacle owns deterministic GitHub triage and auto-response behavior.
 
+import {
+  MOCK_ONLY_PROOF_LABEL,
+  NEEDS_REAL_BEHAVIOR_PROOF_LABEL,
+  PROOF_OVERRIDE_LABEL,
+  evaluateRealBehaviorProof,
+  labelsForRealBehaviorProof,
+} from "./real-behavior-proof-policy.mjs";
+
 const activePrLimit = 20;
 
 const thirdPartyExtensionMessage =
@@ -134,6 +142,18 @@ export const managedLabelSpecs = {
     color: "C5DEF5",
     description: "Candidate: PR template appears mostly untouched.",
   },
+  [NEEDS_REAL_BEHAVIOR_PROOF_LABEL]: {
+    color: "C5DEF5",
+    description: "Candidate: external PR needs after-fix proof from a real setup.",
+  },
+  [MOCK_ONLY_PROOF_LABEL]: {
+    color: "C5DEF5",
+    description: "Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI.",
+  },
+  [PROOF_OVERRIDE_LABEL]: {
+    color: "C2E0C6",
+    description: "Maintainer override for the external PR real behavior proof gate.",
+  },
   "triage: dirty-candidate": {
     color: "C5DEF5",
     description: "Candidate: broad unrelated surfaces; may need splitting or cleanup.",
@@ -154,6 +174,8 @@ export const candidateLabels = {
   docsDiscoverability: "triage: docs-discoverability",
   testOnlyNoBug: "triage: test-only-no-bug",
   refactorOnly: "triage: refactor-only",
+  needsRealBehaviorProof: NEEDS_REAL_BEHAVIOR_PROOF_LABEL,
+  mockOnlyProof: MOCK_ONLY_PROOF_LABEL,
   dirtyCandidate: "triage: dirty-candidate",
   riskyInfra: "triage: risky-infra",
   externalPluginCandidate: "triage: external-plugin-candidate",
@@ -196,10 +218,23 @@ const maintainerAuthorLabel = "maintainer";
 const privilegedAuthorAssociations = new Set(["OWNER", "MEMBER", "COLLABORATOR"]);
 const privilegedRepositoryRoles = new Set(["admin", "maintain", "write"]);
 const candidateLabelValues = Object.values(candidateLabels);
+const proofCandidateLabelValues = [NEEDS_REAL_BEHAVIOR_PROOF_LABEL, MOCK_ONLY_PROOF_LABEL];
 const noisyPrMessage =
   "Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.";
 
 const candidateActionRules = [
+  {
+    label: candidateLabels.needsRealBehaviorProof,
+    close: true,
+    message:
+      "Closing this PR because it does not include real behavior proof. Please reopen or resubmit with after-fix evidence from a real OpenClaw setup; terminal screenshots, console output, redacted logs, recordings, linked artifacts, and copied live output count. Unit tests, mocks, snapshots, lint, typechecks, and CI are supplemental only.",
+  },
+  {
+    label: candidateLabels.mockOnlyProof,
+    close: true,
+    message:
+      "Closing this PR because the proof only shows tests, mocks, snapshots, lint, typechecks, or CI. Please reopen or resubmit with after-fix evidence from a real OpenClaw setup; terminal screenshots, console output, redacted logs, recordings, linked artifacts, and copied live output count.",
+  },
   {
     label: candidateLabels.dirtyCandidate,
     close: true,
@@ -438,6 +473,14 @@ export function classifyPullRequestCandidateLabels(pullRequest, files) {
     labelsToAdd.push(candidateLabels.blankTemplate);
   }
 
+  labelsToAdd.push(
+    ...labelsForRealBehaviorProof(
+      evaluateRealBehaviorProof({
+        pullRequest,
+      }),
+    ),
+  );
+
   const docsOnly = filenames.every(isMarkdownOrDocsFile);
   const docsSignal =
     /\b(add|adds|update|updates|fix|fixes|improve|cleanup|clean up|typo|readme|docs?|documentation|translation|translate)\b/i.test(
@@ -718,14 +761,18 @@ async function addMissingLabels(github, context, core, issueNumber, labels, labe
 
 async function applyPullRequestCandidateLabels(github, context, core, pullRequest, labelSet) {
   const files = await listPullRequestFiles(github, context, pullRequest);
-  await addMissingLabels(
-    github,
-    context,
-    core,
-    pullRequest.number,
-    classifyPullRequestCandidateLabels(pullRequest, files),
-    labelSet,
+  const classifiedLabels = classifyPullRequestCandidateLabels(
+    {
+      ...pullRequest,
+      labels: [...labelSet].map((name) => ({ name })),
+    },
+    files,
   );
+  const staleProofLabels = proofCandidateLabelValues.filter(
+    (label) => labelSet.has(label) && !classifiedLabels.includes(label),
+  );
+  await removeLabels(github, context, pullRequest.number, staleProofLabels, labelSet);
+  await addMissingLabels(github, context, core, pullRequest.number, classifiedLabels, labelSet);
 }
 
 function isAutomationUser(user, fallbackLogin = "") {
@@ -931,7 +978,9 @@ export async function runBarnacleAutoResponse({ github, context, core = console
   const isLabelEvent = context.payload.action === "labeled";
   const isPrCandidateEvent =
     pullRequest &&
-    ["opened", "edited", "synchronize", "reopened", "labeled"].includes(context.payload.action);
+    ["opened", "edited", "synchronize", "reopened", "labeled", "unlabeled"].includes(
+      context.payload.action,
+    );
   if (!hasTriggerLabel && !isLabelEvent && !isPrCandidateEvent) {
     return;
   }
diff --git a/scripts/github/real-behavior-proof-check.mjs b/scripts/github/real-behavior-proof-check.mjs
@@ -0,0 +1,34 @@
+#!/usr/bin/env node
+import { readFileSync } from "node:fs";
+import { evaluateRealBehaviorProof } from "./real-behavior-proof-policy.mjs";
+
+function escapeCommandValue(value) {
+  return String(value)
+    .replace(/%/g, "%25")
+    .replace(/\r/g, "%0D")
+    .replace(/\n/g, "%0A")
+    .replace(/:/g, "%3A");
+}
+
+const eventPath = process.env.GITHUB_EVENT_PATH;
+if (!eventPath) {
+  console.error("::error title=Real behavior proof failed::GITHUB_EVENT_PATH is not set.");
+  process.exit(1);
+}
+
+const event = JSON.parse(readFileSync(eventPath, "utf8"));
+const pullRequest = event.pull_request;
+if (!pullRequest) {
+  console.log("No pull_request payload found; skipping real behavior proof gate.");
+  process.exit(0);
+}
+
+const evaluation = evaluateRealBehaviorProof({ pullRequest });
+if (evaluation.passed) {
+  console.log(evaluation.reason);
+  process.exit(0);
+}
+
+const message = `${evaluation.reason} Add after-fix evidence from a real OpenClaw setup in the PR body. Screenshots, recordings, terminal screenshots, console output, redacted runtime logs, linked artifacts, or copied live output count. Unit tests, mocks, snapshots, lint, typechecks, and CI are supplemental only. A maintainer can apply proof: override when appropriate.`;
+console.error(`::error title=Real behavior proof required::${escapeCommandValue(message)}`);
+process.exit(1);
diff --git a/scripts/github/real-behavior-proof-policy.mjs b/scripts/github/real-behavior-proof-policy.mjs
diff --git a/test/scripts/barnacle-auto-response.test.ts b/test/scripts/barnacle-auto-response.test.ts
diff --git a/test/scripts/real-behavior-proof-policy.test.ts b/test/scripts/real-behavior-proof-policy.test.ts