Building a GRC Agent with OpenAI Agents SDK and Codex CLI

Table of Contents

I already wrote one version of this demo using the Claude Agent SDK. This is the OpenAI-native version: Python, openai-agents, and Codex CLI handling the bounded workspace work that would otherwise turn into a pile of shell scripts and one-off glue code.

Note

This draft reflects the OpenAI Agents SDK and Codex docs as they existed on March 19, 2026. This space is moving fast. In particular, the Agents SDK currently documents codex_tool as experimental, and OpenAI also documents an alternative pattern where Codex CLI runs as an MCP server instead of being wrapped directly as a tool.

What are we building?
#

The goal is not “ask a model whether AC-2 exists” and call it a day. The goal is a GRC workflow that can:

read an SSP and related evidence from a local workspace,
reason over a selected framework and baseline,
use deterministic tools for framework facts instead of re-deriving them every run,
delegate narrow analysis tasks to specialist agents,
use Codex CLI for bounded workspace inspection and file-oriented work, and
return typed findings plus POA&M-ready remediation entries.

This is a core MVP, not parity with my larger Claude demo. I’m intentionally keeping the first OpenAI version tighter:

one primary path: NIST 800-53 with FedRAMP Moderate,
one typed assessment result,
deterministic baseline/control helpers,
specialist analysis via Agent.as_tool(),
Codex for local workspace actions,
traces and auditability from the SDK itself.

The companion repo for this draft lives at:

ethanolivertroy/codex-grc-agent-demo

GRC assessment demo built with the OpenAI Agents SDK and Codex CLI

Python

Warning

This is still a proof-of-concept pattern, not a production-ready federal assessment system. Model output should be reviewed by qualified humans. If you let an agent touch a real evidence repository, threat model it like any other privileged automation, because that is exactly what it is.

Why this stack?
#

I still think the best GRC agent architecture mixes deterministic and non-deterministic behavior. The OpenAI stack maps pretty cleanly to that split.

Need	Best fit
Framework facts, baseline membership, POA&M formatting	`@function_tool`
Specialist analysis that should stay subordinate to one orchestrator	`Agent.as_tool()`
Local workspace inspection, shell commands, file edits, MCP calls	`codex_tool(...)`
Final typed answer for downstream systems	root agent with `output_type=...`
Audit trail and debugging	`new_items`, traces, and run metadata

That last point is one of the reasons I would not center this first version on handoffs.

The OpenAI docs support both handoffs and “agents as tools.” Handoffs are useful when you want a specialist to take over the conversation. In this GRC demo I want the root assessor to stay in charge of the final answer so I can force a single typed assessment object out of the run. The specialist agents should help, not own the output.

That gives me a cleaner split:

the root agent owns the assessment,
specialist agents act like bounded analysts,
Codex does the ugly workspace work,
deterministic tools provide compliance ground truth.

First working run
#

If I were starting this from scratch in a clean repo, the setup would look like this:

python -m venv .venv
source .venv/bin/activate
pip install --upgrade openai-agents pydantic

# pick one
npm install -g @openai/codex
# or: brew install --cask codex

export OPENAI_API_KEY=sk-...

python -m grc_agent assess \
  --framework "NIST 800-53" \
  --baseline "FedRAMP Moderate" \
  examples/sample-ssp.md

Codex CLI can authenticate with OPENAI_API_KEY, although the current Agents SDK docs also note dedicated Codex auth options like CODEX_API_KEY and explicit codex subprocess config.

The output target for v1 is not prose. It is structured assessment data that a human can read and a system can also consume.

Start with the typed result
#

The first thing I would lock down is the schema, because that determines everything else.

from pydantic import BaseModel, Field


class Finding(BaseModel):
    control_id: str
    title: str
    status: str = Field(description="satisfied, partial, or not_satisfied")
    severity: str = Field(description="low, moderate, or high")
    rationale: str
    evidence: list[str] = Field(default_factory=list)


class PoamEntry(BaseModel):
    control_id: str
    weakness_description: str
    remediation_plan: str
    scheduled_completion_date: str


class AssessmentResult(BaseModel):
    framework: str
    baseline: str
    summary: str
    findings: list[Finding]
    poam: list[PoamEntry] = Field(default_factory=list)
    evidence_reviewed: list[str] = Field(default_factory=list)

OpenAI’s Agents SDK lets the final agent declare an output_type, and result.final_output comes back as that typed object when the run completes. That is exactly what I want for assessment output: one root schema, one root owner, no guessing about the final shape.

Deterministic GRC tools first
#

This is the part too many agent demos skip. I do not want Codex or the model deciding from scratch what a FedRAMP Moderate baseline contains every run. That logic belongs in code and data.

import json
from pathlib import Path

from agents import function_tool


@function_tool
def get_baseline_controls(baseline: str) -> list[str]:
    """Return control identifiers for the selected baseline."""
    data = json.loads(Path("data/fedramp-baselines.json").read_text())
    return data[baseline]["controls"]


@function_tool
def get_control_metadata(control_id: str) -> dict:
    """Return deterministic control metadata for a single control."""
    data = json.loads(Path("data/nist-800-53-r5.json").read_text())
    return data[control_id]


@function_tool
def format_poam_entry(
    control_id: str,
    weakness_description: str,
    remediation_plan: str,
    scheduled_completion_date: str,
) -> PoamEntry:
    """Create a POA&M-ready entry."""
    return PoamEntry(
        control_id=control_id,
        weakness_description=weakness_description,
        remediation_plan=remediation_plan,
        scheduled_completion_date=scheduled_completion_date,
    )

This is where the agent gets boring in the right way. Baseline membership, control metadata, shared-responsibility hints, and POA&M formatting are all things I want to be deterministic. Let the model do reasoning over messy evidence, not reimplement compliance data loading in its head.

Where Codex CLI fits
#

Now the interesting part.

The current OpenAI Agents SDK docs expose an experimental codex_tool that wraps the Codex CLI. This is the integration path I want for this demo because it keeps Codex inside a bounded execution surface instead of turning it into the root orchestration engine.

from pathlib import Path

from agents.extensions.experimental.codex import ThreadOptions, TurnOptions, codex_tool


workspace_tool = codex_tool(
    sandbox_mode="workspace-write",
    working_directory=str(Path.cwd()),
    default_thread_options=ThreadOptions(
        model="gpt-5.4",
        model_reasoning_effort="low",
        approval_policy="never",
        network_access_enabled=False,
        web_search_mode="disabled",
    ),
    default_turn_options=TurnOptions(
        idle_timeout_seconds=60,
    ),
    persist_session=True,
)

Those defaults are deliberate:

workspace-write keeps Codex inside the repo boundary,
network_access_enabled=False stops it from wandering onto the internet,
web_search_mode="disabled" avoids blending local evidence review with live web calls,
approval_policy="never" is only tolerable here because the sandbox is tight and the network is off.

In other words, this is the part of the agent that gets to do the messy local work:

inspect SSP sections,
search evidence folders,
normalize local markdown or JSON files,
run bounded shell commands like rg,
prepare snippets the reasoning model can actually use.

What I do not want Codex doing is inventing the baseline, the schema, or the compliance policy. That part stays deterministic.

Note

OpenAI also documents another pattern where Codex CLI runs as an MCP server and the Agents SDK talks to it over MCP. That is a good option when you want Codex to look like a reusable server capability. For this first GRC version, I want the simpler local wrapper.

Specialists as tools, not handoffs
#

The OpenAI docs explicitly support turning agents into tools with Agent.as_tool(). That is the pattern I want for policy review and evidence review.

from pydantic import BaseModel, Field
from agents import Agent


class ReviewRequest(BaseModel):
    control_id: str
    question: str
    evidence_paths: list[str] = Field(default_factory=list)


policy_reviewer = Agent(
    name="Policy Reviewer",
    instructions=(
        "Review SSP and policy text for the requested control. "
        "Extract implementation statements, missing details, and contradictions."
    ),
    model="gpt-5",
)


evidence_reviewer = Agent(
    name="Evidence Reviewer",
    instructions=(
        "Review local evidence and configuration artifacts for the requested control. "
        "Use Codex when you need to inspect files, search the workspace, or summarize findings."
    ),
    model="gpt-5",
    tools=[workspace_tool],
)


policy_review_tool = policy_reviewer.as_tool(
    tool_name="review_policy_language",
    tool_description="Review SSP/policy text for a specific control.",
    parameters=ReviewRequest,
)


evidence_review_tool = evidence_reviewer.as_tool(
    tool_name="review_local_evidence",
    tool_description="Review local evidence artifacts for a specific control.",
    parameters=ReviewRequest,
)

This is the critical design decision in the OpenAI version.

If I used handoffs, the last agent in the chain could become the owner of the final output. That is fine for chat routing. It is not what I want for a typed GRC assessment. I want one assessor agent to synthesize everything into one AssessmentResult.

So the root agent gets tools, not peers.

Wire up the root assessor
#

Once the schema and tools exist, the root agent gets pretty straightforward.

from agents import Agent, Runner


assessor_agent = Agent(
    name="GRC Assessor",
    instructions=(
        "Assess the provided SSP and local evidence against the selected framework and baseline. "
        "Use deterministic tools for framework facts. "
        "Use specialist tools when policy language or local evidence needs deeper review. "
        "Return only a complete AssessmentResult."
    ),
    model="gpt-5",
    output_type=AssessmentResult,
    tools=[
        get_baseline_controls,
        get_control_metadata,
        format_poam_entry,
        policy_review_tool,
        evidence_review_tool,
    ],
)


async def main() -> None:
    prompt = (
        "Assess examples/sample-ssp.md against NIST 800-53 / FedRAMP Moderate. "
        "Use local evidence under examples/evidence/. "
        "Return structured findings and POA&M entries where remediation is needed."
    )
    result = await Runner.run(assessor_agent, prompt, max_turns=20)
    print(result.final_output.model_dump_json(indent=2))

That is the center of gravity for the OpenAI version:

root agent owns the assessment,
function tools provide stable compliance facts,
nested specialist agents help with reasoning,
Codex is a bounded local operator inside one specialist path.

Guardrails, traces, and memory
#

OpenAI’s SDK has a few details here that matter more than they look at first glance.

Guardrails are not a magic security blanket
#

The docs split guardrails into input, output, and tool guardrails. The important nuance is that tool guardrails apply to custom function_tools, not to every execution surface in the system. They do not replace Codex sandboxing and approvals.

That means:

use agent guardrails for scope control,
use function tool guardrails for deterministic tool validation,
use Codex sandboxing and approval policy for Codex execution risk.

If you blur those together, you will think you secured something you did not actually secure.

`result.final_output` is not the whole story
#

The results docs make a useful distinction:

final_output is the final typed assessment,
new_items is the rich audit trail with tool calls, outputs, and nested agent metadata,
raw_responses is there when you need provider-level debugging,
last_agent tells you who actually finished the run.

For GRC work, new_items matters a lot. If someone asks “why did the agent call this control partial?” I want to inspect the reasoning trail, the tool outputs, and the exact evidence path that fed the decision.

Sessions are optional, but pick one memory strategy
#

For the first version I would keep the assessment command mostly stateless. But once you add follow-up questions, the SDK gives you two broad choices:

client-managed history via SQLiteSession(...),
OpenAI-managed continuation via previous_response_id or conversation_id.

The docs explicitly warn against layering multiple memory strategies together unless you mean to reconcile duplicated context. For a local GRC CLI, I would probably start with SQLiteSession("fedramp-demo") and keep the conversation state local.

Traces are one of the strongest parts of this stack
#

The tracing docs are honestly one of the more compelling parts of the OpenAI SDK for this use case. The trace captures generations, tool calls, handoffs, guardrails, and custom workflow activity. For an agent that touches compliance evidence, that matters.

You are not just trying to get an answer. You are trying to understand:

what the model saw,
which tool it called,
what came back,
where it escalated to Codex,
and how the final assessment was assembled.

That is a much better place to be than “trust me bro, the LLM looked at it.”

Security and limitations
#

If I shipped the first version of this demo today, I would be very explicit about what it is and is not doing.

What it is:

a local evidence review workflow,
a typed assessment generator,
a bounded Codex execution pattern,
a good way to separate deterministic compliance data from agent reasoning.

What it is not:

a production assessor,
a replacement for human review,
a complete federal workflow,
a full parity port of my Claude demo.

There are also a few concrete limitations worth calling out:

codex_tool is still documented as experimental.
The approval story is different depending on whether you use function_tool, Agent.as_tool(), local MCP servers, or Codex execution.
Tool guardrails do not magically cover every runtime surface.
This v1 scope does not include OSCAL conversion, broader framework coverage, or an interactive follow-up mode yet.

If I needed higher assurance immediately, I would tighten this further:

keep network disabled,
keep Codex sandboxed,
move any dangerous actions to explicit approval flows,
and add stricter agent/tool separation before touching real customer evidence.

Why this version is useful anyway
#

The reason I still like this architecture is simple: it keeps the messy parts messy and the stable parts stable.

The stable parts:

framework data,
baseline logic,
POA&M formatting,
output schema.

The messy parts:

SSP prose,
contradictory evidence,
file layouts nobody standardized,
figuring out what a screenshot or markdown bundle is actually trying to say.

That is where an agent helps. That is also where Codex helps. The trick is not confusing that help with authority.

What’s next
#

If I keep iterating on this OpenAI version, the next additions would probably be:

SQLiteSession support for follow-up review conversations.
Wider framework coverage beyond the first FedRAMP/NIST path.
OSCAL export and mapping generation.
A stronger approval model for any action that changes workspace content.
A cleaner comparison against the Codex-as-MCP-server pattern.

For now, the main thing I wanted to prove is narrower than that: OpenAI’s current agent stack is already capable of a serious GRC workflow if you keep the compliance facts deterministic, keep the final output typed, and use Codex as a bounded operator instead of turning it loose as the whole system.

Building a GRC Agent with OpenAI Agents SDK and Codex CLI

What are we building?
#

Why this stack?
#

First working run
#

Start with the typed result
#

Deterministic GRC tools first
#

Where Codex CLI fits
#

Specialists as tools, not handoffs
#

Wire up the root assessor
#

Guardrails, traces, and memory
#

Guardrails are not a magic security blanket
#

`result.final_output` is not the whole story
#

Sessions are optional, but pick one memory strategy
#

Traces are one of the strongest parts of this stack
#

Security and limitations
#

Why this version is useful anyway
#

What’s next
#

References
#

Related Reading
#

Related

What are we building?#

Why this stack?#

First working run#

Start with the typed result#

Deterministic GRC tools first#

Where Codex CLI fits#

Specialists as tools, not handoffs#

Wire up the root assessor#

Guardrails, traces, and memory#

Guardrails are not a magic security blanket#

result.final_output is not the whole story#

Sessions are optional, but pick one memory strategy#

Traces are one of the strongest parts of this stack#

Security and limitations#

Why this version is useful anyway#

What’s next#

References#

Related Reading#

Related

What are we building?
#

Why this stack?
#

First working run
#

Start with the typed result
#

Deterministic GRC tools first
#

Where Codex CLI fits
#

Specialists as tools, not handoffs
#

Wire up the root assessor
#

Guardrails, traces, and memory
#

Guardrails are not a magic security blanket
#

`result.final_output` is not the whole story
#

Sessions are optional, but pick one memory strategy
#

Traces are one of the strongest parts of this stack
#

Security and limitations
#

Why this version is useful anyway
#

What’s next
#

References
#

Related Reading
#