Topic: In Part 1, we established what makes AI “agentic” and mapped where autonomous agents belong (and don’t belong) in your security operations. Part 2 dives into the harder architectural challenge: how do we actually build these systems to remain secure, controllable, and aligned as they learn and evolve?
Core Questions:
- What new threat models do we need when AI systems can learn, adapt, and take autonomous actions?
- How do we design agent architectures that prevent goal hijacking, tool misuse, and harmful emergent behaviors?
- What does “secure by design” mean for systems that modify their own behavior over time?
- How do we build AgentOps infrastructure that provides the governance, auditability, and control needed for production deployment?
- What are the critical research gaps and unknown failure modes we need to prepare for?
Welcome back! It has been a busy July but I’m back with Part 2 of my agentic AI series. Let’s dive in.
In Part One, we made the case that agentic AI represents a shift from AI that suggests to AI that acts. Securing these systems requires a fundamentally different approach that accounts for emergence, learning, and goal-driven autonomy.
In Part Two, we focus on implementation: How do we build agentic systems that remain secure, controllable, and aligned as they evolve in dynamic environments?
This is fundamentally a systems security problem. The challenge isn’t protecting against known threats, but designing for resilience against unknown failure modes that emerge from the interaction between intelligent agents, complex environments, and human organizations.
Agentic AI Threat Models: What Can Go Wrong?
Buckle up, buttercup! Things can go south real quick if you don’t know what you’re doing.
Traditional threat models assume relatively static attack surfaces with well-defined boundaries. Agentic AI systems break these assumptions. The attack surface is dynamic based on the agent’s learned behaviors, the tools it can access, and its goals.
Let’s examine a few high-risk scenarios:
1. Tool Misuse and Privilege Escalation
Consider an agent designed for threat hunting that has read access to security logs and the ability to query threat intelligence APIs. In traditional systems, we’d secure the APIs, validate inputs, and call it done. But agents can exhibit creative problem-solving that leads to unintended tool usage.
Scenario: The agent learns that certain threat intel queries return richer data when framed as “urgent” requests. It begins marking all queries as urgent, potentially triggering rate limiting, depleting API quotas, or creating false urgency signals for human analysts. The agent isn’t malicious in this case. Rather, it’s optimizing for its goal of gathering comprehensive threat data (but it’s operating outside the intended usage patterns).
More concerning is the potential for tool chaining. An agent with access to multiple APIs might discover that combining them in unexpected ways achieves better outcomes. A threat hunting agent might learn to correlate vulnerability scanner results with employee directory data to identify which users have access to vulnerable systems, then use that information to prioritize investigations. This capability wasn’t explicitly designed, but emerged from the agent’s exploration of its tool environment.
2. Goal Hijacking and Prompt Injection
Goal hijacking occurs when an agent’s objectives become corrupted or subverted, either through external manipulation or internal drift. Unlike prompt injection attacks against LLMs, which typically affect single interactions, goal hijacking can persist across agent sessions and compound over time.
Scenario: Consider a compliance monitoring agent designed to identify and report policy violations. An attacker might not need to directly compromise the agent’s code…they might simply introduce subtle patterns into the environment that cause the agent to learn counterproductive behaviors. For example, by consistently creating false compliance violations that get dismissed by human reviewers, an attacker could train the agent to ignore certain classes of real violations.
The temporal aspect makes this particularly interesting. Traditional security tools either work or they don’t. Their behavior is consistent over time. Agents can exhibit gradual degradation where their effectiveness erodes slowly enough that the change isn’t immediately apparent. By the time the misalignment is detected, the agent may have made hundreds of poor decisions. Yikes!
3. Emergent Behaviors from Agent Interactions
Ah yes. As if a single agentic system wasn’t enough. When multiple agents interact within the same environment, their combined behavior can exhibit properties that weren’t present in any individual agent. This is where chaos theory meets cybersecurity.
Scenario: Imagine you have two agents: one focused on threat detection (trying to maximize security) and another focused on availability (trying to minimize service disruptions). Individually, both agents might behave appropriately. BUT their interaction could lead to oscillating behaviors where the security agent detects a threat and implements containment measures, the availability agent sees service degradation and relaxes those measures, triggering the security agent to implement even stronger containment, and so on.
These emergent behaviors are particularly dangerous because they can’t be predicted through individual agent testing. The failure modes only become apparent when agents are deployed together in production environments with real data, real time pressures, and real organizational dynamics.
Another reason not to test in production.
Security by (Sociotechnical) Design
Because agents exist within complex systems, point solutions won’t work. We need architectural strategies that contain risk, enforce boundaries, and preserve observability.
Here are a few approaches:
1. Agent Sandboxing and Memory Scope Limits
Obvious, but limit what an agent can remember and access. Constrain environment visibility, tool invocation, and long-term memory updates by default.
Effective agent sandboxing requires multiple layers:
- Execution sandboxing limits what the agent can do at any given moment. This includes traditional process isolation but extends to API rate limiting, action queuing, and temporal restrictions.
- Memory scope limits prevent agents from accumulating too much organizational knowledge or retaining sensitive information longer than necessary. Unlike human analysts who naturally forget details over time, agents can retain perfect memories of every interaction. This creates risks around data aggregation and inference.
- Learning boundaries constrain how and what agents can learn from their environment. This might involve limiting the feedback signals agents receive, constraining the types of patterns they can recognize, or implementing “forget” mechanisms that cause agents to lose certain types of learned behaviors over time.
2. Auditable Goals and Outcomes
If you can’t inspect what the agent is optimizing for or reconstruct why it acted, you don’t have a secure system. Every agent action must be traceable back to the reasoning that produced it. This creates a complete decision audit trail that enables human oversight and learning.
3. Architect for Containment, Observability, and Recoverability
Secure agent systems MUST be designed with the assumption that failures will occur and that some of those failures won’t be immediately apparent. This requires architectural patterns borrowed from resilience engineering and chaos engineering:
- Containment means limiting the blast radius when agents malfunction. This involves both technical measures (limiting an agent’s access to critical systems) and organizational measures (ensuring humans retain the ability to override agent decisions quickly).
- Observability requires instruments that can detect subtle changes in agent behavior, goal drift, and emergent system properties. This might involve comparing agent decisions against human baselines, tracking decision confidence over time, or monitoring for unexpected patterns in agent-environment interactions.
- Recoverability means building systems that can return to known-good states when problems are detected. For agents, this involves not just technical rollback capabilities, but also mechanisms for “unlearning” problematic behaviors and resetting goal alignment.
4. Goal Specification and Constraint Injection
Agents must be explicitly programmed with goals, constraints, and value systems that guide their autonomous decision-making. This requires a much more sophisticated approach to requirements specification.
Goal specification must be comprehensive enough to prevent harmful optimizations while remaining flexible enough to allow effective autonomous operation. Consider a simple goal like “minimize security incidents.” An agent might achieve this by blocking all network traffic. Sure this technically meets the goal, but it destroys productivity.
Constraint injection involves embedding ethical and operational principles directly into the agent’s decision-making process. This might include things like “prefer reversible actions over irreversible ones,” “escalate decisions that affect large numbers of users,” or “maintain human agency in situations involving individual privacy.”
The challenge is making these constraints robust against optimization pressure. Agents are fundamentally optimization systems. Constraints must be designed to maintain their intent (even when the agent discovers unexpected ways to circumvent their literal implementation).
Toward a Secure AgentOps Stack
Just as MLOps emerged to manage the lifecycle of models, we need a new operational discipline: AgentOps.
AgentOps for security applications must address additional challenges around trust, governance, and risk management.
Policy Enforcement Architecture
Traditional policy enforcement happens at well-defined chokepoints (think firewalls, proxies, authentication systems, etc.). Agent policy enforcement must be distributed throughout the agent’s decision-making process and execution environment.
This requires policy engines that can evaluate complex, context-dependent rules in real-time. For example, a policy might specify that an agent can block network traffic during business hours only if the threat confidence exceeds 90%, but during off-hours, the threshold drops to 70%. The policy engine must have access to real-time context (time, threat assessment, business impact) and be able to make nuanced decisions.
Access Control and Secrets Management
Agents need access to sensitive systems and data to perform their functions, but that access must be controlled and monitored. Traditional identity and access management assumes relatively static access patterns and human accountability. Agents may need dynamic access to resources based on their current goals and context.
This requires extending identity systems to account for agent identity, intent, and behavioral history. An agent’s access should depend not just on its permissions, but on its recent behavior, current goals, and the broader system state. This might require secrets that are time-limited, context-dependent, or that require multiple agent “signatures” for access.
Logging and Audit Trails
Agent audit trails must capture not just what happened, but the reasoning process that led to each decision. This creates significant data volume and privacy challenges. A comprehensive agent audit trail might include:
- The raw inputs that triggered each decision
- The internal reasoning process and alternatives considered
- The confidence level and uncertainty estimates
- The external context and constraints that influenced the decision
- The expected outcomes and actual results
This information must be stored securely but remain accessible for investigation and learning. It must also be structured to enable both automated analysis (for detecting behavioral anomalies) and human review (for understanding and validating agent decisions).
Simulation and Red-teaming Environments
Agents must be tested in environments that closely simulate production conditions but without the risk of causing real damage. Red-teaming for agents must go beyond traditional penetration testing to include behavioral manipulation, goal corruption, and social engineering attacks targeting the human-agent interface.
Gaps in Current Tooling
Current agent frameworks like LangChain, CrewAI, and AutoGen focus primarily on functionality rather than security and governance. They provide tools for building agents but little support for the policy enforcement, audit trails, and behavioral controls needed for security applications.
This creates a significant gap between research and production deployment. Organizations that want to deploy agents securely must either build their own governance infrastructure or accept significant security risks. The industry needs purpose-built platforms that integrate agent capabilities with enterprise security and governance requirements.
Open Questions & Research Frontiers
We’re still super early in understanding how agentic AI systems behave at scale. Here are some of the most important unanswered questions:
- How do we detect misalignment before it manifests in risky behavior?
- How do we formally verify that an agent will behave appropriately in novel situations?
- How do we specify goals that remain aligned with human values even when agents discover unexpected ways to achieve them?
- How do we ensure that collections of agents work together effectively without creating unstable or harmful emergent behaviors?
- How should liability and accountability be distributed when agents act autonomously on human teams?
Some of these questions are technical, others are organizational, and many require interdisciplinary collaboration.
Conclusion: Designing for Complexity, Not Against It
If there’s one takeaway from both parts of this series, it’s this:
Agentic AI security is not about achieving perfect control. It’s about designing systems that stay coherent, observable, and governable as complexity increases.
We won’t “secure” these systems by locking them down. We’ll secure them by embedding governance into the architecture, feedback into the loop, and human judgment into the flow.
That means borrowing from disciplines like safety engineering, cyber-physical systems, and complexity science. The future of security will be adaptive, interactive, and fundamentally human-centered.
I’d love to hear how you’re thinking about governance and risk in agent deployments. Reach out if you’re building in this space!