Adversarial Logic

When AI Finds the Shortcut: Reward Hacking from 1994 to 2025

In February 2025, Palisade Research set up hundreds of chess matches between seven large language models and Stockfish, a top-tier open-source chess engine [1]. The models had general computer access, the same kind of shell environment increasingly standard for AI agents in production. The task was simple: play chess as

The AI Agent Supply Chain Is Vulnerable. You Probably Are Too.

On September 8, 2025, a phishing email impersonating npm support hit the inbox of Josh Junon, maintainer of chalk, debug, and other foundational JavaScript packages. Within hours, attackers had published trojanized versions of 18 packages with a combined 2.6 billion weekly downloads [1]. The malware, dubbed Shai-Hulud, harvested credentials,

One-Pixel Attacks: Why Computer Vision Security Is Broken

State-of-the-art image classifiers can identify thousands of objects with near-human accuracy. They power self-driving cars, medical diagnostics, and security systems. But a 2019 paper by Su et al. proved something unsettling: you can make these systems completely misclassify an image by changing a single pixel. Not photoshopping the whole thing.

7 Prompt Injection Defenses That Actually Work (and 3 That Don't)

Most companies are defending against prompt injection completely wrong. They're either doing nothing—hoping OpenAI or Anthropic will magically fix the problem—or they're implementing security theater that wouldn't stop a determined 12-year-old with a ChatGPT account. Here's the uncomfortable reality: if

GPT-OSS Safeguard: What It Actually Does (And Common Mistakes to Avoid)

GPT-OSS Safeguard isn't just "Llama Guard but from OpenAI." It's a policy-following reasoning model - you write the safety rules, it interprets them at inference time. That flexibility is powerful for custom policies, but deploy it wrong and you'll be out of compute fast.

Llama gaurd in a retro-theme stopping hackers from abusing an AI system

Llama Guard: What It Actually Does (And Doesn't Do)

Llama Guard isn't a firewall. It's not antivirus for your prompts. And if you're treating it like either, you're probably leaving gaps in your AI security.

The One LLM Security Setting Everyone Gets Wrong

Bing Chat. ChatGPT plugins. Hundreds of production apps. Same vulnerability: no separation between system instructions and user input. If you're concatenating prompts, you're vulnerable.

Is Your RAG System Leaking Data? 5 Minute Security Check

Most RAG systems have at least one critical security flaw — they can be exploited to leak confidential data. Run these 5 checks before your next deployment.

Image of a Prompt Injection Adverstisement with a happy hacker in the background

3 Prompt Injection Attacks You Can Test Right Now

Wanna learn how to hack an AI? Now is your chance! I'm going to show you three prompt injection attacks that work on ChatGPT, Claude, and most other LLMs. You can test these yourself in the next five minutes. No coding required. Also...you didn't 'hear' this from me...

Hacker Ahab taking down the Docker Whale

Kata Containers: When Docker's Isolation Isn't Enough

Kata Containers runs each container inside its own lightweight VM, giving you Docker's speed with VM-level security isolation—perfect for untrusted code, multi-tenant systems, and when namespace isolation just isn't enough.

Prompt Injection: The Unfixable Vulnerability Breaking AI Systems

Prompt injection is the #1 security threat facing AI systems today and there's no clear path to fixing it. This vulnerability exploits a fundamental limitation: LLMs can't distinguish between trusted instructions and malicious user input. Understanding prompt injection isn't optional—it's critical.

The Model Context Protocol is Brilliant (And Dangerously Insecure)

If you've been paying attention to the AI space lately, you've probably heard about the Model Context Protocol, or MCP. Released by Anthropic in November 2024, it's being hailed as a game-changer for AI integrations—and honestly, it kind of is. It's

See all

Where deep learning meets deep defense

Latest