Security Bug Hunter

An OpenEnv reinforcement learning environment where agents learn to find security vulnerabilities in code using LLM-powered analysis.

Why This Matters

Static analysis tools exist, but they don't "think" - they just match patterns. What if an AI agent could:

Actually understand code semantics
Learn which vulnerability types are most likely in a given context
Improve its strategy over time through experience

That's exactly what this project achieves.

The Problem

Traditional RL environments use simulated states - the agent learns from numbers, not real content. This project takes a different approach:

The approach: Send real code to an LLM, let it analyze, extract the vulnerability type, then use RL to learn which LLM-reasoning strategies work best.

This isn't just "LLM inference as a service" - it's a genuine RL agent that learns from experience.

Architecture & Design Decisions

Why LLM + RL?

Three approaches were considered:

LLM-only (no RL): Just ask the LLM to find bugs. Works, but no learning, no improvement over time.
RL-only (no LLM): Use keyword matching or regex. Simple but can't handle complex code analysis.
LLM + RL (chosen): The LLM provides semantic understanding, RL learns which strategies work. Best of both worlds.

Supported Vulnerabilities

Category	Examples
Injection	SQL Injection, Command Injection
Scripting	XSS (Stored, Reflected, DOM)
Auth	Auth Bypass, IDOR, Broken Access Control
Web	CSRF, Path Traversal
Logic	Race Conditions, Timing Attacks, Business Logic

Supported Languages

Python
JavaScript
Java

Methodology Deep Dive

The Training Loop

1. Sample random code snippet
2. Send to Groq LLM: "What vulnerability is here?"
3. LLM responds: "SQL injection - f-string with user input"
4. Agent takes action: identify_vulnerability("sql_injection")
5. Environment checks: Is this correct?
6. RL updates: Q("sql_injection", "easy") += learning_rate × reward
7. Repeat

The agent learns that certain vulnerability types are more valuable to pursue in certain contexts.

State Representation

Unlike traditional RL where state is pixels, the state here consists of:

Code sent to LLM (semantic understanding)
Task difficulty
Previous analysis history

This hybrid approach lets the LLM do what it does best (understand code) while RL does what it does best (optimize strategy).

Reward Design

+0.4 for correct vulnerability identification (strong signal)
+0.1 for exploratory actions (encourage exploration)
+0.2 for patch suggestions (secondary behavior)

The high reward for correct identification ensures fast learning of good strategies.

Quick Start

# Install dependencies
cd my_env
uv sync

# Run baseline (no server needed)
python run_baseline.py

# Or start the environment server
python -m server.app

# Train the LLM-RL agent
python train_llm_rl.py --episodes 100

Training Scripts

Script	Description
`run_baseline.py`	Quick demo - Run agent without server
`train_llm_rl.py`	LLM + Q-Learning hybrid (recommended)
`train_dqn.py`	Deep Q-Network with PyTorch
`train_dqn_embeddings.py`	DQN + Code Embeddings
`train_ppo.py`	Proximal Policy Optimization
`train_pg.py`	REINFORCE policy gradient
`baseline.py`	LLM-only (no RL) for comparison

Project Structure

my_env/
├── train_llm_rl.py           # Main training script
├── train_dqn.py              # DQN baseline
├── train_ppo.py              # PPO baseline
├── train_pg.py               # REINFORCE baseline
├── baseline.py               # LLM-only baseline
├── analyze_logs.py           # Training analysis
├── models.py                 # Data types
├── server/
│   ├── app.py                # FastAPI server
│   └── security_environment.py  # OpenEnv environment
└── logs/                     # Training metrics

Environment Details

Tasks

Task	Difficulty	Vulnerabilities
Easy	SQL Injection, XSS	Basic but real
Medium	CSRF, IDOR, Path Traversal	Requires understanding context
Hard	Auth Bypass, Timing, Race Conditions	Subtle and complex

Why Three Difficulties?

Security vulnerabilities vary wildly in complexity. Easy vulns like SQL injection have obvious patterns. Hard ones like race conditions require deep analysis. Supporting all three demonstrates the agent can learn context-appropriate strategies.

Requirements

Python 3.10+
Groq API key (free tier works)

echo "GROQ_API_KEY=your_key" > my_env/.env

What Makes This Different

Approach	Learns?	Real Code Understanding?
Regex/Static Analysis	No	No
LLM Alone	No	Yes
Traditional RL	Yes	No
This Project	Yes	Yes

This is a genuine RL agent - it learns from experience and improves over time. The LLM isn't just an API call, it's integrated into the learning loop.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.opencode/skills		.opencode/skills
frontend		frontend
my_env		my_env
.gitignore		.gitignore
AGENTS.md		AGENTS.md
HACKATHON_CONTEXT.md		HACKATHON_CONTEXT.md
README.md		README.md
opencode.json		opencode.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Security Bug Hunter

Why This Matters

The Problem

Architecture & Design Decisions

Why LLM + RL?

Supported Vulnerabilities

Supported Languages

Methodology Deep Dive

The Training Loop

State Representation

Reward Design

Quick Start

Training Scripts

Project Structure

Environment Details

Tasks

Why Three Difficulties?

Requirements

What Makes This Different

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Security Bug Hunter

Why This Matters

The Problem

Architecture & Design Decisions

Why LLM + RL?

Supported Vulnerabilities

Supported Languages

Methodology Deep Dive

The Training Loop

State Representation

Reward Design

Quick Start

Training Scripts

Project Structure

Environment Details

Tasks

Why Three Difficulties?

Requirements

What Makes This Different

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages