An OpenEnv reinforcement learning environment where agents learn to find security vulnerabilities in code using LLM-powered analysis.
Static analysis tools exist, but they don't "think" - they just match patterns. What if an AI agent could:
- Actually understand code semantics
- Learn which vulnerability types are most likely in a given context
- Improve its strategy over time through experience
That's exactly what this project achieves.
Traditional RL environments use simulated states - the agent learns from numbers, not real content. This project takes a different approach:
The approach: Send real code to an LLM, let it analyze, extract the vulnerability type, then use RL to learn which LLM-reasoning strategies work best.
This isn't just "LLM inference as a service" - it's a genuine RL agent that learns from experience.
Three approaches were considered:
-
LLM-only (no RL): Just ask the LLM to find bugs. Works, but no learning, no improvement over time.
-
RL-only (no LLM): Use keyword matching or regex. Simple but can't handle complex code analysis.
-
LLM + RL (chosen): The LLM provides semantic understanding, RL learns which strategies work. Best of both worlds.
| Category | Examples |
|---|---|
| Injection | SQL Injection, Command Injection |
| Scripting | XSS (Stored, Reflected, DOM) |
| Auth | Auth Bypass, IDOR, Broken Access Control |
| Web | CSRF, Path Traversal |
| Logic | Race Conditions, Timing Attacks, Business Logic |
- Python
- JavaScript
- Java
1. Sample random code snippet
2. Send to Groq LLM: "What vulnerability is here?"
3. LLM responds: "SQL injection - f-string with user input"
4. Agent takes action: identify_vulnerability("sql_injection")
5. Environment checks: Is this correct?
6. RL updates: Q("sql_injection", "easy") += learning_rate × reward
7. Repeat
The agent learns that certain vulnerability types are more valuable to pursue in certain contexts.
Unlike traditional RL where state is pixels, the state here consists of:
- Code sent to LLM (semantic understanding)
- Task difficulty
- Previous analysis history
This hybrid approach lets the LLM do what it does best (understand code) while RL does what it does best (optimize strategy).
- +0.4 for correct vulnerability identification (strong signal)
- +0.1 for exploratory actions (encourage exploration)
- +0.2 for patch suggestions (secondary behavior)
The high reward for correct identification ensures fast learning of good strategies.
# Install dependencies
cd my_env
uv sync
# Run baseline (no server needed)
python run_baseline.py
# Or start the environment server
python -m server.app
# Train the LLM-RL agent
python train_llm_rl.py --episodes 100| Script | Description |
|---|---|
run_baseline.py |
Quick demo - Run agent without server |
train_llm_rl.py |
LLM + Q-Learning hybrid (recommended) |
train_dqn.py |
Deep Q-Network with PyTorch |
train_dqn_embeddings.py |
DQN + Code Embeddings |
train_ppo.py |
Proximal Policy Optimization |
train_pg.py |
REINFORCE policy gradient |
baseline.py |
LLM-only (no RL) for comparison |
my_env/
├── train_llm_rl.py # Main training script
├── train_dqn.py # DQN baseline
├── train_ppo.py # PPO baseline
├── train_pg.py # REINFORCE baseline
├── baseline.py # LLM-only baseline
├── analyze_logs.py # Training analysis
├── models.py # Data types
├── server/
│ ├── app.py # FastAPI server
│ └── security_environment.py # OpenEnv environment
└── logs/ # Training metrics
| Task | Difficulty | Vulnerabilities |
|---|---|---|
| Easy | SQL Injection, XSS | Basic but real |
| Medium | CSRF, IDOR, Path Traversal | Requires understanding context |
| Hard | Auth Bypass, Timing, Race Conditions | Subtle and complex |
Security vulnerabilities vary wildly in complexity. Easy vulns like SQL injection have obvious patterns. Hard ones like race conditions require deep analysis. Supporting all three demonstrates the agent can learn context-appropriate strategies.
- Python 3.10+
- Groq API key (free tier works)
echo "GROQ_API_KEY=your_key" > my_env/.env| Approach | Learns? | Real Code Understanding? |
|---|---|---|
| Regex/Static Analysis | No | No |
| LLM Alone | No | Yes |
| Traditional RL | Yes | No |
| This Project | Yes | Yes |
This is a genuine RL agent - it learns from experience and improves over time. The LLM isn't just an API call, it's integrated into the learning loop.