0% found this document useful (0 votes)
162 views18 pages

From CVE To Exploit

Uploaded by

mphatso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views18 pages

From CVE To Exploit

Uploaded by

mphatso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

From CVE Entries to Verifiable Exploits:

An Automated Multi-Agent Framework for Reproducing CVEs


Saad Ullah Praneeth Wenbo Guo Amanda Burnett
Boston University Balasubramanian UC Santa Barbara Arizona State University
saadu@[Link] UC Santa Barbara henrygwb@[Link] aburne22@[Link]
praneeth@[Link]
Hammond Pearce Christopher Kruegel Giovanni Vigna Gianluca Stringhini
UNSW Sydney UC Santa Barbara UC Santa Barbara Boston University
[Link]@ chris@[Link] vigna@[Link] gian@[Link]
arXiv:2509.01835v1 [[Link]] 1 Sep 2025

[Link]
Abstract CVE ID
Vuln Project Setup
High-quality datasets of real-world vulnerabilities and their
corresponding verifiable exploits are crucial resources in soft- Exploit for Vuln
ware security research. Yet such resources remain scarce, as
Verifier for Exploit
their creation demands intensive manual effort and deep secu- CVE-Genie
rity expertise. In this paper, we present CVE-G ENIE, an auto-
mated, large language model (LLM)-based multi-agent frame- Figure 1: CVE-G ENIEOverview.
work designed to reproduce real-world vulnerabilities, pro-
vided in Common Vulnerabilities and Exposures (CVE) for-
mat, to enable creation of high-quality vulnerability datasets. Studies have shown that low-quality data can cause machine
Given a CVE entry as input, CVE-G ENIE gathers the rel- learning (ML) based vulnerability detection techniques to
evant resources of the CVE, automatically reconstructs the learn spurious correlations [4, 49]. Although public reposi-
vulnerable environment, and (re)produces a verifiable exploit. tories, such as the NVD, include hundreds of thousands of
Our systematic evaluation highlights the efficiency and robust- CVEs, they often lack critical details, such as the vulnerable
ness of CVE-G ENIE’s design and successfully reproduces code, setup instructions for the vulnerable environment, work-
approximately 51% (428 of 841) CVEs published in 2024- ing exploits, and validation steps, which are all required for
2025, complete with their verifiable exploits, at an average the effective replication of a vulnerability and to build suitable
cost of $2.77 per CVE. Our pipeline offers a robust method to ground truth datasets for research [38]. This is because the
generate reproducible CVE benchmarks, valuable for diverse main purpose of vulnerability advisories is to alert system
applications such as fuzzer evaluation, vulnerability patching, administrators about software that needs to be updated, and
and assessing AI’s security capabilities. concrete exploit code is often kept confidential for ethical
reasons. Unfortunately, generating this missing information
1 Introduction from CVEs alone requires both substantial manual effort and
extensive security expertise. For instance, Mu et al. [38] spent
Tens of thousands of software vulnerabilities are discovered over 3,600 hours of manual effort with 43 security experts to
every year, with more than 49,000 vulnerabilities tracked reproduce publicly reported CVEs but managed to reproduce
by the National Vulnerability Database (NVD) [42] in 2024 only 368 memory-related vulnerabilities in Linux.
alone. Many of these vulnerabilities are cataloged using the To address these challenges, the security community has
CVE format, which provides a standardized identification adopted several approaches: manual annotation [21, 65], in-
system to help organizations address security flaws in both serting bugs into real-world code [36,41], and heuristic-based
software and hardware. Still, existing vulnerability detection mining [7, 10, 13, 15, 40]. While these methods have advanced
tools, such as those listed by OWASP [44], struggle to keep up the field, they come with limitations, such as limited support
with the increasing volume of code released each year. As a for programming languages and Common Weakness Enumer-
result, many vulnerabilities remain undiscovered in codebases ations (CWE) categories, mislabeled vulnerabilities [50], and
for years (e.g., up to 2 years in Chromium and 7 years in data becoming stale. For instance, the DARPA Cyber Grand
OpenSSL [3]) posing significant security risks. Challenge dataset from 2016 [11] quickly became outdated
A key requirement for developing effective automated vul- due to its small size and low-complexity bugs [20]. On the
nerability detection techniques is the availability of high- other hand, automated data collection helps address some of
quality datasets with detailed examples and ground truth. these limitations, but often tend to be limited in scope. For in-

1
Reproducible
stance, ARVO [34] automatically curates reproducible builds

Automated
# Projects
with triggering inputs from OSS-Fuzz reports, but its scope is

# CWEs

Verifier
# Lang
limited to memory corruption bugs in C/C++. This highlights

Real

PoC
a clear need for a dataset curation method that enables Name Source

fully reproducible vulnerability instances across a wide SARD [41] Manual ● ~120 ~150 6 ✗ ✓ ✓ ✗
range of programming languages and vulnerability types. SVEN [21] Manual ❍ n/a 9 3 ✗ ✗ ✗ ✗
Devign [65] Manual ❍ 4 n/a 2 ✗ ✗ ✗ ✗
Recently, LLMs have demonstrated remarkable capabil- SecLLMHolmes [54] Manual ◗ 4 8 3 ✗ ✗ ✗ ✗
ities in solving complex software engineering (SWE) and VulChecker [36] Manual ● n/a 5 2 ✓ ✗ ✗ ✗
Draper [52] SA tool ◗ n/a 9 2 ✓ ✗ ✗ ✗
security tasks, ranging from automated bug detection [16, 39] D2A [64] SA tool ❍ 6 7 2 ✓ ✗ ✗ ✗
and repair [24, 46, 57], to writing fuzzer test harnesses [12]. VUDENC [56] Sec. issue ❍ 812 7 1 ✓ ✗ ✗ ✗
Moreover, the shift to agentic workflows [8, 32, 62, 63], in DiverseVul [10] Sec. issue ❍ 797 150 12 ✓ ✗ ✗ ✗
PrimeVul [13] Patch diff ❍ 755 140 2 ✓ ✗ ✗ ✗
which LLM-powered agents are provided access to real-world ❍
BigVul [15] Patch diff 348 91 2 ✓ ✗ ✗ ✗
tools [1], has further pushed the capabilities of general pur- CrossVul [40] Patch diff ❍ 1,675 168 40 ✓ ✗ ✗ ✗
pose LLMs and lead to significant performance gains on ex- CVEfixes [7] Patch diff ❍ 1,754 180 30 ✓ ✗ ✗ ✗
ARVO [34] OSS-Fuzz ❍ 273 4 2 ✓ ✓ ✓ ✗
tremely difficult real-world SWE related benchmarks, such ❍
CVE-Bench [66] Public CVEs 36 8 6 ✗ ✓ ✓ ✓
as SWE-bench [25]. This raises the question whether LLM Mu et al. [38] Public CVEs ❍ 1 4 1 ✗ ✓ ✓ ✓
agents could also be useful in automatically setting up and re- ❍
CVE-G ENIE Public CVEs 267 141 22 ✓ ✓ ✓ ✓
producing existing CVEs, allowing us to build reliable, large,
and realistic benchmarks for the security community. Table 1: Overview of vulnerability benchmark datasets. Real:
In this paper, we present CVE-G ENIE, a fully-automated, synthetic ●, real-world ❍, mixed ◗. CVE-G ENIE results
LLM-driven multi-agent framework for end-to-end CVE re- reflect CVEs from Jun 2024–May 2025.
production, which embodies what we define as key prop-
erties of an ideal CVE reproduction system, captured
by the acronym EAGER. Specifically, an ideal system: 2 Background and Related Work
Generates working Exploit/proof-of-concept (PoC); Builds
Existing Benchmarks and Limitations. To evaluate var-
Assessors/verifiers for the exploit; Generalizes across CWEs,
ious vulnerability detection techniques (rule-based meth-
languages, and projects; Is End-to-end automated; and
ods [44,58], ML-approaches [29–31,36], and LLM-based sys-
Rebuilds vulnerable environments. This paper makes the fol-
tems [17–19, 47]), the community needs standardized bench-
lowing contributions:
marks. Prior work has explored four main approaches for
1. We develop CVE-G ENIE, a four-stage, end-to-end auto- addressing these challenges. (1) Manual efforts involve ana-
mated pipeline; Processor, Builder, Exploiter, and CTF lyzing [21] or reproducing [66] vulnerabilities in real-world
Verifier, that extracts CVE data (i.e., source code, secu- code, or crafting synthetic examples [54], but datasets are
rity advisories, patches, etc.), rebuilds the environment, small, lack ecological validity, and often suffer labeling er-
generates exploits and verifiers to assess the exploits, rors due to human mistakes (e.g., 50% data in Devign [65]
and validates the final results. have wrong labels [50]). (2) Automated labeling using static
2. We systematically select the optimal configuration of analysis (SA) tools, such as bandit [6] or infer [23], reduces
LLMs for CVE-G ENIE for optimal performance over manual effort but introduces many false positives [48, 64]. (3)
diverse CVEs, encompassing diverse range of vulnera- Mining real-world data, such as GitHub issues or CVE patch
bilities, programming languages, and projects, including commits, assumes that all removed or modified code is vul-
scenarios with incomplete CVE information. nerable; However, this assumption often leads to mislabeled
3. We ran CVE-G ENIE on 841 CVEs (published between samples, as Risse et al. [50] highlighted for DiverseVul [10]
June 2024–May 2025), successfully reproducing 428 and BigVul [15]. (4) Relying on proven sources, such as OSS-
CVEs across 267 projects, 141 CWEs, and 22 program- Fuzz, offers verified exploits and sanitizer feedback; However,
ming languages, highlighting our generalizability. these are restricted to certain bug types, primarily memory
4. We systematically evaluate CVE-G ENIE’s design and corruption in C/C++, and do not generalize across languages
demonstrate that each component is essential for optimal or vulnerability classes [34].
performance: even advanced LLMs like o3 fail to repro- Besides data quality, existing datasets also lack necessary
duce even one CVE without our agentic framework. artifacts for dynamic evaluation beyond static labels [4,49,54],
5. We open-source our framework, source code, and including dynamic execution environments and reproducible
datasets of reproduced CVEs as well as logs for all agents exploits, which is difficult to obtain [38]. Table 1 provides an
conversations, providing an ongoing valuable contribu- overview of the current state of vulnerability benchmarks.
tion to the community for downstream research tasks Call for EAGER Style for CVE Reproduction. A CVE ID
(e.g., tools for automated vulnerability detection). is a unique identifier for publicly disclosed security vulnera-

2
Method Exploit Assess General E2E Rebuild recent DAPRA AIxCC competition, researchers constructed
agentic-based systems with end-to-end capabilities for vulner-
Manual [32, 66] ✓ ✓ ✗ ✗ ✓
Fuzzers ✓ ✗ ✗ ✗ ✗ ability detection and patching [18]. These recent successes
Metasploit [35] ✓ ✗ ✗ ✗ ✗ provides the argument for using LLM agents in vulnerability
ARVO [34] ✓ ✗ ✗ ✓ ✓ analysis, software development, and external tool interaction,
CVE-G ENIE ✓ ✓ ✓ ✓ ✓ key capabilities for building an EAGER-style automated end-
to-end CVE reproduction framework.
Table 2: Comparing Potential Vulnerability Reproduction
Methods and their EAGER attributes.
3 CVE-G ENIE
bilities in real-world software. Each CVE ID corresponds to Like human analysts [38], CVE-G ENIE utilizes four core
a CVE entry that includes key details for that CVE, such as modules to process and reproduce CVEs: (1) the Proces-
vulnerability description, source code, affected versions, se- sor (Section 3.1), which retrieves source code and constructs
curity advisories, CWE classifications, and patches, enabling a structured knowledge base; (2) the Builder (Section 3.2),
security professionals to assess and mitigate risks. Creating a which reconstructs the vulnerable environment; (3) the Ex-
dataset from the standard CVE database is a promising way ploiter (Section 3.3), which reproduces the exploit; and (4)
towards addressing the aforementioned challenges (see Ap- the CTF Verifier (Section 3.4), which generates a verifier for
pendix B). Therefore, we introduce EAGER, a set of crucial the exploit. We also design and integrate a set of tools that
criteria for ideal CVE reproduction: enable LLM agents in these modules to perform their respec-
• Exploit Generation – (re)creation of an exploit or PoC tive tasks (see Appendix A). To illustrate the framework in
that reliably triggers the vulnerability in the CVE. practice, we also present a full end-to-end reproduction of
• Assessment – inclusion of a “verifier” or “sanitizer” ca- CVE-2024-4340 by CVE-G ENIE (Figure 2).
pable of assessing whether the generated exploit or PoC To bridge the gap identified in Section 2, namely the lack of
successfully triggers the vulnerability. any system capable of true EAGER-style CVE reproduction,
• Generalization – reproduction of CVEs across diverse CVE-G ENIE’s design is guided by the following principles.
CWEs, programming languages, and software projects. 1. Modular Task Decomposition: Since LLMs struggle
• End-to-end Automation – execution of all stages of CVE with long, complex contexts [54] and tasks [25], CVE-
reproduction in a fully automated manner. G ENIE decomposes reproduction into focused modules
• Rebuild project – reconstruction of the original vulnera- handled by specialized agents with systematically engi-
ble environment to facilitate exploit/PoC execution. neered prompts1 .
However, none of existing benchmarks satisfy all the crite- 2. Robustness to Incomplete Data: Given that missing
ria (Table 2). For example, manual effort for producing EA- CVE fields can reduce reproduction success by up to
GER datasets does not scale and is not generalizable [38, 66]. 44% [38], agents in CVE-G ENIE can fall back on patch
ARVO [34] focuses solely on memory-related bugs from analysis and direct source-code inspection when advi-
OSS-Fuzz C/C++ projects, limiting its scope. In contrast, sories or PoCs are unavailable.
CVE-G ENIE is the only framework satisfying all criteria. 3. Reliability through Self-Critique: To counter LLMs’
Large Language Models and Agents. LLMs are transformer- reasoning limits [54], each module uses paired developer
based neural network models with billions of parameters. and critic agents in a ReAct-style loop [60], enabling
They have shown remarkable performances in various applica- iterative refinement and internal verification.
tion domains, such as question answering [43], mathematical Together, these principles ensure CVE-G ENIE can deliver the
reasoning [2], and code generation [5, 9]. Recent work also first practical, automated, and generalizable framework for
shows their utility in security tasks, e.g., such as bug detection EAGER-style CVE reproduction. The following subsections
and repair [24, 46, 57] and writing fuzzer test harnesses [12]. detail the workings of each CVE-G ENIE component, using
LLM agents are systems that combine LLMs with tools. CVE-2024-4340 as a running example to illustrate the end-
They can finish complex tasks via a sequence of actions, in- to-end reproduction shown in Figure 2.
cluding planning, generation, and tool executions [22, 26, 27,
45,53]. An agentic system can have multiple agents, each with
a sub-task, such as reasoning, critiquing, generating, perceiv- 3.1 Processor
ing, remembering, or teaching, as well as its own workflow This module comprises of two sub-modules: the Data Proces-
and tool sets. In security, researchers have started to con- sor, which locates the vulnerable project’s source code and
struct LLM agents to find and repair vulnerabilities. There are gathers relevant raw resources, and the LLM-based Knowl-
multi-agent patching systems, where different sub-agents are edge Builder, which transforms these resources into a struc-
responsible for fault localization, patch generation and valida-
tion (e.g., PatchPilot [28], and PatchAgent [61]). Through the 1 All prompts will be provided in the open-source artifact.

3
CVE-2024-4340 3
Pre-Reqs 3. Exploiter 4. CTF Verifier
Pre-Reqs Dev 6 8
4
a Exploit Dev Verifier Dev
1 ex ec _l s ( . ) Setup Dev
Data Processor Affected b a a
get _f i l e ( README)
a get _f i l e ( s ql . py ) ex ec _l s ( . )
Version v0.4.4 get _f i l e ( __i ni t __)
c b b
Descript ion ex ec _l s ( ex ampl es )
b ex ec ( python - << 'print get _f i l e ( __i ni t __)
Passing a heavily nested list CWE-674 ex ec ( py t hon - V) sqlparse version')
d c c
to [Link]() leads to a Uncontrolled get _f i l e ( py pr oj ec t ) ex ec ( pi p i ns t al l ex ec ( python - << 'poc as Verifier: see Figure 4a, and 4b (for
DoS due to RecursionError. Recursion e s ql par s e==0. 4. 4) a string') weak and strong verifier generated)
get _f i l e ( s ql . py ) d
get _f i l e ( engi ne. py ) Success: True
GHSA/CVE-2024-4340 Commit b4a39d9 e Exploit : CVE-2024-4340 is an Verifier
Overview: Project s ql par s e .... Feedback
High Severity wr i t e_f i l e ( poc . py ) uncontrolled recurrsion .... 9
Important files: [Link] ... f
sqlparse < v0.5.0 Text Required Services: None ex ec ( py t hon poc . py ) PoC: see Figure 3 Flag Checker
Summary: ....
Expected State: After creating a
Details + PoC: ... Success: True # STDOUT
Impact: DoS ... py t hon env, and installing Exploit Logs
Access: 1. Make sure you have Feedback 3x pl oi t 66f ul l
sqlparse==0.4.4, run poc, .... 7
py t hon version ...
Raw Data 2. Installed vulnerable version .... Exploit Critic # STDERR
2 5
Setup Critic The exploit developer agent
Knowledge Builder Verifier Feedback
Setup correctly identified and targeted
Feedback Logs the uncontrolled recursion
10
[Link] Summary .... .... Overall, all criteria are
vulnerability in s ql par s e Verifier Critic
satisfied: correct vulnerable
[Link] Cause Analysis: The core of the v 0. 4. 4 . . .
version is installed, the PoC
... Overall, the verifier meets all
vulnerability is in [Link]() demonstrates the flaw, and no
required conditions and produced
method (defined in sqlparse/[Link]) .... artificial vulnerability was added
2. Builder Setup PoC the correct flag in the supplied
[Link] ion Details & PoC: Info run-time logs ...
[Link] with nested square brackets ...
[Link] is fed to [Link]() ...
c. ...Python exceeds it recursion limit ... Tools for Knowledge Source Store 1. VM snapshot stored
Text
LLMs Base Code 11 Reprod. 2. exploit and verifier stored
1. Processor CVE 3. Cost = $0.27 -- Time = 5 mins

Figure 2: CVE-G ENIE architecture and an end-to-end example of workflow of reproduction for CVE-2024-4340, i.e., Denial of
Service due to RecursionError in sqlparse < v0.5.0.

tured knowledge base. Below we provide detailed functional- 2. CWE data: The CWE information, a categorization sys-
ities of these sub-modules. tem for hardware and software weaknesses that asso-
ciates the given CVE with a speficic type of vulnerability,
3.1.1 Data Processor 1
e.g., CWE-674 (Uncontrolled Recursion)3 .
3. Patch commits: The git-diff of code changes in the
Goal/Rationale. For the given CVE ID, this sub-module iden- target source code that fix the vulnerability. These com-
tifies and extracts the vulnerable version of the software along mits help LLMs localize the issue, understand its root
with all relevant CVE details, as follows: cause [13, 15, 21], and generate effective exploits.
4. Security advisories and PoC: We filter URLs referenced
Source Code Extraction. The Data Processor begins by lo-
in the CVE entry by keywords (e.g., “security”, “advi-
cating the source code for the vulnerable project linked to the
sory”, “advisories”, “bounty”, “bounties”), and scrape
CVE, an essential requirement for CVE-G ENIE to reproduce
their contents.
the vulnerability. It searches the GitHub URLs referenced in
the “cvelist”2 to identify the project repository. For closed- Output. The vulnerable version of a project’s source code
source software, Data Processor provides the option to the alongside its directory structure and CVE raw data.
user to supply the repository URL. Once the source code is
located, it identifies the affected software configurations: spe- 3.1.2 Knowledge Builder 2
cific versions, platforms, or settings vulnerable to the CVE.
For instance, CVE-2024-4340 affects all versions of “sqlparse” Goal/Rationale. This agent is instructed to analyze, distill,
prior to v0.5.0. Based on this, the latest affected (vulnerable) and organize the raw information of, extracted by the Data
version (v0.4.4) is retrieved using the GitHub API, and its Processor, into a structured knowledge base while retaining
source code is downloaded (as shown in Figure 2). essential details required to reproduce the given CVE. This
Vulnerability Information Extraction. After the vulnerable includes extracting PoC details or step-by-step instructions
version of the project source code is downloaded, Data Pro- from security advisories, which are invaluable for accurately
cessor extracts the following four key pieces of information recreating the vulnerability. For instance, CVE-2024-4340
from “cvelist”, if available (see Figure 2): has a PoC provided in its security advisory and the patch com-
1. CVE description: A high-level summary of the vulnera- mit helps understand the root cause by highlighting where the
bility in the target project. vulnerability was mitigated. As shown in Figure 2, the Knowl-
edge Builder included both of these details in the knowledge
2“cvelist”
is an automated pilot program for CVE submission through
GitHub ([Link] 3 [Link]

4
base. In case an advisory does not include an exploit, the LLM can exceed the context limit and reduce efficiency (as demon-
agent is tasked with generating an overview of what an exploit strated in the ablation study in Section 4.5). To address this,
might entail. Additionally, capturing the patch details helps to we delegate the exploration of code base to a dedicated agent,
localize the root cause of the vulnerability [7, 13, 21] and aids the Pre-Requisite Developer, allowing the setup process to
in crafting an effective exploit. This knowledge base acts as focus more on building the project.
the long-term memory for agents, helping them to effectively (3) The Pre-Requisite Developer also defines the “expected
reproduce CVEs. By including only essential information, state,” i.e., when the project is fully set up and ready for ex-
the knowledge base avoids overloading the LLM’s context ploitation. This state guides the Setup Developer Agent in
window, ensuring efficient reproduction. verifying the correct configuration of the vulnerable envi-
Output. A structured CVE knowledge base providing details ronment. For example, in CVE-2024-4340 (Figure 2), the
of CVE including its associated CWEs, affected software con- Pre-Requisite Developer explored the root directory and key
figurations, root causes (if provided in the security advisory), files such as README and [Link] to understand the project
and essential information from security advisories and patch context, then specified the expected state as having sqlparse
commits necessary for exploitation. version v0.4.4 installed for the exploit to work correctly.
Output. (1) Detailed overview of the project, (2) important
3.2 Builder files to pay attention to during the setup, (3) required services
and their configurations, and (4) expected state of the project
Once the knowledge base is populated and the vulnerable upon successful setup to facilitate verification.
version of the project’s source code is downloaded, the next
objective is to build the project in a way that allows the exploit
3.2.2 Setup Developer Agent 4
to be executed. To achieve this, our pipeline employs two
developer LLM agents: the Pre-requisite Developer Agent, Tools. For this step, we allow access to the fol-
which analyzes and plans the environment setup, and the Setup lowing tools, i.e., get_file, write_to_file,
Developer Agent, which executes this plan to configure the execute_ls_command, execute_linux_command, and
vulnerable environment. As well as one critic, the Setup Critic set_environment_variable.
Agent, which analyzes the logs of the Setup Developer and
evaluates the setup for the project, allowing the system to Goal/Rationale. The Setup Developer begins in the directory
correct its mistakes. Below, we discuss the functionality of containing the vulnerable version’s source code and focuses
these agents. on configuring the project based on instructions from the Pre-
Requisite Developer, exploring the codebase when necessary.
For CVEs involving third-party libraries, our initial experi-
3.2.1 Pre-Requisite Developer Agent 3
ments showed that setup is greatly simplified by using pack-
Tools. For this step, we only provide the LLM agent with age managers like pip or npm to install specific vulnerable ver-
access to read-only tools, i.e., execute_ls_command, and sions instead of building from source. For example, in the case
get_file_by_name, which the agent can use to traverse the of CVE-2024-4340 affecting sqlparse, the Pre-Requisite De-
code base and read files, because we do not want the agent to veloper instructed the installation of version 0.4.4, and the
execute any commands or write to files in this phase. Setup Developer executed pip install sqlparse==0.4.4
directly (command 4a, Figure 2), resulting in a smooth setup.
Goal/Rationale. We introduce this agent as an initial explo-
If the package manager fails, the Setup Developer falls back to
ration and planning step before actually setting up the vulner-
building from source. Typically, the Pre-Requisite Developer
able project. This is done for three key reasons.
also recommends running a basic PoC to confirm readiness
(1) Our preliminary experiments reveal that many projects
before handing off to the next agent (e.g., commands 4e and
contain inaccurate and incomplete information in their
4f in Figure 2).
README files (e.g., vulnerable version v1.2.7 of lunary for
CVE-2024-5129, includes incorrect paths to .env files that Output. This agent returns a final decision whether the setup
are crucial for the setup). Without the Pre-Requisite agent, is successful or not. If successful, the output also includes
these details were consistently overlooked by LLMs, causing instructions on how another agent can access the running
setup failures. Identifying and correcting such issues early project.
reduces unsuccessful attempts, conserves the LLM’s context
window, and improves the likelihood of successful environ- 3.2.3 Setup Critic Agent 5
ment setup.
(2) During environment setup, the agent’s job is to explore Goal/Rationale. The Setup Critic evaluates whether the
and analyze the source code to identify critical components project setup performed by the Setup Developer is correct
and to build the project by executing a series of commands. and complete, ensuring that the environment is properly con-
For large projects, having to do both tasks with one agent figured for the vulnerability described in the CVE knowledge

5
1 """ 3.3 Exploiter
2 PoC - CVE-2024-4340 (sqlparse < 0.5.0)
3 After successfully configuring the vulnerable system, the next
4 EXAMPLE THAT CRASHES v0.4.4 step in our pipeline is to generate an exploit for the given
5 python3 [Link] 10000 vulnerability. CVE-G ENIE accomplishes this using two LLM
6 """
7 import sys
agents: the Exploit Developer, which develops the PoC for
8 import sqlparse an exploit in the vulnerable environment, and the Exploit
9 Critic, which analyzes the logs of the Exploit Developer and
10 def main() -> None: evaluates the exploit for the given vulnerability. Below, we
11 if len([Link]) != 2:
12 print("Usage: python3 [Link] <depth>") discuss the detailed functionality of these two agents.
13 [Link](1)
14 arg = [Link][1]
15 try: 3.3.1 Exploit Developer Agent 6
16 depth = int(arg)
17 payload = "[" * depth + "]" * depth Tools. This agent uses the same tools as the Setup Developer.
18 except ValueError:
19 payload = arg Goal/Rationale. The Exploit Developer is responsible for
20 [Link](payload) crafting and demonstrating exploits for vulnerabilities listed
21
22 if __name__ == "__main__": in the CVE knowledge base on pre-configured vulnerable
23 main() systems. If a PoC is provided, the agent replicates and veri-
fies it by executing the script with the appropriate triggering
(a) Exploit script for CVE-2024-4340. It constructs a deeply nested input before delivering the final, working PoC. For exam-
list payload and submits it to [Link](), triggering uncon- ple, in CVE-2024-4340 (Figure 2), the PoC was available, so
trolled recursion. The included comment documents a crashing input the agent validated the setup (6b), demonstrated the exploit
(depth 10,000) that reliably triggers the vulnerability. (6c) by triggering RecursionError in sqlparse/[Link]
(see Figure 3b), and submitted the verified PoC script (see
1 Traceback (most recent call last):
2 File "[Link]", line 20, in <module> Figure 3a). If no PoC or exploitation steps are available, the
3 [Link](payload) Exploit Developer iteratively analyzes the codebase using
4 .... available tools to develop a functional exploit. Upon success,
5 File "sqlparse/[Link]", line 214, in flatten
6 yield from [Link]()
it produces a Python PoC script that accepts the crashing input
7 [Previous line repeated 983 more times] via command-line, includes comments on expected input for-
8 RecursionError: maximum recursion depth exceeded mat, and provides an example input. This ensures that the CTF
Verifier can reliably validate the exploit without repeating the
(b) Execution traceback produced by the exploit, showing a analysis.
RecursionError after excessive recursive calls within sqlparse. Output. The agent returns its final decision whether it consid-
ers the exploit is successful or not. If successful, it provides
Figure 3: PoC for CVE-2024-4340 in sqlparse v0.4.4,
an overview of the exploit and a Python PoC script.
showing the exploit script and the resulting crash.

3.3.2 Exploit Critic Agent 7

base to be exploited. Additionally, this agent plays a critical Similar to the role of the Setup Critic, the Exploit Critic is
role in detecting deceptive or shortcut behaviors by the LLM, tasked with evaluating the behavior of the Exploit Developer
such as fabricating a mock project instead of performing a by analyzing its execution logs. Its primary goal is to de-
genuine setup, or deliberately injecting vulnerabilities into the termine whether the generated exploit is both valid and of
codebase to simplify exploitation. These undesirable patterns high quality. This includes verifying that the exploit could
are discussed in detail in Section 4.1.1, which informs the in- plausibly succeed in a real-world setting; that it does not rely
struction design for the Setup Critic. Based on this, the critic on manipulating or altering the original setup in illegitimate
is equipped to identify both functional and security-related ways; and that it avoids fabricated shortcuts, oversimplified
flaws in the setup process. logic, or artificial assumptions that would not hold in practice.
To perform this assessment, the Exploit Critic leverages
a predefined set of behavioral patterns and common failure
Output. A comprehensive analysis of the setup logs, a binary cases, as detailed in Section 4.1.2. By comparing the exploit’s
decision on whether the setup is valid and complete, and behavior against these known issues, the critic can detect
actionable feedback for correcting issues or improving the actions that are unfaithful to the original intent or technically
setup if necessary. incorrect. When flaws are found, the Exploit Critic provides

6
structured, actionable feedback designed to help improve the
1 import os, subprocess, sys
robustness and realism of the exploit in subsequent iterations. 2 FLAG = "3xploit66full"
3
4 def pre_setup():
3.4 CTF Verifier 5 # Configuring PYTHONPATH
6
After generating a working exploit for a given CVE, the final 7 def run_exploit():
step is to create a verifier script that confirms whether the 8 cmd = [[Link], "[Link]", "10000"]
exploit was successful. This approach is inspired by security 9 return [Link](cmd, timeout=20)
10
Capture-The-Flag (CTF) challenges, where exploiting a vul- 11 def post_exploit(proc):
nerability reveals a hidden “flag” i.e., a string of random char- 12 output = ([Link] or "") + ([Link] or "")
acters or numbers. The CTF Verifier uses three sub-modules 13 return [Link] != 0 and "RecursionError"
to develop and validate such verifiers. The Verifier Developer ,→ in output
14
agent module generates a candidate verifier, the Flag Checker 15 if __name__ == "__main__":
module ensures that the exploit correctly retrieves a flag, and 16 pre_setup()
the Verifier Critic agent assesses the quality and effectiveness 17 try: result = run_exploit()
of the verifier for the given exploit. 18 except [Link]: [Link](1)
19 if post_exploit(result):
20 print(FLAG)
3.4.1 Verifier Developer Agent 8

(a) Attempt 1 – Weak verifier script: pre_setup() lacks a check


Tools. This agent follows the same read-only tools as the Pre- for the vulnerable sqlparse version. The exploit is run with a
Requisite Developer agent, as in this step we don’t want the 20s timeout, risking missed RecursionErrors if they occur af-
model to execute anything, and we do all validations using ter the limit. The verification relies solely on detecting the string
the Flag Checker and the Verifier Critic. RecursionError in output, which can be trivially spoofed.
Goal/Rationale. Following the CTF methodology, we prompt
1 import os, sys, importlib, traceback
the Verifier Developer to create a general CTF-style Python 2 FLAG = "3xploit66full"
verifier script for a given PoC script. So, if the verifier script 3

runs the PoC and it successfully triggers the vulnerability, 4 def pre_setup():
5 assert sqlparse.__version__ < "0.5.0"
it returns this flag (3xploit66ful). We prompt the Verifier
6
Developer to format the verifier script in three structured 7 def run_exploit_inprocess():
steps: 8 [Link] = ["[Link]", "10000"]
1. Pre-Setup: Prepares the necessary inputs for the exploit, 9 try:
10 importlib.import_module("exploit").main()
sets up the environment, and prepares a flag to return if 11 return "no_exception", ""
the exploit correctly triggers the vulnerability. 12 except RecursionError: return
2. Exploit Execution: Runs the provided PoC script to trig- ,→ "recursion_error", traceback.format_exc()
ger the vulnerability. We make sure that the agent cannot 13 except: return "other_exception",
,→ traceback.format_exc()
modify the PoC script to avoid contamination. 14
3. Post-Setup: Verifies the success of the exploit and if 15 def exploit_succeeded(status, tb):
successful, the script returns the flag. 16 return status == "recursion_error" and
This structured format ensures consistent and reliable vali- ,→ "sqlparse/[Link]" in tb
17
dation across a broad spectrum of vulnerability exploit types. 18 if __name__ == "__main__":
Output. A Python script that acts as a custom verifier for the 19 pre_setup()
20 status, tb = run_exploit_inprocess()
exploit/PoC of the given CVE. 21 if exploit_succeeded(status, tb):
22 print(FLAG)
3.4.2 Flag Checker 9
(b) Attempt 2 with Verifier Critic’s feedback – Improved veri-
Once the verifier script is generated for the given exploit,
fier: Validates project version sqlparse < 0.5.0, runs the ex-
before passing it to the critic agent, the Flag Checker exe- ploit in-process to avoid timeout issues, and explicitly checks
cutes the verifier script to ensure that it runs with the exploit for a genuine RecursionError with traceback originating from
script without any errors and produces the expected output sqlparse/[Link], preventing spoofed or misleading outputs.
i.e., the correct flag. For instance, figure 2 shows the correct
flag 3xploit66ful is released when executing the verifier Figure 4: Verifier scripts for exploit (in Figure 3) for CVE-
scripts generated for CVE-2024-4340. If a script fails due to 2024-4340 corresponding to the run in Figure 2, illustrating
a runtime error, or the flag is missing from the output, the the progression from a weak to a robust verifier based on
system assumes that the verifier is incorrect. Feedback is then Verifier Critic feedback.

7
sent to the Verifier Developer to revise the script. This pro- Model # Max Max Reas- Open Cutoff
cess is repeated iteratively, with a maximum of five attempts Params Input Output oning Src
allowed to achieve functional correctness of the verifier script. o3 n/a 200k 100k ✓ ✗ 05/2024
o4-mini n/a 200k 100k ✓ ✗ 05/2024
Claude 3.7 Sonnet n/a 200k 64k ✓ ✗ 11/2024
3.4.3 Verifier Critic Agent 10 Claude 3.5 Sonnet 175B 200k 8k ✗ ✗ 04/2024
Gemini 2.5 Pro n/a 1M 65k ✓ ✗ 01/2025
Once the verifier and exploit scripts pass functional validation Gemini 2.5 Flash n/a 1M 65k ✓ ✗ 01/2025
via the Flag Checker, they are further examined by the Verifier Llama 4 Maverick 400B 1M n/a ✗ ✓ 08/2024
Critic, which performs a similar evaluative role as the other
Qwen 3 235B 128k 64k ✓ ✓ 06/2024

critic agents, focusing specifically on the response produced


Deepseek V3 671B 64k 8k ✗ ✓ 07/2024
DeepSeek R1 671B 128k 8k ✓ ✓ 07/2024
by the Verifier Developer. It examines whether the verifica-
tion process was properly carried out, without assumptions, Table 3: Studied LLMs.
omissions, or unreliable heuristics. Using the behavioral is-
sues identified in Section 4.1.3, the agent flags incomplete or
misleading verification strategies and provides feedback on • RQ4: Is the complex architecture of CVE-G ENIE nec-
how the verification could be made more robust or accurate. essary, or can standalone LLMs achieve comparable re-
For instance, Figure 4 shows a detailed example of verifiers sults?
generated for CVE-2024-4340 with a weak and bypassable
verification criteria, which was corrected using the Verifier
Critic’s feedback. 4.1 Selecting Optimal LLMs for CVE-G ENIE
Dataset (DS ). To evaluate the capabilities of LLMs in re-
3.4.4 Store Reproduced CVE 11 producing vulnerabilities using CVE-G ENIE, we construct
a small-scale, diverse dataset (DS ) consisting of 60 CVEs.
If the given CVE passes the Verifier Critic check, CVE-
These CVEs were published after the latest knowledge cutoff
G ENIE marks it as reproduced and stores the VM snapshot for
date of the LLMs under study (January 2025), as listed in
the vulnerable environment as well as the exploit and verifier
Table 3. To ensure that DS is both representative and diverse,
scripts. Moreover, CVE-G ENIE also stores the metadata such
we select 40 CVEs from the top 25 most dangerous CWE [37]
as LLM cost, time spent, and all agents conversations.
categories, and the remaining 20 CVEs are randomly sam-
pled from the other CWE types. The resulting dataset, DS ,
4 Experimental Methodology comprises 60 CVEs drawn from 44 different CWE categories,
spanning 18 programming languages, and involving 56 dis-
We begin by identifying the most suitable LLMs for inte- tinct software projects.
gration into CVE-G ENIE by evaluating ten leading models Methodology. CVE-G ENIE is composed of four modules.
(Table 3) across CVE-G ENIE’s three core modules, Builder, The Processor’s LLM-based component, the Knowledge
Exploiter, and CTF Verifier, using a diverse set of 60 post- Builder, only performs a trivial task of summarization, so
knowledge-cutoff CVEs (Section 4.1). Based on this evalua- we assign a lightweight LLM (o4-mini) to it. In contrast, the
tion, we select the optimal LLM configuration and conduct remaining modules i.e., Builder, Exploiter, and CTF Verifier,
a four-part evaluation of CVE-G ENIE: (1) a baseline study require strong security vulnerability reasoning and SWE ca-
assessing consistency and efficiency across various vulnera- pabilities. Thus for each of these modules, we systematically
bilities, languages, and projects (Section 4.2); (2) robustness select the best performing LLM for “developer” and “critic”
testing under incomplete CVE information (Section 4.3); (3) tasks using the following 5-step procedure:
large-scale evaluation on 841 CVEs (Section 4.4); and (4) 1. Candidate Selection: For each module, we identify sub-
an ablation study to assess the impact of key components of sets of LLMs from Table 3 as potential developers and
CVE-G ENIE on its scalability and effectiveness (Section 4.5). critics, based on prior evidence of their capabilities, as
These evaluations are guided by the following research ques- well as time and cost constraints.
tions: 2. Developer Evaluation: Each candidate developer LLM
• RQ1: How does the performance of CVE-G ENIE vary is run for the given module, and its outputs are manu-
with the types of vulnerabilities, languages, and projects? ally scored by the authors. The highest-scoring LLM is
• RQ2: Can CVE-G ENIE effectively reproduce CVEs chosen as the developer.
with limited information, and which CVE report compo- 3. Critic Prompt Design: We analyze common mistakes
nents are most crucial for reproduction? made by developers and use these insights to design a
• RQ3: Can CVE-G ENIE reproduce a broad range of task-specific critic prompt for the module, enabling the
CVEs across diverse CWEs, languages, and projects? critic to effectively detect those mistakes.

8
LLM Reported TPR Finally, we validated the resulting environment: for libraries
60 Manual Verified 1.0 TNR
50 0.8 or binaries, we confirmed the vulnerable version by invoking

gem
40
0.6
n3

the corresponding -version (or equivalent) command (e.g.,

ini-2
30

eek r1
Qwe

0.4

.5-f

o4-min
20 for CVE-2025-1215 we verified that vim v9.1.1096 was

lash
0.2

deeps
10

i
0 0.0 installed), while for server-based software we performed a
health check to ensure the service was running and accessible

et
onn
Dee

.5-s at the expected port.


pSe

gem t
de-3

nne As shown in Figure 5a, o4-mini emerged as the best de-


ek V

ini- o
.7-s
clau

2.5
3

-pr
o ude-3 veloper LLM for the Builder, successfully setting up 32 out
Llama-4-Maverick cla
of 60 CVEs projects, while making an average of 20 tool calls
(a) Developer Agent (b) Critic Agent per CVE. In contrast, claude-3.5-sonnet mostly gave up
on the given setup task, and models like gemini-2.5-flash
Figure 5: Builder Optimal LLM Evaluation and open-source LLMs were unreliable: Qwen 3 often de-
scribed setup steps without executing them using tool calls,
while gemini-2.5-flash issued tool calls with frequent syn-
4. Critic Evaluation: Each candidate critic LLM is evalu- tax errors, eventually leading to the failed setup of the given
ated on recall or true positive rate (TPR) and specificity project.
or true negative rate (TNR). When TPR is equal, we
prefer the model with higher TNR, and vice versa. The Critic Prompt Design. From analyzing developer LLM setups,
critic with the best trade-off is selected. we observed three recurring mistakes: (1) when setup fails, the
5. Feedback Loop Assessment: We rerun the chosen best LLM builds a simplified mock-up substitute project instead;
developer–critic pair for the given module, and evaluate (2) for pip/npm packages, it installs the latest version instead
the correctness of critic’s feedback and the developer’s of the specified vulnerable one; and (3) when a server is
capability to correctly address the critic’s feedback. required, it assumes commands like npm run dev succeed
without verifying server health. We designed the critic prompt
to detect these errors while reviewing Setup Developer logs.
4.1.1 Builder
Critic Evaluation. As shown in Figure 5b, the o3 model
Candidate Selection. Setting up a project from scratch achieves the highest TPR and TNR when evaluating Setup De-
is the most time-consuming and complex part of CVE- veloper’s execution logs. Most of its false negatives involved
G ENIE, as security advisories or CVEs do not provide in- setups requiring external hardware, which Exploit Developer
structions for setting up the vulnerable project. To ad- cannot exploit anyway. Some valid setups were also mis-
dress this, we use two developer agents: one for explor- takenly rejected due to limited post-setup checks. Given its
ing the project and another for setting it up (see Sec- strong balance of recall and precision, we choose o3 as the
tion 3.2). This process requires extensive tool calls and critic model for the Builder.
command executions, making it both slow and costly with Feedback Assessment. We evaluated CVE-G ENIE up to Setup
reasoning-heavy models like o3 or claude-3.7-sonnet, Developer stage 4 , using the best-performing developer
which can cost $10–20 per CVE. Therefore, we se- (o4-mini) and critic (o3) models. Minor issues flagged by
lect fast, cost-effective LLMs with sufficient context ca- the critic (e.g., limited verification) were typically resolved by
pacity, o4-mini, claude-3.5-sonnet, gemini-2.5-flash, the developer in one iteration. However, fundamental issues
llama-4-maverick, deepseek-v3, and qwen3, as developer (e.g., mock-up version of the project) persisted even after five
agent candidates. For critic agents, we opt for stronger rea- iterations. Therefore, we limit feedback to a single iteration to
soning models such as o3, o4-mini, claude-3.7-sonnet, efficiently address minor issues without incurring excessive
gemini-2.5-pro, and deepseek-r1, as the critic’s task only overhead.
requires one call, and therefore, they incur minimal overhead.
Developer Evaluation. We ran CVE-G ENIE up to the Setup 4.1.2 Exploiter
Developer stage 4 on CVEs from DS and manually evaluated
each developer LLM using the following procedure. (1) For Candidate Selection. Generating a working exploit from a
each CVE, we first identified the official installation documen- given CVE in a pre-configured vulnerable environment is a
tation for the software’s vulnerable version and, based on it, es- crucial task for reproducing CVEs. However, the Exploiter
tablished both the expected final build state and the sequence requires less code base exploration than the Builder, as advi-
of commands required for installation. (2) We then reviewed sories often provide PoCs, exploitation steps, or vulnerability
the logs of runs where the LLM reported successful setup, details. Thus we include expensive reasoning models like
comparing the executed commands against the documented o3, claude-3.7-sonnet, and gemini-2.5-pro to the can-
installation process to assess alignment and completeness. (3) didates list of developers. To mitigate instability seen in open-

9
LLM Reported o3 TPR LLM Reported TPR
38 Manual Verified 1.0 TNR 16 Manual Verified 1.0 TNR
32 0.8 0.8
12
ash

ash
24 0.6 0.6
l

l
.5-f

.5-f
o4-m

o4-m
8

eek r1

eek r1
ini-2

ini-2
16 0.4 0.4

o4-min
ini

ini
gem

gem
8 0.2 4 0.2

deeps

deeps

i
0 0.0 0 0.0

t
nne

nne
gem

gem
o

o
.7-s

.7-s
ini-2

ini-2
gem t gem t
nne nne
de-3

de-3
.5-p

.5-p
ini- -so ini- -so
clau

clau
2.5 3.7 2.5 -3.7
ro

ro
-pr
o ude- -pr
o ude
claude-3.5-sonnet cla claude-3.5-sonnet cla

(a) Developer Agent (b) Critic Agent (a) Developer Agent (b) Critic Agent

Figure 6: Exploiter Optimal LLM Evaluation Figure 7: Verifier Optimal LLM Evaluation

source models (Section 4.1.1), we employ only closed-source model for the Exploit Critic.
LLMs (Table 3) as developers, while retaining the same critic Feedback Assessment. We evaluated feedback effectiveness by
LLMs as in the Builder. running CVE-G ENIE on 32 correct setups with the top devel-
Developer Evaluation. For the 32 successful setups from oper (o3) and critic (o4-mini). This yielded valid exploits for
Section 4.1.1, we ran CVE-G ENIE up to the Exploit Devel- 16 CVEs, of which 3 were fixed based on the critic’s feedback.
oper stage 6 and manually evaluated each developer LLM As in Section 4.1.1, the developer mainly resolved minor is-
as follows: (1) we checked whether the LLM executed a sues (e.g., incomplete verification or missing log triggers),
PoC or command sequence that demonstrated the exploit, (2) while more complex flaws (e.g., incorrect exploit verification)
we verified that the vulnerability was clearly triggered (e.g., persisted. To balance usefulness and cost, we therefore restrict
uncontrolled recursion in CWE-674 resulting in a Maximum feedback to a single iteration, focusing on quick corrections
Recursion Error), (3) we assessed fidelity to the CVE de- of minor errors.
scription, (4) we ensured no mock-up environments or fake
exploits were used, and (5) we confirmed the exploit script 4.1.3 Verifier
followed the expected format (see Section 3.3.1).
Among all models, o3 consistently performed best, suc- Candidate Selection. Like exploit generation, creating a veri-
cessfully reproducing the highest number of exploits (13 out fier also demands deep understanding of security vulnerabili-
of 27) and demonstrating them end-to-end within the actual ties to ensure the exploit is successfully triggered. Therefore,
vulnerable environment. In contrast, o4-mini often produced we use the same developer and critic candidates as in the
incorrect or mock-up exploits, and struggled with runtime Exploiter.
errors. While claude and gemini models showed strong ca- Developer Evaluation. For the 16 valid exploits from Sec-
pabilities in demonstrating exploits, they lacked versatility tion 4.1.2, we ran CVE-G ENIE up to the Flag Checker stage
across different vulnerability types. Hence, we select o3 as 9 and manually evaluated each developer LLM on three cri-
the most effective Exploit Developer. teria: (1) whether the verifier script followed the required
Critic Prompt Design. We examined the failed exploit at- format (see Section 3.4.1), (2) whether it executed the pro-
tempts and categorized the recurring error patterns. These vided exploit script without modification, and (3) whether its
error types directly mirrored our manual verification criteria verification logic was precise and reliable (see Figure 4 for a
(e.g., failure to demonstrate the exploit, unclear triggering detailed example).
of the vulnerability, deviation from the CVE description, re- As shown in Figure 7a, the o3 model generated 13 valid
liance on dummy/fake exploits, or incorrect script formatting). verifiers out of 14, outperforming all others. While o4-mini
We then incorporated these categories into the critic agent’s achieved comparable numbers, its verifiers more often relied
prompt, allowing it to automatically detect the same issues in on weak logic. Other models struggled with functional cor-
the Exploit Developer logs. rectness and error recovery. We therefore selected o3 for the
Critic Evaluation. As shown in Figure 6b, o4-mini achieved Verifier Developer due to its consistent one-shot success in
the best balance of recall and specificity. In contrast, o3 was producing correct verifiers.
overly strict, rejecting correct exploits and focusing too much Critic Prompt Design. We analyzed failures in verifier gen-
on code formatting, while gemini-2.5-pro frequently over- eration and found that the recurring error types matched our
looked incomplete exploit verification. Claude-3.7-sonnet verification criteria (e.g., incomplete checks, incorrect vali-
was too lenient, accepting nearly all exploits. Based on this dation logic, or misuse of the target environment). We used
evaluation, we selected o4-mini as the most effective critic these categories to design the critic agent’s prompt, enabling

10
Cummulative CVE Count
categories our results and analysis in the following categories:
45
Variance Over Multiple Runs. The variability in CVE-
40
G ENIE’s performance primarily stems from the Builder phase,
35
which is more open-ended than the targeted exploit generation
30 Builder Success guided by CVE context. Project setup involves repository ex-
25 Exploiter Success
Verifier Success ploration (Pre-Requisite Developer) and command execution
20 (Setup Developer), where environment-specific behaviors of-
1 2 3 4 5 6 7
Run Numbers ten cause failures. For example, one run failed because the
Setup Developer modified the PATH variable, while a clean
Figure 8: CVE-G ENIE performance over five runs on DS . retry succeeded, highlighting the value of reattempting in a
fresh environment. Since the CTF Verifier depends on the Ex-
ploiter, which in turn relies on the Builder, this initial project
it to automatically flag such errors when evaluating verifiers setup is a critical bottleneck. As a previously failed build, if
produced by the Verifier Developer. later successful, can enable downstream modules to reproduce
Critic Evaluation. As shown in Figure 7b, only the o3 model a new CVE (as illustrated in Figure 8). Over multiple runs,
successfully identified critical but subtle issues in the verifier the number of successful project, exploit, and verifier builds
scripts, such as weak or flawed verification logic (Figure 4). showed consistent gains with convergance at run seven, indi-
Other models provided weak or no critique, making o3 the cating that most recoverable failures can be resolved within a
most effective critic for the CTF Verifier. few retries.
Feedback Assessment. We evaluated CVE-G ENIE on 16 cor- Vulnerability and Project Context. Across 60 CVE repro-
rect exploits from the Exploiter, successfully generating ac- duction attempts, we observed a 63.3% (38/60) overall
curate verifiers for 15 of them, of which 4 verifiers were success rate, with PoC availability in the advisory signif-
improved through critic feedback. In our experiment, the max- icantly helping the reproducibility (60% with PoC). Web-
imum number of feedback iterations required to resolve either centric ecosystems (TypeScript/JavaScript/PHP) showed
functional or critique-based issues was 4; therefore, we al- markedly higher success than compiled/system-heavy stacks
low up to 5 retries for both the Flag Checker and the Verifier (C/C++/Go/C#/Kotlin/Java), where build and environment
Critic. issues dominated failures. Classic input-validation and injec-
tion classes (e.g., SQLi, XSS, path traversal, prototype pol-
lution) reproduced reliably when endpoints were reachable,
4.1.4 Time and Cost Constraints
whereas access-control vulnerabilities underperformed, likely
In our evaluation of CVE-G ENIE, we found that successfully due to the need for realistic authentication/authorization con-
reproduced CVEs typically required about $2 and 18 minutes text. The principal blockers were build failures (36%) and
on average, with the most resource-intensive case costing $6 timeouts (40%), indicating that improvements in environment
and taking 45 minutes. Failed reproductions, on the other provisioning and adaptive budgeting are as critical as exploit
hand, averaged up to $4 and 35 minutes. To balance flexibility logic for scaling automated CVE reproduction.
with cost control, we set a per-CVE budget cap of $5 and a
maximum runtime of 45 minutes for all experiments. RQ1: CVE-G ENIE’s success first depends on whether it
can build the vulnerable project, after which performance
4.2 Diversity-Aware Baseline Assessment varies with vulnerability type and language, e.g., web-
based and input-validation bugs are easier to reproduce,
Dataset. For this study we use the same dataset DS of 60 while compiled/system-heavy stacks and access-control
diverse CVEs (see Section 4.1). related CVEs are harder.
Methodology. Due to the non-deterministic nature of LLMs,
a single run may fail due to incorrect reasoning, while later
4.3 Robustness to Loss of CVE Context
attempts might succeed [33]. To account for this, we run
CVE-G ENIE multiple times on the CVEs in DS , using its Dataset (DR ). To evaluate CVE-G ENIE’s robustness to in-
most optimal configuration (Section 4.1). After each iteration, complete CVE data, we perform an ablation study using 15
we store successfully reproduced CVEs and rerun on the CVEs from DS , each with complete information (i.e., CVE
remaining ones. This iterative approach helps maximize CVE description, PoC, patch commit, and security advisory) and
reproduction and reveals CVE-G ENIE’s convergence over reproduced in the first three baseline iterations (Section 4.2).
seven runs. We call this dataset DR . We intentionally select these “easier”
Results. We use statistical methods as well as manual analy- CVEs because they serve as a controlled probe for ablation:
sis of randomly selected 25 CVE reproduction runs, and we if performance degrades even on CVEs that CVE-G ENIE can

11
Project # Re # T25 # Oth LoC Lang
CVEs CWEs CWEs

lunary-ai 35 13 22 ~70k TypeScript


WeGIA 17 15 2 ~1M PHP
cpython 10 2 8 ~1.8M Py, C
vite 6 6 0 ~107k TypeScript
yeswiki 6 6 0 ~234k PHP
symfony 5 4 1 ~1.6M PHP
llama_index 5 4 1 ~1.1M Py
vim 5 3 2 ~1.2M Vim Script, C
directus 5 2 3 ~398k TypeScript
Figure 9: Breakdown of CVE-G ENIE’s run on DL rails 4 2 2 ~450k Ruby
phpipam 4 2 2 ~225k PHP
assimp 4 4 0 ~562k C++
siyuan 3 1 2 ~265k Go, TypeScript
normally handle reliably, it indicates that the missing infor- cvat-ai 3 3 0 ~320k Py, TypeScript
mation is truly critical. As shown earlier, the availability of MobSF 3 1 2 ~132k JS, Py
the PoC is significant. To simulate other incomplete context mlflow 3 1 2 ~870k Py
scenarios we create three variants of DR by systematically
removing: (a) the security advisory, (b) the patch commit, and Table 4: Top 15 projects identified in the large-scale evaluation
(c) all except the CVE description. of CVE-G ENIE, showing the number of reproduced CVEs,
their classification into CWE Top 25 vs. other CWEs, lines of
Methodology. For each variant of DR , we run CVE-G ENIE
code (LoC), and primary languages.
three times per CVE.
Results. Removing the patch commit lowers performance to
10/15 (67%), even when the PoC is available, as it significantly statistical analysis shows approximately 55% of CVEs in DL
impairs the model’s ability to localize the vulnerability. For fall under the top 25 most dangerous CWEs [37], and around
example, in CVE-2025-31481, an authorization flaw in the 49% lack a PoC in their advisory.
PHP-based API Platform Core, the system lacked direc-
tion, explored irrelevant parts of the codebase, and eventually Methodology. We use the same methodology as Section 4.2,
exhausted its cost budget. Removing the security advisory and run CVE-G ENIE on DL three times, iteratively.
also results in a drop to 11/15 (73%). Most failures here are Results. CVE-G ENIE reproduced 428 CVEs out of 841, span-
due to environment setup and verification issues rather than ning 22 programming languages, 141 vulnerability types, and
exploit synthesis errors. As security advisories often contain 267 projects (see Table 4). To better understand outcomes,
contextual details that guide environment construction and we manually analyzed 50 successful and 50 failed cases, and
verification strategy. On the other hand, even with only the categorize the main reasons for successes and failures.
CVE description, CVE-G ENIE successfully reproduces 9/15
(60%), particularly for vulnerabilities with recognizable pat- Successes. Web vulnerabilities (XSS, CSRF, SQLi, path traver-
terns (e.g., SSRF, CSRF). This highlights the model’s baseline sal) were most reproducible, especially in interpreted lan-
robustness under minimal context. guages (Python, JavaScript, PHP, Ruby, TypeScript), as shown
in Table 4. Projects with clear setup instructions and code-
based PoCs with crashing input had the highest success rates,
RQ2: CVE-G ENIE can reproduce CVEs with limited
as runnable PoCs directly guided exploit construction.
information. Code-based PoCs are most helpful, but in
their absence, patch commits offer valuable localization Failures. Memory-safety bugs (C/C++), concurrency issues,
cues. However, reproduction success decreases with min- and UI-dependent flaws were hardest to verify. Many fail-
imal context. ures were due to brittle builds, poor project documentation,
or resource-heavy environments; however, 24% failed from
time/cost overruns. PoCs expressed only in natural language
4.4 Large Scale Evaluation contributed little, often leaving too much ambiguity for reli-
able reproduction. Compiled toolchains (C/C++, Rust, Go)
Dataset (DL ). For our large-scale evaluation, we curate a were prone to the most setup problems.
dataset, (DL ), comprising all CVEs published between June
2024 and May 2025 that include publicly available source RQ3: At scale, CVE-G ENIE reproduced 428 / 841
code. Since these CVEs postdate knowledge cutoff of LLMs CVEs, spanning 267 projects, 22 programming lan-
(o3 and o4-mini) used in CVE-G ENIE, they ensure zero guages, and 141 CWE types, demonstrating broad appli-
prior exposure. DL includes 841 CVEs across 186 CWEs, 29 cability across ecosystems (see Table 4).
programming languages, and 440 open-source projects. Our

12
Ideal Knw Pre- Feed- Critic Single yielded 0/15 reproductions, even advanced standalone LLMs
Case Build Reqs back Agents Agent (o3) failed, underscoring the necessity of CVE-G ENIE’s mod-
15 / 15 9 / 15 13 / 15 5 / 15 8 / 15 0 / 15
ular, feedback-driven, critic-guided design.

- ↑ 30% ↑ 27% ↓ 67% ↑ 47% -


RQ4: Standalone LLMs are currently unable to repro-
Exploit Max Reprod False
Failure Tool Rate Reprod
duce CVEs end-to-end. CVE-G ENIE’s modular design
remains the most effective approach for enabling LLMs
Table 5: Ablation study of CVE-G ENIE’s design. to perform this task comprehensively (Table 5).

4.5 Importance of CVE-G ENIE’s Design 5 Limitations and Future Work

Dataset. For this ablation study, we use the version of DR Vulnerabilities that Involve UI Interactions. Currently, CVE-
dataset which contains all essential information fields for 15 G ENIE only supports command-line interface (CLI) interac-
CVEs. As established earlier in Section 4.3, omitting these tions. However, many web-based vulnerabilities require inter-
critical fields can affect accurate CVE reproduction. There- action with a graphical UI, making them difficult to trigger
fore, in this ablation, we ensure all CVEs retain their com- via CLI. Future work should explore integrating CVE-G ENIE
plete critical information. This allows us to isolate the effect with UI environments such as web browsers.
of changes to the design of CVE-G ENIE. If performance Multimodal CVE Knowledge Curation. Due to lack of image
degrades when a specific change is made, we can attribute and video processing support, CVE-G ENIE struggles with
that drop solely to that change, since the CVE inputs remain CVEs where the PoC is provided as an image or video. Future
otherwise fully informative. work can address this limitation by integrating CVE-G ENIE
Methodology. As Data Processor, Setup Developer, Exploit with multimodal LLMs. Additionally, by leveraging external
Developer, and Verifier Developer are the four indispensable CVE resources through LLM-based deep search tools, be-
components of CVE-G ENIE for end-to-end CVE reproduc- yond standard vulnerability databases, may also help improve
tion, for this evaluation of CVE-G ENIE’s design we use the reproduction success rates [38].
following five ablation settings: Over Critique. As shown in Figures 5b, 6b, and 7b, all critics
1. No Knowledge Builder: Bypass the Knowledge Builder exhibit a lower TPR compared to their TNR. While this helps
and feed raw data directly to all agents. reduce false positives, it also results in missing some valid
2. No Pre-Requisite Developer: Eliminate the Pre- reproductions, particularly in cases where critics requested de-
Requisite Developer, allowing the Setup Developer to tailed setup, exploit, or verifier information that the developer
independently plan and execute the entire environment could not sufficiently provide.
setup from scratch. Cost. Running CVE-G ENIE costs approximately $2.77 per
3. No Feedback Loops: Enforce single-shot execution with- CVE. While open-source models do not perform well and
out iterative refinement, by removing all feedback loops. their large-scale deployment remains a challenge. To support
4. No Critics: Remove the Verifier Critic and treat a CVE further research, we release logs from all runs as a novel
as reproduced if the final scripts pass the Flag Checker. dataset to help the community fine-tune and improve local
5. Single Monolithic Agent: Combine all modular agents models’ performance for CVE reproduction.
into a single agent with access to same tools, thereby
evaluating the performance of a standalone LLM without
CVE-G ENIE’s agentic design and structured guidance.
6 Conclusion
Results. In our ablation study (Table 5), we find that remov- In this paper we presented CVE-G ENIE, an automated frame-
ing the Knowledge Builder drops reproduction to 9/15, as work designed to reproduce real-world vulnerabilities at scale.
agents struggled with unstructured advisories and lost con- This has considerable benefit for downstream security re-
text, causing 30% more exploit failures. Eliminating the Pre- search (see Appendix B), which suffers from a shortage of
Requisite Developer had a mild impact (13/15), but over- diverse high-quality data suitable for research and develop-
loaded the Setup Developer, which lead to 27% more tool-call ment of new tooling for automated vulnerability discovery
limits being hit, particularly on complex repositories such and repair. CVE-G ENIE successfully reproduced around 51%
as CVE-2025-32389 in Nameless, a website software for (428) of 841 CVEs, with on average $2.77 cost per CVE,
Minecraft servers. Feedback loops proved most critical: with- across a vast variety of projects, CWEs, and programming
out them, performance collapsed to 5/15 (–67%). On the other languages. To the best of the authors’ knowledge, no other
hand, dropping critic agents cut results to 8/15 and raised false framework is capable of such comprehensive results at this
reproductions by 47%. Finally, collapsing into a single agent scale and efficiency.

13
Ethical Considerations server). For these, we show output after 5 seconds, pro-
vide the log file path, and indicate whether the process
The goal of our work is to support defenders and solution is still running.
developers by accelerating vulnerability triage and tool devel- 5. set_environment_variable: This tool allows
opment for vulnerability detection and patching. CVE-Genie, for setting environment variables for subsequent
which does not create or determine new vulnerabilities, pro- execute_linux_command calls. We observed that
vides a powerful framework capable of providing recreations LLMs frequently issue calls to export environment
of CVE reports to provide up-to-date and reproducible CVE variables, which are not retained in successive runs
environments as benchmark test cases. These benchmarks as each command is executed in a separate shell. To
can then be used to rigorously test and enhance vulnerabil- address this, we enable the LLM to set these variables,
ity detection and patching systems, thereby understanding which are then passed to commands. There is also an
their performance and ideally accelerating their response ca- option to clear all set variables.
pabilities. Following the beneficence criterion of the Menlo Restrictions. Each tool-calling agent is limited to 60 tool calls.
Report [14], we are confident that our system provides more This limit was found to be optimal in preliminary experiments
benefits than potential harms to the security community and for maximizing CVE reproduction across all LLMs.
to society.
Output Formatting. For all agents requiring output in a spe-
cific format, we add a gpt-4o-mini-based format corrector
Open Science to the output parser. If parsing fails, the raw output is passed
to gpt-4o-mini for formatting. If formatting fails after three
We will open source CVE-G ENIE’s source code, logs for all attempts, an error is returned.
experiments in Section 4 including all agent conversations,
and our dataset of 428 reproduced CVEs.
B Applications of CVE-G ENIE
A CVE-G ENIE Specifications As shown in Table 6, reproducible CVEs can be used for a
plethora of security analysis tasks, including vulnerability
CVE-G ENIE is built on LangChain and supports easy design detection, triage, exploitation, and patching.
and integration of tools for LLMs. For CVE-G ENIE to effec-
Vulnerability Detection. Most existing vulnerability detec-
tively reproduce CVEs, it require access to a project’s source
tion datasets are built using patch commits from CVEs (see Ta-
code. However, due to LLMs’ limited context windows, large
ble 1). They typically label the pre-patch version of a function
project files cannot be processed all at once. To address this,
as vulnerable and the post-patch version as benign [7, 15, 40].
we developed primitives enabling directory browsing using
However, extracting isolated functions often strips away crit-
“project directory tree” and command execution:
ical context [51], and functions labeled for a single vulner-
1. get_file: The LLM agent specifies the file path and the ability may contain others, resulting in noisy labels. Using
number of lines it wants to read from a given offset. To reproducible CVEs with working PoCs can address these
prevent context overflow and improve efficiency, it reads issues. By executing PoCs during function code extraction,
a maximum of 300 lines at a time. If more content is we can ensure that vulnerable execution traces are included,
needed, the agent can scroll up or down to access it. preserving necessary context and confirming the presence
2. write_to_file: The LLM provides the path and content of vulnerabilities. Additionally, this approach supports both
for a file, which we then write to the specified file. This static and dynamic analysis, unlike most existing datasets
facilitates modifications and updates within the project. (e.g., SVEN [21], BigVul [15], PrimeVul [13]), which are
3. execute_ls_command: The LLM specifies a direc- limited to static methods.
tory, and we execute the ls command within it, re-
turning the output. This command is distinct from
execute_linux_command due to its major role in Category Application Benefits brought by CVE-G ENIE
ML-based detection Training and testing data
project directory exploration and its frequent use. Vul.
Dynamic fuzzing Benchmarks and initial fuzzing seeds
detection
4. execute_linux_command: The LLM issues Linux com- Static analysis Benchmarks and static rules
Vul. Root cause analysis Enrich inputs and enable more tools
mands to be executed from the project’s root directory. triage Exploit generation Ingredients for exploit chaining
Foreground commands have a 300-second timeout, with Vul. Patch generation Benchmarks and testing environments
standard output and error logged to separate files. Af- Patching Patch verification Inputs and environments
Other LLM secure code generation Benchmarks and testing environment
ter completion, we return up to the last 100 lines of the applications Penetration testing Construct pen. test tasks and environment
log and the log file path which LLM can read using Attack detection Benchmarks and detection rules

get_file for further analysis. Background execution is


supported for non-blocking tasks (e.g., starting a web Table 6: Tasks that benefit from CVE-G ENIE.

14
Vulnerability Triage and Patching. Reproducible CVEs are Daniel Preoţiuc-Pietro, and Anastasia Shimorina, edi-
essential for effective vulnerability triage. They help eliminate tors, Proceedings of the 2024 Conference on Empiri-
false positives and provide an executable environment with cal Methods in Natural Language Processing: Industry
known vulnerable inputs, which aids in accurately analyz- Track, pages 1131–1139, Miami, Florida, US, November
ing and reproducing bugs. This capability enables advanced 2024. Association for Computational Linguistics.
techniques like reverse execution and backward taint analysis
for root cause identification. CVE-G ENIE supports this pro- [2] Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui
cess by offering exploits for individual vulnerabilities, which Zhang, and Wenpeng Yin. Large language models
can be combined into complex exploit chains for studying for mathematical reasoning: Progresses and challenges,
multi-stage attacks. It also serves as a benchmark for evaluat- 2024.
ing patching methods, providing rich contextual data such as [3] Nikolaos Alexopoulos, Manuel Brack, Jan Philipp Wag-
crash stack traces and outputs. Additionally, its PoCs assist in ner, Tim Grube, and Max Mühlhäuser. How long do
verifying and refining patches. vulnerabilities live in the code? a Large-Scale empiri-
LLM Insecure Code Generation. Reproducible CVEs are cal measurement study on FOSS vulnerability lifetimes.
valuable for evaluating the security of code generated by In 31st USENIX Security Symposium (USENIX Secu-
LLMs. Recent studies have shown that LLMs can produce in- rity 22), pages 359–376, Boston, MA, August 2022.
secure code, particularly in security-critical contexts [55, 59]. USENIX Association.
Various benchmarks evaluate secure code generation capa-
bility of LLM, such as, CyberSecEval [55] prompts an LLM [4] Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexan-
to generate code snippets and uses rule-based detectors to der Warnecke, Fabio Pierazzi, Christian Wressnegger,
identify insecure code, however, rule-based detection often Lorenzo Cavallaro, and Konrad Rieck. Dos and don’ts of
leads to false positives. To improve this, SecCodePLT [59] machine learning in computer security. In 31st USENIX
leverages CWE data and manually crafts dynamic test cases Security Symposium, 2022.
to evaluate secure code generation. Reproducible CVEs offer [5] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten
a scalable alternative because these CVEs are confirmed to Bosma, Henryk Michalewski, David Dohan, Ellen Jiang,
be vulnerable, they can be used to assess LLMs’ ability to Carrie Cai, Michael Terry, Quoc Le, and Charles Sut-
generate secure code. By prompting the LLM to implement ton. Program Synthesis with Large Language Mod-
the functionalities in the vulnerable functions, and using the els. arXiv:2108.07732 [cs], August 2021. arXiv:
associated PoCs to test the output, we can determine whether 2108.07732.
the generated code is vulnerable.
Penetration Testing and Attack Detection. CVE-G ENIE [6] Bandit. Bandit Documentation.
offers a valuable resource for both penetration testing and [7] Guru Bhandari, Amara Naseer, and Leon Moonen. Cve-
cybersecurity training. It provides ready-to-use exploitation fixes: automated collection of vulnerabilities and their
scenarios along with verified solutions, making it ideal for ed- fixes from open-source software. In Proceedings of
ucating future penetration testers and developing automated the 17th International Conference on Predictive Models
testing tools. For defenders, these exploits can be analyzed to and Data Analytics in Software Engineering, PROMISE
identify attack patterns, which can enhance detection mecha- 2021, page 30–39, New York, NY, USA, 2021. Associa-
nisms such as intrusion detection systems (IDS) or malware tion for Computing Machinery.
classifiers.
[8] Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan,
Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu,
References Guoliang Dong, Artem Aliev, Jie Wang, Xiao Cheng,
Guangtai Liang, Yuchi Ma, Pan Bian, Tao Xie, and
[1] Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sad- Qianxiang Wang. Coder: Issue resolving with multi-
hana Kumaravel, Matthew Stallone, Rameswar Panda, agent and task graphs, 2024.
Yara Rizk, G P Shrivatsa Bhargav, Maxwell Crouse, Chu-
laka Gunasekara, Shajith Ikbal, Sachindra Joshi, Hima [9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Karanam, Vineet Kumar, Asim Munawar, Sumit Nee- Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri
lam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Edwards, Yuri Burda, Nicholas Joseph, Greg Brock-
Dheeraj Sreedhar, Praveen Venkateswaran, Merve Un- man, Alex Ray, Raul Puri, Gretchen Krueger, Michael
uvar, David Daniel Cox, Salim Roukos, Luis A. Las- Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin,
tras, and Pavan Kapanipathi. Granite-function calling Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov,
model: Introducing function calling abilities via multi- Alethea Power, Lukasz Kaiser, Mohammad Bavarian,
task learning of granular tasks. In Franck Dernoncourt, Clemens Winter, Philippe Tillet, Felipe Petroski Such,

15
Dave Cummings, Matthias Plappert, Fotios Chantzis, pre-training for code representation. In Proceedings of
Elizabeth Barnes, Ariel Herbert-Voss, William Heb- the 60th Annual Meeting of the Association for Compu-
gen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie tational Linguistics, May 2022.
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
[18] Wenbo Guo, Yujin Potter, Tianneng Shi, Zhun Wang,
William Saunders, Christopher Hesse, Andrew N. Carr,
Andy Zhang, and Dawn Song. Frontier ai’s im-
Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa,
pact on the cybersecurity landscape. arXiv preprint
Alec Radford, Matthew Knight, Miles Brundage, Mira
arXiv:2504.05408, 2025.
Murati, Katie Mayer, Peter Welinder, Bob McGrew,
Dario Amodei, Sam McCandlish, Ilya Sutskever, and [19] Hazim Hanif and Sergio Maffeis. Vulberta: Simpli-
Wojciech Zaremba. Evaluating Large Language Models fied source code pre-training for vulnerability detection.
Trained on Code, July 2021. arXiv:2107.03374 [cs]. In International Joint Conference on Neural Networks
(IJCNN), 2022.
[10] Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun
Chen, and David Wagner. Diversevul: A new vulnerable [20] Ahmad Hazimeh, Adrian Herrera, and Mathias Payer.
source code dataset for deep learning based vulnerabil- Magma: A ground-truth fuzzing benchmark. Proceed-
ity detection. In Proceedings of the 26th International ings of the ACM on Measurement and Analysis of Com-
Symposium on Research in Attacks, Intrusions and De- puting Systems, 2020.
fenses, RAID ’23, page 654–668, New York, NY, USA,
[21] Jingxuan He and Martin Vechev. Large language models
2023. Association for Computing Machinery.
for code: Security hardening and adversarial testing. In
[11] DARPA. The cyber grand challenge, 2016. Proceedings of the 2023 ACM SIGSAC Conference on
Computer and Communications Security, CCS ’23, page
[12] Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, 1865–1879, New York, NY, USA, 2023. Association for
Chenyuan Yang, and Lingming Zhang. Large Language Computing Machinery.
Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning
Libraries via Large Language Models. In Proceedings [22] Junda He, Christoph Treude, and David Lo. LLM-Based
of the 32nd ACM SIGSOFT International Symposium Multi-Agent Systems for Software Engineering: Litera-
on Software Testing and Analysis, ISSTA 2023, pages ture Review, Vision and the Road Ahead. ACM Trans.
423–435, New York, NY, USA, July 2023. Association Softw. Eng. Methodol., January 2025.
for Computing Machinery. [23] Infer. A tool to detect bugs in Java and C/C++/Objective-
C code before it ships.
[13] Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin
Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, [24] Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan.
Baishakhi Ray, and Yizheng Chen. Vulnerability Detec- Impact of Code Language Models on Automated Pro-
tion with Code Language Models: How Far Are We?, gram Repair. In IEEE/ACM 45th International Confer-
July 2024. arXiv:2403.18624 [cs]. ence on Software Engineering (ICSE), pages 1430–1442.
IEEE Computer Society, May 2023.
[14] D Dittrich and E Kenneally. The Menlo Report: Ethi-
cal Principles Guiding Information and Communication [25] Carlos E. Jimenez, John Yang, Alexander Wettig,
Technology Research. Technical report, U.S. Depart- Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R.
ment of Homeland Security, August 2012. Narasimhan. Swe-bench: Can language models resolve
real-world github issues? In ICLR, 2024.
[15] Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen.
A c/c++ code vulnerability dataset with code changes [26] Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan,
and cve summaries. In Proceedings of the 17th Inter- Bo Li, and Huaming Chen. From LLMs to LLM-based
national Conference on Mining Software Repositories, Agents for Software Engineering: A Survey of Current,
MSR ’20, page 508–512, New York, NY, USA, 2020. Challenges and Future, August 2024. arXiv:2408.02479
Association for Computing Machinery. [cs].

[16] Michael Fu, Chakkrit Kla Tantithamthavorn, Van [27] Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii
Nguyen, and Trung Le. Chatgpt for vulnerability de- Khizbullin, and Bernard Ghanem. CAMEL: Commu-
tection, classification, and repair: How far are we? In nicative Agents for "Mind" Exploration of Large Lan-
2023 30th Asia-Pacific Software Engineering Confer- guage Model Society. In A. Oh, T. Naumann, A. Glober-
ence (APSEC), pages 632–636, 2023. son, K. Saenko, M. Hardt, and S. Levine, editors, Ad-
vances in Neural Information Processing Systems, vol-
[17] Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming ume 36, pages 51991–52008. Curran Associates, Inc.,
Zhou, and Jian Yin. UniXcoder: Unified cross-modal 2023.

16
[28] Hongwei Li, Yuheng Tang, Shiqi Wang, and Wenbo c/c++. In Proceedings of the 18th ACM/IEEE Interna-
Guo. Patchpilot: A cost-efficient software engineering tional Symposium on Empirical Software Engineering
agent with early attempts on formal verification. arXiv and Measurement, ESEM ’24, page 72–83, 2024.
preprint arXiv:2502.02747, 2025.
[40] Georgios Nikitopoulos, Konstantina Dritsa, Panos Louri-
[29] Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, das, and Dimitris Mitropoulos. Crossvul: a cross-
and Zhaoxuan Chen. Sysevr: A framework for using language vulnerability dataset with commit data. In
deep learning to detect software vulnerabilities. IEEE Proceedings of the 29th ACM Joint Meeting on Euro-
Transactions on Dependable and Secure Computing, pean Software Engineering Conference and Symposium
2022. on the Foundations of Software Engineering, ESEC/FSE
2021, page 1565–1569, New York, NY, USA, 2021. As-
[30] Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin,
sociation for Computing Machinery.
Sujuan Wang, Zhijun Deng, and Yuyi Zhong. Vuldeep-
ecker: A deep learning-based system for vulnerability [41] NIST. SARD.
detection. Proceedings of Network and Distributed Sys-
tem Security Symposium, 2018. [42] NVD. NVD - Home.

[31] Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, and Yang [43] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal,
Xiang. Poster: Vulnerability discovery with function Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
representation learning from unlabeled projects. In Pro- Diogo Almeida, Janko Altenschmidt, Sam Altman, and
ceedings of ACM SIGSAC Conference on Computer and Shyamal Anadkat. Gpt-4 technical report, 2024.
Communications Security, 2017.
[44] OWASP. Source Code Analysis Tools.
[32] Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li,
Fei Huang, and Yongbin Li. How to understand whole [45] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered-
software repository?, 2024. ith Ringel Morris, Percy Liang, and Michael S. Bern-
stein. Generative Agents: Interactive Simulacra of Hu-
[33] Rohin Manvi, Anikait Singh, and Stefano Ermon. Adap- man Behavior. In Proceedings of the 36th Annual ACM
tive Inference-Time Compute: LLMs Can Predict if Symposium on User Interface Software and Technology,
They Can Do Better, Even Mid-Generation, 2024. UIST ’23, pages 1–22, New York, NY, USA, October
2023. Association for Computing Machinery.
[34] Xiang Mei, Pulkit Singh Singaria, Jordi Del Castillo,
Haoran Xi, Abdelouahab, Benchikh, Tiffany Bao, Ruoyu [46] Hammond Pearce, Benjamin Tan, Baleegh Ahmad,
Wang, Yan Shoshitaishvili, Adam Doupé, Hammond Ramesh Karri, and Brendan Dolan-Gavitt. Examining
Pearce, and Brendan Dolan-Gavitt. Arvo: Atlas of repro- Zero-Shot Vulnerability Repair with Large Language
ducible vulnerabilities for open source software, 2024. Models. In 2023 IEEE Symposium on Security and Pri-
[35] Metasploit. RAPID7 metasploit. vacy (SP), pages 2339–2356, May 2023. ISSN: 2375-
1207.
[36] Yisroel Mirsky, George Macon, Michael Brown, Carter
Yagemann, Matthew Pruett, Evan Downing, Sukarno [47] Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James
Mertoguno, and Wenke Lee. VulChecker: Graph-based Annibal, Alec Peltekian, and Yanfang Ye. CoTexT:
vulnerability localization in source code. In 32nd Multi-task learning with code-text transformer. In Pro-
USENIX Security Symposium, 2023. ceedings of the 1st Workshop on Natural Language Pro-
cessing for Programming (NLP4Prog), 2021.
[37] MITRE. 2024 CWE Top 25 Most Dangerous Software
Weaknesses. [48] Akond Rahman, Chris Parnin, and Laurie Williams. The
seven sins: Security smells in infrastructure as code
[38] Dongliang Mu, Alejandro Cuevas, Limin Yang, Hang scripts. In IEEE/ACM 41st International Conference on
Hu, Xinyu Xing, Bing Mao, and Gang Wang. Under- Software Engineering (ICSE), 2019.
standing the Reproducibility of Crowd-reported Security
Vulnerabilities. In 27th USENIX Security Symposium [49] Niklas Risse and Marcel Böhme. Limits of Machine
(USENIX Security 18), pages 919–936, Baltimore, MD, Learning for Automatic Vulnerability Detection. arXiv
August 2018. USENIX Association. e-prints, page arXiv:2306.17193, June 2023.

[39] Anh The Nguyen, Triet Huynh Minh Le, and M. Ali [50] Niklas Risse and Marcel Böhme. Top score on the
Babar. Automated code-centric software vulnerabil- wrong exam: On benchmarking in machine learning for
ity assessment: How far are we? an empirical study in vulnerability detection, 2024.

17
[51] Niklas Risse and Marcel Böhme. Top score on the [61] Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu,
wrong exam: On benchmarking in machine learning for Dongliang Mu, Yan Chen, and Xinyu Xing. Patchagent:
vulnerability detection, 2024. A practical program repair agent mimicking human ex-
pertise.
[52] Rebecca Russell, Louis Kim, Lei Hamilton, Tomo La-
zovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, [62] Daoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin,
and Marc McConley. Automated vulnerability detection Yifan Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu,
in source code using deep representation learning. In Lei Yu, Dezhi Ran, Muhan Zeng, Bo Shen, Pan Bian,
17th IEEE International Conference on Machine Learn- Guangtai Liang, Bei Guan, Pengjie Huang, Tao Xie,
ing and Applications (ICMLA), 2018. Yongji Wang, and Qianxiang Wang. Swe-bench-java: A
github issue resolving benchmark for java, 2024.
[53] Yashar Talebirad and Amirhossein Nadiri. Multi-Agent
Collaboration: Harnessing the Power of Intelligent LLM [63] Yuntong Zhang, Jiawei Wang, Dominic Berzin, Martin
Agents, June 2023. arXiv:2306.03314 [cs]. Mirchev, Dongge Liu, Abhishek Arya, Oliver Chang,
and Abhik Roychoudhury. Fixing security vulnerabili-
[54] Saad Ullah, Mingji Han, Saurabh Pujar, Hammond ties with ai in oss-fuzz, 2024.
Pearce, Ayse Coskun, and Gianluca Stringhini. LLMs
Cannot Reliably Identify and Reason About Security [64] Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Bu-
Vulnerabilities (Yet?): A Comprehensive Evaluation, ratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro
Framework, and Benchmarks. In 2024 IEEE Sympo- Morari, and Zhong Su. D2a: A dataset built for ai-based
sium on Security and Privacy (SP), pages 862–880, Los vulnerability detection methods using differential analy-
Alamitos, CA, USA, May 2024. IEEE Computer Soci- sis. In 2021 IEEE/ACM 43rd International Conference
ety. on Software Engineering: Software Engineering in Prac-
tice (ICSE-SEIP), 2021.
[55] Shengye Wan, Cyrus Nikolaidis, Daniel Song, David
Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, [65] Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du,
Sahana Chennabasappa, Spencer Whitman, Stephanie and Yang Liu. Devign: Effective vulnerability identifi-
Ding, et al. Cyberseceval 3: Advancing the evaluation cation by learning comprehensive program semantics
of cybersecurity risks and capabilities in large language via graph neural networks. In Advances in Neural Infor-
models. arXiv preprint arXiv:2408.01605, 2024. mation Processing Systems, 2019.

[56] Laura Wartschinski, Yannic Noller, Thomas Vogel, Timo [66] Yuxuan Zhu, Antony Kellermann, Dylan Bowman,
Kehrer, and Lars Grunske. Vudenc: Vulnerability detec- Philip Li, Akul Gupta, Adarsh Danda, Richard Fang,
tion with deep learning on a natural codebase for python. Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi
Inf. Softw. Technol., 144(C), April 2022. Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel
Kang. Cve-bench: A benchmark for ai agents’ ability to
[57] Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. exploit real-world web application vulnerabilities, 2025.
Automated Program Repair in the Era of Large Pre-
trained Language Models. In 2023 IEEE/ACM 45th In-
ternational Conference on Software Engineering (ICSE),
pages 1482–1494, Melbourne, Australia, May 2023.
IEEE.

[58] Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad


Rieck. Modeling and discovering vulnerabilities with
code property graphs. In IEEE Symposium on Security
and Privacy, 2014.

[59] Yu Yang, Yuzhou Nie, Zhun Wang, Yuheng Tang, Wenbo


Guo, Bo Li, and Dawn Song. Seccodeplt: A unified
platform for evaluating the security of code genai. arXiv
preprint arXiv:2410.11096, 2024.

[60] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik Narasimhan, and Yuan Cao. React:
Synergizing reasoning and acting in language models.
arXiv preprint arXiv:2210.03629, 2022.

18

You might also like