Proposed Agentic Hybrid CTI Analysis
System
The proposed system is a hybrid multi-agent architecture combining Large Language Models
(LLMs) with graph-based reasoning and retrieval components. Raw CTI reports (blogs, PDFs)
are ingested and pre-processed (text extraction, tokenization, normalization) into a common
format. A central orchestrator agent – implemented as an LLM – coordinates specialized
sub-agents in a pipeline:
● Extraction Agent: A fine-tuned or prompt-engineered LLM scans the text to extract
entities and concepts for all STIX 2.1 object types (Indicators, Threat Actors, Campaigns,
etc.) and TTPs (techniques/sub-techniques). It generates initial STIX triples (e.g.
Actor–uses–Attack Pattern) and short annotations. These objects are created with STIX
schema compliance in mind. In practice, we run separate LLM prompts for STIX Domain
Objects (SDOs), Cyber-observable Objects (SCOs), and Relationship Objects (SROs),
then merge them into a STIX bundle
● Graph Construction & Reasoning Agent: Extracted entities are loaded into a
Knowledge Graph where nodes are STIX objects and edges are their relationships. A
graph neural network (GNN, e.g. GraphSAGE ) propagates context and infers hidden
links. This helps link related TTPs or actors even if only implicitly mentioned.
● Validation / RAG Agent: To reduce hallucinations and ensure factual grounding, a
retrieval-augmented generation (RAG) step checks and enriches the LLM output. The
agent performs semantic search over a vector database of indexed CTI reports and
threat feeds, retrieving relevant text snippets that support each extraction. It may also
call external APIs (e.g. MITRE ATT&CK) to verify IOC details or technique names. By
“grounding” queries in authoritative data, the system avoids fabrications. Each candidate
fact is iteratively cross-checked by a chain-of-thought prompt; if inconsistencies arise,
the agent refines or rejects the fact.
● Output Generation Agent: The orchestrator then compiles the validated findings into
final outputs: fully-formed STIX 2.1 JSON bundles (complete with all relevant object
properties) and corresponding visualizations. All reasoning steps and retrieved evidence
are logged to ensure explainability.Agent orchestration frameworks like LangGraph can
manage this ensuring modularity and traceability.
Overall, data flows from raw CTI → LLM-based extraction → KG enrichment → RAG validation
→ final STIX output. The use of RAG (retrieval from a CTI corpus) and graph queries is built into
the toolset, while multi-agent orchestration organizes the steps
Addressing Existing Limitations
● Complete STIX 2.1 Coverage: Our extraction agent is explicitly trained and prompted to
recognize all 18 STIX Domain Objects (SDOs) and 2 Relationship Objects, not just the
commonly-used few.our system’s prompts reference the full STIX ontology. We can
fine-tune on annotated CTI that label each STIX type, or use rules to post-process LLM
output into any missing object types. This ensures niche objects (e.g. Course-of-Action,
Infrastructure, Grouping) are captured, yielding full STIX compliance. The STIX bundle
assembly step then validates each object using the STIX 2.1 schema libraries.
● Enhanced TTP Recognition and Linking: We employ a multi-label classification
strategy for TTPs that mirrors the MITRE ATT&CK hierarchy. For instance, a
DistilBERT-based system covering 560 ATT&CK classes (techniques and
sub-techniques) achieved 0.933 F1 on fine-grained TTP extraction. Similarly, our LLM is
guided (via prompts or fine-tuning) to tag TTPs at the sub-technique level. The
constructed graph then encodes TTP–tactic relationships, and GNN link prediction
identifies chains of procedures. This multi-hop reasoning uncovers, for example, how
reconnaissance techniques connect to later exploitation steps, strengthening our TTP
linkage beyond what a single-pass LLM could do.
● Reduced Hallucinations: Hallucination risk is mitigated through our RAG and
multi-agent checks. In practice, the RAG agent fetches relevant report snippets before
generation, so the model cites real sentences. Furthermore, multiple agents
cross-validate. For example, one agent’s extraction is verified by another agent or by
querying provenance logs, akin to the dual-evaluation paradigm where an auxiliary LLM
cross-checks outputs. ProvSEEK, a similar agentic system, enforces that “every
LLM-generated claim is tied to verifiable ground truth. In our design, any fact not
confirmed by retrieval or databases triggers a refinement loop, keeping hallucinations
below a low threshold. Empirically, such RAG+validation workflows can cut hallucination
rates dramatically, as each output is effectively “vetted” by multiple sources.
● Novel Entity & Pattern Detection: To flag new IoCs or TTP variants, we leverage
semantic embeddings. Extracted entities are compared via cosine similarity to an
embedding index of known threats. If the similarity score is below a threshold (e.g. <0.8),
the entity is marked as novel. The system then automatically queries live sources (e.g.
VirusTotal for IPs/domains, malware sandboxes for hashes, or the ATT&CK Taxii API) to
seek context. This lets the system adapt to emerging threats. New threat patterns can
also emerge via link prediction in the KG or by agentic exploratory prompts. The
architecture thus supports zero-shot recognition: even unseen threats trigger either
evidence-gathering or human-in-the-loop alerts, rather than silent failure.
● Reduced Dependency on Static KBs: Instead of a fixed internal knowledge base, the
system dynamically retrieves up-to-date CTI and threat data. Agents can call online APIs
(MITRE ATT&CK, AlienVault OTX, etc.) and use real-time threat feeds as additional
knowledge sources. This retrieval-augmented design means the system stays current:
when a new vulnerability or campaign appears online, the next RAG query will
incorporate it automatically.
Agentic AI Multi-Agent Approach
Our framework treats each component as an autonomous agent empowered by an LLM. In AI
terms, an agent is “a system that can use an LLM to reason through a problem, create a plan to
solve it, and execute the plan with tools. We adopt a multi-agent setup where each agent has a
role (e.g. “Extraction Specialist”, “Graph Reasoner”, “Validation Analyst”). They communicate via
the orchestrator.
● Role Specialization: Each agent has a distinct persona: for example, one prompt might
be “You are a Threat Analyst focusing on extracting Indicators of Compromise,” while
another is “You are a Forensics Expert mapping attack chains.” This follows ProvSEEK’s
strategy of role-based prompts to align contextarxiv.org. Role specialization minimizes
drift: each agent applies a professional “lens” to its subtask, improving consistency and
interpretability.
● Tool Integration: Agents access external tools to perform actions. For example, the
Graph Agent can run Cypher queries against a Neo4j KG, the RAG Agent can call a
FAISS vector store for nearest-neighbor search, and any agent can invoke REST APIs.
A question about a technique might trigger an ATT&CK API query; an IOC might trigger
a VirusTotal lookup. These tools extend the LLM’s capabilities. Frameworks like
LangChain (which LangGraph builds upon) make it easy to integrate such tools into
prompts. For instance, the agent uses a “Plan Generator” tool to decompose tasks and a
“Data Retriever” to fetch DB results
● Orchestration Frameworks: We will leverage open-source agentic frameworks for
coordination. LangGraph (built on LangChain) models workflows as directed graphs of
components, ideal for visualizing dependencies. Other options include HuggingFace’s
SmolAgents for lightweight tasks. We can also use tools like LangSmith/LangFuse to
monitor agent interactions and logs.
● Knowledge Graph Interaction: The Graph Agent interacts with the KG using queries
and embedding lookups. For example, it might ask “find all attack patterns linked to this
threat actor” and use the graph query result in subsequent reasoning. The agent can
also perform GNN inference in batches (using libraries like PyTorch Geometric) to
update node embeddings and predict edges. Each graph traversal or GNN output is
logged as an explanation: e.g. “Threat Actor A → uses → Technique T1059 (Scripting)
[confidence 0.92]” provides a traceable link.
● Iterative Reasoning Loop: Agents operate in an LLM-based reasoning loop (often
called ReAct). The orchestrator issues a high-level query (e.g. “Extract TTPs from
Report X”), the Extraction Agent responds, the Validation Agent critiques or refines, and
so on. Chain-of-Thought (CoT) prompting within each agent decomposes complex tasks.
If one agent’s result is uncertain, the orchestrator may route it to another agent (or the
same agent with a refinement instruction) for additional evidence. This dynamic planning
mirrors human analysis.
Alternative Approaches and Comparison
We considered several architectures:
● LLM + Static Graph: A baseline is to use an LLM to extract entities and a static
knowledge graph for relations. This can yield a structured KG, but it lacks dynamic
validation. Without RAG or multi-agent checks, the LLM may hallucinate and the graph
cannot update with new data. For example, GraphRAG systems combine KG with LLMs,
but our agentic design adds autonomy and refreshability.
● Transformer + Knowledge Graph: Here a fine-tuned transformer (e.g. a CTI-focused
BERT) identifies entities/TTPs and updates a graph. This achieves high precision on
known terms, but struggles with novel phrasing and cannot reason beyond its training
data. It also usually assumes a fixed schema. By contrast, our agents can query outside
data and re-run analysis on demand, giving higher recall on unseen variants.
● Dual-LLM: One could use two LLMs: one for extraction, another as a “critic” to validate
(a mini RAG approach). This reduces simple errors, but still operates purely on model
knowledge. Our multi-agent method generalizes this by allowing many specialized
“critics” and tool calls, not just a single validator.
● Fully Agentic : Our agentic architecture adds tool use and real-time feedback. It is more
complex to orchestrate but yields the best coverage and adaptability. In practice,
systems like ProvSEEK that use an agentic approach “outperform retrieval-based
methods in precision/recall and achieve accuracy on threat detection. We expect similar
gains: by actively querying evidence at each step, our solution maintains higher factuality
and broader coverage than any single-model pipeline.
Model and Tool Recommendations
We propose using a mix of open-source and state-of-the-art models:
● LLMs: For initial extraction and orchestration, capable LLMs include OpenAI’s GPT-4 or
newer (e.g. GPT-4o/5 if available) for their strong reasoning. However, cost and privacy
may favor open models: LLaMA-3 (via Meta) or Falcon can be fine-tuned on CTI texts.
Anthropic’s Claude (if accessible) also offers high inference ability. Domain-adapted
models like WizardLM or specialized security LLMs can boost performance. For initial
tests, even Llama 2 (7B to 70B) or Mistral 7B – which have shown good CTI extraction
ability– would suffice.
● Graph Models: For graph embeddings and link prediction, we recommend Graph
Neural Network libraries (e.g. PyTorch Geometric or DGL) with models like GraphSAGE
or GCN. These handle sparse CTI graphs effectively.
● Transformers for NLP: In addition to the LLMs, transformer encoders fine-tuned on
cybersecurity corpora can help. For example, SecBERT or CyberBERT
(RoBERTa/BERT models trained on security data) can improve the entity/TTP tagging
step. SciBERT might also help with technical jargon in CTI reports. We can also
leverage models like DeBERTa or RoBERTa for NER tasks on IOCs and TTP names.
● Agent Frameworks: For orchestration, LangGraph (graph-based workflows) which align
with our needs. HuggingFace’s SmolAgents can serve lightweight tasks. We will
evaluate these frameworks for ease of integration and monitoring (e.g. using LangSmith
for logs).
● Retrieval and Vector Stores: We will use vector databases such as FAISS or Milvus to
store embeddings of CTI documents. This enables fast RAG retrieval when validating
IoCs. We will also use STIX/STIX2 Python libraries to construct/validate bundles, and
common HTTP libraries (requests, httpx) to call CTI APIs.