0% found this document useful (0 votes)
21 views27 pages

Developers Guide GraphRAG

The document is a comprehensive guide on GraphRAG, a system that enhances traditional retrieval-augmented generation (RAG) by integrating knowledge graphs to provide context-aware responses. It addresses the limitations of conventional RAG, which struggles with complex queries that require understanding relationships and context across documents. The guide outlines the structure, logic, and implementation of GraphRAG, emphasizing the importance of combining structured and unstructured data for improved AI responses.

Uploaded by

lola.medina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views27 pages

Developers Guide GraphRAG

The document is a comprehensive guide on GraphRAG, a system that enhances traditional retrieval-augmented generation (RAG) by integrating knowledge graphs to provide context-aware responses. It addresses the limitations of conventional RAG, which struggles with complex queries that require understanding relationships and context across documents. The guide outlines the structure, logic, and implementation of GraphRAG, emphasizing the importance of combining structured and unstructured data for improved AI responses.

Uploaded by

lola.medina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

EBOOK

THE DEVELOPER’S GUIDE TO

GraphRAG
Alison Cossette
Zach Blumenfeld
Damaso Sanoja
The Developer’s Guide to GraphRAG

Table of Contents

PART I: The Problem With Current RAG................................................................................ 4

PART II: What Makes It GraphRAG – Structure, Logic, and Meaning......................... 5

What Is RAG? ................................................................................................................... 5

What Is GraphRAG?...................................................................................................... 6

1. Context-Aware Responses..................................................................... 6

2. Traceability and Explainability............................................................. 6

3, Access to Structured and Unstructured Data............................... 7

How GraphRAG Works................................................................................................. 7

Prepare a Knowledge Graph for GraphRAG........................................................ 7

Ground With Unstructured and Structured Data............................................... 8

PART III: Constructing the Graph............................................................................................. 9

Create a Neo4j Database............................................................................................. 9

Ingest Unstructured Data............................................................................................ 10

Key Features of Neo4j GraphRAG Package........................................ 10

Neo4j Connection............................................................................................ 10

Initialize the LLM and Embeddings.......................................................... 11

Define Node Labels and Relationship Types....................................... 11

Initialize and Run the Pipeline.................................................................... 11

Process the PDF Document........................................................................ 12

Create the Vector Index............................................................................... 12

Ingest Structured Data................................................................................................. 13

Getting Started With Data Importer........................................................ 13

Import Structured Data................................................................................ 13

Mapping Your Data to Graph Structures............................................... 14

Mapping Relationships................................................................................. 15
The Developer’s Guide to GraphRAG

Table of Contents (continued)

PART IV: Implementing GraphRAG Retrieval Patterns................................................... 17

Import Libraries................................................................................................................ 17

Load Environment Variables and Initialize Neo4j Driver................................. 18

Initialize the LLM and Embedder.............................................................................. 18

The Basic Retriever Pattern........................................................................................ 18

The Graph-Enhanced Vector Search Pattern..................................................... 20

VectorCypher Retriever in Practice......................................................................... 21

VectorCypher Retrieval: A Working Example..................................................... 22

Text2CypherRetriever................................................................................................... 23

Community Summary Pattern.................................................................................... 25

Concluding Thoughts and Next Steps................................................................... 25

Appendix: Technical Resources in Workflow Order......................................................... 27


The Developer’s Guide to GraphRAG

PART I: The Problem With the meaning depends on a sales addendum from
three weeks earlier. Or maybe they ask a support
Current RAG question that only makes sense in the context of
their infrastructure and license tier. The information
is there, but it’s scattered across multiple documents,
Why chunk-based RAG hits a ceiling — and why
formats, and timelines. Chunk-based retrieval can’t
developers need more context to answer well
bridge that gap.
You’ve built a retrieval-augmented generation (RAG)
Traditional RAG doesn’t have shared context
system. You embedded the docs, connected the
across documents. That’s because it doesn’t track
vector store, wrapped a prompt around the output,
relationships. It doesn’t know which concepts are
and deployed it. For a minute, it felt like you cracked
upstream, downstream, dependent, or mutually
the code. The model was grounded in your own data,
exclusive. It doesn’t distinguish between definitions,
giving answers that sounded smarter than base GPT.
instructions, timelines, policies, or decision logic.
Then reality hit.
The bottom line: Traditional RAG treats all chunks
The system works — but only under the most as equal, flat, unstructured blobs of text.
forgiving conditions. The moment you ask a question
Even more problematic is that the system has
that spans documents, relies on implicit context, or
no mental model for your business. It cannot
touches anything complex or structured, the cracks
understand what a “customer” is in your world.
start to show. Answers get vague. Sometimes they’re
Or how a support ticket relates to a contract. Or
just plain wrong. Or worse, the system confidently
what a system diagram implies about downstream
quotes the right chunk — but misses the point entirely.
integrations. The mental model that represents the
Your RAG system isn’t broken. It’s just blind. structure behind your content is absent in RAG.

Without it, RAG can’t reason. It can only retrieve, and


RAG retrieves semantically
that isn’t enough.
similar text, but it doesn’t know how
You already know what your RAG system should
the pieces fit together. be able to do. It’s the kind of reasoning your team
It has no map of your domain. No memory of what does every day without thinking. Consider this: If
matters. It’s like hiring a new developer and giving a customer reaches out to your support team, the
them a stack of index cards with code snippets from employee will listen to the customer’s concern,
your repo. They can parrot back functions, maybe look up their account and tech stack, check
even modify them, but they don’t understand the previous service requests, etc. When answering the
architecture. They don’t know the “why,” only the “what.” customer’s question, the employee brings context.
They may answer differently if the person is a new
That’s the ceiling of traditional RAG. And that’s what customer vs. a long-term customer.
this book is here to fix.
You want your RAG application to do what humans
Here’s the core issue: RAG retrieves based on do naturally: use context to inform its answer. As
similarity, not understanding. examples, you might want the RAG system to:

You give it a query, it vectorizes that query, and • Answer a support question and understand
fetches the top-k similar chunks. That’s fine if the the user’s tech stack, contract level,
answer you need lives entirely within isolated chunks. and product version.
But most real-world questions don’t work that way. • Explain a contract term — and know what the
sales path looked like, who signed off, and
Let’s say a user asks about a contract clause, but
which systems were impacted.

4
The Developer’s Guide to GraphRAG

• Interpret a customer review and place it in such as ChatGPT, Gemini, and Claude. When a
context with purchase history, usage data, and user’s prompt goes directly to the LLM, it generates
net promoter score (NPS). a response based on its training data. Due to the
probabilistic nature of response generation, LLMs
These shouldn’t feel like advanced use cases —
often produce responses that lack accuracy and
they’re basic context. They’re what you, as a human
nuance and don’t draw on knowledge specific to your
developer, bring into every decision without even
business. In addition, the LLM in question may have
realizing it. And that’s the problem: Your RAG
limited explainability, which limits its adoption in
system has none of that. Sure, it has some document
enterprise settings.
metadata available, but no user metadata, no
business logic, no connected data — just isolated RAG addresses these challenges by intercepting
chunks in a vector store. But RAG can’t use what it a user’s prompt, querying external data, usually a
can’t see. So until you give it structure — until you vector store, and passing relevant documents back
teach it relationships, timelines, ownership, and to the LLM. Adding retrieval to the LLM enables the
dependencies — it will keep retrieving the right application to answer questions with knowledge
words for the wrong reasons. from a specific dataset. This simple technique
suddenly makes it possible to build applications for a
This isn’t a whitepaper. It’s a build-it-yourself playbook.
variety of use cases. As examples:
We’re going to walk you through:
• Knowledge assistants can tap into company-
• Ingesting documents and turning them into a specific information for accurate, contextual
knowledge graph responses.
• Structuring real-world context from messy • Recommendation systems can incorporate
PDFs, CSVs, and APIs real-time data for more personalized
• Building retrievers that combine vector search suggestions.
and graph traversal • Search APIs can deliver more nuanced and
• Using text-to-query generation to run dynamic context-aware results.
Cypher queries (a query language for graphs)
RAG consists of three key components:
and pull precise information and calculations
from your data • An LLM that serves as the generator
• A knowledge base or database that stores the
And we’re going to do it with code. No fluff. Just the
information to be retrieved
stack, the logic, and the patterns that actually work.
• A retrieval mechanism to find relevant
If you’ve built RAG, and you know it’s not enough,
information from the knowledge base, based
then this is the guide to take you further.
on the input query

PART II: What Makes It


GraphRAG – Structure, Logic,
and Meaning
To understand GraphRAG, let’s explore its
foundational components — RAG and knowledge
graphs — and why they work so well together.

What Is RAG? Figure 1. Querying a knowledge graph with an LLM


Let’s start with the well-known problems of large
language models (LLMs), which power chatbots

5
The Developer’s Guide to GraphRAG

The quality of a RAG response depends heavily on customer’s purchase history, known issues with that
the database type the information is retrieved from. product version, related documentation, and prior
If you use a vector store (as in traditional RAG), the support conversations.
process goes like this: The user query is turned into
a vector, which is then used to retrieve semantically
similar text chunks from a vector database. While
retrieval based on semantic similarity can work
across multiple documents, it often falls short when
questions require understanding implicit context or
relationships that span those documents. Traditional
RAG treats each chunk in isolation, as it lacks a
holistic view of the domain. Figure 2. Order issue flow

Retrieval based on semantic similarity can only get A knowledge graph holds all related information
you so far. And this is where GraphRAG comes in. together across both structured and unstructured
GraphRAG gives the LLM a mental model of your data. A RAG system built on a knowledge graph —
domain so that it can answer questions by drawing on or GraphRAG — excels at generating context-aware
the correct context. responses.
What Is GraphRAG? The main reasons to implement a GraphRAG solution
In GraphRAG, the knowledge base used for retrieval include:
is a knowledge graph. A knowledge graph organizes
facts as connected entities and relationships, 1. Context-Aware Responses
which helps the system understand how pieces of Unlike traditional RAG, which retrieves isolated
information relate to each other. chunks of text based on similarity, GraphRAG
The knowledge graph becomes a mental map of your retrieves facts in context. Since the knowledge
domain, providing the LLM with information about graph explicitly encodes relationships,
dependencies, sequences, hierarchies, and meaning. GraphRAG returns relevant information, as
This makes GraphRAG especially effective at well as related information. This structured
answering complex, multi-step questions that require retrieval ensures that application outputs are
reasoning across multiple sources. comprehensive, reducing hallucinations and
leading to more accurate, reliable outputs and
Imagine that a customer calls to request support improving real-world applicability.
regarding a recent purchase. Customer Service uses
an internal chatbot to troubleshoot the request. 2. Traceability and Explainability
A traditional system built on vector-only RAG LLMs and even standard RAG approaches
would retrieve a product name from the customer operate as black boxes, making it difficult
support ticket: to know why and how a certain answer was
generated. GraphRAG increases transparency
Service Ticket Service Ticket Text Embedding
by structuring retrieval paths through the
234381 My new JavaCo coffee [.234, .789, .123……]
knowledge graph. The knowledge graph
maker isn’t working.
will show the sources and relationships that
But that’s all the RAG system would surface. contributed to a response. This makes it
easier to audit results, build trust, and meet
A GraphRAG system, on the other hand, would compliance needs.
show not only this service ticket text but also the

6
The Developer’s Guide to GraphRAG

3. Access to Structured and Unstructured Data


Prepare a Knowledge Graph
GraphRAG overcomes a key limitation
of vector-only RAG by integrating both for GraphRAG
structured and unstructured data. It integrates
information like whole databases, ontologies, Effective retrieval in GraphRAG starts with a well-
documents, and real-time streams into a single structured knowledge graph. The data needs to be
knowledge graph. Richer data means superior structured to model the business domain as it relates
AI responses. to the documents. That means having a clear data
model that defines both the content you’re working
with and how it is connected.
How GraphRAG Works There are two aspects to consider when you’re
modeling a knowledge graph for AI workflows:
GraphRAG works by using a knowledge graph to
retrieve and connect relevant information. It starts 1. The relationships between documents — or
with a search — vector, full-text, spatial, or others — how your content is organized and related:
to find entry points in the graph, then follows related
• How chunks connect to source documents
nodes and relationships to gather more context.
• How sections of a book or catalog
The system considers the user’s task and filters
are structured
and ranks the results before passing them to the
• How content is grouped or nested
generation phase.
2. Business entities and logic:
Think of GraphRAG as a RAG architecture built on a
knowledge graph. Using a knowledge graph affects • The core entities (i.e., Customers,
the way you design the entire solution. There are two Products, Companies)
main steps to creating a GraphRAG application: • How these entities relate to each other
• The structure and relationships that already
1. Preparing a knowledge graph for GraphRAG
exist in your current databases, schemas, or
• Documents and unstructured text ingestion business logic
• Structured data source import
These two layers — the document structure and the
2. Implementing GraphRAG retrieval patterns business domain — work together to give GraphRAG
its power. GraphRAG is retrieving documents in
the context of your business. Consider a customer
review in context of their purchase history or a user’s
question in context of their technical stack.

The first step is to determine where you can access


that business domain and how to connect it to
your documents. It might be well defined in your
structured data (databases, business hierarchies,
etc.) or it may be hidden inside your unstructured
content (i.e., contract terms, product features). A
Figure 3. Implementing GraphRAG retrieval patterns flow
knowledge graph brings it all together, connecting
The rest of this book walks you through these two the dots so your LLM retrieves not just semantic
critical steps. similarity but also relevant facts.

7
The Developer’s Guide to GraphRAG

Ground With Unstructured At one end, you’ve got relational databases and clean
CSV, where entities and relationships are explicitly
and Structured Data defined. At the other end, you’ve got raw text:
meaning buried in natural language. In between?
A complex middle: XML files, JSON logs, form
If you’ve worked with RAG systems, you’re already
submissions, and mixed-format documents with both
familiar with vector databases and unstructured
tables and prose.
content — PDFs, contracts, reports. But the most
important context for your data rarely lives in a As you think about your own dataset, ask yourself
single format. In fact, most of the time you’ll want these questions: Where does the context for
to use more than just unstructured data. Structured your application actually live? And where on the
data like CRM exports, product catalogs, and structure continuum does it fall? These questions
relational databases often contains crucial grounding matter because they will help you determine the
information for the answers your users need. tools you should use to build the knowledge graph.
For this guide, you’ll use:
To build systems that retrieve the right answer at
the right time, you need to connect two worlds: • Neo4j Data Importer (Neo4j Aura Platform) for
unstructured and structured. That’s where structured data
knowledge graphs come in. By linking unstructured • Knowledge Graph Builder Pipeline (Neo4j
chunks to structured business entities and GraphRAG Python Package) for extracting
relationships, you create a semantic network that implicit relationships from natural language
makes retrieval smarter, safer, and more transparent.
So, where do you start? With your documents or your If you find that your dataset has more complex data
structured schema? structures, you can consider adding tools to your
workflow. This is an ever-evolving field, and many are
Technically, you can begin from either side. But in working on building tools for these scenarios. A few
practice, most teams start with unstructured data to consider:
because that’s where the buried context usually lives.
Think financial disclosures, legal contracts, emails, Took Description Resource
and support tickets. These contain implicit business [Link] Extracts structured data Neo4j
logic, risk factors, and decision-making signals that (tables, lists, key-value pairs) Integration
don’t show up in structured rows and columns. from unstructured documents Guide
like PDFs, HTML, and email
But here’s the catch: Structure isn’t binary. It’s a Boundary’s Declarative language for BAML
continuum. Annotation extracting structured data to Neo4j
Modeling from unstructured sources, Tutorial by
Language demonstrated with Neo4j Jason Koo
(BAML)

pdfplumber Parses tables and text from GitHub


PDF files, ideal for extracting Repository
structured data from
documents

LangChain Framework for developing Neo4j


applications powered by Integration
language models, with
Figure 4. Structured and unstructured data continuum support for Neo4j integration

8
The Developer’s Guide to GraphRAG

For this exercise, you’ll start with unstructured


financial documents. Using an LLM-powered pipeline
to extract entities like Company and Risk Factor,
you’ll look for relationships such as FACES_RISK
to build a knowledge graph in Neo4j. This process
mirrors what many teams face: extracting meaning
from dense reports, contracts, or disclosures.

You’ll then use Neo4j’s Data Importer to load


structured datasets — the kind of CSVs or database
connectors most companies already have — further Figure 5. Create your first instance screen
enriching the graph with known entities and
You then have three choices of instances to
relationships.
choose from:
Finally, you’ll test retrieval strategies, from vector
• AuraDB Free, a small database (2 GB) that will
search to graph-enhanced queries, to dynamic
always be free, though it will be deleted after
Cypher generation with Text2Cypher. The same
30 days of no activity.
process can be applied to your own PDFs, internal
• AuraDB Professional offers up to 128 GB of
databases, and business domain to build a
memory and a free 14-day trial.
semantic layer over enterprise knowledge, making
• AuraDB Business Critical is the most robust
it accessible to GenAI systems with precision,
and offers up to 512 GB of memory and pay-as-
transparency, and context.
you-go billing.

PART III: Constructing


the Graph
Create a Neo4j Database
Begin by choosing a Neo4j database solution that
fits your needs. Options include a free instance
of AuraDB or a free trial of AuraDB Professional.
Neo4j is also available on all the major cloud partner
marketplaces. When you navigate to [Link]
[Link] and log in, you’ll see the following screen,
inviting you to create your first instance. Figure 6. New instance tiers

Tip: Download your AuraDB credentials (URI, If you’re just getting started, you’ll do well with
username, password) immediately after creating AuraDB Free or AuraDB Professional trial.
the instance. They will not be available for
download later. Store them securely, as you’ll need
them to connect your application to Neo4j.

9
The Developer’s Guide to GraphRAG

Be sure to download the credentials when you set up • Document Chunking and Storage: The
the database because they won’t be available later on. package uses the SimpleKGPipeline class
to automate chunking and storage. This
class handles the parsing of documents, the
chunking of text, and storage of chunks as
nodes in Neo4j.

from neo4j import GraphDatabase

from neo4j_graphrag.[Link].kg_
builder import SimpleKGPipeline

from neo4j_graphrag.llm import OpenAILLM

from neo4j_graphrag.embeddings import


OpenAIEmbeddings

from neo4j_graphrag.[Link] import


Figure 7. Credential download and continue screen ERExtractionTemplate

from dotenv import load_dotenv

Ingest Unstructured Data import os, time, asyncio, glob, csv

As you begin to build your knowledge graph, you • neo4j: Official Python driver for interacting
can use the Neo4j GraphRAG Python library. This with a Neo4j database.
package offers specialized functionalities that • GraphDatabase: Connects to Neo4j to
streamline and enhance the process of building a interact with the graph database.
knowledge graph from unstructured data, such as PDFs. • SimpleKGPipeline: Automates chunking,
Capabilities include document chunking, embedding entity recognition, and storage in Neo4j.
generation, and knowledge graph construction. • OpenAILLM: Integrates GPT-4 for text-based
processing and knowledge extraction.
pip install neo4j-graphrag • OpenAIEmbeddings: Handles vector
embeddings to enable semantic search in
Neo4j.
• ERExtractionTemplate: Supplies prompt
templates for entity-relation extraction.

The LLM does the thinking by extracting meaningful


Figure 8. Document flow
concepts from text. The embedder turns the text into
Key Features of Neo4j GraphRAG Package vectors, which lets your system perform semantic
search later.
• Knowledge Graph Construction Pipeline:
Automates the extraction of entities and Neo4j Connection
relationships from unstructured text and You’ll use GraphDatabase from the Neo4j Python
structures them into a Neo4j graph. driver to connect to Neo4j Graph Database.
• Vector Indexing and Retrieval: Facilitates the
driver = [Link](NEO4J_URI,
creation of vector indices for efficient semantic
auth=(NEO4J_USER, NEO4J_PASSWORD))
search within the graph.
• Integration with LLMs: Seamlessly integrates
with LLMs for tasks like entity extraction and
relation identification.

10
The Developer’s Guide to GraphRAG

Note that the required credentials can be found in the Defining your nodes and relationships in two lists is
.txt file you downloaded when you created the instance. a key moment in the knowledge graph construction
process. This is when you determine the data model.
These lists control what the SimpleKGBuilder will
look for in the text and how it will organize that
information in your graph. To understand how you
might want to construct these lists, let’s take a look
at some general ideas.

Entities = Nouns
Figure 9. Credentials from .txt file
What are the real-world concepts you’re trying
• NEO4J_URI: The database URL (e.g., to capture?
“neo4j+s://[Link].neo4j.
Company, Executive, RiskFactor, Product —
io”)
whatever matters to your domain.
• auth=(NEO4J_USER, NEO4J_PASSWORD):
Credentials to authenticate Relationships = Verbs or Connectors
How do those concepts relate?
Initialize the LLM and Embeddings

llm = OpenAILLM(model_name=”gpt-4o”, api_key=openai_api_key)


Perhaps a Company  FACES_RISK  RiskFactor,
dimensions = 1536 or Company  ISSUED_STOCK  StockType.
embedder = OpenAIEmbeddings(api_key=openai_api_key)
If you aren’t sure which entities and relationships
to include in your first project, ask yourself: What
• llm: Uses GPT-4o to extract entities, information would help my chunk provide a better
relationships, and summarize text. answer? Alternatively, what information connects
• embedder: Generates vector embeddings various chunks? Ultimately, you want to think
to enable semantic search and contextual through the application’s use case and start with
retrieval. the entities and relationships that will move the
needle the most on your project. This step isn’t just
Define Node Labels and Relationship Types
configuration; it’s your chance to define the mental
entities = [
model of your data.
{“label”: “Executive”, “properties”: [{“name”: “name”,
“type”: “STRING”}]},
Initialize and Run the Pipeline
{“label”: “Product”, “properties”: [{“name”: “name”,
“type”: “STRING”}]},
pipeline = SimpleKGPipeline(
{“label”: “FinancialMetric”, “properties”: [{“name”:
“name”, “type”: “STRING”}]}, driver=driver,
{“label”: “RiskFactor”, “properties”: [{“name”: “name”, llm=llm,
“type”: “STRING”}]},
embedder=embedder,
{“label”: “StockType”, “properties”: [{“name”: “name”, entities=entities,
“type”: “STRING”}]},
relations=relations,
{“label”: “Transaction”, “properties”: [{“name”:
“name”, “type”: “STRING”}]}, enforce_schema=”STRICT”)
{“label”: “TimePeriod”, “properties”: [{“name”: “name”,
“type”: “STRING”}]},
{“label”: “Company”, “properties”: [{“name”: “name”, The SimpleKGPipeline sets up a structured
“type”: “STRING”}]}
pipeline for extracting and storing knowledge from
]
relations = [
unstructured text into a graph database. It starts
{“label”: “HAS_METRIC”, “source”: “Company”, “target”:
with the driver, which is the Neo4j connection
“FinancialMetric”}, used to write data into the graph. The llm parameter
{“label”: “FACES_RISK”, “source”: “Company”, “target”:
“RiskFactor”}, specifies the language model that will interpret
{“label”: “ISSUED_STOCK”, “source”: “Company”, and extract meaningful entities and relationships
“target”: “StockType”},
from the input text. The embedder is the embedding
{“label”: “MENTIONS”, “source”: “Company”, “target”:
“Product”}
] 11
The Developer’s Guide to GraphRAG

model used to vectorize text, which supports of the chunk text. This is how your retriever finds the
similarity-based retrieval alongside structured relevant chunk in your application: by comparing the
querying. embedding of the query and the embeddings in your
data store.
The entities and relations define the schema:
what kinds of objects (like Customers, Contracts,
Products) and relationships (like HAS_CONTRACT,
CONTAINS, REFERENCES) the pipeline should look
for. Finally, enforce_schema=True ensures that
only the entity and relationship types that have been
explicitly defined in those lists are allowed into the
graph. This prevents schema drift and keeps the
resulting knowledge graph clean and reliable.

Process the PDF Document


Running the pipeline involves I/O-heavy operations:

• Calling the LLM to extract structured meaning


Figure 10. Node details
from text
• Generating embeddings via an external API Create the Vector Index
• Writing data into Neo4j A vector index is a type of database index that
enables fast similarity search over high-dimensional
All of these are network-bound and would block the
vectors, such as embeddings from models like
main thread in a normal synchronous setup. That’s
OpenAI’s. Unlike traditional indexes that look for
why the pipeline is designed to be asynchronous – so
exact matches, vector indexes retrieve items most
these operations can run concurrently and efficiently.
similar to a query vector using metrics like cosine
To execute it, you need to use Python’s async /
similarity or Euclidean distance.
await syntax: The await keyword tells Python:
In the context of Neo4j and RAG, here’s what you
“Pause this function while we wait on an external
need to know:
operation, but don’t freeze the whole program.”
1. Each node (e.g., a Chunk) stores an embedding,
async def run_pipeline_on_file(file_path, pipeline):
a numeric representation of its semantic
await pipeline.run_async(pdf_path=file_path) content.
2. The vector index organizes these embeddings
If you’re calling this inside another async function, so that, given a new query embedding, Neo4j
it will work by itself. If you’re in a regular script or Graph Database can quickly retrieve the most
notebook, you’ll need to run it inside an event loop. similar nodes.
If you’re unfamiliar with it, don’t worry — you can 3. This capability is essential for semantic search,
treat await pipeline.run_async() like a normal question answering, and other AI-powered
function call, as long as it’s inside an async context. applications where meaning and context
matter more than exact keywords.
for pdf_file in pdf_files:

[Link](run_pipeline_on_file(pdf_file, pipeline))

As you can see in the image below, the document and


chunk nodes have been created and written to the
database. Note that there is now a property on the
node called embedding, which represents the vector

12
The Developer’s Guide to GraphRAG

By using a vector index, Neo4j enables scalable, real- 2. Create a new graph model.
time retrieval of relevant knowledge from large and
complex graphs.

from neo4j_graphrag.indexes import create_vector_index

create_vector_index(driver, name=”chunkEmbeddings”,

label=”Chunk”,

embedding_property=”embedding”,

dimensions=1536, similarity_fn=”cosine”)

Ingest Structured Data


Getting Started With Data Importer
Neo4j Data Importer provides a streamlined process Figure 12. New graph model screen

for bringing structured data into your graph database. 3. A graph data model has been provided for your
Here’s how to use this powerful tool. The Neo4j Aura convenience. Note: Due to pathway differences
console includes a dedicated Data Importer feature between operating systems, please choose
that allows you to transform tabular data into graph either Mac or Windows data models.
structures without writing code. This tool works well
in quickly populating your knowledge graph with
data from existing datasets.

Import Structured Data


To import your data:

1. Navigate to Import > Data Importer in the


Neo4j Aura console.

Figure 13. Selecting model starting point screen

4. Once you’ve loaded the provided data model,


click Browse and navigate to the data folder
in your repository, selecting both the Asset_
Manager_Holdings.csv file and the Company_
[Link] files.

Figure 14. Browse to .csv files screen


Figure 11. Neo4j Aura Data Importer

13
The Developer’s Guide to GraphRAG

5. Once the files are connected, you’ll see that Once you’ve uploaded these CSV files, you’ll be
the data model has check marks for each entity given a choice as to how to proceed. Click Define
and relationship. Click Run Import in the upper Manually to begin building your data model.
right-hand corner.
First, you’ll see a blank node, and on the right-
hand side, you’ll see the parameters for that node,
including Label, Table, Properties.

Figure 15. Run import screen

Mapping Your Data to Graph Structures


To get you started, we’ve given you a full, completed Figure 17. Node parameters options screen
data model for this exercise. When working with your
Label refers to the type of node. Table points to
own data, you’ll create these data model maps yourself.
the data source where the information is sourced
If you’d like to work with your own dataset, here’s (the tables you uploaded will appear on the left).
how to get [Link] Aura console provides a Properties refer to the values you want associated with
unified experience where you can manage your that node. Let’s start with the Company_Filings.csv.
database instances, connect to diverse data sources,
Company Node
import structured data, model graphs visually, query
your data with Cypher, explore your graph, and more. Label: Company
Table: Company_Filings.csv
When navigating to Import > New Data Sources,
you’re presented with many possible connectors. For Properties: name, ticker
our case, there are two CSVs in this dataset: Asset_ ID(key): name
Manager_Holdings.csv and Company_Filings.csv.
You’ll also need to
identify the unique ID
property for that node,
akin to the primary key,
which in this case is the
name of the company.
This is done by clicking
the key icon next to the
property name.

Figure 18. Company node screen

Figure 16. New data source connectors screen

14
The Developer’s Guide to GraphRAG

Document Node and unstructured (Document) data, enabling


Label: Document advanced retrieval and reasoning across
Table: Company_Filings.csv your graph.
Properties: path (this must match exactly - read below)
Asset Manager Node
ID(key): path Label: AssetManager
CRITICAL STEP: Rename Your Path Column to path
Table: Asset_Manager_Holdings.csv
The kg_builder has already created Document
nodes using a path property. To correctly link Properties: managerName
companies to their documents, your imported data ID(key): managerName
must use the exact same property name: path.

Figure 20. AssetManager node screen


Figure 19. Path property screen
Mapping Relationships

⚠ If you skip this renaming step, the relationship


Relationships are created with the following criteria:
will NOT connect and your graph will be incomplete. • Relationship Label: Describes the type
of connection between the entities. It is
The CSV includes two columns with OS-specific paths: common practice in knowledge graphs for the
relationships to be in ALL_CAPS with
• path_Windows for Windows users no spaces.
• path_Mac_ix for macOS/Linux users • Table: Has identifiers for each node type
• Choose the appropriate column based on contained in it. It is the way we connect
your operating system and rename it to the two nodes.
path during import • Node ID Mapping: Maps the columns in the
Pick the column for your system: relevant table to the IDs of the pertinent nodes.
• Properties: Adds information to a relationship
1. Rename that column to exactly: path or entity.
(lowercase, no quotes).
2. Even though Document nodes already exist,
we’re now creating relationships between each
Company and its corresponding Document.
This connection bridges structured (Company)

15
The Developer’s Guide to GraphRAG

Next, let’s create connections between and among 3. Drag the outline of the AssetManager node to
these entities. In our domain, the Asset Managers cover the Company node. When you release,
own stock in various companies. Here’s a sample you’ll see a new relationship arrow between
from the Asset_Manager_Holdings.csv: them:

managerName companyName ticker shares

ALLIANCEBERNSTEIN L.P. AMAZON AMZN 50065439

ALLIANCEBERNSTEIN L.P. APPLE INC AAPL 28143032

ALLIANCEBERNSTEIN L.P. INTEL CORP INTC 5735993

MCDONALDS
ALLIANCEBERNSTEIN L.P. MCD 1201960
CORP

ALLIANCEBERNSTEIN L.P. MICROSOFT CORP MSFT 46541943

In a knowledge graph, we want to map the domain Figure 23. Drag and release for new relationship
knowledge of structured data, which in this case is
Clicking on this arrow allows you to edit the
the Asset Managers’ ownership of stock in a given
parameters of the relationship.
company. If entities are nouns, then relationships are
verbs. So let’s create the relationship OWNS that goes OWNS Relationship
from Asset Manager to Company.
Relationship Type: OWNS
1. Click on the AssetManager node. You’ll see a Table: Asset_Manager_Holdings.csv
blue outline of the node: Node ID Mapping
From:
Node: AssetManager
ID: managerName
ID column: managerName
To:
Node: Company
ID: name
ID column: companyName
Properties: shares

Figure 21. AssetManager blue outline

2. Hover over the outline until it turns gray:

Figure 24. OWNS relationship


Figure 22. AssetManager gray outline

16
The Developer’s Guide to GraphRAG

The property shares represents the number of import. Click the blue Run import button in the upper
shares of the Company owned by the Asset Manager right corner of the screen.
and for this book is an optional inclusion. Additional
columns such as value or sharevalue are optional,
as well. When working with your own data, it’s best
to consider if that property will have value to your
use case. Will you be asking to rank based on shares
owned? Does the total value of the holding have
Figure 26. Run import button
relevance to your application? Additional information
on data modeling can be found at GraphAcademy. Now that your unstructured and structured data is
loaded, you can use the Explore and Query functions
FILED Relationship
to refine your graph structure and data to accurately
Note that the relationship between Company
represent your business domain. Use Explore to
and Document is the linchpin that connects the
visualize and navigate your graph with Neo4j Bloom
structured and the unstructured data in this
and Query to investigate the graph.
GraphRAG application.

Relationship Type: FILED For a detailed walkthrough of graph data modeling,


Table: Company_Filings.csv see The Developer’s Guide: How to Build a
Knowledge Graph.
Node ID Mapping
From:
Node: Company PART IV: Implementing
ID: name
ID column: companyName
GraphRAG Retrieval Patterns
To:
GraphRAG retrieval patterns are practical
Node: Document
mechanisms that define how the LLM in your
ID: path
GraphRAG solution accesses the context and
ID column: path_Windows or
connections in your knowledge graph.
path_Mac_ix
Let’s examine some of the most common GraphRAG
patterns and how to use them.

Import Libraries

from neo4j import GraphDatabase

from neo4j_graphrag.llm import OpenAILLM

from neo4j_graphrag.embeddings import


OpenAIEmbeddings

from neo4j_graphrag.retrievers import


VectorRetriever, VectorCypherRetriever,
Text2CypherRetriever

from neo4j_graphrag.generation import GraphRAG

Figure 25. FILED relationship from neo4j_graphrag.schema import get_schema

from detenv import load_dotenv


As you see in the diagram above, each entity and
relationship will have a green check mark when it has
been properly mapped. Now you’re ready to run the

17
The Developer’s Guide to GraphRAG

This notebook imports the core libraries required for Here, you load sensitive configuration values
building and querying RAG pipelines with Neo4j and (such as database credentials and API keys) from
GraphRAG: environment variables, ensuring that secrets aren’t
hardcoded in your notebook. The steps include:
• [Link]: The official Python
driver for connecting to and querying a Neo4j • load_dotenv(): Loads environment variables
database. from an .env file into your Python environment.
• neo4j_graphrag.[Link]: Integrates • [Link](): Fetches the Neo4j connection
OpenAI language models for generating and URI, username, and password, as well as your
processing natural language queries. OpenAI API key.
• neo4j_graphrag.embeddings. • [Link](): Initializes the
OpenAIEmbeddings: Provides access to Neo4j database driver with the provided
OpenAI’s embedding models for generating credentials, allowing your notebook to connect
vector representations of text. and interact with your Neo4j instance securely.
• Neo4j_graphrag.retrievers: Different
TIP: Make sure your .env file contains the correct
retriever classes for semantic and hybrid
values for NEO4J_URI, NEO4J_USERNAME,
search over graph data using vector similarity
NEO4J_PASSWORD, and OPENAI_API_KEY
and Cypher queries:
before running this code. This approach keeps
• VectorRetriever
your credentials secure and makes your codebase
• VectorCypherRetriever
easier to share and maintain.
• Text2CypherRetriever
• neo4j_graphrag.[Link]: Initialize the LLM and Embedder
The main class for orchestrating RAG Just as you selected a specific LLM and embedding
workflows over a Neo4j knowledge graph. model when processing your PDFs, you should do
• neo4j_graphrag.schema.get_schema: the same when generating embeddings for your text
Utility to introspect and retrieve the schema of data. It’s important to keep track of the language
your Neo4j database. model and embedding tools that you use during this
• dotenv.load_dotenv: Loads environment process.
variables (such as credentials and API keys)
from an .env file for secure configuration. For the retrievers to work correctly, the embedding
model used during retrieval must match the one
These imports enable advanced semantic search, used to generate the dataset’s embeddings. This
retrieval, and GenAI capabilities directly on your ensures accurate and meaningful search results.
Neo4j knowledge graph.
llm = OPENAILLM (model_name=‘gpt-4o’, api_key=OPENAI_API_KEY)
Load Environment Variables and Initialize Neo4j Driver
embedder = OPENAIEmbeddings(api_key=OPENAI_API_KEY)

load_dotenv()

NEO4J_URI = [Link](‘NEO4J_URI’) The Basic Retriever Pattern


NEO4J_USER = [Link](‘NEO4J_USERNAME’) The basic retriever uses vector embeddings to find
NEO4J_PASSWORD = [Link](‘NEO4J_PASSWORD’) nodes that are semantically similar based on content.
OPERNAI_API_KEY = [Link](‘OPENAI_API_KEY’) This retriever is useful only for handling specific
information requests about topics contained in just
driver = [Link](NEO4J_URI, auth=(NEO4J_ one or a few chunks. It’s a starting point for more
USER, NEO4J_PASSWORD)) complex graph-based retrievals, and it’s easy to
implement if you’re familiar with RAG but new
to GraphRAG.

18
The Developer’s Guide to GraphRAG

There are two components in the process: Score Content ID


cryptocurrency assets could be
• Chunks as nodes: The pattern uses the already
0.913177 treated as a general unsecured 6064a2f775a8:1724
chunked data to create a graph, where each
claim ag..
chunk becomes a node in the graph.
agency offerings could subject us
• Retrieval: When a query is performed, the basic 0.908264 to additional regulations, licensing 6064a2f775a8:1723
retriever pattern searches through these chunk r…
nodes to find the most relevant information. cyberextortion, distributed denial-
0.903259 of-service attacks, ransomware, 6064a2f775a8:1718
Let’s look at how you would implement this pattern
spe…
using the SEC dataset.
While we maintain insurance
0.898422 policies intended to help offset the\ 6064a2f775a8:1720
You can now execute vector similarity searches to
nfina…
retrieve a company’s current challenges based on
financing, and branded credit card
certain text in their filing. The retriever compares a
0.896942 products; branded debit card and\ 6064a2f775a8:1731
query vector generated from the search prompt (i.e., ns…
the numeric representation of the question) against
our customers unimpaired and
the indexed text embeddings of the chunks. Vector 0.89476 unconstrained access to our online 6064a2f775a8:1731
similarity searches work well for simple queries with servic…
a narrow focus, such as: “What are the risks around changes in cryptocurrencies,
cryptocurrency?” 0.894135 government cryptocurrency 6064a2f775a8:1251
policies and ,,,
from [Link] import VectorRetriever
ct our reputation and revenue.
0.893723 Actual or perceived vulnerabilities 6064a2f775a8:2508
# Initialize the retriever may…
retriever = VectorRetriever( the past and could in the future
driver, 0.893539 have a material adverse effect on 6064a2f775a8:1254
index_name= “text_embeddings”, our…

embedder=embedder, may continue to result\nin,

return_properties=[“text”] 0.893402 disruption of and volatility in global 6064a2f775a8:1260


fin…
)

You should review the results, or at least check


query = “What are the main risks around cryptocurrency?” the number of returned items. If there’s an error in
result = vector_retriever.search(query_text=query, top_ your retriever and you proceed directly to natural
k=10) language generation, your application may produce
a generic LLM response that isn’t grounded in your
Be sure to review your retrieval results before data. This validation step ensures your outputs
generating any text output. This step helps you remain accurate and reflect the content of your
confirm that your retriever is functioning as intended underlying dataset.
and returning relevant data from your knowledge
To get the natural language output, use the
graph. For example, in the query above, a sample of
following code:
the retrieved content is displayed for inspection:
rag = GraphRAG(llm=llm, retriever=vector_retriever)
result_table=[Link]([([Link][‘score’], item.

content [10:80], print([Link](query).answer)

[Link][‘id’]) for item in [Link]],

columns=[‘Score’, ‘Content’, ‘ID’]

19
The Developer’s Guide to GraphRAG

The basic retriever will cause the LLM to generate a While the vector search provided useful information
result like this: about cryptocurrency risks, it did not answer deeper,
more actionable questions, such as:

The main risks around cryptocurrency, as • Which specific companies are exposed to
highlighted in the context, include: these risks?
• What other risks may be occurring
1. Regulatory Risks: The regulatory status of
concurrently?
certain cryptocurrencies is unclear, which
• Which asset managers are associated with
could subject businesses to additional
the affected companies? (e.g., multi-hop
licensing and regulatory obligations. If
relationships from risk to company to asset
cryptocurrencies are deemed securities,
manager)
this might necessitate securities broker-
dealer registration under federal In other words, the approach demonstrated here
securities laws. Non-compliance could retrieves relevant text fragments. However, it
lead to regulatory actions, fines, and other doesn’t use the graph’s structure to connect the
consequences. risks to companies or asset managers, nor does it
2. Custodial Risks: Cryptocurrency assets show related or concurrent risks. There’s no traversal
held through a third-party custodian or multi-hop reasoning, so you miss out on the rich,
are susceptible to various risks, such as contextual insights that a knowledge graph
inappropriate access, theft, or destruction. can provide.
Inadequate insurance coverage by
To answer these more complex, relationship-driven
custodians and their potential inability to
questions, you need to combine vector search with
maintain effective controls can expose
graph-powered Cypher queries that can traverse and
customers to losses. In the event of a
analyze connections between entities. This is where
custodian’s bankruptcy, the treatment of
graph-enhanced retrieval patterns come in.
custodial holdings in proceedings remains
uncertain, which could delay or prevent The Graph-Enhanced Vector Search Pattern
the return of assets. The basic retriever pattern typically relies on text-
3. Third-Party Partner Risks: Dependence based embeddings, capturing only the semantic
on third-party custodians and financial meaning of content. While this method is effective
institutions means exposure to operational in identifying similar chunks, it leaves the LLM in the
disruptions, inability to safeguard dark as to how those items interact in the real world.
holdings, and financial defaults by these
partners, which could harm business The Graph-Enhanced Vector Search Pattern, also
operations and customer trust. known as augmented vector search, overcomes
this limitation by drawing on the graph structure
These risks underscore the need for robust (i.e., using not just what items are but also how
regulatory compliance, secure custodial they connect). By embedding node positions and
arrangements, and the management of third- relationships within a graph, this approach generates
party relationships to mitigate potential contextually relevant nodes, integrating both:
negative impacts on businesses offering
cryptocurrency products. • Unstructured data: Product descriptions,
customer reviews, and other text content via
semantic similarity
• Structured data: Purchase patterns, category
relationships, and transaction records via
explicit instructions

20
The Developer’s Guide to GraphRAG

The VectorCypherRetriever uses the full graph Next, let’s add this new retrieval query to the
capabilities of Neo4j by combining vector-based VectorCypherRetriever parameters:
similarity searches with graph traversal techniques.
vector_cypher_retriever = VectorCypherRetriever(
The retriever completes the following actions:
driver=driver,

1. Processes a query embedding to perform a index_name=’chunkEmbeddings’,


similarity search against a specified vector embedder=embedder,
index. retrieval_query=company_risk_list_query
2. Retrieves relevant node variables. )
3. Executes a Cypher query to traverse the graph
based on these nodes.
VectorCypherRetriever parameters:
To set up this particular query, you need to tell the
• Driver: The Neo4j database connection
graph where and how to traverse from the semantic
• Index_name: The name of the vector index
nodes. In this example, the query is:
(here, chunkEmbeddings) used for semantic
“What are the risk factors for companies discussing search
cryptocurrency in their filings?” • Embedder: The embedding model used to
generate/query vector representations
The following code creates a retriever to answer
• Retrieval_query: The Cypher query (defined
this query:
above) that tells Neo4j how to traverse the
company_risk_list_query = “““ graph from the semantically matched nodes
WITH node
MATCH (node)-[:FROM_DOCUMENT]-(d:Document)-[:FILED]- This setup enables you to start with a semantic
(c:Company)-[:FACES_RISK]-(rf:RiskFactor)
search (e.g., for “cryptocurrency risk”) and
RETURN [Link] AS company, [Link] AS context,
collect(DISTINCT [Link]) AS risks automatically traverse your knowledge graph to
“””
reveal which companies are involved and what other
risks they face. The resulting responses are both
Let’s start by looking at the parts of the graph that
semantically relevant and graph-aware.
help to answer this query. We start by identifying
the Chunk that is semantically similar to the VectorCypher Retriever in Practice
cryptocurrency query. Then we need to traverse the The power of the Graph-Enhanced Vector Search
graph to identify the Document the Chunk comes Pattern lies in its flexibility. While the example
from, the Company that FILED the Document and above focuses on linking companies to risk factors in
collect the other RiskFactors for that Company. financial filings, the approach can be applied to any
Once this information is retrieved, it’s converted to domain or vertical by customizing the graph schema
Cypher and set as the retrieval query. and Cypher queries.

How might this look for other industries?

• Healthcare: Retrieve patient records,


diagnoses, and treatment plans by combining
semantic search of clinical notes with graph
traversal across relationships like doctor-
patient, medication-prescribed, or symptom-
diagnosis.
• Ecommerce: Connect customer reviews or
product descriptions (unstructured text) to
Figure 27. VectorCypherRetriever example 1

21
The Developer’s Guide to GraphRAG

purchase behavior, category hierarchies, or VectorCypher Retrieval: A Working Example


supplier relationships (a structured graph), Which Asset Managers are most affected by
enabling recommendations and/or supply reseller concerns?
chain insights.
Let’s again start with the Chunks semantically
• Law: Link case law or legal opinions to statutes,
similar to “reseller concerns,” and then traverse
precedents, and involved parties, surfacing not
through the Document to the Company through OWNS
just relevant text but also the legal context
to identify the AssetManagers relevant to the query.
and network of citations.
We’ll also include the property shares from the
• Cybersecurity: Combine threat intelligence
relationship OWNS and order by largest holdings.
reports (text) with the graph relationships
between vulnerabilities, affected assets, and
mitigation strategies to provide a holistic view
of your security posture.
• Education: Map student essays or discussion
posts to learning objectives, course materials,
and assessment outcomes for personalized
education analytics.

Let’s summarize the major tasks from this example


so you can apply it to your domain:
Figure 28. VectorCypherRetriever example 2
• Adapt the Pattern Model Your Domain:
Define the node types, relationships, and chunk_to_asset_manager_query = “““
key properties relevant to your vertical (e.g., WITH node
Patient, Diagnosis, Product, Supplier, Case, MATCH

Asset, etc.). (node)-[:FROM_DOCUMENT]-(doc:Document)-[:FILED]-


(company:Company)-[owns:OWNS]-(manager:AssetManager)
• Index the Right Data: Create vector indexes RETURN distinct [Link] AS company, [Link] AS
AssetManager, [Link] AS shares order by shares desc
on the appropriate text or document nodes for
“””
semantic retrieval.
• Craft Domain-Specific Cypher Queries: Write
Cypher queries that traverse from the retrieved Next, add this new retrieval query to the
nodes to related entities and/or relationships VectorCypherRetriever parameters:
that matter in your context.
• Integrate With VectorCypherRetriever: vector_cypher_retriever = VectorCypherRetriever(

Use the VectorCypherRetriever with your driver=driver,

custom query to combine semantic and index_name=’chunkEmbeddings’,

structural search. embedder=embedder,

retrieval_query=chunk_to_asset_manager_query
The result: You can ask complex, context-aware
)
questions about entities in your own industry. The
GraphRAG retriever will surface relevant information
VectorCypherRetriever parameters:
that connects context across structured and
unstructured data to drive real-world understanding. • Driver: The Neo4j database connection
• Index_name: The name of the vector
With this in mind, let’s look at another
index (here, chunkEmbeddings) used for
VectorCypherRetriever example.
semantic search

22
The Developer’s Guide to GraphRAG

• Embedder: The embedding model used to Since these results look as expected, we proceed to
generate/query vector representations the natural language output:
• Retrieval_query: The Cypher query
result = GraphRag(llm=llm,retriever=vector_cyper_retriever)
(defined above) that tells Neo4j how to
print([Link](query_text=query_text).answer)
traverse the graph from the semantically
matched nodes
The Asset Managers most affected by
result = vector_cypher_retriever.search(query_text=query,
top_k=10) cryptocurrency concerns are:
for item in [Link]:
1. BlackRock Inc.
print([Link][:100])

2. FMR LLC
Let’s look at the results:
3. STATE STREET CORP
<Record company=’APPLE INC’
AssetManager=’BlackRock Inc.’ 4. GEODE CAPITAL MANAGEMENT, LLC
shares=1031407553> 5. MORGAN STANLEY
<Record company=’APPLE INC’ 6. NORTHERN TRUST CORP
AssetManager=’Berkshire Hathaway Inc’
shares=915560382> 7. BANK OF AMERICA CORP /DE/

<Record company=’AMAZON’ 8. Bank of New York Mellon Corp


AssetManager=’BlackRock Inc.’
9. ALLIANCEBERNSTEIN L.P.
shares=613380364>
10. AMUNDI
<Record company=’APPLE INC’
AssetManager=’STATE STREET CORP’ 11. WELLINGTON MANAGEMENT GROUP LLP
shares=569291690>
12. Capital World Investors
<Record company=’MICROSOFT CORP’
13. AMERIPRISE FINANCIAL INC
AssetManager=’BlackRock Inc.’
shares=533634606> 14. WELLS FARGO & COMPANY/MN

<Record company=’AMAZON’ This is where GraphRAG really shines. You may be


AssetManager=’STATE STREET CORP’ wondering how to construct the retrieval query
shares=332449318> that traverses the graph. In this example, you can see
that the retrieval_query is a string of Cypher code,
<Record company=’AMAZON’
the language of graph querying. Now let’s look at
AssetManager=’FMR LLC’ shares=302101441>
one last retriever pattern found in the Neo4j library:
<Record company=’APPLE INC’ the Text2CypherRetriever.
AssetManager=’FMR LLC’ shares=298321726>
Text2CypherRetriever
<Record company=’APPLE INC’ You can use Text2CypherRetriever to seamlessly
AssetManager=’GEODE CAPITAL MANAGEMENT, generate Cypher queries from natural language
LLC’ shares=296103070> questions. Instead of manually crafting each Cypher
statement, the retriever uses an LLM to translate
your plain-English queries into Cypher based on its
understanding of your Neo4j schema.

23
The Developer’s Guide to GraphRAG

The process begins with a natural language question, Now that you’ve defined the schema, you
such as: have everything you need to set the
Text2CypherRetriever.
“What are the names of companies owned
by BlackRock Inc.?” query=”What are the names of the companies owned by BlackRock
Inc.?”
The retriever then uses the schema, described text2cypher_retriever = Text2CypherRetriever(
as a string outlining the main node types and driver=driver,
relationships in your graph (for example, companies,
llm=llm,
risk factors, and asset managers), to guide the LLM
neo4j_schema= schema
in generating an appropriate Cypher query. While you
)
could pass a hard-coded schema to the retriever, it’s
best practice to access the schema as it currently
cypher_query = text2cypher_retriever.get_search_results(query)
exists in your instance. Here’s a sample of the full
schema: cypher_query.metadata[“cypher”]

result = get_schema (driver)


MATCH (a:AssetManager {managerName:
Node properties: ‘BlackRock Inc.’})-[:OWNS]->(c:Company)

Document {id: STRING, path: STRING, RETURN [Link] AS company_name


createdAt: STRING}
This approach has several advantages. It removes
Chunk {id: STRING, index: INTEGER, text: the need to write Cypher by hand for each query,
STRING, embedding: LIST} making graph data accessible even to those without
technical expertise. It’s ideal for rapid prototyping,
Company {id: STRING, name: STRING, chunk_ exploratory analysis, and building natural language
index: INTEGER, ticker: STRING} interfaces to your knowledge graph, enabling a
Product {id: STRING, name: STRING, chunk_ broader range of users to interact with complex
index: INTEGER} graph data.

. . . Now you can pass that Cypher query directly to the


driver to get the results:
Relationship properties:
result = driver.execute_query(cypher_query.metadata[“cypher”])
OWNS {position_status: STRING, Value: for record in [Link]:
FLOAT, shares: INTEGER, share_value: print(record)
FLOAT}

The relationships: <Record companyName=’APPLE INC’>

.... <Record companyName=’MICROSOFT CORP’>

(:Executive)-[:FROM_CHUNK]->(:Chunk) <Record companyName=’INTEL CORP’>

(:StockType)-[:FROM_CHUNK]->(:Chunk) <Record companyName=’AMAZON’>

(:AssetManager)-[:OWNS]->(:Company) <Record companyName=’PG&E CORP’>

<Record companyName=’NVIDIA CORPORATION’>

24
The Developer’s Guide to GraphRAG

While the Text2Cypher functionality in the Neo4j trade-offs will help you integrate Text2Cypher
GraphRAG library offers a powerful way to translate effectively while ensuring that it is used in
natural language queries into Cypher, there are scenarios where its strengths outweigh its
important considerations to keep in mind when using it. potential drawbacks.

First, because Text2Cypher relies on an LLM to Check out the Text2Cypher Crowdsourcing
generate queries dynamically, the same input may App to explore Text2Cypher applications and
not always yield identical results. The model’s contribute to development projects.
responses can vary depending on context, training
Community Summary Pattern
data, and even minor changes in phrasing. While the
You may have heard the term GraphRAG and
flexibility of Text2Cypher allows for more natural
thought of the pattern popularized by Microsoft,
interactions, it can also introduce inconsistencies
where the text is used to summarize community
when precise, repeatable queries are required.
or other knowledge (i.e., forum posts). This
Additionally, query optimization remains an important type of retriever is often called the Community
factor. While LLMs are capable of generating Summary Pattern.
complex Cypher queries, they may not always
While a Microsoft-style GraphRAG emphasizes
produce the most efficient ones. Without human
summarization and community Q&A, Neo4j’s
intervention or performance tuning, these queries
approach focuses on domain-specific schema
might not be optimized for speed or resource
control and composable query generation. This
consumption, which could potentially slow
focus expands GraphRAG from summarization
application performance.
into structured reasoning, decision tracing, and
dynamic compliance use cases.

Concluding Thoughts and


Next Steps
Integrating a knowledge graph with RAG
gives GenAI systems structured context and
relationships, improving the relevance and
quality of generated results.
Finally, high-stakes applications — such as
those requiring strict reproducibility, financial This guide has equipped you with the
computations, or regulatory compliance — may foundational skills needed to implement
require standardized, manually crafted Cypher GraphRAG. You learned how to use Neo4j’s
queries instead. In such cases, relying entirely on an cloud-based graph database service, Neo4j Aura,
AI-generated query could introduce risks, especially to prepare a knowledge graph for GraphRAG,
if the generated query structure does not fully align Data Importer, and the GraphRAG Python library
with business logic or data constraints. to create a knowledge graph from unstructured
data. You also learned how to implement
Despite these limitations, Text2Cypher is a valuable
foundational GraphRAG retrieval patterns,
tool for making Neo4j more accessible, particularly
including the basic retriever, graph-enhanced
for applications where flexibility, adaptability, and
vector search, and Text2Cypher.
user-driven query formulation are more important
than absolute precision. Understanding these

25
The Developer’s Guide to GraphRAG

Like other AI technologies, GraphRAG is rapidly Build on what you learned in this guide:
evolving. A few trends to watch:
• The Neo4j for GenAI use case page offers
• More advanced, dynamic Cypher queries and guides, tutorials, and best practices about
sophisticated retrieval patterns that use graph GraphRAG implementation.
algorithms and machine learning techniques • The GraphRAG site contains explanations
are pushing the boundaries of what’s possible of GraphRAG principles and step-by-
in information retrieval step guides for various implementation
and generation. scenarios.
• Deeper integration with other AI technologies, • Neo4j GraphAcademy offers free, hands-
such as knowledge graph embeddings and on online courses.
graph neural networks, promises to enhance
the semantic understanding and reasoning
capabilities of GraphRAG systems.
• Integrating GraphRAG with agentic systems
and other multi-tool, multi-step RAG chains
can result in more autonomous and intelligent
systems capable of handling complex,
multifaceted tasks with greater efficiency
and accuracy.
• Incorporating semantic layers in GraphRAG
systems can provide even more nuanced
understanding and context awareness in
information retrieval and generation tasks.

Explore GenAI
With Neo4j
Neo4j uncovers hidden relationships and patterns
across billions of data connections deeply, easily,
and quickly, making graph databases an ideal choice
for building your first GraphRAG application.

Learn More

26
The Developer’s Guide to GraphRag

Appendix
Technical Resources in Workflow Order

Stage Resource Why It’s Useful

1. Data Modeling Designing a Graph Data Model for Helps you define entity-relationship schemas (on-
GenAI (Neo4j Blog) tology) that power GraphRAG context.

2. Data Modeling Neo4j Data Modeling Guide Foundation for understanding how to structure both
unstructured and structured data into a graph.

3. Environment Setup Neo4j Aura Free Tier Spin up a secure cloud instance instantly – perfect
for prototyping.

4. Data Ingestion (Structured) Neo4j Data Importer Tool Visual UI for mapping CSVs and relational data to
graph nodes and relationships.

5. Data Ingestion (Unstructured) Neo4j GraphRAG Python Library Convert PDFs and text to a knowledge graph using
LLM-powered entity + relationship extraction.

6. Data Ingestion (Unstructured) KGBuilder Tutorial – SEC Filings Walkthrough for turning dense financial disclosu-
Example res into structured graph nodes and edges.

7. Embeddings + Vector Indexing Neo4j Vector Indexing Docs Build and manage vector embeddings inside Neo4j
for hybrid retrieval.

8. Retrieval: Basic + Vector Neo4j GraphRAG Basic Retriever First step: combine chunked content and embed-
Pattern ding for basic semantic retrieval.

9. Retrieval: Graph-Enhanced Graph-Enhanced Vector Search with Augment vector search with traversal logic to im-
Neo4 prove contextual accuracy.

10. Test2Cypher Automation Text2Cypher Documentation & Translate user queries into Cypher automatically
Examples using LLMs – ideal for dynamic GraphRAG.

11. Agentic & Multi-Step Use GraphRAG + NeoConverse + Agents Build multi-tool agents that query graphs autono-
mously across task chains.

12. Semantic Enhancement Topic Extraction for Semantic RAG Use LLMs to extract topics and themes into your
graph to add interpretability.

13. Deployment + Ops Neo4j Deployment Best Practices Tips for scaling and monitoring GraphRAG in pro-
duction environments.

27

You might also like