0% found this document useful (0 votes)
45 views10 pages

Thinking With Knowledge Graphs

Thinking with Knowledge Graphs

Uploaded by

simonliang0719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views10 pages

Thinking With Knowledge Graphs

Thinking with Knowledge Graphs

Uploaded by

simonliang0719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Thinking with Knowledge Graphs: Enhancing LLM Reasoning

Through Structured Data


Xue Wu Kostas Tsioutsiouliklis
Yahoo Research Facts.ai
Mountain View, California, USA Saratoga, California, USA
[email protected] [email protected]

ABSTRACT is specific per graph, it may require significant re-engineering for


Large Language Models (LLMs) have demonstrated remarkable ca- different tasks or graph types.
Another method uses semantic parsing to convert natural lan-
arXiv:2412.10654v1 [cs.CL] 14 Dec 2024

pabilities in natural language understanding and generation. How-


ever, they often struggle with complex reasoning tasks and are guage queries into structured query languages like SPARQL ([32],
prone to hallucination. Recent research has shown promising re- [6], [26], [8]). In this approach, the LLM generates a SPARQL query
sults in leveraging knowledge graphs (KGs) to enhance LLM per- based on the input prompt, which is then executed against the
formance. KGs provide a structured representation of entities and KG to retrieve relevant information. This method effectively uses
their relationships, offering a rich source of information that can the KG as an external knowledge base, and treats the KG and the
enhance the reasoning capabilities of LLMs. For this work, we have LLM as separate components. As a result, the reasoning process is
developed different techniques that tightly integrate KG structures not fully integrated into the LLM, potentially limiting the model’s
and semantics into LLM representations. Our results show that ability to perform complex reasoning during text generation.
we are able to significantly improve the performance of LLMs in Alternatively, researchers encode entities and relationships of
complex reasoning scenarios, and ground the reasoning process KGs as natural language text ([22], [12]). They either incorporate
with KGs. We are the first to represent KGs with programming them into the LLM’s input context or fine-tune the LLM with these
language and fine-tune pretrained LLMs with KGs. This integration text representations. This approach leverages the LLM’s natural
facilitates more accurate and interpretable reasoning processes, language understanding ability to reason over the text-encoded
paving the way for more advanced reasoning capabilities of LLMs. knowledge. However, representing structured data as unstructured
text poses challenges. Capturing the nuances of entities and re-
lationships in natural language requires careful design to avoid
1 INTRODUCTION
ambiguities and ensure that the structural information is preserved,
Large Language Models (LLMs) have achieved state of art perfor- which can be non-trivial for complex graphs.
mance in many Natural Language Processing (NLP) tasks ([7], [11]). While there are different ways to leverage KGs to enhance LLMs’
They have been successfully applied in a wide range of applications, performance, there are two aspects to using KGs for improving
from question answering to summarization and machine transla- LLMs. One is grounding the LLMs with trustworthy information,
tion. However, due to limitations in the training data and training and the other is providing them with examples of relations from
process, they suffer from hallucination [18], where generated text which they can generalize. It is this second part that motivated
is nonfactual, nonsensical, or incoherent. This issue becomes par- our work. In this work, we propose an approach that represents
ticularly prevalent when dealing with tasks that require intricate or knowledge graphs with programming language code. Programming
complex reasoning. To address hallucination, researchers have used languages are inherently structured and are designed to represent
methods such as prompting ([31], [39], [30], [14], [25], [35], [34]), complex data and relationships efficiently. This allows for an ac-
retrieval augmented generation (RAG [20]), and fine-tuning. These curate encoding of entity relationships that preserves the internal
approaches often leverage external sources of information, which structure of the graph. More importantly, programming code is part
can come from the internet, third-party applications, databases, or of the pre-training data for many LLMs, meaning that the models
knowledge graphs. are already equipped to parse and understand programming syntax
Knowledge Graphs (KGs) are structured representations of real and semantics. This reduces the need for additional specialized
world entities and the relationships among them, offering a rich training to interpret the KG representations. By leveraging the
source of factual information ([21] [4]). By grounding the reasoning structured representation of KGs in programming languages, LLMs
processes of LLMs with KGs, we can enhance the factual accuracy can perform more sophisticated reasoning over the data. We in-
of the generated text and reduce hallucinations. Researchers have vestigate different methods of integrating entity relationships into
explored various approaches to integrate KGs with LLMs, each LLMs. Our experimental results demonstrate that programming
with its own advantages and limitations. One approach involves language (Python) representations of KGs outperform traditional
using Graph Neural Networks (GNNs) to encode KGs into embed- natural language representations and structured JSON represen-
dings that capture the structural and semantic information of the tations in complex reasoning tasks, leading to more accurate and
graphs ([10], [36]). These embeddings then serve as soft prompts reliable outputs.
to LLMs, guiding the generation process with knowledge from the The main contributions of this paper are:
KGs. However, this method requires very careful design and tuning,
as it needs to align the representations learned by GNNs with the
token-based processing of LLMs. Moreover, since such integration
Xue Wu and Kostas Tsioutsiouliklis

• We introduce a novel representation of knowledge graphs show promising improvements on various tasks, including reason-
with programming language. It facilitates the seamless in- ing tasks. There is another line of work ([19], [39]) that studies how
tegration of structured knowledge into the language mod- to improve LLM Chain of Thought (CoT) reasoning ability through
eling process. fine-tuning. However, creating good datasets for fine-tuning CoT
• By tightly integrating knowledge graphs into LLMs, our is labor intensive, and the reasoning steps are described in natural
approach improves the reasoning accuracy of LLMs on com- language text, which can introduce ambiguity.
plex tasks and effectively grounds the reasoning process, Yang et al. [33] studied how LLMs do multi-hop reasoning and
reducing the chance for hallucinations. found that LLMs perform latent multi-hop reasoning in certain
relation types, but often struggle with later hops. Biran et al. [3]
observed that later hops are resolved in the model’s deeper layers,
and thus the LLM may no longer encode the necessary knowledge
2 RELATED WORK for answering the question. They propose a back-patching approach,
There have been several attempts to apply LLMs to graph rea- essentially feeding the hidden representations from later layers back
soning tasks. Wang et al. [28], Guo et al. [15], and Ye et al. [37] into earlier ones.
employ the Graph2Text strategy of converting graph data into tex- In this work, we continue the prior research work on multi-hop
tual descriptions. However, these textual descriptions can result in queries, showing that we can significantly improve the performance
very large contexts, and algorithms such as shortest path compu- of LLMs by integrating KG structures and semantics into LLM
tations and bipartite graph matching require calculations across representations. Similarly to [24] we also experiment with code-
the entire context, making the task highly challenging. Chai et style representations. However, in our case, we represent the entity-
al. [9] have introduced GraphLLM, which combines three steps, relations -not the atomic operations- in Python, and our goal is
namely node understanding, graph structure understanding, and to improve the LLMs’ ability to answer questions directly, not its
graph-enhanced prefix tuning. Zhu et al. [40], and Wang et al. [29] ability to query the knowledge graph.
proposed different methods for instruction fine-tuning LLMs to The rest of the paper is structured as follows. In Section 3 we
impove the performance of common graph tasks. describe the methodologies we followed to prompt or fine-tune the
While the above works address general graph problems, other LLM. In Section 4 we describe the experimental design and results.
research has focused specifically on combining KGs with LLMs. Finally, we conclude the paper in Section 5.
One such approach is to use the LLM as an encoder to transform
text-based entity nodes and relations, and then fuse the LLM and 3 METHODOLOGY
GNN-derived representations. Applications of this approach range Our work focuses on studying the entity relationships representa-
from product recommendation (Choudhary et al. [10]) to biomedi- tion of KG for grounded LLM reasoning.
cal question-answering (Yasunaga et al. [36]). Luo & Pan [22] have
proposed a reasoning on Graph (RoG) method that comprises two 3.1 Knowledge Graph Definition
modules: a planning module and a retrieval-reasoning module. The
Let 𝐺 = {𝐸, 𝑅,𝑇 } denote a knowledge graph, where 𝐸 is the set
planning module mines the KG and generates faithful relation paths
of entities, 𝑅 is the set of relationships, 𝑇 ⊆ 𝐸 × 𝑅 × 𝐸 is the
for answering the question. The retrieval-reasoning module com-
set of triplets that are edges in the knowledge graph. A triplet
bines a breadth-first search and a probabilistic optimization over all
(𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ) ∈ 𝑇 if there is a directed edge between entity 𝑒𝑖 (𝑒𝑖 ∈ 𝐸)
paths. Dernbach et al. [12] developed a neighborhood partitioning
and entity 𝑒𝑖+1 (𝑒𝑖+1 ∈ 𝐸) through relationship 𝑟𝑖 (𝑟𝑖 ∈ 𝑅). A
and encoding scheme to accommodate real-world graph properties.
triplet also corresponds to a complete one hop reasoning pro-
Their encoding scheme transforms graph relations into alternate
cess. A two hop compositional reasoning process can be repre-
text representations, which in turn are used to fine-tune the LLM.
sented as ((𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ), (𝑒𝑖+1, 𝑟𝑖+1, 𝑒𝑖+2 )), where (𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ) ∈ 𝑇 and
Edge et al. [13] have built an end-to-end system, called GraphRAG,
(𝑒𝑖+1, 𝑟𝑖+1, 𝑒𝑖+2 ) ∈ 𝑇 . Since 𝑒𝑖+1 exists in both triplets, it is the bridge
that starts with a set of source documents and iteratively applies
entity. Similarly, a three hop compositional reasoning process can be
an LLM to extract the entities in each document. Next, entities are
represented as ((𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ), (𝑒𝑖+1, 𝑟𝑖+1, 𝑒𝑖+2 ), (𝑒𝑖+2, 𝑟𝑖+2, 𝑒𝑖+3 )), where
connected via the relationships extracted, and a knowledge graph
(𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ) ∈ 𝑇 , (𝑒𝑖+1, 𝑟𝑖+1, 𝑒𝑖+2 ) ∈ 𝑇 and (𝑒𝑖+2, 𝑟𝑖+2, 𝑒𝑖+3 ) ∈ 𝑇 .
is created. The knowledge graph is split into communities, and each
community is summarized. These summaries are subsequently used
by the LLM via RAG to help answer questions submitted to the
3.2 Knowledge Graph Representation for LLM
system. Reasoning
Nie et al. [24] provide a code-style in-context learning (ICL) To improve LLM multihop reasoning with knowledge graphs, we
method for knowledge base question answering (KBQA). They de- represent knowledge graphs in ways that are more compatible with
sign seven meta-functions written in Python that cover the atomic LLM prompting and fine-tuning. When given a complex reasoning
operations used for querying databases. By using few-shot learning, prompt, LLMs can detect entities and relationships, then implicitly
they improve LLMs’ ability to query knowledge bases effectively. infer the key entities and facts by following logical reasoning steps
There is also ongoing research work ([23], [38], [2]) that studies grounded by knowledge graphs. For instance, given the prompt:
the impact of mixing programming code in pre-training or instruc- “Who is the spouse of the composer of ‘It Goes Like It Goes’?”,
tion fine-tuning datasets on the performance of LLMs. Even though LLMs can follow the reasoning steps: “The composer of ‘It Goes
the programming code used by the research is generic, the results Like It Goes’ is David Shire”, and “The spouse of David Shire is Didi
Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data

Conn", and infer that the correct answer is “Didi Conn”. Different and unambiguous way; corner cases can be easily checked. How-
representations of knowledge graphs affect how effectively LLMs ever, the representation for the same triplet can be more verbose.
can perform such logical reasoning.
The most natural way is to use natural language to describe the
triplets in knowledge graphs. Figure 1 shows the natural language
representation for two-hop reasoning of triplets ((𝑒 1, 𝑟 1, 𝑒 2 ), (𝑒 2, 𝑟 2,
𝑒 3 )), where 𝑒 2 is the bridge entity. For instance, the triplets ((‘It Goes
Like It Goes’, ‘composer’, ‘David Shire’), (‘David Shire’, ‘spouse’, Figure 1: Natural Language Representation of KG with Static
‘Didi Conn’)) can be represented as (The composer of ‘It Goes Relationships
Like It Goes’ is David Shire, The spouse of David Shire is Didi
Conn). And the two hop reasoning can be represented as “The
spouse of the composer of ‘It Goes Like It Goes’ is Didi Conn”. As
LLMs understand natural languages very well, the natural language
representation is the most straightforward way for either prompting
or fine-tuning LLMs with knowledge graphs.
JSON (JavaScript Object Notation) is a lightweight data inter- Figure 2: JSON Representation of KG with Static Relation-
change format [5]. It is designed to store data in universal data ships
structures such as dictionary and list that are supported by most
programming languages. JSON is a pure data only format and
can be used to store structured data from knowledge graphs. Fig-
ure 2 shows the JSON representation of knowledge graph triplets
(𝑒 1, 𝑟 1, 𝑒 2 ) and (𝑒 2, 𝑟 2, 𝑒 3 ). The entities 𝑒 1 and 𝑒 2 are the keys and the
relationship/entity pairs 𝑟 1 : 𝑒 2 and 𝑟 2 : 𝑒 3 are the values. However,
since JSON is designed to store data only, it is difficult to represent
multi-hop inference process with the JSON format. Alternatively,
a comment or description field (with “comment” as the key and
natural language description of the multi-hop reasoning as the
value) can be added to the JSON representation. But this is not a
recommended practice in general.
Programming language code is another major data source for
LLM pre-training and fine-tuning. Knowledge graphs can be repre-
sented using various data structures supported by major program- Figure 3: Python Representation of KG with Static Relation-
ming languages such as Python. The triples can be represented ships
either as a static dictionary or added dynamically to the prede-
fined data structures as part of the code. Figure 3 shows Python
representation of knowledge graph triplets with a static dictionary
data structure, and the two-hop inference process that is based on
the stored triplets. As shown in figure 3, relationships and entities
are stored in dictionary data structure “relationships” with rela-
tionships 𝑟 1 and 𝑟 2 as the keys and entity pairs as the values. The
inference process is just the process of retrieving the values based
on the key values of the dictionary. Figure 4 shows Python repre-
sentation of knowledge graph triplets with predefined Python class
“KnowledgeBase”, and an iterative multi-hop inference function
‘infer’ that supports inference of an arbitrary number of hops. As
shown by figure 4, the main data structure of “KnowledgeBase” is
a dictionary “self.facts”, and entities and relationships are added
to the dictionary with function “add_fact”. The “infer” function
accepts any number of relationships with parameter “*relations”
and does the corresponding multi-hop reasoning. We designed the
dynamic self-defined data structure based Python representation
because it can be easily generalized to support multi-hop reasoning
based on subgraphs of KGs.
The example mentioned in the previous paragraphs can be easily Figure 4: Python Representation of KG with Dynamic Rela-
represented as dictionaries in Python. Figure 5 shows four different tionships
representations for the same example. Using programming lan-
guages, knowledge graphs can be represented in a more controlled
Xue Wu and Kostas Tsioutsiouliklis

Figure 5: Examples of Different Representation of KG

Train and Test Train and Test


Train Test Intersection Train Test Intersection
Number of Hops 2 2 2 Number of Hops 2 2 2
Dataset Size 10,262 10,255 0 Dataset Size 10,733 5,236 0
Bridge Entities (𝑒 2 ) 324 880 0 Bridge Entities (𝑒 2 ) 4,170 3,650 782
Relations (𝑟 1 , 𝑟 2 ) 191 207 167 Relations (𝑟 1 , 𝑟 2 ) 80 110 66
No. of row with (𝑟 1 , 𝑟 2 ) 10,262 10,255 10,130 No. of row with (𝑟 1 , 𝑟 2 ) 10,733 5,236 1,907

Table 1: Dataset 1 Train and Test Data Selection Table 2: Dataset 2 Train and Test Data Selection

4 EXPERIMENTS the training data. We use the development dataset with composi-
We designed experiments to study how different representations tional relationships for testing purpose. The details of the dataset
of entity relationships in KGs affect the reasoning performance of are listed in Table 2. We choose compositional relationships so the
LLMs across two different datasets. type of relationships are consistent with those in Dataset1 and it is
easier to compare the reasoning performance across datasets. We
4.1 Datasets didn’t further restrict the entities and relationships based on the
overlap between training and testing dataset to respect the design
4.1.1 Dataset 1. For the first dataset, we use the same dataset used
of the dataset by paper ([16]).
by the “Hopping Too Late” paper ([3]). The dataset includes two-
hop relationships extracted from publicly available knowledge base 4.1.3 Dataset 3. This dataset is an extension of Dataset 1. We ex-
Wikidata ([27]). For the experiments, we split the dataset into eight tended two-hop relationships ((𝑒 1, 𝑟 1, 𝑒 2 ), (𝑒 2, 𝑟 2, 𝑒 3 )) by adding a
equal-sized partitions based on the bridge entity 𝑒 2 in a round- third hop (𝑒 3, 𝑟 3, 𝑒 4 ), resulting in three-hop relationships ((𝑒 1, 𝑟 1, 𝑒 2 ),
robin fashion, so that any unique 𝑒 2 exists only in one partition. We (𝑒 2, 𝑟 2, 𝑒 3 ), (𝑒 3, 𝑟 3, 𝑒 4 )), while keeping the entities 𝑒 3 as a subset of
partition the dataset in this way so that the LLMs only learn the the entities 𝑒 3 in Dataset 1. This dataset was created to test whether
relationships and logical reasoning process rather than memorizing models fine-tuned for two-hop reasoning can generalize to improve
the entities. To avoid overrepresentation of the most popular 𝑒 2 in three-hop reasoning performance as well. The details of the dataset
the training or testing dataset, we choose partition 2 as our training are listed in Table 3. The overlap of bridge entities between train-
dataset and partition 4 as the testing dataset. The details of the ing and testing dataset is minimal for this dataset. There is high
dataset are listed in Table 1. For this dataset, there is no overlap of percentage of relationship overlap between training and testing
bridge entities between training and testing dataset. The overlap of dataset.
relationship pairs (𝑟 1, 𝑟 2 ) is about 99%. The final training data for fine-tuning the LLMs include both
one-hop prompts and responses based on the first hop from the
4.1.2 Dataset 2. For the second dataset, we use the dataset created
datasets, and two-hop prompts and responses based on two-hop
in paper ([16]). This dataset also includes two-hop relationships
information from the datasets.
extracted from publicly available knowledge base Wikidata ([27]).
Although both Dataset 1 and Dataset 2 are derived from the same
knowledge base, the extracted entities and relationships for Dataset 4.2 Large Language Models
2 are different from those in Dataset 1. For the training dataset, we To evaluate the performance of LLM reasoning, we chose the latest
select only compositional relationships from this dataset and limit released open source models by Meta: Llama-3.1-8B-Instruct and
the number of instances per relationship pair (𝑟 1, 𝑟 2 ) to no more Llama-3.1-70B-Instruct ([1]). The Llama-3.1 model family demon-
than 500 to avoid over representation of particular relationships in strates stronger reasoning abilities compared to other open-source
Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data

Intersection with 4.5 Prompt Design


Dataset 1 Train Test Dataset 1 Train
Number of Hops 2 3 0
The design of the prompts for all the experiments is presented in Ta-
Dataset Size 10,262 1,007 0 ble 4. Examples are shown for two-hop reasoning prompts. Because
Bridge Entities e2 324 261 28 of the paper length limitation, the natural language explanation,
Train 𝑒 2 vs Test 𝑒 3 324 233 15 JSON data structure and Python code snippets for one-shot prompt-
Relations (𝑟 1 , 𝑟 2 ) 191 81 70 ing are shown in Appendix A. Three-hop reasoning prompts follow
No. of row with (𝑟 1 , 𝑟 2 ) 10,262 1,007 973 the same logic and use the same prompt format.
Relations (𝑟 2 , 𝑟 3 ) *191 97 92
Note *: this number is for (𝑟 1 , 𝑟 2 ) in Dataset 1 Train
4.6 Metrics
Table 3: Dataset 3 Basic Statistics The main metric for measuring the performance of LLM reasoning
is the accuracy of multi-hop reasoning conditioned on the cor-
rectness of each individual hop, denoted as 𝑟 = 𝑝 (ℎ|ℎ 1, ℎ 2, ..., ℎ𝑛 ),
LLMs. For our experiments, we fine-tuned the smaller model (Llama- where ℎ is the correctness of the multi-hop reasoning, and ℎ𝑖 is the
3.1-8B-Instruct) with different entity relationship representations correctness of the 𝑖𝑡ℎ hop reasoning. We use this metric instead
of multi-hop reasoning. During fine-tuning, we optimize the lan- of the overall accuracy of the results because the final reasoning
guage model objective, which predicts the probability distribution output is affected by multiple factors, including whether the LLM
of each token based on its previous tokens. We adopt LoRA ([17]) has the knowledge of each individual entity and whether it can infer
to perform parameter-efficient fine-tuning. During inference, we each hop correctly. Our main metric measures the latent multi-hop
used greedy decoding for reproducibility. reasoning performance. We also provide the overall accuracy of the
results as a reference for potential applications.
4.3 Experiment Environment
We use the large language model checkpoints from Hugging Face 4.7 Experimental Results
Transformers1 in all the experiments. Experiments were conducted
using one Nvidia A100 processor with 40GB of RAM per GPU. When We compare the different approaches of grounding LLM reasoning
fine-tuning the LLMs, we ran the training process for one epoch using different knowledge graph representations by answering the
across all the experiments. We didn’t fine-tune the larger model following research questions.
(Llama-3.1-70B-Instruct) because of the computation resource con-
4.7.1 RQ1: Do different representations of entity relationships affect
straint.
LLM multi-hop reasoning ? Can we fine-tune the LLM with proper en-
tity relationship representations to improve its reasoning capabilities?
4.4 Experiment Design Since LLMs have already demonstrated abilities in latent multi-
To test how well each representation of KGs helps with LLM rea- hop reasoning ([33]), we designed experiments to compare how
soning, we designed the following three types of experiments. different approaches and different representations of multi-hop en-
4.4.1 LLM Reasoning without Context. To test the multi-hop rea- tity relationships can further improve LLMs’ multi-hop reasoning
soning ability of LLMs, we compared model performance under ability.
three conditions: Impact of Entity Relationship Representations for LLM Prompt-
ing. Table 5 and Table 6 present the performance of Llama-3.1-8B-
(1) Zero shot prompting of the LLMs to predict 𝑒 3 . Given
Instruct with different representations across different datasets. It
(𝑒 1, 𝑟 1, 𝑟 2 ), we prompt the LLMs to predict 𝑒 3 . This serves
is obvious from the one-shot prompting results that Python rep-
as our baseline experiment.
resentations of entity relationships outperform both the plain nat-
(2) One shot prompting of the LLMs with reasoning exam-
ural language text representation and the JSON representation.
ple. We have conducted three experiments, one for each
And both Python and natural language representations for one
representation format of KGs.
shot prompting perform better than zero shot prompting. The two-
(3) Zero shot prompting of fine-tuned LLMs. We have con-
hop reasoning performance of the one-shot LLM with dynamic
ducted three experiments, one for each representation for-
Python representation is approximately 78% higher than that of
mat of KGs.
zero-shot prompting of the LLM on the Dataset 1 (Table 1). Similarly,
4.4.2 LLM Reasoning with Context. Since Retrieval Augmented the performance of the static Python representation for one-shot
Generation (RAG) is one of the most popular applications of LLMs, prompting of the LLM is about 60% higher than that of zero-shot
we designed experiments to compare the performance of LLMs prompting for Dataset 2 (Table 2). However, the performance of
when given input context. This set of experiments are designed to one-shot prompting with JSON representation is worse than zero-
test how well LLMs perform for potential RAG applications. shot prompting of LLM. We hypothesize that the structured JSON
representation is not native to the LLM reasoning process.
4.4.3 Model Generalization to Longer Reasoning Path. We fine-
Impact of Entity Relationship Representation for LLM Fine-Tuning.
tuned the LLMs using only one-hop and two-hop triplets. For this
Table 5 and Table 6 also show the performance of the fine-tuned
set of experiments, we tested how well the fine-tuned models can
Llama-3.1-8B-Instruct model with different representations on dif-
generalize to reasoning over longer paths.
ferent datasets. In this case, the LLMs fine-tuned with Python rep-
1 https://huggingface.co/docs/transformers/index resentations perform better than the LLMs fine-tuned with either
Xue Wu and Kostas Tsioutsiouliklis

Dataset Dataset1 Dataset2


Multihop Question 𝑟 2 of 𝑟 1 of 𝑒 1 is What is 𝑟 2 of 𝑟 1 of 𝑒 1 ?
Zero Shot Prompt Given the incomplete statement: 𝑟 2 of 𝑟 1 of 𝑒 1 is _ , Given the question: What is 𝑟 2 of 𝑟 1 of 𝑒 1 ?
provide answer and generate explanation generate explanation and provide answer to the question
for completing the statement
One Shot Prompt Given the incomplete statement: 𝑟 2 of 𝑟 1 of 𝑒 1 is _ , Given the question: What is 𝑟 2 of 𝑟 1 of 𝑒 1 ?
for text representation {“Answer”:“𝑒 3 ”, “Explanation”:“𝑟 1 of 𝑒 1 is 𝑒 2 . 𝑟 2 of 𝑒 2 is 𝑒 3 . {“Answer”:“𝑒 3 ”, “Explanation””:“𝑟 1 of 𝑒 1 is 𝑒 2 . 𝑟 2 of 𝑒 2 is 𝑒 3 .
𝑟 2 of 𝑟 1 of 𝑒 1 is 𝑒 3 ”} 𝑟 2 of 𝑟 1 of 𝑒 1 is 𝑒 3 ”}
Given the incomplete statement: {statement} _ , Given the question: {question} ?
provide answer and generate explanation generate explanation and provide answer to the question
for completing the statement generate explanation and provide answer to the question
Or for JSON Given the incomplete statement: 𝑟 2 of 𝑟 1 of 𝑒 1 is _ , Given the question: What is 𝑟 2 of 𝑟 1 of 𝑒 1 ?
representation {“Answer”:“𝑒 3 ”, “JSON Structure”:“{JSON data}” {“Answer”:“𝑒 3 ”, “JSON structure”:“{JSON data}”
Given the incomplete statement: {statement} _ , Given the question: {question} ?
provide answer and generate JSON structure generate JSON structure and provide answer to the question
for completing the statement
Or for Python Given the incomplete statement: 𝑟 2 of 𝑟 1 of 𝑒 1 is _ , Given the question: What is 𝑟 2 of 𝑟 1 of 𝑒 1 ?
representation {“Answer”:“𝑒 3 ”, “Python code snippet”:“{Python code}" {“Answer”:“𝑒 3 ”, “Python code snippet”:“{Python code}"
Given the incomplete statement: {statement} _ , Given the question: {question} ?
provide answer and generate python code generate python code and provide answer to the question
for completing the statement
Prompt with Context Given context: {context} and the uncompleted Given context: {context} and the question: {question}
statement: {statement} _ , provide answer and generate explanation and provide answer to the question
generate explanation for completing the statement

Table 4: Prompt Design

JSON representation or plain natural language text representations. In these experiments, we performed one-shot prompting of the
The performance of the LLM fine-tuned with JSON representation LLMs, using the prompts listed in Table 4. The results are shown
is only slightly worse than that of the LLM fined-tuned with natu- in Table 8. For Dataset 1, since no context is provided, we use the
ral language representation. For comparison, we also provide the 1st hop and 2nd hop inferences as context and measure whether
performance numbers for zero shot prompting of Llama-3.1-70B- the model can infer the correct answers. For Dataset 2, we use the
Instruct. All LLMs that were fine-tuned with any entity relationship given context for the questions. And again, the LLMs that were fine-
representation outperform the larger model for latent multi-hop tuned with different entity relationship representations outperform
reasoning. the baseline model without fine-tuning. Notably, the LLM fine-
tuned with dynamic Python representation greatly outperforms
4.7.2 RQ2: Can fine-tuned LLMs generalize their reasoning ability
the baseline model; it even surpasses the performance of the much
to more hops that are beyond the training data ? Since we fine-tuned
larger baseline model (Llama-3.1-70B-Instruct).
LLMs using only one-hop and two-hop reasoning data, it is impor-
tant to study whether such a fine-tuning process will improve the
LLMs’ reasoning performance on more hops. We created a three-
hop dataset (as shown in Table 3) to measure the performance of 4.8 Discussion
the fine-tuned LLMs. As shown in Table 7, the three-hop reasoning
We designed a series of experiments to study how to seamlessly
performance of all fine-tuned LLMs has improved across all entity
integrate entity relationships into LLMs to improve their multi-hop
relationship representations compared with the baseline LLM with-
reasoning ability, and the impact of different entity relationship rep-
out fine-tuning. Furthermore, LLMs that were fine-tuned with the
resentations on the performance of LLMs. Our experimental results
Python representation outperform those fine-tuned with either the
show that it is possible to integrate entity relationships into LLMs
plain natural language representation or the JSON representation.
and ground the LLM multi-hop reasoning process with knowledge
As the relationship (𝑟 2, 𝑟 3 ) has significant overlap with (𝑟 1, 𝑟 2 ) in
graphs. While all forms of the proposed entity relationship repre-
the training data (as shown in Table 3), the relative performance of
sentations help improve LLM reasoning performance, they affect
fine-tuned models is consistent with what is shown in Table 5.
LLM performance differently. The natural language representation
4.7.3 RQ3: How much can LLM in-context learning help or benefit is straightforward and is the major form of pre-training data for
from multi-hop reasoning? Retrieval Augmented Generation (RAG) LLMs. The JSON representation is best suited for storing structured
is one of the main applications of LLMs. However, even when data. However, it can be difficult for LLM to directly integrate pure
supplied with correctly retrieved information, LLMs do not always structured data. The Python representations store both structured
generate the correct answer, particularly when multi-hop reasoning data and the inference process, providing a more controlled and
is required. To address this issue, we designed experiments to study unambiguous way of guiding LLMs through the reasoning process.
the performance of fine-tuned LLMs when given an input context. This helps LLMs achieve better reasoning performance in all the
Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data

Model KG Representation KG Representation Prompt Accuracy 1st hop correct 1st hop correct 1st hop correct
for fine tuning for Prompt 2nd hop correct 2nd hop correct 2nd hop correct
final incorrect final correct final accuracy
Llama-3.1-8B NA NA Zero Shot 16.3% 989 604 38.2%
Llama-3.1-70B NA NA Zero Shot 33.9% 615 1,979 76.3%
Llama-3.1-8B NA Natural Language One Shot 16.9% 965 776 44.6%
Llama-3.1-8B NA JSON One Shot 7.78% 995 352 26.1%
Llama-3.1-8B NA Python Static One Shot 17.5% 408 609 59.9%
Llama-3.1-8B NA Python Dynamic One Shot 19.1% 341 772 67.9%
Llama-3.1-8B tuned Natural Language NA Zero Shot 25.6% 145 1,005 87.4%
Llama-3.1-8B tuned JSON NA Zero Shot 20.8% 82 569 87.4%
Llama-3.1-8B tuned Python Static NA Zero Shot 26.2% 102 1,093 91.5%
Llama-3.1-8B tuned Python Dynamic NA Zero Shot 26.5% 119 882 88.1%

Table 5: LLM Two Hop Reasoning for Dataset 1

Model KG Representation KG Representation Prompt Accuracy 1st hop correct 1st hop correct 1st hop correct
for fine tuning for Prompt 2nd hop correct 2nd hop correct 2nd hop correct
final incorrect final correct final accuracy
Llama-3.1-8B NA NA Zero Shot 1.85% 51 14 21.5%
Llama-3.1-70B NA NA Zero Shot 10.7% 164 195 54.3%
Llama-3.1-8B NA Natural Language One Shot 3.99% 82 25 23.4%
Llama-3.1-8B NA JSON One Shot 1.80% 19 4 17.4%
Llama-3.1-8B NA Python Static One Shot 4.98% 50 29 36.7%
Llama-3.1-8B NA Python Dynamic One Shot 4.12% 35 12 25.5%
Llama-3.1-8B tuned Natural Language NA Zero Shot 11.0% 97 164 62.8%
Llama-3.1-8B tuned JSON NA Zero Shot 10.0% 61 157 72.0%
Llama-3.1-8B tuned Python Static NA Zero Shot 12.1% 44 191 81.3%
Llama-3.1-8B tuned Python Dynamic NA Zero Shot 12.3% 38 203 84.2%

Table 6: LLM Two Hop Reasoning for Dataset 2

Model KG Representation Accuracy % of Correct % of Correct % of Correct


for tuning conditioned on conditioned on conditioned on
1st & 2nd hop correct 2nd & 3rd hop correct all three hops correct
Llama-3.1-8B NA 20.4% 35.2% 37.1% 50.0%
Llama-3.1-70B NA 42.6% 64.6% 61.8% 80.4%
Llama-3.1-8B tuned Natural Language 31.0% 44.0% 54.1% 65.0%
Llama-3.1-8B tuned JSON 23.3% 42.5% 46.1% 63.9%
Llama-3.1-8B tuned Python Static 30.9% 47.7% 59.3% 70.6%
Llama-3.1-8B tuned Python Dynamic 38.0% 51.6% 57.4% 67.0%

Table 7: LLM Three Hop Reasoning for Dataset 3 with Zero Shot Prompting

Model KG Representation Dataset 1 Accuracy Dataset 2 Accuracy


for tuning given 1st & 2nd given context
hop as context
Llama-3.1-8B NA 88.6% 10.9%
Llama-3.1-70B NA 96.1% 41.9%
Llama-3.1-8B tuned Natural Language 87.9% 44.7%
Llama-3.1-8B tuned JSON 92.3% 46.6%
Llama-3.1-8B tuned Python Static 87.2% 52.8%
Llama-3.1-8B tuned Python Dynamic 96.4% 59.2%

Table 8: LLM Two Hop Reasoning with Input Context

cases we studied. In some cases, the fine-tuned small LLMs per- networks of the larger LLMs can help the model make connections
form better than much larger LLMs, even though the deep neural
Xue Wu and Kostas Tsioutsiouliklis

among multiple hops and infer the correct answers. Because we REFERENCES
guide and fine-tune the models with emphasis on multi-hop rela- [1] Meta AI. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https:
tionships rather than the facts of individual entities, the fine-tuned //arxiv.org/abs/2407.21783
[2] Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr
models can easily generalize to do multi-hop reasoning for com- Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. To Code, or Not
pletely different entities, even in cases where the number of hops is To Code? Exploring Impact of Code in Pre-training. arXiv:2408.10914 [cs.CL]
https://arxiv.org/abs/2408.10914
greater than what is in the training data. Our proposed fine-tuning [3] Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson.
approaches also help improve the results of in-context learning. 2024. Hopping Too Late: Exploring the Limitations of Large Language Models on
As synthetic data becomes increasingly important for LLM pre- Multi-Hop Queries. arXiv:2406.12775 [cs.CL] https://arxiv.org/abs/2406.12775
[4] Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor.
training and fine-tuning, generating proper representations of struc- 2008. Freebase: a collaboratively created graph database for structuring human
tured data (especially knowledge graphs) and incorporating this knowledge. In SIGMOD Conference. https://api.semanticscholar.org/CorpusID:
type of data in LLM pre-training and fine-tuning can greatly im- 207167677
[5] Tim Bray. 2014. The javascript object notation (json) data interchange format.
prove LLM reasoning abilities and reduce hallucinations. As demon- [6] Felix Brei, Johannes Frey, and Lars-Peter Meyer. 2024. Leveraging small lan-
strated by our experimental results, the representation of entity guage models for Text2SPARQL tasks to improve the resilience of AI assistance.
arXiv:2405.17076 [cs.AI] https://arxiv.org/abs/2405.17076
relationships is very important for LLM performance. Program- [7] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
ming languages provide the flexibility to represent various entity Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
relationships with native data structures. They can guide LLM rea- Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
soning in a more controlled and principled way and ground LLM Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
inference with a knowledge base. Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
In our experiments, we only studied the simplest form of rea- Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.
arXiv:2005.14165 [cs.CL] https://arxiv.org/abs/2005.14165
soning: two-hop and three-hop reasoning with compositional re- [8] Diego Bustamante and Hideaki Takeda. 2024. SPARQL Generation with Entity
lationships. The reasoning process for real-life applications can Pre-trained GPT for KG Question Answering. arXiv:2402.00969 [cs.CL] https:
//arxiv.org/abs/2402.00969
be much more sophisticated. As shown in Figure 4, the Python [9] Ziwei Chai, Tianjie Zhang, Liang Wu, Kaiqiao Han, Xiaohai Hu, Xuanwen Huang,
representation of a knowledge graph with dynamic relationships and Yang Yang. 2023. GraphLLM: Boosting Graph Reasoning Ability of Large
provides the flexibility to define any relationships, even extending Language Model. arXiv:2310.05845 [cs.CL] https://arxiv.org/abs/2310.05845
[10] Nurendra Choudhary, Nikhil Rao, Karthik Subbian, and Chandan Reddy. 2022.
to a subgraph. Entities and relationships can be defined as classes Graph-based multilingual language model: Leveraging product relations for
themselves, allowing us to add attributes to the entities and rela- search relevance. In KDD 2022. https://www.amazon.science/publications/graph-
tionships. Correspondingly, the “infer” function can be redefined based-multilingual-language-model-leveraging-product-relations-for-search-
relevance
to include code that checks these attributes. In future work, we [11] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav
will study the graph representation of entity relationships and its Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas-
tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez,
impact on more complex reasoning cases. Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran,
Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin,
Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay
5 CONCLUDING REMARKS Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin
Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek
We proposed different representations of entity relationships in Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani
knowledge graphs to improve the multi-hop reasoning capabilities Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana
Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr
of LLMs. We conducted a series of experiments to study how dif- Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz,
ferent representations of entity relationships affect LLM reasoning Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck,
Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Modeling
ability. We showed that introducing programming language repre- with Pathways. arXiv:2204.02311 [cs.CL] https://arxiv.org/abs/2204.02311
sentations of the entity relationships helps improve LLM multi-hop [12] Stefan Dernbach, Khushbu Agarwal, Alejandro Zuniga, Michael Henry, and Su-
reasoning ability and reduce hallucination. tanay Choudhury. 2024. GLaM: Fine-Tuning Large Language Models for Domain
Knowledge Graph Alignment via Neighborhood Partitioning and Generative
The programming language representation of the entity rela- Subgraph Encoding. arXiv:2402.06764 [cs.AI] https://arxiv.org/abs/2402.06764
tionships provides a controlled and unambiguous way for LLM [13] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva
multi-hop reasoning. By leveraging the native data structures in- Mody, Steven Truitt, and Jonathan Larson. 2024. From Local to Global: A Graph
RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL]
herent in programming languages, we can effectively model com- https://arxiv.org/abs/2404.16130
plex entity relationships, while iterative inference functions guide [14] Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2023.
Complexity-Based Prompting for Multi-step Reasoning. In The Eleventh Interna-
the logical reasoning process. This approach not only enhances tional Conference on Learning Representations. https://openreview.net/forum?
reasoning accuracy but also facilitates generalization to more so- id=yf1icZHC-l9
phisticated reasoning use cases. However, accurately measuring [15] Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. 2023.
GPT4Graph: Can Large Language Models Understand Graph Structured Data ?
the performance of LLM reasoning beyond two-hop and three-hop An Empirical Evaluation and Benchmarking. arXiv:2305.15066 [cs.AI] https:
compositional relationships can be challenging, as the reasoning //arxiv.org/abs/2305.15066
process becomes increasingly complex. As part of the future work, [16] Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020.
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason-
we would like to study the programming language representations ing Steps. In Proceedings of the 28th International Conference on Computational Lin-
of more sophisticated relationships in order to solve more complex guistics. International Committee on Computational Linguistics, Barcelona, Spain
(Online), 6609–6625. https://www.aclweb.org/anthology/2020.coling-main.580
reasoning tasks. We will experiment at both the pre-training and [17] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
fine-tuning stages of LLMs to evaluate the performance impact. We Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of
hope that our work will inspire other researchers to further push Large Language Models. In International Conference on Learning Representations.
https://openreview.net/forum?id=nZeVKeeFYf9
the frontier of LLM research.
Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data

[18] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, 2210.09338
Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination [37] Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang.
in Natural Language Generation. Comput. Surveys 55, 12 (March 2023), 1–38. 2024. Language is All a Graph Needs. arXiv:2308.07134 [cs.CL] https://arxiv.
https://doi.org/10.1145/3571730 org/abs/2308.07134
[19] Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin [38] Xinlu Zhang, Zhiyu Zoey Chen, Xi Ye, Xianjun Yang, Lichang Chen, William Yang
Shin, and Minjoon Seo. 2023. The CoT Collection: Improving Zero-shot and Wang, and Linda Ruth Petzold. 2024. Unveiling the Impact of Coding Data Instruc-
Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning. tion Fine-Tuning on Large Language Models Reasoning. arXiv:2405.20535 [cs.AI]
arXiv:2305.14045 [cs.CL] https://arxiv.org/abs/2305.14045 https://arxiv.org/abs/2405.20535
[20] Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grig- [39] Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. 2024.
orev. 2022. Internet-augmented language models through few-shot prompt- Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in
ing for open-domain question answering. arXiv:2203.05115 [cs.CL] https: LLMs. arXiv:2406.09136 [cs.CL] https://arxiv.org/abs/2406.09136
//arxiv.org/abs/2203.05115 [40] Kerui Zhu, Bo-Wei Huang, Bowen Jin, Yizhu Jiao, Ming Zhong, Kevin Chang,
[21] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Shou-De Lin, and Jiawei Han. 2024. Investigating Instruction Tuning Large
Pablo Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Language Models on Graphs. arXiv:2408.05457 [cs.CL] https://arxiv.org/abs/
Auer, and Christian Bizer. 2014. DBpedia - A Large-scale, Multilingual Knowledge 2408.05457
Base Extracted from Wikipedia. Semantic Web Journal 6 (01 2014). https:
//doi.org/10.3233/SW-140134
[22] Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024. Reason- A ONE SHOT EXAMPLE FOR DIFFERENT
ing on Graphs: Faithful and Interpretable Large Language Model Reasoning.
arXiv:2310.01061 [cs.CL] https://arxiv.org/abs/2310.01061 PROMPT FORMATS
[23] Yingwei Ma, Yue Liu, Yue Yu, Yuanliang Zhang, Yu Jiang, Changjian Wang,
and Shanshan Li. 2023. At Which Training Stage Does Code Data Help LLMs A.1 Explanation for Natural Language Prompt
Reasoning? arXiv:2309.16298 [cs.CL] https://arxiv.org/abs/2309.16298
[24] Zhijie Nie, Richong Zhang, Zhongyuan Wang, and Xudong Liu. 2024. The composer of It Goes Like It Goes is David Shire . The
Code-Style In-Context Learning for Knowledge-Based Question Answering. spouse of David Shire is Didi Conn . The spouse of
arXiv:2309.04695 [cs.CL] https://arxiv.org/abs/2309.04695 the composer of It Goes Like It Goes is _
[25] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike
Lewis. 2023. Measuring and Narrowing the Compositionality Gap in Language
Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, A.2 JSON structure Prompt
Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational
Linguistics, Singapore, 5687–5711. https://doi.org/10.18653/v1/2023.findings- {
emnlp.378
" composer " : {
[26] Julio C. Rangel, Tarcisio Mendes de Farias, Ana Claudia Sima, and Norio
Kobayashi. 2024. SPARQL Generation: an analysis on fine-tuning Open- " It ␣ Goes ␣ Like ␣ It ␣ Goes " : " David ␣ Shire "
LLaMA for Question Answering over a Life Science Knowledge Graph. },
arXiv:2402.04627 [cs.AI] https://arxiv.org/abs/2402.04627 " spouse " : {
[27] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative " David ␣ Shire " : " Didi ␣ Conn "
Knowledgebase. Commun. ACM 57, 10 (Sept. 2014), 78–85. https://doi.org/10. }
1145/2629489 }
[28] Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and
Yulia Tsvetkov. 2024. Can Language Models Solve Graph Problems in Natural
Language? arXiv:2305.10037 [cs.CL] https://arxiv.org/abs/2305.10037
[29] Jianing Wang, Junda Wu, Yupeng Hou, Yao Liu, Ming Gao, and Julian McAuley.
A.3 Python Code Snippet for Prompt V1
2024. InstructGraph: Boosting Large Language Models via Graph-centric In-
struction Tuning and Preference Alignment. arXiv:2402.08785 [cs.CL] https: # Step 1. Define relationships with explicit types
//arxiv.org/abs/2402.08785 relationships = {
[30] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan ' composer ': {
Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Im- 'It ␣ Goes ␣ Like ␣ It ␣ Goes ': ' David ␣ Shire ' # It Goes
proves Chain of Thought Reasoning in Language Models. In The Eleventh Inter- Like It Goes is related to David Shire via
national Conference on Learning Representations. https://openreview.net/forum? relationship composer
id=1PL1NIMMrw },
[31] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter,
' spouse ': {
Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought
Prompting Elicits Reasoning in Large Language Models. In Advances in ' David ␣ Shire ': ' Didi ␣ Conn ' # David Shire is
Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agar- related to Didi Conn via relationship spouse
wal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, }
Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/ }
9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
[32] Silei Xu, Shicheng Liu, Theo Culhane, Elizaveta Pertseva, Meng-Hsi Wu, Sina # Define entities and relationships
Semnani, and Monica Lam. 2023. Fine-tuned LLMs Know More, Hallucinate e1 = 'It ␣ Goes ␣ Like ␣ It ␣ Goes '
Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata. In
r1 = ' composer '
Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for r2 = ' spouse '
Computational Linguistics, Singapore, 5778–5791. https://doi.org/10.18653/v1/
2023.emnlp-main.353 # Step 2. ( r1 , e1 ) -> e2
[33] Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. e2 = relationships [ r1 ][ e1 ]
2024. Do Large Language Models Latently Perform Multi-Hop Reasoning?
arXiv:2402.16837 [cs.CL] https://arxiv.org/abs/2402.16837 # Step 3. ( r2 , e2 ) -> e3
[34] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan e3 = relationships [ r2 ][ e2 ]
Cao, and Karthik R Narasimhan. 2023. Tree of Thoughts: Deliberate Problem
Solving with Large Language Models. In Thirty-seventh Conference on Neural
Information Processing Systems. https://openreview.net/forum?id=5Xc1ecxO1h # Output the result
[35] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, print ( f " { r2 } ␣ of ␣ { r1 } ␣ of ␣ { e1 } ␣ is ␣ { e3 } " )
and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language
Models. In International Conference on Learning Representations (ICLR). # when you run the code , it will output :
[36] Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D # The spouse of the composer of It Goes Like It Goes is
Manning, Percy Liang, and Jure Leskovec. 2022. Deep Bidirectional Language- Didi Conn
Knowledge Graph Pretraining. arXiv:2210.09338 [cs.CL] https://arxiv.org/abs/
Xue Wu and Kostas Tsioutsiouliklis

A.4 Python Code Snippet for Prompt V2 result3 = kb . infer ( e1 , r1 , r2 ) # Should return
Didi Conn , ( It Goes Like It Goes , composer ) ->
David Shire , ( David Shire , spouse ) -> Didi Conn

# Output the result


# Step 1. Define relationships with knowledge base
print ( result1 ) # Output : David Shire is related to
class KnowledgeBase :
It Goes Like It Goes through composer
def __init__ ( self ) :
print ( result2 ) # Output : Didi Conn is related to
# Initialize an empty dictionary to store
David Shire through spouse
facts .
print ( result3 ) # Output : Didi Conn is related to It
# Each key is a tuple ( entity1 , relation ) ,
Goes Like It Goes through composer and spouse
and the value is the entity2 related to
entity1 through relation .
self . facts = {{}}

def add_fact ( self , entity1 , relation , entity2 ) :


# Add a fact to the knowledge base .
# : param entity1 : The starting entity .
# : param relation : The relation from entity1
to entity2 .
#: param entity2 : The related entity reached
via the relation .

self . facts [( entity1 , relation ) ] = entity2

def infer ( self , entity , * relations ) :


# Infer the resulting entity by traversing
the relations starting from the given
entity .

#: param entity : The starting entity .


#: param relations : A chain of relations to
traverse .
#: return : The resulting entity after
applying the relations , or None if no
such path exists .

current_entity = entity
for relation in relations :
key = ( current_entity , relation )
if key in self . facts :
current_entity = self . facts [ key ]
else :
# If the path does not exist , return
None .
return None
return current_entity

# Example usage :
# Create a knowledge base instance .
kb = KnowledgeBase ()

# Step 2. Define entities and relationships


e1 = ' It ␣ Goes ␣ Like ␣ It ␣ Goes '
r1 = ' composer '
e2 = ' David ␣ Shire '
r2 = ' spouse '
e3 = ' Didi ␣ Conn '

# Add entities and relationships to the knowledge


base .
kb . add_fact (e1 , r1 , e2 )
kb . add_fact (e2 , r2 , e3 )

# Step 3. Perform inference .


result1 = kb . infer ( e1 , r1 ) # Should return
David Shire , ( It Goes Like It Goes , composer )
-> David Shire
result2 = kb . infer ( e2 , r2 ) # Should return
Didi Conn , ( David Shire , spouse ) -> Didi Conn

You might also like