Thinking With Knowledge Graphs
Thinking With Knowledge Graphs
• We introduce a novel representation of knowledge graphs show promising improvements on various tasks, including reason-
with programming language. It facilitates the seamless in- ing tasks. There is another line of work ([19], [39]) that studies how
tegration of structured knowledge into the language mod- to improve LLM Chain of Thought (CoT) reasoning ability through
eling process. fine-tuning. However, creating good datasets for fine-tuning CoT
• By tightly integrating knowledge graphs into LLMs, our is labor intensive, and the reasoning steps are described in natural
approach improves the reasoning accuracy of LLMs on com- language text, which can introduce ambiguity.
plex tasks and effectively grounds the reasoning process, Yang et al. [33] studied how LLMs do multi-hop reasoning and
reducing the chance for hallucinations. found that LLMs perform latent multi-hop reasoning in certain
relation types, but often struggle with later hops. Biran et al. [3]
observed that later hops are resolved in the model’s deeper layers,
and thus the LLM may no longer encode the necessary knowledge
2 RELATED WORK for answering the question. They propose a back-patching approach,
There have been several attempts to apply LLMs to graph rea- essentially feeding the hidden representations from later layers back
soning tasks. Wang et al. [28], Guo et al. [15], and Ye et al. [37] into earlier ones.
employ the Graph2Text strategy of converting graph data into tex- In this work, we continue the prior research work on multi-hop
tual descriptions. However, these textual descriptions can result in queries, showing that we can significantly improve the performance
very large contexts, and algorithms such as shortest path compu- of LLMs by integrating KG structures and semantics into LLM
tations and bipartite graph matching require calculations across representations. Similarly to [24] we also experiment with code-
the entire context, making the task highly challenging. Chai et style representations. However, in our case, we represent the entity-
al. [9] have introduced GraphLLM, which combines three steps, relations -not the atomic operations- in Python, and our goal is
namely node understanding, graph structure understanding, and to improve the LLMs’ ability to answer questions directly, not its
graph-enhanced prefix tuning. Zhu et al. [40], and Wang et al. [29] ability to query the knowledge graph.
proposed different methods for instruction fine-tuning LLMs to The rest of the paper is structured as follows. In Section 3 we
impove the performance of common graph tasks. describe the methodologies we followed to prompt or fine-tune the
While the above works address general graph problems, other LLM. In Section 4 we describe the experimental design and results.
research has focused specifically on combining KGs with LLMs. Finally, we conclude the paper in Section 5.
One such approach is to use the LLM as an encoder to transform
text-based entity nodes and relations, and then fuse the LLM and 3 METHODOLOGY
GNN-derived representations. Applications of this approach range Our work focuses on studying the entity relationships representa-
from product recommendation (Choudhary et al. [10]) to biomedi- tion of KG for grounded LLM reasoning.
cal question-answering (Yasunaga et al. [36]). Luo & Pan [22] have
proposed a reasoning on Graph (RoG) method that comprises two 3.1 Knowledge Graph Definition
modules: a planning module and a retrieval-reasoning module. The
Let 𝐺 = {𝐸, 𝑅,𝑇 } denote a knowledge graph, where 𝐸 is the set
planning module mines the KG and generates faithful relation paths
of entities, 𝑅 is the set of relationships, 𝑇 ⊆ 𝐸 × 𝑅 × 𝐸 is the
for answering the question. The retrieval-reasoning module com-
set of triplets that are edges in the knowledge graph. A triplet
bines a breadth-first search and a probabilistic optimization over all
(𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ) ∈ 𝑇 if there is a directed edge between entity 𝑒𝑖 (𝑒𝑖 ∈ 𝐸)
paths. Dernbach et al. [12] developed a neighborhood partitioning
and entity 𝑒𝑖+1 (𝑒𝑖+1 ∈ 𝐸) through relationship 𝑟𝑖 (𝑟𝑖 ∈ 𝑅). A
and encoding scheme to accommodate real-world graph properties.
triplet also corresponds to a complete one hop reasoning pro-
Their encoding scheme transforms graph relations into alternate
cess. A two hop compositional reasoning process can be repre-
text representations, which in turn are used to fine-tune the LLM.
sented as ((𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ), (𝑒𝑖+1, 𝑟𝑖+1, 𝑒𝑖+2 )), where (𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ) ∈ 𝑇 and
Edge et al. [13] have built an end-to-end system, called GraphRAG,
(𝑒𝑖+1, 𝑟𝑖+1, 𝑒𝑖+2 ) ∈ 𝑇 . Since 𝑒𝑖+1 exists in both triplets, it is the bridge
that starts with a set of source documents and iteratively applies
entity. Similarly, a three hop compositional reasoning process can be
an LLM to extract the entities in each document. Next, entities are
represented as ((𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ), (𝑒𝑖+1, 𝑟𝑖+1, 𝑒𝑖+2 ), (𝑒𝑖+2, 𝑟𝑖+2, 𝑒𝑖+3 )), where
connected via the relationships extracted, and a knowledge graph
(𝑒𝑖 , 𝑟𝑖 , 𝑒𝑖+1 ) ∈ 𝑇 , (𝑒𝑖+1, 𝑟𝑖+1, 𝑒𝑖+2 ) ∈ 𝑇 and (𝑒𝑖+2, 𝑟𝑖+2, 𝑒𝑖+3 ) ∈ 𝑇 .
is created. The knowledge graph is split into communities, and each
community is summarized. These summaries are subsequently used
by the LLM via RAG to help answer questions submitted to the
3.2 Knowledge Graph Representation for LLM
system. Reasoning
Nie et al. [24] provide a code-style in-context learning (ICL) To improve LLM multihop reasoning with knowledge graphs, we
method for knowledge base question answering (KBQA). They de- represent knowledge graphs in ways that are more compatible with
sign seven meta-functions written in Python that cover the atomic LLM prompting and fine-tuning. When given a complex reasoning
operations used for querying databases. By using few-shot learning, prompt, LLMs can detect entities and relationships, then implicitly
they improve LLMs’ ability to query knowledge bases effectively. infer the key entities and facts by following logical reasoning steps
There is also ongoing research work ([23], [38], [2]) that studies grounded by knowledge graphs. For instance, given the prompt:
the impact of mixing programming code in pre-training or instruc- “Who is the spouse of the composer of ‘It Goes Like It Goes’?”,
tion fine-tuning datasets on the performance of LLMs. Even though LLMs can follow the reasoning steps: “The composer of ‘It Goes
the programming code used by the research is generic, the results Like It Goes’ is David Shire”, and “The spouse of David Shire is Didi
Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data
Conn", and infer that the correct answer is “Didi Conn”. Different and unambiguous way; corner cases can be easily checked. How-
representations of knowledge graphs affect how effectively LLMs ever, the representation for the same triplet can be more verbose.
can perform such logical reasoning.
The most natural way is to use natural language to describe the
triplets in knowledge graphs. Figure 1 shows the natural language
representation for two-hop reasoning of triplets ((𝑒 1, 𝑟 1, 𝑒 2 ), (𝑒 2, 𝑟 2,
𝑒 3 )), where 𝑒 2 is the bridge entity. For instance, the triplets ((‘It Goes
Like It Goes’, ‘composer’, ‘David Shire’), (‘David Shire’, ‘spouse’, Figure 1: Natural Language Representation of KG with Static
‘Didi Conn’)) can be represented as (The composer of ‘It Goes Relationships
Like It Goes’ is David Shire, The spouse of David Shire is Didi
Conn). And the two hop reasoning can be represented as “The
spouse of the composer of ‘It Goes Like It Goes’ is Didi Conn”. As
LLMs understand natural languages very well, the natural language
representation is the most straightforward way for either prompting
or fine-tuning LLMs with knowledge graphs.
JSON (JavaScript Object Notation) is a lightweight data inter- Figure 2: JSON Representation of KG with Static Relation-
change format [5]. It is designed to store data in universal data ships
structures such as dictionary and list that are supported by most
programming languages. JSON is a pure data only format and
can be used to store structured data from knowledge graphs. Fig-
ure 2 shows the JSON representation of knowledge graph triplets
(𝑒 1, 𝑟 1, 𝑒 2 ) and (𝑒 2, 𝑟 2, 𝑒 3 ). The entities 𝑒 1 and 𝑒 2 are the keys and the
relationship/entity pairs 𝑟 1 : 𝑒 2 and 𝑟 2 : 𝑒 3 are the values. However,
since JSON is designed to store data only, it is difficult to represent
multi-hop inference process with the JSON format. Alternatively,
a comment or description field (with “comment” as the key and
natural language description of the multi-hop reasoning as the
value) can be added to the JSON representation. But this is not a
recommended practice in general.
Programming language code is another major data source for
LLM pre-training and fine-tuning. Knowledge graphs can be repre-
sented using various data structures supported by major program- Figure 3: Python Representation of KG with Static Relation-
ming languages such as Python. The triples can be represented ships
either as a static dictionary or added dynamically to the prede-
fined data structures as part of the code. Figure 3 shows Python
representation of knowledge graph triplets with a static dictionary
data structure, and the two-hop inference process that is based on
the stored triplets. As shown in figure 3, relationships and entities
are stored in dictionary data structure “relationships” with rela-
tionships 𝑟 1 and 𝑟 2 as the keys and entity pairs as the values. The
inference process is just the process of retrieving the values based
on the key values of the dictionary. Figure 4 shows Python repre-
sentation of knowledge graph triplets with predefined Python class
“KnowledgeBase”, and an iterative multi-hop inference function
‘infer’ that supports inference of an arbitrary number of hops. As
shown by figure 4, the main data structure of “KnowledgeBase” is
a dictionary “self.facts”, and entities and relationships are added
to the dictionary with function “add_fact”. The “infer” function
accepts any number of relationships with parameter “*relations”
and does the corresponding multi-hop reasoning. We designed the
dynamic self-defined data structure based Python representation
because it can be easily generalized to support multi-hop reasoning
based on subgraphs of KGs.
The example mentioned in the previous paragraphs can be easily Figure 4: Python Representation of KG with Dynamic Rela-
represented as dictionaries in Python. Figure 5 shows four different tionships
representations for the same example. Using programming lan-
guages, knowledge graphs can be represented in a more controlled
Xue Wu and Kostas Tsioutsiouliklis
Table 1: Dataset 1 Train and Test Data Selection Table 2: Dataset 2 Train and Test Data Selection
4 EXPERIMENTS the training data. We use the development dataset with composi-
We designed experiments to study how different representations tional relationships for testing purpose. The details of the dataset
of entity relationships in KGs affect the reasoning performance of are listed in Table 2. We choose compositional relationships so the
LLMs across two different datasets. type of relationships are consistent with those in Dataset1 and it is
easier to compare the reasoning performance across datasets. We
4.1 Datasets didn’t further restrict the entities and relationships based on the
overlap between training and testing dataset to respect the design
4.1.1 Dataset 1. For the first dataset, we use the same dataset used
of the dataset by paper ([16]).
by the “Hopping Too Late” paper ([3]). The dataset includes two-
hop relationships extracted from publicly available knowledge base 4.1.3 Dataset 3. This dataset is an extension of Dataset 1. We ex-
Wikidata ([27]). For the experiments, we split the dataset into eight tended two-hop relationships ((𝑒 1, 𝑟 1, 𝑒 2 ), (𝑒 2, 𝑟 2, 𝑒 3 )) by adding a
equal-sized partitions based on the bridge entity 𝑒 2 in a round- third hop (𝑒 3, 𝑟 3, 𝑒 4 ), resulting in three-hop relationships ((𝑒 1, 𝑟 1, 𝑒 2 ),
robin fashion, so that any unique 𝑒 2 exists only in one partition. We (𝑒 2, 𝑟 2, 𝑒 3 ), (𝑒 3, 𝑟 3, 𝑒 4 )), while keeping the entities 𝑒 3 as a subset of
partition the dataset in this way so that the LLMs only learn the the entities 𝑒 3 in Dataset 1. This dataset was created to test whether
relationships and logical reasoning process rather than memorizing models fine-tuned for two-hop reasoning can generalize to improve
the entities. To avoid overrepresentation of the most popular 𝑒 2 in three-hop reasoning performance as well. The details of the dataset
the training or testing dataset, we choose partition 2 as our training are listed in Table 3. The overlap of bridge entities between train-
dataset and partition 4 as the testing dataset. The details of the ing and testing dataset is minimal for this dataset. There is high
dataset are listed in Table 1. For this dataset, there is no overlap of percentage of relationship overlap between training and testing
bridge entities between training and testing dataset. The overlap of dataset.
relationship pairs (𝑟 1, 𝑟 2 ) is about 99%. The final training data for fine-tuning the LLMs include both
one-hop prompts and responses based on the first hop from the
4.1.2 Dataset 2. For the second dataset, we use the dataset created
datasets, and two-hop prompts and responses based on two-hop
in paper ([16]). This dataset also includes two-hop relationships
information from the datasets.
extracted from publicly available knowledge base Wikidata ([27]).
Although both Dataset 1 and Dataset 2 are derived from the same
knowledge base, the extracted entities and relationships for Dataset 4.2 Large Language Models
2 are different from those in Dataset 1. For the training dataset, we To evaluate the performance of LLM reasoning, we chose the latest
select only compositional relationships from this dataset and limit released open source models by Meta: Llama-3.1-8B-Instruct and
the number of instances per relationship pair (𝑟 1, 𝑟 2 ) to no more Llama-3.1-70B-Instruct ([1]). The Llama-3.1 model family demon-
than 500 to avoid over representation of particular relationships in strates stronger reasoning abilities compared to other open-source
Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data
JSON representation or plain natural language text representations. In these experiments, we performed one-shot prompting of the
The performance of the LLM fine-tuned with JSON representation LLMs, using the prompts listed in Table 4. The results are shown
is only slightly worse than that of the LLM fined-tuned with natu- in Table 8. For Dataset 1, since no context is provided, we use the
ral language representation. For comparison, we also provide the 1st hop and 2nd hop inferences as context and measure whether
performance numbers for zero shot prompting of Llama-3.1-70B- the model can infer the correct answers. For Dataset 2, we use the
Instruct. All LLMs that were fine-tuned with any entity relationship given context for the questions. And again, the LLMs that were fine-
representation outperform the larger model for latent multi-hop tuned with different entity relationship representations outperform
reasoning. the baseline model without fine-tuning. Notably, the LLM fine-
tuned with dynamic Python representation greatly outperforms
4.7.2 RQ2: Can fine-tuned LLMs generalize their reasoning ability
the baseline model; it even surpasses the performance of the much
to more hops that are beyond the training data ? Since we fine-tuned
larger baseline model (Llama-3.1-70B-Instruct).
LLMs using only one-hop and two-hop reasoning data, it is impor-
tant to study whether such a fine-tuning process will improve the
LLMs’ reasoning performance on more hops. We created a three-
hop dataset (as shown in Table 3) to measure the performance of 4.8 Discussion
the fine-tuned LLMs. As shown in Table 7, the three-hop reasoning
We designed a series of experiments to study how to seamlessly
performance of all fine-tuned LLMs has improved across all entity
integrate entity relationships into LLMs to improve their multi-hop
relationship representations compared with the baseline LLM with-
reasoning ability, and the impact of different entity relationship rep-
out fine-tuning. Furthermore, LLMs that were fine-tuned with the
resentations on the performance of LLMs. Our experimental results
Python representation outperform those fine-tuned with either the
show that it is possible to integrate entity relationships into LLMs
plain natural language representation or the JSON representation.
and ground the LLM multi-hop reasoning process with knowledge
As the relationship (𝑟 2, 𝑟 3 ) has significant overlap with (𝑟 1, 𝑟 2 ) in
graphs. While all forms of the proposed entity relationship repre-
the training data (as shown in Table 3), the relative performance of
sentations help improve LLM reasoning performance, they affect
fine-tuned models is consistent with what is shown in Table 5.
LLM performance differently. The natural language representation
4.7.3 RQ3: How much can LLM in-context learning help or benefit is straightforward and is the major form of pre-training data for
from multi-hop reasoning? Retrieval Augmented Generation (RAG) LLMs. The JSON representation is best suited for storing structured
is one of the main applications of LLMs. However, even when data. However, it can be difficult for LLM to directly integrate pure
supplied with correctly retrieved information, LLMs do not always structured data. The Python representations store both structured
generate the correct answer, particularly when multi-hop reasoning data and the inference process, providing a more controlled and
is required. To address this issue, we designed experiments to study unambiguous way of guiding LLMs through the reasoning process.
the performance of fine-tuned LLMs when given an input context. This helps LLMs achieve better reasoning performance in all the
Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data
Model KG Representation KG Representation Prompt Accuracy 1st hop correct 1st hop correct 1st hop correct
for fine tuning for Prompt 2nd hop correct 2nd hop correct 2nd hop correct
final incorrect final correct final accuracy
Llama-3.1-8B NA NA Zero Shot 16.3% 989 604 38.2%
Llama-3.1-70B NA NA Zero Shot 33.9% 615 1,979 76.3%
Llama-3.1-8B NA Natural Language One Shot 16.9% 965 776 44.6%
Llama-3.1-8B NA JSON One Shot 7.78% 995 352 26.1%
Llama-3.1-8B NA Python Static One Shot 17.5% 408 609 59.9%
Llama-3.1-8B NA Python Dynamic One Shot 19.1% 341 772 67.9%
Llama-3.1-8B tuned Natural Language NA Zero Shot 25.6% 145 1,005 87.4%
Llama-3.1-8B tuned JSON NA Zero Shot 20.8% 82 569 87.4%
Llama-3.1-8B tuned Python Static NA Zero Shot 26.2% 102 1,093 91.5%
Llama-3.1-8B tuned Python Dynamic NA Zero Shot 26.5% 119 882 88.1%
Model KG Representation KG Representation Prompt Accuracy 1st hop correct 1st hop correct 1st hop correct
for fine tuning for Prompt 2nd hop correct 2nd hop correct 2nd hop correct
final incorrect final correct final accuracy
Llama-3.1-8B NA NA Zero Shot 1.85% 51 14 21.5%
Llama-3.1-70B NA NA Zero Shot 10.7% 164 195 54.3%
Llama-3.1-8B NA Natural Language One Shot 3.99% 82 25 23.4%
Llama-3.1-8B NA JSON One Shot 1.80% 19 4 17.4%
Llama-3.1-8B NA Python Static One Shot 4.98% 50 29 36.7%
Llama-3.1-8B NA Python Dynamic One Shot 4.12% 35 12 25.5%
Llama-3.1-8B tuned Natural Language NA Zero Shot 11.0% 97 164 62.8%
Llama-3.1-8B tuned JSON NA Zero Shot 10.0% 61 157 72.0%
Llama-3.1-8B tuned Python Static NA Zero Shot 12.1% 44 191 81.3%
Llama-3.1-8B tuned Python Dynamic NA Zero Shot 12.3% 38 203 84.2%
Table 7: LLM Three Hop Reasoning for Dataset 3 with Zero Shot Prompting
cases we studied. In some cases, the fine-tuned small LLMs per- networks of the larger LLMs can help the model make connections
form better than much larger LLMs, even though the deep neural
Xue Wu and Kostas Tsioutsiouliklis
among multiple hops and infer the correct answers. Because we REFERENCES
guide and fine-tune the models with emphasis on multi-hop rela- [1] Meta AI. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https:
tionships rather than the facts of individual entities, the fine-tuned //arxiv.org/abs/2407.21783
[2] Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr
models can easily generalize to do multi-hop reasoning for com- Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. To Code, or Not
pletely different entities, even in cases where the number of hops is To Code? Exploring Impact of Code in Pre-training. arXiv:2408.10914 [cs.CL]
https://arxiv.org/abs/2408.10914
greater than what is in the training data. Our proposed fine-tuning [3] Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson.
approaches also help improve the results of in-context learning. 2024. Hopping Too Late: Exploring the Limitations of Large Language Models on
As synthetic data becomes increasingly important for LLM pre- Multi-Hop Queries. arXiv:2406.12775 [cs.CL] https://arxiv.org/abs/2406.12775
[4] Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor.
training and fine-tuning, generating proper representations of struc- 2008. Freebase: a collaboratively created graph database for structuring human
tured data (especially knowledge graphs) and incorporating this knowledge. In SIGMOD Conference. https://api.semanticscholar.org/CorpusID:
type of data in LLM pre-training and fine-tuning can greatly im- 207167677
[5] Tim Bray. 2014. The javascript object notation (json) data interchange format.
prove LLM reasoning abilities and reduce hallucinations. As demon- [6] Felix Brei, Johannes Frey, and Lars-Peter Meyer. 2024. Leveraging small lan-
strated by our experimental results, the representation of entity guage models for Text2SPARQL tasks to improve the resilience of AI assistance.
arXiv:2405.17076 [cs.AI] https://arxiv.org/abs/2405.17076
relationships is very important for LLM performance. Program- [7] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
ming languages provide the flexibility to represent various entity Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
relationships with native data structures. They can guide LLM rea- Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
soning in a more controlled and principled way and ground LLM Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
inference with a knowledge base. Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
In our experiments, we only studied the simplest form of rea- Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.
arXiv:2005.14165 [cs.CL] https://arxiv.org/abs/2005.14165
soning: two-hop and three-hop reasoning with compositional re- [8] Diego Bustamante and Hideaki Takeda. 2024. SPARQL Generation with Entity
lationships. The reasoning process for real-life applications can Pre-trained GPT for KG Question Answering. arXiv:2402.00969 [cs.CL] https:
//arxiv.org/abs/2402.00969
be much more sophisticated. As shown in Figure 4, the Python [9] Ziwei Chai, Tianjie Zhang, Liang Wu, Kaiqiao Han, Xiaohai Hu, Xuanwen Huang,
representation of a knowledge graph with dynamic relationships and Yang Yang. 2023. GraphLLM: Boosting Graph Reasoning Ability of Large
provides the flexibility to define any relationships, even extending Language Model. arXiv:2310.05845 [cs.CL] https://arxiv.org/abs/2310.05845
[10] Nurendra Choudhary, Nikhil Rao, Karthik Subbian, and Chandan Reddy. 2022.
to a subgraph. Entities and relationships can be defined as classes Graph-based multilingual language model: Leveraging product relations for
themselves, allowing us to add attributes to the entities and rela- search relevance. In KDD 2022. https://www.amazon.science/publications/graph-
tionships. Correspondingly, the “infer” function can be redefined based-multilingual-language-model-leveraging-product-relations-for-search-
relevance
to include code that checks these attributes. In future work, we [11] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav
will study the graph representation of entity relationships and its Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas-
tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez,
impact on more complex reasoning cases. Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran,
Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin,
Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay
5 CONCLUDING REMARKS Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin
Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek
We proposed different representations of entity relationships in Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani
knowledge graphs to improve the multi-hop reasoning capabilities Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana
Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr
of LLMs. We conducted a series of experiments to study how dif- Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz,
ferent representations of entity relationships affect LLM reasoning Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck,
Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Modeling
ability. We showed that introducing programming language repre- with Pathways. arXiv:2204.02311 [cs.CL] https://arxiv.org/abs/2204.02311
sentations of the entity relationships helps improve LLM multi-hop [12] Stefan Dernbach, Khushbu Agarwal, Alejandro Zuniga, Michael Henry, and Su-
reasoning ability and reduce hallucination. tanay Choudhury. 2024. GLaM: Fine-Tuning Large Language Models for Domain
Knowledge Graph Alignment via Neighborhood Partitioning and Generative
The programming language representation of the entity rela- Subgraph Encoding. arXiv:2402.06764 [cs.AI] https://arxiv.org/abs/2402.06764
tionships provides a controlled and unambiguous way for LLM [13] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva
multi-hop reasoning. By leveraging the native data structures in- Mody, Steven Truitt, and Jonathan Larson. 2024. From Local to Global: A Graph
RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL]
herent in programming languages, we can effectively model com- https://arxiv.org/abs/2404.16130
plex entity relationships, while iterative inference functions guide [14] Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2023.
Complexity-Based Prompting for Multi-step Reasoning. In The Eleventh Interna-
the logical reasoning process. This approach not only enhances tional Conference on Learning Representations. https://openreview.net/forum?
reasoning accuracy but also facilitates generalization to more so- id=yf1icZHC-l9
phisticated reasoning use cases. However, accurately measuring [15] Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. 2023.
GPT4Graph: Can Large Language Models Understand Graph Structured Data ?
the performance of LLM reasoning beyond two-hop and three-hop An Empirical Evaluation and Benchmarking. arXiv:2305.15066 [cs.AI] https:
compositional relationships can be challenging, as the reasoning //arxiv.org/abs/2305.15066
process becomes increasingly complex. As part of the future work, [16] Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020.
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason-
we would like to study the programming language representations ing Steps. In Proceedings of the 28th International Conference on Computational Lin-
of more sophisticated relationships in order to solve more complex guistics. International Committee on Computational Linguistics, Barcelona, Spain
(Online), 6609–6625. https://www.aclweb.org/anthology/2020.coling-main.580
reasoning tasks. We will experiment at both the pre-training and [17] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
fine-tuning stages of LLMs to evaluate the performance impact. We Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of
hope that our work will inspire other researchers to further push Large Language Models. In International Conference on Learning Representations.
https://openreview.net/forum?id=nZeVKeeFYf9
the frontier of LLM research.
Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data
[18] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, 2210.09338
Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination [37] Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang.
in Natural Language Generation. Comput. Surveys 55, 12 (March 2023), 1–38. 2024. Language is All a Graph Needs. arXiv:2308.07134 [cs.CL] https://arxiv.
https://doi.org/10.1145/3571730 org/abs/2308.07134
[19] Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin [38] Xinlu Zhang, Zhiyu Zoey Chen, Xi Ye, Xianjun Yang, Lichang Chen, William Yang
Shin, and Minjoon Seo. 2023. The CoT Collection: Improving Zero-shot and Wang, and Linda Ruth Petzold. 2024. Unveiling the Impact of Coding Data Instruc-
Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning. tion Fine-Tuning on Large Language Models Reasoning. arXiv:2405.20535 [cs.AI]
arXiv:2305.14045 [cs.CL] https://arxiv.org/abs/2305.14045 https://arxiv.org/abs/2405.20535
[20] Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grig- [39] Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. 2024.
orev. 2022. Internet-augmented language models through few-shot prompt- Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in
ing for open-domain question answering. arXiv:2203.05115 [cs.CL] https: LLMs. arXiv:2406.09136 [cs.CL] https://arxiv.org/abs/2406.09136
//arxiv.org/abs/2203.05115 [40] Kerui Zhu, Bo-Wei Huang, Bowen Jin, Yizhu Jiao, Ming Zhong, Kevin Chang,
[21] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Shou-De Lin, and Jiawei Han. 2024. Investigating Instruction Tuning Large
Pablo Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Language Models on Graphs. arXiv:2408.05457 [cs.CL] https://arxiv.org/abs/
Auer, and Christian Bizer. 2014. DBpedia - A Large-scale, Multilingual Knowledge 2408.05457
Base Extracted from Wikipedia. Semantic Web Journal 6 (01 2014). https:
//doi.org/10.3233/SW-140134
[22] Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024. Reason- A ONE SHOT EXAMPLE FOR DIFFERENT
ing on Graphs: Faithful and Interpretable Large Language Model Reasoning.
arXiv:2310.01061 [cs.CL] https://arxiv.org/abs/2310.01061 PROMPT FORMATS
[23] Yingwei Ma, Yue Liu, Yue Yu, Yuanliang Zhang, Yu Jiang, Changjian Wang,
and Shanshan Li. 2023. At Which Training Stage Does Code Data Help LLMs A.1 Explanation for Natural Language Prompt
Reasoning? arXiv:2309.16298 [cs.CL] https://arxiv.org/abs/2309.16298
[24] Zhijie Nie, Richong Zhang, Zhongyuan Wang, and Xudong Liu. 2024. The composer of It Goes Like It Goes is David Shire . The
Code-Style In-Context Learning for Knowledge-Based Question Answering. spouse of David Shire is Didi Conn . The spouse of
arXiv:2309.04695 [cs.CL] https://arxiv.org/abs/2309.04695 the composer of It Goes Like It Goes is _
[25] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike
Lewis. 2023. Measuring and Narrowing the Compositionality Gap in Language
Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, A.2 JSON structure Prompt
Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational
Linguistics, Singapore, 5687–5711. https://doi.org/10.18653/v1/2023.findings- {
emnlp.378
" composer " : {
[26] Julio C. Rangel, Tarcisio Mendes de Farias, Ana Claudia Sima, and Norio
Kobayashi. 2024. SPARQL Generation: an analysis on fine-tuning Open- " It ␣ Goes ␣ Like ␣ It ␣ Goes " : " David ␣ Shire "
LLaMA for Question Answering over a Life Science Knowledge Graph. },
arXiv:2402.04627 [cs.AI] https://arxiv.org/abs/2402.04627 " spouse " : {
[27] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative " David ␣ Shire " : " Didi ␣ Conn "
Knowledgebase. Commun. ACM 57, 10 (Sept. 2014), 78–85. https://doi.org/10. }
1145/2629489 }
[28] Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and
Yulia Tsvetkov. 2024. Can Language Models Solve Graph Problems in Natural
Language? arXiv:2305.10037 [cs.CL] https://arxiv.org/abs/2305.10037
[29] Jianing Wang, Junda Wu, Yupeng Hou, Yao Liu, Ming Gao, and Julian McAuley.
A.3 Python Code Snippet for Prompt V1
2024. InstructGraph: Boosting Large Language Models via Graph-centric In-
struction Tuning and Preference Alignment. arXiv:2402.08785 [cs.CL] https: # Step 1. Define relationships with explicit types
//arxiv.org/abs/2402.08785 relationships = {
[30] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan ' composer ': {
Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Im- 'It ␣ Goes ␣ Like ␣ It ␣ Goes ': ' David ␣ Shire ' # It Goes
proves Chain of Thought Reasoning in Language Models. In The Eleventh Inter- Like It Goes is related to David Shire via
national Conference on Learning Representations. https://openreview.net/forum? relationship composer
id=1PL1NIMMrw },
[31] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter,
' spouse ': {
Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought
Prompting Elicits Reasoning in Large Language Models. In Advances in ' David ␣ Shire ': ' Didi ␣ Conn ' # David Shire is
Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agar- related to Didi Conn via relationship spouse
wal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, }
Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/ }
9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
[32] Silei Xu, Shicheng Liu, Theo Culhane, Elizaveta Pertseva, Meng-Hsi Wu, Sina # Define entities and relationships
Semnani, and Monica Lam. 2023. Fine-tuned LLMs Know More, Hallucinate e1 = 'It ␣ Goes ␣ Like ␣ It ␣ Goes '
Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata. In
r1 = ' composer '
Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for r2 = ' spouse '
Computational Linguistics, Singapore, 5778–5791. https://doi.org/10.18653/v1/
2023.emnlp-main.353 # Step 2. ( r1 , e1 ) -> e2
[33] Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. e2 = relationships [ r1 ][ e1 ]
2024. Do Large Language Models Latently Perform Multi-Hop Reasoning?
arXiv:2402.16837 [cs.CL] https://arxiv.org/abs/2402.16837 # Step 3. ( r2 , e2 ) -> e3
[34] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan e3 = relationships [ r2 ][ e2 ]
Cao, and Karthik R Narasimhan. 2023. Tree of Thoughts: Deliberate Problem
Solving with Large Language Models. In Thirty-seventh Conference on Neural
Information Processing Systems. https://openreview.net/forum?id=5Xc1ecxO1h # Output the result
[35] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, print ( f " { r2 } ␣ of ␣ { r1 } ␣ of ␣ { e1 } ␣ is ␣ { e3 } " )
and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language
Models. In International Conference on Learning Representations (ICLR). # when you run the code , it will output :
[36] Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D # The spouse of the composer of It Goes Like It Goes is
Manning, Percy Liang, and Jure Leskovec. 2022. Deep Bidirectional Language- Didi Conn
Knowledge Graph Pretraining. arXiv:2210.09338 [cs.CL] https://arxiv.org/abs/
Xue Wu and Kostas Tsioutsiouliklis
A.4 Python Code Snippet for Prompt V2 result3 = kb . infer ( e1 , r1 , r2 ) # Should return
Didi Conn , ( It Goes Like It Goes , composer ) ->
David Shire , ( David Shire , spouse ) -> Didi Conn
current_entity = entity
for relation in relations :
key = ( current_entity , relation )
if key in self . facts :
current_entity = self . facts [ key ]
else :
# If the path does not exist , return
None .
return None
return current_entity
# Example usage :
# Create a knowledge base instance .
kb = KnowledgeBase ()