Knowledge Graph Large Language Model
Knowledge Graph Large Language Model
dongshu2024@[Link]
Abstract
The task of multi-hop link prediction within knowledge graphs (KGs) stands as
a challenge in the field of knowledge graph analysis, as it requires the model
to reason through and understand all intermediate connections before making a
prediction. In this paper, we introduce the Knowledge Graph Large Language
Model (KG-LLM), a novel framework that leverages large language models (LLMs)
for knowledge graph tasks. We first convert structured knowledge graph data into
natural language and then use these natural language prompts to fine-tune LLMs
to enhance multi-hop link prediction in KGs. By converting the KG to natural
language prompts, our framework is designed to learn the latent representations of
entities and their interrelations. To show the efficacy of the KG-LLM Framework,
we fine-tune three leading LLMs within this framework, including Flan-T5, Llama2
and Gemma. Further, we explore the framework’s potential to provide LLMs with
zero-shot capabilities for handling previously unseen prompts. Experimental results
show that KG-LLM significantly improves the models’ generalization capabilities,
leading to more accurate predictions in unfamiliar scenarios. Our code is available
at [Link]
1 Introduction
In the domain of data representation and organization, knowledge graphs (KGs) have emerged as
a structured and effective methodology, attracting substantial interest in recent years. Although
two-node link prediction in KGs has yielded promising results, the multi-hop link prediction remains
a difficult task. In real life, multi-hop link prediction plays a crucial role because, often, we are
more interested in the relationship between two far-apart entities rather than direct connections. This
requires models to reason through intermediate entities and their relationships. A further challenge is
the issue of debugging KG model predictions, particularly in the context of discriminative prediction,
where the model’s lack of explanatory reasoning steps, obscures the origins of errors, diminishing
accuracy and performance. Consequently, the development of models capable of generatively and
precisely predicting multi-hop links in KGs is a critical challenge.
Historically, approaches to solving tasks related to KGs can trace their origins from embedding-based
methods to more recent advancements with LLMs [28]. Initially, embedding-based methods played a
crucial role, utilizing techniques to represent both entities and relations in a KG as low-dimensional
vectors to address the link prediction task by preserving the structural and semantic integrity of
the graph [2, 30, 5, 13]. As the field progressed, the integration of LLMs began to offer new
paradigms, leveraging large amounts of data and advanced architectures to further enhance prediction
capabilities and semantic understanding in KGs [1, 38, 39, 37, 22]. This transition shows a significant
∗
Equal Contribution.
• By converting knowledge graphs into CoT prompts, our framework allows LLMs to better
understand and learn the latent representations of entities and their relationships within the
knowledge graph.
• Our analysis of real-world datasets confirms that our framework improves generative multi-
hop link prediction in KGs, underscoring the benefits of incorporating CoT and instruction
fine-tuning during training.
• Our findings also indicate that our framework substantially improves the generalizability of
LLMs in responding to unseen prompts.
2 Related Work
Recently, researchers have used Graph Neural Network (GNN) models to solve various graph-related
tasks, significantly advancing the field. Among different GNN models, Graph Attention Networks
(GATs) have gained attention for their ability to weigh the importance of neighboring nodes, with
models like wsGAT [6] demonstrating effectiveness in link prediction tasks. Additionally, Graph
Convolutional Network (GCN)-based models have shown promising results; ConGLR [12] leverages
context graphs and logical reasoning for improved inductive relation prediction, while ConvRot [11]
integrates relational rotation and convolutional techniques to enhance link prediction performance in
2
K G-L L M (ablation) K nowledge Prompt K G-L L M K nowledge Prompt
Training Question:
### Instruction:
Below is the detail of a knowledge graph path. Is node_1 connected with node_3?
Training Question: Answer the question by reasoning step-by-step. Choose from the given options:
Node_1 has relation_1 with node_2, and 1. Yes
node_2 has relation_2 with node_3. Is 2. No
node_1 connected with node_3? ### Input:
Training Answer: Node_1 has relation_1 with node_2, and node_2 has relation_2 with node_3.
The answer is yes. Training Answer:
### Response:
Testing Question: Node_1 has relation_1 with node_2 means Jack bought Shampoo. Node_2
Node_6540 has relation_9 with node_765, has relation_2 with node_3 means Shampoo is related with Hair Conditioner.
and node_765 has relation_4 with So Jack will also buy Hair Conditioner. The answer is yes.
node_2148. Is node_6540 connected with
node_2148? Testing Question:
### Instruction: [...]
### Input: Node_6540 has relation_9 with node_765, and node_765 has relation_4
with node_2148.
Figure 2: An Example of Prompt Used in the Multi-hop Link Prediction Training Process:
Models processed through the ablation framework will be trained using the ablation knowledge
prompt (left), whereas models processed via the KG-LLM framework will be trained on the KG-LLM
knowledge prompt (right).
knowledge graphs (KGs). While the aforementioned approaches have achieved significant success,
multi-hop link prediction remains an unsolved challenge.
Other than GNN models, recent development of large language models (LLMs), such as BERT [4],
GPT [18], Llama [24], Gemini [23], and Flan-T5 [31] has also solved various KGs tasks, including
link prediction. The text-to-text training approach makes LLMs particularly suitable for our generative
multi-hop link prediction task. Recent studies, concurrent work such as GraphEdit [8], MuseGraph
[21], and InstructGraph [27], have shown that natural language is effective for representing structural
data for LLMs. Besides, training on large-scale data makes it possible for LLMs to generalize to
unseen tasks or prompts that were not part of its training data [32].
Another advantage of LLM-based generative modeling is the Chain-of-Thought (CoT) reasoning
ability [34]. It provides the flexibility of modifying the instruction, options, and exemplars to allow
structured generation and prediction. The Chain-of-Thought reasoning process can be naturally
integrated with KGs by translating a reasoning path on a KG into natural language. This flexibility
allows us to easily test the model’s ability to follow instructions and make decisions based on the
provided information. Similarly, In-Context Learning (ICL) [3] helps LLMs learn from demonstrative
examples in the prompt to generate correct answers for the given question. This can also be naturally
integrated with KGs. As a result, CoT and ICL enable flexible KG reasoning through natural language.
3 Methodology
Let KG = (E, R, L) denote a knowledge graph, where E is the set of entities, R is the set of
relationships, and L ⊆ E × R × E is the set of triples that are edges in the KG. Each triple
(ei , r, ei+1 ) ∈ L denotes that there exists a directed edge from entity ei to entity ei+1 via the
relationship r [29].
3
M ulti-hop L ink Prediction M ulti-hop L ink Prediction M ulti-hop Relation Prediction M ulti-hop Relation Prediction
(Ablation) (K G-L L M ) (Ablation) (K G-L L M )
PROMPT PROMPT PROMPT PROMPT
### Input: ### Instruction: ### Input: ### Instruction:
Node [node id1] has relation Below is the detail of a knowledge Node [node id1] has relation Below is the detail of a knowledge
[relation id] with node [node id2]. graph path. Is node [node id1] [relation id] with node [node id2]. graph path. What is the relation
Node [node id2] has relation connected with node [last node id]? Node [node id2] has relation between [node id1] and [last node
[relation id] with node [node id3]. Answer the question by reasoning [relation id] with node [node id3]. id]? Answer the question by
[...] step-by-step. Choose from the given [...] reasoning step-by-step. Choose from
options: 1. Yes 2. No the given options: 1. [relation text1]
Is node [node id1] connected with ### Input: What is the relation between [node 2. [relation text2] [...]
node [last node id]? Node [node id1] has relation id1] and [last node id]? ### Input:
[relation id] with node [node id2]. Node [node id1] has relation
Node [node id2] has relation [relation id] with node [node id2].
[relation id] with node [node id3]. Node [node id2] has relation
[...] [relation id] with node [node id3].
[...]
Expected Output Expected Output Expected Output Expected Output
### Response: ### Response: ### Response: ### Response:
[Yes / No] Node [node id1] has relation [relation id] Node [node id1] has relation
[relation id] with node [node id2] [relation id] with node [node id2]
means [node text1] [relation text] means [node text1] [relation text]
[node text2]. [...] [node text2]. [...]
So [node text1] [relation text] [last So [node text1] [relation text] [last
node text]. node text].
The answer is yes. The answer is [relation text].
Figure 3: Overview of our knowledge prompts in the ablation and KG-LLM Frameworks:
Ablation framework’s knowledge prompts are in the first and third columns. KG-LLM framework’s
knowledge prompts are in the second and fourth columns.
The knowledge prompt is a specialized prompt designed for KGs that converts a given sequence of
observed triples Pobs into natural language. By leveraging the knowledge prompt in the training
process, the model can more effectively understand the underlying relationships and patterns present
within KGs, thus improving overall performance in multi-hop prediction tasks. In Figure 3, we
define the two types of knowledge prompts, KG-LLM knowledge prompt and KG-LLM (ablation)
knowledge prompt for both multi-hop link prediction and multi-hop relation prediction.
The two types of prompts demonstrate distinct approaches to enhancing model performance in
multi-hop prediction tasks. KG-LLM knowledge prompt adopts a structured format that includes
instructions and inputs. This approach involves textualizing node and relation IDs into text based on
the dataset and breaking down complex inputs into manageable, concise processes. The KG-LLM
instruction falls under the classification category. By listing all possible options in the instructions,
LLMs can follow and generate a response based on the given choices. On the other hand, we remove
the instruction and textualized IDs in the ablation knowledge prompt and the CoT reasoning process
from the expected response. This approach stands out for its clarity and simplicity, providing a good
comprehension of the KG and improving prediction accuracy. To illustrate our knowledge prompt
better, we provide an example for the multi-hop link prediction task in Figure 2.
4
In addition, we adopt the approach of utilizing one-shot ICL learning, specifically tailored to the
FLAN-T5-Large, which is our smallest model. This is because, for models of this scale, the impact
of utilizing one-shot ICL versus few-shot ICL on accuracy is minimal [3]. To maintain consistency
across our experimental framework, we apply the same one-shot ICL methodology to all LLMs. This
uniform approach ensures that our comparative analysis of the models’ performances is conducted
under equivalent learning conditions. We listed all ICL examples in Appendix A.1.
Our complete KG-LLM Framework is illustrated in Figure 1. Initially, the KG is taken as input.
Each node is iteratively assigned as the root, and depth-first search (DFS) is used to extract all
possible paths. Duplicate paths are then removed, retaining only those with node counts ranging
from 2 to 6. This range is based on the “six degrees of separation” theory [7], which states that
any two individuals are, on average, connected through a chain of no more than six intermediaries.
The node counts correspond to the number of hops: a single-hop is between two nodes, a two-hop
involves three nodes, and so on. These paths are labeled as either positive (there is a connection
between the first and last node) and negative (there is no connection) instances. We observed that
negative instances outnumbered positive instances, so we randomly reduced the number of negative
instances to achieve a balanced dataset. Finally, these paths are converted into KG-LLM and KG-
LLM (ablation) knowledge prompts. During the fine-tuning phase, three distinct LLMs are utilized:
Flan-T5-Large, Llama2-7B, and Gemma-7B. We added all the node IDs and relation IDs as special
tokens to the vocabulary of these LLMs. Different fine-tuning techniques are applied for each model
within our framework. A global fine-tuning strategy is employed on Flan-T5 to boost its performance.
For Llama2 and Gemma, a 4-bit quantized LoRA (Low-Rank Adaptation) modification [10] is
implemented. During the training process, we use the cross-entropy loss function L. It calculates the
difference between the model’s predicted token probabilities and the actual token probabilities in the
expected output sequence. In the following equation, n represents the length of the expected output
sequence, x stands for the input instruction, and yi denotes the i-th token in the expected output
sequence.
n
X
L=− log P (yi |x, y1 , y2 , ..., yi−1 ) (1)
i=1
To evaluate our KG-LLM Frameworks, we train each model twice. As illustrated in Figure 2, the
initial training session employs KG-LLM (ablation) knowledge prompt inputs to establish a baseline.
Subsequently, we use instruction finetuning to finetune the original models using KG-LLM knowledge
prompt inputs.
After the training phases, we subject each model to two inference task sets, each comprising two
sub-tests: non-In-Context Learning (non-ICL) and In-Context Learning (ICL). The primary set of
inference tasks is centered around multi-hop link prediction. Conversely, the secondary set probes the
models’ generalization ability in multi-hop relation prediction, particularly with previously unseen
prompts. Through pre- and post-ICL evaluation within each task set, we aim to evaluate the impact
of ICL integration across both the KG-LLM (ablation) and KG-LLM frameworks.
4 Experiments
In this section, we conduct experiments to evaluate the effectiveness of the proposed KG-LLM
frameworks to answer the following several key research questions:
• Q1: Which framework demonstrates superior efficacy in multi-hop link prediction tasks in the
absence of ICL?
• Q2: Does incorporating ICL enhance model performance on multi-hop link prediction task?
• Q3: Is the KG-LLM framework capable of equipping models with the ability to navigate unseen
prompts during multi-hop relation prediction inferences?
• Q4: Can the application of ICL bolster the models’ generalization ability in multi-hop relation
prediction tasks?
5
Table 1: Basic statistics of the experimental datasets.
Dataset #Entities #Triples # Relations
WN18RR 40,943 86,835 11
NELL-995 75,492 149,678 200
FB15k-237 14,541 310,116 237
YAGO3-10 123,182 1,179,040 37
Datasets. We conduct experiments over f our real-world datasets, WN18RR, NELL-995, FB15k-
237 and YAGO3-10, which are constructed by the OpenKE library [9]. All datasets are commonly
used for evaluating knowledge graph models in the field of knowledge representation learning.
Statistics of the datasets are shown in Table 1.
Task splits. In the preprocessing stage of each dataset, we randomly selected 80% of the nodes
to construct the training set of KG. Following the steps in section 3.3, we constructed the training
knowledge prompts. For validation, we randomly split off 20% of the positive and negative instances
from training knowledge prompts. The same procedure was applied to the remaining 20% of the
nodes to create the test set.
• TransE [2] is a traditional distance model that represents relationships as translations in the
embedding space.
• Analogy [14] can effectively capture knowledge graph structures to improve link prediction.
• CompleX [25] uses complex embeddings to represent both entities and relations, capturing
asymmetric relationships.
• DistMult [36] represents relations as diagonal matrices for simplicity and efficiency.
• RESCAL [17] uses a tensor factorization method that captures rich interactions between
entities and relations.
• wsGAT [6] is a graph attention network that uses weighted self-attention mechanisms to
perform various knowledge graph tasks.
• ConGLR [12] leverages context-aware graph representation learning to enhance link predic-
tion.
• ConvRot [11] integrates convolutional networks and rotational embeddings to perform a
variety of knowledge graph tasks.
Implementation Details. We trained each model for 5 epochs on an A40 GPU, and despite limited
resources, models still showed promising results. As mentioned in section 3.3, we set the maximum
complexity of five-hops. We also monitor the input token size to optimize processing efficiency,
noting that Flan-T5, with its 512-token capacity, had the smallest token size. Consequently, we
tailored our experiments to ensure that the maximum length of input data did not exceed 512 tokens.
Metrics for Multi-hop Link Prediction. In evaluating the performance of models in multi-hop
link prediction tasks, we utilized the Area Under the ROC Curve (AUC) metric [15] and the F1 score
[20]. AUC measures the area under the Receiver Operating Characteristic (ROC) curve, which plots
the true positive rate against the false positive rate at varying classification thresholds. The threshold
is set at a 50% true positive rate and 50% false positive rate, as the number of positive and negative
data points are equal in the testing case. A higher AUC value indicates a better ability of the model to
differentiate between positive and negative examples. Similarly, the F1 score, ranging from 0 to 1,
6
measures the balance between precision and recall, where higher values represent better performance.
For the performance tables presented below, the best performance is indicated in bold, while the
second-best performance is indicated with underline.
Metrics for Multi-hop Relation Prediction. We use accuracy as the performance metric for the
multi-hop relation prediction task, which provides an overall measure of the model’s correctness,
calculated as the percentage of test cases where the true relation is predicted correctly.
This section analyzes the traditional approaches, ablation framework, and KG-LLM framework in
the context of non-In-Context Learning (non-ICL) Link Prediction, as shown in Table 2. Traditional
approaches are shown in the top section of the table, the ablation framework is in the middle section,
and the KG-LLM framework is in the bottom section.
Answer to Q1: Our analysis reveals that the traditional approach’s GNN model, especially ConvRot,
exhibited relatively good performance, particularly surpassing the ablation models on the WN18RR
dataset. This GNN model performance can be attributed to its ability to effectively capture the
structural information in graph data. However, the results demonstrate that across all models,
the implementation of the KG-LLM framework surpasses the traditional approaches and ablation
framework across all datasets. This improved performance can be attributed to the KG-LLM
framework’s knowledge prompts. These prompts enable LLMs to take advantage of the relationships
network between entities and their interconnections within the KG. Furthermore, these LLMs already
possess basic common sense knowledge from pre-training. When all nodes and relations are converted
to text, this inherent common sense enhances their understanding of the relations and nodes, thereby
7
WN18RR-F1 NELL995-F1 WN18RR-F1 NELL995-F1 WN18RR-F1 NELL995-F1
WN18RR-AUC NELL995-AUC WN18RR-AUC NELL995-AUC WN18RR-AUC NELL995-AUC
0.75 0.75 0.75
Score
Score
Score
0.50 0.50 0.50
0.25 0.25 0.25
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Complexity of Multi-Hop Complexity of Multi-Hop Complexity of Multi-Hop
(a) wsGAT (b) ConGLR (c) ConvRot
Score
Score
0.50 0.50 0.50
0.25 0.25 0.25
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Complexity of Multi-Hop Complexity of Multi-Hop Complexity of Multi-Hop
(d) Flan-T5 (Ablation) (e) Llama 2 (Ablation) (f) Gemma (Ablation)
Score
Score
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Complexity of Multi-Hop Complexity of Multi-Hop Complexity of Multi-Hop
(g) Flan-T5 (KGLLM) (h) Llama 2 (KGLLM) (i) Gemma (KGLLM)
improving link prediction accuracy. Instruction fine-tuning (IFT) also contributed to this improvement
by forcing models to focus on the limited options. The evidence presented here underscores the
efficacy of our KG-LLM framework, enriched with CoT and IFT, indicating its potential to advance
the domain of multi-hop link prediction tasks in real-world applications.
We also evaluate GNN, ablation, and KG-LLM framework models’ performance at each level of hop
complexity in WN18RR and NELL-995 datasets. As shown in Figure 4, the performance of GNN and
ablation models significantly declines as hop complexity increases. Upon closer examination, it is
evident that as hop complexity grows, these models frequently respond with ’No’ for most questions,
resulting in an F1 score close to 0 and an AUC score around 0.5. This performance is due to the
increased complexities of multi-hop link prediction. Unlike the straightforward task of predicting
a direct link between two nodes, models must consider all intermediate nodes to conclude, adding
significant complexity and reducing their effectiveness. In contrast, the KG-LLM framework models
effectively address this challenge, maintaining fair performance even at five-hops, except for the
Flan-T5 model.
In this section, we evaluate the influence of In-Context Learning (ICL) on models subjected to both
ablation and KG-LLM frameworks, excluding the traditional approach as it lacks ICL capability. We
experimented using the same LLMs and testing inputs as in the previous section. The key distinction
in this evaluation was adding an ICL example at the beginning of each original testing input. The ICL
example shown in Appendix A.1, derived from the training dataset, was restricted to the complexity
of two-hops. This constraint aimed to prevent providing additional knowledge through the ICL
example while furnishing a contextually relevant example.
8
Table 3 reveals a notable enhancement in the performance of models under the ablation framework,
with Llama 2 and Gemma models achieving an F1 and AUC score exceeding 80% in WN18RR and
NELL-995 datasets. Remarkably, the adoption of ICL within the KG-LLM framework resulted in a
significant performance uplift. Notably, the Gemma model achieved a staggering 98% F1 score on
the first dataset, while Llama 2 recorded a 96% F1 score on the second dataset.
An interesting observation is that ICL has shown unstable improvements in the Flan-T5 model. For
some datasets within the ablation and KG-LLM frameworks, the performance slightly declined after
implementing ICL. This phenomenon could be attributed to the increased length and complexity of
the testing prompts. While the inclusion of an ICL example generally aids in model understanding, in
certain cases, it might be perceived as noise, potentially affecting the Flan-T5’s performance.
Answer to Q2: The experimental results indicate that the deployment of ICL does not uniformly
improve performance across all models. However, for the Llama 2 and Gemma models, the integration
of ICL consistently facilitates performance improvements.
Ablation without ICL Ablation with ICL KGLLM without ICL KGLLM with ICL
0.7 0.7
0.5 0.5
Accuracy
Accuracy
0.2 0.2
0.1 0.1
0.0 WN18RR NELL-995 WN18RR NELL-995 WN18RR NELL-995
0.0 WN18RR NELL-995 WN18RR NELL-995 WN18RR NELL-995
Flan-T5 LLaMa2 Gemma Flan-T5 LLaMa2 Gemma
(a) Ablation Framework (b) KGLLM Framework
Figure 5: Multi-hop Relation Prediction Performance Comparison: The left graph shows model
performance under the ablation framework, while the right graph shows model performance under
the KGLLM framework. Blue bars represent testing without ICL, and red bars represent testing with
ICL.
In this analysis, we explore models’ ability to perform unseen multi-hop relation prediction tasks
on WN18RR and NELL-995 datasets, excluding the traditional approach as it lacks generalization
ability. We used the same testing dataset in the multi-hop link prediction task to ensure comparability
and fairness. As mentioned in section 3.2, the difference lies in the instruction and prompt question
presented to the model.
Our findings are presented in Table 5. We discovered that both frameworks showcased limited
performance in this task without ICL. Notably, the KG-LLM framework exhibited marginally
superior performance. Upon reviewing the predictive results, we observed that the model continues
to provide ‘yes’ and ‘no’ answers for most questions, similar to the multi-hop link prediction task.
For some questions, it outputs random responses.
Answer to Q3: The findings suggest that the KG-LLM framework marginally enhances the models’
generalization abilities. However, it would be premature to assert that our framework equips models
with the ability to navigate unseen prompts. This could be attributed to the complexity and difficulty
of the new instructions and options. With options no longer limited to a binary yes or no answer,
the model may struggle to comprehend the updated instruction and effectively utilize the provided
options.
We further explore the impact of incorporating ICL into the multi-hop relation prediction task. The
ICL example is shown in the Appendix A.1. The results of red bars (with ICL) in Table 5 reveal
9
a significant improvement in the generalization abilities of the models under both ablation and
KG-LLM frameworks, in contrast to the blue bars (without ICL). In particular, the Llama 2 and
Gemma models under the KG-LLM framework with ICL, achieved an accuracy exceeding 70% in
the WN18RR datasets.
Answer to Q4: The integration of ICL has improved the models’ ability to excel in unseen tasks. The
KG-LLM framework, in particular, exhibits the ability to learn and utilize the contextual example
provided by ICL.
References
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. Knowledge graph based
synthetic corpus generation for knowledge-enhanced language model pre-training. arXiv
preprint arXiv:2010.12688, 2020.
[2] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.
Translating embeddings for modeling multi-relational data. Advances in neural information
processing systems, 26, 2013.
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[5] Miao Fan, Qiang Zhou, Emily Chang, and Fang Zheng. Transition-based knowledge graph em-
bedding with relational mapping properties. In Proceedings of the 28th Pacific Asia conference
on language, information and computing, pages 328–337, 2014.
[6] Marco Grassia and Giuseppe Mangioni. wsgat: weighted and signed graph attention networks
for link prediction. In Complex Networks & Their Applications X: Volume 1, Proceedings of
the Tenth International Conference on Complex Networks and Their Applications COMPLEX
NETWORKS 2021 10, pages 369–375. Springer, 2022.
[7] John Guare. Six degrees of separation. In The Contemporary Monologue: Men, pages 89–93.
Routledge, 2016.
[8] Zirui Guo, Lianghao Xia, Yanhua Yu, Yuling Wang, Zixuan Yang, Wei Wei, Liang Pang, Tat-
Seng Chua, and Chao Huang. Graphedit: Large language models for graph structure learning.
arXiv preprint arXiv:2402.15183, 2024.
[9] Xu Han, Shulin Cao, Xin Lv, Yankai Lin, Zhiyuan Liu, Maosong Sun, and Juanzi Li. Openke:
An open toolkit for knowledge embedding. In Proceedings of the 2018 conference on empirical
methods in natural language processing: system demonstrations, pages 139–144, 2018.
10
[10] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021.
[11] Thanh Le, Nam Le, and Bac Le. Knowledge graph embedding by relational rotation and
complex convolution for link prediction. Expert Systems with Applications, 214:119122, 2023.
[12] Qika Lin, Jun Liu, Fangzhi Xu, Yudai Pan, Yifan Zhu, Lingling Zhang, and Tianzhe Zhao. Incor-
porating context graph with logical reasoning for inductive relation prediction. In Proceedings
of the 45th international ACM SIGIR conference on research and development in information
retrieval, pages 893–903, 2022.
[13] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation
embeddings for knowledge graph completion. In Proceedings of the AAAI conference on
artificial intelligence, 2015.
[14] Hanxiao Liu, Yuexin Wu, and Yiming Yang. Analogical inference for multi-relational embed-
dings. In International conference on machine learning, pages 2168–2178. PMLR, 2017.
[15] Jorge M Lobo, Alberto Jiménez-Valverde, and Raimundo Real. Auc: a misleading measure
of the performance of predictive distribution models. Global ecology and Biogeography,
17(2):145–151, 2008.
[16] Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. Learning attention-based
embeddings for relation prediction in knowledge graphs. arXiv preprint arXiv:1906.01195,
2019.
[17] Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. Holographic embeddings of
knowledge graphs. In Proceedings of the AAAI conference on artificial intelligence, 2016.
[18] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning
with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
[19] Varun Ranganathan and Denilson Barbosa. Hoplop: multi-hop link prediction over knowledge
graph embeddings. World Wide Web, 25(2):1037–1065, 2022.
[20] Ananya B Sai, Akash Kumar Mohankumar, and Mitesh M Khapra. A survey of evaluation
metrics used for nlg systems. ACM Computing Surveys (CSUR), 55(2):1–39, 2022.
[21] Yanchao Tan, Hang Lv, Xinyi Huang, Jiawei Zhang, Shiping Wang, and Carl Yang. Musegraph:
Graph-oriented instruction tuning of large language models for generic graph mining. arXiv
preprint arXiv:2403.04780, 2024.
[22] Xiaobin Tang, Jing Zhang, Bo Chen, Yang Yang, Hong Chen, and Cuiping Li. Bert-int: a
bert-based interaction model for knowledge graph alignment. interactions, 100:e1, 2020.
[23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly
capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[24] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[25] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard.
Complex embeddings for simple link prediction. In International conference on machine
learning, pages 2071–2080. PMLR, 2016.
[26] Guojia Wan and Bo Du. Gaussianpath: A bayesian multi-hop reasoning framework for knowl-
edge graph reasoning. In Proceedings of the AAAI conference on artificial intelligence, pages
4393–4401, 2021.
[27] Jianing Wang, Junda Wu, Yupeng Hou, Yao Liu, Ming Gao, and Julian McAuley. Instructgraph:
Boosting large language models via graph-centric instruction tuning and preference alignment.
arXiv preprint arXiv:2402.08785, 2024.
11
[28] Meihong Wang, Linling Qiu, and Xiaoli Wang. A survey on knowledge graph embeddings for
link prediction. Symmetry, 13(3):485, 2021.
[29] Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A survey
of approaches and applications. IEEE Transactions on Knowledge and Data Engineering,
29(12):2724–2743, 2017.
[30] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by
translating on hyperplanes. In Proceedings of the AAAI conference on artificial intelligence,
2014.
[31] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan
Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv
preprint arXiv:2109.01652, 2021.
[32] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani
Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large
language models. arXiv preprint arXiv:2206.07682, 2022.
[33] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny
Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint
arXiv:2201.11903, 2022.
[34] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le,
Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In
Advances in Neural Information Processing Systems, 2022.
[35] Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of
in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
[36] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and
relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
[37] Liang Yao, Chengsheng Mao, and Yuan Luo. Kg-bert: Bert for knowledge graph completion.
arXiv preprint arXiv:1909.03193, 2019.
[38] Jason Youn and Ilias Tagkopoulos. Kglm: Integrating knowledge graph structure in language
models for link prediction. arXiv preprint arXiv:2211.02744, 2022.
[39] Donghan Yu, Chenguang Zhu, Yiming Yang, and Michael Zeng. Jaket: Joint pre-training of
knowledge graph and language understanding. In Proceedings of the AAAI Conference on
Artificial Intelligence, pages 11630–11638, 2022.
12
A Appendix
A.1 In-Context Learning Examples
13