Enhancing Knowledge Graph Construction Using v2
Enhancing Knowledge Graph Construction Using v2
Abstract—The growing trend of Large Language Models extracted from the texts and applied for intelligent reasoning.
(LLM) development has attracted significant attention, with mod- This fact has motivated us to use some of the state-of-the-art
els for various applications emerging consistently. However, the models in an attempt to extract information from text data on
combined application of Large Language Models with semantic
technologies for reasoning and inference is still a challenging task. the Web.
This paper analyzes how the current advances in foundational Yet, creating Knowledge Graphs from raw text data is a
LLM, like ChatGPT, can be compared with the specialized complex task that requires advanced NLP techniques such as
pretrained models, like REBEL, for joint entity and relation Named Entity Recognition [3], Relation Extraction [4], and
extraction. To evaluate this approach, we conducted several Semantic Parsing [5]. Large language models such as GPT-3
experiments using sustainability-related text as our use case. We
created pipelines for the automatic creation of Knowledge Graphs [6], T5 [7], and BERT [8] have shown remarkable performance
from raw texts, and our findings indicate that using advanced in these tasks, and their use has resulted in significant improve-
LLM models can improve the accuracy of the process of creating ments in the quality and accuracy of knowledge graphs.
these graphs from unstructured text. Furthermore, we explored To evaluate our approach in connecting both fields, we chose
the potential of automatic ontology creation using foundation to analyze the specific use case of sustainability. Sustainability
LLM models, which resulted in even more relevant and accurate
knowledge graphs. is a topic of great importance for our future, and a lot
Index Terms—ChatGPT, REBEL, LLMs, Relation-extraction, of emphasis has been placed on identifying ways to create
NLP, Sustainability more sustainable practices in organizations. Sustainability has
become the norm for organizations in developed countries,
I. I NTRODUCTION mainly due to the rising awareness of their consumers and
The technological advancements, together with the avail- employees. However, this situation is not reflected in devel-
ability of Big Data, have led to a surge in the development of oping and underdeveloped countries to this extent. Although
Large Language Models (LLMs) [1]. This trend has paved the the perception of sustainability has improved, progress toward
way for a cascade of new models being released on a regular sustainable development has been slower, indicating the need
basis, each outperforming its predecessors. These models for more concrete guidance [9]. Moreover, theoretical research
have started a revolution in the field with their capability has attempted to link strategic management and sustainable
to process massive amounts of unstructured text data and by development in corporations in order to encourage the inte-
achieving state-of-the-art results on multiple Natural Language gration of sustainability issues into corporate activities and
Processing (NLP) tasks. strategies [10]. Even though research has set a basis for
However, one of the aspects which have not yet taken over developing standards and policies in favor of sustainability, a
the spotlight is the combined application of these models with more empirical approach is needed for policy definitions and
semantic technologies to enable reasoning and inference. This analyzing an organization’s sustainability level with respect to
paper attempts to fill this gap by making a connection between the defined policies.
the Deep Learning (DL) space and the semantic space, through In this study, the goal is to make a connection between
the use of NLP for creating Knowledge Graphs [2]. LLMs and semantic reasoning to automatically generate a
Knowledge Graphs are structured representations of in- Knowledge Graph on the topic of sustainability and populate
formation that capture the relationships between entities in it with concrete instances using news articles available on the
a particular domain. They are used extensively in various Web. For this purpose, we create multiple experiments where
applications, such as search engines, recommendation systems, we utilize popular NLP models, namely Relation Extraction
and question-answering systems. By End-to-end Language generation (REBEL) [11] and Chat-
On a related note, there is a significant amount of raw GPT [12]. We show that although REBEL is specifically
texts available on the Web which contain valuable information. trained for relation extraction, ChatGPT, a conversational
Nevertheless, this information is unusable if it cannot be agent using a generative model, can streamline the process
of automatically creating accurate Knowledge Graphs from Knowledge Base. The agent consists of three steps, including
an unstructured text when provided with detailed instructions. separate models: a supervised fine-tuning (SFT) model based
The rest of the paper is structured as follows: Section on GPT-3 [6], a reward model, and a reinforcement learning
II presents a brief literature overview, Section III describes model.
the methods and experimental setup, Section IV outlines the ChatGPT was trained using Reinforcement Learning from
results of the information extraction process, Section V states Human Feedback (RLHF) [15], employing methods similar to
the propositions for future work, and finally section VI gives InstructGPT with minor variations in data collection. An initial
the conclusion of the work done in this paper. model is trained through supervised fine-tuning, with human
AI trainers engaging in conversations, assuming both user
II. L ITERATURE R EVIEW
and AI assistant roles. To aid in formulating responses, train-
A. Algorithms ers were given access to model-generated suggestions. The
Our study focuses on the task of information extraction from newly created dialogue dataset was then combined with the
news and reports available on the Web. For this purpose, we InstructGPT dataset, which was transformed into a dialogue
compare the capabilities of NLP models to generate a useful format. In order to establish a reward model for reinforcement
Knowledge Base on the topic. learning, comparison data needed to be gathered, consisting
A Knowledge Base represents information stored in a struc- of two or more model responses ranked by quality. This data
tured format, ready to be used for analysis or inference. Often, was collected by taking conversations between AI trainers and
Knowledge Bases are stored in the form of a graph and are the chatbot, randomly selecting a model-generated message,
then called Knowledge Graphs. sampling multiple alternative completions, and having AI
In order to create such a Knowledge Base, we need to trainers rank them. The reward models enabled fine-tuning of
extract information from the raw texts in a triplet format. An ChatGPT using Proximal Policy Optimization [16], and several
example of a triplet would be <Person, Location, City>. In iterations of this procedure were executed.
the triplet, we have a structure consisting of the following links
Entity -> Relation -> Entity, where the first entity is referred B. Use case: Sustainability
to as the subject, the relation is a predicate, and the second
entity represents the object. In order to achieve this structured The Global sustainability study of 2022 has reported that
information extraction, we need to identify entities in the raw 71% out of 11,500 surveyed consumers around the world are
texts, as well as the relations connecting these entities. making changes to the way they live and the products they
In the past, this process was implemented by leveraging buy in an effort to live more sustainably [17]. This shows that
multi-step pipelines, where one step included Named-entity corporations not only need to change their operations to be
Recognition (NER) [3], and another step was Relation classi- more sustainable for the sake of the environment but also to
fication (RC) [13]. However, these multi-step pipelines often be able to stay competitive.
prove to have unsatisfactory performance due to the propaga- With the vast amount of unstructured data available on
tion of errors from the steps. In order to tackle this problem, the Web, it is crucial to develop methods that can automat-
end-to-end approaches have been implemented, referred to as ically identify sustainability-related information from news,
Relation-Extraction (RE) [4] methods. reports, papers, and other forms of documents. One such study
One of the models utilized in this study is REBEL (Relation identifies this opportunity and attempts to create a method
Extraction By End-to-end Language generation) [11], which for directly extracting non-financial information generated by
is an auto-regressive seq2seq model based on BART [14] that various media to provide objective ESG information [18].
performs end-to-end relation extraction for more than 200 The authors have trained an ESG classifier and recorded a
different relation types. The model achieves 74 micro-F1 and classification accuracy of 86.66% on 4-class on texts which
51 macro-F1 scores. It was created for the purpose of joint they manually labeled. On a related note, researchers have
entity-relation extraction. taken a step further to extract useful ESG information from
REBEL is a generative seq2seq model which attempts to texts. In this article [19], the authors have trained a joint entity
”translate” the raw text into a triple format. The REBEL model and relation extraction model on a private dataset consisting of
outputs additional tokens, which are used during its training ESG and CSR reports annotated internally at Crédit Agricole.
to identify a triplet. These tokens include <triplet>, which They were able to identify entities such as coal activities and
represents the beginning of a triplet, <subj>, which represents environmental or social issues. In [20], the authors presented
the end of the subject and the start of the predicate, and an approach for knowledge graph generation based on ESG-
<obj>, which represents the end of the predicate and start related news and company official documents.
of the object. The authors of the paper for REBEL provide a
parsing function for extracting the triplet from the output of III. M ETHODS
REBEL.
The second approach we took was to use ChatGPT [12], This section describes the methods used in this research,
as a conversational agent and compare the performance in the including the data collection process and the entity-relation
task of entity-relation extraction and creation of a common extraction algorithms used to analyze the gathered data.
A. Data Collecting Process approach will not work for entities that are not present on
In order to conduct the experimental comparison of the DBpedia.
two approaches for entity-relation extraction, news data was 2) ChatGPT: The second approach taken in this paper uses
gathered from the Web on the topic of sustainability. For this OpenAI’s ChatGPT [12]. We have created two experiments
purpose, the News API [21] system was used. News API is using ChatGPT.
an HTTP REST API for searching and retrieving live articles The first experiment prompts ChatGPT to extract relations
from all over the Web. It provides the ability to search through from the collected news articles. After extracting the relations,
the articles posted on the Web by specifying the following we follow the same steps as with the REBEL model in order
options: keyword or phrase, date of publication, source domain to create a comprehensive Knowledge Base.
name, and language. The second experiment focuses on creating a prompt that
Using News API, 94 news articles from 2023-02-15 to would directly generate the entire Knowledge Base and write
2023-03-19 on the topic of sustainability have been collected. an ontology describing the concepts identified in the texts.
The collected texts contained various numbers of words rang- This approach has the goal of reducing the number of manual
ing from 50 to over 4200. With the limitation of the number steps which need to be performed in order to obtain the final
of tokens that can be passed as input to a language model, Knowledge Graph.
additional pre-processing steps needed to be taken to account For both experiments, we set the value of the parameter
for the texts consisting of a large number of words. ’temperature’ to 0 in order to get more deterministic outputs
since OpenAI models are non-deterministic by nature.
B. Relation-Extraction Methods Experiment 1. For the first experiment, we prompt Chat-
Relation-extraction is a fundamental task in NLP that aims GPT to extract relations connected to sustainability. ChatGPT
to identify the semantic relationships between entities in a was able to successfully extract entities and connect them with
sentence or document. The task is challenging because it relations, and return the results in a triple format. After the
requires understanding the context in which the entities appear relations had been extracted, the same post-processing step of
and the types of relationships that exist between them. Entity Linking was implemented on the results from ChatGPT.
In this subsection, we describe how we utilize REBEL and Although ChatGPT was able to extract entities from the
ChatGPT for the task of relation extraction. articles and link them with relations, it was not successful at
1) REBEL: Our first approach was to use REBEL in an abstracting concepts. The entities and relations identified often
attempt to extract relations from unstructured news articles. represented whole phrases instead of concepts.
In order for REBEL to be able to use the provided texts, To overcome the obstacle, we prompted ChatGPT to map
they need to be tokenized with the corresponding tokenizer identified entities and relations to a suitable OWL ontology
function. Tokenization is the process of separating the raw text [23]. However, ChatGPT failed to identify relevant sustainabil-
into smaller units called tokens. Tokens can refer to words, ity concepts or define their instances. The identified classes,
characters, or sub-words. The model has a token limitation of such as Company, Customer, MarketingEcosystem, Resource,
512 tokens, which means that the collected articles which are CustomerExperience, Convenience, and DigitalMarketing, had
longer need to be pre-processed before sending them to the some potential relevance to sustainability, but ChatGPT did not
model for triplets extraction. identify any instances for these classes.
To address this limitation, we tokenize the raw text and Experiment 2. In the second experiment, we refined the
divide the tokens into 256-token batches. These batches are prompt to ask ChatGPT to explicitly generate an OWL ontol-
processed separately by the REBEL model, and the results ogy on sustainability, which includes concepts like organiza-
are subsequently merged to extract relations for longer texts. tions, actions, practices, policies, and related terms. We also
Metadata is also added to the extracted relations, referencing allowed ChatGPT to create additional classes and properties
the token batch from which the relation was derived. With if necessary. We explicitly requested the results to be returned
this approach, some relations may not be extracted accurately in RDF Turtle format.
because the batch of tokens might begin or end in the middle of Providing additional information to ChatGPT resulted in the
the sentence. However, the number of cases where this happens creation of an improved Knowledge Base. ChatGPT was able
is insignificant. Thus, we leave their handling for future work. to define concepts such as organizations, actions, practices,
Once the entity-relation extraction process is finished, the and policies, as well as identify suitable relations to connect
extracted information is stored in a triplet structure. To further them together. Moreover, it was able to create instances of the
normalize the extracted entities, we perform Entity Linking defined classes and properties and link them together. This
[22]. Entity Linking refers to the identification and association shows that adding more specific instructions to the prompts
of entity mentions in raw text with their corresponding entities for ChatGPT can produce drastically different results.
in a Knowledge Base. The process of Entity Linking is not part
of the REBEL model, and it is an additional post-processing IV. R ESULTS
step that is used to refine the extracted relations. In this study, This section presents the results from the experiments de-
we utilize DBpedia as our Knowledge Base and consider two scribed in Section III. A comparison of the created Knowledge
entities identical if they share the same DBpedia URL. This Base from both methods is given, and the characteristics of the
generated Knowledge Bases are outlined. Table I represents
the Knowledge Bases from the REBEL model and the first
experiment with ChatGPT, respectively. The table shows the
number of entities, relations, and triplets extracted from the
raw texts on sustainability.
TABLE I
K NOWLEDGE BASE STRUCTURE COMPARISON