0% found this document useful (0 votes)
30 views4 pages

Using Knowledge Graphs To Explain Entity Co-Occurrence in

Uploaded by

SarraSaroura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views4 pages

Using Knowledge Graphs To Explain Entity Co-Occurrence in

Uploaded by

SarraSaroura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Using Knowledge Graphs to Explain Entity Co-occurrence in

Twitter
Yiwei Wang∗ Mark James Carman Yuan-Fang Li
Hong Kong University of Science and Monash University Monash University
Technology Caulfield, VIC, Australia Clayton, VIC, Australia
Hong Kong, China [Link]@[Link] [Link]@[Link]
wangyw_seu@[Link]

ABSTRACT
Modern Knowledge Graphs such as DBPedia contain significant
information regarding Named Entities and the logical relationships
which exist between them. Twitter on the other hand, contains
important information on the popularity and frequency with which
these entities are mentioned and discussed in combination with one
another. In this paper we investigate whether these two sources of
information can be used to complement and explain one another. In
particular, we would like to know whether the logical relationships
(a.k.a. semantic paths) which exist between pairs of known entities
can help to explain the frequency with which those entities co-occur Figure 1: A tweet referring to the entities Nike and Adidas
with one another in Twitter. To do this we train a ranking function (above), and semantic relations linking them (below).
over semantic paths between pairs of entities. The aim of the ranker
is to identify the path that most likely explains why a particular pair between the entities to explain their co-occurrence. For example,
of entities have appeared together in a particular tweet. We train Fig. 1 shows a tweet referring to two entities, Adidas and Nike
the ranking model using a number of lexical, graph-embedding and Inc., and the semantic relations containing one intermediate entity
popularity-based features over semantic paths containing a single between them present in DBPedia, as shown by RelFinder2 .
intermediate entity and demonstrate the efficacy of the model for For any given pair of named entities, many possible paths may
determining why pairs of entities occur together in tweets. link them in the knowledge graph. The issue we investigate in this
paper is how best to rank these relations, as represented by the
KEYWORDS intermediate entities that lie along the path between the entities,
Microblog; Information Retrieval; Importance Ranking; Machine for the purpose of explaining their co-occurrence. To the best of
Learning; DBPedia; Knowledge Graphs; Twitter our knowledge, this is the first research work aimed at ranking the
semantic relations between popular entities in Twitter. Our method
1 INTRODUCTION can be described as follows:

On-line social networks have become an inalienable part of many • We propose an approach for automatically labelling semantic
people’s lives allowing them to communicate effectively with friends paths for building a large training corpus of labelled semantic
and colleagues. Currently about 500 million tweets are posted on paths, alleviating the need for manual labelling and allowing
Twitter per day1 . This mountain of data provides useful informa- us to scale to larger training quantities: a dataset of approx. 10
tion about the popularity of various named entities (people, places, million tweets with 4 hundred thousand pairs of co-occurring
products, etc.) which are also described in knowledge graphs such entities.
as DBPedia [2]. Many tweets contain more than one named en- • We propose several features for predicting the importance
tity and knowledge graphs can provide semantic relations (paths) of different paths based on lexical information, knowledge
graph embeddings and on-line popularity information.
∗ Work was conducted while on placement at Monash University, and was supported • We cast the problem as a rank learning problem and train a
partly by the China Scholarship Council.
1 [Link] RankSVM model to rank paths.
• Our preliminary evaluation using a human-labelled dataset
Permission to make digital or hard copies of all or part of this work for personal or in terms of NDCG@k shows promising results. Our analy-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
sis also identified features most important to the ranking
on the first page. Copyrights for components of this work owned by others than the algorithm.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@[Link]. 2 METHOD
CIKM’17, November 6–10, 2017, Singapore, Singapore We now describe methods used for data collection, labelling and
© 2017 Copyright held by the owner/author(s). Publication rights licensed to Associa-
tion for Computing Machinery. feature extraction.
ACM ISBN 978-1-4503-4918-5/17/11. . . $15.00
2 [Link]
[Link]
Entity pair: (Ebay, Google)
Intermediate entity Relev. label
Tweets Entities (Frequency)
“Pay up! Google, eBay pay EBay California 25
1-1.7% tax on earnings in Google
Australia.” For entity pair: (Ebay, Google)
Category:Internet_compani 21
… Rank SVM {California, Internet, Category:Internet_companies_of_the_United_States}
es_of_the_United_States
… … Internet 35
Auto-labelling Human-labelling Prediction output
feature collection 6 3.5
Twitter search:
{#Ebay #Google #California} Graph embeddings 8 4.1
Entity pairs Intermediate entities ({Ebay Google California}) L F1 F2 …

Twitter timelines 25 1 2 9 5.6


Ebay, Google • California

21 1 3 …

Category:Internet_c Two days ago Two hours ago … … … …



ompanies_of_the_ Compute: NDCG@k, List-wise accuracy,
United_States Features for entity: California Pair-wise accuracy

• Internet Wikipedia pageviews


String analysis
… … 1 word, 0 digit, Not
Category

Figure 2: The data processing pipeline of our system.

2.1 Data Collection number of intermediate entities do exist between some pairs, which
Fig. 2 shows the data processing pipeline for the system. Twit- demonstrates the challenging nature of the proposed research.
ter’s streaming API is used to collect tweets, from which named
entities that are present in DBPedia are extracted using DBPedia
Spotlight [5], with confidence set to 0.8 and support set to 20. Twit-
ter’s streaming API requests query terms, so we used the top 400
words from a list of the 500 most frequent words on Twitter3 as
queries to ensure maximum coverage of tweets (more than 91% of
tweets contain at least one word from this list). Tweets containing
at least two entities are stored for further analysis.
For each pair of entities that co-occur in a tweet, queries are
executed against a DBPedia SPARQL endpoint to collect the set
of intermediate entities that appear in semantic relations between
them (e.g., the entities Clothing, Sports equipment and Fashion ac-
cessory that connect Nike, Inc. and Adidas in Fig. 1).
We limited SPARQL queries to only return paths containing
exactly one intermediate entity linking a pair of co-occurring enti-
ties. Semantic relations directly connecting the observed entities
through a single predicate (i.e., without an intermediate entity) were
very rare4 and thus insufficient for explaining the co-occurrence in
most cases. Longer relations containing two or more intermediate Figure 3: Number of intermediate entities between pairs of
entities were so prolific (due to DBpedia’s high branching factor) co-occurring entities in tweets.
that they were infrequently meaningful and were ignored in this
work. Pairs of entities with no paths between them in DBPedia are
removed from further analysis. Future work will investigate the In addition, we also collect graph embeddings of these entities
ranking of paths of varied length. from RDF2Vec [7] (with dimension of vectors set to 200), as well as
Over a two-month period more than 10 million tweets were col- pageview data of Wikipedia pages corresponding to these entities
lected, from which 689,336 occurrences of pairs of entities were ex- to use as features, which we will discuss in a following subsection.
tracted. These occurrences cover 90,981 unique entities and 383,234
unique pairs of entities. Fig. 3 shows a histogram of the number 2.2 Automatic labelling of training data
of paths between pairs of entities (y-axis) that contain a certain
Effective training of a rank learning model requires a large quantity
number of intermediate entities (x-axis). As can be seen, though
of labelled training data consisting of queries (observed pairs of
most pairs do not have more than 100 intermediate entities, a huge
entities) for which each document (semantic relation or interme-
3 [Link]
diate entity linking them) has been assigned a relevance label (or
4 TheDBPedia graph contains nearly 4 million entities but has an average degree of rank). In this preliminary work we treat the intermediate entity
only 7.0, see: [Link] as sufficient representation of the semantic relation between the
observed entities. Future work will investigate the ranking of re-
lations involving different predicates but the same intermediate
entity/entities.
Human labelling is both tedious and time-consuming, and hence
infeasible for generating the quantity of training data required to
train a rank learning model. Thus we instead developed a method
of automatically labelling data that uses co-occurrence information
between the intermediate entity and the observed entity pair to
estimate the relevance of the intermediate entity. Specifically, given
a pair of co-occurring entities (el , er ) and an intermediate entity
ei , we use Twitter’s search functionality to find the 20 most recent
tweets that contain all three of the entities, in the form of hashtags
(e.g., #Adidas for the entity Adidas). We append the hashtag prefix #
to each entity’s name since if the name is used as a hashtag it is more
likely to be an actual reference to the the entity (consider a tweet
containing the word “apple” versus the hashtag #apple). For some Figure 4: Accuracy ratio of pair-wise comparison and list-
pairs of entities (especially those with longer names), queries using wise ranking.
the three hashtags return zero results for almost all intermediate
entities. If this is the case, we remove the hashtag prefix # from
each of the entity names and repeat the search without it. We do F5 Absolute difference between the cosine similarities:
this whenever the set of counts across the intermediate entities has |cos(e®i , e®l ) −cos(e®i , e®r )|. Intuitively, this feature measures the
variance less than 5.5 difference in similarity of ei with el and er respectively.
Finally, the average frequency for each intermediate entity (cal-
culated as the inverse of average time interval between tweets On-line Popularity Features. We also propose two features to
containing all three entities) is used as its relevance label (rank), represent the popularity of entities.
since it approximates the popularity of the intermediate entity in F6 The log of number of pageviews of the Wikipedia page
the context of the pair of entities (i.e., P(ei |el , er )). corresponding to an entity, during March 2017.
F7 Average time interval between consecutive co-occurrences
2.3 Features of pairs of entities (ei , el ) and (ei , er ).
We have designed a set of seven features, organised into three broad
categories: lexical, graph embeddings based, and popularity based 2.4 Training
features. After performing min-max normalisation of feature values, we
Lexical Features. We propose three features that relate an en- train a SVMr ank [4] model with linear kernel function to rank
tity to its name. intermediate entities. Three thousand two hundred pairs of most
F1 We conjecture that the number of words in an intermediate popular entities (co-occurring in most number of tweets) are chosen
entity’s name may influence its importance, since long names for training our ranker. These pairs of entities have in total more
can indicate that the entity denotes a more specific topic. than 25,000 relations among them extracted from DBPedia. The
F2 Similar to the number of words, number of digits in an dataset is randomly split 4:1 for training and testing, and repeated
entity’s name could correlate with its specificity. 600 times.
F3 Whether the entity is a category is chosen as the third As described in the previous subsection, we proposed three
feature. Wikipedia uses Category pages to classify entities, groups of features, and these groups are increasingly expensive
these classes are not mutually exclusive but concrete. to calculate. The last group, on-line popularity features, is also
time-sensitive.
Graph embeddings features. DBpedia’s network structure re- To understand the effect of the three different groups of fea-
flects relationships of different entities in the world. RDF2Vec [7] tures, we trained ranking models with (1) lexical features only, (2)
is a recently proposed model that translates entities in an RDF- lexical and graph embeddings features, and (3) all features. The
based knowledge graph (in this case DBPedia) to distributed vector accuracy values of pair-wise comparison and list-wise ranking of
embeddings. Represented as vectors, different entities can be com- these models are shown in Fig. 4. As can be seen, the more features
pared using cosine similarity between the corresponding vectors. are included, the higher the accuracy values are.
Let e® be the vector representation of entity e in RDF2Vec. Given
an intermediate entity ei between a pair of entities el and er , we 3 EVALUATION
propose two features.
We perform evaluation using human-labelled data to evaluate the
F4 Sum of the cosine similarities: cos(e®i , e®l ) + cos(e®i , e®r ). Intu- effectiveness of our technique using the performance metric Nor-
itively, this feature measures how similar the intermediate malised Discounted Cumulative Gain (NDCG@k).
entity is with both el and er . We extracted from the pairs of entities co-occurring in Twitter
5 The value of the threshold was set empirically by inspecting the results. those which were linked by between 8 to 10 unique intermediate
The results show that F7, which is the Twitter popularity of
the intermediate entity, is the most important, and the drop in
NDCG@k is significantly larger than any of the other features.

4 RELATED WORK
There has been ongoing research into ranking relationships and
explaining relatedness between entities on knowledge graphs [1, 3,
6, 8], and most of these existing work only makes use structured data
encoded in knowledge graphs. In contrast, our work is concerned
with the problem of ranking semantic relations between pairs of
entities co-occurring in the same tweet. Moreover, we are first in
incorporating novel features based on RDF graph embeddings [7].

5 CONCLUSION
Microblogging services such as Twitter have become an indispens-
Figure 5: NDCG@k versus k (using ground truth labels). able part of modern life. Many tweets contain multiple entities, and
their co-occurrence could be explained by one or more semantic
relations between these entities. In this paper, we address this prob-
entities in DBPedia6 , and randomly selected 50 pairs for use in the lem of ranking such relations to explain co-occurrence of pairs of
evaluation. Five participants were shown all of the intermediate entities in tweets using a rank learning approach.
entities for each of the 50 pairs, and asked to rate each intermediate We propose a novel and sophisticated framework to automati-
entity on a scale of 0 to 10 of how “important” it was given the cally obtain labelled data for training. We propose several features
pair. The scores for each intermediate entity were then aggregated for predicting the importance of different relations based on lexical
(summed) and used as graded relevance judgements on the scale information, knowledge graph structure and on-line popularity
from 0 to 50 to compute NDCG for each query pair. The NDCG@k information. Our preliminary evaluation shows promising ranking
values (for k from 1 to 6) on the manually labelled test dataset is accuracy as measured by NDCG@k.
shown in Fig. 5. For comparison random ranking is shown as well. Our future work plan includes identifying additional features
In human evaluation, similar to what is observed in Fig. 4, we and ranking learning algorithms to further improve ranking perfor-
can observe that performance (as measured by NDCG@k) improves mance. We also plan to generalise our work to (1) support longer,
with the addition of features. Moreover, models trained on all three arbitrary-length relations (paths) and (2) incorporate information
groups of features outperform the random ranking baseline. The about edges (predicates) and not only intermediate entities.
model trained on all features is the most accurate, significantly
outperforming the other models, with NDCG values of at least 0.8 REFERENCES
for all k values. It suggests that the seven features all contribute [1] Kemafor Anyanwu, Angela Maduko, and Amit Sheth. 2005. SemRank: Ranking
positively to ranking performance. Complex Relationship Search Results on the Semantic Web. In Proceedings of the
14th International Conference on World Wide Web (WWW ’05). ACM, New York,
To understand the effects of different features on the result, we NY, USA, 117–127. [Link]
perform an ablation study to compare the drop in NDCG@k when [2] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. 2008. DBpedia:
A Nucleus for a Web of Open Data, In Proceedings of 6th International Semantic
each of the 7 features is deleted, for k = 1 and k = 5. Table 1 shows Web Conference, 2nd Asian Semantic Web Conference (ISWC+ASWC 2007). The
the result. Note that with the full set of features, NDCG@1 = 0.812, Semantic Web, 722–735. [Link]
and NDCG@5 = 0.934. [3] Gong Cheng, Daxin Liu, and Yuzhong Qu. 2016. Efficient Algorithms for As-
sociation Finding and Frequent Association Pattern Mining. In The Semantic
Web - ISWC 2016 - 15th International Semantic Web Conference, Kobe, Japan,
Table 1: Ablation study - list of features, ordered from most October 17-21, 2016, Proceedings, Part I (Lecture Notes in Computer Science),
Paul T. Groth, Elena Simperl, Alasdair J. G. Gray, Marta Sabou, Markus Krötzsch,
important to least. Freddy Lécué, Fabian Flöck, and Yolanda Gil (Eds.), Vol. 9981. 119–134. https:
//[Link]/10.1007/978-3-319-46523-4_8
[4] Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of
∆ NDCG@k the 12th ACM SIGKDD international conference on Knowledge discovery and data
Feature
k=1 k=5 mining. ACM, 217–226.
[5] Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DB-
F7: Entity popularity in Twitter 0.341 (42%) 0.162 (19%) pedia spotlight: shedding light on the web of documents. In I-SEMANTICS (ACM
International Conference Proceeding Series), Chiara Ghidini, Axel-Cyrille Ngonga
F6: Num. of Wikipedia pageviews 0.036 (4%) 0.022 (3%) Ngomo, Stefanie N. Lindstaedt, and Tassilo Pellegrini (Eds.). ACM, 1–8.
F1: Num. of words in entity name 0.025 (2%) 0.014 (2%) [6] Giuseppe Pirrò. 2015. Explaining and suggesting relatedness in knowledge graphs.
F5: Difference of entity similarity 0.024 (2%) 0.013 (2%) In International Semantic Web Conference. Springer, 622–639.
[7] Petar Ristoski and Heiko Paulheim. 2016. Rdf2vec: Rdf graph embeddings for
F4: Sum of entity similarity 0.018 (1%) 0.008 (2%) data mining. In International Semantic Web Conference. Springer, 498–514.
F2: Num. of digits in entity name 0.014 (1%) 0.004 (1%) [8] Stephan Seufert, Klaus Berberich, Srikanta J. Bedathur, Sarath Kumar Kondreddi,
F3: Whether entity is category 0.013 (1%) 0.003 (1%) Patrick Ernst, and Gerhard Weikum. 2016. ESPRESSO: Explaining Relationships
Between Entity Sets. In Proceedings of the 25th ACM International on Conference
on Information and Knowledge Management (CIKM ’16). ACM, New York, NY,
6 Notethat this means 8 to 10 unique paths each with a different intermediate entity, USA, 1311–1320. [Link]
but not paths of length 8 to 10.

You might also like