articles by Sebastian Krause

The article demonstrates how generic parsers in a minimally supervised information extraction fra... more The article demonstrates how generic parsers in a minimally supervised information extraction framework can be adapted to a given task and domain for relation extraction (RE). For the experiments, two parsers that deliver n-best readings are included: (1) a generic deep-linguistic parser (PET) with a largely hand-crafted head-driven phrase structure grammar for English (ERG); (2) a generic statistical parser (Stanford Parser) trained on the Penn Treebank. It will be shown how the estimated confidence of RE rules learned from the n-best parses can be exploited for parse reranking for both parsers. The acquired reranking model improves the performance of RE in both training and test phases with the new first parses. The obtained significant boost of recall does not come from an overall gain in parsing performance but from an application-driven selection of parses that are best suited for the RE task. Since the readings best suited for the successful extraction of rules and instances are often not the readings favoured by a regular parser evaluation, generic parsing accuracy actually decreases. The novel method for task-specific parse reranking does not require any annotated data beyond the semantic seed, which is needed anyway for the RE task.

Abstract Recent years have seen a significant growth and increased usage of large-scale knowledge... more Abstract Recent years have seen a significant growth and increased usage of large-scale knowledge resources in both academic research and industry. We can distinguish two main types of knowledge resources: those that store factual information about entities in the form of semantic relations (e.g., Freebase), namely so-called knowledge graphs, and those that represent general linguistic knowledge (e.g., WordNet or UWN). In this article, we present a third type of knowledge resource which completes the picture by connecting the two first types. Instances of this resource are graphs of semantically-associated relations (sar-graphs), whose purpose is to link semantic relations from factual knowledge graphs with their linguistic representations in human language. We present a general method for constructing sar-graphs using a language- and relation-independent, distantly supervised approach which, apart from generic language processing tools, relies solely on the availability of a lexical semantic resource, providing sense information for words, as well as a knowledge base containing seed relation instances. Using these seeds, our method extracts, validates and merges relation-specific linguistic patterns from text to create sar-graphs. To cope with the noisily labeled data arising in a distantly supervised setting, we propose several automatic pattern confidence estimation strategies, and also show how manual supervision can be used to improve the quality of sar-graph instances. We demonstrate the applicability of our method by constructing sar-graphs for 25 semantic relations, of which we make a subset publicly available at http://sargraph.dfki.de. We believe sar-graphs will prove to be useful linguistic resources for a wide variety of natural language processing tasks, and in particular for information extraction and knowledge base population. We illustrate their usefulness with experiments in relation extraction and in computer assisted language learning.
Papers by Sebastian Krause

Coreference resolution for event mentions enables extraction systems to process document-level in... more Coreference resolution for event mentions enables extraction systems to process document-level information. Current systems in this area base their decisions on rich semantic features from various knowledge bases, thus restricting them to domains where such external sources are available. We propose a model for this task which does not rely on such features but instead utilizes sentential features coming from convolutional neural networks. Two such networks first process coreference candidates and their respective context, thereby generating latent-feature representations which are tuned towards event aspects relevant for a linking decision. These representations are augmented with lexical-level and pairwise features, and serve as input to a trainable similarity function producing a coreference score. Our model achieves state-of-the-art performance on two datasets, one of which is publicly available. An error analysis points out directions for further research.

International Conference on Language Resources and Evaluation (LREC), 2016
Recent research shows the importance of linking linguistic knowledge resources for the creation o... more Recent research shows the importance of linking linguistic knowledge resources for the creation of large-scale linguistic data. We describe our approach for combining two English resources, FrameNet and sar-graphs, and illustrate the benefits of the linked data in a relation extraction setting. While FrameNet consists of schematic representations of situations, linked to lexemes and their valency patterns, sar-graphs are knowledge resources that connect semantic relations from factual knowledge graphs to the linguistic phrases used to express instances of these relations. We analyze the conceptual similarities and differences of both resources and propose to link sar-graphs and FrameNet on the levels of relations/frames as well as phrases. The former alignment involves a manual ontology mapping step, which allows us to extend sar-graphs with new phrase patterns from FrameNet. The phrase-level linking, on the other hand, is fully automatic. We investigate the quality of the automatically constructed links and identify two main classes of errors.
Workshop on Natural Language Processing Techniques for Educational Applications at the Annual Meeting of the Association for Computational Linguistics (NLP-TEA @ ACL), 2015
We propose a strategy for the semi-automatic generation of learning material for reading-comprehe... more We propose a strategy for the semi-automatic generation of learning material for reading-comprehension tests, guided by semantic relations embedded in expos-itory texts. Our approach combines methods from the areas of information extraction and paraphrasing in order to present a language teacher with a set of candidate multiple-choice questions and answers that can be used for verifying a language learners reading capabilities. We implemented a web-based prototype showing the feasibility of our approach and carried out a pilot user evaluation that resulted in encouraging feedback but also pointed out aspects of the strategy and prototype implementation which need improvements.

Workshop on Linked Data in Linguistics: Resources and Applications, co-located with the Annual Meeting of the Association for Computational Linguistics (LDL @ ACL), 2015
We present sar-graphs, a knowledge resource that links semantic relations from factual knowledge ... more We present sar-graphs, a knowledge resource that links semantic relations from factual knowledge graphs to the linguistic patterns with which a language can express instances of these relations. Sar-graphs expand upon existing lexico-semantic resources by modeling syntactic and semantic information at the level of relations, and are hence useful for tasks such as knowledge base population and relation extraction. We present a language-independent method to automatically construct sar-graph instances that is based on distantly supervised relation extraction. We link sar-graphs at the lexical level to BabelNet, WordNet and UBY, and present our ongoing work on pattern-and relation-level linking to FrameNet. An initial dataset of English sar-graphs for 25 relations is made publicly available, together with a Java-based API.
Annual Meeting of the Association for Computational Linguistics (ACL), System Demonstrations, 2015
Patterns extracted from dependency parses of sentences are a major source of knowl- edge for most... more Patterns extracted from dependency parses of sentences are a major source of knowl- edge for most state-of-the-art relation ex- traction systems, but can be of low qual- ity in distantly supervised settings. We present a linguistic annotation tool that al- lows human experts to analyze and cate- gorize automatically learned patterns, and to identify common error classes. The an- notations can be used to create datasets that enable machine learning approaches to pattern quality estimation. We also present an experimental pattern error anal- ysis for three semantic relations, where we find that between 24% and 61% of the learned dependency patterns are defective due to preprocessing or parsing errors, or due to violations of the distant supervision assumption.
Conference of the North American Chapter of the ACL – Human Language Technologies (NAACL HLT), 2015
This paper describes IDEST, a new method for learning paraphrases of event patterns. It is based ... more This paper describes IDEST, a new method for learning paraphrases of event patterns. It is based on a new neural network architecture that only relies on the weak supervision signal that comes from the news published on the same day and mention the same real-world entities. It can generalize across extractions from different dates to produce a robust paraphrase model for event patterns that can also capture meaningful representations for rare patterns. We compare it with two state-of-the-art systems and show that it can attain comparable quality when trained on a small dataset. Its generalization capabilities also allow it to leverage much more data, leading to substantial quality improvements.
International Conference on Agents and Artificial Intelligence (ICAART), 2015
A new method is proposed and evaluated that improves distantly supervised learning of pattern rul... more A new method is proposed and evaluated that improves distantly supervised learning of pattern rules for n-ary relation extraction. The new method employs knowledge from a large lexical semantic repository to guide the discovery of patterns in parsed relation mentions. It extends the induced rules to semantically relevant material outside the minimal subtree containing the shortest paths connecting the relation entities and also discards rules without any explicit semantic content. It significantly raises both recall and precision with roughly 20% f-measure boost in comparison to the baseline system which does not consider the lexical semantic information.

International Conference on Language Resources and Evaluation (LREC), 2014
In this paper, we present a novel combination of two types of language resources dedicated to the... more In this paper, we present a novel combination of two types of language resources dedicated to the detection of relevant relations (RE) such as events or facts across sentence boundaries. One of the two resources is the sar-graph, which aggregates for each target relation ten thousands of linguistic patterns of semantically associated relations that signal instances of the target relation (Uszkoreit and Xu, 2013). These have been learned from the Web by intra-sentence pattern extraction (Krause et al., 2012) and after semantic filtering and enriching have been automatically combined into a single graph. The other resource is cockrACE, a specially annotated corpus for the training and evaluation of cross-sentence RE. By employing our powerful annotation tool Recon, annotators mark selected entities and relations (including events), coreference relations among these entities and events, and also terms that are semantically related to the relevant relations and events. This paper describes how the two resources are created and how they complement each other.

International Conference on Language Resources and Evaluation (LREC), 2014
This paper presents a new resource for the training and evaluation needed by relation extraction ... more This paper presents a new resource for the training and evaluation needed by relation extraction experiments. The corpus consists of annotations of mentions for three semantic relations: marriage, parent–child, siblings, selected from the domain of biographic facts about persons and their social relationships. The corpus contains more than one hundred news articles from Tabloid Press. In the current corpus, we only consider the relation mentions occurring in the individual sentences. We provide multi-level annotations which specify the marked facts from relation, argument, entity, down to the token level, thus allowing for detailed analysis of linguistic phenomena and their interactions. A generic markup tool Recon developed at the DFKI LT lab has been utilised for the annotation task. The corpus has been annotated by two human experts, supported by additional conflict resolution conducted by a third expert. As shown in the evaluation, the annotation is of high quality as proved by the stated inter-annotator agreements both on sentence level and on relation-mention level. The current corpus is already in active use in our research for evaluation of the relation extraction performance of our automatically learned extraction patterns.

International Semantic Web Conference (ISWC), 2013
Web-scale relation extraction is a means for building and extending large repositories of formali... more Web-scale relation extraction is a means for building and extending large repositories of formalized knowledge. This type of automated knowledge building requires a decent level of precision, which is hard to achieve with automatically acquired rule sets learned from unlabeled data by means of distant or minimal supervision. This paper shows how precision of relation extraction can be considerably improved by employing a wide-coverage, general-purpose lexical semantic network, i.e., BabelNet, for effective semantic rule filtering. We apply Word Sense Disambiguation to the content words of the automatically extracted rules. As a result a set of relation-specific relevant concepts is obtained, and each of these concepts is then used to represent the structured semantics of the corresponding relation. The resulting relation-specific subgraphs of BabelNet are used as semantic filters for estimating the adequacy of the extracted rules. For the seven semantic relations tested here, the semantic filter consistently yields a higher precision at any relative recall value in the high-recall range.

International Semantic Web Conference (ISWC), 2012
Wepresentalarge-scalerelationextraction(RE)systemwhichlearns grammar-based RE rules from the Web ... more Wepresentalarge-scalerelationextraction(RE)systemwhichlearns grammar-based RE rules from the Web by utilizing large numbers of relation instances as seed. Our goal is to obtain rule sets large enough to cover the actual range of linguistic variation, thus tackling the long-tail problem of real-world applications. A variant of distant supervision learns several relations in parallel, enabling a new method of rule filtering. The system detects both binary and n-ary relations. We target 39 relations from Freebase, for which 3M sentences extracted from 20M web pages serve as the basis for learning an average of 40K distinctive rules per relation. Employing an efficient dependency parser, the average run time for each relation is only 19 hours. We compare these rules with ones learned from local corpora of different sizes and demonstrate that the Web is indeed needed for a good coverage of linguistic variation.

International Conference on Parsing Technologies (IWPT), 2011
The paper demonstrates how the generic parser of a minimally supervised information extraction fr... more The paper demonstrates how the generic parser of a minimally supervised information extraction framework can be adapted to a given task and domain for relation extraction (RE). For the experiments a generic deep-linguistic parser was employed that works with a largely hand-crafted head-driven phrase structure grammar (HPSG) for English. The output of this parser is a list of n best parses selected and ranked by a MaxEnt parse-ranking component, which had been trained on a more or less generic HPSG treebank. It will be shown how the estimated confidence of RE rules learned from the n best parses can be exploited for parse reranking. The acquired rerank-ing model improves the performance of RE in both training and test phases with the new first parses. The obtained significant boost of recall does not come from an overall gain in parsing performance but from an application-driven selection of parses that are best suited for the RE task. Since the readings best suited for successful rule extraction and instance extraction are often not the readings favored by a regular parser evaluation, generic parsing accuracy actually decreases. The novel method for task-specific parse reranking does not require any annotated data beyond the semantic seed, which is needed anyway for the RE task.

International Conference on Computational Linguistics (COLING), Posters Volume, 2010
This paper presents a new approach to improving relation extraction based on minimally supervised... more This paper presents a new approach to improving relation extraction based on minimally supervised learning. By adding some limited closed-world knowledge for confidence estimation of learned rules to the usual seed data, the precision of relation extraction can be considerably improved. Starting from an existing base-line system we demonstrate that utilizing limited closed world knowledge can effectively eliminate " dangerous " or plainly wrong rules during the bootstrapping process. The new method improves the reliability of the confidence estimation and the precision value of the extracted instances. Although recall suffers to a certain degree depending on the domain and the selected settings, the overall performance measured by F-score considerably improves. Finally we validate the adaptability of the best ranking method to a new domain and obtain promising results.
Uploads
articles by Sebastian Krause
Papers by Sebastian Krause