Papers by Katharina Probst

The transfer-based approach to machine translation (MT) captures structural transfers between the... more The transfer-based approach to machine translation (MT) captures structural transfers between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios where bilingual data is in sparse supply. The rules are learned using a variety of information, such as parses that are available for one of the languages, and morphological information that is available for both languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the effectiveness of the learned rules on Hebrew→English and a Hindi→English translation tasks.
International Conference on Enterprise Information Systems, 2009
In many practical applications, multiple interrelated tasks must be accomplished sequentially thr... more In many practical applications, multiple interrelated tasks must be accomplished sequentially through user interaction with retrieval, classification and recommendation systems. The ordering of the tasks may have a significant impact on the overall utility (or performance) of the systems; hence optimal ordering of tasks is desirable. However, manual specification of optimal ordering is often difficult when task dependencies are complex,
This paper compares a manually written MT grammar and a grammar learned automatically from an Eng... more This paper compares a manually written MT grammar and a grammar learned automatically from an English-Spanish elicitation corpus with the ultimate purpose of automatically re ning the translation rules. The experiment described here shows that the kind of automatic re nement operations required to correct a translation not only varies depending on the type of error, but also on the type of grammar. This paper describes the two types of grammars and gives a detailed error analysis of their output, indicating what kinds of re nements are required in each case.

Lecture Notes in Computer Science, 2007
ABSTRACT We describe an approach to extract attribute-value pairs from product descriptions in or... more ABSTRACT We describe an approach to extract attribute-value pairs from product descriptions in order to augment product databases by representing each product as a set of attribute-value pairs. Such a representation is useful for a variety of tasks where treating a product as a set of attribute-value pairs is more useful than as an atomic entity. We formulate the extraction task as a classification problem and use Naïve Bayes combined with a multi-view semi-supervised algorithm (co-EM). The extraction system requires very little initial user supervision: using unlabeled data, we automatically extract an initial seed list that serves as training data for the semi-supervised classification algorithm. The extracted attributes and values are then linked to form pairs using dependency information and co-location scores. We present promising results on product descriptions in two categories of sporting goods products. The extracted attribute-value pairs can be useful in a variety of applications, including product recommendations, product comparisons, and demand forecasting. In this paper, we describe one practical application of the extracted attribute-value pairs: a prototype of an Assortment Comparison Tool that allows retailers to compare their product assortments to those of their competitors. As the comparison is based on attributes and values, we can draw meaningful conclusions at a very fine-grained level. We present the details and research issues of such a tool, as well as the current state of our prototype.
Lecture Notes in Computer Science, 2009
This paper introduces the ‘guessing anonymity,’ a definition of privacy for noise perturbation me... more This paper introduces the ‘guessing anonymity,’ a definition of privacy for noise perturbation methods. This definition captures the difficulty of linking identity to a sanitized record using publicly available information. Importantly, this definition leads to analytical expressions that bound data privacy as a function of the noise perturbation parameters. Using these bounds, we can formulate optimization problems to describe the
Lecture Notes in Computer Science, 2004
Lecture Notes in Computer Science, 2007
Abstract. Research in multi-view active learning has typically focused on al-gorithms for selecti... more Abstract. Research in multi-view active learning has typically focused on al-gorithms for selecting the next example to label. This is often at the cost of lengthy wait-times for the user between each query iteration. We deal with a real-world information extraction task, extracting attribute-...
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL '08, 2008
We describe a Knowledge Management System that shifts the focus from the traditional document-cen... more We describe a Knowledge Management System that shifts the focus from the traditional document-centric to a user-centric view. It takes into account users' query and download behavior, opinions, reputations, and social connections.

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02, 2001
We describe an approach to improve the bilingual cooccurrence dictionary that is used for word al... more We describe an approach to improve the bilingual cooccurrence dictionary that is used for word alignment, and evaluate the improved dictionary using a version of the Competitive Linking algorithm. We demonstrate a problem faced by the Competitive Linking algorithm and present an approach to ameliorate it. In particular, we rebuild the bilingual dictionary by clustering similar words in a language and assigning them a higher cooccurrence score with a given word in the other language than each single word would have otherwise. Experimental results show a significant improvement in precision and recall for word alignment when the improved dicitonary is used. ¢ ¡ and a target language word £ ¡ are said to cooccur if ¢ ¡ occurs in a source language sentence and £ ¡ occurs in the corresponding target language sentence. Cooccurrence scores then are then counts for all word pairs ¥ ¤ and £ § ¦ , where ¤ is in the source language vocabulary and £ © ¦
In the past, educators relied on classroom observation to determine the relevance of various peda... more In the past, educators relied on classroom observation to determine the relevance of various pedagogical techniques. Automated language learning now allows us to examine pedagogical questions in a much more rigorous manner. We can use a computer-assisted language learning (CALL) system as a base, tracing all user responses and controlling the information given out. We have thus used the Fluency system [Proceedings of Speech Technology in Language and Learning, 1998, p. 77] to answer the question of what voice a language learner should imitate when working on pronunciation. In this article, we will examine whether there should be a choice of model speakers and what characteristics of a model's voice may be important to match when there is a choice. Ó 2002 Elsevier Science B.V. All rights reserved.
Proceedings of The IEEE, 2004
We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system u... more We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specically designed for rapid MT prototyping for lan- guages with limited linguistic resources. The task is particularly challenging due to two main rea- sons: the high lexical and morphological ambiguity of Hebrew and the dearth of available resources for the language. Existing, publicly available
We describe a Machine Translation (MT) approach that is specifically designed to enable rapid dev... more We describe a Machine Translation (MT) approach that is specifically designed to enable rapid development of MT for languages with limited amounts of online resources. Our approach assumes the availability of a small number of bi-lingual speakers of the two languages, but these need not be linguistic experts. The bi-lingual speakers create a comparatively small corpus of word aligned phrases
Machine Translation, 2003
Lecture Notes in Computer Science, 2002
ACM SIGKDD Explorations Newsletter, 2006
... For the case of textual data in general, and product descriptions in particular, obtaining ..... more ... For the case of textual data in general, and product descriptions in particular, obtaining ... We exploit this observation to automatically extract high-quality seeds by defining a modified ... supervised learning to improve the per-formance of Naïve Bayes by exploiting large amounts of ...

The rule-based approach to machine translation (MT) captures structural mappings between the sour... more The rule-based approach to machine translation (MT) captures structural mappings between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast and robust system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios when bilingual data is in sparse supply. The rules are learned using a variety of information given, such as parses, part of speech tags, etc. that are available for one of the languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the algorithms in a Hebrew→English translation task.
Speech Communication, 2002
In the past, educators relied on classroom observation to determine the relevance of various peda... more In the past, educators relied on classroom observation to determine the relevance of various pedagogical techniques. Automated language learning now allows us to examine pedagogical questions in a much more rigorous manner. We can use a computer-assisted language learning (CALL) system as a base, tracing all user responses and controlling the information given out. We have thus used the Fluency system [Proceedings of Speech Technology in Language and Learning, 1998, p. 77] to answer the question of what voice a language learner should imitate when working on pronunciation. In this article, we will examine whether there should be a choice of model speakers and what characteristics of a model's voice may be important to match when there is a choice. Ó 2002 Elsevier Science B.V. All rights reserved.
Machine Translation, 2000

The transfer-based approach to machine translation (MT) captures structural transfers between the... more The transfer-based approach to machine translation (MT) captures structural transfers between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios where bilingual data is in sparse supply. The rules are learned using a variety of information, such as parses that are available for one of the languages, and morphological information that is available for both languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the effectiveness of the learned rules on Hebrew→English and a Hindi→English translation tasks.
Abstract: ACTIVE, a three-year EU integrating project which began in March 2008, is using semanti... more Abstract: ACTIVE, a three-year EU integrating project which began in March 2008, is using semantic technology to address three particular requirements of knowledge workers: the need to share information easily and effectively; the need to give priority to information which is relevant ...
Uploads
Papers by Katharina Probst