Skip to main content

Katharina Probst

Followers

11

Following

6

Co-authors

6

Public Views

Carnegie Mellon University

Carnegie Mellon University

Jaime Carbonell

Roberto Aranovich

Soulideth Phimmasone

Jakob Uszkoreit

Google

Harvard University

The Pennsylvania State University

Interests

Uploads

Papers by Katharina Probst

Learning Transfer Rules for Machine Translation with Limited Data

The transfer-based approach to machine translation (MT) captures structural transfers between the... more The transfer-based approach to machine translation (MT) captures structural transfers between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios where bilingual data is in sparse supply. The rules are learned using a variety of information, such as parses that are available for one of the languages, and morphological information that is available for both languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the effectiveness of the learned rules on Hebrew→English and a Hindi→English translation tasks.

Graph Structure Learning for Task Ordering

International Conference on Enterprise Information Systems, 2009

In many practical applications, multiple interrelated tasks must be accomplished sequentially thr... more In many practical applications, multiple interrelated tasks must be accomplished sequentially through user interaction with retrieval, classification and recommendation systems. The ordering of the tasks may have a significant impact on the overall utility (or performance) of the systems; hence optimal ordering of tasks is desirable. However, manual specification of optimal ordering is often difficult when task dependencies are complex,

Error Analysis of Two Types of Grammar

This paper compares a manually written MT grammar and a grammar learned automatically from an Eng... more This paper compares a manually written MT grammar and a grammar learned automatically from an English-Spanish elicitation corpus with the ultimate purpose of automatically re ning the translation rules. The experiment described here shows that the kind of automatic re nement operations required to correct a translation not only varies depending on the type of error, but also on the type of grammar. This paper describes the two types of grammars and gives a detailed error analysis of their output, indicating what kinds of re nements are required in each case.

Extracting and Using Attribute-Value Pairs from Product Descriptions on the Web

Lecture Notes in Computer Science, 2007

ABSTRACT We describe an approach to extract attribute-value pairs from product descriptions in or... more ABSTRACT We describe an approach to extract attribute-value pairs from product descriptions in order to augment product databases by representing each product as a set of attribute-value pairs. Such a representation is useful for a variety of tasks where treating a product as a set of attribute-value pairs is more useful than as an atomic entity. We formulate the extraction task as a classification problem and use Naïve Bayes combined with a multi-view semi-supervised algorithm (co-EM). The extraction system requires very little initial user supervision: using unlabeled data, we automatically extract an initial seed list that serves as training data for the semi-supervised classification algorithm. The extracted attributes and values are then linked to form pairs using dependency information and co-location scores. We present promising results on product descriptions in two categories of sporting goods products. The extracted attribute-value pairs can be useful in a variety of applications, including product recommendations, product comparisons, and demand forecasting. In this paper, we describe one practical application of the extracted attribute-value pairs: a prototype of an Assortment Comparison Tool that allows retailers to compare their product assortments to those of their competitors. As the comparison is based on attributes and values, we can draw meaningful conclusions at a very fine-grained level. We present the details and research issues of such a tool, as well as the current state of our prototype.

Maximizing Privacy under Data Distortion Constraints in Noise Perturbation Methods

Lecture Notes in Computer Science, 2009

This paper introduces the ‘guessing anonymity,’ a definition of privacy for noise perturbation me... more This paper introduces the ‘guessing anonymity,’ a definition of privacy for noise perturbation methods. This definition captures the difficulty of linking identity to a sanitized record using publicly available information. Importantly, this definition leads to analytical expressions that bound data privacy as a function of the noise perturbation parameters. Using these bounds, we can formulate optimization problems to describe the

Error Analysis of Two Types of Grammar for the Purpose of Automatic Rule Refinement

Lecture Notes in Computer Science, 2004

Towards ‘Interactive’ Active Learning in Multi-view Feature Sets for Information Extraction

Lecture Notes in Computer Science, 2007

Abstract. Research in multi-view active learning has typically focused on al-gorithms for selecti... more

Considering users and their opinions in knowledge management systems

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL '08, 2008

We describe a Knowledge Management System that shifts the focus from the traditional document-cen... more

Using similarity scoring to improve the bilingual dictionary for word alignment

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02, 2001

We describe an approach to improve the bilingual cooccurrence dictionary that is used for word al... more We describe an approach to improve the bilingual cooccurrence dictionary that is used for word alignment, and evaluate the improved dictionary using a version of the Competitive Linking algorithm. We demonstrate a problem faced by the Competitive Linking algorithm and present an approach to ameliorate it. In particular, we rebuild the bilingual dictionary by clustering similar words in a language and assigning them a higher cooccurrence score with a given word in the other language than each single word would have otherwise. Experimental results show a significant improvement in precision and recall for word alignment when the improved dicitonary is used. ¢ ¡ and a target language word £ ¡ are said to cooccur if ¢ ¡ occurs in a source language sentence and £ ¡ occurs in the corresponding target language sentence. Cooccurrence scores then are then counts for all word pairs ¥ ¤ and £ § ¦ , where ¤ is in the source language vocabulary and £ © ¦

Enhancing foreign language tutors – In search of the golden speaker

In the past, educators relied on classroom observation to determine the relevance of various peda... more In the past, educators relied on classroom observation to determine the relevance of various pedagogical techniques. Automated language learning now allows us to examine pedagogical questions in a much more rigorous manner. We can use a computer-assisted language learning (CALL) system as a base, tracing all user responses and controlling the information given out. We have thus used the Fluency system [Proceedings of Speech Technology in Language and Learning, 1998, p. 77] to answer the question of what voice a language learner should imitate when working on pronunciation. In this article, we will examine whether there should be a choice of model speakers and what characteristics of a model's voice may be important to match when there is a choice. Ó 2002 Elsevier Science B.V. All rights reserved.

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

Proceedings of The IEEE, 2004

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system u... more We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specically designed for rapid MT prototyping for lan- guages with limited linguistic resources. The task is particularly challenging due to two main rea- sons: the high lexical and morphological ambiguity of Hebrew and the dearth of available resources for the language. Existing, publicly available

A Trainable Transfer-based Machine Translation Approach for Languages with Limited Resources

We describe a Machine Translation (MT) approach that is specifically designed to enable rapid dev... more We describe a Machine Translation (MT) approach that is specifically designed to enable rapid development of MT for languages with limited amounts of online resources. Our approach assumes the availability of a small number of bi-lingual speakers of the two languages, but these need not be linguistic experts. The bi-lingual speakers create a comparatively small corpus of word aligned phrases

Mt for resource-poor languages using elicitation-based learning of syntactic transfer rules

by Katharina Probst and Lori Levin

Machine Translation, 2003

Automatic Rule Learning for Resource-Limited MT

by Katharina Probst and Lori Levin

Lecture Notes in Computer Science, 2002

Text mining for product attribute extraction

by Katharina Probst, Marko Krema, and Andrew Fano

ACM SIGKDD Explorations Newsletter, 2006

... For the case of textual data in general, and product descriptions in particular, obtaining ..... more

Automatically Induced Syntactic Transfer Rules for Machine Translation under a Very Limited Data Scenario

by Katharina Probst and Lori Levin

The rule-based approach to machine translation (MT) captures structural mappings between the sour... more The rule-based approach to machine translation (MT) captures structural mappings between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast and robust system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios when bilingual data is in sparse supply. The rules are learned using a variety of information given, such as parses, part of speech tags, etc. that are available for one of the languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the algorithms in a Hebrew→English translation task.

Enhancing foreign language tutors – In search of the golden speaker

Speech Communication, 2002

In the past, educators relied on classroom observation to determine the relevance of various peda... more In the past, educators relied on classroom observation to determine the relevance of various pedagogical techniques. Automated language learning now allows us to examine pedagogical questions in a much more rigorous manner. We can use a computer-assisted language learning (CALL) system as a base, tracing all user responses and controlling the information given out. We have thus used the Fluency system [Proceedings of Speech Technology in Language and Learning, 1998, p. 77] to answer the question of what voice a language learner should imitate when working on pronunciation. In this article, we will examine whether there should be a choice of model speakers and what characteristics of a model's voice may be important to match when there is a choice. Ó 2002 Elsevier Science B.V. All rights reserved.

MT for Minority Languages Using Elicitation-Based Learning of Syntactic Transfer Rules

by Katharina Probst and Lori Levin

Machine Translation, 2000

Learning transfer rules for machine translation with limited data

by Katharina Probst and Lori Levin

The transfer-based approach to machine translation (MT) captures structural transfers between the... more The transfer-based approach to machine translation (MT) captures structural transfers between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios where bilingual data is in sparse supply. The rules are learned using a variety of information, such as parses that are available for one of the languages, and morphological information that is available for both languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the effectiveness of the learned rules on Hebrew→English and a Hindi→English translation tasks.

ACTIVEEnabling the Knowledge-Powered Enterprise Semantic Technology for Knowledge Worker Productivity

Abstract: ACTIVE, a three-year EU integrating project which began in March 2008, is using semanti... more

Learning Transfer Rules for Machine Translation with Limited Data

The transfer-based approach to machine translation (MT) captures structural transfers between the... more The transfer-based approach to machine translation (MT) captures structural transfers between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios where bilingual data is in sparse supply. The rules are learned using a variety of information, such as parses that are available for one of the languages, and morphological information that is available for both languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the effectiveness of the learned rules on Hebrew→English and a Hindi→English translation tasks.

Graph Structure Learning for Task Ordering

International Conference on Enterprise Information Systems, 2009

In many practical applications, multiple interrelated tasks must be accomplished sequentially thr... more In many practical applications, multiple interrelated tasks must be accomplished sequentially through user interaction with retrieval, classification and recommendation systems. The ordering of the tasks may have a significant impact on the overall utility (or performance) of the systems; hence optimal ordering of tasks is desirable. However, manual specification of optimal ordering is often difficult when task dependencies are complex,

Error Analysis of Two Types of Grammar

This paper compares a manually written MT grammar and a grammar learned automatically from an Eng... more This paper compares a manually written MT grammar and a grammar learned automatically from an English-Spanish elicitation corpus with the ultimate purpose of automatically re ning the translation rules. The experiment described here shows that the kind of automatic re nement operations required to correct a translation not only varies depending on the type of error, but also on the type of grammar. This paper describes the two types of grammars and gives a detailed error analysis of their output, indicating what kinds of re nements are required in each case.

Extracting and Using Attribute-Value Pairs from Product Descriptions on the Web

Lecture Notes in Computer Science, 2007

ABSTRACT We describe an approach to extract attribute-value pairs from product descriptions in or... more ABSTRACT We describe an approach to extract attribute-value pairs from product descriptions in order to augment product databases by representing each product as a set of attribute-value pairs. Such a representation is useful for a variety of tasks where treating a product as a set of attribute-value pairs is more useful than as an atomic entity. We formulate the extraction task as a classification problem and use Naïve Bayes combined with a multi-view semi-supervised algorithm (co-EM). The extraction system requires very little initial user supervision: using unlabeled data, we automatically extract an initial seed list that serves as training data for the semi-supervised classification algorithm. The extracted attributes and values are then linked to form pairs using dependency information and co-location scores. We present promising results on product descriptions in two categories of sporting goods products. The extracted attribute-value pairs can be useful in a variety of applications, including product recommendations, product comparisons, and demand forecasting. In this paper, we describe one practical application of the extracted attribute-value pairs: a prototype of an Assortment Comparison Tool that allows retailers to compare their product assortments to those of their competitors. As the comparison is based on attributes and values, we can draw meaningful conclusions at a very fine-grained level. We present the details and research issues of such a tool, as well as the current state of our prototype.

Maximizing Privacy under Data Distortion Constraints in Noise Perturbation Methods

Lecture Notes in Computer Science, 2009

This paper introduces the ‘guessing anonymity,’ a definition of privacy for noise perturbation me... more This paper introduces the ‘guessing anonymity,’ a definition of privacy for noise perturbation methods. This definition captures the difficulty of linking identity to a sanitized record using publicly available information. Importantly, this definition leads to analytical expressions that bound data privacy as a function of the noise perturbation parameters. Using these bounds, we can formulate optimization problems to describe the

Error Analysis of Two Types of Grammar for the Purpose of Automatic Rule Refinement

Lecture Notes in Computer Science, 2004

Towards ‘Interactive’ Active Learning in Multi-view Feature Sets for Information Extraction

Lecture Notes in Computer Science, 2007

Abstract. Research in multi-view active learning has typically focused on al-gorithms for selecti... more

Considering users and their opinions in knowledge management systems

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL '08, 2008

We describe a Knowledge Management System that shifts the focus from the traditional document-cen... more

Using similarity scoring to improve the bilingual dictionary for word alignment

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02, 2001

We describe an approach to improve the bilingual cooccurrence dictionary that is used for word al... more We describe an approach to improve the bilingual cooccurrence dictionary that is used for word alignment, and evaluate the improved dictionary using a version of the Competitive Linking algorithm. We demonstrate a problem faced by the Competitive Linking algorithm and present an approach to ameliorate it. In particular, we rebuild the bilingual dictionary by clustering similar words in a language and assigning them a higher cooccurrence score with a given word in the other language than each single word would have otherwise. Experimental results show a significant improvement in precision and recall for word alignment when the improved dicitonary is used. ¢ ¡ and a target language word £ ¡ are said to cooccur if ¢ ¡ occurs in a source language sentence and £ ¡ occurs in the corresponding target language sentence. Cooccurrence scores then are then counts for all word pairs ¥ ¤ and £ § ¦ , where ¤ is in the source language vocabulary and £ © ¦

Enhancing foreign language tutors – In search of the golden speaker

In the past, educators relied on classroom observation to determine the relevance of various peda... more In the past, educators relied on classroom observation to determine the relevance of various pedagogical techniques. Automated language learning now allows us to examine pedagogical questions in a much more rigorous manner. We can use a computer-assisted language learning (CALL) system as a base, tracing all user responses and controlling the information given out. We have thus used the Fluency system [Proceedings of Speech Technology in Language and Learning, 1998, p. 77] to answer the question of what voice a language learner should imitate when working on pronunciation. In this article, we will examine whether there should be a choice of model speakers and what characteristics of a model's voice may be important to match when there is a choice. Ó 2002 Elsevier Science B.V. All rights reserved.

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

Proceedings of The IEEE, 2004

We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system u... more We describe the rapid development of a preliminary Hebrew-to-English Machine Translation system under a transfer-based framework specically designed for rapid MT prototyping for lan- guages with limited linguistic resources. The task is particularly challenging due to two main rea- sons: the high lexical and morphological ambiguity of Hebrew and the dearth of available resources for the language. Existing, publicly available

A Trainable Transfer-based Machine Translation Approach for Languages with Limited Resources

We describe a Machine Translation (MT) approach that is specifically designed to enable rapid dev... more We describe a Machine Translation (MT) approach that is specifically designed to enable rapid development of MT for languages with limited amounts of online resources. Our approach assumes the availability of a small number of bi-lingual speakers of the two languages, but these need not be linguistic experts. The bi-lingual speakers create a comparatively small corpus of word aligned phrases

Mt for resource-poor languages using elicitation-based learning of syntactic transfer rules

by Katharina Probst and Lori Levin

Machine Translation, 2003

Automatic Rule Learning for Resource-Limited MT

by Katharina Probst and Lori Levin

Lecture Notes in Computer Science, 2002

Text mining for product attribute extraction

by Katharina Probst, Marko Krema, and Andrew Fano

ACM SIGKDD Explorations Newsletter, 2006

... For the case of textual data in general, and product descriptions in particular, obtaining ..... more

Automatically Induced Syntactic Transfer Rules for Machine Translation under a Very Limited Data Scenario

by Katharina Probst and Lori Levin

The rule-based approach to machine translation (MT) captures structural mappings between the sour... more The rule-based approach to machine translation (MT) captures structural mappings between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast and robust system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios when bilingual data is in sparse supply. The rules are learned using a variety of information given, such as parses, part of speech tags, etc. that are available for one of the languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the algorithms in a Hebrew→English translation task.

Enhancing foreign language tutors – In search of the golden speaker

Speech Communication, 2002

In the past, educators relied on classroom observation to determine the relevance of various peda... more In the past, educators relied on classroom observation to determine the relevance of various pedagogical techniques. Automated language learning now allows us to examine pedagogical questions in a much more rigorous manner. We can use a computer-assisted language learning (CALL) system as a base, tracing all user responses and controlling the information given out. We have thus used the Fluency system [Proceedings of Speech Technology in Language and Learning, 1998, p. 77] to answer the question of what voice a language learner should imitate when working on pronunciation. In this article, we will examine whether there should be a choice of model speakers and what characteristics of a model's voice may be important to match when there is a choice. Ó 2002 Elsevier Science B.V. All rights reserved.

MT for Minority Languages Using Elicitation-Based Learning of Syntactic Transfer Rules

by Katharina Probst and Lori Levin

Machine Translation, 2000

Learning transfer rules for machine translation with limited data

by Katharina Probst and Lori Levin

The transfer-based approach to machine translation (MT) captures structural transfers between the... more The transfer-based approach to machine translation (MT) captures structural transfers between the source language and the target language, with the goal of producing grammatical translations. The major drawback of the approach is the development bottleneck, requiring many human-years of rule development. On the other hand, data-driven approaches such as example-based and statistical MT achieve fast system development by deriving mostly non-structural translation information from bilingual corpora. This thesis aims at striking a balance between both approaches by inferring transfer rules automatically from bilingual text, aiming specifically at scenarios where bilingual data is in sparse supply. The rules are learned using a variety of information, such as parses that are available for one of the languages, and morphological information that is available for both languages. They are learned in three stages, first producing an initial hypothesis, then capturing the syntactic structure, and finally adding appropriate unification constraints. The learned rules are used in a run-time translation system, a statistical transfer system which is a combination of a transfer engine and a statistical decoder. We demonstrate the effectiveness of the learned rules on Hebrew→English and a Hindi→English translation tasks.

ACTIVEEnabling the Knowledge-Powered Enterprise Semantic Technology for Knowledge Worker Productivity

Abstract: ACTIVE, a three-year EU integrating project which began in March 2008, is using semanti... more