Skip to main content

Sofus Macskassy

University of Southern California, Computer Science, Faculty Member

Followers

62

Following

0

Public Views

Jonathan Zittrain

Harvard University

Vrije Universiteit Brussel

Julita Vassileva

University of Saskatchewan

Knox College

Université Gustave Eiffel

Roshan Chitrakar

Nepal College of Information Technology

Graduate Center of the City University of New York

Dr. Shadab Alam

Jazan University

PALIMOTE JUSTICE

RIVERS STATE POLYTECHNIC

Ataturk University

Interests

Uploads

Papers by Sofus Macskassy

Information Valets: Adaptivity for Multi-Platform Access to Heterogeneous Information

This paper introduces the concept of an Information Valet ("iValet"), an approach for combining m... more This paper introduces the concept of an Information Valet ("iValet"), an approach for combining multi-platform access to heterogeneous information with adaption to user preferences. The goal of an iValet is to support access to a range of information sources from a range of wired and wireless client devices with some degree of uniformity. An iValet should be aware of its user and user devices, learning from a user's past interactions where and how to send new incoming information. Our metaphor is that of a valet that sits between a user's client devices and the information services that the user may want to access. This paper describes a prototype iValet that interacts with its user through both the Palm VII "Web-clippings" service and the RIM 950 two-way email-capable pager, providing access to email, Web pages, and personal files. We discuss general issues raised by our work concerning the more general design of an iValet, and present initial results concerning the ability of machine learning techniques to successfully inject adaptability into our iValet system.

Conﬁdence Bands for ROC Curves

We address the problem of comparing the performance of classifiers. In this paper we study techni... more We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating confidence bands on ROC curves. Historically this has been done using one-dimensional confidence intervals by freezing one variable-the false-positive rate, or threshold on the classification scoring function. We adapt two prior methods and introduce a new radial sweep method to generate confidence bands. We show, through empirical studies, that the bands are too tight and introduce a general optimization methodology for creating bands that better fit the data, as well as methods for evaluating confidence bands. We show empirically that the optimized confidence bands fit much better and that, using our new evaluation method, it is possible to gauge the relative fit of different confidence bands.

Classification in Networked Data: A toolkit and a univariate case study

This paper 1 is about classifying entities that are interlinked with entities for which the class... more This paper 1 is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well-well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes-that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection.

A Conversational Agent

Conversational agents are the stepping stones for the next generation of interaction between user... more Conversational agents are the stepping stones for the next generation of interaction between users and computers. Not only is this a more natural way for people to interact, but also it would increase e ciency and speed in working with computers. In this paper, I will present some major parts of what a conversational agent needs in order to work and show what have been done in some areas as well as what still needs to be done. We are still quite far from building a conversational agent for general use, but I hope to show in this paper that we already posses the knowledge to build constrained agents that work in narrow domains who can interact naturally with people in this domain. For the more general agent, however, I will point out some areas that need further work and where I think this work needs to be h e aded.

Binning: Converting Numerical Classification into Text Classification

Consider a supervised learning problem in which examples contain both numerical-and text-valued f... more Consider a supervised learning problem in which examples contain both numerical-and text-valued features. One common approach to this problem would be to treat the presence or absence of a word as a Boolean feature, which when combined with the other numerical features enables the application of a range of traditional feature-vector-based learning methods. This paper presents an alternative approach, in which numerical features are converted into "bag of word" features, enabling instead the use of a range of existing text-classification methods. Our approach creates a set of bins for each feature into which its observed values can fall. Two tokens are defined for each bin endpoint, representing which side of a bin's endpoint a feature value lies. A numerical feature is then assigned the bag of tokens appropriate for its value. Not only does this approach now make it possible to apply text-classification methods to problems involving both numerical and text-valued features, even problems that contain solely numerical features can be converted using this representation so that text-classification methods can be applied. We therefore evaluate our approach both on a range of real-world datasets taken from the UCI Repository that solely involve numerical features, as well as on additional datasets that contain both numerical-and text-valued features. Our results show that the performance of the text-classification methods using the binning representation often meets or exceeds that of traditional supervised learning methods (C4.5, k-NN, NBC, and Ripper), even on existing numericalfeature-only datasets from the UCI Repository, suggesting that text-classification methods, coupled with binning, can serve as a credible learning approach for traditional supervised learning problems.

A Comparison of Two On-line Algorithms that Adapt to Concept Drift

Recently two new multiplicative weight-updating algorithms, Shift-ing Winnow" and Tracking t... more

Appears in User Modeling 1999 Workshop: Machine Learning for User Modeling EmailValet: Learning Email Preferences for Wireless Platforms

Suspicion scoring based on guilt-by-association, collective

We describe a guilt-by-association system that can be used to rank networked entities by their su... more We describe a guilt-by-association system that can be used to rank networked entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terrorist-world simulator developed to support a DoD program. Each data set consists of thousands of entities and some known links between them. The system ranks truly malicious entities highly, even if only relatively few are known to be malicious ex ante. When used as a tool for identifying promising data-gathering opportunities, the system focuses on gathering more information about the most suspicious entities and thereby increases the density of linkage in appropriate parts of the network. We assess performance under conditions of noisy prior knowledge of maliciousness. Although the levels of performance reported here would not support direct action on all data sets, the results do recommend the consideration of network-scoring techniques as a new source of evidence for decision making. For example, the system can operate on networks far larger and more complex than could be processed by a human analyst. This is a follow-up study to a prior paper; although there is a considerable amount of overlap, here we focus on more data sets and improve the evaluation by identifying entities with high scores simply as an artifact of the data acquisition process.

NetKit-SRL: A Toolkit for Network Learning and Inference

This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying... more

EmailValet: Learning User Preferences for Wireless Email

This paper presents EmailValet, a system that learns users' emailreading preferences on email-cap... more This paper presents EmailValet, a system that learns users' emailreading preferences on email-capable wireless platforms -specifically, on two-way pagers with small "qwerty" keyboards and an 8-line 30-character display. In use by the authors for about three months, it has gathered data on email-reading preferences over more than 8900 email messages received by the authors during this period. The paper presents results comparing the ability of different learning methods to form models that can predict whether a given message should be forwarded to the user's wireless device. Our results show that the best performance of one method, over a range of established learning methods developed on the information retrieval and machine learning communities, was able to achieve a break-even point of over 53% for one user that had received over 5000 messages. We also find that, in general, all methods are able to achieve better performance than what would be achieved by a baseline of simply forwarding all messages to the wireless device, and that many methods are able to find procedures that, although they forward only a small fraction of the messages that a user would want, achieve 100% precision on those messages that it does actually choose to forward.

Maintaining information resources Sofus Macskassy

With the proliferation of the World Wide Web, it has become very important to provide advanced to... more With the proliferation of the World Wide Web, it has become very important to provide advanced tools for maintaining referential integrity of information resources. The growing tendency toward building increasingly complex Web sites makes it necessary to maintain not only physical files, but also logical resources, or views, which are composed of references to other resources and presentation programs. Our solution to this problem is to design an infrastructure of resource maintenance agents. It includes the Data Agent, which keeps track of files and supports third-party requests to notify them of changes that occur to these files. Another component of the infrastructure is the Repository Agent, which supports change notification requests for logical resources. Prototype implementation of the infrastructure is currently available and is discussed in this paper.

Converting Numerical Classification into Text Classification

Consider a supervised learning problem in which examples contain both numericaland text-valued fe... more Consider a supervised learning problem in which examples contain both numericaland text-valued features. To use traditional feature-vector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic -in the most straightforward approach each number would be considered a distinct token and treated as a word. This paper presents an alternative approach for the use of text classification methods for supervised learning problems with numerical-valued features in which the numerical features are converted into bag-of-words features, thereby making them directly usable by text classification methods. We show that even on purely numerical-valued data the results of text-classification on the derived textlike representation outperforms the more naive numbers-as-tokens representation and, more importantly, is competitive with mature numerical classification methods such as C4.5, Ripper, and SVM. We further show that on mixed-mode data adding numerical features using our approach can improve performance over not adding those features.

Data Mining in the Context of Entity Resolution

We have encountered several practical issues in performing data mining on a database that has bee... more We have encountered several practical issues in performing data mining on a database that has been normalized using entity resolution. We describe here four specific lessons learned in such mining and the meta-level lesson learned through dealing with these issues. The four specific lessons we describe deal with handling correlated values, getting canonical records, getting authoritative records and ensuring that relations are properly stored. The perhaps most important lesson learned is that one ought to know the kind of data mining is to be done on the data before designing the schema of the normalized database such that data specific to the mining is derivable from the database.

Characterizing Retweeting Behaviors in Twitter: On the use of Text vs. Concepts

Twitter and other microblogs have rapidly become a significant means by which people communicate ... more Twitter and other microblogs have rapidly become a significant means by which people communicate with the world and each other in near realtime. There has been a large number of studies surrounding these social media, focusing on areas such as information spread, various centrality measures, topic detection and more. However, one area which has received little attention is trying to better understand what information is being spread and why it is being spread. One recent line of work has been looking at the problem of modeling retweeting behaviors. This work has advocated mapping tweets into a conceptual space such as Wikipedia categories and reasoning about diffusion behaviors in that space. The work, however, did not show that this was in fact needed and the question is whether one can get equally good reasoning by staying at the token or word level. This paper looks at this particular question of whether one in fact improve upon reasoning by mapping into a more abstract space or whether there is a place for token-level modeling. We show that, in fact, token-level models do have their place when reasoning about whether a tweet is likely interesting based on the tweet words but that the conceptual space is better when reasoning about homophily-similarities between users. Ideally one would like a hybrid model and we show that while the hybrid model is not always the optimal, it does yield good performance. We here repeat part of an earlier retweet study on over 768K tweet and show that profiles using a combination of word-based and concept-based features work better than either of the simpler representations.

Discovering Users Topics of Interest on Twitter: A First Look

Twitter, a micro-blogging service, provides users with a framework for writing brief, often-noisy... more Twitter, a micro-blogging service, provides users with a framework for writing brief, often-noisy postings about their lives. These posts are called "Tweets." In this paper we present early results on discovering Twitter users' topics of interest by examining the entities they mention in their Tweets. Our approach leverages a knowledge base to disambiguate and categorize the entities in the Tweets. We then develop a "topic profile," which characterizes users' topics of interest, by discerning which categories appear frequently and cover the entities. We demonstrate that even in this early work we are able to successfully discover the main topics of interest for the users in our study.

Confidence Bands for ROC Curves: Methods and an Empirical Study

In this paper we study techniques for generating and evaluating confidence bands on ROC curves. R... more In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning, although evaluating ROC curves has thus far been limited to studying the area under the curve (AUC) or generation of one-dimensional confidence intervals by freezing one variable-the false-positive rate, or threshold on the classification scoring function. Researchers in the medical field have long been using ROC curves and have many well-studied methods for analyzing such curves, including generating confidence intervals as well as simultaneous confidence bands. In this paper we introduce these techniques to the machine learning community and show their empirical fitness on the Covertype data set-a standard machine learning benchmark from the UCI repository. We show how some of these methods work remarkably well, others are too loose, and that existing machine learning methods for generation of 1-dimensional confidence intervals do not translate well to generation of simultanous bands-their bands are too tight.

Information Triage using Prospective Criteria

In many applications, large volumes of time-sensitive textual information require triage: rapid, ... more In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective indications of the importance of a time-sensitive document, for the purpose of producing better document filtering or ranking. By prospective, we mean importance that could be assessed by actions that occur in the future. For example, a news story may be assessed (retrospectively) as being important, based on events that occurred after the story appeared, such as a stock-price plummeting or the issuance of many follow-up stories. If a system could anticipate (prospectively) such occurrences, it could provide a timely indication of importance. Clearly, perfect prescience is impossible. However, sometimes there is sufficient correlation between the content of an information item and the events that occur subsequently. We describe a process for creating and evaluating approximate information-triage procedures that are based on prospective indications. Unlike many information-retrieval applications for which document labeling is a laborious, manual process, for many prospective criteria it is possible to build very large, labeled, training corpora automatically. Such corpora can be used to train text classification procedures that will predict the (prospective) importance of each document. This paper illustrates the process with two case studies, demonstrating the ability to predict whether the stock price of one or more companies mentioned in a news story will move significantly following the appearance of that story. We conclude by discussing that the comprehensibility of the learned classifiers can be critical to success.

Flexible query formulation for federated search

One common framework for data integration in practice is federated search. Here an agent queries ... more One common framework for data integration in practice is federated search. Here an agent queries disjoint sources simultaneously, and then clusters the returned records in the absence of unique keys. However, formulating the correct queries to the sources can be challenging because of the possible query value variations. For instance, some sources may contain a first name as "John" while other sources use the name "Jonathan" for the same person. If the underlying sources do not support sophisticated matching then a single query of "John" will miss many records from the "Jonathan" sources. This paper presents an approach to formulating queries for federated search that leverages automatically discovered transformations such as synonyms and abbreviations to create the set of possible queries for the given sources. Our preliminary results demonstrate that indeed, transformations mined from a subset of sources will apply to a new, distinct source, thereby allowing query expansions based on the discovered transformations.

Graph Mining using Graph Pattern Profiles

This paper presents our investigation into graph mining methods to help users understand large gr... more This paper presents our investigation into graph mining methods to help users understand large graphs. Our approach is a two-step process: First calculate subgraph labels and then calculate distribution statistics on these labels. Our approach is flexible in that it can identify a range of patterns from very abstract to very specific (e.g., isomorphisms). The statistics that we calculate can be used to find rare and common patterns, patterns that are (dis)similar to the distribution of induced subgraphs of the same size, patterns that are (dis)similar to each other, as well as variance of graph patterns given a specific set of input node types. We also investigate a method to understand structural characteristics by analyzing clusters that are created by "collapsing" overlapping instances of user-specified patterns. We evaluated our approach on two publicly available networks-the Texas CS web-site from WebKB and the internet movie database.

Human Performance on Clustering Web Pages

With the increase in information on the World Wide Web it has become difficult to find desired in... more With the increase in information on the World Wide Web it has become difficult to find desired information quickly without using multiple queries or using a topic-specific search engine. One way to help in the search is by grouping HTML pages together that appear in some way to be related. In order to better understand this task, we performed an initial study of human clustering of web pages, in the hope that it would provide some insight into the difficulty of automating this task. Our results show that subjects did not cluster identically; in fact, on average, any two subjects had little similarity in their web-page clusters. We also found that subjects generally created rather small clusters, and those with access only to URLs created fewer clusters than those with access to the full text of each web page. Generally the overlap of documents between clusters for any given subject increased when given the full text, as did the percentage of documents clustered. When analyzing individual subjects, we found that each had different behavior across queries, both in terms of overlap, size of clusters, and number of clusters. These results provide a sobering note on any quest for a single clearly correct clustering method for web pages. 1 1 A slightly condensed version of this paper was published in . centroids and clusters based on similarity to those centroids. A recent HAC-based method, Word-Intersection Clustering , clusters based on phrases and allows for overlapping clusters. Another interactive approach, Scatter/Gather [3,, lets the user navigate through the retrieved results and dynamically clusters based on this navigation. A K-means method is used to cluster documents and find important words for each of those clusters.

Information Valets: Adaptivity for Multi-Platform Access to Heterogeneous Information

This paper introduces the concept of an Information Valet ("iValet"), an approach for combining m... more This paper introduces the concept of an Information Valet ("iValet"), an approach for combining multi-platform access to heterogeneous information with adaption to user preferences. The goal of an iValet is to support access to a range of information sources from a range of wired and wireless client devices with some degree of uniformity. An iValet should be aware of its user and user devices, learning from a user's past interactions where and how to send new incoming information. Our metaphor is that of a valet that sits between a user's client devices and the information services that the user may want to access. This paper describes a prototype iValet that interacts with its user through both the Palm VII "Web-clippings" service and the RIM 950 two-way email-capable pager, providing access to email, Web pages, and personal files. We discuss general issues raised by our work concerning the more general design of an iValet, and present initial results concerning the ability of machine learning techniques to successfully inject adaptability into our iValet system.

Conﬁdence Bands for ROC Curves

We address the problem of comparing the performance of classifiers. In this paper we study techni... more We address the problem of comparing the performance of classifiers. In this paper we study techniques for generating and evaluating confidence bands on ROC curves. Historically this has been done using one-dimensional confidence intervals by freezing one variable-the false-positive rate, or threshold on the classification scoring function. We adapt two prior methods and introduce a new radial sweep method to generate confidence bands. We show, through empirical studies, that the bands are too tight and introduce a general optimization methodology for creating bands that better fit the data, as well as methods for evaluating confidence bands. We show empirically that the optimized confidence bands fit much better and that, using our new evaluation method, it is possible to gauge the relative fit of different confidence bands.

Classification in Networked Data: A toolkit and a univariate case study

This paper 1 is about classifying entities that are interlinked with entities for which the class... more This paper 1 is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well-well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes-that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection.

A Conversational Agent

Conversational agents are the stepping stones for the next generation of interaction between user... more Conversational agents are the stepping stones for the next generation of interaction between users and computers. Not only is this a more natural way for people to interact, but also it would increase e ciency and speed in working with computers. In this paper, I will present some major parts of what a conversational agent needs in order to work and show what have been done in some areas as well as what still needs to be done. We are still quite far from building a conversational agent for general use, but I hope to show in this paper that we already posses the knowledge to build constrained agents that work in narrow domains who can interact naturally with people in this domain. For the more general agent, however, I will point out some areas that need further work and where I think this work needs to be h e aded.

Binning: Converting Numerical Classification into Text Classification

Consider a supervised learning problem in which examples contain both numerical-and text-valued f... more Consider a supervised learning problem in which examples contain both numerical-and text-valued features. One common approach to this problem would be to treat the presence or absence of a word as a Boolean feature, which when combined with the other numerical features enables the application of a range of traditional feature-vector-based learning methods. This paper presents an alternative approach, in which numerical features are converted into "bag of word" features, enabling instead the use of a range of existing text-classification methods. Our approach creates a set of bins for each feature into which its observed values can fall. Two tokens are defined for each bin endpoint, representing which side of a bin's endpoint a feature value lies. A numerical feature is then assigned the bag of tokens appropriate for its value. Not only does this approach now make it possible to apply text-classification methods to problems involving both numerical and text-valued features, even problems that contain solely numerical features can be converted using this representation so that text-classification methods can be applied. We therefore evaluate our approach both on a range of real-world datasets taken from the UCI Repository that solely involve numerical features, as well as on additional datasets that contain both numerical-and text-valued features. Our results show that the performance of the text-classification methods using the binning representation often meets or exceeds that of traditional supervised learning methods (C4.5, k-NN, NBC, and Ripper), even on existing numericalfeature-only datasets from the UCI Repository, suggesting that text-classification methods, coupled with binning, can serve as a credible learning approach for traditional supervised learning problems.

A Comparison of Two On-line Algorithms that Adapt to Concept Drift

Recently two new multiplicative weight-updating algorithms, Shift-ing Winnow" and Tracking t... more

Appears in User Modeling 1999 Workshop: Machine Learning for User Modeling EmailValet: Learning Email Preferences for Wireless Platforms

Suspicion scoring based on guilt-by-association, collective

We describe a guilt-by-association system that can be used to rank networked entities by their su... more We describe a guilt-by-association system that can be used to rank networked entities by their suspiciousness. We demonstrate the algorithm on a suite of data sets generated by a terrorist-world simulator developed to support a DoD program. Each data set consists of thousands of entities and some known links between them. The system ranks truly malicious entities highly, even if only relatively few are known to be malicious ex ante. When used as a tool for identifying promising data-gathering opportunities, the system focuses on gathering more information about the most suspicious entities and thereby increases the density of linkage in appropriate parts of the network. We assess performance under conditions of noisy prior knowledge of maliciousness. Although the levels of performance reported here would not support direct action on all data sets, the results do recommend the consideration of network-scoring techniques as a new source of evidence for decision making. For example, the system can operate on networks far larger and more complex than could be processed by a human analyst. This is a follow-up study to a prior paper; although there is a considerable amount of overlap, here we focus on more data sets and improve the evaluation by identifying entities with high scores simply as an artifact of the data acquisition process.

NetKit-SRL: A Toolkit for Network Learning and Inference

This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying... more

EmailValet: Learning User Preferences for Wireless Email

This paper presents EmailValet, a system that learns users' emailreading preferences on email-cap... more This paper presents EmailValet, a system that learns users' emailreading preferences on email-capable wireless platforms -specifically, on two-way pagers with small "qwerty" keyboards and an 8-line 30-character display. In use by the authors for about three months, it has gathered data on email-reading preferences over more than 8900 email messages received by the authors during this period. The paper presents results comparing the ability of different learning methods to form models that can predict whether a given message should be forwarded to the user's wireless device. Our results show that the best performance of one method, over a range of established learning methods developed on the information retrieval and machine learning communities, was able to achieve a break-even point of over 53% for one user that had received over 5000 messages. We also find that, in general, all methods are able to achieve better performance than what would be achieved by a baseline of simply forwarding all messages to the wireless device, and that many methods are able to find procedures that, although they forward only a small fraction of the messages that a user would want, achieve 100% precision on those messages that it does actually choose to forward.

Maintaining information resources Sofus Macskassy

With the proliferation of the World Wide Web, it has become very important to provide advanced to... more With the proliferation of the World Wide Web, it has become very important to provide advanced tools for maintaining referential integrity of information resources. The growing tendency toward building increasingly complex Web sites makes it necessary to maintain not only physical files, but also logical resources, or views, which are composed of references to other resources and presentation programs. Our solution to this problem is to design an infrastructure of resource maintenance agents. It includes the Data Agent, which keeps track of files and supports third-party requests to notify them of changes that occur to these files. Another component of the infrastructure is the Repository Agent, which supports change notification requests for logical resources. Prototype implementation of the infrastructure is currently available and is discussed in this paper.

Converting Numerical Classification into Text Classification

Consider a supervised learning problem in which examples contain both numericaland text-valued fe... more Consider a supervised learning problem in which examples contain both numericaland text-valued features. To use traditional feature-vector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic -in the most straightforward approach each number would be considered a distinct token and treated as a word. This paper presents an alternative approach for the use of text classification methods for supervised learning problems with numerical-valued features in which the numerical features are converted into bag-of-words features, thereby making them directly usable by text classification methods. We show that even on purely numerical-valued data the results of text-classification on the derived textlike representation outperforms the more naive numbers-as-tokens representation and, more importantly, is competitive with mature numerical classification methods such as C4.5, Ripper, and SVM. We further show that on mixed-mode data adding numerical features using our approach can improve performance over not adding those features.

Data Mining in the Context of Entity Resolution

We have encountered several practical issues in performing data mining on a database that has bee... more We have encountered several practical issues in performing data mining on a database that has been normalized using entity resolution. We describe here four specific lessons learned in such mining and the meta-level lesson learned through dealing with these issues. The four specific lessons we describe deal with handling correlated values, getting canonical records, getting authoritative records and ensuring that relations are properly stored. The perhaps most important lesson learned is that one ought to know the kind of data mining is to be done on the data before designing the schema of the normalized database such that data specific to the mining is derivable from the database.

Characterizing Retweeting Behaviors in Twitter: On the use of Text vs. Concepts

Twitter and other microblogs have rapidly become a significant means by which people communicate ... more Twitter and other microblogs have rapidly become a significant means by which people communicate with the world and each other in near realtime. There has been a large number of studies surrounding these social media, focusing on areas such as information spread, various centrality measures, topic detection and more. However, one area which has received little attention is trying to better understand what information is being spread and why it is being spread. One recent line of work has been looking at the problem of modeling retweeting behaviors. This work has advocated mapping tweets into a conceptual space such as Wikipedia categories and reasoning about diffusion behaviors in that space. The work, however, did not show that this was in fact needed and the question is whether one can get equally good reasoning by staying at the token or word level. This paper looks at this particular question of whether one in fact improve upon reasoning by mapping into a more abstract space or whether there is a place for token-level modeling. We show that, in fact, token-level models do have their place when reasoning about whether a tweet is likely interesting based on the tweet words but that the conceptual space is better when reasoning about homophily-similarities between users. Ideally one would like a hybrid model and we show that while the hybrid model is not always the optimal, it does yield good performance. We here repeat part of an earlier retweet study on over 768K tweet and show that profiles using a combination of word-based and concept-based features work better than either of the simpler representations.

Discovering Users Topics of Interest on Twitter: A First Look

Twitter, a micro-blogging service, provides users with a framework for writing brief, often-noisy... more Twitter, a micro-blogging service, provides users with a framework for writing brief, often-noisy postings about their lives. These posts are called "Tweets." In this paper we present early results on discovering Twitter users' topics of interest by examining the entities they mention in their Tweets. Our approach leverages a knowledge base to disambiguate and categorize the entities in the Tweets. We then develop a "topic profile," which characterizes users' topics of interest, by discerning which categories appear frequently and cover the entities. We demonstrate that even in this early work we are able to successfully discover the main topics of interest for the users in our study.

Confidence Bands for ROC Curves: Methods and an Empirical Study

In this paper we study techniques for generating and evaluating confidence bands on ROC curves. R... more In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning, although evaluating ROC curves has thus far been limited to studying the area under the curve (AUC) or generation of one-dimensional confidence intervals by freezing one variable-the false-positive rate, or threshold on the classification scoring function. Researchers in the medical field have long been using ROC curves and have many well-studied methods for analyzing such curves, including generating confidence intervals as well as simultaneous confidence bands. In this paper we introduce these techniques to the machine learning community and show their empirical fitness on the Covertype data set-a standard machine learning benchmark from the UCI repository. We show how some of these methods work remarkably well, others are too loose, and that existing machine learning methods for generation of 1-dimensional confidence intervals do not translate well to generation of simultanous bands-their bands are too tight.

Information Triage using Prospective Criteria

In many applications, large volumes of time-sensitive textual information require triage: rapid, ... more In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective indications of the importance of a time-sensitive document, for the purpose of producing better document filtering or ranking. By prospective, we mean importance that could be assessed by actions that occur in the future. For example, a news story may be assessed (retrospectively) as being important, based on events that occurred after the story appeared, such as a stock-price plummeting or the issuance of many follow-up stories. If a system could anticipate (prospectively) such occurrences, it could provide a timely indication of importance. Clearly, perfect prescience is impossible. However, sometimes there is sufficient correlation between the content of an information item and the events that occur subsequently. We describe a process for creating and evaluating approximate information-triage procedures that are based on prospective indications. Unlike many information-retrieval applications for which document labeling is a laborious, manual process, for many prospective criteria it is possible to build very large, labeled, training corpora automatically. Such corpora can be used to train text classification procedures that will predict the (prospective) importance of each document. This paper illustrates the process with two case studies, demonstrating the ability to predict whether the stock price of one or more companies mentioned in a news story will move significantly following the appearance of that story. We conclude by discussing that the comprehensibility of the learned classifiers can be critical to success.

Flexible query formulation for federated search

One common framework for data integration in practice is federated search. Here an agent queries ... more One common framework for data integration in practice is federated search. Here an agent queries disjoint sources simultaneously, and then clusters the returned records in the absence of unique keys. However, formulating the correct queries to the sources can be challenging because of the possible query value variations. For instance, some sources may contain a first name as "John" while other sources use the name "Jonathan" for the same person. If the underlying sources do not support sophisticated matching then a single query of "John" will miss many records from the "Jonathan" sources. This paper presents an approach to formulating queries for federated search that leverages automatically discovered transformations such as synonyms and abbreviations to create the set of possible queries for the given sources. Our preliminary results demonstrate that indeed, transformations mined from a subset of sources will apply to a new, distinct source, thereby allowing query expansions based on the discovered transformations.

Graph Mining using Graph Pattern Profiles

This paper presents our investigation into graph mining methods to help users understand large gr... more This paper presents our investigation into graph mining methods to help users understand large graphs. Our approach is a two-step process: First calculate subgraph labels and then calculate distribution statistics on these labels. Our approach is flexible in that it can identify a range of patterns from very abstract to very specific (e.g., isomorphisms). The statistics that we calculate can be used to find rare and common patterns, patterns that are (dis)similar to the distribution of induced subgraphs of the same size, patterns that are (dis)similar to each other, as well as variance of graph patterns given a specific set of input node types. We also investigate a method to understand structural characteristics by analyzing clusters that are created by "collapsing" overlapping instances of user-specified patterns. We evaluated our approach on two publicly available networks-the Texas CS web-site from WebKB and the internet movie database.

Human Performance on Clustering Web Pages

With the increase in information on the World Wide Web it has become difficult to find desired in... more With the increase in information on the World Wide Web it has become difficult to find desired information quickly without using multiple queries or using a topic-specific search engine. One way to help in the search is by grouping HTML pages together that appear in some way to be related. In order to better understand this task, we performed an initial study of human clustering of web pages, in the hope that it would provide some insight into the difficulty of automating this task. Our results show that subjects did not cluster identically; in fact, on average, any two subjects had little similarity in their web-page clusters. We also found that subjects generally created rather small clusters, and those with access only to URLs created fewer clusters than those with access to the full text of each web page. Generally the overlap of documents between clusters for any given subject increased when given the full text, as did the percentage of documents clustered. When analyzing individual subjects, we found that each had different behavior across queries, both in terms of overlap, size of clusters, and number of clusters. These results provide a sobering note on any quest for a single clearly correct clustering method for web pages. 1 1 A slightly condensed version of this paper was published in . centroids and clusters based on similarity to those centroids. A recent HAC-based method, Word-Intersection Clustering , clusters based on phrases and allows for overlapping clusters. Another interactive approach, Scatter/Gather [3,, lets the user navigate through the retrieved results and dynamically clusters based on this navigation. A K-means method is used to cluster documents and find important words for each of those clusters.