Papers by Sofus Macskássy
arXiv (Cornell University), Jan 30, 2014
We tackle the problem of inferring node labels in a partially labeled graph where each node in th... more We tackle the problem of inferring node labels in a partially labeled graph where each node in the graph has multiple label types and each label type has a large number of possible labels. Our primary example, and the focus of this paper, is the joint inference of label types such as hometown, current city, and employers, for users connected by a social network. Standard label propagation fails to consider the properties of the label types and the interactions between them. Our proposed method, called EDGEEXPLAIN, explicitly models these, while still enabling scalable inference under a distributed message-passing architecture. On a billion-node subset of the Facebook social network, EDGEEXPLAIN significantly outperforms label propagation for several label types, with lifts of up to 120% for recall@1 and 60% for recall@3.

Monitoring Entities in an Uncertain World: Entity Resolution and Referential Integrit
Proceedings of the AAAI Conference on Artificial Intelligence
This paper describes a system to help intelligence analysts track and analyze information being p... more This paper describes a system to help intelligence analysts track and analyze information being published in multiple sources, particularly open sources on the Web. The system integrates technology for Web harvesting, natural language extraction, and network analytics, and allows analysts to view and explore the results via a Web application. One of the difficult problems we address is the entity resolution problem, which occurs when there are multiple, differing ways to refer to the same entity. The problem is particularly complex when noisy data is being aggregated over time, there is no clean master list of entities, and the entities under investigation are intentionally being deceptive. Our system must not only perform entity resolution with noisy data, but must also gracefully recover when entity resolution mistakes are subsequently corrected. We present a case study in arms trafficking that illustrates the issues, and describe how they are addressed.
We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predi... more We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent attributes. We show that it performs surprisingly well by comparing it to more complex models such as Probabilistic Relational Models and Relational Probability Trees on three data sets from published work. We argue that a simple model such as this should be used as a baseline to assess the performance of relational learners.
Public reporting burden for this collection of information is estimated to average 1 hour per res... more Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Washington Headquarters Service, Directorate for Information Operations and Reports,

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001
In many applications, large volumes of time-sensitive textual information require triage: rapid, ... more In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective indications of the importance of a time-sensitive document, for the purpose of producing better document filtering or ranking. By prospective, we mean importance that could be assessed by actions that occur in the future. For example, a news story may be assessed (retrospectively) as being important, based on events that occurred after the story appeared, such as a stock price plummeting or the issuance of many follow-up stories. If a system could anticipate (prospectively) such occurrences, it could provide a timely indication of importance. Clearly, perfect prescience is impossible. However, sometimes there is sufficient correlation between the content of an information item and the events that occur subsequently. We describe a process for creating and evaluating approximate information-triage procedures that are based on prospective indications. Unlike many informationretrieval applications for which document labeling is a laborious, manual process, for many prospective criteria it is possible to build very large, labeled, training corpora automatically. Such corpora can be used to train text classification procedures that will predict the (prospective) importance of each document. This paper illustrates the process with two case studies, demonstrating the ability to predict whether a news story will be followed by many, very similar news stories, and also whether the stock price of one or more companies associated with a news story will move significantly following the appearance of that story. We conclude by discussing how the comprehensibility of the learned classifiers can be critical to success.

Proceedings of the 22nd international conference on Machine learning - ICML '05, 2005
This paper is about constructing confidence bands around ROC curves. We first introduce to the ma... more This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well they perform. Such confidence bands represent the region where the "true" ROC curve is expected to reside, with the designated confidence level. To assess the containment of the bands we begin with a synthetic world where we know the true ROC curve-specifically, where the class-conditional model scores are normally distributed. The only method that attains reasonable containment out-of-the-box produces non-parametric, "fixed-width" bands (FWBs). Next we move to a context more appropriate for machine learning evaluations: bands that with a certain confidence level will bound the performance of the model on future data. We introduce a correction to account for the larger uncertainty, and the widened FWBs continue to have reasonable containment. Finally, we assess the bands on 10 relatively large benchmark data sets. We conclude by recommending these FWBs, noting that being non-parametric they are especially attractive for machine learning studies, where the score distributions (1) clearly are not normal, and (2) even for the same data set vary substantially from learning method to learning method.
We tackle the problem of inferring node labels in a partially labeled graph where each node in th... more We tackle the problem of inferring node labels in a partially labeled graph where each node in the graph has multiple label types and each label type has a large number of possible labels. Our primary example, and the focus of this paper, is the joint inference of label types such as hometown, current city, and employers, for users connected by a social network. Standard label propagation fails to consider the properties of the label types and the interactions between them. Our proposed method, called EDGEEXPLAIN, explicitly models these, while still enabling scalable inference under a distributed message-passing architecture. On a billion-node subset of the Facebook social network, EDGEEXPLAIN significantly outperforms label propagation for several label types, with lifts of up to 120% for recall@1 and 60% for recall@3.

Receiver Operating Characteristic Analysis (ROC Analysis) is related in a direct and natural way ... more Receiver Operating Characteristic Analysis (ROC Analysis) is related in a direct and natural way to cost/benefit analysis of diagnostic decision making. Widely used in medicine for many decades, it has been introduced relatively recently in machine learning. In this context, ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. Furthermore, the Area Under the ROC Curve (AUC) has been shown to be a better evaluation measure than accuracy in contexts with variable misclassification costs and/or imbalanced datasets. AUC is also the standard measure when using classifiers to rank examples, and, hence, is used in applications where ranking is crucial, such as campaign design, model combination, collaboration strategies, and co-learning. Nevertheless, there are many open questions and some limitations that hamper a broader use and applicability of ROC analysis. Its use in data mining and machine learning is still below its full potential. An important limitation of ROC analysis, despite some recent progress, is its possible but difficult extension for more than two classes. This workshop follows up a first workshop (ROCAI-2004) held within ECAI-2004 and a second workshop (ROCML-2005) held within ICML-2005. This third workshop is intended to investigate on the hot topics identified during the two previous workshops (e.g. multiclass extension, statistical analysis, alternative approaches), on the one hand, and to encourage crossfertilisation with ROC practitioners in medicine, on the other hand, thanks to an invited medical expert. We would like to thank everyone who contributed to make this workshop possible. First of all, we thank all the authors who submitted papers to ROCML-2006. Each of these was reviewed by two or more members from the Program Committee, who finally accepted nine papers (eight research papers and one research note). In this regard, we are grateful to the Program Committee and the additional reviewers for their excellent job. We wish to express our gratitude to our invited speaker, Dr. Darrin C. Edwards from Department of Radiology, University of Chicago, who presented the state-of-the-art of ROC analysis in radiology. Moreover, his research group provided a three-class medical dataset to support exchanges between medical experts and participants. Finally, we have to express our gratitude to the ICML -2006 organization for the facilities provided.

Wikis allow for collaborators to collect information about entities. In turn, such entity informa... more Wikis allow for collaborators to collect information about entities. In turn, such entity information can be used for AI tasks, such as information extraction. However, these collaborators are almost exclusively human users. Allowing arbitrary software agents to act as collaborators can greatly enrich a wiki since agents can contribute structured data to complement the human-contributed, unstructured-data. For instance, agents can import huge volumes of structured data about entities, enriching the pages, and agents can update wiki pages to reflect real-time information changes (e.g., win-loss records in sports). This paper describes an approach that allows for both arbitrary software agents and human users to collaborate. In particular, we address three key problems: agents updating the correct wiki pages, policies for agent updates, and sharing the schema across collaborators. Using our approach, we describe creating entity-focused wikis which include the ability to create dynamic categories of entities based on their wiki pages. These categories dynamically update their membership based upon real-world changes.

This paper presents our investigation into graph mining methods to help users understand large gr... more This paper presents our investigation into graph mining methods to help users understand large graphs. Our approach is a two-step process: First calculate subgraph labels and then calculate distribution statistics on these labels. Our approach is flexible in that it can identify a range of patterns from very abstract to very specific (e.g., isomorphisms). The statistics that we calculate can be used to find rare and common patterns, patterns that are (dis)similar to the distribution of induced subgraphs of the same size, patterns that are (dis)similar to each other, as well as variance of graph patterns given a specific set of input node types. We also investigate a method to understand structural characteristics by analyzing clusters that are created by "collapsing" overlapping instances of user-specified patterns. We evaluated our approach on two publicly available networks-the Texas CS web-site from WebKB and the internet movie database.

Real-world data is virtually never noise-free. Current methods for handling noise do so either by... more Real-world data is virtually never noise-free. Current methods for handling noise do so either by removing noisy instances or by trying to clean noisy attributes. Neither of these deal directly with the issue of noise and in fact removing a noisy instance is not a viable option in many real systems. In this paper, we consider the problem of noise in the context of record linkage, a frequent problem in text mining. We present a new method for dealing with data sources that have noisy attributes which reflect the pedigree of that source. Our method, which assumes that training data is clean and that noise is only present in the test set, is an extension of decision trees which directly handles noise at classification time by changing how it walks through the tree at the various nodes, similar to how current trees handle missing values. We test the efficacy of our method on the IMDb movie database where we classify whether pairs of records refer to the same person. Our results clearly show that we dramatically improve performance by handling pedigree directly at classification time.

Intelligent Information Filtering is the process of receiving or monitoring large amounts of dyna... more Intelligent Information Filtering is the process of receiving or monitoring large amounts of dynamically generated information and extracting the subset of information that would be of interest to a user based on some specified information need. Historically, this need has been based on user profiles that are directly evaluable-the information can be immediately classified as interesting or not. In this thesis I introduce a new type of user interestingness criterion which is prospective-the criterion defines the interestingness of an information item based on events that happen subsequent to the information item appearing. Hence, the interestingness cannot be directly evaluated. A new technique is described which takes such a criterion and operationalizes it, using machine learning to generate a predictive model that can directly evaluate a piece of information. I show that this technique works statistically significantly better than Arunava Banerjee helped me out by implementing the entropy-based split point algorithm as well as the C4.5 split-point extraction utility which were used in the second part of this dissertation. Chapters 3-5 were published in SIGIR-2001 [MHP + 01]. Portions of Chapters 6-8 were published as a conference paper at IJCAI-2001 [MHBD01], an extended version of which will appear in an upcoming publication of the Artificial Intelligence Journal [MHBD]. Initial results in the wireless email domain was published at IJCAI-1999 [MDH99].

This paper introduces Information Valets ("iValets"), a general framework for intelligent access ... more This paper introduces Information Valets ("iValets"), a general framework for intelligent access to information. Our goal is to support access to a range of information sources from a range of client devices with some degree of uniformity. Further, the information server is aware of its user and user devices, learning from the user's past interactions where and how to send new incoming information from whatever information is available for the given task. Our metaphor is that of a valet that sits between a user's client devices and the information services that the user made want to access. The paper presents the general structure of an iValet, cataloging the main design decisions that must go into the design of any such system. To demonstrate this, the paper instantiates the abstract iValet model with a fielded prototype system, the EmailValet, which learns users' email-reading preferences on wireless platforms.

This paper presents EmailValet, a system that learns users' emailreading preferences on email-cap... more This paper presents EmailValet, a system that learns users' emailreading preferences on email-capable wireless platforms -specifically, on two-way pagers with small "qwerty" keyboards and an 8-line 30-character display. In use by the authors for about three months, it has gathered data on email-reading preferences over more than 8900 email messages received by the authors during this period. The paper presents results comparing the ability of different learning methods to form models that can predict whether a given message should be forwarded to the user's wireless device. Our results show that the best performance of one method, over a range of established learning methods developed on the information retrieval and machine learning communities, was able to achieve a break-even point of over 53% for one user that had received over 5000 messages. We also find that, in general, all methods are able to achieve better performance than what would be achieved by a baseline of simply forwarding all messages to the wireless device, and that many methods are able to find procedures that, although they forward only a small fraction of the messages that a user would want, achieve 100% precision on those messages that it does actually choose to forward. able to achieve a break-even 1 point of 53%. We also present some more detailed analyses of the various methods, including how different ways of encoding the email for the learner affect learning performance. We conclude the paper with a discussion of future work and final remarks.
This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying... more This paper describes NetKit-SRL, or NetKit for short, a toolkit for learning from and classifying networked data. The toolkit is open-source and publicly available. It is modular and built for ease of plug-and-play-such that it is easy to add new modules and have them interact with other existing modules. Currently available NetKit modules are focused on "batch" within-network learning and classification: given a partially labeled network, where all nodes and edges are already known to exist, estimate the class membership probability of the unlabeled nodes in the network. NetKit has been used in various network domains such as websites, citation graphs, movies and social networks.
Knowledge Discovery and Data Mining, Jul 24, 2010

SIGKDD explorations, Mar 31, 2011
The Eighth Workshop on Mining and Learning with Graphs (MLG) 1 was held at KDD 2010 in Washington... more The Eighth Workshop on Mining and Learning with Graphs (MLG) 1 was held at KDD 2010 in Washington DC. It brought together a variete of researchers interested in analyzing data that is best represented as a graph. Examples include the WWW, social networks, biological networks, communication networks, and many others. The importance of being able to effectively mine and learn from such data is growing, as more and more structured and semi-structured data is becoming available. This is a problem across widely different fields such as economics, statistics, social science, physics and computer science, and is studied within a variety of sub-disciplines of machine learning and data mining including graph mining, graphical models, kernel theory, statistical relational learning, etc. The objective of this workshop was to bring together practitioners from these various fields and areas to foster a rich discussion of which problems we work on, how we frame them in the context of graphs, which tools and algorithms we apply and our general findings and lessons learned. This year's workshop was very successful with well over 100 attendees, excellent keynote speakers and papers. This is a rapidly growing area and we believe that this community is only in its infancy. We hope that the readers will join us next year for MLG 2011!

Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW '18
We consider the problem of inferring node labels in a partially labeled graph where each node in ... more We consider the problem of inferring node labels in a partially labeled graph where each node in the graph has multiple label types and each label type has a large number of possible labels. Our primary example, and the focus of this paper, is the joint inference of label types such as hometown, current city, and employers for people connected by a social network; by predicting these user profile fields, the network can provide a better experience to its users. Existing approaches such as Label Propagation (Zhu et al., 2003) fail to consider interactions between the label types. Our proposed method, called Edge-Explain, explicitly models these interactions, while still allowing scalable inference under a distributed message-passing architecture. On a large subset of the Facebook social network, collected in a previous study (Chakrabarti et al., 2014), EdgeExplain outperforms label propagation for several label types, with lifts of up to 120% for recall@1 and 60% for recall@3.

Proceedings of the International AAAI Conference on Web and Social Media
Twitter and other microblogs have rapidly become a significant means by which people communicate ... more Twitter and other microblogs have rapidly become a significant means by which people communicate with the world and each other in near realtime. There has been a large number of studies surrounding these social media, focusing on areas such as information spread, various centrality measures, topic detection and more. However, one area which has not received much attention is trying to better understand what information is being spread and why it is being spread. This work looks to get a better understanding of what makes people spread information in tweets or microblogs through the use of retweeting. Several retweet behavior models are presented and evaluated on a Twitter data set consisting of over 768,000 tweets gathered from monitoring over 30,000 users for a period of one month. We evaluate the proposed models against each user and show how people use different retweet behavior models. For example, we find that although users in the majority of cases do not retweet information o...
Uploads
Papers by Sofus Macskássy