Papers by Kostyantyn Shchekotykhin

Web Mining Systems exploit the redundancy of data published on the Web to automatically extract i... more Web Mining Systems exploit the redundancy of data published on the Web to automatically extract information from existing web documents. The first step in the Information Extraction process is thus to locate as many web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its "recall", i.e. the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data is available, which in turn leads to better results in the subsequent fact extraction phase. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of websites, such as hierarchies, lists or maps. In addition, automatic query generation is applied to rapidly collect web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web Mining System developed to extract product and service descriptions and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall whilst maintaining precision.

Knowledge and Information Systems, 2010
Web mining systems exploit the redundancy of data published on the Web to automatically extract i... more Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.
Journal of Web Semantics, 2009
The process of populating an ontology-based system with high-quality and upto-date instance infor... more The process of populating an ontology-based system with high-quality and upto-date instance information can be both time consuming and prone to error. In many domains, however, one possible solution to this problem is to automate the instantiation process for a given ontology by searching (mining) the web for the required instance information.

The process of instantiating an ontology with high-quality and up-todate instance information man... more The process of instantiating an ontology with high-quality and up-todate instance information manually is both time consuming and prone to error. Automatic ontology instantiation from Web sources is one of the possible solutions to this problem and aims at the computer supported population of an ontology through the exploitation of (redundant) information available on the Web. In this paper we present ALLRIGHT, a comprehensive ontology instantiating system. In particular, the techniques implemented in ALLRIGHT are designed for application scenarios, in which the desired instance information is given in the form of tables and for which existing Information Extraction (IE) approaches based on statistical or natural language processing methods are not directly applicable. Within ALLRIGHT, we have therefore developed new techniques for dealing with tabular instance data and combined these techniques with existing methods. The system supports all necessary steps for ontology instantiation, i.e. web crawling, name extraction, document clustering as well as fact extraction and validation. ALLRIGHT has been successfully evaluated in the popular domains of digital cameras and notebooks leading to a about eighty percent accuracy of the extracted facts given only a very limited amount of seed knowledge.
Debugging is an important prerequisite for the wide-spread application of ontologies, especially ... more Debugging is an important prerequisite for the wide-spread application of ontologies, especially in areas that rely upon everyday users to create and maintain knowledge bases, such as the Semantic Web. Most recent approaches use diagnosis methods to identify sources of inconsistency. However, in most debugging cases these methods return many alternative diagnoses, thus placing the burden of fault localization on the user. This paper demonstrates how the target diagnosis can be identified by performing a sequence of observations, that is, by querying an oracle about entailments of the target ontology. We exploit probabilities of typical user errors to formulate information theoretic concepts for query selection. Our evaluation showed that the suggested method reduces the number of required observations compared to myopic strategies.

Applied Intelligence, 2009
Customers interacting with online selling platforms require the assistance of sales support syste... more Customers interacting with online selling platforms require the assistance of sales support systems in the product and service selection process. Knowledge-based recommenders are specific sales support systems which involve online customers in dialogs with the goal to support preference forming processes. These systems have been successfully deployed in commercial environments supporting the recommendation of, e.g., financial services, e-tourism services, or consumer goods. However, the development of user interface descriptions and knowledge bases underlying knowledge-based recommenders is often an error-prone and frustrating business. In this paper we focus on the first aspect and present an approach which supports knowledge engineers in the identification of faults in user interface descriptions. These descriptions are the input for a model-based diagnosis algorithm which automatically identifies faulty elements and indicates those elements to the knowledge engineer. In addition, we present results of an empirical study which demonstrates the applicability of our approach.

One of the common approaches to extracting high-quality knowledge from Web sources is to exploit ... more One of the common approaches to extracting high-quality knowledge from Web sources is to exploit the redundancy of the published information. Therefore, a Web Mining System not only has to search for relevant Web pages but also has to somehow determine whether two pages describe the same entity in order to extract as much knowledge as possible about it. It has been shown that statistical clustering techniques are in general a suitable means to achieve this task by grouping documents that are supposed to contain similar information. However, when data is given in tabular form -which is for instance a typical way of describing items in online shops -existing document clustering algorithms show limited performance as documents containing tabular descriptions typically share a very common set of tokens although they describe different entities. In this paper we therefore propose a new document clustering approach that exploits hyperlinks and document metadata to extract candidates for entity names. These candidate names are subsequently used to cluster the documents and further improve these names, which are finally used to determine whether two documents describe the same entity. The detailed evaluation of our approach in two popular example domains showed its high accuracy in terms of precision and recall (F-Measure > 0.9).

Argumentation based constraint acquisition
ABSTRACT Efficient acquisition of constraint networks is a key factor for the applicability of co... more ABSTRACT Efficient acquisition of constraint networks is a key factor for the applicability of constraint problem solving methods. Current techniques learn constraint networks from sets of training examples, where each example is classified as either a solution or non-solution of a target network. However, in addition to this classification, an expert can usually provide arguments as to why examples should be rejected or accepted. Generally speaking domain specialists have partial knowledge about the theory to be acquired which can be exploited for knowledge acquisition. Based on this observation, we discuss the various types of arguments an expert can formulate and develop a knowledge acquisition algorithm for processing these types of arguments which gives the expert the possibility to input arguments in addition to the learning examples. The result of this approach is a significant reduction in the number of examples which must be provided to the learner in order to learn the target constraint network.

Nowadays, many users use web search engines to find and gather information. User faces an increas... more Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various web pages information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek a specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. Extracting information from the web pages also becomes very important because the massive and increasing amount of diverse web pages information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes an approach for extracting information from web tables based on standard classifications. The proposed approach consists of four main phases, namely: (i) pre-processing, (ii) extraction, (iii) classification, and (iv) simplification. The proposed approach is evaluated by conducting experiments on a number of web pages from the Nokia products domain, as to the best of our knowledge this is the only product that has complete and complex standard classifiers.
The effective debugging of ontologies is an important prerequisite for their successful applicati... more The effective debugging of ontologies is an important prerequisite for their successful application and impact on the semantic web. The heart of this debugging process is the diagnosis of faulty knowledge bases. In this paper we define general concepts for the diagnosis of ontologies. Based on these concepts, we provide correct and complete algorithms for the computation of minimal diagnoses of knowledge bases. These concepts and algorithms are broadly applicable since they are independent of a particular variant of an underlying logic (with monotonic semantics) and independent of a particular reasoning system. The practical feasibility of our method is shown by extensive test evaluations.
NameIt: Extraction of product names

The complexity of product assortments offered by e-Commerce platforms requires intelligent sales ... more The complexity of product assortments offered by e-Commerce platforms requires intelligent sales assistance systems alleviating the retrieval of solutions fitting to the wishes and needs of a customer. Knowledge-based recommender applications meet these requirements by allowing the calculation of personalized solutions based on an explicit representation of product, marketing and sales knowledge stored in an underlying recommender knowledge base. Unfortunately, in many cases faulty models of recommender user interfaces are defined by knowledge engineers and no automated support for debugging such process designs is available. This paper presents an approach to automated debugging of faulty process designs of knowledge-based recommenders which increases the productivity of user interface development and maintenance. The approach has been implemented for a knowledge-based recommender environment within the scope of the Koba4MS project.
Diagnosis discrimination for ontology debugging

Applied Intelligence, 2009
Customers interacting with online selling platforms require the assistance of sales support syste... more Customers interacting with online selling platforms require the assistance of sales support systems in the product and service selection process. Knowledge-based recommenders are specific sales support systems which involve online customers in dialogs with the goal to support preference forming processes. These systems have been successfully deployed in commercial environments supporting the recommendation of, e.g., financial services, e-tourism services, or consumer goods. However, the development of user interface descriptions and knowledge bases underlying knowledge-based recommenders is often an error-prone and frustrating business. In this paper we focus on the first aspect and present an approach which supports knowledge engineers in the identification of faults in user interface descriptions. These descriptions are the input for a model-based diagnosis algorithm which automatically identifies faulty elements and indicates those elements to the knowledge engineer. In addition, we present results of an empirical study which demonstrates the applicability of our approach.
Ontology debugging is an important stage of the ontology life-cycle and supports a knowledge engi... more Ontology debugging is an important stage of the ontology life-cycle and supports a knowledge engineer during the ontology development and maintenance processes. Model based diagnosis is the basis of many recently suggested ontology debugging methods. The main difference between the proposed approaches is the method of computing required conflict sets, i.e. a sets of axioms such that at least one axiom of each set should be changed (removed) to make ontology coherent. Conflict set computation is, however, the most time consuming part of the debugging process. Consequently, the choice of an efficient conflict set computation method is crucial for ensuring the practical applicability of an ontology debugging approach.

Effective debugging of ontologies is an important prerequisite for their broad application, espec... more Effective debugging of ontologies is an important prerequisite for their broad application, especially in areas that rely on everyday users to create and maintain knowledge bases, such as the Semantic Web. In such systems ontologies capture formalized vocabularies of terms shared by its users. However in many cases users have different local views of the domain, i.e. of the context in which a given term is used. Inappropriate usage of terms together with natural complications when formulating and understanding logical descriptions may result in faulty ontologies. Recent ontology debugging approaches use diagnosis methods to identify causes of the faults. In most debugging scenarios these methods return many alternative diagnoses, thus placing the burden of fault localization on the user. This paper demonstrates how the target diagnosis can be identified by performing a sequence of observations, that is, by querying an oracle about entailments of the target ontology. To identify the best query we propose two query selection strategies: a simple "split-in-half" strategy and an entropy-based strategy. The latter allows knowledge about typical user errors to be exploited to minimize the number of queries. Our evaluation showed that the entropy-based method significantly reduces the number of required queries compared to the "split-in-half" approach. We experimented with different probability distributions of user errors and different qualities of the a-priori probabilities. Our measurements demonstrated the superiority of entropy-based query selection even in cases where all fault probabilities are equal, i.e. where no information about typical user errors is available.
Uploads
Papers by Kostyantyn Shchekotykhin