Mining the Web: Searching and Integration
Rahul Sharma1, Poonam Kataria2
Department of Computer Science and Engineering, Swami Paramanand Engineering College, Lalru,Punjab,India
jonty9@[Link]
1
Department of Information Technology, Swami Paramanand Engineering College, Lalru,Punjab,India
pooni.6711@[Link] Abstract As the web technology is growing the world is
shrinking into a whole and there is a great urge for world wide web to provide a mechanism so that individuals can easily retrieve the data, reports, knowledge etc. One way of alleviating this problem is to develop web-based information integration systems or agents, which take a user's query or request (e.g., monitor a site), and access the relevant sources or services to efficiently support the request. In this paper, we will have a review over web mining and its categories. Also about various techniques for extraction and integration of information from the Web. Web mining is categorised on the sources of data available i.e. extraction of data and thereafter integrating the information to have a unified view of the knowledge gained from web.
Keywords Web Scraping, Information Extraction, Information retrieval, Information Integration
social networks(S) and mobile/sensor networks (M). It is required to focus our efforts on developing novel systems in order to enable innovative Web Services in this area. Currently we have the following main research directions: i) Exploring New Search Paradigms In the past decade, search engines have been developed with the goal of better organizing the Web information. Search has a long document-centric tradition, where searching information is equivalent to searching documents. ii) Interactive Knowledge Mining and Crowdsourcing New paradigm must be explored to enable web-scale entity search and knowledge mining, extracting and integrating web information for various types of realworld entities. These entities are ranked in terms of their relevance and popularity in answering user queries. An interactive knowledge mining platform for users to effectively interact with and contribute to our automated entity extraction and disambiguation systems. iii)Machine Learning for Web Search Web search can be viewed as an intelligent system built with huge amount of content data and behavior data using machine learning techniques. All the major tasks in web search, including crawling, indexing, query understanding, document understanding, query-document matching, ranking, and search result presentation need to make intelligent decisions, and the most effective approach to performing the tasks is to use data-driven and machine learning techniques. At the Web Search and Mining group, we aim to develop fundamental and advanced machine learning technologies to improve all the aspects of web search system. iv)Managing Data from the Physical World By accumulating and aggregating physical world information from multiple users and multiple mobile devices over a long period, collective social intelligence can be derived. There are technologies to
I. INTRODUCTION The explosive growth and popularity of the world-wide web has resulted in a huge number of information sources on the Internet and the promise of unprecedented informationgathering capabilities to lay users as well. Also Web is unique as it contributes greatly to the creation of an ever-increasing global information database. Thus its importance can not be over-emphasized. It consists of huge amount of information which is still growing almost on every issue and changes continuously. There is no single editorial control as it has significant variance in quality, much duplication & data formats vary widely. [1] In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and other multimedia files available via internet and the number is still rising. But considering the impressive variety of the web, retrieving interesting content has become a very difficult task.[7] The goal of the Web Search and Mining is to define the next generation Web by leveraging data mining, machine learning, knowledge discovery, and media analysis techniques for information analysis, organization, retrieval, and visualization. Web will be an organic combination of traditional Web(W),
manage physical world information and build intelligence from them. The data can be linked from people, services and sensors together with a unified knowledge model and provide the intelligence as a service in the cloud. v)Multimedia search and visual information mining The focus is on pattern analysis and extraction for content understanding and data mining of multimedia data. The work should be done on research problems in search-based image annotation, large scale visual indexing and recognition, sketch-based image search, and object recognition with 3D structures.
Information Retrieval(IR) is the study of finding needed information that matches their information needs which is expressed as query. Technically, IR studies the acquisition, organization, storage, retrieval and distribution of information.[2]
II. AN OVERVIEW OF WEB MINING Web mining or web scraping is the application of data mining techniques to discover patterns from the web[8]. Also it is a process of extracting structured information from Figure2: IR Architecture unstructured or semi-structured web data sources. It is build not only on existing data and text mining techniques but also adds many new tasks and algorithms. Broadly, we classify A. Information Retrieval the web mining into three categories based on the sources of An IR model governs how a document and a query are data i.e. Web Structure mining, Web content mining,& Web represented and how the relevance of a document to a user usage mining . query is defined. Web Structure mining: It is the process of using graph Main Models are listed as below:theory to analyse the node and connection structure of website. i) Boolean Model Web Content mining: It is mining, extraction and ii) Vector space model integration of useful data, information and knowledge from iii) Statistical language model etc. Web page contents. Web usage mining: It is the process of extracting useful 1) Boolean Model: Each document or query is treated as information from server logs. a bag of words or terms. Given a collection of documents D, let V = {t1, t2, ..., t|V|} be the set of distinctive words/terms in the collection. V is called the vocabulary. A weight wij > 0 is associated with each term ti of a document dj D. For a term that does not appear in document dj, wij = 0. dj = (w1j, w2j, ..., w|V|j), Query terms are combined logically using the Boolean operators AND, OR, and NOT. E.g., ((data AND mining) AND (NOT text)) Retrieval:- Given a Boolean query, the system retrieves every document that makes the query logically [Link] is called exact match. 2) Vector- Space Model: Documents are also treated as a bag of words or terms. Each document is represented as a vector. However, the term weights are no longer 0 or 1. Each term weight is computed based on some variations of TF or TF-IDF scheme. Figure 1: Web Mining Overview III. INFORMATION EXTRACTION Text mining is referred to as data mining using text documents as data. Most of the mining tasks uses Information Retrieval methods to pre-process text documents. Logically, Term- Frequency : The weight of a term ti in document dj is the number of times that ti appears in dj, denoted by fij. Normalization may also be applied. TF- IDF weighing scheme The most well known weighting scheme TF: still term frequency
IDF: inverse document frequency. N: total number of docs dfi: the number of docs that ti appears. i) Retrieval in vector space model : Query q is represented in the same way or slightly differently. Relevance of di to q: Compare the similarity of query q and document [Link] similarity (the cosine of the angle between the two vectors)
Cosine is also commonly used in text clustering. 3 Statistical language models: Statistical language models (or simply language models) are based on probability and have foundations in statistical theory. The basic idea of this approach to retrieval is simple. It first estimates a language model for each document, and then ranks documents by the likelihood of the query given the language model. B. Text Pre-processing Pre-processing is essential to analyze the multivariate datasets before data mining. It includes the following:Stopwords removal Stemming Frequency counts and computing TF-IDF term weights 1) Stopwords Removal: Many of the most frequently used words in English are useless in IR and text mining & these words are called stop words. These are the, of, and, to etc. For an application, an additional domain specific stopwords list may be [Link] need to remove stopwords to reduce indexing (or data) file size and to improve efficiency and effectiveness because stopwords are not useful for searching or text mining and they may also confuse the retrieval system. 2) Stemming: It is the techniques used to find out the root/stem of a word. It is useful as it improving effectiveness of IR and text mining by matching similar words and mainly improve [Link] it reduces indexing size by combing words with same roots may reduce indexing size as much as 40-50%. 3)Frequency counts + TF-IDF: It counts the number of times a word occurred in a document by using occurrence frequencies to indicate relative importance of a word in a document. If a word appears often in a document, the document likely deals with subjects related to the word. Also, it counts the number of documents in the collection that contains each word. C. Web Page Pre-Processing.
1) Identifying different text fields : In HTML, there are different text fields, e.g., title, metadata, and body. Identifying them allows the retrieval system to treat terms in different fields differently. For example, in search engines terms that appear in the title field of a page are regarded as more important than terms that appear in other fields and are assigned higher weights because the title is usually a concise description of the page. In the body text, those emphasized terms (e.g., under header tags <h1>, <h2>, , bold tag <b>, etc.) are also given higher weights. 3) Identifying anchor text: Anchor text associated with a hyperlink is treated specially in search engines because the anchor text often represents a more accurate description of the information contained in the page pointed to by its link. D. Web Search A search engine starts with the crawling of pages on the Web. The crawled pages are then parsed, indexed, and stored. At the query time, the index is used for efficient retrieval. The subsequent operations of a search engine are described below: i)Parsing: A parser is used to parse the input HTML page, which produces a stream of tokens or terms to be indexed. The parser can be constructed using a lexical analyzer generator such as YACC and Flex. ii)Indexing: A full index is then built based on all the text in each page, including anchor texts (a piece of anchor text is indexed both for the page that contains it, and for the page that its link points. iii)Searching & Ranking: Given a user Query, searching involves following steps: pre-processing the query terms finding pages that contain all (or most of) the query terms in the inverted index ranking the pages and returning them to the user IV. INFORMATION INTEGRATION Information integration started with database integration, which has been studied in the database community since the early 1980s. The fundamental problem is schema matching, which takes two (or more) database schemas to produce a mapping between elements (or attributes) of the two (or more) schemas that correspond semantically to each other. The main objective is to merge the schemas into a single global schema. A. Integrating two schemes Consider two schemas, S1 and S2, representing two customer relations, Cust and Customer S1 Cust S2 Customer
CNo CustID CompName Company FirstName Contact LastName Phone It represent the mapping with a similarity relation, , over the power sets of S1 and S2, where each pair in represents one element of the mapping. E.g., [Link] [Link] [Link] [Link] {[Link],[Link]} [Link] [Link] types of matching i) Schema-level: Schema level matching relies on information such as name, description, data type, relationship type (e.g., part-of, is-a, etc), constraints, etc. Match cardinality: 1:1 match: one element in one schema matches one element of another schema. 1:m match: one element in one schema matches m elements of another schema. m:n match: m elements in one schema matches n elements of another schema Linguistic approaches They are used to derive match candidates based on names, comments or descriptions of schema elements as follows: Name match Equality of names Synonyms Equality of hypernyms: A is a hypernym of B is B is a kind-of A. Common sub-strings Cosine similarity User-provided name match: usually a domain dependent match dictionary ii)Domain and instance-level only matching: Some instance data (data records) and possibly the domain of each attribute are used. This case is quite common on the Web. In many applications, some data instances or attribute domains may be available. The value characteristics are used in matching. There are two different types of domains Simple domain: Each value in the domain has only a single component (the value cannot be decomposed). Composite domain: Each value in the domain contains more than one component. Matching of simple domains A simple domain can be of any type. If the data type information is not available (this is often the case on the Web), the instance values can often be used to infer types, e.g., Words may be considered as strings
Phone numbers can have a regular expression pattern. Data type patterns (in regular expressions) can be learnt automatically or defined manually e.g It is used to identify such types as integer, real, string, month, weekday, date, time, zip code, phone numbers, etc Handling composite domains A composite domain is usually indicated by its values containing delimiters, e.g., punctuation marks (e.g., -, /, _) and white spaces etc. To detect a composite domain, these delimiters can be used. They are also used to split a composite value into simple [Link] methods for simple domains can then be applied. iii) Integrated matching of schema, domain and instance data: Both schema and instance data (possibly domain information) are available. C Pre-processing for integration i) Tokenization: break an item into atomic words using a dictionary, e.g., Break fromCity into from and city Break first-name into first and name ii) Expansion: expand abbreviations and acronyms to their full words, e.g., From dept to departure iii) Stopword removal and stemming iv) Standardization of words: Irregular words are standardized to a single form, e.g., From colour to color D Web information integration Many integration tasks, Integrating Web query interfaces (search forms) Integrating ontologies (taxonomy) Integrating extracted data We only introduce query interface integration as it has been studied extensively. Many web sites provide forms (called query interfaces) to query their underlying databases (often called the deep web as opposed to the surface Web that can be browsed). Applications: meta-search and meta-query E . Integration of Web Query Interface The Web consists of the surface Web and the deep Web. The surface Web can be browsed using any Web browser, while the deep Web consists of databases that can only be accessed through parameterized query interfaces. With the rapid expansion of the Web, there are now a huge number of deep web data sources. In almost any domain, one
can find a large number of them, which are hosted by ecommerce sites. Each of such sources usually has a keyword based search engine or a query interface that allows the user to fill in some information in order to retrieve the needed data 1) Schema Model of Query Interface In each domain, there is a set of concepts C = {c1, c2, , cn} that represents the essential information of the domain. These concepts are used in query interfaces to enable the user to restrict the search for some specific instances or objects of the domain. A particular query interface uses a subset of the concepts S C. A concept I in S may be represented in the interface with a set of attributes (or fields) fi1, fi2, ..., fik. Each attribute is labeled with a word or phrase, called the label of the attribute, which is visible to the user. Each attribute may also have a set of possible values that the user can use in search, which is its domain. All the attributes with their labels in a query interface are called the schema of the query interface. 2) A Clustering Based Approach Given a large set of schemas from query interfaces in the same application domain, this technique utilizes a data mining method, clustering, to find attribute matches of all interfaces. Three types of information are employed, namely, attribute labels, attribute names and value domains. 3) A Correlation Based Approach The approach is based on co-occurrences of schema attributes and following observations: i). In an interface, some attributes may be grouped together to form a bigger concept. For example, first name and last name compose the name of a person. This is called the grouping relationship, denoted by a set, e.g., {first name, last name}. Attributes in such a group often co-occur in schemas, i.e., they are positively correlated. ii). An attribute group rarely co-occurs in schemas with their synonym attribute groups. For example, first name and last name rarely co-occur with name in the same query interface. Thus, {first name, last name} and {name} are negatively correlated. V. CONCLUSIONS We have surveyed almost all the techniques and models for the information retrieval as well as various approaches for information integration. Three types of web mining is discussed briefly. Main stress is given over extraction and integration process for web mining. ACKNOWLEDGEMENT We thank the anonymous referees for their careful reading of the paper and their valuable comments that significantly improved its quality
REFERENCES
[1] Soumen Chakrabarti, "Mining the Web: Analysis of Hypertext and Semi Structured Data", Morgan Kaufmann, 2002 [2] Bing Liu, "Web Data Mining: Exploring Hyperlinks, Contents and Usage Data", Springer, 2007 [3] Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng. Integrating e-commerce and data mining: Architecture and challenges. In Nick Cercone, Tsau Young Lin, and Xindong Wu, editors, Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 2001). IEEE Computer Society, 2001. [4] Chakrabarti, S . Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, 2002 [5] Advances in Web Mining and Web Usage Analysis 2005 - revised papers from 7 th workshop on Knowledge Discovery on the Web, Olfa Nasraoui, Osmar Zaiane, Myra Spiliopoulou, Bamshad Mobasher, Philip Yu, Brij Masand, Eds., Springer Lecture Notes in Artificial Intelligence, LNAI 4198, 2006 [6]Web Mining and Web Usage Analysis 2004 - revised papers from 6 th workshop on Knowledge Discovery on the Web, Bamshad Mobasher, Olfa Nasraoui, Bing Liu, Brij Masand, Eds., Springer Lecture Notes in Artificial Intelligence, 2006 [7] Google. [Link] [8] [Link]