Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
International Journal of Data Mining, Modelling and Management, 2013
The boom in the use of Web and its exponential growth are now well known. The amount of textual data available on the Web is estimated to be in the order of one terra byte, in addition to images, audio and video. This has imposed additional challenges to the Web directories which help the user to search the Web by classifying selected Web documents into subject. Manual classification of web pages by human expertise also suffers from the exponential increase in the amount of Web documents. Instead of using the entire web page for classifying it, this article emphasizes the need for automatic web page classification using minimum number of features in it. A method for generating such optimum number of features for web pages is also proposed. Machine learning classifiers are modeled using these optimum features. Experiments on the bench marking data sets with these machine learning classifiers have shown promising improvement in classification accuracy.
In recent years the Internet has massive growth of data stored in various forms. There is a need for innovative and effective technologies to help find and use the valuable information and knowledge from a multiple disciplinary crew. Always the data is not to be static it is dynamically increasing and varying In order to utilize the Web information better, people pursue the latest technology, which can effectively organize and use online information. Classification of Web page content is important to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. In this paper in Web page classification, to indicate the importance of these Web-specific features and algorithms.
EURASIA-ICT 2002 Proceedings of the Workshop, 2002
Web page classification is significantly different from traditional text classification because of the presence of some additional information, provided by the HTML structure and by the presence of hyperlinks. In this paper we analyze these peculiarities and try to exploit them for representing web pages in order to improve categorization accuracy. We conduct various experiments on a corpus of 8000 documents belonging to 10 Yahoo! categories, using Kernel Perceptron and Naive Bayes classifiers. Our experiments show the usefulness of dimensionality reduction and of a new, structure-oriented weighting technique. We also introduce a new method for representing linked pages using local information that makes hypertext categorization feasible for real-time applications. Finally, we observe that the combination of the usual representation of web pages using local words with a hypertextual one can improve classification performance.
International Journal of Computer Applications, 2018
Classification of Web pages is one of the challenging and important task as there is an increase in web pages in day to day life provided by internet. There are many ways of classifying web pages based on different approach and features. This paper explains some of the approaches and algorithms used for the classification of webpages. Web pages are allocated to predetermined categories which is done mainly according to their content in Web page classification. The important technique for web mining is web page classification because classifying the web pages of interesting class is the initial step of data mining. The agenda of this paper is first to introduce the concepts related to web mining and then to provide a comprehensive review of different classification techniques.
With the continuous growth in the World Wide Web, the need arises for indexing and classifying Web documents for fast retrieval of relevant information accessible through it. Traditionally, classification has been accomplished manually. A recent study revealed that, there exist about 29.7 billion pages on the Web in February 2007, which means that manual classification would be infeasible and reflects the need for automated techniques for accomplishing this task. Though Web documents should follow the basic definitions on Hypertext Markup Language, they are known to be unstructured or semi-structured, which imposes new challenges to Web classification especially in the area of feature selection. The objective of this paper is to investigate Web document classification approaches, and compare between recent techniques proved promising in literature within this field. Traditionally, automatic classification is performed by extracting information for representing a web document from th...
IEEE International Conference on Data Mining, 2002
Automatic classification of web pages is an effective way to deal with the difficulty of retrieving information from the Internet. Although there are many automatic classification algorithms and systems that have been proposed, most of them ignore the conflict between the fixed number of categories and the growing number of web pages going into the system. They also require searching
2013
Classification of web pages greatly helps in making the search engines more efficient by providing the relevant results to the user’s queries. In most of the prevailing algorithms available in literature, the classification/ categorization solely depends on the features extracted from the text content of the web pages. But as the most of the web pages nowadays are predominately filled with images and contain less text information which may even be false and erroneous, classifying those web pages with the information present alone in those web pages often leads to mis-classification. To solve this problem, in this paper an algorithm has been proposed for automatically categorizing the web pages with the less text content based on the features extracted from both the URLs present in web page along with its own web page text content. Experiments on the bench marking data set “WebKB” using K-NN, SVM and Naive Bayes machine learning algorithms shows the effectiveness of the proposed appr...
Proceedings of the 13th international World Wide Web …, 2004
Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can hint at the category of the resource. This paper explores the use of URLs for web page categorization via a two-phase pipeline of word segmentation/expansion and classification. We quantify its performance against document-based methods, which require the retrieval of the source document.
2014 47th Hawaii International Conference on System Sciences, 2014
Computer Engineering and Intelligent Systems, 2014
The World Wide Web (www) is growing at an uncontrollable rate, hundreds of thousands of web site appear every day, with the added challenge of keeping the web directories up-to-date , further the uncontrolled nature of web presents difficulties for w eb page classification , the proposed system about using neural network technique for automatically online web pages classification according to their domain , the system provide ability to online web page classification ,which make the system sensible to any change happen to the website .
The web is growing very fast; it has a very large amount of information from different types. This necessitates the need to ways to arrange and organize this vast amount of data. One of these ways is automatic Web page classification; that is used in many other applications. In this paper, the comparison between various page structural elements, which used in the Web page classification task, has been presented. The classification rules and word-weighting algorithms have been used for the significance criteria limitation of page structural elements in web page classification. The obtained results showed that, the page title proved its significance; giving better accuracy over all categories, whereas it s ranged between 84.69% and 93.85% with average 90.52%. Finally, the results also proved that the word-weighting algorithm has improved the accuracy than the classification rules algorithm with all other classification criteria (title, body text, header and URL). Paper Pages: 164-171
Web Page Classification is one of the common problems of the today's Internet. In this paper, an automatic Web page classification system is introduced. The proposed system tries to increase the accuracy of a web page classification via combine the well-known Naïve Bayesian algorithm, Support Vector Machine and K-Nearest Neighbor. The experimental results shows that the performance of classifying web page by hybrid Naïve Bayesian classifier, Support Vector Machine and K-Nearest Neighbor algorithm is better than using Naïve Bayesian alone as always used to get the highest and fastest classifier or using K-Nearest Neighbor alone or using Support Vector Machine alone to reduce the false positive rate and get highest accuracy. The experimental results, applied on 10.000 web pages (30% for training process and 70% for testing process), showed a high efficiency with the less number of false positive rate (on average) 0%, the true positive rate (on average) 1%, F-measure (on average) 1% and overall accuracy rate (on average) 99.98%.
IOSR Journal of Computer Engineering, 2012
The World Wide Web is growing at an uncontrollable rate. Hundreds of thousands of web sites appear every day, with the added challenge of keeping the web directories up-to-date. Further, the uncontrolled nature of web presents difficulties for Web page classification. As the number of Internet users is growing, so is the need for classification of web pages with greater precision in order to present the users with web pages of their desired class. However, web page classification has been accomplished mostly by using textual categorization methods. Herein, we propose a novel approach for web page classification that uses the HTML information present in a web page for its classification. There are many ways of achieving classification of web pages into various domains. This paper proposes an entirely new dimension towards web page classification using Artificial Neural Networks (ANN).
Advances in Artificial Intelligence, 2011
Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.
2015
Nowadays, when a keyword is provided, a search engine can return a large number of web pages, which makes it difficult for people to find the right information. Web page classification is a technology that can help us to make a relevant and quick selection of information that we are looking for. Moreover, web page classification is important for companies that provide marketing and analytics platforms, because it can help them to build
1999
Assistance in retrieving documents on the World Wide Web is provided either by search engines, through keyword-based queries, or by catalogues, which organize documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult, due to the sheer amount of material on the Web; it is thus becoming necessary to resort to techniques for the automatic classification of documents. Automatic classification i s traditionally performed by extracting the information for representing a document ("indexing") from the document itself. The paper describes the novel technique of categorization by context, which instead extracts useful information for classifying a document from the context where a URL referring to it appears. We present the results of experimenting with Theseus, a classifier that exploits this technique.
The Web is the largest collection of electronically accessible documents which make the richest source of information in the world. The problem with the Web is that this information is not well structured and organized so that it would be easily retrieved. Web page classification used for managing and extract relevant information from Web content and in order to effectively use the knowledge available on the Web. In this dissertation a rule base system used to contract the system classifier for solve Web online classification by assigning each scanned HTML document to their class. The aim of this thesis is to design and implement HTML document classification system that is able to classify the HTML documents according to their class (category). The proposed system is designed to solve and improve web page classification problem using Rule-Based Classifier that check the HTML content of each entered URL address for system rule occurrences. The Proposed system enhances other web page classification system by making the system work online.
Knowledge-Based Systems, 2014
Unsupervised URL-Based Web Page Classication refers to the problem of clustering the URLs in a web site so that each cluster includes a set of pages that can be classied using a unique class. The existing proposals to perform URL-Based Classication suer from a number of drawbacks: they are supervised, which requires the user to provide labelled training data and are then dicult to scale, are language or domain dependent, since they require the user to provide dictionaries of words, or require extensive crawling, which is time and resource consuming. In this article, we propose a new statistical technique to mine URL patterns that are able to classify Web pages. Our proposal is unsupervised, language and domain independent, and does not require extensive crawling. We have evaluated our proposal on 45 real-world web sites, and the results conrm that it can achieve a mean precision of 98% and a mean recall of 91%, and that its performance is comparable to that of a supervised classication technique, while it does not require to label large sets of sample pages. Furthermore, we propose a novel application that helps to extract the underlying model from non-semantic-web sites.
Lecture Notes in Computer Science, 2001
This paper describes automatic Web-page classification by using machine learning methods. Recently, the importance of portal site services is increasing including the search engine function on World Wide Web. Especially, the portal site such as for Yahoo! service which hierarchically classifies Web-pages into many categories is becoming popular. However, the classification of Web-page into each category exclusively relies on man power which costs much time and care. To alleviate this problem, we propose techniques to generate attributes by using cooccurrence analysis and to classify Web-page automatically based on machine learning. We apply these techniques to Web-pages on Yahoo! JAPAN and construct decision trees which determine appropriate category for each Web-page. The performance of this proposed method is evaluated in terms of error rate, recall, and precision. The experimental evaluation demonstrates that this method provides high accuracy with the classification of Web-page into top level categories on Yahoo! JAPAN.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.