Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
In recent years the Internet has massive growth of data stored in various forms. There is a need for innovative and effective technologies to help find and use the valuable information and knowledge from a multiple disciplinary crew. Always the data is not to be static it is dynamically increasing and varying In order to utilize the Web information better, people pursue the latest technology, which can effectively organize and use online information. Classification of Web page content is important to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. In this paper in Web page classification, to indicate the importance of these Web-specific features and algorithms.
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
International Journal of Computer Applications, 2018
Classification of Web pages is one of the challenging and important task as there is an increase in web pages in day to day life provided by internet. There are many ways of classifying web pages based on different approach and features. This paper explains some of the approaches and algorithms used for the classification of webpages. Web pages are allocated to predetermined categories which is done mainly according to their content in Web page classification. The important technique for web mining is web page classification because classifying the web pages of interesting class is the initial step of data mining. The agenda of this paper is first to introduce the concepts related to web mining and then to provide a comprehensive review of different classification techniques.
With the continuous growth in the World Wide Web, the need arises for indexing and classifying Web documents for fast retrieval of relevant information accessible through it. Traditionally, classification has been accomplished manually. A recent study revealed that, there exist about 29.7 billion pages on the Web in February 2007, which means that manual classification would be infeasible and reflects the need for automated techniques for accomplishing this task. Though Web documents should follow the basic definitions on Hypertext Markup Language, they are known to be unstructured or semi-structured, which imposes new challenges to Web classification especially in the area of feature selection. The objective of this paper is to investigate Web document classification approaches, and compare between recent techniques proved promising in literature within this field. Traditionally, automatic classification is performed by extracting information for representing a web document from th...
International Journal of Data Mining, Modelling and Management, 2013
The boom in the use of Web and its exponential growth are now well known. The amount of textual data available on the Web is estimated to be in the order of one terra byte, in addition to images, audio and video. This has imposed additional challenges to the Web directories which help the user to search the Web by classifying selected Web documents into subject. Manual classification of web pages by human expertise also suffers from the exponential increase in the amount of Web documents. Instead of using the entire web page for classifying it, this article emphasizes the need for automatic web page classification using minimum number of features in it. A method for generating such optimum number of features for web pages is also proposed. Machine learning classifiers are modeled using these optimum features. Experiments on the bench marking data sets with these machine learning classifiers have shown promising improvement in classification accuracy.
2013
Classification of web pages greatly helps in making the search engines more efficient by providing the relevant results to the user’s queries. In most of the prevailing algorithms available in literature, the classification/ categorization solely depends on the features extracted from the text content of the web pages. But as the most of the web pages nowadays are predominately filled with images and contain less text information which may even be false and erroneous, classifying those web pages with the information present alone in those web pages often leads to mis-classification. To solve this problem, in this paper an algorithm has been proposed for automatically categorizing the web pages with the less text content based on the features extracted from both the URLs present in web page along with its own web page text content. Experiments on the bench marking data set “WebKB” using K-NN, SVM and Naive Bayes machine learning algorithms shows the effectiveness of the proposed appr...
EURASIA-ICT 2002 Proceedings of the Workshop, 2002
Web page classification is significantly different from traditional text classification because of the presence of some additional information, provided by the HTML structure and by the presence of hyperlinks. In this paper we analyze these peculiarities and try to exploit them for representing web pages in order to improve categorization accuracy. We conduct various experiments on a corpus of 8000 documents belonging to 10 Yahoo! categories, using Kernel Perceptron and Naive Bayes classifiers. Our experiments show the usefulness of dimensionality reduction and of a new, structure-oriented weighting technique. We also introduce a new method for representing linked pages using local information that makes hypertext categorization feasible for real-time applications. Finally, we observe that the combination of the usual representation of web pages using local words with a hypertextual one can improve classification performance.
Computer Engineering and Intelligent Systems, 2014
The World Wide Web (www) is growing at an uncontrollable rate, hundreds of thousands of web site appear every day, with the added challenge of keeping the web directories up-to-date , further the uncontrolled nature of web presents difficulties for w eb page classification , the proposed system about using neural network technique for automatically online web pages classification according to their domain , the system provide ability to online web page classification ,which make the system sensible to any change happen to the website .
The Web is the largest collection of electronically accessible documents which make the richest source of information in the world. The problem with the Web is that this information is not well structured and organized so that it would be easily retrieved. Web page classification used for managing and extract relevant information from Web content and in order to effectively use the knowledge available on the Web. In this dissertation a rule base system used to contract the system classifier for solve Web online classification by assigning each scanned HTML document to their class. The aim of this thesis is to design and implement HTML document classification system that is able to classify the HTML documents according to their class (category). The proposed system is designed to solve and improve web page classification problem using Rule-Based Classifier that check the HTML content of each entered URL address for system rule occurrences. The Proposed system enhances other web page classification system by making the system work online.
Web Page Classification is one of the common problems of the today's Internet. In this paper, an automatic Web page classification system is introduced. The proposed system tries to increase the accuracy of a web page classification via combine the well-known Naïve Bayesian algorithm, Support Vector Machine and K-Nearest Neighbor. The experimental results shows that the performance of classifying web page by hybrid Naïve Bayesian classifier, Support Vector Machine and K-Nearest Neighbor algorithm is better than using Naïve Bayesian alone as always used to get the highest and fastest classifier or using K-Nearest Neighbor alone or using Support Vector Machine alone to reduce the false positive rate and get highest accuracy. The experimental results, applied on 10.000 web pages (30% for training process and 70% for testing process), showed a high efficiency with the less number of false positive rate (on average) 0%, the true positive rate (on average) 1%, F-measure (on average) 1% and overall accuracy rate (on average) 99.98%.
IEEE International Conference on Data Mining, 2002
Automatic classification of web pages is an effective way to deal with the difficulty of retrieving information from the Internet. Although there are many automatic classification algorithms and systems that have been proposed, most of them ignore the conflict between the fixed number of categories and the growing number of web pages going into the system. They also require searching
IOSR Journal of Computer Engineering, 2012
The World Wide Web is growing at an uncontrollable rate. Hundreds of thousands of web sites appear every day, with the added challenge of keeping the web directories up-to-date. Further, the uncontrolled nature of web presents difficulties for Web page classification. As the number of Internet users is growing, so is the need for classification of web pages with greater precision in order to present the users with web pages of their desired class. However, web page classification has been accomplished mostly by using textual categorization methods. Herein, we propose a novel approach for web page classification that uses the HTML information present in a web page for its classification. There are many ways of achieving classification of web pages into various domains. This paper proposes an entirely new dimension towards web page classification using Artificial Neural Networks (ANN).
The web is growing very fast; it has a very large amount of information from different types. This necessitates the need to ways to arrange and organize this vast amount of data. One of these ways is automatic Web page classification; that is used in many other applications. In this paper, the comparison between various page structural elements, which used in the Web page classification task, has been presented. The classification rules and word-weighting algorithms have been used for the significance criteria limitation of page structural elements in web page classification. The obtained results showed that, the page title proved its significance; giving better accuracy over all categories, whereas it s ranged between 84.69% and 93.85% with average 90.52%. Finally, the results also proved that the word-weighting algorithm has improved the accuracy than the classification rules algorithm with all other classification criteria (title, body text, header and URL). Paper Pages: 164-171
Lecture Notes in Computer Science, 2001
This paper describes automatic Web-page classification by using machine learning methods. Recently, the importance of portal site services is increasing including the search engine function on World Wide Web. Especially, the portal site such as for Yahoo! service which hierarchically classifies Web-pages into many categories is becoming popular. However, the classification of Web-page into each category exclusively relies on man power which costs much time and care. To alleviate this problem, we propose techniques to generate attributes by using cooccurrence analysis and to classify Web-page automatically based on machine learning. We apply these techniques to Web-pages on Yahoo! JAPAN and construct decision trees which determine appropriate category for each Web-page. The performance of this proposed method is evaluated in terms of error rate, recall, and precision. The experimental evaluation demonstrates that this method provides high accuracy with the classification of Web-page into top level categories on Yahoo! JAPAN.
International Journal of Automation and Computing, 2012
The number of Internet users and the number of web pages being added to WWW increase dramatically every day. It is therefore required to automatically and efficiently classify web pages into web directories. This helps the search engines to provide users with relevant and quick retrieval results. As web pages are represented by thousands of features, feature selection helps the web page classifiers to resolve this large scale dimensionality problem. This paper proposes a new feature selection method using Ward s minimum variance measure. This measure is first used to identify clusters of redundant features in a web page. In each cluster, the best representative features are retained and the others are eliminated. Removing such redundant features helps in minimizing the resource utilization during classification. The proposed method of feature selection is compared with other common feature selection methods. Experiments done on a benchmark data set, namely WebKB show that the proposed method performs better than most of the other feature selection methods in terms of reducing the number of features and the classifier modeling time.
Expert Systems with Applications, 2009
To help the growing qualitative and quantitative demands for information from the WWW, efficient automatic Web page classifiers are urgently needed. However, a classifier applied to the WWW faces a huge-scale dimensionality problem since it must handle millions of Web pages, tens of thousands of features, and hundreds of categories. When it comes to practical implementation, reducing the dimensionality is a critically important challenge. In this paper, we propose a fuzzy ranking analysis paradigm together with a novel relevance measure, discriminating power measure (DPM), to effectively reduce the input dimensionality from tens of thousands to a few hundred with zero rejection rate and small decrease in accuracy. The two-level promotion method based on fuzzy ranking analysis is proposed to improve the behavior of each relevance measure and combine those measures to produce a better evaluation of features. Additionally, the DPM measure has low computation cost and emphasizes on both positive and negative discriminating features. Also, it emphasizes classification in parallel order, rather than classification in serial order. In our experimental results, the fuzzy ranking analysis is useful for validating the uncertain behavior of each relevance measure. Moreover, the DPM reduces input dimensionality from 10,427 to 200 with zero rejection rate and with less than 5% decline (from 84.5% to 80.4%) in the test accuracy. Furthermore, to consider the impacts on classification accuracy for the proposed DPM, the experimental results of China Time and Reuter-21578 datasets have demonstrated that the DPM provides major benefit to promote document classification accuracy rate. The results also show that the DPM indeed can reduce both redundancy and noise features to set up a better classifier.
Information Technology, e-Business and Applications, 2004
Applying text classification techniques on web documents imposes many potential problems due to the huge volume and unstructured nature of these documents. In this research, Association Rules Classifier (ARC) is proposed as a novel classification framework that captures different hypertext features sources namely text, anchor, title and metadata and uses them to build a comprehensive knowledge base composed of association rules expressing the features dependencies. The ARC performance is compared with three other well-known full text classifiers: Bernoulli Bayes, Multinomial Bayes and KNN. The ARC has shown an accuracy improvement reaching 65% for large vocabulary size datasets. For small vocabulary size datasets, the ARC performance was similar to the best classifier among the three. When compared to other web classifiers that exploit anchor, title and metadata, the ARC enhanced the accuracy by about 22%.
Traditional information retrieving methods use keywords occurring in determine the class of web pages, but usually retrieved unrelated web pages. W3 consortium stated that HTML dosnt provide a better description of semantic structure of the web page contents, because of its limited semi structure data, case sensitivity, predefined tags and so on. To overcome these backs, Web developers started to develop web pages on XML, flash kind of new technologies. It makes a way for new research methods. In this article we propose a new approach based on URL and semantic analysis for classifying XML and other types of web page.
Journal of Engineering Science and Military Technologies
The data on the web is generally stored in structured, semi-structured and unstructured formats; from the survey the most of the information of an organization is stored in unstructured textual form .so, the task of categorizing this huge number of unstructured web text documents has become one of the most important tasks when dealing with web. Categorization, Classification, of web text documents aims in assigning one or more class labels, Categories, to the un-labeled ones; the assignment process depends mainly on the contents of the document itself with the help of using one or more of machine learning techniques. Different learning algorithms have been applied on the content of text documents for the classification process. In this paper experiments uses a subset of Reuters-21578 dataset to highlight the leakage and limitations of traditional techniques for feature generation and dimensionality reduction, showing the results of classification accuracy, and F-measure when applying different classification algorithms.
full potential, automatic classification of web pages into web directories has become more significant. These web directories help the search engines to provide users with relevant and quick retrieval results. In this paper a novel approach to web page classification is implemented by combining the k nearest neighbor classifier (kNN) and association rule mining algorithm. The web pages are preprocessed and discretized before inducing the classifier. The proposed method for web page classification uses a) a feature weighting scheme based on association rules and b) a distance weighted voting scheme. This distance weighted voting scheme enables the model to work for any value of k, being odd or even. Experiments done on a benchmarking data set namely, WebKB have shown that the web page classification accuracy by the proposed method is significantly better than many of the existing web page classification methods..
2003
The Internet makes it possible to share and manipulate a vast quantity of information efficiently and effectively, but the rapid and chaotic growth experienced by the Net has generated a poorly organized environment that hinders the sharing and mining of useful data. The need for meaningful web-page classification techniques is therefore becoming an urgent issue. This paper describes a novel approach to web-page classification based on a fuzzy representation of web pages. A doublet representation that associates a weight with each of the most representative words of the web document so as to characterize its relevance in the document. This weight is derived by taking advantage of the characteristics of HTML language. Then a fuzzy-rule-based classifier is generated from a supervised learning process that uses a genetic algorithm to search for the minimum fuzzy-rule set that best covers the training examples. The proposed system has been demonstrated with two significantly different classes of web pages.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.