Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2004, Proceedings of the 13th international World Wide Web …
Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can hint at the category of the resource. This paper explores the use of URLs for web page categorization via a two-phase pipeline of word segmentation/expansion and classification. We quantify its performance against document-based methods, which require the retrieval of the source document.
Knowledge-Based Systems, 2014
Unsupervised URL-Based Web Page Classication refers to the problem of clustering the URLs in a web site so that each cluster includes a set of pages that can be classied using a unique class. The existing proposals to perform URL-Based Classication suer from a number of drawbacks: they are supervised, which requires the user to provide labelled training data and are then dicult to scale, are language or domain dependent, since they require the user to provide dictionaries of words, or require extensive crawling, which is time and resource consuming. In this article, we propose a new statistical technique to mine URL patterns that are able to classify Web pages. Our proposal is unsupervised, language and domain independent, and does not require extensive crawling. We have evaluated our proposal on 45 real-world web sites, and the results conrm that it can achieve a mean precision of 98% and a mean recall of 91%, and that its performance is comparable to that of a supervised classication technique, while it does not require to label large sets of sample pages. Furthermore, we propose a novel application that helps to extract the underlying model from non-semantic-web sites.
Proceedings of the 14th ACM international …, 2005
We demonstrate the usefulness of the uniform resource locator (URL) alone in performing web page classification. This approach is magnitudes faster than typical web page classification, as the pages themselves do not have to be fetched and analyzed. Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting binary features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness in binary, multi-class and hierarchical classification. Our results show that, in certain scenarios, URL-based methods approach and sometime exceeds the performance of full-text and link-based methods. We also use these features to predict the prestige of a webpage (as modeled by Pagerank), and show that it can be predicted with an average error of less than one point (on a ten-point scale) in a topical set of web pages.
There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets. We have recently proposed a new method for URL-based web page classification. We have introduced an n-gram language model for this task as a method that provides competitive accuracy and scalability to larger datasets. Our method allows for the classification of new URLs with unseen sub-sequences. In this paper we extend our presentation and include additional results to validate the proposed approach. We explain the parameters associated with the n-gram language model and test their impact on the models produced. Our results show that our method is competitive in terms of accuracy with the best known methods but also scales well for larger datasets.
This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL, without the need to visit the page itself. For example, emails or messages sent in social media may contain URLs and require automatic classification. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Much of the current research on URL-based classification has achieved reasonable accuracy, but the current methods do not scale very well with large datasets. In this paper, we propose a new solution based on the use of an n-gram language model. Our solution shows good classification performance and is scalable to larger datasets. It also allows us to tackle the problem of classifying new URLs with unseen sub-sequences.
1999
Assistance in retrieving documents on the World Wide Web is provided either by search engines, through keyword-based queries, or by catalogues, which organize documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult, due to the sheer amount of material on the Web; it is thus becoming necessary to resort to techniques for the automatic classification of documents. Automatic classification is traditionally performed by extracting the information for representing a document ("indexing") from the document itself. The paper describes the novel technique of categorization by context, which instead extracts useful information for classifying a document from the context where a URL referring to it appears. We present the results of experimenting with Theseus, a classifier that exploits this technique.
2013
Classification of web pages greatly helps in making the search engines more efficient by providing the relevant results to the user’s queries. In most of the prevailing algorithms available in literature, the classification/ categorization solely depends on the features extracted from the text content of the web pages. But as the most of the web pages nowadays are predominately filled with images and contain less text information which may even be false and erroneous, classifying those web pages with the information present alone in those web pages often leads to mis-classification. To solve this problem, in this paper an algorithm has been proposed for automatically categorizing the web pages with the less text content based on the features extracted from both the URLs present in web page along with its own web page text content. Experiments on the bench marking data set “WebKB” using K-NN, SVM and Naive Bayes machine learning algorithms shows the effectiveness of the proposed appr...
1999
Assistance in retrieving documents on the World Wide Web is provided either by search engines, through keyword-based queries, or by catalogues, which organize documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult, due to the sheer amount of material on the Web; it is thus becoming necessary to resort to techniques for the automatic classification of documents. Automatic classification i s traditionally performed by extracting the information for representing a document ("indexing") from the document itself. The paper describes the novel technique of categorization by context, which instead extracts useful information for classifying a document from the context where a URL referring to it appears. We present the results of experimenting with Theseus, a classifier that exploits this technique.
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
2004
At present, information systems combining crawling and information extraction (IE) technologies acquire a lot of research and industrial interest. In this paper, we present an algorithm that exploits techniques for unsupervised IE pattern acquisition in order to facilitate identification of web pages containing information relevant to the IE task.
Transactions of The Japanese Society for Artificial Intelligence, 2010
Directory services are popular among people who search their favorite information on the Web. Those services provide hierarchical categories for finding a user's favorite page. Pages on the Web are categorized into one of the categories by hand. Many existing studies classify a web page by using text in the page. Recently, some studies use text not only from a target page which they want to categorize, but also from the original pages which link to the target page. We have to narrow down the text part in the original pages, because they include many text parts that are not related to the target page. However these studies always use a unique extraction method for all pages. Although web pages usually differ so much in their formats, they do not change their extraction methods. We have already developed an extraction method of anchor-related text. We use text parts extracted by our method for classifying web pages. The results of the experiments showed that our extraction method...
2008
TEXT CATEGORIZATION OF COMMERCIAL WEB PAGES E. Binaghi, M. Carullo, I. Gallo and M. Madaio Universit´a degli Studi dell'Insubria Via Mazzini 5, 21100 Varese, Italy email: [email protected] ABSTRACT In this paper we describe a new on-line document categorization strategy that can be integrated within Web applications. A salient aspect is the use of neural learning in both representation and classification tasks.
2003
ABSTRACT Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can indicate metadata about a resource. This paper explores the mining of URLs to yield categoric metadata about web resources via a three-phase pipeline of word segmentation, abbreviation expansion and classification.
IEEE International Conference on Data Mining, 2002
Automatic classification of web pages is an effective way to deal with the difficulty of retrieving information from the Internet. Although there are many automatic classification algorithms and systems that have been proposed, most of them ignore the conflict between the fixed number of categories and the growing number of web pages going into the system. They also require searching
International Journal of Data Mining, Modelling and Management, 2013
The boom in the use of Web and its exponential growth are now well known. The amount of textual data available on the Web is estimated to be in the order of one terra byte, in addition to images, audio and video. This has imposed additional challenges to the Web directories which help the user to search the Web by classifying selected Web documents into subject. Manual classification of web pages by human expertise also suffers from the exponential increase in the amount of Web documents. Instead of using the entire web page for classifying it, this article emphasizes the need for automatic web page classification using minimum number of features in it. A method for generating such optimum number of features for web pages is also proposed. Machine learning classifiers are modeled using these optimum features. Experiments on the bench marking data sets with these machine learning classifiers have shown promising improvement in classification accuracy.
Computer Engineering and Intelligent Systems, 2014
The World Wide Web (www) is growing at an uncontrollable rate, hundreds of thousands of web site appear every day, with the added challenge of keeping the web directories up-to-date , further the uncontrolled nature of web presents difficulties for w eb page classification , the proposed system about using neural network technique for automatically online web pages classification according to their domain , the system provide ability to online web page classification ,which make the system sensible to any change happen to the website .
Le World Wide Web fournit une grande quantité d'information dans plusieurs secteurs thématiques. En raison de sa nature dynamique, l'information que WWW nous offre s'augmente rapidement dans une base quotidienne. Par conséquence, une catégorisation du contenu de Web est très importante. La question de classification de pages Web est bien connue dans la communauté de la recherche d'information et d'apprentissage automatique, et soulève des questions importantes. Ainsi que les études dans le domaine de la classification matière-basée sont fortement développées, nous notons la manque de l'utilisation de l'information sémantique comme critère de catégorisation. Dans notre étude, nous essayons une catégorisation des pages Web dans des catégories thématiques prédéfinies, basées sur le traitement des textes d'ancrage des pages Web. Á ce but, nous exploitons l'information linguistique que les réseaux sémantiques nous fournissent pour l'anglais et la langue grecque. Nous notons également l'importance des traits de catégorisation évalués par des méthodes de désambiguisation sémantique.
Journal of Systems and Software, 2016
Web page classification refers to the problem of automatically assigning a web page to one or more classes after analysing its features. Automated web page classifiers have many applications, and many researchers have proposed techniques and tools to perform web page classification. Unfortunately, the existing tools have a number of drawbacks that makes them unappealing for real-world scenarios, namely: they require a previous extensive crawling, they are supervised, they need to download a page before classifying it, or they are site-, language-, or domain-dependent. In this article, we propose CALA, a tool for URL-based web page classification. The strongest features of our tool are that it does not require a previous extensive crawling to achieve good classification results, it is unsupervised, it is based exclusively on URL features, which means that pages can be classified without downloading them, and it is site-, language-, and domain-independent, which makes it generally applicable. We have validated our tool with 22 real-world web sites from multiple domains and languages, and our conclusion is that CALA is very effective and efficient in practice.
Traditional information retrieving methods use keywords occurring in determine the class of web pages, but usually retrieved unrelated web pages. W3 consortium stated that HTML dosnt provide a better description of semantic structure of the web page contents, because of its limited semi structure data, case sensitivity, predefined tags and so on. To overcome these backs, Web developers started to develop web pages on XML, flash kind of new technologies. It makes a way for new research methods. In this article we propose a new approach based on URL and semantic analysis for classifying XML and other types of web page.
Advances in Artificial Intelligence, 2011
Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.