Web page classification without the web page

Min-Yen Kan

Web page classification without the web page

Min-Yen Kan

2004, Proceedings of the 13th international World Wide Web …

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can hint at the category of the resource. This paper explores the use of URLs for web page categorization via a two-phase pipeline of word segmentation/expansion and classification. We quantify its performance against document-based methods, which require the retrieval of the source document.

Key takeaways

Consider the URL fragment "cs", which might correspond to "computer science" in a majority of the training pages' title in which it appears.
As these methods only affect a small percentage of the URLs (unlike the change from the baseline to the refined URL parsing), their power to enhance performance is similarly limiting.
Given only the URL of a web page, can we identify its topic?
In this paper, we report our findings for web page topic classification only from URL on a large collection of 1.5 million categorized web pages from the Open Directory Project [2].
This approach splits the URL in the same tokens as the method above.

Log In

Web page classification without the web page

Sign up for access to the world's latest research

Abstract

Key takeaways

Related papers

Related topics