Academia.eduAcademia.edu

Web page classification without the web page

2004, Proceedings of the 13th international World Wide Web …

Abstract

Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can hint at the category of the resource. This paper explores the use of URLs for web page categorization via a two-phase pipeline of word segmentation/expansion and classification. We quantify its performance against document-based methods, which require the retrieval of the source document.

Key takeaways

  • Consider the URL fragment "cs", which might correspond to "computer science" in a majority of the training pages' title in which it appears.
  • As these methods only affect a small percentage of the URLs (unlike the change from the baseline to the refined URL parsing), their power to enhance performance is similarly limiting.
  • Given only the URL of a web page, can we identify its topic?
  • In this paper, we report our findings for web page topic classification only from URL on a large collection of 1.5 million categorized web pages from the Open Directory Project [2].
  • This approach splits the URL in the same tokens as the method above.