Web Mining
Course Code: Year: IV Semester: VI
Prerequisites: Data Mining, Machine Learning, Database Systems
Course Description
This course introduces students to the field of web mining, which involves the application of data mining
techniques to discover patterns from the World Wide Web. Students will learn to extract and analyze web
data including content, structure, and usage, using machine learning and natural language processing
techniques.
Course Objectives
1. To understand the fundamental concepts, scope, and types of web mining.
2. To analyze and process web content using text mining and natural language processing.
3. To explore the structure of the web through graph-based techniques and link analysis.
4. To model user behavior through web usage data and apply it in building intelligent systems.
Course Outcomes
Upon successful completion of this course, students will be able to:
CO1: Distinguish between content, structure, and usage-based web mining techniques.
CO2: Apply text and semantic analysis techniques for mining web content.
CO3: Analyze and rank websites using structural and link analysis algorithms.
CO4: Develop models for predicting user behavior and generating recommendations from web usage data.
Syllabus
(10 hours)
Unit I: Foundations of Web Mining and Web Data: Introduction to Data Mining and Web Mining, Web
Mining Taxonomy: Web Content Mining, Web Structure Mining, and Web Usage Mining, Web Data Types:
Structured, Semi-structured (HTML, XML), Unstructured (text, images), Web Crawling: Architecture,
Politeness Policies, Robots.txt, Indexing and Search Engines: Basic Concepts and Architecture.
(12 hours)
Unit II: Web Content Mining and Text Analytics: Text Mining Pipeline: Tokenization, Stop-word Removal,
Stemming and Lemmatization, Information Retrieval: Vector Space Model, TF-IDF, Cosine Similarity,
Document Classification and Clustering, Advanced Text Analytics: Named Entity Recognition (NER), Topic
Modeling (LDA), Sentiment Analysis and Opinion Mining.
(10 hours)
Unit III: Web Structure Mining and Link Analysis: Web Graph Modeling: Nodes, Edges, Hyperlink
Structure, Link Analysis Algorithms: PageRank, HITS, Community Detection: Identifying Web
Communities, Authority and Hub Nodes, Social Network Analysis Basics: Degree, Centrality, Clustering
Coefficient.
(12 hours)
Unit IV: Web Usage Mining and Personalization: Web Log Files: Formats, Parsing, Data Cleaning, User
Identification and Sessionization, Pattern Discovery: Sequential Pattern Mining, Association Rules, User
Profiling and Personalization, Introduction to Recommendation Systems: Collaborative Filtering,
Content-based Filtering.
Textbooks:
1. Bing Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer.
2. Charu C. Aggarwal, Mining the Web: Discovering Knowledge from Hypertext Data, Springer.
Reference Books:
1. Matthew Russell, Mining the Social Web, O'Reilly Media.
2. Christopher D. Manning, Introduction to Information Retrieval, Cambridge University Press.
Software & Tools:
Python Libraries: NLTK, Scikit-learn, BeautifulSoup, NetworkX, WEKA, Elasticsearch and Kibana.