Data Mining Notes
Data Mining Notes
UNIT I:
Introduction: Data Mining – Kinds of Data and Patterns to be
Mined – Technologies used –Kinds of Applications are Targeted -
Major Issues – Data objects and Attribute types – Basic statistical
Descriptions of Data Data Preprocessing : Data Cleaning – Data
Integration - Data Reduction - Data Transformation
UNIT II:
Association Rules Mining: Introduction – Frequent Itemset
Mining Methods: Apriori Algorithm-Generating Association Rules
from Frequent Itemsets-Improving the efficiency of Apriori-A
Pattern –Growth Approach for mining Frequent Itemsets-Pattern
Evaluation Methods.
UNIT III:
Classification: Introduction –Basic concepts – Logistic regression
- Decision tree induction–Bayesian classification, Rule–based
classification-Model Evaluation and selection.
UNIT IV:
Cluster Analysis: Introduction-Requirements for Cluster Analysis
- Partitioning Methods: The K-Means method - Hierarchical
Method: Agglomerative method - Density based methods:
DBSCAN-Evaluation of Clustering: Determining the Number of
Clusters – Measuring Clustering Quality.
UNIT V:
Outlier Detection: Outliers and Outlier Analysis – Outlier
Detection Methods - Data Visualization: Pixel-oriented
visualization – Geometric Projection visualization techniqueIcon-
based-Hierarchical visualization-Visualizing complex data and
relations
V.HARINI M.Sc.,M.Phil.,(PhD)
UNIT I:
Introduction: Data Mining – Kinds of Data and Patterns to be
Mined – Technologies used –Kinds of Applications are Targeted -
Major Issues – Data objects and Attribute types – Basic statistical
Descriptions of Data Data Preprocessing : Data Cleaning – Data
Integration - Data Reduction - Data Transformation
V.HARINI M.Sc.,M.Phil.,(PhD)
What is Data
Data is the raw form of information, a collection of facts, figures,
symbols or observations that represent details about events,
objects or phenomena. By itself, data may appear meaningless, but
when organized, processed and interpreted, it transforms into
valuable insights that support decision-making, problem-solving
and innovation.
Importance of Data
Decision-making and insights: Organizations use data to
make better decisions. Raw data becomes useful when
transformed into insights with the help of analytics.
AI/ML and Innovation: Data is the fuel for artificial
intelligence and machine learning. More and higher-quality
data means better training, more accurate predictions.
Digital transformation: The rise of Big Data has enabled new
capabilities i.e from real-time analytics to personalized
services.
Kinds of Data to be Mined
1. Transactional Data:
o Definition: This type of data is generated from
transactions, such as purchases, sales, and
exchanges. Each record usually represents a single
event.
o Example: A retail store's point of sale data, where each
transaction includes details like item purchased, time
of purchase, quantity, price, and customer ID.
o Mining Focus: Frequent itemsets, association rules
(e.g., "if a customer buys item A, they are likely to buy
item B").
2. Relational Data:
o Definition: Data stored in relational databases,
typically in tables with rows and columns. The tables
are related to each other via keys (primary or foreign).
o Example: A customer database with multiple tables
(customers, orders, payments) linked by customer ID
or order ID.
o Mining Focus: Classification, clustering, association
rules, and sometimes sequence mining.
V.HARINI M.Sc.,M.Phil.,(PhD)
3. Sequence Data:
o Definition: Data that represents events or items in a
sequential order. It’s particularly useful when time or
ordering is important.
o Example: Web browsing history, user activity logs, or
DNA sequences.
o Mining Focus: Sequence patterns, temporal patterns,
and time-series analysis.
4. Spatial Data:
o Definition: Data related to the geographical location of
objects, often including coordinates (latitude,
longitude) and related spatial attributes.
o Example: Mapping data, GPS tracking, satellite
imagery, or store locations.
o Mining Focus: Spatial clustering, nearest-neighbor
searches, and spatial outlier detection.
5. Text Data:
o Definition: Unstructured or semi-structured textual
data that can be mined for insights.
o Example: Social media posts, reviews, emails, articles,
and web pages.
o Mining Focus: Text mining, sentiment analysis, topic
modeling, and entity extraction.
6. Time-Series Data:
o Definition: Data that is indexed or organized based on
time. It's common in fields like finance, economics,
and operations.
o Example: Stock prices over time, sensor data, weather
data.
o Mining Focus: Trend analysis, anomaly detection,
forecasting, and pattern recognition.
7. Multimedia Data:
o Definition: Data in the form of images, audio, or video.
This data requires specialized techniques for analysis.
o Example: Images from a camera, audio clips, video
files, or multimedia logs.
o Mining Focus: Image recognition, video pattern
detection, audio feature extraction.
8. Web Data:
o Definition: Data generated from web-based activities,
including user interactions, clicks, and social media
engagement.
V.HARINI M.Sc.,M.Phil.,(PhD)
o Example: Website usage logs, clickstream data, or
social media feeds.
o Mining Focus: Web mining, recommendation systems,
user behavior prediction.
Patterns to be Mined
There are several types of patterns that can be mined from data
depending on the mining goals. Here are the main types:
V.HARINI M.Sc.,M.Phil.,(PhD)
o Example: In web usage data, users might visit certain
pages in a particular sequence (e.g., Homepage →
Product Page → Checkout).
o Mining Technique: Sequence mining algorithms like
GSP (Generalized Sequential Patterns) or SPADE
(Sequential Pattern Discovery using Equivalence
classes).
5. Outlier Patterns:
o Definition: These patterns detect data points that
deviate significantly from the norm or expected
behavior. Outliers can represent anomalies, fraud, or
errors.
o Example: Detecting fraudulent transactions in a
bank’s transaction data, or spotting rare diseases in
medical data.
o Mining Technique: Statistical tests, k-nearest
neighbors (KNN), isolation forests, or density-based
methods.
6. Trend Patterns:
o Definition: These patterns identify trends over time,
highlighting consistent changes in data that follow a
specific direction (e.g., upward or downward).
o Example: A company’s sales data may show a positive
trend over several quarters.
o Mining Technique: Time-series analysis, regression
analysis.
7. Regression Patterns:
o Definition: Regression patterns predict a continuous
output variable based on one or more input variables.
o Example: Predicting the price of a house based on its
features such as size, location, and age.
o Mining Technique: Linear regression, polynomial
regression, decision trees.
8. Spatial Patterns:
o Definition: These patterns focus on the relationships
between spatial objects and are used to find spatial
distributions, proximity relations, or spatial trends.
o Example: Analyzing crime hotspots in a city or
identifying areas with high concentrations of stores.
o Mining Technique: Spatial data mining, clustering
techniques (e.g., DBSCAN for spatial data).
9. Anomaly Detection:
V.HARINI M.Sc.,M.Phil.,(PhD)
o Definition: Similar to outlier detection but broader,
this includes identifying data points that do not
conform to expected behavior or patterns across time.
o Example: Identifying unusual network traffic
indicating potential cyber security threats.
o Mining Technique: Isolation forests, clustering-based
methods, statistical analysis.
10. Summarization Patterns:
o Definition: These patterns provide concise and high-
level summaries of data to help with understanding
overall trends or distributions.
o Example: Summarizing a large dataset to find the
average sales in each region or summarizing customer
demographics in a retail store.
o Mining Technique: Aggregation, averaging, and
dimensionality reduction techniques like Principal
Component Analysis (PCA).
V.HARINI M.Sc.,M.Phil.,(PhD)
driven domain, the interdisciplinary nature is typically very
significant. Research and development in data mining and its
applications prove quite useful in implementing it.
APPLICATIONS:
Customer segmentation
Sales forecasting
Market basket analysis
Customer relationship management (CRM)
Telecommunications
Education
Quality control
Predictive maintenance
Supply chain optimization
V.HARINI M.Sc.,M.Phil.,(PhD)
E-commerce and Web Applications
Recommendation systems
Personalized marketing
Clickstream analysis
Scientific Research
1. Data-Related Issues
1. Data Quality
o Real-world data may be incomplete, noisy, or
inconsistent.
o Example: Missing values, duplicate records, incorrect
entries.
o Solution: Data cleaning and preprocessing are
essential.
2. Data Integration
o Combining data from multiple sources (databases,
sensors, web logs) can cause schema conflicts or
redundancy.
o Example: Different databases may use different
formats for dates or IDs.
3. Data Selection and Transformation
V.HARINI M.Sc.,M.Phil.,(PhD)
o Selecting the relevant subset of data and
transforming it into a suitable format for mining is
challenging.
o Example: Choosing appropriate features for model
training.
4. Data Volume and Scalability
o The massive size of modern datasets (Big Data)
requires efficient algorithms and high computational
power.
1. Algorithm Efficiency
o Mining algorithms must handle large, high-
dimensional datasets quickly and accurately.
o Balancing speed and accuracy is difficult.
2. Over fitting and Model Generalization
o Models may perform well on training data but poorly
on unseen data.
o Requires proper validation and tuning.
3. Pattern Evaluation
o Determining which discovered patterns are useful,
valid, and interesting is a key issue.
4. Heterogeneous and Complex Data
o Data can be in various forms: text, images, audio,
video, graphs, spatial or temporal data.
o Mining such diverse data types requires specialized
techniques.
1. Data Privacy
o Mining personal or sensitive data (like medical or
financial records) can raise privacy concerns.
o Requires secure data handling and anonymization.
2. Data Security
o Protecting data from unauthorized access or breaches
during mining operations is crucial.
3. Ethical Use of Data
o Ensuring data mining results are not misused for
discrimination or manipulation.
V.HARINI M.Sc.,M.Phil.,(PhD)
4. Social and Legal Issues
1. Ownership of Data
o It is often unclear who owns the data and who has
the right to mine it.
2. Legal Restrictions
o Laws like GDPR (General Data Protection Regulation)
set limits on data collection and usage.
3. User Consent
o Data mining should respect the principle of informed
consent — users should know how their data is used.
1. Scalability
o Mining systems should efficiently process terabytes or
petabytes of data.
2. Real-Time Processing
o Some applications (like fraud detection or IoT
analytics) require real-time or near-real-time mining.
3. Visualization and Interpretation
o Presenting complex results in a simple,
understandable format is often difficult.
Data Objects
V.HARINI M.Sc.,M.Phil.,(PhD)
Attribute types:
Nominal Attributes
V.HARINI M.Sc.,M.Phil.,(PhD)
Attributes Possible Values
Ordinal Attributes:
V.HARINI M.Sc.,M.Phil.,(PhD)
An ordinal attribute is an attribute with possible values that have
a meaningful order or ranking among them, but the magnitude
between successive values is not known.
0: very dissatisfied
1: somewhat dissatisfied
2: neutral
3: satisfied
4: very satisfied.
The central tendency of an ordinal attribute can be represented
by its mode and its median (middle value in an ordered
sequence), but the mean cannot be defined.
Interval-Scaled Attributes
V.HARINI M.Sc.,M.Phil.,(PhD)
A temperature attribute is an interval attribute.
We can quantify the difference between values. For example,
a temperature of 20oC is five degrees higher than a
temperature of 15oC.
Calendar dates is another example for an interval attribute.
Temperatures in Celsius do not have a true zero point, that
is, 0oC does not indicate ―no temperature.‖
Calendar dates do not have a true zero point, that is, the
year 0 not the beginning of the time.
Although we can compute the difference between temperature
values, we cannot talk of one temperature value as being a
multiple of another.
Without a true zero, we cannot say, for instance, that 10oC is
twice as warm as 5oC. That is, we cannot speak of the values in
terms of ratios.
The central tendency of an interval attribute can be represented
by its mode, its median (middle value in an ordered sequence),
and its mean Data.
Ratio Attribute
V.HARINI M.Sc.,M.Phil.,(PhD)
The central tendency of an ratio attribute can be represented by
its mode, its median (middle value in an ordered sequence), and
its mean
Properties of Attribute Values
Distinctness: =, !=
Order: < >
Addition: + -
Multiplication: * /
V.HARINI M.Sc.,M.Phil.,(PhD)
• Measures of central tendency include mean, median, mode and
midrange.
1. Mean :
• The mean of a data set is the average of all the data values. The
sample mean x is the point estimator of the population mean μ.
2. Median :
• The median of a data set is the value in the middle when the
data items are arranged in ascending order. Whenever a data set
V.HARINI M.Sc.,M.Phil.,(PhD)
has extreme values, the median is the preferred measure of
central location.
Median=19
8 observations = 26 18 29 12 14 27 30 19
Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30
3. Mode:
• The mode of a data set is the value that occurs with greatest
frequency. The greatest frequency can occur at two or more
different values. If the data have exactly two modes, the data
have exactly two modes, the data are bimodal. If the data have
more than two modes, the data are multimodal.
V.HARINI M.Sc.,M.Phil.,(PhD)
the mean obtained after cutting off values at the high and low
extremes.
• For example, we can sort the values and remove the top and
bottom 2 % before computing the mean. We should avoid
trimming too large a portion (such as 20 %) at both ends as this
can result in the loss of valuable information.
V.HARINI M.Sc.,M.Phil.,(PhD)
1. Data Cleaning: It is the process of identifying and correcting
errors or inconsistencies in the dataset. It involves handling
missing values, removing duplicates, and correcting incorrect or
outlier data to ensure the dataset is accurate and reliable. Clean
data is essential for effective analysis, as it improves the quality
of results and enhances the performance of data models.
Missing Values: This occur when data is absent from a
dataset. You can either ignore the rows with missing data or
fill the gaps manually, with the attribute mean, or by using
the most probable value. This ensures the dataset remains
accurate and complete for analysis.
Noisy Data: It refers to irrelevant or incorrect data that is
difficult for machines to interpret, often caused by errors in
data collection or entry. It can be handled in several ways:
o Binning Method: The data is sorted into equal
segments, and each segment is smoothed by replacing
values with the mean or boundary values.
o Regression: Data can be smoothed by fitting it to a
regression function, either linear or multiple, to
predict values.
o Clustering: This method groups similar data points
together, with outliers either being undetected or
falling outside the clusters. These techniques help
remove noise and improve data quality.
Removing Duplicates: It involves identifying and eliminating
repeated data entries to ensure accuracy and consistency in
the dataset. This process prevents errors and ensures reliable
analysis by keeping only unique records.
2. Data Integration: It involves merging data from various
sources into a single, unified dataset. It can be challenging due
to differences in data formats, structures, and meanings.
Techniques like record linkage and data fusion help in
combining data efficiently, ensuring consistency and accuracy.
Record Linkage is the process of identifying and matching
records from different datasets that refer to the same entity,
even if they are represented differently. It helps in combining
data from various sources by finding corresponding records
based on common identifiers or attributes.
Data Fusion involves combining data from multiple sources
to create a more comprehensive and accurate dataset. It
integrates information that may be inconsistent or incomplete
V.HARINI M.Sc.,M.Phil.,(PhD)
from different sources, ensuring a unified and richer dataset
for analysis.
3. Data Transformation: It involves converting data into a
format suitable for analysis. Common techniques include
normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and unit
variance; and discretization, which converts continuous data
into discrete categories. These techniques help prepare the data
for more accurate analysis.
Data Normalization: The process of scaling data to a
common range to ensure consistency across variables.
Discretization: Converting continuous data into discrete
categories for easier analysis.
Data Aggregation: Combining multiple data points into a
summary form, such as averages or totals, to simplify
analysis.
Concept Hierarchy Generation: Organizing data into a
hierarchy of concepts to provide a higher-level view for better
understanding and analysis.
4. Data Reduction: It reduces the dataset's size while
maintaining key information. This can be done through feature
selection, which chooses the most relevant features, and feature
extraction, which transforms the data into a lower-dimensional
space while preserving important details. It uses various
reduction techniques such as,
Dimensionality Reduction (e.g., Principal Component
Analysis): A technique that reduces the number of variables
in a dataset while retaining its essential information.
Numerosity Reduction: Reducing the number of data points
by methods like sampling to simplify the dataset without
losing critical patterns.
Data Compression: Reducing the size of data by encoding it
in a more compact form, making it easier to store and
process.
Uses of Data Preprocessing
Data preprocessing is utilized across various fields to ensure
that raw data is transformed into a usable format for analysis
and decision-making. Here are some key areas where data
preprocessing is applied:
1. Data Warehousing: In data warehousing, preprocessing is
essential for cleaning, integrating, and structuring data before it
V.HARINI M.Sc.,M.Phil.,(PhD)
is stored in a centralized repository. This ensures the data is
consistent and reliable for future queries and reporting.
2. Data Mining: Data preprocessing in data mining involves
cleaning and transforming raw data to make it suitable for
analysis. This step is crucial for identifying patterns and
extracting insights from large datasets.
3. Machine Learning: In machine learning, preprocessing
prepares raw data for model training. This includes handling
missing values, normalizing features, encoding categorical
variables, and splitting datasets into training and testing sets to
improve model performance and accuracy.
4. Data Science: Data preprocessing is a fundamental step in
data science projects, ensuring that the data used for analysis
or building predictive models is clean, structured, and relevant.
It enhances the overall quality of insights derived from the data.
5. Web Mining: In web mining, preprocessing helps analyze web
usage logs to extract meaningful user behavior patterns. This
can inform marketing strategies and improve user experience
through personalized recommendations.
6. Business Intelligence (BI): Preprocessing supports BI by
organizing and cleaning data to create dashboards and reports
that provide actionable insights for decision-makers.
7. Deep Learning Purpose: Similar to machine learning, deep
learning applications require preprocessing to normalize or
enhance features of the input data, optimizing model training
processes.
Advantages of Data Preprocessing
Improved Data Quality: Ensures data is clean, consistent,
and reliable for analysis.
Better Model Performance: Reduces noise and irrelevant
data, leading to more accurate predictions and insights.
Efficient Data Analysis: Streamlines data for faster and
easier processing.
Enhanced Decision-Making: Provides clear and well-
organized data for better business decisions.
Disadvantages of Data Preprocessing
Time-Consuming: Requires significant time and effort to
clean, transform, and organize data.
Resource-Intensive: Demands computational power and
skilled personnel for complex preprocessing tasks.
Potential Data Loss: Incorrect handling may result in losing
valuable information.
V.HARINI M.Sc.,M.Phil.,(PhD)