Data Mining Notes
Data Mining Notes
UNIT I:
Introduction: Data Mining – Kinds of Data and Patterns to be
Mined – Technologies used –Kinds of Applications are Targeted -
Major Issues – Data objects and Attribute types – Basic
statistical Descriptions of Data Data Preprocessing : Data
Cleaning – Data Integration - Data Reduction - Data
Transformation
UNIT II:
Association Rules Mining: Introduction – Frequent Itemset
Mining Methods: Apriori Algorithm-Generating Association Rules
from Frequent Itemsets-Improving the efficiency of Apriori-A
Pattern –Growth Approach for mining Frequent Itemsets-Pattern
Evaluation Methods.
UNIT III:
Classification: Introduction –Basic concepts – Logistic regression
- Decision tree induction–Bayesian classification, Rule–based
classification-Model Evaluation and selection.
UNIT IV:
Cluster Analysis: Introduction-Requirements for Cluster Analysis
- Partitioning Methods: The K-Means method - Hierarchical
Method: Agglomerative method - Density based methods:
DBSCAN-Evaluation of Clustering: Determining the Number of
Clusters – Measuring Clustering Quality.
UNIT V:
Outlier Detection: Outliers and Outlier Analysis – Outlier
Detection Methods - Data Visualization: Pixel-oriented
visualization – Geometric Projection visualization techniqueIcon-
based-Hierarchical visualization-Visualizing complex data and
relations
[Link] [Link].,[Link].,(PhD)
UNIT I:
Introduction: Data Mining – Kinds of Data and Patterns to be
Mined – Technologies used –Kinds of Applications are Targeted -
Major Issues – Data objects and Attribute types – Basic
statistical Descriptions of Data Data Preprocessing : Data
Cleaning – Data Integration - Data Reduction - Data
Transformation
[Link] [Link].,[Link].,(PhD)
What is Data
Data is the raw form of information, a collection of facts, figures,
symbols or observations that represent details about events,
objects or phenomena. By itself, data may appear meaningless,
but when organized, processed and interpreted, it transforms
into valuable insights that support decision-making, problem-
solving and innovation.
Importance of Data
Decision-making and insights: Organizations use data to
make better decisions. Raw data becomes useful when
transformed into insights with the help of analytics.
AI/ML and Innovation: Data is the fuel for artificial
intelligence and machine learning. More and higher-quality
data means better training, more accurate predictions.
Digital transformation: The rise of Big Data has enabled
new capabilities i.e from real-time analytics to personalized
services.
Kinds of Data to be Mined
1. Transactional Data:
o Definition: This type of data is generated from
transactions, such as purchases, sales, and
exchanges. Each record usually represents a single
event.
o Example: A retail store's point of sale data, where
each transaction includes details like item purchased,
time of purchase, quantity, price, and customer ID.
o Mining Focus: Frequent itemsets, association rules
(e.g., "if a customer buys item A, they are likely to
buy item B").
2. Relational Data:
o Definition: Data stored in relational databases,
typically in tables with rows and columns. The tables
are related to each other via keys (primary or
foreign).
o Example: A customer database with multiple tables
(customers, orders, payments) linked by customer ID
or order ID.
[Link] [Link].,[Link].,(PhD)
o Mining Focus: Classification, clustering, association
rules, and sometimes sequence mining.
3. Sequence Data:
o Definition: Data that represents events or items in a
sequential order. It’s particularly useful when time or
ordering is important.
o Example: Web browsing history, user activity logs,
or DNA sequences.
o Mining Focus: Sequence patterns, temporal
patterns, and time-series analysis.
4. Spatial Data:
o Definition: Data related to the geographical location
of objects, often including coordinates (latitude,
longitude) and related spatial attributes.
o Example: Mapping data, GPS tracking, satellite
imagery, or store locations.
o Mining Focus: Spatial clustering, nearest-neighbor
searches, and spatial outlier detection.
5. Text Data:
o Definition: Unstructured or semi-structured textual
data that can be mined for insights.
o Example: Social media posts, reviews, emails,
articles, and web pages.
o Mining Focus: Text mining, sentiment analysis,
topic modeling, and entity extraction.
6. Time-Series Data:
o Definition: Data that is indexed or organized based
on time. It's common in fields like finance,
economics, and operations.
o Example: Stock prices over time, sensor data,
weather data.
o Mining Focus: Trend analysis, anomaly detection,
forecasting, and pattern recognition.
7. Multimedia Data:
o Definition: Data in the form of images, audio, or
video. This data requires specialized techniques for
analysis.
o Example: Images from a camera, audio clips, video
files, or multimedia logs.
o Mining Focus: Image recognition, video pattern
detection, audio feature extraction.
8. Web Data:
[Link] [Link].,[Link].,(PhD)
o Definition: Data generated from web-based
activities, including user interactions, clicks, and
social media engagement.
o Example: Website usage logs, clickstream data, or
social media feeds.
o Mining Focus: Web mining, recommendation
systems, user behavior prediction.
Patterns to be Mined
[Link] [Link].,[Link].,(PhD)
o Definition: These patterns identify sequences of
events or items that commonly occur together in a
certain order over time.
o Example: In web usage data, users might visit
certain pages in a particular sequence (e.g.,
Homepage → Product Page → Checkout).
o Mining Technique: Sequence mining algorithms like
GSP (Generalized Sequential Patterns) or SPADE
(Sequential Pattern Discovery using Equivalence
classes).
5. Outlier Patterns:
o Definition: These patterns detect data points that
deviate significantly from the norm or expected
behavior. Outliers can represent anomalies, fraud, or
errors.
o Example: Detecting fraudulent transactions in a
bank’s transaction data, or spotting rare diseases in
medical data.
o Mining Technique: Statistical tests, k-nearest
neighbors (KNN), isolation forests, or density-based
methods.
6. Trend Patterns:
o Definition: These patterns identify trends over time,
highlighting consistent changes in data that follow a
specific direction (e.g., upward or downward).
o Example: A company’s sales data may show a
positive trend over several quarters.
o Mining Technique: Time-series analysis, regression
analysis.
7. Regression Patterns:
o Definition: Regression patterns predict a continuous
output variable based on one or more input variables.
o Example: Predicting the price of a house based on
its features such as size, location, and age.
o Mining Technique: Linear regression, polynomial
regression, decision trees.
8. Spatial Patterns:
o Definition: These patterns focus on the relationships
between spatial objects and are used to find spatial
distributions, proximity relations, or spatial trends.
o Example: Analyzing crime hotspots in a city or
identifying areas with high concentrations of stores.
[Link] [Link].,[Link].,(PhD)
o Mining Technique: Spatial data mining, clustering
techniques (e.g., DBSCAN for spatial data).
9. Anomaly Detection:
o Definition: Similar to outlier detection but broader,
this includes identifying data points that do not
conform to expected behavior or patterns across
time.
o Example: Identifying unusual network traffic
indicating potential cyber security threats.
o Mining Technique: Isolation forests, clustering-
based methods, statistical analysis.
10. Summarization Patterns:
o Definition: These patterns provide concise and high-
level summaries of data to help with understanding
overall trends or distributions.
o Example: Summarizing a large dataset to find the
average sales in each region or summarizing
customer demographics in a retail store.
o Mining Technique: Aggregation, averaging, and
dimensionality reduction techniques like Principal
Component Analysis (PCA).
[Link] [Link].,[Link].,(PhD)
Data mining has incorporated many techniques from other
domain fields like machine learning, statistics, information
retrieval, data warehouse, pattern recognition, algorithms, and
high-performance computing. Since it is a highly application-
driven domain, the interdisciplinary nature is typically very
significant. Research and development in data mining and its
applications prove quite useful in implementing it.
APPLICATIONS:
Customer segmentation
Sales forecasting
Market basket analysis
Customer relationship management (CRM)
Telecommunications
Education
[Link] [Link].,[Link].,(PhD)
Manufacturing and Industry
Quality control
Predictive maintenance
Supply chain optimization
Recommendation systems
Personalized marketing
Clickstream analysis
Scientific Research
1. Data-Related Issues
1. Data Quality
o Real-world data may be incomplete, noisy, or
inconsistent.
o Example: Missing values, duplicate records, incorrect
entries.
o Solution: Data cleaning and preprocessing are
essential.
2. Data Integration
o Combining data from multiple sources (databases,
sensors, web logs) can cause schema conflicts or
redundancy.
[Link] [Link].,[Link].,(PhD)
o Example: Different databases may use different
formats for dates or IDs.
3. Data Selection and Transformation
o Selecting the relevant subset of data and
transforming it into a suitable format for mining is
challenging.
o Example: Choosing appropriate features for model
training.
4. Data Volume and Scalability
o The massive size of modern datasets (Big Data)
requires efficient algorithms and high computational
power.
1. Algorithm Efficiency
o Mining algorithms must handle large, high-
dimensional datasets quickly and accurately.
o Balancing speed and accuracy is difficult.
2. Over fitting and Model Generalization
o Models may perform well on training data but poorly
on unseen data.
o Requires proper validation and tuning.
3. Pattern Evaluation
o Determining which discovered patterns are useful,
valid, and interesting is a key issue.
4. Heterogeneous and Complex Data
o Data can be in various forms: text, images, audio,
video, graphs, spatial or temporal data.
o Mining such diverse data types requires specialized
techniques.
1. Data Privacy
o Mining personal or sensitive data (like medical or
financial records) can raise privacy concerns.
o Requires secure data handling and anonymization.
2. Data Security
o Protecting data from unauthorized access or
breaches during mining operations is crucial.
3. Ethical Use of Data
[Link] [Link].,[Link].,(PhD)
o Ensuring data mining results are not misused for
discrimination or manipulation.
1. Ownership of Data
o It is often unclear who owns the data and who has
the right to mine it.
2. Legal Restrictions
o Laws like GDPR (General Data Protection Regulation)
set limits on data collection and usage.
3. User Consent
o Data mining should respect the principle of
informed consent — users should know how their
data is used.
1. Scalability
o Mining systems should efficiently process terabytes
or petabytes of data.
2. Real-Time Processing
o Some applications (like fraud detection or IoT
analytics) require real-time or near-real-time
mining.
3. Visualization and Interpretation
o Presenting complex results in a simple,
understandable format is often difficult.
Data Objects
Data sets are made up of data objects. A data
object represents an entity — for example in a Student
database, an object may be the name of the student, roll
number, marks, etc. Attributes describe these data objects.
In a relational database, we can think of data objects as
[Link] [Link].,[Link].,(PhD)
the rows of the database and columns as the attributes.
Data objects can also be referred to as examples, samples,
instances, data points, or objects. Next, we clearly define
what an attribute is and the different types of attributes.
Attribute types:
Nominal Attributes
[Link] [Link].,[Link].,(PhD)
The values of nominal attributes do not have any
meaningful order.
Example
[Link] [Link].,[Link].,(PhD)
Example: Test results for COVID patient: Positive (1) and
Negative (0).
By convention, we code the most important outcome, which is
usually the rarest one, by 1 (e.g., COVID positive) and the other
by 0 (e.g., COVID negative).
Ordinal Attributes:
0: very dissatisfied
1: somewhat dissatisfied
2: neutral
3: satisfied
4: very satisfied.
The central tendency of an ordinal attribute can be represented
by its mode and its median (middle value in an ordered
sequence), but the mean cannot be defined.
[Link] [Link].,[Link].,(PhD)
Interval-Scaled Attributes
Ratio Attribute
[Link] [Link].,[Link].,(PhD)
count attribute such as years of experience for employee
object.
Attributes to measure weight, height,
latitude, and longitude coordinates.
With an amount attribute we can say “you are 100 times
richer with $100 than with $1”.
- If a measurement is ratio scaled, we can speak of a value as
being a multiple (or ratio) of another value.
The central tendency of an ratio attribute can be represented
by its mode, its median (middle value in an ordered sequence),
and its mean
Properties of Attribute Values
Distinctness: =, !=
Order: < >
Addition: + -
Multiplication: * /
[Link] [Link].,[Link].,(PhD)
• Basic statistical descriptions can be used to identify
properties of the data and highlight which data values should
be treated as noise or outliers.
1. Mean :
• The mean of a data set is the average of all the data values.
The sample mean x is the point estimator of the population
mean μ.
2. Median :
[Link] [Link].,[Link].,(PhD)
Sum of the values of then observations Number of observations
in the sample
• The median of a data set is the value in the middle when the
data items are arranged in ascending order. Whenever a data
set has extreme values, the median is the preferred measure of
central location.
Median=19
8 observations = 26 18 29 12 14 27 30 19
Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30
3. Mode:
• The mode of a data set is the value that occurs with greatest
frequency. The greatest frequency can occur at two or more
different values. If the data have exactly two modes, the data
have exactly two modes, the data are bimodal. If the data have
more than two modes, the data are multimodal.
[Link] [Link].,[Link].,(PhD)
• Weighted mean: Sometimes, each value in a set may be
associated with a weight, the weights reflect the significance,
importance or occurrence frequency attached to their
respective values.
• For example, we can sort the values and remove the top and
bottom 2 % before computing the mean. We should avoid
trimming too large a portion (such as 20 %) at both ends as this
can result in the loss of valuable information.
[Link] [Link].,[Link].,(PhD)
Steps in Data Preprocessing
Some key steps in data preprocessing are Data Cleaning, Data
Integration, Data Transformation, and Data Reduction.
[Link] [Link].,[Link].,(PhD)
Removing Duplicates: It involves identifying and
eliminating repeated data entries to ensure accuracy and
consistency in the dataset. This process prevents errors and
ensures reliable analysis by keeping only unique records.
2. Data Integration: It involves merging data from various
sources into a single, unified dataset. It can be challenging due
to differences in data formats, structures, and meanings.
Techniques like record linkage and data fusion help in
combining data efficiently, ensuring consistency and accuracy.
Record Linkage is the process of identifying and matching
records from different datasets that refer to the same
entity, even if they are represented differently. It helps in
combining data from various sources by finding
corresponding records based on common identifiers or
attributes.
Data Fusion involves combining data from multiple
sources to create a more comprehensive and accurate
dataset. It integrates information that may be inconsistent
or incomplete from different sources, ensuring a unified and
richer dataset for analysis.
3. Data Transformation: It involves converting data into a
format suitable for analysis. Common techniques include
normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and
unit variance; and discretization, which converts continuous
data into discrete categories. These techniques help prepare
the data for more accurate analysis.
Data Normalization: The process of scaling data to a
common range to ensure consistency across variables.
Discretization: Converting continuous data into discrete
categories for easier analysis.
Data Aggregation: Combining multiple data points into a
summary form, such as averages or totals, to simplify
analysis.
Concept Hierarchy Generation: Organizing data into a
hierarchy of concepts to provide a higher-level view for
better understanding and analysis.
4. Data Reduction: It reduces the dataset's size while
maintaining key information. This can be done through feature
selection, which chooses the most relevant features, and
feature extraction, which transforms the data into a lower-
[Link] [Link].,[Link].,(PhD)
dimensional space while preserving important details. It uses
various reduction techniques such as,
Dimensionality Reduction (e.g., Principal Component
Analysis): A technique that reduces the number of
variables in a dataset while retaining its essential
information.
Numerosity Reduction: Reducing the number of data
points by methods like sampling to simplify the dataset
without losing critical patterns.
Data Compression: Reducing the size of data by encoding
it in a more compact form, making it easier to store and
process.
Uses of Data Preprocessing
Data preprocessing is utilized across various fields to ensure
that raw data is transformed into a usable format for analysis
and decision-making. Here are some key areas where data
preprocessing is applied:
1. Data Warehousing: In data warehousing, preprocessing is
essential for cleaning, integrating, and structuring data before
it is stored in a centralized repository. This ensures the data is
consistent and reliable for future queries and reporting.
2. Data Mining: Data preprocessing in data mining involves
cleaning and transforming raw data to make it suitable for
analysis. This step is crucial for identifying patterns and
extracting insights from large datasets.
3. Machine Learning: In machine learning, preprocessing
prepares raw data for model training. This includes handling
missing values, normalizing features, encoding categorical
variables, and splitting datasets into training and testing sets
to improve model performance and accuracy.
4. Data Science: Data preprocessing is a fundamental step in
data science projects, ensuring that the data used for analysis
or building predictive models is clean, structured, and
relevant. It enhances the overall quality of insights derived
from the data.
5. Web Mining: In web mining, preprocessing helps analyze
web usage logs to extract meaningful user behavior patterns.
This can inform marketing strategies and improve user
experience through personalized recommendations.
6. Business Intelligence (BI): Preprocessing supports BI by
organizing and cleaning data to create dashboards and reports
that provide actionable insights for decision-makers.
[Link] [Link].,[Link].,(PhD)
7. Deep Learning Purpose: Similar to machine learning,
deep learning applications require preprocessing to normalize
or enhance features of the input data, optimizing model
training processes.
Advantages of Data Preprocessing
Improved Data Quality: Ensures data is clean, consistent,
and reliable for analysis.
Better Model Performance: Reduces noise and irrelevant
data, leading to more accurate predictions and insights.
Efficient Data Analysis: Streamlines data for faster and
easier processing.
Enhanced Decision-Making: Provides clear and well-
organized data for better business decisions.
Disadvantages of Data Preprocessing
Time-Consuming: Requires significant time and effort to
clean, transform, and organize data.
Resource-Intensive: Demands computational power and
skilled personnel for complex preprocessing tasks.
Potential Data Loss: Incorrect handling may result in
losing valuable information.
[Link] [Link].,[Link].,(PhD)