0% found this document useful (1 vote)

13 views22 pages

Data Mining Notes

The document provides a comprehensive overview of data mining, covering its definition, types of data, mining techniques, applications, and major issues. It details various methods such as association rule mining, classification, clustering, and outlier detection, along with the technologies used in data mining. Additionally, it discusses the importance of data, ethical concerns, and challenges faced in the field, including data quality, privacy, and legal issues.

Uploaded by

harini.eccs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

13 views22 pages

Data Mining Notes

Uploaded by

harini.eccs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

DATA MINING

UNIT I:
Introduction: Data Mining – Kinds of Data and Patterns to be
Mined – Technologies used –Kinds of Applications are Targeted -
Major Issues – Data objects and Attribute types – Basic statistical
Descriptions of Data Data Preprocessing : Data Cleaning – Data
Integration - Data Reduction - Data Transformation

UNIT II:
Association Rules Mining: Introduction – Frequent Itemset
Mining Methods: Apriori Algorithm-Generating Association Rules
from Frequent Itemsets-Improving the efficiency of Apriori-A
Pattern –Growth Approach for mining Frequent Itemsets-Pattern
Evaluation Methods.

UNIT III:
Classification: Introduction –Basic concepts – Logistic regression
- Decision tree induction–Bayesian classification, Rule–based
classification-Model Evaluation and selection.

UNIT IV:
Cluster Analysis: Introduction-Requirements for Cluster Analysis
- Partitioning Methods: The K-Means method - Hierarchical
Method: Agglomerative method - Density based methods:
DBSCAN-Evaluation of Clustering: Determining the Number of
Clusters – Measuring Clustering Quality.

UNIT V:
Outlier Detection: Outliers and Outlier Analysis – Outlier
Detection Methods - Data Visualization: Pixel-oriented
visualization – Geometric Projection visualization techniqueIcon-
based-Hierarchical visualization-Visualizing complex data and
relations

V.HARINI M.Sc.,M.Phil.,(PhD)
UNIT I:
Introduction: Data Mining – Kinds of Data and Patterns to be
Mined – Technologies used –Kinds of Applications are Targeted -
Major Issues – Data objects and Attribute types – Basic statistical
Descriptions of Data Data Preprocessing : Data Cleaning – Data
Integration - Data Reduction - Data Transformation

What is Data Mining?

Data mining is the process of extracting insights from large
datasets using statistical and computational techniques. It can
involve structured, semi-structured or unstructured data stored
in databases, data warehouses or data lakes. The goal is to
uncover hidden patterns and relationships to support informed
decision-making and predictions using methods like clustering,
classification, regression and anomaly detection.
Data mining is widely used in industries such as marketing,
finance, healthcare and telecommunications. For example, it
helps identify customer segments in marketing or detect disease
risk factors in healthcare. However, it also raises ethical
concerns particularly regarding privacy and the misuse of
personal data, requiring careful safeguards.

V.HARINI M.Sc.,M.Phil.,(PhD)
What is Data
Data is the raw form of information, a collection of facts, figures,
symbols or observations that represent details about events,
objects or phenomena. By itself, data may appear meaningless, but
when organized, processed and interpreted, it transforms into
valuable insights that support decision-making, problem-solving
and innovation.

Importance of Data
 Decision-making and insights: Organizations use data to
make better decisions. Raw data becomes useful when
transformed into insights with the help of analytics.
 AI/ML and Innovation: Data is the fuel for artificial
intelligence and machine learning. More and higher-quality
data means better training, more accurate predictions.
 Digital transformation: The rise of Big Data has enabled new
capabilities i.e from real-time analytics to personalized
services.
Kinds of Data to be Mined

1. Transactional Data:
o Definition: This type of data is generated from
transactions, such as purchases, sales, and
exchanges. Each record usually represents a single
event.
o Example: A retail store's point of sale data, where each
transaction includes details like item purchased, time
of purchase, quantity, price, and customer ID.
o Mining Focus: Frequent itemsets, association rules
(e.g., "if a customer buys item A, they are likely to buy
item B").
2. Relational Data:
o Definition: Data stored in relational databases,
typically in tables with rows and columns. The tables
are related to each other via keys (primary or foreign).
o Example: A customer database with multiple tables
(customers, orders, payments) linked by customer ID
or order ID.
o Mining Focus: Classification, clustering, association
rules, and sometimes sequence mining.

V.HARINI M.Sc.,M.Phil.,(PhD)
3. Sequence Data:
o Definition: Data that represents events or items in a
sequential order. It’s particularly useful when time or
ordering is important.
o Example: Web browsing history, user activity logs, or
DNA sequences.
o Mining Focus: Sequence patterns, temporal patterns,
and time-series analysis.
4. Spatial Data:
o Definition: Data related to the geographical location of
objects, often including coordinates (latitude,
longitude) and related spatial attributes.
o Example: Mapping data, GPS tracking, satellite
imagery, or store locations.
o Mining Focus: Spatial clustering, nearest-neighbor
searches, and spatial outlier detection.
5. Text Data:
o Definition: Unstructured or semi-structured textual
data that can be mined for insights.
o Example: Social media posts, reviews, emails, articles,
and web pages.
o Mining Focus: Text mining, sentiment analysis, topic
modeling, and entity extraction.
6. Time-Series Data:
o Definition: Data that is indexed or organized based on
time. It's common in fields like finance, economics,
and operations.
o Example: Stock prices over time, sensor data, weather
data.
o Mining Focus: Trend analysis, anomaly detection,
forecasting, and pattern recognition.
7. Multimedia Data:
o Definition: Data in the form of images, audio, or video.
This data requires specialized techniques for analysis.
o Example: Images from a camera, audio clips, video
files, or multimedia logs.
o Mining Focus: Image recognition, video pattern
detection, audio feature extraction.
8. Web Data:
o Definition: Data generated from web-based activities,
including user interactions, clicks, and social media
engagement.

V.HARINI M.Sc.,M.Phil.,(PhD)
o Example: Website usage logs, clickstream data, or
social media feeds.
o Mining Focus: Web mining, recommendation systems,
user behavior prediction.

Patterns to be Mined

There are several types of patterns that can be mined from data
depending on the mining goals. Here are the main types:

1. Association Patterns (Rules):

o Definition: These patterns identify relationships or
correlations between different items in a dataset.
o Example: In retail, you might discover that customers
who buy diapers often also buy baby wipes. This would
be an association rule like:
"If a customer buys diapers, they are likely to buy
baby wipes."
o Mining Technique: Market basket analysis,
association rule learning (e.g., Apriori, FP-Growth).
2. Classification Patterns:
o Definition: These patterns assign items or data points
to predefined categories or classes based on attributes.
o Example: Classifying emails as "spam" or "not spam,"
or diagnosing diseases based on patient symptoms.
o Mining Technique: Decision trees, logistic regression,
support vector machines (SVM), neural networks.
3. Clustering Patterns:
o Definition: Clustering involves grouping data points
into clusters based on similarity. Unlike classification,
clustering is an unsupervised learning technique (no
predefined labels).
o Example: Grouping customers based on purchasing
behavior or segmenting images into categories.
o Mining Technique: K-means, hierarchical clustering,
DBSCAN.
4. Sequential Patterns:
o Definition: These patterns identify sequences of
events or items that commonly occur together in a
certain order over time.

V.HARINI M.Sc.,M.Phil.,(PhD)
o Example: In web usage data, users might visit certain
pages in a particular sequence (e.g., Homepage →
Product Page → Checkout).
o Mining Technique: Sequence mining algorithms like
GSP (Generalized Sequential Patterns) or SPADE
(Sequential Pattern Discovery using Equivalence
classes).
5. Outlier Patterns:
o Definition: These patterns detect data points that
deviate significantly from the norm or expected
behavior. Outliers can represent anomalies, fraud, or
errors.
o Example: Detecting fraudulent transactions in a
bank’s transaction data, or spotting rare diseases in
medical data.
o Mining Technique: Statistical tests, k-nearest
neighbors (KNN), isolation forests, or density-based
methods.
6. Trend Patterns:
o Definition: These patterns identify trends over time,
highlighting consistent changes in data that follow a
specific direction (e.g., upward or downward).
o Example: A company’s sales data may show a positive
trend over several quarters.
o Mining Technique: Time-series analysis, regression
analysis.
7. Regression Patterns:
o Definition: Regression patterns predict a continuous
output variable based on one or more input variables.
o Example: Predicting the price of a house based on its
features such as size, location, and age.
o Mining Technique: Linear regression, polynomial
regression, decision trees.
8. Spatial Patterns:
o Definition: These patterns focus on the relationships
between spatial objects and are used to find spatial
distributions, proximity relations, or spatial trends.
o Example: Analyzing crime hotspots in a city or
identifying areas with high concentrations of stores.
o Mining Technique: Spatial data mining, clustering
techniques (e.g., DBSCAN for spatial data).
9. Anomaly Detection:

V.HARINI M.Sc.,M.Phil.,(PhD)
o Definition: Similar to outlier detection but broader,
this includes identifying data points that do not
conform to expected behavior or patterns across time.
o Example: Identifying unusual network traffic
indicating potential cyber security threats.
o Mining Technique: Isolation forests, clustering-based
methods, statistical analysis.
10. Summarization Patterns:
o Definition: These patterns provide concise and high-
level summaries of data to help with understanding
overall trends or distributions.
o Example: Summarizing a large dataset to find the
average sales in each region or summarizing customer
demographics in a retail store.
o Mining Technique: Aggregation, averaging, and
dimensionality reduction techniques like Principal
Component Analysis (PCA).

Technologies used in Data Mining:

Data mining has incorporated many techniques from other

domain fields like machine learning, statistics, information
retrieval, data warehouse, pattern recognition, algorithms, and
high-performance computing. Since it is a highly application-

V.HARINI M.Sc.,M.Phil.,(PhD)
driven domain, the interdisciplinary nature is typically very
significant. Research and development in data mining and its
applications prove quite useful in implementing it.

APPLICATIONS:

Business and Marketing

 Customer segmentation
 Sales forecasting
 Market basket analysis
 Customer relationship management (CRM)

Finance and Banking

 Credit scoring and risk management

 Fraud detection
 Loan approval automation
 Stock market prediction

Healthcare and Medicine

 Disease diagnosis and prediction

 Drug discovery and development
 Patient data analysis
 Medical image analysis

Telecommunications

 Customer churn prediction

 Network fault detection
 Service optimization

Education

 Student performance analysis

 Dropout prediction
 Curriculum improvement

Manufacturing and Industry

 Quality control
 Predictive maintenance
 Supply chain optimization

V.HARINI M.Sc.,M.Phil.,(PhD)
E-commerce and Web Applications

 Recommendation systems
 Personalized marketing
 Clickstream analysis

Government and Public Sector

 Crime pattern analysis

 Tax fraud detection
 Resource allocation

Scientific Research

 Bioinformatics and genomics

 Astronomy data analysis
 Environmental modeling

Major Issues in Data Mining

Data mining is a powerful process for discovering hidden

patterns and knowledge from large datasets. However, it faces
several technical, ethical, and practical challenges. The major
issues are grouped as follows:

1. Data-Related Issues

1. Data Quality
o Real-world data may be incomplete, noisy, or
inconsistent.
o Example: Missing values, duplicate records, incorrect
entries.
o Solution: Data cleaning and preprocessing are
essential.
2. Data Integration
o Combining data from multiple sources (databases,
sensors, web logs) can cause schema conflicts or
redundancy.
o Example: Different databases may use different
formats for dates or IDs.
3. Data Selection and Transformation

V.HARINI M.Sc.,M.Phil.,(PhD)
o Selecting the relevant subset of data and
transforming it into a suitable format for mining is
challenging.
o Example: Choosing appropriate features for model
training.
4. Data Volume and Scalability
o The massive size of modern datasets (Big Data)
requires efficient algorithms and high computational
power.

2. Methodological and Technical Issues

1. Algorithm Efficiency
o Mining algorithms must handle large, high-
dimensional datasets quickly and accurately.
o Balancing speed and accuracy is difficult.
2. Over fitting and Model Generalization
o Models may perform well on training data but poorly
on unseen data.
o Requires proper validation and tuning.
3. Pattern Evaluation
o Determining which discovered patterns are useful,
valid, and interesting is a key issue.
4. Heterogeneous and Complex Data
o Data can be in various forms: text, images, audio,
video, graphs, spatial or temporal data.
o Mining such diverse data types requires specialized
techniques.

3. Privacy, Security, and Ethical Issues

1. Data Privacy
o Mining personal or sensitive data (like medical or
financial records) can raise privacy concerns.
o Requires secure data handling and anonymization.
2. Data Security
o Protecting data from unauthorized access or breaches
during mining operations is crucial.
3. Ethical Use of Data
o Ensuring data mining results are not misused for
discrimination or manipulation.

V.HARINI M.Sc.,M.Phil.,(PhD)
4. Social and Legal Issues

1. Ownership of Data
o It is often unclear who owns the data and who has
the right to mine it.
2. Legal Restrictions
o Laws like GDPR (General Data Protection Regulation)
set limits on data collection and usage.
3. User Consent
o Data mining should respect the principle of informed
consent — users should know how their data is used.

5. Performance and Implementation Issues

1. Scalability
o Mining systems should efficiently process terabytes or
petabytes of data.
2. Real-Time Processing
o Some applications (like fraud detection or IoT
analytics) require real-time or near-real-time mining.
3. Visualization and Interpretation
o Presenting complex results in a simple,
understandable format is often difficult.

Data objects and Attribute types

Data Objects

Data sets are made up of data objects. A data object represents an

entity — for example in a Student database, an object may be the
name of the student, roll number, marks, etc. Attributes describe
these data objects. In a relational database, we can think of data
objects as the rows of the database and columns as the attributes.
Data objects can also be referred to as examples, samples, instances,
data points, or objects. Next, we clearly define what an attribute is
and the different types of attributes.

V.HARINI M.Sc.,M.Phil.,(PhD)
Attribute types:

Attribute values are numbers or symbols assigned to an

attribute. The type of the attribute can be determined based on
the assigned value.
The set of possible values - nominal, binary, ordinal, or numeric -
the attribute can have.

Nominal Attributes

 The values of a nominal attribute are symbols or names of

things. Each value represents some kind of category, code,
or state.
 Nominal attributes are also referred to as Qualitative and
Categorical attributes.
 The values of nominal attributes do not have any
meaningful order.
Example

V.HARINI M.Sc.,M.Phil.,(PhD)
Attributes Possible Values

hair_color black, brown, red, green, and so on.

marital_status single, married, divorced, and widowed.

occupation teacher, doctor, farmer, student and so on.

The nominal attribute values do not have any meaningful order

about them and they are not quantitative. So

 It makes no sense to find the mean (average) value or

median (middle) value for such an attribute.
 However, we can find the attribute’s most commonly
occurring value (mode)
Binary Attributes
A binary attribute is a special nominal attribute with only two
states: 0 or 1. Where 0 typically means that the attribute is
absent, and 1 means that it is present.

Symmetric Binary Attribute

A binary attribute is symmetric if both of its states are equally
valuable and carry the same weight.
Example: the attribute gender having the states male and
female.

Asymmetric Binary Attribute

A binary attribute is asymmetric if the outcomes of the states are
not equally important.
Example: Test results for COVID patient: Positive (1) and
Negative (0).
By convention, we code the most important outcome, which is
usually the rarest one, by 1 (e.g., COVID positive) and the other
by 0 (e.g., COVID negative).

Ordinal Attributes:

V.HARINI M.Sc.,M.Phil.,(PhD)
An ordinal attribute is an attribute with possible values that have
a meaningful order or ranking among them, but the magnitude
between successive values is not known.

 Ordinal attributes are also referred to as Qualitative and

Categorical attributes.
Example: An ordinal attribute drink size corresponds to the size
of drinks available at a fast-food restaurant.

 This attribute has three possible values: small, medium,

and large.
 The values have a meaningful sequence (which corresponds
to increasing drink size);
 However, we cannot tell from the values how much bigger,
say, a medium is than a large.
Ordinal attributes are useful in surveys, In one survey,
participants were asked to rate how satisfied they were as
customers.
Customer satisfaction had the following ordinal categories:

0: very dissatisfied

1: somewhat dissatisfied

2: neutral

3: satisfied

4: very satisfied.
The central tendency of an ordinal attribute can be represented
by its mode and its median (middle value in an ordered
sequence), but the mean cannot be defined.

Interval-Scaled Attributes

Interval-scaled attributes are measured on a scale of equal-size

units. The values of interval-scaled attributes have order and can
be positive, 0, or negative.We can compare and quantify the
difference between values of interval attributes.
Examples:

V.HARINI M.Sc.,M.Phil.,(PhD)
A temperature attribute is an interval attribute.
We can quantify the difference between values. For example,
a temperature of 20oC is five degrees higher than a
temperature of 15oC.
Calendar dates is another example for an interval attribute.
Temperatures in Celsius do not have a true zero point, that
is, 0oC does not indicate ―no temperature.‖
Calendar dates do not have a true zero point, that is, the
year 0 not the beginning of the time.
Although we can compute the difference between temperature
values, we cannot talk of one temperature value as being a
multiple of another.
Without a true zero, we cannot say, for instance, that 10oC is
twice as warm as 5oC. That is, we cannot speak of the values in
terms of ratios.
The central tendency of an interval attribute can be represented
by its mode, its median (middle value in an ordered sequence),
and its mean Data.

Ratio Attribute

A ratio attribute is a numeric attributes with an inherent zero

point.
Examples:

 number_of_words in a documents object.

 count attribute such as years of experience for employee
object.
 Attributes to measure weight, height,
latitude, and longitude coordinates.
 With an amount attribute we can say ―you are 100 times
richer with $100 than with $1‖.
- If a measurement is ratio scaled, we can speak of a value as
being a multiple (or ratio) of another value.

V.HARINI M.Sc.,M.Phil.,(PhD)
The central tendency of an ratio attribute can be represented by
its mode, its median (middle value in an ordered sequence), and
its mean
Properties of Attribute Values

The type of an attribute depends on which of the following

properties it possesses:

 Distinctness: =, !=
 Order: < >
 Addition: + -
 Multiplication: * /

Nominal attribute: distinctness

Ordinal attribute: distinctness & order

Interval attribute: distinctness, order & addition

Ratio attribute: all 4 properties

Basic statistical Descriptions of Data Data Preprocessing:

Basic Statistical Descriptions of Data

• For data preprocessing to be successful, it is essential to have

an overall picture of our data. Basic statistical descriptions can
be used to identify properties of the data and highlight which
data values should be treated as noise or outliers.

• Basic statistical descriptions can be used to identify properties

of the data and highlight which data values should be treated as
noise or outliers.

• For data preprocessing tasks, we want to learn about data

characteristics regarding both central tendency and dispersion of
the data.

V.HARINI M.Sc.,M.Phil.,(PhD)
• Measures of central tendency include mean, median, mode and
midrange.

• Measures of data dispersion include quartiles, interquartile

range (IQR) and variance.

• These descriptive statistics are of great help in understanding

the distribution of the data.

Measuring the Central Tendency

• We look at various ways to measure the central tendency of

data, include: Mean, Weighted mean, Trimmed mean, Median,
Mode and Midrange.

1. Mean :

• The mean of a data set is the average of all the data values. The
sample mean x is the point estimator of the population mean μ.

2. Median :

Sum of the values of then observations Number of observations

in the sample

Sum of the values of the N observations Number of observations

in the population

• The median of a data set is the value in the middle when the
data items are arranged in ascending order. Whenever a data set

V.HARINI M.Sc.,M.Phil.,(PhD)
has extreme values, the median is the preferred measure of
central location.

• The median is the measure of location most often reported for

annual income and property value data. A few extremely large
incomes of property values can inflate the mean.

• For an off number of observations:

7 observations= 26, 18, 27, 12, 14, 29, 19.

Numbers in ascending order = 12, 14, 18, 19, 26, 27, 29

• The median is the middle value.

Median=19

• For an even number of observations :

8 observations = 26 18 29 12 14 27 30 19

Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30

The median is the average of the middle two values.

3. Mode:

• The mode of a data set is the value that occurs with greatest
frequency. The greatest frequency can occur at two or more
different values. If the data have exactly two modes, the data
have exactly two modes, the data are bimodal. If the data have
more than two modes, the data are multimodal.

• Weighted mean: Sometimes, each value in a set may be

associated with a weight, the weights reflect the significance,
importance or occurrence frequency attached to their respective
values.

• Trimmed mean: A major problem with the mean is its

sensitivity to extreme (e.g., outlier) values. Even a small number
of extreme values can corrupt the mean. The trimmed mean is

V.HARINI M.Sc.,M.Phil.,(PhD)
the mean obtained after cutting off values at the high and low
extremes.

• For example, we can sort the values and remove the top and
bottom 2 % before computing the mean. We should avoid
trimming too large a portion (such as 20 %) at both ends as this
can result in the loss of valuable information.

• Holistic measure is a measure that must be computed on the

entire data set as a whole. It cannot be computed by partitioning
the given data into subsets and merging the values obtained for
the measure in each subset.

Data Preprocessing in Data Mining

Data preprocessing is the process of preparing raw data for
analysis by cleaning and transforming it into a usable format. In
data mining it refers to preparing raw data for mining by
performing tasks like cleaning, transforming, and organizing it
into a format suitable for mining algorithms.
 Goal is to improve the quality of the data.
 Helps in handling missing values, removing duplicates, and
normalizing data.
 Ensures the accuracy and consistency of the dataset.

Steps in Data Preprocessing

Some key steps in data preprocessing are Data Cleaning, Data
Integration, Data Transformation, and Data Reduction.

V.HARINI M.Sc.,M.Phil.,(PhD)
1. Data Cleaning: It is the process of identifying and correcting
errors or inconsistencies in the dataset. It involves handling
missing values, removing duplicates, and correcting incorrect or
outlier data to ensure the dataset is accurate and reliable. Clean
data is essential for effective analysis, as it improves the quality
of results and enhances the performance of data models.
 Missing Values: This occur when data is absent from a
dataset. You can either ignore the rows with missing data or
fill the gaps manually, with the attribute mean, or by using
the most probable value. This ensures the dataset remains
accurate and complete for analysis.
 Noisy Data: It refers to irrelevant or incorrect data that is
difficult for machines to interpret, often caused by errors in
data collection or entry. It can be handled in several ways:
o Binning Method: The data is sorted into equal
segments, and each segment is smoothed by replacing
values with the mean or boundary values.
o Regression: Data can be smoothed by fitting it to a
regression function, either linear or multiple, to
predict values.
o Clustering: This method groups similar data points
together, with outliers either being undetected or
falling outside the clusters. These techniques help
remove noise and improve data quality.
 Removing Duplicates: It involves identifying and eliminating
repeated data entries to ensure accuracy and consistency in
the dataset. This process prevents errors and ensures reliable
analysis by keeping only unique records.
2. Data Integration: It involves merging data from various
sources into a single, unified dataset. It can be challenging due
to differences in data formats, structures, and meanings.
Techniques like record linkage and data fusion help in
combining data efficiently, ensuring consistency and accuracy.
 Record Linkage is the process of identifying and matching
records from different datasets that refer to the same entity,
even if they are represented differently. It helps in combining
data from various sources by finding corresponding records
based on common identifiers or attributes.
 Data Fusion involves combining data from multiple sources
to create a more comprehensive and accurate dataset. It
integrates information that may be inconsistent or incomplete

V.HARINI M.Sc.,M.Phil.,(PhD)
from different sources, ensuring a unified and richer dataset
for analysis.
3. Data Transformation: It involves converting data into a
format suitable for analysis. Common techniques include
normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and unit
variance; and discretization, which converts continuous data
into discrete categories. These techniques help prepare the data
for more accurate analysis.
 Data Normalization: The process of scaling data to a
common range to ensure consistency across variables.
 Discretization: Converting continuous data into discrete
categories for easier analysis.
 Data Aggregation: Combining multiple data points into a
summary form, such as averages or totals, to simplify
analysis.
 Concept Hierarchy Generation: Organizing data into a
hierarchy of concepts to provide a higher-level view for better
understanding and analysis.
4. Data Reduction: It reduces the dataset's size while
maintaining key information. This can be done through feature
selection, which chooses the most relevant features, and feature
extraction, which transforms the data into a lower-dimensional
space while preserving important details. It uses various
reduction techniques such as,
 Dimensionality Reduction (e.g., Principal Component
Analysis): A technique that reduces the number of variables
in a dataset while retaining its essential information.
 Numerosity Reduction: Reducing the number of data points
by methods like sampling to simplify the dataset without
losing critical patterns.
 Data Compression: Reducing the size of data by encoding it
in a more compact form, making it easier to store and
process.
Uses of Data Preprocessing
Data preprocessing is utilized across various fields to ensure
that raw data is transformed into a usable format for analysis
and decision-making. Here are some key areas where data
preprocessing is applied:
1. Data Warehousing: In data warehousing, preprocessing is
essential for cleaning, integrating, and structuring data before it

V.HARINI M.Sc.,M.Phil.,(PhD)
is stored in a centralized repository. This ensures the data is
consistent and reliable for future queries and reporting.
2. Data Mining: Data preprocessing in data mining involves
cleaning and transforming raw data to make it suitable for
analysis. This step is crucial for identifying patterns and
extracting insights from large datasets.
3. Machine Learning: In machine learning, preprocessing
prepares raw data for model training. This includes handling
missing values, normalizing features, encoding categorical
variables, and splitting datasets into training and testing sets to
improve model performance and accuracy.
4. Data Science: Data preprocessing is a fundamental step in
data science projects, ensuring that the data used for analysis
or building predictive models is clean, structured, and relevant.
It enhances the overall quality of insights derived from the data.
5. Web Mining: In web mining, preprocessing helps analyze web
usage logs to extract meaningful user behavior patterns. This
can inform marketing strategies and improve user experience
through personalized recommendations.
6. Business Intelligence (BI): Preprocessing supports BI by
organizing and cleaning data to create dashboards and reports
that provide actionable insights for decision-makers.
7. Deep Learning Purpose: Similar to machine learning, deep
learning applications require preprocessing to normalize or
enhance features of the input data, optimizing model training
processes.
Advantages of Data Preprocessing
 Improved Data Quality: Ensures data is clean, consistent,
and reliable for analysis.
 Better Model Performance: Reduces noise and irrelevant
data, leading to more accurate predictions and insights.
 Efficient Data Analysis: Streamlines data for faster and
easier processing.
 Enhanced Decision-Making: Provides clear and well-
organized data for better business decisions.
Disadvantages of Data Preprocessing
 Time-Consuming: Requires significant time and effort to
clean, transform, and organize data.
 Resource-Intensive: Demands computational power and
skilled personnel for complex preprocessing tasks.
 Potential Data Loss: Incorrect handling may result in losing
valuable information.

V.HARINI M.Sc.,M.Phil.,(PhD)

Unit Ii - Data Mining
No ratings yet
Unit Ii - Data Mining
14 pages
Unit I - Dot Net
No ratings yet
Unit I - Dot Net
29 pages
Mini Project File
No ratings yet
Mini Project File
2 pages
Data Mining Notes
No ratings yet
Data Mining Notes
23 pages
Dot Net Notes Unit I-V
No ratings yet
Dot Net Notes Unit I-V
135 pages
DK Mentor Name List
No ratings yet
DK Mentor Name List
1 page
III-it Students Details
No ratings yet
III-it Students Details
2 pages
III-it Students Details
No ratings yet
III-it Students Details
2 pages
III-it - News Paper Amount.
No ratings yet
III-it - News Paper Amount.
1 page
III-it - News Paper Amount
No ratings yet
III-it - News Paper Amount
1 page
III-it Students Details
No ratings yet
III-it Students Details
2 pages
2021-Analisis Regresi Linear Brganda
No ratings yet
2021-Analisis Regresi Linear Brganda
10 pages
Curve Fitting For Gtu Amee
No ratings yet
Curve Fitting For Gtu Amee
20 pages
Kannada Character Recognition Using CNN
No ratings yet
Kannada Character Recognition Using CNN
5 pages
Thesis Sem 1 Jit
No ratings yet
Thesis Sem 1 Jit
6 pages
CHEN20051 Chemical Eng. Optimisation Exam Paper - Final
No ratings yet
CHEN20051 Chemical Eng. Optimisation Exam Paper - Final
5 pages
Pheno of 2hdm
No ratings yet
Pheno of 2hdm
198 pages
CH-2 - Part-I
No ratings yet
CH-2 - Part-I
78 pages
Illustrated Microsoft Office 365 and Office 2016 Projects Loose Leaf Version 1st Edition Cram Solutions Manual Download
100% (23)
Illustrated Microsoft Office 365 and Office 2016 Projects Loose Leaf Version 1st Edition Cram Solutions Manual Download
10 pages
An Introduction To Programming The Winograd Fourier Transform Algorithm
100% (1)
An Introduction To Programming The Winograd Fourier Transform Algorithm
14 pages
Power Flow Optimization for Engineers
No ratings yet
Power Flow Optimization for Engineers
19 pages
Mirt
No ratings yet
Mirt
232 pages
BoostingDEA and R Language
No ratings yet
BoostingDEA and R Language
8 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
4 pages
Algorithm Analysis & Techniques
No ratings yet
Algorithm Analysis & Techniques
5 pages
Linear Regression Certification Completion
No ratings yet
Linear Regression Certification Completion
1 page
04 Rel-Algebra2
No ratings yet
04 Rel-Algebra2
23 pages
Z Transform and Inverse Z & Pole Zero
No ratings yet
Z Transform and Inverse Z & Pole Zero
5 pages
Machine Learning Exercises and Concepts
No ratings yet
Machine Learning Exercises and Concepts
3 pages
Iterative Methods For Solving Ax B - Jacobi's Method
No ratings yet
Iterative Methods For Solving Ax B - Jacobi's Method
4 pages
2410 CCP6214 Assignment (Dynamic)
No ratings yet
2410 CCP6214 Assignment (Dynamic)
8 pages
Institute of Science and Technology: Ca 1 Examination For Odd Semester - 2023
No ratings yet
Institute of Science and Technology: Ca 1 Examination For Odd Semester - 2023
11 pages
GANs for Advanced Learners
No ratings yet
GANs for Advanced Learners
85 pages
Rohini 78402595145
No ratings yet
Rohini 78402595145
7 pages
Unit 4
No ratings yet
Unit 4
26 pages
Exponentiated Weibull Distribution
No ratings yet
Exponentiated Weibull Distribution
10 pages
Assignment # 1
No ratings yet
Assignment # 1
4 pages
Time Series Analysis of Stock AA
No ratings yet
Time Series Analysis of Stock AA
5 pages
r05410408 Digital Image Processing
No ratings yet
r05410408 Digital Image Processing
5 pages
Data Structures Previous Year Question Paper
No ratings yet
Data Structures Previous Year Question Paper
6 pages
Prediction of Stroke Using Machine Learning
No ratings yet
Prediction of Stroke Using Machine Learning
6 pages

Data Mining Notes

Uploaded by

Data Mining Notes

Uploaded by

DATA MINING

What is Data Mining?

1. Association Patterns (Rules):

Technologies used in Data Mining:

Data mining has incorporated many techniques from other

Business and Marketing

Finance and Banking

 Credit scoring and risk management

Healthcare and Medicine

 Disease diagnosis and prediction

 Customer churn prediction

 Student performance analysis

Manufacturing and Industry

Government and Public Sector

 Crime pattern analysis

 Bioinformatics and genomics

Major Issues in Data Mining

Data mining is a powerful process for discovering hidden

2. Methodological and Technical Issues

3. Privacy, Security, and Ethical Issues

5. Performance and Implementation Issues

Data objects and Attribute types

Data sets are made up of data objects. A data object represents an

Attribute values are numbers or symbols assigned to an

 The values of a nominal attribute are symbols or names of

hair_color black, brown, red, green, and so on.

marital_status single, married, divorced, and widowed.

occupation teacher, doctor, farmer, student and so on.

The nominal attribute values do not have any meaningful order

 It makes no sense to find the mean (average) value or

Symmetric Binary Attribute

Asymmetric Binary Attribute

 Ordinal attributes are also referred to as Qualitative and

 This attribute has three possible values: small, medium,

Interval-scaled attributes are measured on a scale of equal-size

A ratio attribute is a numeric attributes with an inherent zero

 number_of_words in a documents object.

The type of an attribute depends on which of the following

Nominal attribute: distinctness

Ordinal attribute: distinctness & order

Interval attribute: distinctness, order & addition

Ratio attribute: all 4 properties

Basic statistical Descriptions of Data Data Preprocessing:

Basic Statistical Descriptions of Data

• For data preprocessing to be successful, it is essential to have

• Basic statistical descriptions can be used to identify properties

• For data preprocessing tasks, we want to learn about data

• Measures of data dispersion include quartiles, interquartile

• These descriptive statistics are of great help in understanding

Measuring the Central Tendency

• We look at various ways to measure the central tendency of

Sum of the values of then observations Number of observations

Sum of the values of the N observations Number of observations

• The median is the measure of location most often reported for

• For an off number of observations:

7 observations= 26, 18, 27, 12, 14, 29, 19.

Numbers in ascending order = 12, 14, 18, 19, 26, 27, 29

• The median is the middle value.

• For an even number of observations :

The median is the average of the middle two values.

• Weighted mean: Sometimes, each value in a set may be

• Trimmed mean: A major problem with the mean is its

• Holistic measure is a measure that must be computed on the

Data Preprocessing in Data Mining

Steps in Data Preprocessing

You might also like