0% found this document useful (0 votes)

13 views23 pages

Data Mining Notes

The document provides a comprehensive overview of data mining, covering its definition, types of data, mining patterns, techniques, applications, and major issues. It discusses various data types such as transactional, relational, and spatial data, as well as mining methods like classification, clustering, and association rules. Additionally, it highlights the ethical concerns and challenges faced in data mining, including data privacy, quality, and integration.

Uploaded by

harini.eccs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views23 pages

Data Mining Notes

Uploaded by

harini.eccs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

DATA MINING

UNIT I:
Introduction: Data Mining – Kinds of Data and Patterns to be
Mined – Technologies used –Kinds of Applications are Targeted -
Major Issues – Data objects and Attribute types – Basic
statistical Descriptions of Data Data Preprocessing : Data
Cleaning – Data Integration - Data Reduction - Data
Transformation

UNIT II:
Association Rules Mining: Introduction – Frequent Itemset
Mining Methods: Apriori Algorithm-Generating Association Rules
from Frequent Itemsets-Improving the efficiency of Apriori-A
Pattern –Growth Approach for mining Frequent Itemsets-Pattern
Evaluation Methods.

UNIT III:
Classification: Introduction –Basic concepts – Logistic regression
- Decision tree induction–Bayesian classification, Rule–based
classification-Model Evaluation and selection.

UNIT IV:
Cluster Analysis: Introduction-Requirements for Cluster Analysis
- Partitioning Methods: The K-Means method - Hierarchical
Method: Agglomerative method - Density based methods:
DBSCAN-Evaluation of Clustering: Determining the Number of
Clusters – Measuring Clustering Quality.

UNIT V:
Outlier Detection: Outliers and Outlier Analysis – Outlier
Detection Methods - Data Visualization: Pixel-oriented
visualization – Geometric Projection visualization techniqueIcon-
based-Hierarchical visualization-Visualizing complex data and
relations

[Link] [Link].,[Link].,(PhD)
UNIT I:
Introduction: Data Mining – Kinds of Data and Patterns to be
Mined – Technologies used –Kinds of Applications are Targeted -
Major Issues – Data objects and Attribute types – Basic
statistical Descriptions of Data Data Preprocessing : Data
Cleaning – Data Integration - Data Reduction - Data
Transformation

What is Data Mining?

Data mining is the process of extracting insights from large
datasets using statistical and computational techniques. It can
involve structured, semi-structured or unstructured data
stored in databases, data warehouses or data lakes. The goal
is to uncover hidden patterns and relationships to support
informed decision-making and predictions using methods like
clustering, classification, regression and anomaly detection.
Data mining is widely used in industries such as marketing,
finance, healthcare and telecommunications. For example, it
helps identify customer segments in marketing or detect
disease risk factors in healthcare. However, it also raises
ethical concerns particularly regarding privacy and the misuse
of personal data, requiring careful safeguards.

[Link] [Link].,[Link].,(PhD)
What is Data
Data is the raw form of information, a collection of facts, figures,
symbols or observations that represent details about events,
objects or phenomena. By itself, data may appear meaningless,
but when organized, processed and interpreted, it transforms
into valuable insights that support decision-making, problem-
solving and innovation.

Importance of Data
 Decision-making and insights: Organizations use data to
make better decisions. Raw data becomes useful when
transformed into insights with the help of analytics.
 AI/ML and Innovation: Data is the fuel for artificial
intelligence and machine learning. More and higher-quality
data means better training, more accurate predictions.
 Digital transformation: The rise of Big Data has enabled
new capabilities i.e from real-time analytics to personalized
services.
Kinds of Data to be Mined

1. Transactional Data:
o Definition: This type of data is generated from
transactions, such as purchases, sales, and
exchanges. Each record usually represents a single
event.
o Example: A retail store's point of sale data, where
each transaction includes details like item purchased,
time of purchase, quantity, price, and customer ID.
o Mining Focus: Frequent itemsets, association rules
(e.g., "if a customer buys item A, they are likely to
buy item B").
2. Relational Data:
o Definition: Data stored in relational databases,
typically in tables with rows and columns. The tables
are related to each other via keys (primary or
foreign).
o Example: A customer database with multiple tables
(customers, orders, payments) linked by customer ID
or order ID.

[Link] [Link].,[Link].,(PhD)
o Mining Focus: Classification, clustering, association
rules, and sometimes sequence mining.
3. Sequence Data:
o Definition: Data that represents events or items in a
sequential order. It’s particularly useful when time or
ordering is important.
o Example: Web browsing history, user activity logs,
or DNA sequences.
o Mining Focus: Sequence patterns, temporal
patterns, and time-series analysis.
4. Spatial Data:
o Definition: Data related to the geographical location
of objects, often including coordinates (latitude,
longitude) and related spatial attributes.
o Example: Mapping data, GPS tracking, satellite
imagery, or store locations.
o Mining Focus: Spatial clustering, nearest-neighbor
searches, and spatial outlier detection.
5. Text Data:
o Definition: Unstructured or semi-structured textual
data that can be mined for insights.
o Example: Social media posts, reviews, emails,
articles, and web pages.
o Mining Focus: Text mining, sentiment analysis,
topic modeling, and entity extraction.
6. Time-Series Data:
o Definition: Data that is indexed or organized based
on time. It's common in fields like finance,
economics, and operations.
o Example: Stock prices over time, sensor data,
weather data.
o Mining Focus: Trend analysis, anomaly detection,
forecasting, and pattern recognition.
7. Multimedia Data:
o Definition: Data in the form of images, audio, or
video. This data requires specialized techniques for
analysis.
o Example: Images from a camera, audio clips, video
files, or multimedia logs.
o Mining Focus: Image recognition, video pattern
detection, audio feature extraction.
8. Web Data:

[Link] [Link].,[Link].,(PhD)
o Definition: Data generated from web-based
activities, including user interactions, clicks, and
social media engagement.
o Example: Website usage logs, clickstream data, or
social media feeds.
o Mining Focus: Web mining, recommendation
systems, user behavior prediction.

Patterns to be Mined

There are several types of patterns that can be mined from

data depending on the mining goals. Here are the main types:

1. Association Patterns (Rules):

o Definition: These patterns identify relationships or
correlations between different items in a dataset.
o Example: In retail, you might discover that
customers who buy diapers often also buy baby
wipes. This would be an association rule like:
"If a customer buys diapers, they are likely to
buy baby wipes."
o Mining Technique: Market basket analysis,
association rule learning (e.g., Apriori, FP-Growth).
2. Classification Patterns:
o Definition: These patterns assign items or data
points to predefined categories or classes based on
attributes.
o Example: Classifying emails as "spam" or "not
spam," or diagnosing diseases based on patient
symptoms.
o Mining Technique: Decision trees, logistic
regression, support vector machines (SVM), neural
networks.
3. Clustering Patterns:
o Definition: Clustering involves grouping data points
into clusters based on similarity. Unlike classification,
clustering is an unsupervised learning technique (no
predefined labels).
o Example: Grouping customers based on purchasing
behavior or segmenting images into categories.
o Mining Technique: K-means, hierarchical
clustering, DBSCAN.
4. Sequential Patterns:

[Link] [Link].,[Link].,(PhD)
o Definition: These patterns identify sequences of
events or items that commonly occur together in a
certain order over time.
o Example: In web usage data, users might visit
certain pages in a particular sequence (e.g.,
Homepage → Product Page → Checkout).
o Mining Technique: Sequence mining algorithms like
GSP (Generalized Sequential Patterns) or SPADE
(Sequential Pattern Discovery using Equivalence
classes).
5. Outlier Patterns:
o Definition: These patterns detect data points that
deviate significantly from the norm or expected
behavior. Outliers can represent anomalies, fraud, or
errors.
o Example: Detecting fraudulent transactions in a
bank’s transaction data, or spotting rare diseases in
medical data.
o Mining Technique: Statistical tests, k-nearest
neighbors (KNN), isolation forests, or density-based
methods.
6. Trend Patterns:
o Definition: These patterns identify trends over time,
highlighting consistent changes in data that follow a
specific direction (e.g., upward or downward).
o Example: A company’s sales data may show a
positive trend over several quarters.
o Mining Technique: Time-series analysis, regression
analysis.
7. Regression Patterns:
o Definition: Regression patterns predict a continuous
output variable based on one or more input variables.
o Example: Predicting the price of a house based on
its features such as size, location, and age.
o Mining Technique: Linear regression, polynomial
regression, decision trees.
8. Spatial Patterns:
o Definition: These patterns focus on the relationships
between spatial objects and are used to find spatial
distributions, proximity relations, or spatial trends.
o Example: Analyzing crime hotspots in a city or
identifying areas with high concentrations of stores.

[Link] [Link].,[Link].,(PhD)
o Mining Technique: Spatial data mining, clustering
techniques (e.g., DBSCAN for spatial data).
9. Anomaly Detection:
o Definition: Similar to outlier detection but broader,
this includes identifying data points that do not
conform to expected behavior or patterns across
time.
o Example: Identifying unusual network traffic
indicating potential cyber security threats.
o Mining Technique: Isolation forests, clustering-
based methods, statistical analysis.
10. Summarization Patterns:
o Definition: These patterns provide concise and high-
level summaries of data to help with understanding
overall trends or distributions.
o Example: Summarizing a large dataset to find the
average sales in each region or summarizing
customer demographics in a retail store.
o Mining Technique: Aggregation, averaging, and
dimensionality reduction techniques like Principal
Component Analysis (PCA).

Technologies used in Data Mining:

[Link] [Link].,[Link].,(PhD)
Data mining has incorporated many techniques from other
domain fields like machine learning, statistics, information
retrieval, data warehouse, pattern recognition, algorithms, and
high-performance computing. Since it is a highly application-
driven domain, the interdisciplinary nature is typically very
significant. Research and development in data mining and its
applications prove quite useful in implementing it.

APPLICATIONS:

Business and Marketing

 Customer segmentation
 Sales forecasting
 Market basket analysis
 Customer relationship management (CRM)

Finance and Banking

 Credit scoring and risk management

 Fraud detection
 Loan approval automation
 Stock market prediction

Healthcare and Medicine

 Disease diagnosis and prediction

 Drug discovery and development
 Patient data analysis
 Medical image analysis

Telecommunications

 Customer churn prediction

 Network fault detection
 Service optimization

Education

 Student performance analysis

 Dropout prediction
 Curriculum improvement

[Link] [Link].,[Link].,(PhD)
Manufacturing and Industry

 Quality control
 Predictive maintenance
 Supply chain optimization

E-commerce and Web Applications

 Recommendation systems
 Personalized marketing
 Clickstream analysis

Government and Public Sector

 Crime pattern analysis

 Tax fraud detection
 Resource allocation

Scientific Research

 Bioinformatics and genomics

 Astronomy data analysis
 Environmental modeling

Major Issues in Data Mining

Data mining is a powerful process for discovering hidden

patterns and knowledge from large datasets. However, it faces
several technical, ethical, and practical challenges. The
major issues are grouped as follows:

1. Data-Related Issues

1. Data Quality
o Real-world data may be incomplete, noisy, or
inconsistent.
o Example: Missing values, duplicate records, incorrect
entries.
o Solution: Data cleaning and preprocessing are
essential.
2. Data Integration
o Combining data from multiple sources (databases,
sensors, web logs) can cause schema conflicts or
redundancy.

[Link] [Link].,[Link].,(PhD)
o Example: Different databases may use different
formats for dates or IDs.
3. Data Selection and Transformation
o Selecting the relevant subset of data and
transforming it into a suitable format for mining is
challenging.
o Example: Choosing appropriate features for model
training.
4. Data Volume and Scalability
o The massive size of modern datasets (Big Data)
requires efficient algorithms and high computational
power.

2. Methodological and Technical Issues

1. Algorithm Efficiency
o Mining algorithms must handle large, high-
dimensional datasets quickly and accurately.
o Balancing speed and accuracy is difficult.
2. Over fitting and Model Generalization
o Models may perform well on training data but poorly
on unseen data.
o Requires proper validation and tuning.
3. Pattern Evaluation
o Determining which discovered patterns are useful,
valid, and interesting is a key issue.
4. Heterogeneous and Complex Data
o Data can be in various forms: text, images, audio,
video, graphs, spatial or temporal data.
o Mining such diverse data types requires specialized
techniques.

3. Privacy, Security, and Ethical Issues

1. Data Privacy
o Mining personal or sensitive data (like medical or
financial records) can raise privacy concerns.
o Requires secure data handling and anonymization.
2. Data Security
o Protecting data from unauthorized access or
breaches during mining operations is crucial.
3. Ethical Use of Data

[Link] [Link].,[Link].,(PhD)
o Ensuring data mining results are not misused for
discrimination or manipulation.

4. Social and Legal Issues

1. Ownership of Data
o It is often unclear who owns the data and who has
the right to mine it.
2. Legal Restrictions
o Laws like GDPR (General Data Protection Regulation)
set limits on data collection and usage.
3. User Consent
o Data mining should respect the principle of
informed consent — users should know how their
data is used.

5. Performance and Implementation Issues

1. Scalability
o Mining systems should efficiently process terabytes
or petabytes of data.
2. Real-Time Processing
o Some applications (like fraud detection or IoT
analytics) require real-time or near-real-time
mining.
3. Visualization and Interpretation
o Presenting complex results in a simple,
understandable format is often difficult.

Data objects and Attribute types

Data Objects
Data sets are made up of data objects. A data
object represents an entity — for example in a Student
database, an object may be the name of the student, roll
number, marks, etc. Attributes describe these data objects.
In a relational database, we can think of data objects as

[Link] [Link].,[Link].,(PhD)
the rows of the database and columns as the attributes.
Data objects can also be referred to as examples, samples,
instances, data points, or objects. Next, we clearly define
what an attribute is and the different types of attributes.

Attribute types:

Attribute values are numbers or symbols assigned to an

attribute. The type of the attribute can be determined based on
the assigned value.
The set of possible values - nominal, binary, ordinal, or numeric
- the attribute can have.

Nominal Attributes

 The values of a nominal attribute are symbols or names of

things. Each value represents some kind of category,
code, or state.
 Nominal attributes are also referred to as Qualitative and
Categorical attributes.

[Link] [Link].,[Link].,(PhD)
 The values of nominal attributes do not have any
meaningful order.
Example

Attributes Possible Values

hair_color black, brown, red, green, and so on.

marital_status single, married, divorced, and widowed.

occupation teacher, doctor, farmer, student and so on.

The nominal attribute values do not have any meaningful order

about them and they are not quantitative. So

 It makes no sense to find the mean (average) value or

median (middle) value for such an attribute.
 However, we can find the attribute’s most commonly
occurring value (mode)
Binary Attributes
A binary attribute is a special nominal attribute with only two
states: 0 or 1. Where 0 typically means that the attribute is
absent, and 1 means that it is present.

Symmetric Binary Attribute

A binary attribute is symmetric if both of its states are equally
valuable and carry the same weight.
Example: the attribute gender having the states male and
female.

Asymmetric Binary Attribute

A binary attribute is asymmetric if the outcomes of the states
are not equally important.

[Link] [Link].,[Link].,(PhD)
Example: Test results for COVID patient: Positive (1) and
Negative (0).
By convention, we code the most important outcome, which is
usually the rarest one, by 1 (e.g., COVID positive) and the other
by 0 (e.g., COVID negative).

Ordinal Attributes:

An ordinal attribute is an attribute with possible values that

have a meaningful order or ranking among them, but the
magnitude between successive values is not known.

 Ordinal attributes are also referred to as Qualitative and

Categorical attributes.
Example: An ordinal attribute drink size corresponds to the
size of drinks available at a fast-food restaurant.

 This attribute has three possible values: small, medium,

and large.
 The values have a meaningful sequence (which
corresponds to increasing drink size);
 However, we cannot tell from the values how much bigger,
say, a medium is than a large.
Ordinal attributes are useful in surveys, In one survey,
participants were asked to rate how satisfied they were as
customers.
Customer satisfaction had the following ordinal categories:

0: very dissatisfied

1: somewhat dissatisfied

2: neutral

3: satisfied
4: very satisfied.
The central tendency of an ordinal attribute can be represented
by its mode and its median (middle value in an ordered
sequence), but the mean cannot be defined.

[Link] [Link].,[Link].,(PhD)
Interval-Scaled Attributes

Interval-scaled attributes are measured on a scale of equal-size

units. The values of interval-scaled attributes have order and
can be positive, 0, or [Link] can compare and quantify
the difference between values of interval attributes.
Examples:
A temperature attribute is an interval attribute.
We can quantify the difference between values. For
example, a temperature of 20oC is five degrees higher
than a temperature of 15oC.
Calendar dates is another example for an interval
attribute.
Temperatures in Celsius do not have a true zero point,
that is, 0oC does not indicate “no temperature.”
Calendar dates do not have a true zero point, that is, the
year 0 not the beginning of the time.
Although we can compute the difference between temperature
values, we cannot talk of one temperature value as being a
multiple of another.
Without a true zero, we cannot say, for instance, that 10 oC is
twice as warm as 5oC. That is, we cannot speak of the values in
terms of ratios.
The central tendency of an interval attribute can be
represented by its mode, its median (middle value in an
ordered sequence), and its mean Data.

Ratio Attribute

A ratio attribute is a numeric attributes with an inherent zero

point.
Examples:

 number_of_words in a documents object.

[Link] [Link].,[Link].,(PhD)
 count attribute such as years of experience for employee
object.
 Attributes to measure weight, height,
latitude, and longitude coordinates.
 With an amount attribute we can say “you are 100 times
richer with $100 than with $1”.
- If a measurement is ratio scaled, we can speak of a value as
being a multiple (or ratio) of another value.
The central tendency of an ratio attribute can be represented
by its mode, its median (middle value in an ordered sequence),
and its mean
Properties of Attribute Values

The type of an attribute depends on which of the following

properties it possesses:

 Distinctness: =, !=
 Order: < >
 Addition: + -
 Multiplication: * /

Nominal attribute: distinctness

Ordinal attribute: distinctness & order

Interval attribute: distinctness, order & addition

Ratio attribute: all 4 properties

Basic statistical Descriptions of Data Data

Preprocessing:
Basic Statistical Descriptions of Data

• For data preprocessing to be successful, it is essential to have

an overall picture of our data. Basic statistical descriptions can
be used to identify properties of the data and highlight which
data values should be treated as noise or outliers.

[Link] [Link].,[Link].,(PhD)
• Basic statistical descriptions can be used to identify
properties of the data and highlight which data values should
be treated as noise or outliers.

• For data preprocessing tasks, we want to learn about data

characteristics regarding both central tendency and dispersion
of the data.

• Measures of central tendency include mean, median, mode

and midrange.

• Measures of data dispersion include quartiles, interquartile

range (IQR) and variance.

• These descriptive statistics are of great help in understanding

the distribution of the data.

Measuring the Central Tendency

• We look at various ways to measure the central tendency of

data, include: Mean, Weighted mean, Trimmed mean, Median,
Mode and Midrange.

1. Mean :

• The mean of a data set is the average of all the data values.
The sample mean x is the point estimator of the population
mean μ.

2. Median :

[Link] [Link].,[Link].,(PhD)
Sum of the values of then observations Number of observations
in the sample

Sum of the values of the N observations Number of

observations in the population

• The median of a data set is the value in the middle when the
data items are arranged in ascending order. Whenever a data
set has extreme values, the median is the preferred measure of
central location.

• The median is the measure of location most often reported

for annual income and property value data. A few extremely
large incomes of property values can inflate the mean.

• For an off number of observations:

7 observations= 26, 18, 27, 12, 14, 29, 19.

Numbers in ascending order = 12, 14, 18, 19, 26, 27, 29

• The median is the middle value.

Median=19

• For an even number of observations :

8 observations = 26 18 29 12 14 27 30 19

Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30

The median is the average of the middle two values.

3. Mode:

• The mode of a data set is the value that occurs with greatest
frequency. The greatest frequency can occur at two or more
different values. If the data have exactly two modes, the data
have exactly two modes, the data are bimodal. If the data have
more than two modes, the data are multimodal.

[Link] [Link].,[Link].,(PhD)
• Weighted mean: Sometimes, each value in a set may be
associated with a weight, the weights reflect the significance,
importance or occurrence frequency attached to their
respective values.

• Trimmed mean: A major problem with the mean is its

sensitivity to extreme (e.g., outlier) values. Even a small
number of extreme values can corrupt the mean. The trimmed
mean is the mean obtained after cutting off values at the high
and low extremes.

• For example, we can sort the values and remove the top and
bottom 2 % before computing the mean. We should avoid
trimming too large a portion (such as 20 %) at both ends as this
can result in the loss of valuable information.

• Holistic measure is a measure that must be computed on

the entire data set as a whole. It cannot be computed by
partitioning the given data into subsets and merging the values
obtained for the measure in each subset.

Data Preprocessing in Data Mining

Data preprocessing is the process of preparing raw data for
analysis by cleaning and transforming it into a usable format.
In data mining it refers to preparing raw data for mining by
performing tasks like cleaning, transforming, and organizing it
into a format suitable for mining algorithms.
 Goal is to improve the quality of the data.
 Helps in handling missing values, removing duplicates, and
normalizing data.
 Ensures the accuracy and consistency of the dataset.

[Link] [Link].,[Link].,(PhD)
Steps in Data Preprocessing
Some key steps in data preprocessing are Data Cleaning, Data
Integration, Data Transformation, and Data Reduction.

1. Data Cleaning: It is the process of identifying and

correcting errors or inconsistencies in the dataset. It involves
handling missing values, removing duplicates, and correcting
incorrect or outlier data to ensure the dataset is accurate and
reliable. Clean data is essential for effective analysis, as it
improves the quality of results and enhances the performance
of data models.
 Missing Values: This occur when data is absent from a
dataset. You can either ignore the rows with missing data or
fill the gaps manually, with the attribute mean, or by using
the most probable value. This ensures the dataset remains
accurate and complete for analysis.
 Noisy Data: It refers to irrelevant or incorrect data that is
difficult for machines to interpret, often caused by errors in
data collection or entry. It can be handled in several ways:
o Binning Method: The data is sorted into equal
segments, and each segment is smoothed by
replacing values with the mean or boundary values.
o Regression: Data can be smoothed by fitting it to a
regression function, either linear or multiple, to
predict values.
o Clustering: This method groups similar data points
together, with outliers either being undetected or
falling outside the clusters. These techniques help
remove noise and improve data quality.

[Link] [Link].,[Link].,(PhD)
 Removing Duplicates: It involves identifying and
eliminating repeated data entries to ensure accuracy and
consistency in the dataset. This process prevents errors and
ensures reliable analysis by keeping only unique records.
2. Data Integration: It involves merging data from various
sources into a single, unified dataset. It can be challenging due
to differences in data formats, structures, and meanings.
Techniques like record linkage and data fusion help in
combining data efficiently, ensuring consistency and accuracy.
 Record Linkage is the process of identifying and matching
records from different datasets that refer to the same
entity, even if they are represented differently. It helps in
combining data from various sources by finding
corresponding records based on common identifiers or
attributes.
 Data Fusion involves combining data from multiple
sources to create a more comprehensive and accurate
dataset. It integrates information that may be inconsistent
or incomplete from different sources, ensuring a unified and
richer dataset for analysis.
3. Data Transformation: It involves converting data into a
format suitable for analysis. Common techniques include
normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and
unit variance; and discretization, which converts continuous
data into discrete categories. These techniques help prepare
the data for more accurate analysis.
 Data Normalization: The process of scaling data to a
common range to ensure consistency across variables.
 Discretization: Converting continuous data into discrete
categories for easier analysis.
 Data Aggregation: Combining multiple data points into a
summary form, such as averages or totals, to simplify
analysis.
 Concept Hierarchy Generation: Organizing data into a
hierarchy of concepts to provide a higher-level view for
better understanding and analysis.
4. Data Reduction: It reduces the dataset's size while
maintaining key information. This can be done through feature
selection, which chooses the most relevant features, and
feature extraction, which transforms the data into a lower-

[Link] [Link].,[Link].,(PhD)
dimensional space while preserving important details. It uses
various reduction techniques such as,
 Dimensionality Reduction (e.g., Principal Component
Analysis): A technique that reduces the number of
variables in a dataset while retaining its essential
information.
 Numerosity Reduction: Reducing the number of data
points by methods like sampling to simplify the dataset
without losing critical patterns.
 Data Compression: Reducing the size of data by encoding
it in a more compact form, making it easier to store and
process.
Uses of Data Preprocessing
Data preprocessing is utilized across various fields to ensure
that raw data is transformed into a usable format for analysis
and decision-making. Here are some key areas where data
preprocessing is applied:
1. Data Warehousing: In data warehousing, preprocessing is
essential for cleaning, integrating, and structuring data before
it is stored in a centralized repository. This ensures the data is
consistent and reliable for future queries and reporting.
2. Data Mining: Data preprocessing in data mining involves
cleaning and transforming raw data to make it suitable for
analysis. This step is crucial for identifying patterns and
extracting insights from large datasets.
3. Machine Learning: In machine learning, preprocessing
prepares raw data for model training. This includes handling
missing values, normalizing features, encoding categorical
variables, and splitting datasets into training and testing sets
to improve model performance and accuracy.
4. Data Science: Data preprocessing is a fundamental step in
data science projects, ensuring that the data used for analysis
or building predictive models is clean, structured, and
relevant. It enhances the overall quality of insights derived
from the data.
5. Web Mining: In web mining, preprocessing helps analyze
web usage logs to extract meaningful user behavior patterns.
This can inform marketing strategies and improve user
experience through personalized recommendations.
6. Business Intelligence (BI): Preprocessing supports BI by
organizing and cleaning data to create dashboards and reports
that provide actionable insights for decision-makers.

[Link] [Link].,[Link].,(PhD)
7. Deep Learning Purpose: Similar to machine learning,
deep learning applications require preprocessing to normalize
or enhance features of the input data, optimizing model
training processes.
Advantages of Data Preprocessing
 Improved Data Quality: Ensures data is clean, consistent,
and reliable for analysis.
 Better Model Performance: Reduces noise and irrelevant
data, leading to more accurate predictions and insights.
 Efficient Data Analysis: Streamlines data for faster and
easier processing.
 Enhanced Decision-Making: Provides clear and well-
organized data for better business decisions.
Disadvantages of Data Preprocessing
 Time-Consuming: Requires significant time and effort to
clean, transform, and organize data.
 Resource-Intensive: Demands computational power and
skilled personnel for complex preprocessing tasks.
 Potential Data Loss: Incorrect handling may result in
losing valuable information.

[Link] [Link].,[Link].,(PhD)

Unit Ii - Data Mining
No ratings yet
Unit Ii - Data Mining
14 pages
Dot Net Notes Unit I-V
No ratings yet
Dot Net Notes Unit I-V
135 pages
Unit I - Dot Net
No ratings yet
Unit I - Dot Net
29 pages
Data Mining Notes
0% (1)
Data Mining Notes
22 pages
Mini Project File
No ratings yet
Mini Project File
2 pages
DK Mentor Name List
No ratings yet
DK Mentor Name List
1 page
III-it Students Details
No ratings yet
III-it Students Details
2 pages
III-it Students Details
No ratings yet
III-it Students Details
2 pages
III-it - News Paper Amount.
No ratings yet
III-it - News Paper Amount.
1 page
III-it - News Paper Amount
No ratings yet
III-it - News Paper Amount
1 page
III-it Students Details
No ratings yet
III-it Students Details
2 pages
Ensemble Methods: Bagging & Boosting
No ratings yet
Ensemble Methods: Bagging & Boosting
15 pages
Literature Review Tools for Researchers
No ratings yet
Literature Review Tools for Researchers
7 pages
The Impact of AI On Audit Quality
No ratings yet
The Impact of AI On Audit Quality
10 pages
(Ebook PDF) Modern Business Statistics With Microsoft Office Excel 6th Edition PDF Download
No ratings yet
(Ebook PDF) Modern Business Statistics With Microsoft Office Excel 6th Edition PDF Download
47 pages
Developing and Publishing Strong Empirical Researc
No ratings yet
Developing and Publishing Strong Empirical Researc
9 pages
Psychology and Marketing - 2022 - Shahid - The Role of Sensory Marketing and Brand Experience in Building Emotional
No ratings yet
Psychology and Marketing - 2022 - Shahid - The Role of Sensory Marketing and Brand Experience in Building Emotional
15 pages
Sample Question Paper PMA
100% (1)
Sample Question Paper PMA
5 pages
Salila 1
No ratings yet
Salila 1
334 pages
Seismic Behaviour and Analysis of Irregular Bridges With Different Column Heights
No ratings yet
Seismic Behaviour and Analysis of Irregular Bridges With Different Column Heights
35 pages
Memorandum No. 044-2022 Work Instructions On Area-Based Demand Driven TVET
No ratings yet
Memorandum No. 044-2022 Work Instructions On Area-Based Demand Driven TVET
47 pages
2010 M.Sc. Integrated Exam Results
No ratings yet
2010 M.Sc. Integrated Exam Results
8 pages
Practice Set VII: Sub: Statistical Methods & Data Analysis (MA 231)
No ratings yet
Practice Set VII: Sub: Statistical Methods & Data Analysis (MA 231)
3 pages
Presentation Job Analysis
No ratings yet
Presentation Job Analysis
6 pages
Priority Setting of Health Interventions: The Need For Multi-Criteria Decision Analysis
No ratings yet
Priority Setting of Health Interventions: The Need For Multi-Criteria Decision Analysis
10 pages
Cox Communications Inc 1999 Analysis Guide
No ratings yet
Cox Communications Inc 1999 Analysis Guide
7 pages
Assignment ChatGPT With AI
No ratings yet
Assignment ChatGPT With AI
3 pages
Selection Andrecruitment Process
No ratings yet
Selection Andrecruitment Process
101 pages
9 Astrology Final (1) 22 July
No ratings yet
9 Astrology Final (1) 22 July
5 pages
Timetable ILP
No ratings yet
Timetable ILP
6 pages
Leavens, 2004
No ratings yet
Leavens, 2004
10 pages
STAT201-Lecture 6-Confirmatory Factor Analysis
No ratings yet
STAT201-Lecture 6-Confirmatory Factor Analysis
4 pages
Staistics Interview Revision
No ratings yet
Staistics Interview Revision
60 pages
Curriculum Vitae - Gowtham - 1802
No ratings yet
Curriculum Vitae - Gowtham - 1802
2 pages
Teaching Case 5 - Piero Olivas
No ratings yet
Teaching Case 5 - Piero Olivas
4 pages
Senior High Ed Challenges at LIF
No ratings yet
Senior High Ed Challenges at LIF
28 pages
The Effects of Training and Development On Employee Productivity
100% (1)
The Effects of Training and Development On Employee Productivity
7 pages
Come Join Our Team! - 241205 - 190828
No ratings yet
Come Join Our Team! - 241205 - 190828
2 pages
Chapter 1 Characteristics of Research
No ratings yet
Chapter 1 Characteristics of Research
1 page
Young Vietnamese Consumer Trends
No ratings yet
Young Vietnamese Consumer Trends
11 pages

Data Mining Notes

Uploaded by

Data Mining Notes

Uploaded by

DATA MINING

What is Data Mining?

There are several types of patterns that can be mined from

1. Association Patterns (Rules):

Technologies used in Data Mining:

Business and Marketing

Finance and Banking

 Credit scoring and risk management

Healthcare and Medicine

 Disease diagnosis and prediction

 Customer churn prediction

 Student performance analysis

E-commerce and Web Applications

Government and Public Sector

 Crime pattern analysis

 Bioinformatics and genomics

Major Issues in Data Mining

Data mining is a powerful process for discovering hidden

2. Methodological and Technical Issues

3. Privacy, Security, and Ethical Issues

4. Social and Legal Issues

5. Performance and Implementation Issues

Data objects and Attribute types

Attribute values are numbers or symbols assigned to an

 The values of a nominal attribute are symbols or names of

Attributes Possible Values

hair_color black, brown, red, green, and so on.

marital_status single, married, divorced, and widowed.

occupation teacher, doctor, farmer, student and so on.

The nominal attribute values do not have any meaningful order

 It makes no sense to find the mean (average) value or

Symmetric Binary Attribute

Asymmetric Binary Attribute

An ordinal attribute is an attribute with possible values that

 Ordinal attributes are also referred to as Qualitative and

 This attribute has three possible values: small, medium,

Interval-scaled attributes are measured on a scale of equal-size

A ratio attribute is a numeric attributes with an inherent zero

 number_of_words in a documents object.

The type of an attribute depends on which of the following

Nominal attribute: distinctness

Ordinal attribute: distinctness & order

Interval attribute: distinctness, order & addition

Ratio attribute: all 4 properties

Basic statistical Descriptions of Data Data

• For data preprocessing to be successful, it is essential to have

• For data preprocessing tasks, we want to learn about data

• Measures of central tendency include mean, median, mode

• Measures of data dispersion include quartiles, interquartile

• These descriptive statistics are of great help in understanding

Measuring the Central Tendency

• We look at various ways to measure the central tendency of

Sum of the values of the N observations Number of

• The median is the measure of location most often reported

• For an off number of observations:

7 observations= 26, 18, 27, 12, 14, 29, 19.

Numbers in ascending order = 12, 14, 18, 19, 26, 27, 29

• The median is the middle value.

• For an even number of observations :

The median is the average of the middle two values.

• Trimmed mean: A major problem with the mean is its

• Holistic measure is a measure that must be computed on

Data Preprocessing in Data Mining

1. Data Cleaning: It is the process of identifying and

You might also like