0% found this document useful (0 votes)
26 views52 pages

DM Unit I

Uploaded by

aravindhan062003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views52 pages

DM Unit I

Uploaded by

aravindhan062003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

UNIT I

Data mining (Knowledge Discovery from Data).

Data mining is a process of extracting and discovering patterns in large data sets.

Very large collections of data- millions or even hundreds of millions of individual records are now being
compiled into centralized data warehouses(place where data gets stored so that applications can access
and share it easily. It is a database but it contains summarized information) and reorganized globally by
topic, allowing analysis to make use of powerful statistical and machine learning methods to examine
data more comprehensively.

Data mining is the art and science of using more powerful algorithms, than traditional query tools such
as SQL, to extract more useful information.

Data mining is concerned with the analysis of data and the use of software techniques for finding patterns
and regularities in sets of data. It is the computer, which is responsible for finding the patterns by
identifying the underlying rules and features in the data.

Data mining is concerned with discovering knowledge. It is about uncovering relationship or patterns
hidden in data that can be used to predict behaviors, outcomes, or provide some other useful function.

In data mining, the goal is to extract meaningful and actionable information from large datasets. Trivial
patterns or insights are ones that are simple and obvious, like finding that most customers purchase a
particular item with another item (e.g., bread and butter). Non-trivial patterns, on the other hand, might
involve identifying complex associations between items that aren't immediately apparent, such as the
correlation between the weather, local events, and sales of a specific product.

Data mining is widely used in diverse areas. There are a number of commercialdata mining system
available today and yet there are many challenges in this field.
Create, store, maintain, modify data indatabase. Extracting, interesting and unknown Information
from raw data.

Support query language Automatic searching of data.

Can work alone without data mining May not work without dbms

Basic elements are language ,data store Basic tasks are classification, regression, clustering
,query language. and association.
Kinds of Data in Data Mining :

In data mining, various types of data can be analyzed to extract meaningful insights and patterns. The types
of data can be broadly categorized as follows:

Structured Data: Structured data is highly organized and follows a specific format, typically stored in
relational databases or spreadsheets. It consists of rows and columns, where each column represents a
specific attribute or feature, and each row represents an individual data instance. Examples of structured data
include sales transactions, customer information, inventory records, and financial data.

Example: Online Retail Sales Data


Structured data in data mining refers to data that is organized in a specific format, typically stored in tables
with rows and columns.
Suppose you're working for an e-commerce company, and you have access to a database containing
structured data about online retail sales. The database might have a table with the following columns:

Order ID: Unique identifier for each order.


Customer ID: Unique identifier for each customer.
Product ID: Unique identifier for each product.
Product Name: Name of the product.
Category: Category to which the product belongs (e.g., electronics, clothing, books).
Quantity: Number of units of the product ordered.
Unit Price: Price of each unit of the product.
Total Price: Total price of the order.
Order Date: Date when the order was placed.

Unstructured Data: Unstructured data lacks a predefined structure and format. It includes text, images,
audio, video, social media posts, emails, and more. Analyzing unstructured data requires specialized
techniques such as natural language processing (NLP), image recognition, and sentiment analysis.

Example: Social Media Sentiment Analysis

Imagine you're working for a brand that wants to monitor its reputation and customer sentiment in real-time
on social media platforms. The data you're dealing with is unstructured text data from social media posts
(tweets, Facebook posts, Instagram comments, etc.). The unstructured data consists of users' opinions,
emotions, and comments about the brand.

Semi-structured Data: Semi-structured data falls between structured and unstructured data. It has some
level of organization but does not necessarily fit neatly into tables or rows. Examples include XML files,
JSON data, and certain types of documents with tags or annotations.

Example: Online News Articles and Comments

Imagine you're working for a news aggregation platform that collects and analyzes news articles and user
comments in real-time. The data you're dealing with is semi-structured, as it comes from various sources
and includes both structured elements (like titles and timestamps) and unstructured text (like article content
and user comments).
Temporal or Time Series Data: Temporal data represents values recorded at different points in time. It's
common in applications such as financial analysis, weather forecasting, stock market analysis, and IoT
sensor data. Time series data can reveal trends, patterns, and seasonal variations.
Example: Energy Consumption Forecasting

Imagine you're working for a utility company that generates and distributes electricity. Your task is to
develop a system that predicts energy consumption in real-time based on historical data and external factors.
The data you're dealing with is time series data, where each data point is associated with a specific timestamp.

Categorical Data: Categorical data represents variables that can take on specific categories or labels, but
there's no inherent order between them. Examples include gender (male/female), color (red/blue/green), and
product categories.
Example: Customer Segmentation for E-commerce

Imagine you work for an e-commerce company and you want to segment your customer base for targeted
marketing campaigns. The data you're working with includes various categorical attributes that describe
customer behavior and preferences.

Numerical Data: Numerical data consists of numerical values that can be further categorized into discrete
or continuous data. Discrete data has distinct and separate values (e.g., number of children), while continuous
data can take any value within a range (e.g., temperature, weight).
Example: Stock Price Prediction

Imagine you're working for a financial firm that specializes in stock trading, and you want to predict the
future prices of certain stocks to make informed investment decisions. The data you're working with includes
numerical attributes related to stock prices and trading volumes.

Binary Data: Binary data consists of only two possible values, often represented as 0 and 1. This type of
data is common in applications like classification tasks, where you're trying to categorize data into one of
two classes.

Imagine you're working for an email service provider, and your task is to develop a system that detects
whether incoming emails are spam or legitimate (ham) in real-time. The data you're working with is binary
data, where each email is represented by a set of binary features indicating the presence or absence of certain
keywords or patterns.

Textual Data: Textual data includes any form of written or typed text, such as articles, reviews, emails, and
social media posts. Analyzing textual data involves techniques like sentiment analysis, topic modeling, and
named entity recognition.
Example: Customer Sentiment Analysis in Social Media

Imagine you're working for a consumer electronics company, and you want to monitor how customers are
discussing your products on social media platforms. The data you're dealing with is textual data from social
media posts, comments, and reviews.

Spatial Data: Spatial data is associated with geographic locations or coordinates. Examples include GPS
data, maps, and satellite images. Spatial data mining involves identifying spatial patterns and relationships.
Example: Traffic Congestion Analysis and Routing

Imagine you work for a transportation company that provides real-time navigation services to drivers. You
want to analyze traffic congestion patterns and provide optimized routes to drivers in congested areas. The
data you're dealing with is spatial data, including information about road networks, traffic flow, and
geographic locations.

Network Data: Network data represents relationships between entities in a network or graph. This can
include social networks, communication networks, and transportation networks. Analyzing network data
involves understanding connections and centrality measures.
Example: Social Network Analysis

Imagine you're working for a social media platform, and your goal is to analyze the interactions and
relationships between users to uncover patterns and trends. The data you're dealing with is network data,
where users are nodes, and their connections (follows, likes, comments) form edges in the network.

Multi-dimensional Data: Multi-dimensional data involves multiple attributes or features that are not
necessarily ordered. It's often visualized using techniques like scatter plots and parallel coordinate plots.
Example: Retail Store Sales Analysis

Imagine you're working for a retail chain with multiple stores, and you want to analyze sales data to make
informed decisions about inventory management, marketing strategies, and store performance. The data
you're dealing with is multi-dimensional, as it includes various attributes that contribute to the sales
performance of each store.

Meta Data: Meta data provides information about other data. For example, it might include data source,
data format, creation date, and data owner. Meta data is crucial for data management and understanding the
context of the main data.
Example: Media Content Recommendation

Imagine you're working for a streaming platform that provides movies and TV shows to users. Your goal is
to recommend content to users based on their preferences and viewing history. The data you're working with
includes both the media content and associated metadata.

In data mining, different types of data require specific techniques and tools for analysis. Depending on the
nature of the data, appropriate preprocessing, feature extraction, and modeling methods need to be chosen
to uncover valuable insights.

ISSUES IN DATA MINING

Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These factors
also create some issues. The major issues are –

Mining Methodology and User Interaction.


Performance Issues.
Diverse Data Types Issues.
1. Mining Methodology and User Interaction :

It refers to the following kinds of issues –

1. Mining different kinds of knowledge in Databases − Different users may be interested in


different kinds of knowledge.
2. Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns.
3. Incorporation of background knowledge − Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple levels of abstraction.
4. Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language.
5. Presentation and visualization of data mining results − Once the patterns are discovered
it needs to be expressed in high level languages, and visual representations.
6. Handling noisy or incomplete data − The data cleaning methods are required tohandle the
noise and incomplete objects while mining the data regularities.
7.Pattern evaluation − The patterns discovered should be interesting becauseeither they
represent common knowledge or lack novelty.

2. Performance Issues :
Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms.
3. Diverse Data Types Issues :
Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data source may be structured,
semi structured or unstructured.

Challenges in Data Mining


Some of the Data Mining challenges are :
1. Security and Social Challenges
2. Noisy and Incomplete Data
3. Distributed Data
4. Complex Data
5. Performance
6. Scalability and Efficiency of the Algorithms
7. Improvement of Mining Algorithms
8. Incorporation of Background Knowledge
9. Data Visualization
10. Data Privacy and Security
11. User Interface
12. Mining dependent on Level of Abstraction
13. Integration of Background Knowledge
14. Mining Methodology Challenges.

Knowledge Discovery in Databases (KDD)

Knowledge Discovery in Databases (KDD) is the process of automatic discovery of


previously unknown patterns, rules, and other regular contents implicitly present in large
volumes of data.

Data mining helps us to extract useful information from large databases. It’s a step within the
KDD process.

It is the process of finding useful information and knowledge in data.

Whereas Data mining is the use of algorithms to extract patterns or models in KDD process.

KDD Process

There are totally six steps in KDD process as is shown on below:

The KDD process can be divided into three parts. The first part is data preprocessing including
step 1-3. The second part is data mining where many data mining algorithms involve. And the
last part is evaluation and presentation. The following figures illustrates the overall KDD
process in more details:
DATA PRE-PROCESSING

It describes any type of processing performed on raw data to prepare it for another processing
procedure. Data preprocessing transforms the data into a format that will be more easily and
effectively processed for the purpose of the user.

Data preprocessing is the most crucial step as the operational data is normally never captured
and prepared for data mining purpose. Mostly, the data is captured from several inconsistent,
poorly documented operational systems. Thus, data preprocessing requires substantial efforts
in purifying and organizing the data. This step ensures that the selected data available for
mining is in good quality.

Data preprocessing is used database-driven applications such as customer relationship


management and rule-based applications (like neural networks).

For example, in a neural networks, there are a number of different tools and methods used for
preprocessing, including,
i) Sampling, which selects a representative subset from a large population of data.
ii) Transformation, which manipulates raw data to produce a single input.
iii) Denoising, which removes noise from data; normalization, which organizes data for
more efficient access, and
iv) Feature extraction, which pulls out specified data that is significant in some particular
context.

Need for Data Pre-processing


Data Pre - Processing is need to check the data quality.The quality can be checked by the following
There are many factors comprising data quality, including,
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or donot match.
Timeliness: The data should be updated correctly.
Believability: The data should be trustable.
Interpretability: The understandability of the data.
STEPS OF DATA PREPROCESSING:

1. Data Cleaning
Data cleaning is a process to clean the dirty data. Data is mostly not clean. It means that most data can be
incorrect due to a large number of reasons like due to hardware error/failure, network error or human
error. So it is compulsory to clean the data before mining.

Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data from the
datasets, and it also replaces the missing values. Here are some techniques for data cleaning:

1. Handling Missing Values


Standard values like “Not Available” or “NA” can be used to replace the missing values.
Missing values can also be filled manually, but it is not recommended when that dataset is big.
The attribute’s mean value can be used to replace the missing value when the data is normally distributed
wherein in the case of non-normal distribution median value of the attribute can be used.
While using regression or decision tree algorithms, the missing value can be replaced by the most
probable value.
2. Handling Noisy Data

Noisy generally means random error or containing unnecessary data points. Handling noisy data is one
of the most important steps as it leads to the optimization of the model we are using Here are some of the
methods to handle noisy data.
I. Binning: This method is to smooth or handle noisy data. First, the data is sorted then, and then
the sorted values are separated and stored in the form of bins or buckets. There are three methods
for smoothing data in the bin.
 Smoothing by bin mean method: In this method, the values in the bin are replaced by the mean
value of the bin;
 Smoothing by bin median: In this method, the values in the bin are replaced by the median value;
 Smoothing by bin boundary: In this method, the using minimum and maximum values of the
bin values are taken, and the closest boundary value replaces the values.

II. Regression − Data can be smoothed by fitting the information to a function, including
with regression, and will help to handle data when unnecessary data is present. For the analysis,
purpose regression helps to decide the variable which is suitable for our analysis.
It is the measure of the average relationship between two or more variables in terms of the original units
of data. They are categories of two types:

 Linear regression contains finding the “best” line to fit two attributes (or variables) so that
one attribute can be used to forecast the other.
 Multiple linear regression is a development of linear regression, where more than two attributes
are contained and the data are fit to a multidimensional area.
III. Clustering − Clustering supports in identifying the outliers. The same values are
organized into clusters and those values which fall outside the cluster are known as
outliers.

Data Integration

Data integration is the phase of combining data from several disparate sources.
data integration is a data pre-processing technique that contains merging data from numerous
heterogeneous data sources into coherent(understandable/meaningful) data to retain and support a
consolidated perspective of the information.
It combines data from various sources into a coherent data store, including in data warehousing. These
sources can involve multiple databases, data cubes(data cube is a multidimensional representation of data
that allows for efficient querying and analysis of data along multiple dimensions/ Each dimension represents
a different attribute or characteristic. For example, in sales data, dimensions could include time, product,
and location). It's particularly useful for analyzing large datasets with complex relationships. Data cubes are
often used in Online Analytical Processing (OLAP) and data warehousing scenarios or flat files(A flat file
is a simple, plain-text file format that stores tabular data. Each row in the file represents a record, and the
columns represent attributes or fields. A flat file contains rows and columns, similar to a spreadsheet. Each
column has a name, and rows hold data values corresponding to those columns.), etc.
Approaches for the efficient data integration:
1. Entity identification problem
2. redundancy and correlation analysis
3. tuple Duplication
4. data value conflict detection and resolution
1. Entity identification problem
Schema integration and object matching are very important issues in data integration.
Schema integration- mismatch in attribute names, It involves merging and reconciling the differences
between various data sources with different structures, formats, and semantics. This process ensures that
data from different sources can be effectively and accurately used for analysis and decision-making.
Example: E-commerce Sales Analysis
Imagine a large e-commerce company that operates across multiple regions and platforms. They collect data
from various sources, such as online stores, mobile apps, and physical retail locations. Each source has its
own database schema and format for recording sales data. The company wants to perform data mining to
gain insights into their sales patterns, customer behavior, and product trends across all channels.
Data Sources:
Online Store Database: Contains information about online purchases, including customer IDs, product
IDs, purchase dates, prices, and payment methods.
Mobile App Database: Stores data about purchases made through the mobile app, with similar attributes
but possibly different naming conventions or additional fields.
Retail Store Database: Records sales data from physical retail locations, including store IDs, product IDs,
transaction timestamps, and payment details.
Schema Differences: Each data source has a different database schema and might use different attribute
names or structures for similar pieces of information.
The same concept (e.g., "customer ID" or "product ID") might be represented differently across sources.
Object matching-Mismatch in structure of the data.
Object matching in data integration within the context of data mining refers to the process of identifying
and linking records from different data sources that correspond to the same real-world entities. These
records can exist in various formats, databases, or datasets and might contain variations due to data entry
errors, inconsistencies, or different representation standards.
The primary goal of object matching is to recognize duplicate or matching records and consolidate them
into a single, unified representation. This process is essential for achieving accurate and meaningful results
in data mining and analysis, as well as for maintaining data quality and integrity in databases and data
warehouses.
Ex: Discount issues,Currency type,redundancy and correlation analysis
2.Redundancy – an attribute may be redundant if it can be “derived” from another attribute or set of
attributes.
Redundancy refers to the presence of duplicated or highly similar information within a dataset. In data
mining, redundancy analysis aims to identify and eliminate redundant attributes or records in a table.
Redundant attributes can lead to increased storage requirements, computational overhead, and potentially
misleading results during analysis.
For example, in a sales dataset, if you have two attributes that convey essentially the same information, like
"Total Sales" and "Net Sales," you might want to perform redundancy analysis to determine which attribute
to keep and which to remove.
Eg) DOB, Age
Correlation analysis- given two attributes such analysis can measure how strongly the attribute the attribute
implies the other, based on the available data.
Correlation analysis focuses on understanding the statistical relationships between different attributes
(columns) within a dataset. Correlation measures how changes in one attribute relate to changes in another.
Correlation coefficients range from -1 to 1, where -1 indicates a strong negative correlation, 1 indicates a
strong positive correlation, and 0 indicates no correlation.
Correlation analysis helps in identifying patterns and dependencies between attributes. For instance, in a
sales dataset, you might be interested in understanding the correlation between "Total Sales" and
"Advertising Spend" to determine whether increased advertising leads to higher sales.
Relation between Redundancy and Correlation Analysis:
While redundancy analysis aims to identify and remove redundant attributes or records, correlation analysis
helps in understanding the relationships between attributes, whether they are redundant or not. It's important
to note that attributes can be correlated without being redundant, and attributes can be redundant without
being highly correlated.
Two datasets: Housing Price Data and Square Footage Data. You want to calculate the Pearson correlation
coefficient between the selling prices of houses and their square footage.
Let's assume you have two variables:
X: Selling prices of houses in different neighborhoods.
Y: Size of houses in terms of square footage.
To calculate the Pearson correlation coefficient, you would follow these steps:
Selling Prices (X): [200000, 250000, 180000, 300000, 220000]
Square Footage (Y): [1500, 1800, 1200, 2100, 1600]
Step 1: Calculate the means of X and Y:
Mean(X) = (200000 + 250000 + 180000 + 300000 + 220000) / 5 = 230000
Mean(Y) = (1500 + 1800 + 1200 + 2100 + 1600) / 5 = 1640
Step 2: Calculate the differences from the mean for each data point:
Differences from Mean(X): [-30000, 20000, -50000, 70000, -10000]
Differences from Mean(Y): [-140, 160, -440, 460, -40]
Step 3: Multiply the corresponding differences for each data point:
Product of Differences: [42000000, 3200000, 22000000, 32200000, 400000]
Step 4: Sum up the products from step 3:
Sum of Products = 42000000 + 3200000 + 22000000 + 32200000 + 400000 = 100040000
Step 5: Calculate the sum of squared differences for X and Y:
Sum of Squared Differences(X) = (-30000)^2 + 20000^2 + (-50000)^2 + 70000^2 + (-10000)^2 =
3500000000
Sum of Squared Differences(Y) = (-140)^2 + 160^2 + (-440)^2 + 460^2 + (-40)^2 = 394000
Step 6: Calculate the Pearson correlation coefficient (r):
r = Sum of Products / (sqrt(Sum of Squared Differences(X)) * sqrt(Sum of Squared Differences(Y)))

r = 100040000 / (sqrt(3500000000) * sqrt(394000)) ≈ 0.853


The calculated Pearson correlation coefficient (r) is approximately 0.853. This value indicates a strong
positive linear correlation between selling prices and square footage of houses in this sample dataset.
3. Tuple Duplication:
Tuple duplication, also known as record duplication or instance duplication, refers to the existence of
multiple identical or nearly identical records within a dataset. This can occur due to various reasons such as
data entry errors, system glitches, or merging of different datasets. Dealing with tuple duplication is an
important aspect of data integration in data mining, as these duplicate records can lead to biased analysis
and incorrect conclusions.
Here's an example of tuple duplication in a sample customer table structure:

In this example, you can see that there are duplicate records for CustomerID 101, which means that the same
customer's information appears multiple times in the dataset.
The use of denormalized tables(often done to improve performance by avoiding joins) is another source of
data redundancy. Inconsistencies often arise between various duplicates, due to inaccurate data entry or
updating some but not all data occurrences.

4. Data value conflict detection and resolution:


Data value conflict detection and resolution refers to the process of identifying and addressing discrepancies
or differences in attribute values that come from various sources. These differences can arise due to a range
of factors, including differing representation formats, measurement units, scaling methods, encoding
schemes, and even varying levels of abstraction. Resolving these conflicts is essential to ensure the accuracy
and consistency of data used for analysis, reporting, and decision-making. Attribute values from different
sources may differ. This may be due to differences in representation, scaling or encoding,
Ex: weight(metric or british empherial units)
School curriculum(grading systems)
Attributes may also differ on the abstraction level, where an attribute in one system is recorded at, say, a
lower abstraction level than the “same” attribute in another.
EX: monthly total sales in a store.
& monthly total sales from all stores in that region.

Examples of Different Types of Conflicts:


Measurement Units Conflict:
In the context of weight, different sources might provide measurements in metric units (kilograms) and
imperial units (pounds). Resolving this conflict involves converting all measurements to a common unit.
Grading Systems Conflict:
School curriculum data might come from different regions or educational systems with varying grading
scales. Converting these grades to a standardized scale (e.g., GPA) allows for proper comparison and
analysis.
Abstraction Level Conflict:
Monthly total sales from an individual store and monthly total sales from all stores in a region represent the
same concept but at different levels of abstraction. Resolving this involves aggregating individual store sales
to the regional level for consistency.

Steps in Conflict Detection and Resolution:


Conflict Detection:
Compare attribute values from different sources to identify inconsistencies or conflicts. These conflicts can
include differences in units, scales, formats, and abstraction levels.
Conflict Resolution Strategies:
Conversion: Convert values to a common unit, scale, or format. For instance, convert all weights to a
specific unit (e.g., kilograms).
Standardization: Transform data to a standardized format, scale, or encoding to ensure uniformity.
Aggregation: Combine data at the appropriate level of aggregation to align different abstraction levels (e.g.,
summing individual store sales to get regional sales).
Source Priority: Choose values from more reliable or authoritative sources based on data quality metrics.
Manual Review: For complex conflicts, human intervention and domain expertise may be necessary to
decide the best resolution.
The data mining system is integrated with a database or data warehouse system so that it can do its tasks in
an effective presence. A data mining system operates in an environment that needed it to communicate with
other data systems like a database system. There are the possible integration schemes that can integrate
these systems which are as follows −
No coupling − No coupling defines that a data mining system will not use any function of a database or
data warehouse system. It can retrieve data from a specific source (including a file system), process data
using some data mining algorithms, and therefore save the mining results in a different file.
Such a system, though simple, deteriorates from various limitations. First, a Database system offers a big
deal of flexibility and adaptability at storing, organizing, accessing, and processing data. Without using a
Database/Data warehouse system, a Data mining system can allocate a large amount of time finding,
collecting, cleaning, and changing data.
Loose Coupling – In this data mining system uses some services of a database or data warehouse system.
The data is fetched from a data repository handled by these systems. Data mining approaches are used to
process the data and then the processed data is saved either in a file or in a designated area in a database or
data warehouse. Loose coupling is better than no coupling as it can fetch some area of data stored in
databases by using query processing or various system facilities.
Semitight Coupling – In this adequate execution of a few essential data mining primitives can be supported
in the database/datawarehouse system. These primitives can contain sorting, indexing, aggregation,
histogram analysis, multi-way join, and pre-computation of some important statistical measures, including
sum, count, max, min, standard deviation, etc.
Tight coupling − Tight coupling defines that a data mining system is smoothly integrated into the
database/data warehouse system. The data mining subsystem is considered as one functional element of an
information system.
Data mining queries and functions are developed and established on mining query analysis, data structures,
indexing schemes, and query processing methods of database/data warehouse systems. It is hugely desirable
because it supports the effective implementation of data mining functions, high system performance, and
an integrated data processing environment.
Data reduction

Data reduction techniques are applied to obtain a reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity of basedata.”
It is a process that reduced the volume of original data and represents it in a much smaller volume. We uses
data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.

Data reduction techniques ensure the integrity of data while reducing the data. The time required for data
reduction should not overshadow the time saved by the data mining on the reduced data set.

When you collect data from different data warehouses for analysis, it results in a huge amount of data. It is
difficult for a data analyst to deal with this large volume of data.

It is even difficult to run the complex queries on the huge amount of data as it takes a long time and
sometimes it even becomes impossible to track the desired data.

This is why reducing data becomes important. Data reduction technique reduces the volume of data yet
maintains the integrity of the data.

Data reduction does not affect the result obtained from data mining that means the result obtained from data
mining before data reduction and after data reduction is the same (or almost the same).

The only difference occurs in the efficiency of data mining. Data reduction increases the efficiency of data
mining. In the following section, we will discuss the techniques of data reduction.

Data Reduction Techniques

Techniques of data deduction include dimensionality reduction, numerosity reduction and data
compression.
 Dimensionality Reduction

Dimensionality reduction eliminates the attributes from the data set under consideration thereby reducing
the volume of original data/It eliminates the redundant attributes which are weakly important across the data.

METHODS:

 Stepwise forward selection − The process starts with a null set of attributes as the reduced set. The
best of the original attributes is determined and added to the reduced set. At every subsequent iteration
or step, the best of the remaining original attributes is inserted into the set.

Forward stepwise selection (or forward selection) is a variable selection method which:
1. Begins with a model that contains no variables (called the Null Model)
2. Then starts adding the most significant variables one after the other
3. Until a pre-specified stopping rule is reached or until all the variables under consideration are
included in the model
Here’s an example of forward selection with 5 variables:
Backward stepwise

 Stepwise backward elimination − The procedure starts with the full set of attributes. At each step,
it removes the worst attribute remaining in the set.

Backward stepwise selection (or backward elimination) is a variable selection method which:
1. Begins with a model that contains all variables under consideration (called the Full Model)
2. Then starts removing the least significant variables one after the other
3. Until a pre-specified stopping rule is reached or until no variable is left in the model.

Decision tree induction

Decision tree induction constructs a flowchart-like structure where each internal (non-leaf) node denotes
a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction. At each node, the algorithm chooses the “best” attribute to partition the data into
individual classes.
All the attributes that do not appear in the tree are assumed to be irrelevant while the attributes that belong
to the tree from the reduced data set.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes
a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
The following decision tree is for the concept buy computer that indicates whether a customer at a company
is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node
represents a class.
The benefits of having a decision tree are as follows −
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

 Numerosity reduction

Another methodology in data reduction in data mining is numerosity reduction in which the
volume of the data is reduced by representing it in a lower format. There are two types of this
technique: parametric and non-parametric.
1. Parametric Reduction
The parametric numerosity reduction technique holds an assumption that the data fits into the model. Hence,
it estimates the model parameters, and stores only these estimated parameters, and not the original or the
actual data. The other data is discarded, leaving out the potential outliers.

For parametric methods, a model is used to estimate the data, so that only the data parameters need to be
stored, instead of the actual data, for example, Log-linear models, regression.
This enable to store the model of data instead of whole data,
For example:
Regression Models.
Linear
regression
Multiple
regression
Log-linear
Regression:
Regression refers to a data mining technique that is used to predict the numeric values in a given data set.
For example, regression might be used to predict the product or service cost or other variables. It is also used
in various industries for business and marketing behavior, trend analysis, and financial forecast.

Regression involves the technique of fitting a straight line or a curve on numerous data points. It happens in
such a way that the distance between the data points and cure comes out to be the lowest.

Regression refers to a type of supervised machine learning technique that is used to predict any continuous-
valued attribute. Linear Regression analysis is used for studying or summarizing the relationship between
variables that are linearly related. The regression is also of two kinds:

FOR REFERENCE:
Simple Linear regression
you are collecting data on the relationship between the number of hours studied and the exam scores
achieved by a group of students. Here's a simplified dataset:

you want to use simple linear regression to understand the relationship between the number of hours
studied and the resulting exam score.

Exam Score = β0 + β1 * Hours Studied + ε

Where:
Exam Score is the dependent variable (what we're trying to predict).
Hours Studied is the independent variable (the input feature).
β0 is the intercept term.
β1 is the coefficient for the Hours Studied variable, representing the change in exam
score for a one-unit change in hours studied.
ε is the error term, representing the variability in exam scores that the model doesn't
explain.
By fitting this model to the data, you can estimate the coefficients β0 and β1. These
coefficients can give you insights into how much an additional hour of studying is
associated with changes in the exam score.

In this case, the goal of data mining with simple linear regression is to find the best-
fitting line that minimizes the difference between the predicted exam scores (based on
the model) and the actual exam scores in the dataset. This line can then be used for
prediction and to understand the relationship between the variables.

Keep in mind that simple linear regression assumes a linear relationship between the
variables and that the error term ε is normally distributed with a mean of 0.

7
HOURS STUDIED: [2, 3, 4, 5, 6]
EXAM SCORE: [65, 75, 85, 90, 95]

We can use the following formulas to calculate β0 and β1:

Calculate the mean of Hours Studied (x̄) and Exam Score (ȳ):
x̄ = (2 + 3 + 4 + 5 + 6) / 5 = 4
ȳ = (65 + 75 + 85 + 90 + 95) / 5 = 82

Calculate the covariance of Hours Studied and Exam Score:


Cov(x, y) = Σ((xi - x̄) * (yi - ȳ)) / (n - 1)
Cov(x, y) = ((2 - 4) * (65 - 82) + (3 - 4) * (75 - 82) + (4 - 4) * (85 - 82) + (5 - 4) * (90 -
82) + (6 - 4) * (95 - 82)) / (5 - 1)
Cov(x, y) = (-34 - 14 + 3 + 8 + 26) / 4 = -11.75

Calculate the variance of Hours Studied (Var(x)):


Var(x) = Σ((xi - x̄)²) / (n - 1)
Var(x) = ((2 - 4)² + (3 - 4)² + (4 - 4)² + (5 - 4)² + (6 - 4)²) / (5 - 1)
Var(x) = (4 + 1 + 0 + 1 + 4) / 4 = 2.5
Calculate the slope (β1):
β1 = Cov(x, y) / Var(x)
β1 = -11.75 / 2.5 = -4.7

Calculate the intercept (β0):


β0 = ȳ - β1 * x̄
β0 = 82 - (-4.7 * 4) = 82 + 18.8 = 100.8

So, the linear regression equation for predicting Exam Score based on the number of
hours studied is:

Exam Score = 100.8 - 4.7 * Hours Studied

Now, you can use this equation to predict the Exam Score for different values of Hours
Studied. For example, if a student studies for 7 hours:

Exam Score = 100.8 - 4.7 * 7 = 100.8 - 32.9 = 67.9

If a student studies for 8 hours:

Exam Score = 100.8 - 4.7 * 8 = 100.8 - 37.6 = 63.2

And so on for other values of Hours Studied.

Multiple Linear regression.


Multiple linear regression is used when you have multiple independent variables that you believe might
collectively influence a dependent variable. In your example, let's consider that in addition to the number of
hours studied, you also have data on the number of hours slept the night before the exam for each student.
Here's how you might apply multiple linear regression to the dataset:
In multiple linear regression, the model can be represented as:

Exam Score = β0 + β1 * Hours Studied + β2 * Hours Slept + ε

Where:

Exam Score is still the dependent variable (what we're trying to predict).
Hours Studied and Hours Slept are the independent variables (input features).
β0 is the intercept term.
β1 is the coefficient for the Hours Studied variable, representing the change in exam
score for a one-unit change in hours studied while holding Hours Slept constant.
β2 is the coefficient for the Hours Slept variable, representing the change in exam score
for a one-unit change in hours slept while holding Hours Studied constant.
ε is the error term.
By fitting this model to the data, you can estimate the coefficients β0, β1, and β2. These
coefficients can help you understand how both hours studied and hours slept collectively
contribute to the exam score.

In data mining, the goal of multiple linear regression is to find the best-fitting plane in
the multi-dimensional space that minimizes the difference between the predicted exam
scores (based on the model) and the actual exam scores in the dataset. This plane can
then be used for prediction and to understand the relationships between the variables.
We need to calculate the values of β0 (the intercept), β1 (the coefficient for Hours
Studied), and β2 (the coefficient for Hours Slept) using the provided data:

HOURS STUDIED: [2, 3, 4, 5, 6]


HOURS SLEPT: [6, 7, 6, 8, 7]
EXAM SCORE: [65, 75, 85, 90, 95]

Calculate the means of Hours Studied (x̄1), Hours Slept (x̄2), and Exam Score (ȳ):
x̄1 = (2 + 3 + 4 + 5 + 6) / 5 = 4
x̄2 = (6 + 7 + 6 + 8 + 7) / 5 = 6.8
ȳ = (65 + 75 + 85 + 90 + 95) / 5 = 82

Calculate the covariance matrix and variance-covariance matrix:


Cov(x1, x1) = Σ((xi1 - x̄1) * (xi1 - x̄1)) / (n - 1)
Cov(x2, x2) = Σ((xi2 - x̄2) * (xi2 - x̄2)) / (n - 1)
Cov(x1, x2) = Σ((xi1 - x̄1) * (xi2 - x̄2)) / (n - 1)
Cov(x1, y) = Σ((xi1 - x̄1) * (yi - ȳ)) / (n - 1)
Cov(x2, y) = Σ((xi2 - x̄2) * (yi - ȳ)) / (n - 1)
Using the values from the data, we can calculate these covariances. I'll provide the results
directly:

Cov(x1, x1) ≈ 2.5


Cov(x2, x2) ≈ 0.7
Cov(x1, x2) ≈ 0.5
Cov(x1, y) ≈ 15.0
Cov(x2, y) ≈ 10.0

Calculate the variance of Hours Studied (Var(x1)) and Hours Slept (Var(x2)):
Var(x1) = Cov(x1, x1)
Var(x2) = Cov(x2, x2)

Calculate the coefficients (β0, β1, β2) using the formulas:


β1 = Cov(x1, y) / Var(x1)
β2 = Cov(x2, y) / Var(x2)
β0 = ȳ - β1 * x̄1 - β2 * x̄2

Substitute the values:

β1 = 15.0 / 2.5 = 6.0


β2 = 10.0 / 0.7 ≈ 14.29
β0 = 82 - 6.0 * 4 - 14.29 * 6.8 ≈ -15.93

So, the multiple linear regression equation for predicting Exam Score based on the
number of hours studied and the number of hours slept is:

Exam Score ≈ -15.93 + 6.0 * Hours Studied + 14.29 * Hours Slept

You can now use this equation to predict the Exam Score for different combinations of
Hours Studied and Hours Slept.

Log-Linear:
Log-linear model discovers the relation between two or more discrete attributes in the database. Suppose,
we have a set of tuples presented in n-dimensional space. Then the log-linear model is used to study the
probability of each tuple in a multidimensional space.
a simple example of how a log-linear model could be applied to analyze the relationship between the
variables:

Suppose we're looking at a dataset that records whether customers made a purchase ("Purchase: Yes" or
"Purchase: No") based on their age group ("Age: Young", "Age: Middle-aged", "Age: Senior") and the
type of product ("Product: A", "Product: B", "Product: C"). The table might look like this:
we want to investigate if there's any relationship between age groups, product choice,
and purchase likelihood. We can build a log-linear model to analyze this relationship.
Here's a simplified example:
log(E(Yij)) = β0 + β1 * Age_i + β2 * Product_j + β3 * Age_i * Product_j

Where:

Yij represents the observed frequency in cell (i, j) of the table.


E(Yij) represents the expected frequency in cell (i, j) based on the log-linear model.
Age_i is a categorical variable representing the ith age group.
Product_j is a categorical variable representing the jth product type.
β0, β1, β2, and β3 are coefficients to be estimated.
By fitting this model to the observed frequencies, you can estimate the coefficients and
assess whether there's a statistically significant relationship between age groups, product
choice, and purchase likelihood. This allows you to understand whether certain age
groups or product choices are associated with a higher or lower likelihood of making a
purchase.
Regression and log-linear method can be used for sparse data and skewed data.

2. Non-parametric Reduction

On the other hand, the non-parametric methods do not hold the assumption of the data fitting in the model.
Unlike the parametric method, these methods may not give a very high decrease in data reduction though
it generates homogenous and systematic reduced data in spite of the size of the data. Non-parametric
methods are used for storing a reduced representation of the data which include histograms, clustering, and
sampling,data cube aggregation.

The types of Non-Parametric data reduction methodology are:


i)Histogram

A histogram is a ‘graph’ that represents frequency distribution which describes how often a value appears
in the data. Histogram uses the binning method and to represent data distribution of an attribute. It uses
disjoint subset which we call as bin or buckets. We have data for AllElectronics data set, which contains
prices for regularly sold items.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
The diagram below shows a histogram of equal width that shows the frequency of price distribution.

A histogram is capable of representing dense, sparse, uniform or skewed data. Instead of only one attribute,
the histogram can be implemented for multiple attributes. It can effectively represent the up to five attributes.

ii) Clustering
Clustering techniques groups the similar objects from the data in such a way that the objects in a cluster
are similar to each other but they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated by using a distance function. More is
the similarity between the objects in a cluster closer they appear in the cluster.
The quality of cluster depends on the diameter of the cluster i.e. the at max distance between any two
objects in the cluster.
The original data is replaced by the cluster representation. This technique is more effective if the present
data can be classified into a distinct clustered.

iii)Sampling
One of the methods used for data reduction is sampling as it is capable to reduce the large data set into a
much smaller data sample. Below we will discuss the different method in which we can sample a large
data set D containing N tuples:

Simple random sample without replacement (SRSWOR) of size s: In this ‘s number’ of tuples are drawn
from N tuples such that in the data set D (s<N). The probability of drawing any tuple from the data set D is
1/N this means all tuples have an equal probability of getting sampled.

In an SRSWOR, each item in the population has an equal chance of being selected, and once an item is
selected, it is not put back into the population. Let's assume you have a population of 100 individuals, and
you want to draw a sample of size 5 using SRSWOR.
Population (100 individuals): [1, 2, 3, ..., 99, 100]

the sample after five selections:

Sample = [42, 17, 88, 5, 67]

This sample represents a simple random sample without replacement of size 5 from the population of 100
individuals. Each individual in the sample was selected with an equal probability, and once selected, they
were not put back into the population for subsequent selections.

Simple random sample with replacement (SRSWR) of size s: It is similar to the SRSWOR but the tuple
is drawn from data set D, is recorded and then replaced back into the data set D so that it can be drawn again.

An example of a simple random sample with replacement (SRSWR) of size 5 from a population. In an
SRSWR, each item in the population has an equal chance of being selected, and after each selection, the
item is put back into the population, allowing it to be selected again.

Let's assume you have a population of 100 balls, each numbered from 1 to 100, and you want to draw an
SRSWR of size 5:

Population (100 balls): [1, 2, 3, ..., 99, 100]

Select a random number between 1 and 100. Let's say the first random number is 42. Add ball #42 to the
sample.

Sample: [42]

Select another random number between 1 and 100. Let's say the second random number is 17. Add ball #17
to the sample.

Sample: [42, 17]

Select a third random number between 1 and 100. Let's say the third random number is 88. Add ball #88 to
the sample.

Sample: [42, 17, 88]

Select a fourth random number between 1 and 100. Let's say the fourth random number is 5. Add ball #5 to
the sample.

Sample: [42, 17, 88, 5]

Select a fifth random number between 1 and 100. Let's say the fifth random number is 67. Add ball #67 to
the sample.

Sample: [42, 17, 88, 5, 67]


Each ball was selected independently, and because it's a sample with replacement, the same ball could be
selected more than once,

Cluster sample: The tuples in data set D are clustered into M mutually disjoint subsets. From these
clusters, a simple random sample of size s could be generated where s<M. The data reduction can be
applied by implementing SRSWOR on these clusters.

Stratified sample: The large data set D is partitioned into mutually disjoint sets called ‘strata’. Now a
simple random sample is taken from each stratum to get stratified data. This method is effective for skewed
data.

.
Data Cube Aggregation

Data Cube Aggregation represents the data in cube format by performing aggregation on data. In this data
reduction technique, each cell of the cube is a placeholder holding the aggregated data point. This cube stores
the data in an aggregated form in a multidimensional space. The resultant value is lower in terms of volume
i.e. takes up less space without losing any information.

EXAMPLE 1:

Consider you have the data of All Electronics sales per quarter for the year 2008 to the year 2010. In case
you want to get the annual sale per year then you just have to aggregate the sales per quarter for each year.
In this way, aggregation provides you with the required data which is much smaller in size and thereby we
achieve data reduction even without losing any data.

The data cube aggregation is a multidimensional aggregation which eases multidimensional analysis. Like
in the image below the data cube represent annual sale for each item for each branch. The data cube present
precomputed and summarized data which eases the data mining into fast access.

EXAMPLE 2:

Below we have the quarterly sales data of a company across four item types: entertainment, keyboard, mobile,
and locks, for one of its locations aka Delhi. The data is in 2-dimensional (2D) format and gives us information
about sales on a quarterly basis.
Viewing 2-dimensional data in the form of a table is indeed helpful. Let’s say we increase one
more dimension and add more locations in our data – Gurgaon, Mumbai, along with the
already available location, Delhi, as below:

Now, rather than viewing this 3-dimensional data in a tabular structure,


representing this data in a cube format increases the readability of the data:
Each side of the cube represents one dimension- Time, Location, and Item Type (Mouse
mobile modem).
Another usability of the data cube aggregation is when we want to aggregate the data values.
In the example below, we have quarterly sales data for different years from 2008 to 2010.

However, to make an analysis of any sort, we typically work with annual sales. So,can visually
depict the yearly sales data by summing over the sales amount across other dimensions like
below:
Data cube aggregation is useful for storing and showing summarized data in a cube form.

Here, data reduction is achieved by aggregating data across different levels of the cube. It is
also seen as a multidimensional aggregation that enhances aggregation operations.

Data Compression : Data compression is the process of encoding,restructuring or


otherwise modifying data in order to reduce its size.

Data compression in data mining as the name suggests simply compresses the data. This
technique encapsulates the data or information into a condensed form by eliminating
duplicate, not needed information. It changes the structure of the data without taking much
space and is represented in a binary form.

There are two types of data compression:

1. Lossless Compression: When the compressed data can be restored or reconstructed


back to its original form without the loss of any informationthen it is referred to as
lossless compression.
2. Lossy Compression: When the compressed data cannot be restored or reconstructed back into
its original form then referred to as Lossy compression.
Data compression technique varies based on the type of data.

1. String data: In string compression, the data is modified in a limited manner without
complete expansion; hence the string is mostly lossless as the data canbe retrieved back
to its original form. Therefore it is lossless data compression. There are extensive
theories and well-tuned algorithms that are used for data compression.

2. Audio or video data: Unlike string data, audio or video data cannot berecreated to its
original shape, hence is lossy data compression. At times, it may be possible to
reconstruct small bits or pieces of the signal data but you cannot restore it to its whole
form.

3. Time Sequential data: The time-sequential data is not audio data. It is by large, usually
short data fragments and it varies slowly with time as is used for data compression.

There are two techniques for this:

Wavelet Transformation

Principal Component Analysis (PCA)

1. Wavelet Transformation
Transforms pixels of the images into wavelets, those will be used for wavelet based compression and
coding.

Wavelet Transform is a form of lossy data compression.

Let’s say we have a data vector Y, by applying the wavelet transform on this vectorY, we
would receive a different numerical data vector Y’, where the length of both the vectors Y and
Y’ are the same. Now, you may be wondering how transforming Y into Y’ helps us to reduce
the data. This Y’ data can be trimmed or truncated whereas the actual vector Y cannot be
compressed.

Let’s say we have a data vector Y. When we apply wavelet transformation on this vector Y,
we get a different numerical data vector Y’ where the length of both the vectors Y and Y’ are
the same now.

Result: TheY’ data can be trimmed or truncated whereas the actual vector Y
cannot be compressed.

The reason it is called ‘wavelet transform’ is that the information here is present in
the form of waves, like how a frequency is depicted graphically as signals. The wavelet
transform also has efficiency for data cubes, sparse or skewed data. It is mostly applicable for
image compression and for signal processing.
2. Principal Component Analysis (PCA)
It is used to reduce data size using ‘k’ orthogonal vectors.
Unit vectors that each point in a direction perpendicular to the others. Those vectors are reffered
to as the principal components.

Principal component analysis, a technique for data reduction in data mining, groups the
important variables into a component taking the maximum information present within the data
and discards the other, not important variables.
Now, let’s say out of total n variables, k are such variables that are identified and are part of
this new component. This component is now what is representative of the data and used for
further analysis.

In short, PCA is applied to reducing multi-dimensional data into lower- dimensional data. This
is done by eliminating variables containing the same information as provided by other
variables and combining the relevant variables into components. The principal component
analysis is also useful for sparse, and skewed data.

DATA TRANSFORMATION

Data transformation is the process of converting data from one format, such as a database file, XML
document or Excel spreadsheet, into another.

Transformations typically involve converting a raw data source into a cleansed, validated and ready-to-use
format. Data transformation is crucial to data management processes that include data integration, data
migration, data warehousing and data preparation.

The process of data transformation can also be referred to as extract/transform/load (ETL). The extraction
phase involves identifying and pulling data from the various source systems that create data and then moving
the data to a single repository. Next, the raw data is cleansed, if needed. It's then transformed into a target
format that can be fed into operational systems or into a data warehouse, a date lake or another repository
for use in business intelligence and analytics applications. The transformation may involve converting data
types, removing duplicate data and enriching the source data.

Raw data is difficult to trace or understand. That's why it needs to be preprocessed before retrieving any
information from it. Data transformation is a technique used to convert the raw data into a suitable format
that efficiently eases data mining and retrieves strategic information. Data transformation includes data
cleaning techniques and a data reduction technique to convert the data into the appropriate form.

There are several data transformation techniques that can help structure and clean up the data before analysis
or storage in a data warehouse. The data are transformed or consolidated so that the resulting mining process
may be more efficient, and the patterns found may be easier to understand.

Data transformation strategies include:


1. Smoothing
2. Attribute construction
3. Aggregation
4. Discretization
5. Concept hierarchy generation
6. Normalization

Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some algorithms. It allows
for highlighting important features present in the dataset. It helps in predicting the patterns. When collecting
data, it can be manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can
often be difficult to digest for finding patterns that they wouldn't see otherwise.
We have seen how the noise is removed from the data using the techniques such as binning, regression,
clustering.
Binning: This method splits the sorted data into the number of bins and smoothens the data values in each
bin considering the neighborhood values around it.
Regression: This method identifies the relation among two dependent attributes so that if we have one
attribute, it can be used to predict the other attribute.
Clustering: This method groups similar data values and form a cluster. The values that lie outside a cluster
are known as outliers.

Attribute Construction New attributes are constructed and added from the given set of attributes
In the attribute construction method, the new attributes consult the existing attributes to construct a new data
set that eases data mining. New attributes are created and applied to assist the mining process from the given
attributes. This simplifies the original data and makes the mining more efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e., we may have the
height and width of each plot. So here, we can construct a new attribute 'area' from attributes 'height' and
'weight'. This also helps understand the relations among the attributes in a data set.

Aggregation:

Summery and aggregation functions can be applied on the data for constructing data cube for data
analysis(data reduction).

Data collection or aggregation is the method of storing and presenting data in a summary format. The data
may be obtained from multiple data sources to integrate these data sources into a data analysis description.
This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used.

Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or business strategy of
the product, pricing, operations, and marketing strategies.

For example, we have a data set of sales reports of an enterprise that has quarterly sales of each year. We can
aggregate the data to get the enterprise's annual sales report.

Discretization
Numeric values are replaced by interval labels or conceptual labels. The labels, in turn, can be recursively
organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute.
This is a process of converting continuous data into a set of data intervals. Continuous attribute values are
substituted by small interval labels. This makes the data easier to study and analyze. If a data mining task
handles a continuous attribute, then its discrete values can be replaced by constant quality attributes. This
improves the efficiency of the task.

This method is also called a data reduction mechanism as it transforms a large dataset into a set of categorical
data. Discretization also uses decision tree-based algorithms to produce short, compact, and accurate results
when using discrete values.

Data discretization can be classified into two types: supervised discretization, where the class information is
used, and unsupervised discretization, which is based on which direction the process proceeds, i.e., 'top-
down splitting strategy' or 'bottom-up merging strategy'.

For example, the values for the age attribute can be replaced by the interval labels such as (0-10, 11-20…)
or (kid, youth, adult, senior).

Concept Hierarchy Generation For Nominal Data

Attributes of lower level concepts can be generalized to higher-level concepts.


Ex1:street A, street B, Street c can begeneralized to town or city,
Ex2. Lower level groups into higher level concepts.
Imagine you have a dataset containing information about products, including their categories and
subcategories. We'll create a concept hierarchy for the "Product Category" attribute.

Step 1: Data Understanding and Exploration

Suppose you have a dataset with the following product categories:


Electronics
Clothing
Books
Electronics > Phones
Electronics > Laptops
Clothing > Men's Clothing
Clothing > Women's Clothing
Books > Fiction
Books > Non-Fiction

Step 2: Hierarchy Levels

In this case, we can identify three hierarchy levels:


Top Level (General): Electronics, Clothing, Books
Middle Level: Electronics > Phones, Electronics > Laptops, Clothing > Men's Clothing, Clothing > Women's
Clothing, Books > Fiction, Books > Non-Fiction
Bottom Level (Specific): Phones, Laptops, Men's Clothing, Women's Clothing, Fiction, Non-Fiction.

Step 3: Hierarchy Generation

We'll use a manual top-down approach to generate the hierarchy.


Top Level:
Electronics
Clothing
Books

Middle Level:

Electronics
Phones
Laptops
Clothing
Men's Clothing
Women's Clothing
Books
Fiction
Non-Fiction

Bottom Level:

Phones
Laptops
Men's Clothing
Women's Clothing
Fiction
Non-Fiction

Step 4: Attribute Generalization and Specialization

Here, we're generalizing categories into higher-level groups and specializing into subcategories.
Step 5: Granularity
We've defined three levels of granularity in our hierarchy.

Step 6: Hierarchical Representation

You can represent this hierarchy as a tree structure or a directed acyclic graph. Here's a simplified textual
representation:
- Electronics
- Phones
- Laptops
- Clothing
- Men's Clothing
- Women's Clothing
- Books
- Fiction
- Non-Fiction

Step 7: Data Transformation

Replace the original categorical values with the corresponding hierarchy levels in your dataset. For instance,
if a product was originally labeled as "Laptops," it would now be represented as "Electronics > Laptops."
Step 8: Benefits

With this concept hierarchy in place, you can now perform various data mining tasks more effectively. For
example, you can analyze sales patterns at different levels of the hierarchy (e.g., total sales for "Electronics,"
sales of "Phones" vs. "Laptops," etc.). The hierarchy also helps to maintain data consistency and enables
efficient querying.

Normalization:
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0, 1.0].
There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for attribute A that
are V1, V2, V3, Vn.

The measurement unit used can affect the data analysis. To help avoid dependence on the choice of
measurement units. The data should be normalized or standardized.
Normalizing the data attempts to give all attributes an equal weight.
Methods for data normalization:
1. Min max normalization
2. z-score normalization
3. normalization by decimal scaling

Z-Score Normalization

Z-Score helps in the normalization of data. If we normalize the data into a simpler form with the help of z
score normalization, then it’s very easy to understand by our brains.

How to calculate Z-Score of the following data?


marks
8
10
15
20
Min Max normalization

Min Max is a technique that helps to normalize the data. It will scale the data between 0 and 1. This
normalization helps us to understand the data easily.
For example, if I say you to tell me the difference between 200 and 1000 then it’s a little bit confusing as
compared to when I ask you to tell me the difference between 0.2 and 1.
It helps to normalize the data. It will scale the data between 0 and 1. This normalization helps us
to understand the data easily.
For example, if I say you to tell me the difference between 200 and 1000 then it’s a little bit confusing as
compared to when I ask you to tell me the difference between 0.2 and 1.
Min Max normalization formula
marks
8
10
15
20
Min:
The minimum value of the given attribute. Here Min is 8
Max:
The maximum value of the given attribute. Here Max is 20
V: V is the respective value of the attribute. For example here V1=8, V2=10, V3=15, and V4=20
newMax:
1
newMin:
0
marks marks after Min-Max normalization
8 0
10 0.16
15 0.58
20 1

Normalization with Decimal scaling


Decimal scaling is a data normalization techniq ue. In this technique, we move the decimal point of values
of the attribute. This movement of decimal points totally depends on the maximum value among all values
in the attribute.
Decimal Scaling Formula
A value v of attribute A is can be normalized by the following formula
Normalized value of attribute = ( vi / 10j )
If you are interested in an excel file of decimal scaling, then you can read the excel file with calculations.
We will check the maximum value among our attribute CGPA. Here maximum value is 3 so we can convert
it to a decimal by dividing by 10. Why 10?
we will count total numbers in our maximum value and then put 1 and after 1 we can put zeros equal to the
length of the maximum value.
Example 2:
Here 3 is the maximum value and the total numbers in this value are only 1. so we will put one zero after
one.
Salary bonus Formula CGPA Normalized after Decimal scaling
400 400 / 1000 0.4
310 310 / 1000 0.31
We will check the maximum value of our attribute “salary bonus“. Here maximum value is 400 so we can
convert it into a decimal by dividing it by 1000. Why 1000?
400 contains three digits and we so we can put three zeros after 1. So, it looks like 1000.
Example 3:
Salary Formula CGPA Normalized after Decimal scaling
40,000 40,000 / 100000 0.4
31, 000 31,000 / 100000 0.31

Data Discretization and Concept Hierarchy Generation.


Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
intervals with minimum data loss. There are twoforms of data discretization first is supervised
discretization, and the second is unsupervised discretization. Supervised discretization refers
to a method in which the class data is used. Unsupervised discretization refers to a method
depending upon the way which operation proceeds. It means it works on the top-down
splitting strategy and bottom-up merging strategy.

Now, we can understand this concept with the help of an exampleSuppose

we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77


Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Child Young Mature Old


Discretization

Another example is analytics, where we gather the static data of website visitors. For example,
all visitors who visit the site with the IP address of India are shown under country level.

This is done to replace the raw values of numeric attribute by interval levels orconceptual levels.
Divide the range of a continuous attribute into intervals.
Some classification algorithms only accept categorical attributes.Reduce
data size by discretization.
Reduce the number of values for a given continuous attribute by dividing the rangeof the
attribute into intervals.
Interval labels can then be used to replace actual data values.

Three types of attributes:


Nominal — values from an unordered set.
Ordinal — values from an ordered set.
Continuous — It have a Real Numbers.

Divide the range of a continuous attribute into intervals.


Some classification algorithms only accept categorical attributes. Reduce
data size by discretization.

Concept Hierarchy Generation : Attributes are converted from lower level to higher level in
hierarchy.
For Example- The attribute “city” can be converted to “country”.

The term hierarchy represents an organizational structure or mapping in which items are
ranked according to their levels of importance. In other words, we cansay that a hierarchy
concept refers to a sequence of mappings with a set of more general concepts to complex
concepts. It means mapping is done from low-level concepts to high-level concepts. For
example, in computer science, there are different types of hierarchical systems. A document
is placed in a folder in windows at a specific place in the tree structure is the best example of
a computer hierarchical tree model. There are two types of hierarchy: top-down mapping and
the second one is bottom-up mapping.

Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi canbe mapped
to India, and India can be mapped to Asia.

Top-down mapping

Top-down mapping generally starts with the top with some general informationand ends
with the bottom to the specialized information.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some specialized information and
ends with the top to the generalized information.

DATA BINARIZATION
Data discretization is a method of converting attributes values of continuous data into a finite
set of intervals with minimum data loss. In contrast, data binarization is used to transform
the continuous and discrete attributes into binary attributes.

FOR REFERENCE:
preprocessing step (data cleaning, data integration, data reduction, and data transformation)
with a deep explanation and the obtained table structure at each step.

Sample Dataset: Customer Purchase History

First, let's start with the initial datasets:


1. Data Cleaning:

In this step, we handle duplicate records and missing values.


There are no duplicate records in the given datasets.
Assuming there are no missing values.

2. Data Integration:

In this step, we merge the customer information and purchase history datasets using the common
"Customer ID" field.
1. Data Reduction:
In this step, we perform dimensionality reduction, but for simplicity, we'll skip this step in our
example.

4. Data Transformation:

In this step, we'll perform data transformation including normalization, one-hot encoding, and
creating a new feature "Total Purchases."

Normalize the "Age" and "Price" columns using Min-Max normalization.


Encode categorical variable "Gender" using one-hot encoding.
Create a new feature "Total Purchases" indicating the number of purchases made by each
customer.

In this transformed dataset, we've applied data transformation techniques to normalize


numerical columns, encode categorical variables, and create a new feature. The resulting table
is now in a format that is suitable for analysis or modeling, with features that have been
processed and structured for better representation and insight extraction.

You might also like