0% found this document useful (0 votes)
12 views21 pages

DMW Unit1

DMW_Unit1

Uploaded by

Samiksha Walbe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views21 pages

DMW Unit1

DMW_Unit1

Uploaded by

Samiksha Walbe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

data is a collection of discrete or continuous values that convey information, describing

the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that
may be further interpreted formally. A datum is an individual value in a collection of data. Data are
usually organized into structures such as tables that provide additional context and meaning, and may
themselves be used as data in larger structures. Data may be used as variables in a computational process.
[1][2]
Data may represent abstract ideas or concrete measurements. [3] Data are commonly used in scientific
research, economics, and virtually every other form of human organizational activity. Examples of data
sets include price indices (such as the consumer price index), unemployment rates, literacy rates,
and census data. In this context, data represent the raw facts and figures from which useful information
can be extracted.
Data, information, knowledge, and wisdom are closely related concepts, but each has its role concerning
the other, and each term has its meaning. According to a common view, data is collected and analyzed;
data only becomes information suitable for making decisions once it has been analyzed in some fashion.
[8]
One can say that the extent to which a set of data is informative to someone depends on the extent to
which it is unexpected by that person. The amount of information contained in a data stream may be
characterized by its Shannon entropy.
Knowledge is the awareness of its environment that some entity possesses, whereas data merely
communicates that knowledge. For example, the entry in a database specifying the height of Mount
Everest is a datum that communicates a precisely-measured value. This measurement may be included in
a book along with other data on Mount Everest to describe the mountain in a manner useful for those who
wish to decide on the best method to climb it. Awareness of the characteristics represented by this data is
knowledge.
Data are often assumed to be the least abstract concept, information the next least, and knowledge the
most abstract.[9] In this view, data becomes information by interpretation; e.g., the height of Mount
Everest is generally considered "data", a book on Mount Everest geological characteristics may be
considered "information", and a climber's guidebook containing practical information on the best way to
reach Mount Everest's peak may be considered "knowledge". "Information" bears a diversity of meanings
that range from everyday usage to technical use. This view, however, has also been argued to reverse how
data emerges from information, and information from knowledge.[10] Generally speaking, the concept of
information is closely related to notions of constraint, communication, control, data, form, instruction,
knowledge, meaning, mental stimulus, pattern, perception, and representation. Beynon-Davies uses the
concept of a sign to differentiate between data and information; data is a series of symbols, while
information occurs when the symbols are used to refer to something.
What are Data Attributes?
 Data attributes refer to the specific characteristics or properties that describe individual data
objects within a dataset.
 These attributes provide meaningful information about the objects and are used to analyze,
classify, or manipulate the data.
 Understanding and analyzing data attributes is fundamental in various fields such
as statistics, machine learning, and data analysis, as they form the basis for deriving insights and
making informed decisions from the data.
 Within predictive models, attributes serve as the predictors influencing an outcome. In descriptive
models, attributes constitute the pieces of information under examination for inherent patterns or
correlations.
We can say that a set of attributes used to describe a given object are known as attribute vector or
feature vector.
Examples of data attributes include numerical values (e.g., age, height), categorical labels (e.g., color,
type), textual descriptions (e.g., name, description), or any other measurable or qualitative aspect of the
data objects.
Types of attributes:
This is the initial phase of data preprocessing involves categorizing attributes into different types, which
serves as a foundation for subsequent data processing steps. Attributes can be broadly classified into two
main types:
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)

Qualitative Attributes:
1. Nominal Attributes :
Nominal attributes, as related to names, refer to categorical data where the values represent different
categories or labels without any inherent order or ranking. These attributes are often used to represent
names or labels associated with objects, entities, or concepts.
Example :
2. Binary Attributes: Binary attributes are a type of qualitative attribute where the data can take on only
two distinct values or states. These attributes are often used to represent yes/no, presence/absence, or
true/false conditions within a dataset. They are particularly useful for representing categorical data where
there are only two possible outcomes. For instance, in a medical study, a binary attribute could represent
whether a patient is affected or unaffected by a particular condition.
 Symmetric:In a symmetric attribute, both values or states are considered equally important or
interchangeable. For example, in the attribute “Gender” with values “Male” and “Female,”
neither value holds precedence over the other, and they are considered equally significant for
analysis purposes.

 Asymmetric: An asymmetric attribute indicates that the two values or states are not equally
important or interchangeable. For instance, in the attribute “Result” with values “Pass” and
“Fail,” the states are not of equal importance; passing may hold greater significance than failing
in certain contexts, such as academic grading or certification exams

3. Ordinal Attributes : Ordinal attributes are a type of qualitative attribute where the values possess a
meaningful order or ranking, but the magnitude between values is not precisely quantified. In other
words, while the order of values indicates their relative importance or precedence, the numerical
difference between them is not standardized or known.
Example:
Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented in integer
or real values. Numerical attributes are of 2 types: interval, and ratio-scaled.
 An interval-scaled attribute has values, whose differences are interpretable, but the numerical
attributes do not have the correct reference point, or we can call zero points. Data can be added
and subtracted at an interval scale but can not be multiplied or divided. Consider an example of
temperature in degrees Centigrade. If a day’s temperature of one day is twice of the other day we
cannot say that one day is twice as hot as another day.
 A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-
scaled, we can say of a value as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between values, and the mean, median, mode,
Quantile-range, and Five number summary can be given.

2. Discrete : Discrete data refer to information that can take on specific, separate values rather than a
continuous range. These values are often distinct and separate from one another, and they can be either
numerical or categorical in nature.

Example:

3. Continuous: Continuous data, unlike discrete data, can take on an infinite number of possible values
within a given range. It is characterized by being able to assume any value within a specified interval,
often including fractional or decimal values.
Example :
Data Preprocessing in Data Mining
Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming,
and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve
the quality of the data and to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform the
data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete
categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant features from the dataset, while
feature extraction involves transforming the data into a lower-dimensional space while preserving the
important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning,
and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1
and 1. Normalization is often used to handle data with different units and scales. Common normalization
techniques include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis
results. The specific steps involved in data preprocessing may vary depending on the nature of the data
and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more
accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
2 Data Integration in Data Mining
 Data integration in data mining refers to the process of combining data from multiple
sources into a single, unified view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may exist between the different
sources. The goal of data integration is to make the data more useful and meaningful for the
purposes of analysis and decision making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data federation.
Data Integration is a data preprocessing technique that combines data from multiple heterogeneous
data sources into a coherent data store and provides a unified view of the data. These sources may
include multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
What is data integration :
Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources, mapping
the data to a common format, and reconciling any inconsistencies or discrepancies between the
sources. The goal of data integration is to make it easier to access and analyze data that is spread
across multiple systems or platforms, in order to gain a more complete and accurate understanding
of the data.
Data integration can be challenging due to the variety of data formats, structures, and semantics
used by different data sources. Different data sources may use different data types, naming
conventions, and schemas, making it difficult to combine the data into a single view. Data
integration typically involves a combination of manual and automated processes, including data
profiling, data mapping, data transformation, and data reconciliation.
Data integration is used in a wide range of applications, such as business intelligence, data
warehousing, master data management, and analytics. Data integration can be critical to the
success of these applications, as it enables organizations to access and analyze data that is spread
across different systems, departments, and lines of business, in order to make better decisions,
improve operational efficiency, and gain a competitive advantage.
There are mainly 2 major approaches for data integration – one is the “tight coupling approach”
and another is the “loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the integrated
data. The data is extracted from various sources, transformed and loaded into a data warehouse.
Data is integrated in a tightly coupled manner, meaning that the data is integrated at a high level,
such as at the level of the entire dataset or schema. This approach is also known as data
warehousing, and it enables data consistency and integrity, but it can be inflexible and difficult to
change or update.
 Here, a data warehouse is treated as an information retrieval component.
 In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual data
elements or records. Data is integrated in a loosely coupled manner, meaning that the data is
integrated at a low level, and it allows data to be integrated without having to create a central
repository or data warehouse. This approach is also known as data federation, and it enables data
flexibility and easy updates, but it can be difficult to maintain consistency and integrity across
multiple data sources.
 Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases
to obtain the result.
 And the data only remains in the actual source databases.
Issues in Data Integration:
There are several issues that can arise when integrating data from multiple sources, including:
1. Data Quality: Inconsistencies and errors in the data can make it difficult to combine and
analyze.
2. Data Semantics: Different sources may use different terms or definitions for the same data,
making it difficult to combine and understand the data.
3. Data Heterogeneity: Different sources may use different data formats, structures, or
schemas, making it difficult to combine and analyze the data.
4. Data Privacy and Security: Protecting sensitive information and maintaining security can
be difficult when integrating data from multiple sources.
5. Scalability: Integrating large amounts of data from multiple sources can be computationally
expensive and time-consuming.
6. Data Governance: Managing and maintaining the integration of data from multiple sources
can be difficult, especially when it comes to ensuring data accuracy, consistency, and
timeliness.
7. Performance: Integrating data from multiple sources can also affect the performance of the
system.
8. Integration with existing systems: Integrating new data sources with existing systems can be
a complex task, requiring significant effort and resources.
9. Complexity: The complexity of integrating data from multiple sources can be high,
requiring specialized skills and knowledge.

# Correlation Analysis
Correlation analysis is a statistical technique for determining the strength of a link between two
variables. It is used to detect patterns and trends in data and to forecast future occurrences.
 Consider a problem with different factors to be considered for making optimal conclusions
 Correlation explains how these variables are dependent on each other.
 Correlation quantifies how strong the relationship between two variables is. A higher value
of the correlation coefficient implies a stronger association.
 The sign of the correlation coefficient indicates the direction of the relationship between
variables. It can be either positive, negative, or zero.
What is Correlation?
The Pearson correlation coefficient is the most often used metric of correlation. It expresses the
linear relationship between two variables in numerical terms. The Pearson correlation coefficient,
written as “r,” is as follows:
r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
where,
 r: Correlation coefficient
 xixi : i^th value first dataset X
 xˉxˉ : Mean of first dataset X
 yiyi : i^th value second dataset Y
 yˉyˉ : Mean of second dataset Y
The correlation coefficient, denoted by “r”, ranges between -1 and 1.
r = -1 indicates a perfect negative correlation.
r = 0 indicates no linear correlation between the variables.
r = 1 indicates a perfect positive correlation.
Types of Correlation
There are three types of correlation:

Correlation
1. Positive Correlation: Positive correlation indicates that two variables have a direct
relationship. As one variable increases, the other variable also increases. For example, there
is a positive correlation between height and weight. As people get taller, they also tend to
weigh more.
2. Negative Correlation: Negative correlation indicates that two variables have an inverse
relationship. As one variable increases, the other variable decreases. For example, there is a
negative correlation between price and demand. As the price of a product increases, the
demand for that product decreases.
3. Zero Correlation: Zero correlation indicates that there is no relationship between two
variables. The changes in one variable do not affect the other variable. For example, there is
zero correlation between shoe size and intelligence.
A positive correlation indicates that the two variables move in the same direction, while a negative
correlation indicates that the two variables move in opposite directions.
The strength of the correlation is measured by a correlation coefficient, which can range from -1 to
1. A correlation coefficient of 0 indicates no correlation, while a correlation coefficient of 1 or -1
indicates a perfect correlation.
Correlation Coefficients
The different types of correlation coefficients used to measure the relation between two variables
are:

Type of Data
Correlation Coefficient Relation Levels of Measurement Distribution

Pearson Correlation Normal


Linear Interval/Ratio
Coefficient distribution

Spearman Rank Any


Non-Linear Ordinal
Correlation Coefficient distribution

Any
Non-Linear Ordinal
Kendall Tau Coefficient distribution

Nominal vs. Nominal (nominal with Any


Non-Linear
Phi Coefficient 2 categories (dichotomous)) distribution

Any
Non-Linear Two nominal variables
Cramer’s V distribution

How to Conduct Correlation Analysis


To conduct a correlation analysis, you will need to follow these steps:
1. Identify Variable: Identify the two variables that we want to correlate. The variables should
be quantitative, meaning that they can be represented by numbers.
2. Collect data : Collect data on the two variables. We can collect data from a variety of
sources, such as surveys, experiments, or existing records.
3. Choose the appropriate correlation coefficient. The Pearson correlation coefficient is the
most commonly used correlation coefficient, but there are other correlation coefficients that
may be more appropriate for certain types of data.
4. Calculate the correlation coefficient. We can use a statistical software package to calculate
the correlation coefficient, or you can use a formula.
5. Interpret the correlation coefficient. The correlation coefficient can be interpreted as a
measure of the strength and direction of the linear relationship between the two variables.
Applications of Correlation Analysis
Correlation Analysis is an important tool that helps in better decision-making, enhances predictions
and enables better optimization techniques across different fields. Predictions or decision making
dwell on the relation between the different variables to produce better results, which can be
achieved by correlation analysis.
The various fields in which it can be used are:
 Economics and Finance : Help in analyzing the economic trends by understanding the
relations between supply and demand.
 Business Analytics : Helps in making better decisions for the company and provides
valuable insights.
 Market Research and Promotions : Helps in creating better marketing strategies by
analyzing the relation between recent market trends and customer behavior.
 Medical Research : Correlation can be employed in Healthcare so as to better understand
the relation between different symptoms of diseases and understand genetical diseases
better.
 Weather Forecasts: Analyzing the correlation between different variables so as to predict
weather.
 Better Customer Service : Helps in better understand the customers and significantly
increases the quality of customer service.
 Environmental Analysis: help create better environmental policies by understanding
various environmental factors.
Advantages of Correlation Analysis
 Correlation analysis helps us understand how two variables affect each other or are related
to each other.
 They are simple and very easy to interpret.
 Aids in decision-making process in business, healthcare, marketing, etc
 Helps in feature selection in machine learning.
 Gives a measure of the relation between two variables.
Disadvantages of Correlation Analysis
 Correlation does not imply causation, which means a variable may not be the cause for the
other variable even though they are correlated.
 If outliers are not dealt with well they may cause errors.
 It works well only on bivariate relations and may not produce accurate results for
multivariate relations.
 Complex relations can not be analyzed accurately.

3 Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the dataset
while preserving the important information. This is done to improve the efficiency of data analysis and to
avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be done
using various techniques such as correlation analysis, mutual information, and principal component
analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features are high-
dimensional and complex. It can be done using techniques such as PCA, linear discriminant analysis
(LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to
reduce the size of the dataset while preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often used to
reduce the size of the dataset by replacing similar data points with a representative centroid. It can be
done using techniques such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important information.
Compression is often used to reduce the size of the dataset for storage and transmission purposes. It can
be done using techniques such as wavelet compression, JPEG compression, and gzip compression.

Data Normalization in Data Mining


Data normalization is a technique used in data mining to transform the values of a dataset into a common
scale. This is important because many machine learning algorithms are sensitive to the scale of the input
features and can produce better results when the data is normalized.
There are several different normalization techniques that can be used in data mining, including:
1. Min-Max normalization: This technique scales the values of a feature to a range between 0 and
1. This is done by subtracting the minimum value of the feature from each value, and then
dividing by the range of the feature.
2. Z-score normalization: This technique scales the values of a feature to have a mean of 0 and a
standard deviation of 1. This is done by subtracting the mean of the feature from each value, and
then dividing by the standard deviation.
3. Decimal Scaling: This technique scales the values of a feature by dividing the values of a feature
by a power of 10.
4. Logarithmic transformation: This technique applies a logarithmic transformation to the values
of a feature. This can be useful for data with a wide range of values, as it can help to reduce the
impact of outliers.
5. Root transformation: This technique applies a square root transformation to the values of a
feature. This can be useful for data with a wide range of values, as it can help to reduce the
impact of outliers.
6. It’s important to note that normalization should be applied only to the input features, not the
target variable, and that different normalization technique may work better for different types of
data and models.
In conclusion, normalization is an important step in data mining, as it can help to improve the
performance of machine learning algorithms by scaling the input features to a common scale. This can
help to reduce the impact of outliers and improve the accuracy of the model.
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0
or 0.0 to 1.0. It is generally useful for classification algorithms.
Need of Normalization –
Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it
may lead to a dilution in effectiveness of an important equally important attribute(on lower scale) because
of other attribute having values on larger scale. In simple words, when multiple attributes are there but
attributes have values on different scales, this may lead to poor data models while performing data mining
operations. So they are normalized to bring all the attributes on the same scale.
Methods of Data Normalization –
 Decimal Scaling
 Min-Max Normalization
 z-Score Normalization(zero-mean Normalization)
Decimal Scaling Method For Normalization –
It normalizes by moving the decimal point of values of the data. To normalize the data by this technique,
we divide each value of the data by the maximum absolute value of data. The data value, vi, of data is

normalized to vi‘ by using the formula below – where j is the smallest integer
such that max(|vi‘|)<1. Example –
Let the input data is: -10, 201, 301, -401, 501, 601, 701 To normalize the above data, Step 1: Maximum
absolute value in given data(m): 701 Step 2: Divide the given data by 1000 (i.e j=3) Result: The
normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701
Min-Max Normalization –
In this technique of data normalization, linear transformation is performed on the original data. Minimum
and maximum value from data is fetched and each value is replaced according to the following formula.

Where A is the attribute data, Min(A), Max(A) are the minimum and maximum absolute value of A
respectively. v’ is the new value of each entry in data. v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary value of range required)
respectively.
Z-score normalization –
In this technique, values are normalized based on mean and standard deviation of the data A. The formula

used is: v’, v is the new and old of each entry in


data respectively. σA, A is the standard deviation and mean of A respectively.
ADVANTAGES OR DISADVANTAGES:
Data normalization in data mining can have a number of advantages and disadvantages.
Advantages:
1. Improved performance of machine learning algorithms: Normalization can help to improve the
performance of machine learning algorithms by scaling the input features to a common scale.
This can help to reduce the impact of outliers and improve the accuracy of the model.
2. Better handling of outliers: Normalization can help to reduce the impact of outliers by scaling the
data to a common scale, which can make the outliers less influential.
3. Improved interpretability of results: Normalization can make it easier to interpret the results of a
machine learning model, as the inputs will be on a common scale.
4. Better generalization: Normalization can help to improve the generalization of a model, by
reducing the impact of outliers and by making the model less sensitive to the scale of the inputs.
Disadvantages:
1. Loss of information: Normalization can result in a loss of information if the original scale of the
input features is important.
2. Impact on outliers: Normalization can make it harder to detect outliers as they will be scaled
along with the rest of the data.
3. Impact on interpretability: Normalization can make it harder to interpret the results of a machine
learning model, as the inputs will be on a common scale, which may not align with the original
scale of the data.
4. Additional computational costs: Normalization can add additional computational costs to the data
mining process, as it requires additional processing time to scale the data.
5. In conclusion, data normalization can have both advantages and disadvantages. It can improve the
performance of machine learning algorithms and make it easier to interpret the results. However,
it can also result in a loss of information and make it harder to detect outliers. It’s important to
weigh the pros and cons of data normalization and carefully assess the risks and benefits before
implementing it.
## Data reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the
most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining, including:
1. Data Cube Aggregation: Data cube aggregation involves moving the data from a detailed level
to fewer dimensions. The resulting data set is smaller in volume, without loss of information
necessary for the analysis task.
Data Cube Aggregation is a multidimensional aggregation that uses aggregation at various levels
of a data cube to represent the original data set, thus achieving data reduction. Data Cube
Aggregation, where the data cube is a much more efficient way of storing data, thus achieving
data reduction, besides faster aggregation operations.
2. Data Sampling: This technique involves selecting a subset of the data to work with, rather than
using the entire dataset. This can be useful for reducing the size of a dataset while still preserving
the overall trends and patterns in the data.
There are four types of sampling data reduction methods.
i. Simple Random Sample Without Replacement of sizes
ii. Simple Random Sample with Replacement of sizes
iii. Cluster Sample
iv. Stratified Sample

3. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features into a
single feature.
4. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
5. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
6. Feature Selection: This technique involves selecting a subset of features from the dataset that are
most relevant to the task at hand.
7. It’s important to note that data reduction can have a trade-off between the accuracy and the size of
the data. The more data is reduced, the less accurate the model will be and the less generalizable
it will be.
In conclusion, data reduction is an important step in data mining, as it can help to improve the efficiency
and performance of machine learning algorithms by reducing the size of the dataset. However, it is
important to be aware of the trade-off between the size and accuracy of the data, and carefully assess the
risks and benefits before implementing it.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine the information you
gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company
every three months. They involve you in the annual sales, rather than the quarterly average, So we can
summarize the data in such a way that the resulting data summarizes the total sales per year instead of per
quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our
analysis. It reduces data size as it eliminates outdated or redundant features.
 Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide the best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


 Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it
eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


 Combination of forwarding and Backward Selection –
It allows us to remove the worst and select the best attributes, saving time and making the process
faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression
techniques.
 Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
 Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component analysis)
are examples of this compression. For e.g., the JPEG image format is a lossy compression, but we
can find the meaning equivalent to the original image. In lossy-data compression, the
decompressed data may differ from the original data but are useful enough to retrieve information
from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model parameter. Or
non-parametric methods such as clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals. This means that
mining results are shown in a concise, and easily understandable way.
 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the
whole set of attributes and repeat this method up to the end, then the process is known as top-
down discretization also known as splitting.
 Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) with
high-level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
 Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number
of categorical counterparts depends on the number of bins specified by the user.
 Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into
disjoint ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of
bins i.e. a set of values ranging from 0-20.
Clustering: Grouping similar data together.

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :


Data reduction in data mining can have a number of advantages and disadvantages.
Advantages:
1. Improved efficiency: Data reduction can help to improve the efficiency of machine learning
algorithms by reducing the size of the dataset. This can make it faster and more practical to work
with large datasets.
2. Improved performance: Data reduction can help to improve the performance of machine learning
algorithms by removing irrelevant or redundant information from the dataset. This can help to
make the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs associated with large
datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the interpretability of the results by
removing irrelevant or redundant information from the dataset.
Disadvantages:
1. Loss of information: Data reduction can result in a loss of information, if important data is
removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing the size of
the dataset can also remove important information that is needed for accurate predictions.
3. Impact on interpretability: Data reduction can make it harder to interpret the results, as removing
irrelevant or redundant information can also remove context that is needed to understand the
results.
4. Additional computational costs: Data reduction can add additional computational costs to the data
mining process, as it requires additional processing time to reduce the data.
5. In conclusion, data reduction can have both advantages and disadvantages. It can improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
However, it can also result in a loss of information, and make it harder to interpret the results. It’s
important to weigh the pros and cons of data reduction and carefully assess the risks and benefits
before implementing it.

You might also like