DMW Unit1
DMW Unit1
the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that
may be further interpreted formally. A datum is an individual value in a collection of data. Data are
usually organized into structures such as tables that provide additional context and meaning, and may
themselves be used as data in larger structures. Data may be used as variables in a computational process.
[1][2]
Data may represent abstract ideas or concrete measurements. [3] Data are commonly used in scientific
research, economics, and virtually every other form of human organizational activity. Examples of data
sets include price indices (such as the consumer price index), unemployment rates, literacy rates,
and census data. In this context, data represent the raw facts and figures from which useful information
can be extracted.
Data, information, knowledge, and wisdom are closely related concepts, but each has its role concerning
the other, and each term has its meaning. According to a common view, data is collected and analyzed;
data only becomes information suitable for making decisions once it has been analyzed in some fashion.
[8]
One can say that the extent to which a set of data is informative to someone depends on the extent to
which it is unexpected by that person. The amount of information contained in a data stream may be
characterized by its Shannon entropy.
Knowledge is the awareness of its environment that some entity possesses, whereas data merely
communicates that knowledge. For example, the entry in a database specifying the height of Mount
Everest is a datum that communicates a precisely-measured value. This measurement may be included in
a book along with other data on Mount Everest to describe the mountain in a manner useful for those who
wish to decide on the best method to climb it. Awareness of the characteristics represented by this data is
knowledge.
Data are often assumed to be the least abstract concept, information the next least, and knowledge the
most abstract.[9] In this view, data becomes information by interpretation; e.g., the height of Mount
Everest is generally considered "data", a book on Mount Everest geological characteristics may be
considered "information", and a climber's guidebook containing practical information on the best way to
reach Mount Everest's peak may be considered "knowledge". "Information" bears a diversity of meanings
that range from everyday usage to technical use. This view, however, has also been argued to reverse how
data emerges from information, and information from knowledge.[10] Generally speaking, the concept of
information is closely related to notions of constraint, communication, control, data, form, instruction,
knowledge, meaning, mental stimulus, pattern, perception, and representation. Beynon-Davies uses the
concept of a sign to differentiate between data and information; data is a series of symbols, while
information occurs when the symbols are used to refer to something.
What are Data Attributes?
Data attributes refer to the specific characteristics or properties that describe individual data
objects within a dataset.
These attributes provide meaningful information about the objects and are used to analyze,
classify, or manipulate the data.
Understanding and analyzing data attributes is fundamental in various fields such
as statistics, machine learning, and data analysis, as they form the basis for deriving insights and
making informed decisions from the data.
Within predictive models, attributes serve as the predictors influencing an outcome. In descriptive
models, attributes constitute the pieces of information under examination for inherent patterns or
correlations.
We can say that a set of attributes used to describe a given object are known as attribute vector or
feature vector.
Examples of data attributes include numerical values (e.g., age, height), categorical labels (e.g., color,
type), textual descriptions (e.g., name, description), or any other measurable or qualitative aspect of the
data objects.
Types of attributes:
This is the initial phase of data preprocessing involves categorizing attributes into different types, which
serves as a foundation for subsequent data processing steps. Attributes can be broadly classified into two
main types:
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes :
Nominal attributes, as related to names, refer to categorical data where the values represent different
categories or labels without any inherent order or ranking. These attributes are often used to represent
names or labels associated with objects, entities, or concepts.
Example :
2. Binary Attributes: Binary attributes are a type of qualitative attribute where the data can take on only
two distinct values or states. These attributes are often used to represent yes/no, presence/absence, or
true/false conditions within a dataset. They are particularly useful for representing categorical data where
there are only two possible outcomes. For instance, in a medical study, a binary attribute could represent
whether a patient is affected or unaffected by a particular condition.
Symmetric:In a symmetric attribute, both values or states are considered equally important or
interchangeable. For example, in the attribute “Gender” with values “Male” and “Female,”
neither value holds precedence over the other, and they are considered equally significant for
analysis purposes.
Asymmetric: An asymmetric attribute indicates that the two values or states are not equally
important or interchangeable. For instance, in the attribute “Result” with values “Pass” and
“Fail,” the states are not of equal importance; passing may hold greater significance than failing
in certain contexts, such as academic grading or certification exams
3. Ordinal Attributes : Ordinal attributes are a type of qualitative attribute where the values possess a
meaningful order or ranking, but the magnitude between values is not precisely quantified. In other
words, while the order of values indicates their relative importance or precedence, the numerical
difference between them is not standardized or known.
Example:
Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented in integer
or real values. Numerical attributes are of 2 types: interval, and ratio-scaled.
An interval-scaled attribute has values, whose differences are interpretable, but the numerical
attributes do not have the correct reference point, or we can call zero points. Data can be added
and subtracted at an interval scale but can not be multiplied or divided. Consider an example of
temperature in degrees Centigrade. If a day’s temperature of one day is twice of the other day we
cannot say that one day is twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-
scaled, we can say of a value as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between values, and the mean, median, mode,
Quantile-range, and Five number summary can be given.
2. Discrete : Discrete data refer to information that can take on specific, separate values rather than a
continuous range. These values are often distinct and separate from one another, and they can be either
numerical or categorical in nature.
Example:
3. Continuous: Continuous data, unlike discrete data, can take on an infinite number of possible values
within a given range. It is characterized by being able to assume any value within a specified interval,
often including fractional or decimal values.
Example :
Data Preprocessing in Data Mining
Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming,
and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve
the quality of the data and to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform the
data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete
categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant features from the dataset, while
feature extraction involves transforming the data into a lower-dimensional space while preserving the
important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning,
and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1
and 1. Normalization is often used to handle data with different units and scales. Common normalization
techniques include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis
results. The specific steps involved in data preprocessing may vary depending on the nature of the data
and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more
accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
2 Data Integration in Data Mining
Data integration in data mining refers to the process of combining data from multiple
sources into a single, unified view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may exist between the different
sources. The goal of data integration is to make the data more useful and meaningful for the
purposes of analysis and decision making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data federation.
Data Integration is a data preprocessing technique that combines data from multiple heterogeneous
data sources into a coherent data store and provides a unified view of the data. These sources may
include multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
What is data integration :
Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources, mapping
the data to a common format, and reconciling any inconsistencies or discrepancies between the
sources. The goal of data integration is to make it easier to access and analyze data that is spread
across multiple systems or platforms, in order to gain a more complete and accurate understanding
of the data.
Data integration can be challenging due to the variety of data formats, structures, and semantics
used by different data sources. Different data sources may use different data types, naming
conventions, and schemas, making it difficult to combine the data into a single view. Data
integration typically involves a combination of manual and automated processes, including data
profiling, data mapping, data transformation, and data reconciliation.
Data integration is used in a wide range of applications, such as business intelligence, data
warehousing, master data management, and analytics. Data integration can be critical to the
success of these applications, as it enables organizations to access and analyze data that is spread
across different systems, departments, and lines of business, in order to make better decisions,
improve operational efficiency, and gain a competitive advantage.
There are mainly 2 major approaches for data integration – one is the “tight coupling approach”
and another is the “loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the integrated
data. The data is extracted from various sources, transformed and loaded into a data warehouse.
Data is integrated in a tightly coupled manner, meaning that the data is integrated at a high level,
such as at the level of the entire dataset or schema. This approach is also known as data
warehousing, and it enables data consistency and integrity, but it can be inflexible and difficult to
change or update.
Here, a data warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual data
elements or records. Data is integrated in a loosely coupled manner, meaning that the data is
integrated at a low level, and it allows data to be integrated without having to create a central
repository or data warehouse. This approach is also known as data federation, and it enables data
flexibility and easy updates, but it can be difficult to maintain consistency and integrity across
multiple data sources.
Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases
to obtain the result.
And the data only remains in the actual source databases.
Issues in Data Integration:
There are several issues that can arise when integrating data from multiple sources, including:
1. Data Quality: Inconsistencies and errors in the data can make it difficult to combine and
analyze.
2. Data Semantics: Different sources may use different terms or definitions for the same data,
making it difficult to combine and understand the data.
3. Data Heterogeneity: Different sources may use different data formats, structures, or
schemas, making it difficult to combine and analyze the data.
4. Data Privacy and Security: Protecting sensitive information and maintaining security can
be difficult when integrating data from multiple sources.
5. Scalability: Integrating large amounts of data from multiple sources can be computationally
expensive and time-consuming.
6. Data Governance: Managing and maintaining the integration of data from multiple sources
can be difficult, especially when it comes to ensuring data accuracy, consistency, and
timeliness.
7. Performance: Integrating data from multiple sources can also affect the performance of the
system.
8. Integration with existing systems: Integrating new data sources with existing systems can be
a complex task, requiring significant effort and resources.
9. Complexity: The complexity of integrating data from multiple sources can be high,
requiring specialized skills and knowledge.
# Correlation Analysis
Correlation analysis is a statistical technique for determining the strength of a link between two
variables. It is used to detect patterns and trends in data and to forecast future occurrences.
Consider a problem with different factors to be considered for making optimal conclusions
Correlation explains how these variables are dependent on each other.
Correlation quantifies how strong the relationship between two variables is. A higher value
of the correlation coefficient implies a stronger association.
The sign of the correlation coefficient indicates the direction of the relationship between
variables. It can be either positive, negative, or zero.
What is Correlation?
The Pearson correlation coefficient is the most often used metric of correlation. It expresses the
linear relationship between two variables in numerical terms. The Pearson correlation coefficient,
written as “r,” is as follows:
r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
where,
r: Correlation coefficient
xixi : i^th value first dataset X
xˉxˉ : Mean of first dataset X
yiyi : i^th value second dataset Y
yˉyˉ : Mean of second dataset Y
The correlation coefficient, denoted by “r”, ranges between -1 and 1.
r = -1 indicates a perfect negative correlation.
r = 0 indicates no linear correlation between the variables.
r = 1 indicates a perfect positive correlation.
Types of Correlation
There are three types of correlation:
Correlation
1. Positive Correlation: Positive correlation indicates that two variables have a direct
relationship. As one variable increases, the other variable also increases. For example, there
is a positive correlation between height and weight. As people get taller, they also tend to
weigh more.
2. Negative Correlation: Negative correlation indicates that two variables have an inverse
relationship. As one variable increases, the other variable decreases. For example, there is a
negative correlation between price and demand. As the price of a product increases, the
demand for that product decreases.
3. Zero Correlation: Zero correlation indicates that there is no relationship between two
variables. The changes in one variable do not affect the other variable. For example, there is
zero correlation between shoe size and intelligence.
A positive correlation indicates that the two variables move in the same direction, while a negative
correlation indicates that the two variables move in opposite directions.
The strength of the correlation is measured by a correlation coefficient, which can range from -1 to
1. A correlation coefficient of 0 indicates no correlation, while a correlation coefficient of 1 or -1
indicates a perfect correlation.
Correlation Coefficients
The different types of correlation coefficients used to measure the relation between two variables
are:
Type of Data
Correlation Coefficient Relation Levels of Measurement Distribution
Any
Non-Linear Ordinal
Kendall Tau Coefficient distribution
Any
Non-Linear Two nominal variables
Cramer’s V distribution
3 Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the dataset
while preserving the important information. This is done to improve the efficiency of data analysis and to
avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be done
using various techniques such as correlation analysis, mutual information, and principal component
analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features are high-
dimensional and complex. It can be done using techniques such as PCA, linear discriminant analysis
(LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to
reduce the size of the dataset while preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often used to
reduce the size of the dataset by replacing similar data points with a representative centroid. It can be
done using techniques such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important information.
Compression is often used to reduce the size of the dataset for storage and transmission purposes. It can
be done using techniques such as wavelet compression, JPEG compression, and gzip compression.
normalized to vi‘ by using the formula below – where j is the smallest integer
such that max(|vi‘|)<1. Example –
Let the input data is: -10, 201, 301, -401, 501, 601, 701 To normalize the above data, Step 1: Maximum
absolute value in given data(m): 701 Step 2: Divide the given data by 1000 (i.e j=3) Result: The
normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701
Min-Max Normalization –
In this technique of data normalization, linear transformation is performed on the original data. Minimum
and maximum value from data is fetched and each value is replaced according to the following formula.
Where A is the attribute data, Min(A), Max(A) are the minimum and maximum absolute value of A
respectively. v’ is the new value of each entry in data. v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary value of range required)
respectively.
Z-score normalization –
In this technique, values are normalized based on mean and standard deviation of the data A. The formula
3. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features into a
single feature.
4. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
5. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
6. Feature Selection: This technique involves selecting a subset of features from the dataset that are
most relevant to the task at hand.
7. It’s important to note that data reduction can have a trade-off between the accuracy and the size of
the data. The more data is reduced, the less accurate the model will be and the less generalizable
it will be.
In conclusion, data reduction is an important step in data mining, as it can help to improve the efficiency
and performance of machine learning algorithms by reducing the size of the dataset. However, it is
important to be aware of the trade-off between the size and accuracy of the data, and carefully assess the
risks and benefits before implementing it.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine the information you
gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company
every three months. They involve you in the annual sales, rather than the quarterly average, So we can
summarize the data in such a way that the resulting data summarizes the total sales per year instead of per
quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our
analysis. It reduces data size as it eliminates outdated or redundant features.
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide the best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}