0% found this document useful (0 votes)
12 views10 pages

DWM Unit 2 Notes

Uploaded by

2021cartoonworld
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

DWM Unit 2 Notes

Uploaded by

2021cartoonworld
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit II: Getting to know your data

Data objects and attribute types, Basic statistical descriptions of data,


Data visualization, Measuring data similarity and dissimilarity
1. Data Objects and Attribute types:
Data sets are made up of data objects. A data object represents an entity— in a sales database,
the objects may be customers, store items, and sales; in a medical database, the objects may
be patients; in a university database, the objects may be students, professors, and courses.
Data objects are described by attributes. Data objects can also be referred to as samples,
examples, instances, data points, or objects. If the data objects are stored in a database, they
are data tuples. That is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes.
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data object. Attributes
describing a customer object can include, for example, customer ID, name, and address.
Observed values for a given attribute are known as observations. A set of attributes used to
describe a given object is called an attribute vector (or feature vector). The distribution of
data involving one attribute (or variable) is called univariate. A bivariate distribution involves
two attributes, and so on.
Types of Attributes:
 Nominal Attributes
 Binary Attributes
 Ordinal Attributes
 Numeric Attributes
 Discrete versus Continuous Attributes
1. Nominal attributes
Nominal means “relating to names.” The values of a nominal attribute are symbols or
names of things. Each value represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical. The values do not have any
meaningful order.
Examples:
 Hair color (blonde, gray, brown, black, etc.)
 Relationship status (married, cohabiting, single, etc.)
 Preferred mode of public transportation (bus, train, tram, etc.)
 Blood type (O negative, O positive, A negative, and so on)
Because nominal attribute values do not have any meaningful order about them and are
not quantitative, it makes no sense to find the mean (average) value or median (middle)
value for such an attribute, given a set of objects. One thing that is of interest, however, is
the attribute’s most commonly occurring value. This value, known as the mode, is one of
the measures of central tendency.
2. Binary attributes:
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where
0 means that the attribute is absent, and 1 means that it is present. Binary attributes are
referred to as Boolean if the two states correspond to true and false.
For example, suppose the patient undergoes a medical test that has two possible
outcomes. The attribute medical test is binary, where a value of 1 means the result of the
test for the patient is positive, while 0 means the result is negative.
A binary attribute is symmetric if both of its states are equally valuable and carry the
same weight. So, there is no preference on which outcome should be coded as 0 or 1. One
such example could be the attribute gender having the states male and female.
A binary attribute is asymmetric if the outcomes of the states are not equally important,
such as the positive and negative outcomes of a medical test for some severe disease. By
convention, we code the most important outcome, which is usually the rarest one, by 1
(positive) and the other by 0 (negative).
3. Ordinal Attribute:
An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but quantitative measure between successive values is not known.
For Example, suppose that drink size corresponds to the size of drinks available at a fast-
food restaurant. This nominal attribute has three possible values: small, medium, and
large. The values have a meaningful sequence (which corresponds to increasing drink
size); however, we cannot tell from the values how much bigger, say, a medium is than a
large.
4. Numeric attributes
A numeric attribute is quantitative, and it is presented as a measure of quantity,
represented in integer or real values. Numeric attributes can be interval-scaled or ratio-
scaled. Interval-scaled attributes are measured on a scale of equal-size units. This type of
data represents quantitative data with equal intervals between consecutive values. Interval
data has no absolute zero point, and therefore, ratios cannot be computed. Examples of
interval data include temperature, IQ scores, and time. Interval data is used in data mining
for clustering and prediction tasks. Because interval-scaled attributes are numeric, we can
compute their mean value, in addition to the median and mode measures of central
tendency. A ratio-scaled attribute is a numeric attribute with an inherent zero-point. This
type of data is similar to interval data, but with an absolute zero point. In ratio data, it is
possible to compute ratios of two values, and this makes it possible to make meaningful
comparisons. Examples of ratio data include height, weight, and income. Ratio data is
used in data mining for prediction and association rule mining tasks.
5. Discrete versus Continuous Attributes:
Discrete Attribute A variable or attribute is discrete if it can take a finite or a countably
infinite set of values. A discrete attribute has a finite or countably infinite set of values,
which may or may not be represented as integers. The attributes hair colour, smoker,
medical test, and drink size each have a finite number of values, and so are discrete.

2. Basic statistical descriptions of data


For data preprocessing to be successful, it is essential to have an overall picture of your
data. Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers. This section discusses
three areas of basic statistical descriptions, those are:
 Measuring the Central Tendency: Mean, Median, and Mode
 Measuring the Dispersion of Data: Range, Quar tiles, Variance, Standard Deviation,
and Interquartile Range (IQR)
 Graphic Displays of Basic Statistical Descriptions of Data

1. Measuring the Central Tendency: Mean, Median, and Mode

a. Mean

The most common and effective numeric measure of the “centre” of a set of data is the
(arithmetic) mean. Let x1, x2, ……, xN be a set of N values or observations, such as for
some numeric attribute X, like salary.
The mean of this set of values is

Example: Mean. Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, and 110.
Using above Eq.,
We have

Sometimes, each value xi in a set may be associated with a weight wi for i = 1, N. The
weights reflect the significance, importance, or occurrence frequency attached to their
respective values. In this case, we can compute
This is called the weighted arithmetic mean or the weighted average.

b. Median
Another measure of the centre of data is the median. Suppose that a given data set of
N distinct values is sorted in numerical order.

 If N is odd, the median is the middle value of the ordered set;


 If N is even, the median is the average of the middle two values.
Example: Median. Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
There is an even number of observations (i.e., 12); therefore, the median is not
unique. It can be any value within the two middlemost values of 52 and 56 (that is,
within the sixth and seventh values in the list). By convention, we assign the average
of the two middlemost values as the median.
that is

Thus, the median is $54,000.


In probability and statistics, the median generally applies to numeric data; however, we
may extend the concept to ordinal data.
Suppose that a given data set of N values for an attribute X is sorted in increasing order.
 If N is odd, then the median is the middle value of the ordered set.
 If N is even, then the median may not be not unique.
In this case, the median is the two middlemost values and any value in between.

c. Mode
Another measure of central tendency is the mode. The mode for a set of data is the
value that occurs most frequently in the set.
It is possible for the greatest frequency to correspond to several different values,
which results in more than one mode.
 Data sets with one, two, or three modes: called unimodal, bimodal, and trimodal.
 At the other extreme, if each data value occurs only once, then there is no mode.
Example: Mode. Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
The above data has bimodal mode.i. e The two modes are 52 and 70.
d. Midrange
The midrange can also be used to assess the central tendency of a numeric data set. It
is the average of the largest and smallest values in the set.
Example: Midrange. Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
The midrange of the data is

Thus, the median is $70,000

Central Tendency Measures for different attributes:

Central Tendency Measures for Numerical Attributes: Mean, Median, Mode

Central Tendency Measures for Categorical Attributes:

 Central Tendency Measures for Nominal Attributes: Mode


 Central Tendency Measures for Ordinal Attributes: Mode, Median

2. Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation,


and Interquartile

a. Range

The range is the easiest dispersion of data or measure of variability. The range
can measure by subtracting the lowest value from the massive Number. The
wide range indicates high variability, and the small range specifies low
variability in the distribution. To calculate a range, prepare all the values in
ascending order, then subtract the lowest value from the highest value.

Range = Highest_value – Lowest_value

Let x1, x2, xN be a set of observations for some numeric attribute, X. The
range of the set is the difference between the largest (max ()) and smallest
(min ()) values.

b. Quantiles:

Suppose that the data for attribute X are sorted in increasing numeric order.
Imagine that we can pick certain data points so as to split the data distribution
into equal-size consecutive sets. These data points are called quantiles.
Quantiles are points taken at regular intervals of a data distribution, dividing it
into essentially equal size consecutive sets. The kth -q-quantile for a given
data distribution is the value x such that at most k/q of the data values are less
than x and at most (q − k)/q of the data values are more than x, where k is an
integer such that 0 < k < q. There is q − 1 q-quantiles.
c. Interquartile range (IQR):

The quartiles give an indication of a distribution’s centre, spread, and shape.


The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest
25% of the data. The third quartile, denoted by Q3, is the 75th percentile—it
cuts off the lowest 75% (or highest 25%) of the data. The second quartile is
the 50th percentile. As the median, it gives the centre of the data distribution.
The distance between the first and third quartiles is a simple measure of spread
that gives the range covered by the middle half of the data. This distance is
called the interquartile range (IQR).

d. Variance:

The variance is a measure of variability that utilizes all the data. It is based on
the difference between the value of each observation (x;) and the mean (x) for
a sample, u for a population).

The variance is the average of the squared between each data value and the
mean.

e. Standard Deviation:

• The standard deviation of a data set is the positive square root of the
variance. It is measured in the same in the same units as the data, making it
more easily interpreted than the variance.

• The standard deviation is computed as follows:


3. Graphic Displays of Basic Statistical Descriptions: Quantile Plots, Quantile–Quantile
Plots, Histograms, And Scatter Plots.

Such graphs are helpful for the visual inspection of data, which is useful for data
preprocessing. The first three of these show univariate distributions (i.e., data for one
attribute), while scatter plots show bivariate distributions (i.e., involving two
attributes).

a. Quantile Plot:

A quantile plot is a simple and effective way to have a first look at a univariate data
distribution. First, it displays all of the data for the given attribute (allowing the user
to assess both the overall behavior and unusual occurrences). Note that the 0.25
percentile corresponds to quartile Q1, the 0.50 percentile is the median, and the 0.75
percentile is Q3. Let

On a quantile plot, xi is graphed against fi. This allows us to compare different distributions
based on their quantiles.

b. Quantile–Quantile Plot:

Q-Q(quantile-quantile) plots play a very vital role to graphically analyze and


compare two probability distributions by plotting their quantiles against each
other. If the two distributions which we are comparing are exactly equal then
the points on the Q-Q plot will perfectly lie on a straight-line y = x. Suppose
that we have two sets of observations for the attribute or variable unit price,
taken from two different branch locations. Let x1, xN be the data from the first
branch, and y1, yM be the data from the second, where each data set is sorted
in increasing order. If M = N (i.e., the number of points in each set is the
same), then we simply plot yi against xi, where yi and xi are both (i − 0.5)/N
quantiles of their respective data sets. If M < N (i.e., the second branch has
fewer observations than the first), there can be only M points on the q-q plot.
Here, yi is the (i − 0.5)/M quantile of the y data, which is plotted against the (i
− 0.5)/M quantile of the x data.

c. Histograms:

Histograms (or frequency histograms) are at least a century old and are widely
used. “Histos” means pole or mast, and “gram” means chart, so a histogram is
a chart of poles. Plotting histograms is a graphical method for summarizing
the distribution of a given attribute, X. If X is nominal, such as automobile
model or item type, then a pole or vertical bar is drawn for each known value
of X. The height of the bar indicates the frequency (i.e., count) of that X value.
The resulting graph is more commonly known as a bar chart. If X is numeric,
the term histogram is preferred. The range of values for X is partitioned into
disjoint consecutive subranges. The subranges, referred to as buckets or bins,
are disjoint subsets of the data distribution for X. The range of a bucket is
known as the width. Typically, the buckets are of equal width. For example, a
price attribute with a value range of $1 to $200 (rounded up to the nearest
dollar) can be partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so on.
For each subrange, a bar is drawn with a height that represents the total count
of items observed within the subrange. Although histograms are widely used,
they may not be as effective as the quantile plot, q-q plot, and boxplot methods
in comparing groups of univariate observations.

d. Scatter Plots and Data Correlation:

The most useful graph for displaying the relationship between two quantitative
variables is a scatterplot. A scatterplot shows the relationship between two
quantitative variables measured for the same individuals. The values of one
variable appear on the horizontal axis, and the values of the other variable
appear on the vertical axis. Each individual in the data appears as a point on
the graph. The scatter plot is a useful method for providing a first look at
bivariate data to see clusters of points and outliers, or to explore the possibility
of correlation relationships. Two attributes, X, and Y, are correlated if one
attribute implies the other. Correlations can be positive, negative, or null
(uncorrelated).

3. Data visualization

Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand. Data visualization is good if it
has a clear meaning, purpose, and is very easy to interpret, without requiring context. Tools
of data visualization provide an accessible way to see and understand trends, outliers, and
patterns in data by using visual effects or elements such as a chart, graphs, and
maps. Characteristics of Effective Graphical Visual:

 It shows or visualizes data very clearly in an understandable manner.


 It encourages viewers to compare different pieces of data.
 It closely integrates statistical and verbal descriptions of data set.
 It grabs our interest, focuses our mind, and keeps our eyes on message as human brain
tends to focus on visual data more than written data.
 It also helps in identifying area that needs more attention and improvement.
 Using graphical representation, a story can be told more efficiently. Also, it requires
less time to understand picture than it takes to understand textual data.

Categories of Data Visualization ; Data visualization is very critical to market research


where both numerical and categorical data can be visualized that helps in an increase in
impacts of insights and also helps in reducing risk of analysis paralysis. So, data
visualization is categorized into following categories:
Figure - Categories of Data Visualization

1. Numerical Data: Numerical data is also known as quantitative data. Numerical data is
any data where data generally represents amount such as height, weight, age of a
person, etc. Numerical data visualization is easiest way to visualize data. It is generally
used for helping others to digest large data sets and raw numbers in a way that makes it
easier to interpret into action. Numerical data is categorized into two categories :
 Continuous Data - It can be narrowed or categorized (Example: Height
measurements).
 Discrete Data - This type of data is not “continuous” (Example: Number of cars or
children’s a household has).
The type of visualization techniques that are used to represent numerical data
visualization is Charts and Numerical Values. Examples are Pie Charts, Bar Charts,
Averages, Scorecards, etc.

2. Categorical Data: Categorical data is also known as qualitative data. Categorical data
is any data where data generally represents groups. It simply consists of categorical
variables that are used to represent characteristics such as a person’s ranking, a person’s
gender, etc. Categorical data visualization is all about depicting key themes,
establishing connections, and lending context. Categorical data is classified into three
categories :
 Binary Data - In this, classification is based on positioning (Example: Agrees or
disagrees).
 Nominal Data - In this, classification is based on attributes (Example: Male or
female).
 Ordinal Data - In this, classification is based on ordering of information (Example:
Timeline or processes).
The type of visualization techniques that are used to represent categorical data is
Graphics, Diagrams, and Flowcharts. Examples are Word clouds, Sentiment Mapping,
Venn diagram, etc.
4. Measuring Data Similarity and Dissimilarity
In data mining, similarity measures quantify how alike data objects are, yielding a higher
score for more similar objects, while dissimilarity measures indicate how different they
are, with a lower score signifying greater similarity. These proximity measures are used
in techniques like clustering and anomaly detection, with common methods including
Euclidean distance for numerical data and Cosine similarity for comparing the orientation
of data points.

Similarity: A numerical score indicating how closely related two data objects are.

Scale: Typically ranges from 0 (no similarity) to 1 (high similarity), although the upper
limit can vary by algorithm.

Purpose: Used in clustering to group similar data points into the same clusters.

Dissimilarity: A numerical score representing how different two data objects are.

Scale: Lower values indicate greater similarity (more alike), with a dissimilarity of 0
meaning the objects are identical.

Purpose: Helps identify objects that are unique or stand out from the rest of the data.

Common Measures

For Numerical Data:

1. Euclidean Distance: The straight-line distance between two points in a multi-


dimensional space.
2. Minkowski Distance: A generalization of Euclidean distance.
3. Cosine Similarity: Measures the cosine of the angle between two vectors, often used
when data has high dimensionality, such as in text mining.

For Binary or Nominal Data:

1. Simple Matching Coefficient (Similarity): Assigns 1 if the states of two attributes are
the same and 0 if they are different.
2. Binary Dissimilarity: Assigns 0 if the attributes are the same and 1 if they are
different.

Relationship between Similarity and Dissimilarity

 In some cases, especially with nominal data, similarity and dissimilarity measures can be
derived from each other.

 For example, similarity can be calculated as 1 – dissimilarity or vice-versa, depending on


the specific context and scale of the measure

You might also like