Paper2scheam 2
Paper2scheam 2
Skewness
The measures of direction and degree of symmetry are called measures of
The dataset may also either have very high values or extremely low values. If
the dataset has far higher values, then it is said to be skewed to the right. On
the other hand, if the dataset has far more low values then it is said to be
skewed towards left. If the tail is longer on the left-hand side and hump on the
right-hand side, it is called positive skew. Otherwise, it is called negative skew.
The given dataset may have an equal distribution of data. The implication of
this is that if the data is skewed, then there is a greater chance of outliers in
the dataset. This affects the mean and median. Hence, this may affect the
performance of the data mining algorithm. A perfect symmetry means the
skewness is zero. In the case of skew, the median is greater than the mean. In
positive skew, the mean is greater than the median.
Generally, for negatively skewed distribution, the
median is more than the mean. The relationship
between skew and the relative size of the mean and
median can be summarised by a convenient
numerical skew index known as Pearson 2 skewness
coefficient.
Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then it
indicates higher
kurtosis and vice versa. Kurtosis is the measure of whether the data is heavy
tailed or light tailed relative to normal distribution. It can be observed that
normal distribution has bell-shaped curve with no long tails. Low kurtosis tends
to have light tails. The implication is that there is no outlier data. Let x, X * , XN
be a set of 'N' values or observations. Then, kurtosis
is measured using the formula given below:
Ideally, the points fall along the reference line (45 Degree) if the data follows
normal distribution. If the deviation is more, then there is greater evidence that
the datasets follow some different distribution, that is, other than the normal
distribution shape. In such a case, careful analysis of the statistical
investigations should be carried out before interpretation.This skewness,
kurtosis, mean absolute deviation and coefficient of variation help in assessing
the univariate data.
Here, x, and y, are data values from X and Y. E(X) and E(Y) are the mean
values of x, and y
N is the number of given data. Also, the COV(X, Y) is same as COV(Y, X).The
covariance between X and Y is 12. It can be normalized to a value between -1
and +1. This is done by dividing it by the correlation of variables. This is called
Pearson correlation coefficient. Sometimes, N - 1 is also can be used instead of
N. In that case, the covariance is 60/4 = 15.
Correlation
The Pearson correlation coefficient is the most common test for determining
any association between two phenomena. It measures the strength and
direction of a linear relationship between the x and y variables.
The correlation indicates the relationship between dimensions using its sign.
The sign is more important than the actual value.
1. If the value is positive, it indicates that the dimensions increase together.
2. If the value is negative, it indicates that while one-dimension increases, the
other dimension decreases.
3. If the value is zero, then it indicates that both the dimensions are
independent of each
other. If the dimensions are correlated, then it is better to remove one
dimension as it is a redundant dimension.
If the given attributes are X= (x, xy'', Xx) and Y=(Y, Yz, YN), then the Pearson
correlation
coefficient, that is denoted as r, is given as:
The mean of multivariate data is a mean vector and the mean of the above
three attributes is given as (2, 7.5, 1.33). The variance of multivariate data
becomes the covariance matrix. The mean vector is called centroid and
variance is called dispersion matrix. This is discussed in the next section.
Multivariate data has three or more variables. The aim of the multivariate
analysis is much more. They are regression analysis, factor analysis and
multivariate analysis of variance that are explained in the subsequent chapters
of this book.
Heat map
Heat map is a graphical representation of 2D matrix. It takes a matrix as input
and colours
it. The darker colours indicate very large values and lighter colours indicate
smaller values. The advantage of this method is that humans perceive colours
well. So, by colour shaping, larger values can be perceived well. For example,
in vehicle traffic data, heavy traffic regions can be differentiated from low
traffic regions through heat
map. In Figure 2.13, patient
data highlighting
weight and health status is
plotted. Here, X- axis is weights
and Y-axis is patient counts.
The dark colour regions highlight
patients' weights vs
patient counts in health status.
Pairplot
Pair plot or scatter matrix is a data visual technique for multivariate data. A
scatter matrix consists of several pair-wise scatter plots of variables of the
multivariate data. All the results are presented in a matrix format. By visual
examination of the chart, one can
easily find relationships
among the variables such as
correlation between the
variables. A random matrix of
three columns is chosen and the
relationships of the columns is
plotted
as a pair plot (or scatter matrix) as
shown below in Figure 2.14
2a]
Construct how 6V’s are helpful to characterise the big data and
also representation of data storage in big data.
3 Elements of Big Data
Data whose volume is less and can be stored and processed by a small-scale
computer is called 'small data'. These data are collected from several sources,
and integrated and processed by a small-scale computer. Big data, on the
other hand, is a larger data whose volume is much larger than 'small data' and
is characterized as follows:
1. Volume - Since there is a reduction in the cost of storing devices, there has
been a
tremendous growth of data. Small traditional data is measured in terms of
gigabytes (GB)
and terabytes (TB), but Big Data is measured in terms of petabytes (PB) and
exabytes (EB). One exabyte is 1 million terabytes.
2. Velocity - The fast arrival speed of data and its increase in data volume is
noted as velocity. The availability of loT devices and Internet power ensures
that the data is arriving at a faster rate. Velocity helps to understand the
relative growth of big data and its accessibility by users, systems and
applications.
3. Variety - The variety of Big Data includes:
• Form - There are many forms of data. Data types range from text, graph,
audio, video,
to maps. There can be composite data too, where one media can have many
other
sources of data, for example, a video can have an audio song.
• Function - These are data from various sources like human conversations,
transaction
records, and old archive data.
• Source of data - This is the third aspect of variety. There are many sources of
data.
Broadly, the data source can be classified as open/public data, social media
data and
multimodal data. These are discussed in Section 2.3.1 of this chapter.
Some of the other forms of Vs that are often quoted in the literature as
characteristics of
Big data are:
4. Veracity of data - Veracity of data deals with aspects like conformity to the
facts, truth-
fulness, believability, and confidence in data. There may be many sources of
error
such as technical errors, typographical errors, and human errors. So, veracity is
one of
the most important aspects of data.
5. Validity - Validity is the accuracy of the data for taking decisions or for any
other goals that
are needed by the given problem.
6. Value - Value is the characteristic of big data that indicates the value of the
information
that is extracted from the data and its influence on the decisions that are taken
based on it.
24 • Machine Learning
Thus, these 6 Vs are helpful to characterize the big data. The data quality of
the numeric
attributes is determined by factors like precision, bias, and accuracy. Precision
is defined as the closeness of repeated measurements. Often, standard
deviation is used to measure the precision. Bias is a systematic result due to
erroneous assumptions of the algorithms or procedures. Accuracy is the degree
of measurement of errors that refers to the closeness of measurements to the
true value of the quantity. Normally, the significant digits used to store and
manipulate indicate the accuracy of the measurement.
Flat Files These are the simplest and most commonly available data source. It
is also the cheapest way of organizing the data. These flat files are the files
where data is stored in plain ASCII or EBCDIC format. Minor changes of data in
flat files affect the results of the data mining algorithms. Hence, flat file is
suitable only for storing small dataset and not desirable if the dataset becomes
larger. Some of the popular spreadsheet formats are listed below:
• CSV files - CSV stands for comma-separated value files where the values are
separated by commas. These are used by spreadsheet and database
applications. The first row may have attributes and the rest of the rows
represent the data.
• TSV files - TSV stands for Tab separated values files where values are
separated by Tab.
Both CSV and TSV files are generic in nature and can be shared. There are
many tools like Google Sheets and Microsoft Excel to process these files.
Database System It normally consists of database files and a database
management system
(DBMS). Database files contain original data and metadata. DBMS aims to
manage data and improve operator performance by including various tools like
database administrator, query processing, and transaction manager. A
relational database consists of sets of tables. The tables have rows and
columns. The columns represent the attributes and rows represent tuples. A
tuple corresponds to either an object or a relationship between objects. A user
can access and manipulate the data in the database using SQL.
Different types of databases are listed below:
1. A transactional database is a collection of transactional records. Each record
is a
transaction. A transaction may have a time stamp, identifier and a set of items,
which may
have links to other tables. Normally, transactional databases are created for
performing
associational analysis that indicates the correlation among the items.
2. Time-series database stores time related information like log files where
data is associated with a time stamp. This data represents the sequences of
data, which represent values or events obtained over a period (for example,
hourly, weekly or yearly) or repeated time span. Observing sales of product
continuously may yield a time-series data.
26 • Machine Learning
3. Spatial databases contain spatial information in a raster or vector format.
Raster formats
are either bitmaps or pixel maps. For example, images can be stored as a
raster data.
On the other hand, the vector format can be used to store maps as maps use
basic geometric primitives like points, lines, polygons and so forth. World Wide
Web (WWW) It provides a diverse, worldwide online information source. The
objective of data mining algorithms is to mine interesting patterns of
information present
in WWW. XML (eXtensible Markup Language) It is both human and machine
interpretable data format that can be used to represent data that needs to be
shared across the platforms. Data Stream It is dynamic data, which flows in
and out of the observing environment. Typical characteristics of data stream
are huge volume of data, dynamic, fixed order movement, and real-time
constraints. RSS (Really Simple Syndication) It is a format for sharing instant
feeds across services.
JSON (JavaScript Object Notation) It is another useful data interchange format
that is often used for many machine learning algorithms.
2b]
Organise the following as importance of probability and statistics
in machine learning.
a) probability distribution
b) continuous probability distribution
c) Discrete distribution.
Probability Distributions
A probability distribution of a variable, say X, summarizes the probability
associated with X's
events. Distribution is a parameterized mathematical function. In other words,
distribution is a
function that describes the relationship between the observations in a sample
space.
Consider a set of data. The data is said to follow a distribution if it obeys a
mathematical
function that characterizes that distribution. The function can be used to
calculate the probability
of individual observations.
Probability distributions are of two types:
1. Discrete probability distribution
2. Continuous probability distribution
The relationships between the events for a continuous random variable and
their probabilities
is called a continuous probability distribution. It is summarized as Probability
Density Function
(PDF). PDF calculates the probability of observing an instance. The plot of PDF
shows the shape
of the distribution. Cumulative Distributive Function (CDF) computes the
probability of an obser-
vation ≤ value.
Both PDF and CDF are continuous values. The discrete equivalent of PDF in
discrete
distribution is called Probability Mass Function (PMF).
The probability of an event cannot be detected directly. It should be computed
as the area
under the curve for a small interval around the specific outcome. This is
defined as CDF.
Let us discuss some of the distributions that are encountered in machine
learning.
Continuous Probability Distributions Normal, Rectangular, and Exponential
distributions
fall under this category.
Most of the statistical tests expect data to follow normal distribution. To check
it,
normality tests are used. Normality test of the data can be done by Q-Q plot
where CDF
of one random variable follows CDF of normal distribution. Then, quantity of
one distri-
bution is plotted against other distributions. If they are same, then the plot
closely follows
the straight line from bottom-left to top-right.
2. Rectangular Distribution - This is also known as uniform distribution. It has
equal
probabilities for all values in the range a, b. The uniform distribution is given as
follows:
.Here, x is a random variable and 1 is called rate parameter. The mean and
standard
deviation of exponential distribution is given as ß, where,
Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under
this category.
1. Binomial Distribution - Binomial distribution is another distribution that is
often encoun-
tered in machine learning. It has only two outcomes: success or failure. This is
also called
Bernoulli trial.
The objective of this distribution is to find probability of getting success k out of
n trials.
The way to get success out of k out of n number of
trials is given as:
Here, x is the number of times the event occurs and 1 is the mean number of
times an
event occurs.
The mean is the population mean at number of emails received and the
standard
deviation is
The computation of the above formula is unstable and the hence the problem is
restated as
maximum of log conditional probability given 0. This is
given as:
This window can be replaced by any other function too. If Gaussian function is
used, then it
is called Gaussian density function.
KNN Estimation The KNN estimation is another non-parametric density
estimation method.
Here, the initial parameter k is determined and based on that k-neighbours are
determined.
The probability density function estimate is the average of the values that are
returned by
the neighbours.
4a]Construct the different types of graphs that are used in
Univariate data analysis.
Data Visualization
To understand data, graph visualization is must. Data visualization helps to
understand data.
It helps to present information and data to customers. Some of the graphs that
are used in
univariate data analysis are bar charts, histograms, frequency polygons and pie
charts.
The advantages of the graphs are presentation of data, summarization of data,
description of
data, exploration of data, and to make comparisons of data. Let us consider
some forms of graphs
now:
Bar Chart A Bar chart (or Bar graph) is used to display the frequency
distribution for variables.
Bar charts are used to illustrate discrete data. The charts can also help to
explain the counts of
nominal data. It also helps in comparing the frequency of different groups.
The bar chart for students' marks (45, 60, 60, 80, 85) with Student ID = (1, 2,
3, 4, 5) is shown
below in Figure 2.3.
Pie Chart These are equally helpful in illustrating the univariate data. The
percentage frequency
distribution of students' marks (22, 22, 40, 40, 70, 70, 70, 85, 90, 90) is below
in Figure 2.4.
It can be observed that the number of students with 22 marks are 2. The total
number of
students are 10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for
marks 2 in Figure 2.4.
Histogram It plays an important role in data mining for showing frequency
distributions.
The histogram for students' marks (45, 60, 60, 80, 85) in the group range of 0-
25, 26-50, 51- 75,
76-100 is given below
in Figure 2.5. One can
visually inspect from
Figure 2.5 that the
number of
students in the range 76-
100 is 2.
Histogram conveys useful information like nature of data and its mode. Mode
indicates the
peak of dataset. In other words, histograms can be used as charts to show
frequency, skewness
present in the data, and shape.
Dot Plots These are similar to bar charts. They are less clustered as compared
to bar charts,
as they illustrate the bars only
with single points. The dot
plot of English marks for five
students with ID as (1, 2, 3, 4, 5)
and marks (45, 60, 60, 80, 85) is
given in Figure 2.6. The
advantage
is that by visual inspection one
can find out who got more marks.
Central Tendency
One cannot remember all the data. Therefore, a condensation or summary of
the data is necessary.
This makes the data analysis easy and simple. One such summary is called
central tendency. Thus,
central tendency can explain the characteristics of data and that further helps
in comparison. Mass
data have tendency to concentrate at certain values, normally in the central
location. It is called
measure of central tendency (or averages). This represents the first order of
measures. Popular
measures are mean, median and mode.
For example, the mean of the three numbers 10, 20, and 30 is
Weighted mean - Unlike arithmetic mean that gives the weightage of all items
equally,
weighted mean gives different importance to all items as the item importance
varies.
Hence, different weightage can be given to items.
In case of frequency distribution, mid values of the range are taken for
computation.
This is illustrated in the following computation.
In weighted mean, the mean is computed by adding the product of proportion
and
group mean. It is mostly used when the sample sizes are unequal.
• Geometric mean - Let x, X,,..., X, be a set of 'N' values or observations.
Geometric mean
is the N" root of the product of N items. The formula for computing geometric
mean is
given as follows:
2. Median - The middle value in the distribution is called median. If the total
number of items
in the distribution is odd, then the middle value is called median. If the
numbers are even, then
the average value of two items in the centre is the median. It can be observed
that the median
is the value where x, is divided into two equal halves, with half of the values
being lower than
the median and half higher than the median. A median class is that class where
(N/2)* item is
present.
In the continuous case, the median is given by the
formula:
Median class is that class where N/2th item is present. Here, i is the class
interval of the
median class and L, is the lower limit of median class, fis the frequency of the
median class, and
fi s the cumulative frequency of all classes preceding median.
3. Mode - Mode is the value that occurs more frequently in the dataset. In other
words, the
value that has the highest frequency is called mode. Mode is only for discrete
data and is not
applicable for continuous data as there are no repeated values in continuous
data.
The procedure for finding the mode is to calculate the frequencies for all the
values in the
data, and mode is the value (or values) with the highest frequency. Normally,
the dataset is
classified as unimodal, bimodal and trimodal with modes 1, 2 and 3,
respectively.