0% found this document useful (0 votes)
27 views28 pages

Paper2scheam 2

The document discusses the concepts of skewness and kurtosis in data analysis, explaining how they measure the symmetry and peak of datasets, respectively. It also covers bivariate and multivariate statistics, including covariance, correlation, and various visualization techniques like heat maps and pair plots. Additionally, it outlines the 6 Vs of big data (Volume, Velocity, Variety, Veracity, Validity, Value) and the importance of probability distributions in machine learning.

Uploaded by

THEJAS KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views28 pages

Paper2scheam 2

The document discusses the concepts of skewness and kurtosis in data analysis, explaining how they measure the symmetry and peak of datasets, respectively. It also covers bivariate and multivariate statistics, including covariance, correlation, and various visualization techniques like heat maps and pair plots. Additionally, it outlines the 6 Vs of big data (Volume, Velocity, Variety, Veracity, Validity, Value) and the importance of probability distributions in machine learning.

Uploaded by

THEJAS KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

1a]

Construct the indication of symmetry/asymmetry and peak location of


the dataset in skewness and Kurtosis

Skewness
The measures of direction and degree of symmetry are called measures of

third order. Ideally, skewness should be zero as in ideal normal distribution.


More often, the given dataset may not have perfect symmetry (consider the
following Figure 2.8).

The dataset may also either have very high values or extremely low values. If
the dataset has far higher values, then it is said to be skewed to the right. On
the other hand, if the dataset has far more low values then it is said to be
skewed towards left. If the tail is longer on the left-hand side and hump on the
right-hand side, it is called positive skew. Otherwise, it is called negative skew.
The given dataset may have an equal distribution of data. The implication of
this is that if the data is skewed, then there is a greater chance of outliers in
the dataset. This affects the mean and median. Hence, this may affect the
performance of the data mining algorithm. A perfect symmetry means the
skewness is zero. In the case of skew, the median is greater than the mean. In
positive skew, the mean is greater than the median.
Generally, for negatively skewed distribution, the
median is more than the mean. The relationship
between skew and the relative size of the mean and
median can be summarised by a convenient
numerical skew index known as Pearson 2 skewness
coefficient.

Also, the following measure is more commonly


used to measure skewness. Let X, X,,*, Xy
be a set of 'N' values or observations then the
skewness can be given as:
Here, i s the population mean and o is the population standard deviation of the
univariate
data. Sometimes, for bias correction instead of N, N - 1 is used.

Kurtosis

Kurtosis also indicates the peaks of data. If the data is high peak, then it
indicates higher
kurtosis and vice versa. Kurtosis is the measure of whether the data is heavy
tailed or light tailed relative to normal distribution. It can be observed that
normal distribution has bell-shaped curve with no long tails. Low kurtosis tends
to have light tails. The implication is that there is no outlier data. Let x, X * , XN
be a set of 'N' values or observations. Then, kurtosis
is measured using the formula given below:

It can be observed that N - 1 is used instead of N in the numerator of Eq. (2.14)


for bias correction. Here, i and o are the mean and standard deviation of the
univariate data, respectively Some of the other useful measures for finding the
shape of the univariate dataset are mean absolute deviation (MAD) and
coefficient of variation (CV).

Mean Absolute Deviation (MAD)


MAD is another dispersion measure and is robust to outliers.
Normally, the outlier point is detected by computing the deviation
from median and by dividing it by MAD. Here, the absolute deviation
between the data and mean is taken. Thus, the absolute deviation is
given as:

The sum of the absolute deviations is given as :

Therefore, the mean absolute deviation is given as:

Coefficient of Variation (CV)


Coefficient of variation is used to compare datasets with different units. CV is
the ratio of standard deviation and mean, and %CV is the percentage of
coefficient of variations.

Special Univariate Plots


The ideal way to check the shape of the dataset is a stein and leaf plot. A stem
and leaf plot are a display that help us to know the shape and distribution of
the data. In this method, each value is split into a 'stem' and a leaf". The last
digit is usually the leaf and digits to the left of the leaf mostly form the stem.
For example, marks 45 are divided into stem 4 and leaf 5 in
Figure 2.9. The stem and leaf plot for the English subject marks,
say, (45, 60, 60, 80, 85) is given in Figure 2.9.

Figure 2.9: Stem and Leaf Plot for English Marks


It can be seen from Figure 2.9 that the first column is stem and the second
column is leaf.
For the given English marks, two students with 60 marks are shown in stem
and leaf plot as stem-6 with 2 leaves with 0. As discussed earlier, the ideal
shape of the dataset is a bell-shaped curve. This corresponds to normality.
Most of the statistical tests are designed only for normal distribution of data. A l
- l plot can be used to assess the shape of
the dataset. The Q-l plot is a 2D scatter
plot of an univariate data against
theoretical normal distribution data or
of two datasets - the quartiles of the first
and second datasets. The normal Q-Q plot
for marks x = [13 1 2 3 4 8 9] is given
below in Figure 2.10.

Ideally, the points fall along the reference line (45 Degree) if the data follows
normal distribution. If the deviation is more, then there is greater evidence that
the datasets follow some different distribution, that is, other than the normal
distribution shape. In such a case, careful analysis of the statistical
investigations should be carried out before interpretation.This skewness,
kurtosis, mean absolute deviation and coefficient of variation help in assessing
the univariate data.

1b] Build Bivariate statistics and multivariate statistics with an example


for each of them.
Bivariate Statistics
Covariance and Correlation are examples of bivariate statistics. Covariance is a
measure of joint probability of random variables, say X and Y. Generally,
random variables are represented in capital
letters. It is defined as covariance(X, Y) or
COVX, Y) and is used to measure the
variance between two dimensions. The
formula for finding co- variance for specific x,
and y are:

Here, x, and y, are data values from X and Y. E(X) and E(Y) are the mean
values of x, and y
N is the number of given data. Also, the COV(X, Y) is same as COV(Y, X).The
covariance between X and Y is 12. It can be normalized to a value between -1
and +1. This is done by dividing it by the correlation of variables. This is called
Pearson correlation coefficient. Sometimes, N - 1 is also can be used instead of
N. In that case, the covariance is 60/4 = 15.
Correlation
The Pearson correlation coefficient is the most common test for determining
any association between two phenomena. It measures the strength and
direction of a linear relationship between the x and y variables.
The correlation indicates the relationship between dimensions using its sign.
The sign is more important than the actual value.
1. If the value is positive, it indicates that the dimensions increase together.
2. If the value is negative, it indicates that while one-dimension increases, the
other dimension decreases.
3. If the value is zero, then it indicates that both the dimensions are
independent of each
other. If the dimensions are correlated, then it is better to remove one
dimension as it is a redundant dimension.
If the given attributes are X= (x, xy'', Xx) and Y=(Y, Yz, YN), then the Pearson
correlation
coefficient, that is denoted as r, is given as:

where, ox O, are the standard deviations of X and Y.


MULTIVARIATE STATISTICS
In machine learning, almost all datasets are multivariable. Multivariate data is
the analysis of more than two observable variables, and often, thousands of
multiple measurements
need to be conducted for one
or more subjects. The multivariate
data is like bivariate data but
may have more than two
dependant variables. Some of
the multivariate analysis are
regression analysis, principal
component analysis, and path analysis.

The mean of multivariate data is a mean vector and the mean of the above
three attributes is given as (2, 7.5, 1.33). The variance of multivariate data
becomes the covariance matrix. The mean vector is called centroid and
variance is called dispersion matrix. This is discussed in the next section.
Multivariate data has three or more variables. The aim of the multivariate
analysis is much more. They are regression analysis, factor analysis and
multivariate analysis of variance that are explained in the subsequent chapters
of this book.

Heat map
Heat map is a graphical representation of 2D matrix. It takes a matrix as input
and colours
it. The darker colours indicate very large values and lighter colours indicate
smaller values. The advantage of this method is that humans perceive colours
well. So, by colour shaping, larger values can be perceived well. For example,
in vehicle traffic data, heavy traffic regions can be differentiated from low
traffic regions through heat
map. In Figure 2.13, patient
data highlighting
weight and health status is
plotted. Here, X- axis is weights
and Y-axis is patient counts.
The dark colour regions highlight
patients' weights vs
patient counts in health status.

Pairplot

Pair plot or scatter matrix is a data visual technique for multivariate data. A
scatter matrix consists of several pair-wise scatter plots of variables of the
multivariate data. All the results are presented in a matrix format. By visual
examination of the chart, one can
easily find relationships
among the variables such as
correlation between the
variables. A random matrix of
three columns is chosen and the
relationships of the columns is
plotted
as a pair plot (or scatter matrix) as
shown below in Figure 2.14
2a]
Construct how 6V’s are helpful to characterise the big data and
also representation of data storage in big data.
3 Elements of Big Data
Data whose volume is less and can be stored and processed by a small-scale
computer is called 'small data'. These data are collected from several sources,
and integrated and processed by a small-scale computer. Big data, on the
other hand, is a larger data whose volume is much larger than 'small data' and
is characterized as follows:
1. Volume - Since there is a reduction in the cost of storing devices, there has
been a
tremendous growth of data. Small traditional data is measured in terms of
gigabytes (GB)
and terabytes (TB), but Big Data is measured in terms of petabytes (PB) and
exabytes (EB). One exabyte is 1 million terabytes.
2. Velocity - The fast arrival speed of data and its increase in data volume is
noted as velocity. The availability of loT devices and Internet power ensures
that the data is arriving at a faster rate. Velocity helps to understand the
relative growth of big data and its accessibility by users, systems and
applications.
3. Variety - The variety of Big Data includes:
• Form - There are many forms of data. Data types range from text, graph,
audio, video,
to maps. There can be composite data too, where one media can have many
other
sources of data, for example, a video can have an audio song.
• Function - These are data from various sources like human conversations,
transaction
records, and old archive data.
• Source of data - This is the third aspect of variety. There are many sources of
data.
Broadly, the data source can be classified as open/public data, social media
data and
multimodal data. These are discussed in Section 2.3.1 of this chapter.
Some of the other forms of Vs that are often quoted in the literature as
characteristics of
Big data are:
4. Veracity of data - Veracity of data deals with aspects like conformity to the
facts, truth-
fulness, believability, and confidence in data. There may be many sources of
error
such as technical errors, typographical errors, and human errors. So, veracity is
one of
the most important aspects of data.
5. Validity - Validity is the accuracy of the data for taking decisions or for any
other goals that
are needed by the given problem.
6. Value - Value is the characteristic of big data that indicates the value of the
information
that is extracted from the data and its influence on the decisions that are taken
based on it.
24 • Machine Learning
Thus, these 6 Vs are helpful to characterize the big data. The data quality of
the numeric
attributes is determined by factors like precision, bias, and accuracy. Precision
is defined as the closeness of repeated measurements. Often, standard
deviation is used to measure the precision. Bias is a systematic result due to
erroneous assumptions of the algorithms or procedures. Accuracy is the degree
of measurement of errors that refers to the closeness of measurements to the
true value of the quantity. Normally, the significant digits used to store and
manipulate indicate the accuracy of the measurement.
Flat Files These are the simplest and most commonly available data source. It
is also the cheapest way of organizing the data. These flat files are the files
where data is stored in plain ASCII or EBCDIC format. Minor changes of data in
flat files affect the results of the data mining algorithms. Hence, flat file is
suitable only for storing small dataset and not desirable if the dataset becomes
larger. Some of the popular spreadsheet formats are listed below:
• CSV files - CSV stands for comma-separated value files where the values are
separated by commas. These are used by spreadsheet and database
applications. The first row may have attributes and the rest of the rows
represent the data.
• TSV files - TSV stands for Tab separated values files where values are
separated by Tab.
Both CSV and TSV files are generic in nature and can be shared. There are
many tools like Google Sheets and Microsoft Excel to process these files.
Database System It normally consists of database files and a database
management system
(DBMS). Database files contain original data and metadata. DBMS aims to
manage data and improve operator performance by including various tools like
database administrator, query processing, and transaction manager. A
relational database consists of sets of tables. The tables have rows and
columns. The columns represent the attributes and rows represent tuples. A
tuple corresponds to either an object or a relationship between objects. A user
can access and manipulate the data in the database using SQL.
Different types of databases are listed below:
1. A transactional database is a collection of transactional records. Each record
is a
transaction. A transaction may have a time stamp, identifier and a set of items,
which may
have links to other tables. Normally, transactional databases are created for
performing
associational analysis that indicates the correlation among the items.
2. Time-series database stores time related information like log files where
data is associated with a time stamp. This data represents the sequences of
data, which represent values or events obtained over a period (for example,
hourly, weekly or yearly) or repeated time span. Observing sales of product
continuously may yield a time-series data.
26 • Machine Learning
3. Spatial databases contain spatial information in a raster or vector format.
Raster formats
are either bitmaps or pixel maps. For example, images can be stored as a
raster data.
On the other hand, the vector format can be used to store maps as maps use
basic geometric primitives like points, lines, polygons and so forth. World Wide
Web (WWW) It provides a diverse, worldwide online information source. The
objective of data mining algorithms is to mine interesting patterns of
information present
in WWW. XML (eXtensible Markup Language) It is both human and machine
interpretable data format that can be used to represent data that needs to be
shared across the platforms. Data Stream It is dynamic data, which flows in
and out of the observing environment. Typical characteristics of data stream
are huge volume of data, dynamic, fixed order movement, and real-time
constraints. RSS (Really Simple Syndication) It is a format for sharing instant
feeds across services.
JSON (JavaScript Object Notation) It is another useful data interchange format
that is often used for many machine learning algorithms.

2b]
Organise the following as importance of probability and statistics
in machine learning.
a) probability distribution
b) continuous probability distribution
c) Discrete distribution.
Probability Distributions
A probability distribution of a variable, say X, summarizes the probability
associated with X's
events. Distribution is a parameterized mathematical function. In other words,
distribution is a
function that describes the relationship between the observations in a sample
space.
Consider a set of data. The data is said to follow a distribution if it obeys a
mathematical
function that characterizes that distribution. The function can be used to
calculate the probability
of individual observations.
Probability distributions are of two types:
1. Discrete probability distribution
2. Continuous probability distribution
The relationships between the events for a continuous random variable and
their probabilities
is called a continuous probability distribution. It is summarized as Probability
Density Function
(PDF). PDF calculates the probability of observing an instance. The plot of PDF
shows the shape
of the distribution. Cumulative Distributive Function (CDF) computes the
probability of an obser-
vation ≤ value.
Both PDF and CDF are continuous values. The discrete equivalent of PDF in
discrete
distribution is called Probability Mass Function (PMF).
The probability of an event cannot be detected directly. It should be computed
as the area
under the curve for a small interval around the specific outcome. This is
defined as CDF.
Let us discuss some of the distributions that are encountered in machine
learning.
Continuous Probability Distributions Normal, Rectangular, and Exponential
distributions
fall under this category.

1. Normal Distribution - Normal distribution is a continuous probability


distribution.
This is also known as gaussian distribution or bell-shaped curve distribution. It
is the
most common distribution function. The shape of this distribution is a typical
bell-shaped
curve. In normal distribution, data tends to be around a central value with no
bias on left
or right. The heights of the students, blood pressure of a population, and marks
scored in
a class can be approximated using normal
distribution.
PDF of the normal distribution is given as:

y two parameters - mean and variance.


Mostly, one uses the normal distribution curve of mean 0 and a SD of 1. In
normal
distribution, mean, median and mode are same. The distribution extends from -
co to +oo.
Standard deviation is how the data is spread out.
One important concept associated with normal distribution is z-
score. It can be
computed as:
When u is zero and o is 1, z-score is same as x. This is useful to normalize the
data.

Most of the statistical tests expect data to follow normal distribution. To check
it,
normality tests are used. Normality test of the data can be done by Q-Q plot
where CDF
of one random variable follows CDF of normal distribution. Then, quantity of
one distri-
bution is plotted against other distributions. If they are same, then the plot
closely follows
the straight line from bottom-left to top-right.
2. Rectangular Distribution - This is also known as uniform distribution. It has
equal
probabilities for all values in the range a, b. The uniform distribution is given as
follows:

3. Exponential Distribution - This is a continuous uniform distribution. This


probability
distribution is used to describe the time between events in a Poisson process.
Exponential
distribution is another special case of Gamma distribution with a fixed
parameter of 1.
This distribution is helpful in modelling of time until an event occurs.
The PDF is given as follows:

.Here, x is a random variable and 1 is called rate parameter. The mean and
standard
deviation of exponential distribution is given as ß, where,
Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under
this category.
1. Binomial Distribution - Binomial distribution is another distribution that is
often encoun-
tered in machine learning. It has only two outcomes: success or failure. This is
also called
Bernoulli trial.
The objective of this distribution is to find probability of getting success k out of
n trials.
The way to get success out of k out of n number of
trials is given as:

The binomial distribution function is given as follows, where p is the probability


of
success and probability of failure is (1 - p). The probability of success in a
certain number
of trials is given as:

Combining both, one gets PDF of binomial


distribution as:

Here, p is the probability of each choice, k is the number of choices, and n is


the total
number of choices. The mean of binomial distribution is
given below:

And the variance is given as:


Hence, the standard deviation is given as:

2. Poisson Distribution - It is another important


distribution that is quite useful. Given an
interval of time, this distribution is used to model the probability of a given
number of
events k. The mean rule 2 is inclusive of previous events. Some of the
examples of Poisson
distribution are number of emails received, number of customers visiting a
shop and the
number of phone calls received by the office.
The PDF of Poisson distribution is given as follows:

Here, x is the number of times the event occurs and 1 is the mean number of
times an
event occurs.
The mean is the population mean at number of emails received and the
standard
deviation is

Bernoulli Distribution - This distribution models an experiment whose outcome


is binary.
The outcome is positive with p and negative with 1 - p. The PMF of this
distribution is
given as:

The mean is p and variance is p(1 - p) = ρ


3a]
Identify the problems that are associated with machine learning
and the process of machine learning with a neat diagram.
1.5 CHALLENGES OF MACHINE LEARNING
What are the challenges of machine learning? Let us discuss about them now.
Problems that can be Dealt with Machine Learning
Computers are better than humans in performing tasks like computation. For
example, while calculating
the square root of large numbers, an average human may blink but computers
can display the result in
seconds. Computers can play games like chess, GO, and even beat professional
players of that game.
However, humans are better than computers in many aspects like recognition.
But, deep
learning systems challenge human beings in this aspect as well. Machines can
recognize human
faces in a second. Still, there are tasks where humans are better as machine
learning systems still
require quality data for model construction. The quality of a learning system
depends on the
quality of data. This is
a challenge. Some of the
challenges are listed
below:
1. Problems -
Machine learning can
deal with the 'well- posed'
problems where
specifications
are complete and
available.
Computers cannot solve
'ill-posed' problems.
Consider one simple example (shown in Table 1.3):
Can a model for this test data be multiplication? That is, y = X, XX,. Well! It is
true! But, this is
equally true that y may be y = x, ÷ x, or y = x, *2. So, there are three
functions that fit the data.
2. Huge data - This is a primary requirement of machine learning. Availability of
a quality
data is a challenge. A quality data means it should be large and should not
have data
problems such as missing data or incorrect data.
3. High computation power - With the availability of Big Data, the
computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU)
or even Tensor
Processing Unit (TPU) are required to execute machine learning algorithms.
Also, machine
learning tasks have become complex and hence time complexity has
increased, and that
can be solved only with high computing power.
4. Complexity of the algorithms - The selection of algorithms, describing the
algorithms,
application of algorithms to solve machine learning task, and comparison of
algorithms
have become necessary for machine learning or data scientists now.
Algorithms have
become a big topic of discussion and it is a challenge for machine learning
professionals to
design, select, and evaluate optimal algorithms.
5. Bias/Variance - Variance is the error of the model. This leads to a problem
called bias/
variance tradeoff. A model that fits the training data correctly but fails for test
data, in
general lacks generalization, is called overfitting. The reverse problem is called
underfitting
where the model fails for training data but has good generalization. Overfitting
and
underfitting are great challenges for machine learning algorithms.

1.6 MACHINE LEARNING PROCESS


The emerging process model for the data mining solutions for business
organizations is CRISP-DM.
Since machine learning is like data mining, except for the aim, this process can
be used for machine
learning. CRISP-DM stands for Cross Industry Standard Process - Data Mining.
This process
involves six steps. The steps are listed below in Figure 1.11.
1. Understanding the business - This step involves understanding the
objectives and
requirements of the business organization. Generally, a single data mining
algorithm is
enough for giving the solution. This step also involves the formulation of the
problem
statement for the data mining process.
2. Understanding the data - It involves the steps like data collection, study of
the charac-
teristics of the data, formulation of hypothesis, and matching of patterns to the
selected
hypothesis.
3. Preparation of data - This step involves producing the final dataset by
cleaning the raw
data and preparation of data for the data mining process. The missing values
may cause
problems during both training and testing phases. Missing data forces
classifiers to produce
inaccurate results. This is a perennial problem for the classification models.
Hence, suitable
strategies should be adopted to handle the missing data.
4. Modelling - This step plays a role in the application of data mining algorithm
for the data
to obtain a model or pattern.
5. Evaluate - This step involves the evaluation of the data mining results using
statistical
analysis and visualization methods. The performance of the classifier is
determined by
evaluating the accuracy of the classifier. The process of classification is a fuzzy
issue.
For example, classification of emails requires extensive domain knowledge and
requires
domain experts. Hence, performance of the classifier is very crucial.
6. Deployment - This step involves the deployment of results of the data mining
algorithm
to improve the existing process or for a new situation.

3b]Build a short note on two types of density estimation


methods
Density Estimation
Let there be a set of observed values x, x, *., x, from a larger set of data whose
distribution is not
known. Density estimation is the problem of estimating the density function
from an observed data.
The estimated density function, denoted as, p(x) can be used to value directly
for any unknown
data, say x, as p(x). If its value is less than e, then x, is not an outlier or
anomaly data. Else, it is
categorized as an anomaly data.
There are two types of density estimation methods, namely parametric density
estimation and
non-parametric density estimation.
Parametric Density Estimation It assumes that the data is from a known
probabilistic distri-
bution and can be estimated as p(x | ©), where, © is the parameter. Maximum
likelihood function
is a parametric estimation method.
Maximum Likelihood Estimation For a sample of observations, one can estimate
the probability
distribution. This is called density estimation. Maximum Likelihood Estimation
(MLE) is a
probabilistic framework that can be used for density estimation. This involves
formulating
a function called likelihood function which is the conditional probability of
observing the
observed samples and distribution function with its parameters. For example, if
the observations
are X = (x, x, ' , x,), then density estimation is the problem of choosing a PDF
with suitable
parameters to describe the data. MLE treats this problem as a search or
optimization problem
where the probability should be maximized for the joint probabilities of X and
its parameter, 0.
For example, this is expressed as p(X; 0), where, X = (x, xy,., x ,)
The likelihood of observing the data is given as a function L(X;
0). The objective of MLE is to
maximize this function as max L(X; 0).

The joint probability of this problem can be restated as

The computation of the above formula is unstable and the hence the problem is
restated as
maximum of log conditional probability given 0. This is
given as:

Instead of maximizing, one can minimize this function as:

as, often, minimization is preferred over maximization.

This is called negative log-likelihood function.


The relevance of this theory of MLE for machine learning is that MLE can solve
the problem
of predictive modelling in machine learning. Regression problem is treated in
Chapter 5 using the
principle of least-square approach. The same problem can be viewed from MLE
perspective too.
If one assumes that the regression problem can be framed as predicting output
y given input x,
then for p(y/x), the MLE framework can be applied
as:

Here, h is the linear regression model. If Gaussian distribution is assumed as it


is an obvious
fact that most of the data follow Gaussian distribution, then
MLE can be stated as:
Here, is the regression coefficient and x, is the given sample. One can
maximize this function
or minimize the negative log likelihood function to provide a solution for linear
regression
problem. The Eq. (2.37) yields the same answer of the least-square approach.
Stochastic gradient descent (SGD) is an algorithm that is normally used to
maximize or
minimize any function. This algorithm can be used to minimize the above
function to get the
results effectively. Hence, many statistical packages use this approach for
solving linear regression
problem. Theoretically, any model can be used instead of linear regression in
the above equation.
Logistic regression is another model that can be used in MLE framework.
Logistic regression is
discussed in Chapter 5 of this book.
Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm In
machine learning,
ustuforeigningmodebasedrethadfordustetingdate,
Amodesatetcalnethodan
data is assumed to be generated by a distribution model with its parameter, I.
There may be many
distributions involved and that is why it is called as mixture model. Since,
Gaussian are normally
assumed for data, this mixture model is categorized as Gaussian Mixture Model
(GMIM).
The EM algorithm is one algorithm that is commonly used for estimating the
MLE in the
presence of latent or missing variables. What is a latent variable? Let us
assume that the dataset
includes weights of boys and girls. Considering the fact that the boys' weights
would be slightly
more than the weights of the girls, one can assume that the larger weights are
generated by one
gaussian distribution with one set of parameters while girls' weights are
generated with another
set of parameters. There is an influence of gender in the data, but it is not
directly present or
observable. These are called latent variables. The EM algorithm is effective for
estimating the PDF
in the presence of latent variables.
Generally, there can be many unspecified distributions with different set of
parameters. The
EM algorithm has two stages:
1. Expectation (E) Stage - In this stage, the expected PDF and its parameters
are estimated
for each latent variable.
2. Maximization (M) stage - In this, the parameters are optimized using the MLE
function.
This process is iterative, and the iteration is continued till all the latent
variables are fitted
by probability distributions effectively along with the parameters.
Non-parametric Density Estimation A non-parametric estimation can
be generative
or discriminative. Parzen window is a generative estimation method that finds
p(x | ©) as
conditional density. Discriminative methods directly compute p(® | x) as
posteriori probability.
Parzen window and k-Nearest Neighbour (KNN) rule are examples of non-
parametric density
estimation. Let us discuss about them now.
Parzen Window Let there be 'n' samples, X - (x, xy'.., x,)
The samples are drawn independently, called as identically independent
distribution.
Let R be the region that covers 'K' samples of total 'n'
samples. Then, the probability density function
is given as:

The estimate is given as:

where, V is the volume of the region R. If R is the hypercube centered at x and


h is the length of
the hypercube, the volume V is h' for 2D square cube and h° for 3D cube.
The Parzen window is given as follows:

The window indicates if the sample is inside the


region or not. The Parzen probability density
function estimate using Eq. (2.40) is given as:

This window can be replaced by any other function too. If Gaussian function is
used, then it
is called Gaussian density function.
KNN Estimation The KNN estimation is another non-parametric density
estimation method.
Here, the initial parameter k is determined and based on that k-neighbours are
determined.
The probability density function estimate is the average of the values that are
returned by
the neighbours.
4a]Construct the different types of graphs that are used in
Univariate data analysis.
Data Visualization
To understand data, graph visualization is must. Data visualization helps to
understand data.
It helps to present information and data to customers. Some of the graphs that
are used in
univariate data analysis are bar charts, histograms, frequency polygons and pie
charts.
The advantages of the graphs are presentation of data, summarization of data,
description of
data, exploration of data, and to make comparisons of data. Let us consider
some forms of graphs
now:
Bar Chart A Bar chart (or Bar graph) is used to display the frequency
distribution for variables.
Bar charts are used to illustrate discrete data. The charts can also help to
explain the counts of
nominal data. It also helps in comparing the frequency of different groups.
The bar chart for students' marks (45, 60, 60, 80, 85) with Student ID = (1, 2,
3, 4, 5) is shown
below in Figure 2.3.

Pie Chart These are equally helpful in illustrating the univariate data. The
percentage frequency
distribution of students' marks (22, 22, 40, 40, 70, 70, 70, 85, 90, 90) is below
in Figure 2.4.
It can be observed that the number of students with 22 marks are 2. The total
number of
students are 10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for
marks 2 in Figure 2.4.
Histogram It plays an important role in data mining for showing frequency
distributions.
The histogram for students' marks (45, 60, 60, 80, 85) in the group range of 0-
25, 26-50, 51- 75,
76-100 is given below
in Figure 2.5. One can
visually inspect from
Figure 2.5 that the
number of
students in the range 76-
100 is 2.

Histogram conveys useful information like nature of data and its mode. Mode
indicates the
peak of dataset. In other words, histograms can be used as charts to show
frequency, skewness
present in the data, and shape.
Dot Plots These are similar to bar charts. They are less clustered as compared
to bar charts,
as they illustrate the bars only
with single points. The dot
plot of English marks for five
students with ID as (1, 2, 3, 4, 5)
and marks (45, 60, 60, 80, 85) is
given in Figure 2.6. The
advantage
is that by visual inspection one
can find out who got more marks.

Central Tendency
One cannot remember all the data. Therefore, a condensation or summary of
the data is necessary.
This makes the data analysis easy and simple. One such summary is called
central tendency. Thus,
central tendency can explain the characteristics of data and that further helps
in comparison. Mass
data have tendency to concentrate at certain values, normally in the central
location. It is called
measure of central tendency (or averages). This represents the first order of
measures. Popular
measures are mean, median and mode.

1. Mean - Arithmetic average (or mean) is a measure of central tendency that


represents the
'center of the dataset. This is the commonest measure used in our daily
conversation such
as average income or average traffic. It can be found by adding all the data
and dividing
the sum by the number of observations. Mathematically, the average of all the
values in the
sample (population) is denoted as X. Let x, Xy,.., xy be a set of 'N' values or
observations, then
the arithmetic
mean is given as:

For example, the mean of the three numbers 10, 20, and 30 is

Weighted mean - Unlike arithmetic mean that gives the weightage of all items
equally,
weighted mean gives different importance to all items as the item importance
varies.
Hence, different weightage can be given to items.

In case of frequency distribution, mid values of the range are taken for
computation.
This is illustrated in the following computation.
In weighted mean, the mean is computed by adding the product of proportion
and
group mean. It is mostly used when the sample sizes are unequal.
• Geometric mean - Let x, X,,..., X, be a set of 'N' values or observations.
Geometric mean
is the N" root of the product of N items. The formula for computing geometric
mean is
given as follows:

Here, n is the number of items and x, are values. For example,


if the values are 6 and 8,
the geometric mean is given as

In larger cases, computing


geometric
mean is difficult. Hence, it is
usually calculated as:
The problem of mean is its extreme sensitiveness to noise. Even small changes
in the input
affect the mean drastically. Hence, often the top 2% is chopped off and then
the mean is calcu-
lated for a larger dataset.

2. Median - The middle value in the distribution is called median. If the total
number of items
in the distribution is odd, then the middle value is called median. If the
numbers are even, then
the average value of two items in the centre is the median. It can be observed
that the median
is the value where x, is divided into two equal halves, with half of the values
being lower than
the median and half higher than the median. A median class is that class where
(N/2)* item is
present.
In the continuous case, the median is given by the
formula:

Median class is that class where N/2th item is present. Here, i is the class
interval of the
median class and L, is the lower limit of median class, fis the frequency of the
median class, and
fi s the cumulative frequency of all classes preceding median.
3. Mode - Mode is the value that occurs more frequently in the dataset. In other
words, the
value that has the highest frequency is called mode. Mode is only for discrete
data and is not
applicable for continuous data as there are no repeated values in continuous
data.
The procedure for finding the mode is to calculate the frequencies for all the
values in the
data, and mode is the value (or values) with the highest frequency. Normally,
the dataset is
classified as unimodal, bimodal and trimodal with modes 1, 2 and 3,
respectively.

4b] Build a program to implement Principal Component Analysis


(PCA) for reducing the dimensionality of the Iris dataset from 4
features to 2.
Pca
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
iris = load_iris()
print("Type of iris.data:", type(iris.data))
print("shape of iris.data", iris.data.shape)
print("First 5 rows of iris.data:")
print(iris.data[:5])
print("iris target (species labels):",iris.target)
print("iris feature names:",iris.feature_names)
print("first 5 rows of iris.data:")
print(iris.data[:5])
X=iris.data
pca=PCA(n_components=2)
X_pca = pca.fit_transform(X)
print("reduced data shape:", X_pca.shape)
print("first 5 rows of reduced data:")
print(X_pca[:5])
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=iris.target, cmap='viridis')
plt.xlabel("principal component 1")
plt.ylabel("principal component 2")
plt.title("PCA of iris dataset")
plt.colorbar(label="Iris Species")
plt.show()

You might also like