DATA SCIENCE
e PART-A
SHORT QUESTIONS WITH SOLUTIONS
Qt. Write in short about data science.
Answer! Model Paper, Qt
Data seience is the combination of several fools, algorithms and machine leanting principles whose
‘ain is 0 discover the hidden patterns from raw data,
‘or example, consider the role of statisticians.
Business administration
Explanatory data analysis
Analyst
Data scientist
Inthe above figure, the role of data analyst is to illustrate the processing history of data, And the role
of data scientist is responsible for explanatory analysis for discovering insights andl to identify the reasons
betind events in future by using the advanced mashine Ieornie
data from various angles, Hence, data science is used to mk decisions and predictions Uhrough predictive
casual analytics, prescriptive analytics ancl machine learn
2. Whatis linear algebra? :
Answer Model Papers, a2
Linear algebra is said to be a branch of mathematics with respect to Tinear v wations and tinear
{imstions along sith their representations through matries and veetor spaces ts unis esl Far all the areas
of mathematics stich a8 geometry and functional analysis, The concepts are inples sequirements fo even
derstand the linear algebra. The Jincar algebra is used in the form of scalars, vectors, matrices: and tensors.
dala science.
|. Scalar : 1 is-a single number
Example : |
ee Ea
i 1
ae
Scanned with CamScannerData Science Using R
Vector: It is an array of numbers
Example :
Wis a2-D array
4. Tensor: It is n-dimensional array with n> 2
(1 2) B 2)
lapis 4)
\Voriotts operations that can be performed on them arc transposition. vector
and matrix mutt
identity and inverse matrices, etc. iphicatig
Q3. Define linear equation.
Answer : Mode! Paper,
Linear Equation
‘An equation in variable x that can be written in the foto
1 form is called linear equat
inx +b : ,
Ax + By=C
Here, m, b, A, Bund C are realnuibers. [Link] B must not be zero. The graph of any Tinear equa
would be a straight line
Let the linear equation be Ax = b ~
Where,
A ism » n matrix of coefficient for m equations and n unknowns.
xis ann | vector unknowns, x,. x.
bis an m* 1 vector of constants is the right hand sides of equations,
Q4. Define distance.
Answer :
Distance
Distance is a function that calculates various dissimilarity or distance metrics. Itallows new i
to be added, It is not used for large met
Syntax
distance(x, method » “euelidean”, sprange = NULL, spweight = NULL, icov)
It-can also be used to compute and return the distance matrix through a specified distance measure
compuiting the distances between the rows of data matrix.
‘Scanned with Cam$cannermR
defining hyper pla:
Data Science
a yer gem
Qs. Explain In bri about hyper pt
‘Answer t/ Model Papers, Ot
Hyper Planes
Hyper plane is a geometricentity geometrically whose dimension is one less than that ofits ambicrt
space. For example for a 3D spacé the hyper plane is 2D anid for 2D space the hyper plane is 1D line and so
on. Hyper plane can be defined by the below equation °
Xh+b=0
The above equation can be expanded for n dimensions.
Xn, + Xn, +X 0, +... Xi, # B= O
For 2 dimensions the equation is,
Xn +X, +b=0 .
Consider the hyper plane of the below form
‘ xTn=0
iLc., if the plane goes through the origin, the hyperplane also becames sub space,
‘The function hyper plane computes a(k~ 1) diniensional hyperplane thet passes through k given
points in k-dimensional space.
‘The general format of this function is hypesplane (X).
Q6. What is half space?
Answer z
Half Spaee
‘The half spaces of R® are suid to be the sets that are attached to (5, 6) © RY ~ R.s ¥ and it can be
defined by, :
(ER (x)
+» the spade is two dimensional then the half space is known as half plane, The half space that isin
-one dimensional space is Known as ray.
Tkean be defined by linear inequality that is derived from linear equation which in tum specifies the
“The linear inequality that is strict will specify an open half-supce,
/ BX, FAN oe FAX FB
innit
‘Scanned with CamScannerData Science Using R
Q7. Write short notes on eigen values. ae
Answer
yen values ore the numerical vals tha ett nr the uMnbe of Teale 0 Be retin "
concept of eigen values is applicable to scare matrices, Its emsidered a6 2 ery important on
example let the principal component analysis fs hosed an i, Enh eigen value definitely has comrespy
sal
cigen vector, The princi
pment snalysis oF a system of variables ts perfirmied by compating ik
ego value of dispersion matrix ar eorelation mani ef variables. This peurcipal eitapanent is eg,
tobe the Lins
‘ar combination of items af corresponding eigen vector,
The eigen values define the propartion of variance provicked for every cigen vector that js ing
from transformations of original set of Variables ta orthogonal variables. This leads to adecrease in numb:
Variables that are used in determining the majority’ of total variance among origi
variables. IFeach oy:
oT
"ariable contributes in the direction of eigen vectors then te important variables can be summarized ink,
number of vectors,
Q8. Define eigen vector.
Answer Model Paper
Eigen Vector
In linear algebra, i°T is alinear transformation from a vector space V overa field F into iscifandy
's non-zero vector in V then v is ealled an eigen vector of Tif T(v) is a scalar multiple of v i
‘Where v is a scalar in F known as eigen value associated with ef
Yeclor-v. Eigen veetors are associated
with linear models which are unusual in engineering as first appr’ nation. These vectors are watt
by transformation matrix, Jt is applied on machine learning algorith
'n. This algorithm is masily useful in
handling
the large data sets. The concept of eigen vectors is considered ais a back bone
For example, multiply a 2-dimensional vectar with a 2°? matvix,
“Ty ap bos
03)2 6
This particular operation on vectors called lined transformation’ “The Golan tate represenis 2
of this algorithm.
‘eelor whose input and output veetor directions are not same. The vectors whose dimension docs not change
‘sr applying linear transformation with matrix are called eigen vectors,
square matrices,
‘This concept is applicable cay on
ao ad
Scanned with CamScannerPART-B
ESSAY QUESTIONS WITH SOLUTICHS
1.1 INTRODUCTION To DATA SCIENCE
“gg, Give a brief introduction on data science, Illustrate various phases of It.
Answer:
Data science is the combination of several tonls, algorithms and machine learning principles whese
to discover the hidden patterns from raw data, For example, consider the role of statistician
7 Analyst
of data scientist is responsible for explanatory analysis far discovering insights and to identify the reasons
behind events in future by using the advanced machine learning algorithms. This is dane by observing the
data from various angles. Hence, data science is used to make decisions and predictions through predictive
‘casual analytics, prescriptive analytics and machine learning.
|
E Phases of Data Science
ee.
| ‘Various phases in data science life cycle are as follows,
i Phase 1
t
‘Scanned with Cam$cannerData Sclence Using R
SS
Phase P: Diseavery aes
fications, requirements, prior
and required bua
MUSE ee Koy
understood before starting any project, In audition to this she business problem must be fi
hypothesis must be formutited fhe tos
I ae jy
Phase 23 Data Preparation
Inthis phasea sandbox is required to
form analytics for complete Lime periad ofthe project),
must be explored, preprocessed and structure the data before mocleling..And then ETLT (Extra
|.oud and Transfarm) must be performed to extract the data into the sandbox. Statistical analy
Fallows, y
tet,
Preparation af analyties scndhox
+
Performing EVLT
+
Data conditioning
d
Survey and visualize
Here, R can be used for cleaning the data, transforming and then visualizing. With this the outliers can
the determined and relationship ean be established. Alter this the explanatory analyt
is performed on i,
se 3 Model I
Tn this phase, the methods to drav the relationships between the variables aré determined. With thisa
base for algorithms is set forimplementing them in next phase. EDA (Exploratory Data Analytics) is appt
‘on various statistical forms and visualization tools.
Some of the model planning tools ate as follow
SQL analysis services : It is used to perform in database-analyties tnrouwls voxsmen dati ming
functions and basie predictive model.
R 2 It contains a complete set of modeling capabilities to provide goo env ieunsae «for bulking
interpretive maces, i
3. SASMACCESS : It is used for necessin sf
et ow i
1 ur ereating repeatalle and revsable
Vp thay 259 the mest
i data seen,
‘Scanned with CamScannerData Science UNIT-1
Phase 4: Model Huildi
In this phase, the data set ate bull for taiising and testing puepose.
In additional to this. various learning techniques such as association, classification and clustering are
analyzed for building the model. Variony commonly used tools for model building aze as Follas.
S enterprise miner
(i) WEKA
SPCS modeler
Matlab
(0) Alpine miner
(vi) Statistica
Phase 5 : Operationalize
code and technical documents are delivered.
In this phase, the final reports. bri
Phase 6 : Communicable Results
n this phase, the outcome is compared with the goal in first phase. The Key findings are identified
‘and communicated with stakehaluers to determine whether the project results are success or failure based on
criteria develaped in phase 1
4.2 LINEAR ALGEBRA FOR DATA SCIENCE
10, Explain about finear algebra and its role in data science,
Answer = tiodel Papert, Q44{2)
and finear
with respect to Lingar ey
spaces. Its universal forall ie areas
Linear algebra is suid to be a branel of mathemat
functions along with their representations thravgh aiatrices and vector
‘of mathematies such as geometry and functional analysis, The eoaeepts are com
‘understand he finear algebra, The linear algebra is used inthe form of seals, wsetars. mnstrices and tensors
requirements to even
in data science:
1. = Scalar: Itis_a single number i
Example: 1 .
2, Vector : It [Link] array of numbers
:
ll
2
Scanned with CamScannerRae
-
Data Sclonce Using R
Tensor : 1 is a n-timensianal array with n> 2
(24 (3 2]
UWP TS 4]
Various operations that can be performed on thei are transposition, vector and matrix my ili
im,
ulentity and inverse matrices ete.
Linear Algebrie Operations on Vectors and Matrices: Linear algebra is a mathematical Pe omenog ny
sdeals with vector space and their mappings. Reprogrammit ing supports linear algebric operat
ions likes
Alii
and multiplication on vectors and matrices,
‘Multiplication operation on Vectors: The product between to vectors can be ealctilated in the following
manner,
Mire Packages » Windawe. Help
Gea
‘order to compate the dot productinner produet between the two vectors a predefined function called
‘crossprod() is used, *
Syntax
crossprod( )
Example
‘Scanned with Cam$cannerf
'
p
‘on Mairiees: In R programm
for this is as follows,
ramining are as Follows.
‘Apart fiom this, the various algebra flnctions that are available in R pro;
4+ TU): This function computes the transpose of n matri.
4 qrt):"This function finds the QR decomposition
ae chol}: This function compares the ehle i
° det Tiiss fan i tlculates the determinant of 3 given matrix.
igen): This funet es the eigen values ur eigen vectors.
4 diag( J: This function computes the diagonal of a square matrix
4. solvet.): This function solves the system vf linear equation.
sweep: inetion performs complex niumerical aperatians.
Among these functions diag( ), solve( ) and sweep( ) functions are the most important predefined
functions. The functionality of these finetions i as follows,
1. diag(): This fonction computes the diagonals of square matrix. Intakes two types of arguments either
ruatriy or either vector). IF matrix is taken ds argument then the resultant output will be a vector whereas if
vector values ar¢ taken as argument then the resultant output wil! be a matrix,
(a) Gt
a8
2, 4 ,
a5) '
2, dea ide)
ty Cad
“
‘Scanned with Cam$cannerData Science U
2
Ising R
elve( This fimelion solves system of linear equation and also caleulatey heingy
ample
Of,
ty
Consider the below linear equation
MaKn4
6
‘The matrix representation of above equation is,
Hel
The code fur 1
‘The above obtained output is the inverse ofthe actual output. That is, initially,
he
mil, the sob
Solves the given linear equation and then takes the inverse of the output and display ton scree,
3 sweet This fiction is
‘sed lor performing complex operations on numerical value,
Example
1.3 Linear Equations
Q11. Discuss about linear equations,
Answer : i : ‘Model Papers, ctlt
Linear Equation
‘An equation in variable x that can be written in the Following form is called linear equation.
yemetb
Ax4By=C
Here, m, b, A, Band C are real numbers, A and B must not be zero. The graph of any linear cali
would be a straight |
10
ae
‘Scanned with Cam$cannerData Science UNIT-1
Linear equations can be solved by following the below steps,
Initially the equation of all the fractions must be cleared hy multiplying the bath sides oF equations
by. tcp (Lowest Common denominator) of the fractions.
Every side of equation must be solved completely by distributive property in order to delete the
paranthesis and to combine the terms,
Now isolate the terms of vaviables at one side of equation and numbers on other site of equation
through addition property of eqn
4, Genefate an equation with variable whose coefficient is I by'using the multiplication property
3. Finally check the answer in original equation,
Let the linear equation be A\
Where,
+ — Aism xn matrix of coefficients form equations and n unknowns.
+ xisann x 1 vector unknowns, x), Xu. %y
+ — bisanm « | vector of constants is the right hand sides of equati
Conditions for the solutions are as follaws,
@ The equations are consistent ifr(A | b) = (A)
+ The solution is unique if r(A | b)=r(A}=n
+ The solution can also be undetermined if (A |b)
(ii) The equations are inconsistent if r(A | b)> r(A)
To demonsrate the ranks, use '
(RCA), R(cbind(A, b)) and to test for consistency, use [Link] (RCA), R(cbindA,b))).
Equations in Two Unknowns
Every equation in two unknowns will correspond to line in 2D space, Ifall the lines ae intersecting ;
at one point then the equations can have unique solution,
‘True Consistent Equations: "
Ae matrix(CX1,2, =1, 2), 2,2)
bee!) * ; ‘ ‘
- ShowBgn(a, by s \
‘Scanned with Cam$cannerData Sclence Using R
AV Dean
#2 ~ Vary
Data
CERKAD. RECbindLA, bY) show ranks -
UY 22 “
allequal(R(A), Ricbind(A, b))) Hconsistent ae
#U-TRUE,
Pot the Equations 5
x
‘Equations can be plotted as shown below.
PlotEqn(A, b)
xp ete?
HH Derr x2=1
Bxle2sat
~ The solution can be more comprehensibly determined by solve( ) finetion.
solve(A, b, fractions = TRUE)
HANI = 5/4
* tx? = 3/4
Inthe similar way, three consistent equations thres inconsistent equations, equations inthe
can also be determined. :
SE
‘Scanned with Cam$cannerData Science UNIT-1
1.4 pistance
Q12, Write In detail about distance.
Answer : Model Paper4, Q14(b)
Distance
Distance is a function that calculates various dissimilarity or distance metrics. It allows new metrics
tobe added. It is not used for large metrics bu
purely a choice for understandability and extensibility
Syntax
distance(x, method = “euclidean”, sprange = NULL, spweight = NULL, icov)
Tecan also be used to compute and return the distance matrix through a specified distance measure for
computing the distances between the rows of data matrix.
Syntax .
dist(x, method = “euclidean”, diag - FALSE, upper = FALSE, p=2)
Arguments
x: Itrepresents a numeric matrix, data frame or ‘dist’ abject with row and samples and columns as
variables, The distance will be computed for every pair of rows,
method: ‘It calculates one‘of the various dissimilarity metries such ds euclidean,
bray -curtis, manhattan, mahalanotis, jaccard, difference, sorensen, gower, modgower 10 and
modgower 2. :
‘sprange :- The gower dissimilarities allow to divide based on species range. If the value of it is NULL
then no range is used, ifthe value of it is vector of [Link](x) then if is used to standardize
the dissimilarities.
digg: — Itrepresenit whether the diagonal of distance matrix to be printed by print. dist. For this it uses
logical values. ‘ ;
spweight: Weighting is allowed by euclidean, [Link] manhattan dissimilarities. Ifthe vatuc off itis
NULL then no range is used, if the value of it is absence then w = Q-and ifthe species are absent
and 1 then joint absences are detected, =
upper: Itis logical value that represents whether the diagonal of distance matrix must be displayed
. by print, dist. :
cov; This optional covariance mattix that is used if method = “mahalnobis® I allows to calculate
the distance for a subset of full dataset if it is provided directly.
It indicates the power of minkowski distance, sys
‘Scanned with CamScanner" a
nee Using R &
This object jy,
Ieretuins a lower-~triangular distance matrix as an object of class “dist ei
attribute,
ze Tis an integer that indicates number of observations in dataset
‘Scanned with CamScanner
| labels: 11 is an optimal value that consists of labels inease of observations of dataset
\ diag, upper: I is logical value that is related to the argument diag and upper that depict howy the hie
be displayed.
all {Wis an optional value that is used to create an object
1,5 HYPER PLANES, HALF SPACES
4 13: Explai
about hyper planes,
Answer
Hyper Planes . ®
Hyper plane is a geometric entity geometrically whose dimension is one less than that ots apg
space. For example for 83D space the hyper plane is 2D and for 2D space the hyper. plate i 1D lps sy,
on. Hyper plane ean be defined by the below equation,
Xn+b=0
‘The above equation ean be expanded for n dimensions,
For 2 dimensions the equation i,
Xa, +X n+ b= 0
Consider the hyper plane af the below form,
xTa=
if the plane goes through the origin, the hyperplane also becames'sub space.
“The function hyper plane computes a{k — 1) dimensional hyperplane that passes through k git:
points in k-dimensional space,
“The general format of this funetion is hyperplane (X),
here, X indicates a numeric k » k matrix with k data point as rows,
14Data Science UNIT-1
‘AE —1) dimensional hyper plane i R¥ contains the points tha
n be saished by x
dxtc=0
here d is k vector and c is scalar,
‘The Finetion will relumn (k + 1) vestor (4, 6)
{tcan be normalized in such a way that the length of dis equal to (k — 1),timesthe (1) dimenstowal
volume of simplex that are farmed by points on plane. Ifthe value af k is 3 then it would be a triangle
‘Therefore the function
and
compute volumes of simplices. The direction of d towards the origin is exible
based on the order of data points within the mairix X. 1f points eamsnot elefine (ke ~ 1) dimensional
hyper plane then a veetor with zeros is returned.
Example
Xe rhind U4, 5), C8, 2))
hyper plane(X)
X € rbind(C(s, 2), C(4, 5))
hyper plane(x)
X < diag(rep(l, 3))
hyper plane(X)
Q14. Discuss in detail about half spaces.
Answer Model Papers, 11a)
Half Space :
‘The half spaces of R* are said to be the sets that are attached to (, r) & Rv R, 0 and it ean be
defined by,
fee R(x) Sr} for closed half space
(ER: (5,x)."
Ifthe space is two dimensional then the half space is kaown as half plane, The half space that is in
cone dimensional space is known as ray.
15
‘Scanned with CamScanner2 equation whi
‘derived from
an open hal esapcs.
ity that is strict will specify
Fax ob
ar inequality that is not strict és called closed half-space.
‘Scanned with CamScanner
38, Fay +e bay 2b
Consider the bclow piven two dimensional space.
H
+X,
ve half of plane
ae
i ‘An equation in two dimensions can be a line that can must be hyper plane. So equation ioncy,
can be written as, .
i Xn+b=0VX cline
In these two dimensions the line can be
xn, +10, +b=0
This line be extended on both sides even. If this is done the two dimensional space is divided iy,
| two spaces :Data Science unira
‘One space is at ane side of the line i.e, at right side and another space is at the other side of the Tne
ic., at let side, These two spaces are called half spaces, For example if there are points an one half space
‘and points on the other space, Is thereany characteristic that ean separaic them’ A solution for this would he
toperform certain computations on one half space fur al the points and obtain some result, Repeat the same
procedure on the other side and use the results to make the decisions, These type of situations are mostly
observed in classification problems. Consider a binary classification problem, to know on which half space
the point lies in. And now consider three points X,..X, and X, from the above figure and! distinguish their
positions. In the equation Xn +
0, nis said to be normal in thi
‘The above figure, ifn is considéred as normal in equation X7n +b =0 and ifthis equation is multiplied
by -I then normal is said to be defined to side of n, Otherwise normal is said to be defined in the opposite
direction of n, To know where the points X,, X, and X, lie, the equation Xn +b = 0 must be evaluated.
xTn+b
XIntb
Thtbo*
Forthe equation X7n +b 0, itis clearshatihe point fies on theline so it evaluates to0. Now consider
tie equation Xn +): In the above figure; take two points X' and X,, Here X, is a vector from to X,. Fron
vector addition, itcan be written as, :
xe
‘This must be substituted in the equation, :
weyyed
XT+b+¥'Tn
‘Scanned with CamScannerIfthe point ties in oF IV" quadeant, then the angel would be a positive @ angle, Foy in
| thoy
i a dot matrix ath = ‘aifbleos 8 the 0 angles might be between the two vectors. For any ping, ig *y
, ,
{ 270° to 360%, the equations Yn evaluats to a positive value since a°b is also positive .
| Xn b+yn > 0
‘ the points are at the opposite side ie., between 90° to 180° or 180° to 270°,
The cos 6 for angles between 90 to 270 would be anegative value. Therefore for any poj,, te
(on this side of line or half space, the computation Xfm + b would be less than 0.
Xtneheo
|
Example
Consider a2D geometry with n=
“
ix [Joa
X?+b=0
"end b=4
jj] andb=4.
x
fx tixygtd=0
ler three points, (—1,-1), (1,~1) and a 2), Substitute these points in the above equation
| Cb
x 4aepe aso
a+d=0
‘The point (-1,—I).is said to be on line.
2 GD
nt3n ded
1=3+4=2>0
|, ~1) is said to be in positive half space, —
13
3. Tie point
all
‘Scanned with Cam$cannerData Science
UNIT-1
2 Gd)
X,43x,440
1-6442-1<9
The point (1, -2) is said to be in negative half space
1.6 EIGEN VALUES, EIGEN VECTORS
Q15. Write about eigen values,
Answer : Model Paper-Ill, Q14{b)
Rigen Values +
Eigen values are the numerical values that can determine the number of features fo be retained. The
concept of eigen values is applicable to square matrices. It is considered as a very important topic. For
‘example let the principal component analysis is based on il. Each eigen val
definitely has corresponding «
‘eigen vector. The principal component analysis of a sy$tein of variables is performed hy computing the
eigen value of dispersion matrix or correlation matrix of variables, This principal component is considered
to be the linear combination of items of correspon:
i eigen vector.
‘The cigen values define the proportion of variance provided for every eigen vector that is derived
from transformations of original set of variables to orthogonal variables, This leads to a decrease in number
of variables that are used in determining the majority of total variance among, otiginal variables. If each
original variable contributes in the direetion of eigen vectors then the important variables can be summarized
in less number of vectors.
Consider the below mathematical formula,
Ax=hx -
Here, constant 4 (positive) represents the amount of stretch or shrinkage that the attributes x go through
the x direction.
‘Scanned with Cam$canner8X are called eigenvectors and their corresponding? ae called wigan ym
atrix, the eigen values and eigen vectors can be computed as fallowys,
The eigen vatties can be computed as follows,
AK = Ax. Atn ays xt 1)
AX —AIk = 0
(A-2)x = 0
Therefore the eigen values of the equation can be determined by using the below canis
|A-Al| =0
2. By substituting the eigen values in original equation the solution for cigen vector x an be com
Example
Consider the below matrix
_[8 7’
“bl
Bis] _ al |_ day
23) ,e}> “(x |7]an
a7 ro
[spol
Bae 7
2 3-2
=0
|A-All =
(@-NG-H-14=0
A-1IA+10=0
40,1)
-—_— TSS
20
‘Scanned with Cam$cannerData Science UNIT-1
R code is ns follows,
> RE-MALEAC(C (87,2, 31-2424 yEOWAT)
‘Therefor2, there are two eigen values,
To, comptite the eigen vectors considers the below process.
ee Pee
ESI) E
|
Bxy+7xy
Therefore the corresponding eigen vector to = Vis,
dnt Bi
xX 4X,=
Pe
IfaA=10
8 7]}x |_| tox e
23][m| [lon
8x, + Tp ]_[10x,
2x, 43x,] [10r,
21
‘Scanned with CamScanner‘Scanned with CamScanner
& RECrWEE AM (5EE47,2/21,2,2. YEON)
[> avcesgen ta)
Retationsh
between Eigen Values and Eigen Vectors
‘Theeigen values ean be complex numbers even fo real matrices, the eigen values become compley
than eigen vectors also become complex.
TF the matric is symmetric and if this symmetric is in the following,
AnaAT
then there are following properties
@ Ifthe matrix is symmet
thon cigen values will be real always
Gi) Eigen vectors of the symmetric are also real
For a matrix and for
Vp VovenV, for symmetric matrices
Q16. What is an
Answer : Model Papersil, 118)
gen values Ry 2, wy h, then there linenrly independent eigen veetors such ss
jen vector? Explain,
Eigen Vector
Jn linear algebra, if T is a Tinear transformation from a veetor space V over a field F into itselFand
is nonzero vector in V then v is called an eigen vector of T if T(v) isa scalar multiple of v ie, To)=
Where v isa scalar in F known as eigen value associated with eigen vector v. Eigen vectors are associated
22
alData Science UNIT-1
‘Tiih linenr models which are wnasial ih enginesring as first approximation, These veetors are wnrotated
uy is mostly vusefl in
aluerithi,
by transformation mavix, It is applicd on machine learning algorithm. This aly
handling the argc dla sets. The concep! of eigen veetery ix considered as a back bone of thi
For example, multiply a 2-dimensignal vector with a 2°2 mattis,
12)1.3
03)2 6
‘This particular operation on vector is called linear transformation, ‘The cofunn matrix represents a
jgetarwhase input one! eutpul vector directionsare not sume, The weetars whose dimension does notch
after applying finear transformation widh matrix are called eigen Vestors, This cangept is applicable only:
square matvices.
Finding Eigen Vector of a Matrix
Consider a matrix M and cigen veelor ‘e’ corresponding to the matrix,
“The direction of **remaiis unchanged when multiplied with anatrix, only has @ change in magnitude,
Consider the below equation,
Me
(M—C)e=0
‘Interms of (MC). C indicates an identify matrix of order equal to “MF that is multiplied by a sealer
*c'. There are two unknown “e" and ‘x’ and one equation, This equation can be solved by making the veetor
*e°as zero vector. Then there will be only a single choice that, (M-C) is a singular matrix. [t has a property
‘that ifs determinant is equal to 0, This property can be used to find the value of ‘c".
©
Det(M-C)=0
‘This produces an equation in ‘c* that is in the order based on matrix M, ‘This needs a solution
for equation. If the solutions are ‘cl", “e2" and so on then place *c1” in the eqnation and find vector “61”
corresponding to ‘c1*. The vector ‘el isan eigen vectar of M, This procedure anist be repeated with *¢2°,
“c3" and soon, .
Example
Ecit_ Mis
> He~maerix{e(80,31,20,$1,50,51, 60, 61,70) ,nrewss, byrow=T)
> xeceagen (Mf)
> xSveiues
£2) 147,737876 §.317459 -2.055095 i
> xSvectors
La 21 Lay
{2,} -0.3968974 0.9897557 -0.7447e185
(2)] -0.8497487 -0.8198420 -0.06303763
a) =0.7961272 0.366296 0.6643239. :
>
‘Scanned with CamScannera Scien aR
Q17. IMustrate thi usage of eigen vectors in data scient
Answer :
Theconceptofcigen vectors is applied i e., machine learning algorithm principal Component an,
re is data with huge set features It has high dimensionality. There mightbe redundant feature ina ab,
‘These. features make the eff ieney to reduce and disk space to inerease. But the PCA craps joo. ie
‘The cigen vectors help in defining these features.
ha
Consider the PCA alg
fo perform this are as follows.
‘ith for ‘n’ dimensional data that are to be reduced to *k* dimessiong 5
"ts
Step 1
Initially the data is mean normalized and feature scaled.
Step2
‘The covariance matrix of the data set is computed.
To reduce the number of features (dimensions) de, the features must be deducted. But this ead
Joss of information. So, loss of information need to be mi
ized and maintain the maximum varanes, fy
this, the directions of maximum variance must be determined. This is done in the next Step.
Step 3
Un this step eigen vectors of convariance matrix is determined. Since there is data in ‘n' dimes,
then ‘n’ cigen vectors corresponding to ‘n’ eigen values are deterinined, —~
Step 4
Select *k’ eigen vectors corresponding to ‘k’ largest eigen values and then build matrix in whieh evey
eigen vector that represents columns. This matrix is called as
In order to reduce a data point ‘a’ in the data set to ‘k’ dimensions, the transpose of the mairixU must
be determined and then multiplied with vector ‘a’. Then the desired vector in *k” dimensions is obtained.
24
a
‘Scanned with Cam$cannerSTATISTICAL MODELING
PART-A
SHORT QUESTIONS WITH SOLUTIONS
1. Define statistical modeling.
Answer: . Model Papers, 3
Statistical Modeling
Siatistica! modeling can be defined as the formalization of relationship in between the variables is
the form equations, ts actually about finding out the variable, It explains about how variables ae related
‘with each other, The relationship can be in the form oF mathematical equations. And the variable can be an
attribute such as height, weight or age ofa person.
| . The variables
| analyzing and applying it on varios circumstances,
Statisticel rodeling gives the introduction and illuminates the statistical reasoning that is uscd is
modem research throughout the natural s well as medicine, ecommerce, social sciences. government et. It
also focuses on the usage of inodels to untangle and quantify the variation on observed data.
Q2. What is a random variable?
Answer = Moulel Paperstl, 03
Random Variables
+2" Random variable is variable that takes particular value .e., numerical valuc with definite probability
It is obtained from the resull of rangom experiment. The random variables are denoted by capital letters and
the corresponding letters are denoted by srual letters. ‘
Example =
If'g fair dice is rolted and if'X* denotes the number obtained then *X” is called as random variable.
“Thus *X' can take any one of the particular values such as |, 2, 3, 4,$ of Geach with n probability 1/6. These
| Values are tabulated as follows. * ‘
‘Scanned with CamScanner
1 not be related accurately but ean be stochastically related, It consists of data .nce Using R
All the possible outcomes of random experiment together is called “Sample Spacg ay
* The sum of all probabilities of sample space is # always. "ay
oa
Random variates are of two types, they arc,
() Discrete random variable.
Gi) Contimious random variable.
3. Write in short about hypothesis testing.
Answer : Mode bop
Hl,
Hypothesis Testing 7
‘The statistical hypothesis can be defined as an assumption with Fespect to a populati
Ey Mot be true. It is a set of formal procedures that is used by stalisticians for accepting g-
Atwtistical hypothesis. Infact itis process of validing the hypothesis that is made by rescarchey,
the hypothesis, the complex population is considered.
10" Chan ni
"Sean
8. Fay
4 this process it makes use ef random samples from the poptlation, The selectiy
hy MOF recta
pothesis depends om the result of testing over the sample data. et
4. State the types of errors occur in hypothesis testing.
Answer : Model Papers a,
Types of Errors
‘There are two types of errors that exit occur in hypothesis testing.
1 Type! Error
Taccurs when the null hypothesis is rejected while its value it true. The probability of this enocen
be defermined through the term sighificance'level when the hypothesis is tested. The significance levis
denoted by the symbol ot (alpha).
2 Type tt ferer
Type Il error ean be defined as the acceptance of false null hypothesis H.'The term called poweraftsx
defines the probability of type LI error when the hypothesis testing is performed. It is represented by synto\
B (beta),
ee
QS. Define p-value.
Answer + ‘Model Paper-i, 4
p-value
‘The p-value can be dofinedas the probability of obtaining result that is equal [Link] more than observaon
from data when null hypothesis is true,
Hypothesis testing makes use of p-alus to actully use pvalo fo weight the strength of vie
data of population, The p-value can be computed for the given data through a statistical tes. Tt ele
compared with predetermined value i.e. alpha. usually the value of alpha will be 0.05. If itis less @! oak
then null hypothesis is rejected and if it is more or equal than alpha then rejection of null hypothes!
26
esl
‘Scanned with Cam$cannerStatistical Modeling UNIT-2
ae
fe PART-B
gore ESSAY QUESTIONS WITH SOLUTIONS
2.1 STATISTICAL MODELING
Q6. Discuss about statistical modeling.
Answer: Model Papers, @12(a)
Statistical Modeling
Statistical modeling can be defined as the formalization of relationship in between the variables in
the fonm equations. It is actually about finding out the variable. It explains about how variables are related
with each other. The relationship can be in the form of mathematical equations. And the variable can be an
ute such as height, weight or age of a person.
The variables might not be related accurately but can be stochastically related, Statistical modelin,
‘consists of data analyzing and applying. it on various circumstances.
Example
The attributes such as height and age are probabilistically distribited amang humans. They are
stochastically related i.e., if a person is of age 35 then this influences the chance of this person being 4 feet
{all and if a person is- of age-15 then this influences the chiaice of this persan being 6 fect tall.
Model 1
Height, = 6,+8,ape,+¢, “
Where, a
8
intercept,
6, is parameter that age is multiplied to generate a predi
€ is the error term and davutd
‘is subject,
Model 2
by bage, +b, sex, +e,
itistical reasoning that is used
sciences, governmentetc. It
"Statistigal’ modeling gives the introduction and il sli
modem research throughout the natural as Well as medicine} ‘eBrnimérce; 5
also focuses on the usage of models [Link] and quantify [Link] of [Link]: «
°? “Tesiplaté for statistical model mould be a linear regression model with independent and homoscedastic
errors.
ysi=sum_{j =.0}"p beta jx_ti}+ei, c ‘
oo . i
Fed, i
‘Scanned with Cam$cannerData Si
Where,
ce jare NEO (0. sigmna”2)
Inv vate terms tis eam be ete 8
y =X belate
Win
0, Hone
rdesign mati widneolumes
ig response weetor, Xis model mal
cariables.
More frequently X_0 mould be a column of ones by defining a
wn intercept term.
c
J! from potentially large sc of
Fieal models are ifusirated as jy?
“Dypes of Statistical Modeling
sing the mvnimal adele mod
uns cho' :
‘Various types of statist
simplification.
Statisiieal inodelling
models by using stepwise mo
Model Tnterpretation.
‘Saturated model ‘ne parameter for each dati point
Fit: Perfect
Degrees of freedom : None
Explanatory power of model : NONE
Treonsists all (P) factors, i Factions and covariates of any interest, Moy
of the madels ean be insignificant
Maximal model
Degtee of freedom : t-P—
Explanatory power of model: Depends.
simplified model with 1 gamma (20, 10)
J) (2) 8.322375 is.661ses 10.s27896 18.807450 LD.sa2sE2 B.1E1262 126780455.
{8} 10.709388 11.s49666 11.256586 16,979900 10,419608 15,895826 10.052508
Hal 8.436457 10,269957 6.191293 9.510985 8.270894 14.367074
>
5 recom
Ttretums n random numbers from geomettie distribution.
‘resom(n, prob)
Heren indicates n indicates number of observations and prob indicates probability a success in each
ackages 5 Windover»
> set .seed(2)
>-egeames, 1/6)
6 rlnorm
Il generates random amounts with a multivariate lognormal distribution or density of this particular
distribution at some specific point.
~rinorm(n, meanlog, varlog) :
Here n indicates number of data sets that are to be simulated, meatilog indicates the mean-vector of |
logs and varlog indicates the variance/covariance matrix of the logs.
S$.
é 31
‘Scanned with Cam$cannerData Science Using R
Example
Ble bait Misc Packages: Windows Helo
> doe (ztasen(s))
1) 0.210731885 o.oesa9s6q7
16) -1.246783429 9,99815995 0.580872"
\ 122) =1.4508639¢5 9.3s0909791 ~9.47452602
26) -1.087292503 2,03G203603 -0.926989232 sae
2-763246020 o_zeez02760 ~2.252558924 -1. 29956975
riS08a8sE13 o_s275¢0097 -o.sassae57s -0.9FE37EALS -0.7205¢5
30291196 o.eT7B¢a42 0.452793: 76 earl
2-85600373¢ a ogess922 g.a7G¢03855 9278215449 -2/87790294
~O.B26s26142 Lloia7rog6a 891277732 0-742002772 9147573408
e O-425365565 [Link] o.agi9ae754 9.225422912 -1. 010465085
-a,462689253 0.81083980 ~1.912248796
oo -0.216375791 -2.621957255
35402726:
7. logis
This function depicts information about logic distribution. Kt generates random devia
Hlogis(a, location0, seale=1)
ese minicates numberof observation, eeatonandscalehave0 and 13 Fm values nog
Example ;
}|> vax (=tagas (1000, 0, seals = 5))
8. rmvbin
ereates corretated multivariate binary random variables by thresholding a normal distributing
rvbin(n, bincor, margprob)
Here n indicates number of realization of variables that are to be simulated bincorr isa mains!
margprob indiestes the vector of some length,
Example
rmvbin( 10, margprob = C(03, 0.9))
«pois
Il generates values from poisson distribution and returns the results,
rpols(ob, rate = rate)
Here, ob indicates the number of observations and rate indicates estimated rate ef events for dss
32
‘Scanned with Cam$canner\> Statistical Modeling UMiie2
Example
It generates random compositions with uniform distribution.
if(9, win, max)
Here n indicates number of observations, min and max are by default 0 and | respectively.
Example as
[4] ~0.81133706 ~0.03129085 ~0.s7e Hine Bn, 8
‘There are even other types of hypothesis testing, they are as fullows,
Simple Hypothesis
Simple hypothesis ig a statistical hypothesis which completely specifies an exact Paraneter py
hypothesis is always a simple hypothesis stated as an equality specifying an exact value OF paramet, “
Example
bo Hw=y,
2 Hor y,—p
Complete Hypothesis
Composite hypothesis is stated in Trims of several possible values Lc, by an inequality. Aten
hypothesis is a composite hypothesis invalving statement expressed as inequalities such a8 <> ore,
Example
1 Aopen,
a H,
BSB
3. A pep,
Example of Hypothesis Testing
‘Consider an example, to check whether a coin was fair and balanced, According to mull hypothes:
the half flips would be of head and half would be of tails. And according to alternative hypothesis the iss
of head and tail may be different.
Hy: Ps05
H,:P#05
for 50 times, might result 40 heads and LO tails, Based on the result the null hypat®*
must be rejected and concluded according to the evidence that coin was not fair and balanced probebl-
—_—_—_——
Flipping of ©