0% found this document useful (0 votes)

7 views20 pages

Unit - 3 PDA

The document outlines the principles of data analytics with a focus on regression analysis, including concepts such as least squares estimation, model building, and various regression techniques like linear and logistic regression. It discusses the importance of regression in predicting relationships between dependent and independent variables, as well as the assumptions and properties of estimators. Additionally, it highlights the model building lifecycle and tools used in data analytics, emphasizing the significance of variable selection and model validation.

Uploaded by

greekathena0501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views20 pages

Unit - 3 PDA

Uploaded by

greekathena0501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

PRINCIPLES OF DATA ANALYTICS III YEAR IISEM

3.REGRESSION
Syllabus:
Regression:

Concepts,Bluepropertyassumptions,LeastSquareEstimation,VariableRationalization,Model
building.

LogisticRegression:

ModelTheory,Modelfit statistics,modelConstruction,Analyticsapplicationstovariousbusiness domains.

TEXTBOOKS:
1. Student’sHandbookforAssociateAnalytics–II,III.
2. DataMiningConceptsandTechniques,Han,Kamber, 3rdEdition,MorganKaufmannPublishers.

REFERENCEBOOKS:
1. IntroductiontoDataMining,Tan,SteinbachandKumar,AddisionWisley, 2006.
2. DataMiningAnalysisandConcepts, M.ZakiandW.Meira

Mr.A.SHEKAR Page1
REGRESSION

Regression Analysis is a statistical process for estimating the relationships between the dependent
variables or criterion variables and one or more independent variables or predictors.

Regressionanalysisexplainsthechangesincriterionsinrelationtochangesinselectpredictors.

The conditional expectation of the criterions based on predictors where the average value of the
dependent variables is given when the independent variables are changed.

Three major uses for regression analysis are determining the strength ofpredictors, forecasting an effect,
and trend forecasting.

Thistechniqueisusedforforecasting,timeseriesmodellingandfindingthe causaleffectrelationshipbetween
the variables. For example, relationship between rash driving and number of road accidents by a driver
is best studied through regression.

Regressionanalysis isanimportanttoolformodellingandanalysingdata.

Ex: youwant to estimate growthinsalesofacompanybasedoncurrent economicconditions. Youhave the

recent company data which indicates that the growth in sales is around two and a half times the growth
inthe economy. Using this insight, we can predict future sales ofthe company based on current & past
information.

Therearemultiplebenefitsofusingregressionanalysis. Theyareasfollows:

1. Itindicatesthesignificantrelationshipsbetweendependentvariableandindependentvariable.
2. Itindicatesthestrengthofimpactofmultipleindependentvariablesonadependentvariable.

Regressionanalysisalso allowsusto comparetheeffectsofvariables measuredondifferent scales, such as

the effect of price changes and the number of promotional activities. These benefits help market
researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used
for building predictive models.

Here are various kinds of regression techniques available to make predictions. These techniques are
mostlydrivenbythree metrics(number ofindependent variables,type ofdependent variablesand shape of
regression line). We’ll discuss them in detail in the following sections.

Mr.A.SHEKAR Page2
By using the least squares method, we are able to construct a best fitting straight line to the scatter
diagrampoints and then formulate a regressionequation inthe formof: y= a +bx y= y+ b(x–x)What is
linear regression model? List advantages of linear regression model

TypesofRegression–

 Linearregression
 Logisticregression
 Polynomialregression
 Stepwiseregression
 Stepwiseregression
 Ridgeregression
 Lassoregression
 ElasticNetregression

Linear regression is used for predictive analysis.Linear regression is a linear approach for modellingthe
relationship between the criterion or the scalar response and the multiple predictors or explanatory
variables.Linearregressionfocuseson theconditional probability distribution of theresponsegiven
thevalues of the predictors.Forlinear regression, thereis a danger of overfitting. Theformulafor linear
regression is: Y’ = bX + A.

Mr.A.SHEKAR Page3
Logistic regression is used when the dependent variable is dichotomous. Logistic regression estimates
the parameters of alogisticmodel andis form of binomial regression. Logistic regression is used to deal
with data that has two possible criterions and the relationship between the criterions and the predictors.

Polynomialregression isusedforcurvilineardata.Polynomialregressionisfitwiththemethod of least

squares. The goal of regression analysis to model the expected value of a dependent variable y in
regards to the independent variable x.

Stepwise regression is used for fitting regression models with predictive models. It is carried out
automatically.Witheachstep,thevariableisaddedorsubtractedfrom thesetofexplanatory variables. The
approaches for stepwise regression are forward selection, backward elimination, and bidirectional
elimination.

Ridgeregression isatechniqueforanalysingmultipleregressiondata.Whenmulticollinearity
occurs,leastsquares estimates areunbiased.A degree of biasisadded tothe regression estimates,and a
result, ridge regression reduces the standard errors.

Lasso regression is a regression analysis method that performs both variable selection and
regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset of the
provided covariates for use in the final model.

Elastic Net regression is a regularized regression method that linearly combines the penalties of the
lasso and ridge methods. Elastic Net regression is used for support vector machines, metric learning,and
portfolio optimization.

BLUEPRPERTY ASSUMPTIONS

Aproperty whichislessstrictthanefficiency,isthesocalledbest,linearunbiasedestimator (BLUE)

property,whichalsousesthevarianceoftheestimators. BLUE.Avectorofestimatorsis BLUE if it is the
minimum variance linear unbiased estimator.

Toshowthisproperty,weusethe Gauss-MarkovTheorem.

OLS estimators areBLUE(i.e. theyare linear, unbiased and have the least variance among the class of all
linear and unbiased estimators).

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the
unknownparametersinalinearregressionmodel....Undertheseconditions,themethodof OLS provides
minimum-variance mean-unbiased estimation when the errors have finite variances.

WHATISAN ESTIMATOR?

In statistics,an estimator is a ruleforcalculating an estimate of a given quantity based on observed data

Mr.A.SHEKAR Page4
TWOTYPESOFESTIMATORS

PointEstimatorsApointestimateofapopulationparameter isasinglevalueofastatistic.

IntervalEstimatorsAnintervalestimateisdefinedbytwonumbers,betweenwhichapopulation parameter is
said to lie

PROPERTIESOFBLUE

• B-BEST

• L-LINEAR

• U-UNBIASED

• E-ESTIMATOR

AnestimatorisBLUEifthefollowinghold:

1. Itislinear(Regressionmodel)

2. Itisunbiased

3. Itisanefficientestimator(unbiasedestimatorwithleastvariance)

LINEARITY

• Anestimatorissaidtobealinear estimatorof(β) ifitisalinearfunctionofthesample observations

• SamplemeanisalinearestimatorbecauseitisalinearfunctionoftheXvalues.

UNBIASEDNESS

• AdesirablepropertyofadistributionofestimatesiSthatitsmeanequalsthetruemeanofthe variables being

estimated

• Formally, an estimator is an unbiased estimator if its sampling distribution has as its expected
valueequal to the true value of population.

MINIMUM VARIANCE

• Just as we wanted the mean of the sampling distribution to be centred around the true population , so
too it is desirable for the sampling distribution to be as narrow (or precise) as possible.

Mr.A.SHEKAR Page5
LEASTSQUARE ESTIMATORS

Theleast squares method is a statistical procedure to find the best fit for a set of data points by
minimizing the sum of the offsets or residuals of points from the plotted curve.

Leastsquaresregressionisusedtopredictthebehaviorofdependentvariables. OR

Theleastsquaremethod istheprocessoffindingthebest-fittingcurveorlineofbestfitforaset ofdata points by

reducing the sum of the squares of the offsets (residual part) of the points from the curve. The method
of curve fitting is an approach to regression analysis.

LeastSquareMethodDefinition

The least-squares method is a crucial statistical method that is practised to find a regression line or a
best-fit line for the given pattern.

This method is described by an equation with specific parameters. The method of least squares is
generously used in evaluation and regression.

In regression analysis, this method is said to be a standard approach for the approximation of sets of
equations having more equations than the number of unknowns.

The method of least squares actually defines the solution for the minimization of the sum of squares of
deviations or the errors in the result of each equation. Find the formula for sum of squares of errors,
which help to find the variation in observed data.

The least-squaresmethod isoftenapplied indata fitting.Thebest fit result isassumedto reducethesum of

squared errors or residuals which are stated to be the differences between the observed or experimental
value and corresponding fitted value given in the model.

Therearetwo basiccategoriesofleast-squaresproblems:

 Ordinaryorlinearleastsquares
 Nonlinearleast squares

Thesedependuponlinearityornonlinearityofthe residuals.

Thelinearproblemsareoftenseeninregressionanalysisinstatistics.

On the other hand, the non-linear problems generally used in the iterative method of refinement inwhich
the model is approximated to the linear one with each iteration.

Howdoyoucalculateleast squares?

Let usassumethatthegivenpointsofdataare(x_1, y_1), (x_2, y_2), …, (x_n, y_n) inwhichallx’sare

independent variables, while all y’s are dependent ones. Also, suppose that f(x) be the fitting curve andd
represents error or deviation from each given point.
The least-squares explain that the curve that best fits is represented by the property that the sum of
squares of all the deviations from given values must be minimum.

Mr.A.SHEKAR Page6
What istheprincipleofleastsquares?

The least squares principle states that by getting the sum of the squares of the errors a minimum value,
the most probable values of a system of unknown quantities can be obtained upon which observations
have been made.

Whatdoestheleastsquaremean?

The least square method is the process ofobtaining the best-fitting curve or line ofbest fit forthe given
data set byreducing the sum of the squares of the offsets (residual part) of the points from the curve.

Whatisleastsquarecurve fitting?

The least-squares method is a generally used method of the fitting curve for a given data set. It is the
most prevalent method used to determine the trend line for the given time series data.

LINEARREGRESSION

It’sacommontechniqueto determinehowonevariableofinterest isaffectedbyanother. Its

used for three main purposes:

 Fordescribingthelinear dependenceofonevariableontheother.
 Forpredictionofvaluesofothervariablefromtheonewhichhasmore data.
 Correctionoflineardependenceofonevariableontheother. A

line is fitted through the group of plotted data.

Y=α+βX+ε

α=interceptcoefficients β

= slope coefficients

ε =residuals

Cluster Analysis is the process of forming groups of related variable for the purpose of drawing
important conclusions based on the similarities within the group.

 The greater the similarity within a group and greater the difference between the groups, more distinct
is the clustering.

VARIABLESELECTION

Computational techniquesforvariableselection In order toselectasubsetmodel,several techniques based on

computational procedures and algorithm the available.
Theyare essentiallybasedontwoideas
– selectallpossibleexplanatoryvariablesorselecttheexplanatoryvariables stepwise.
1. Useallpossibleexplanatoryvariables
 Thismethodologyisbasedonthefollowingsteps:
 Fitamodelwithoneexplanatoryvariable.
 Fitamodelwithtwoexplanatoryvariables.
 Fitamodelwiththreeexplanatoryvariables. andsoon.

Mr.A.SHEKAR Page7
 Chooseasuitablecriterionformodelselectionandevaluateeachofthefittedregressionequation with
the selection criterion.
 Thetotalnumberofmodelsto befitted sharplyriseswithan increaseink .
 So such models can be evaluated using a model selection criterion with the help of an efficient
computation algorithm on computers.

2. Stepwiseregressiontechniques

Thismethodologyisbasedonchoosingtheexplanatoryvariablesinthesubset modelinsteps which can

be either adding one variable at a time or deleting one variable at times.
Based onthis,therearethree procedures.
- Forwardselection,
- backwardeliminationand
- stepwiseregression.

Theseproceduresarebasicallycomputer-intensiveproceduresandareexecutedusingthe
software.

MODELBUILDING

In thisphase data science teamneeds todevelop data setsfortraining,testing,and production purposes.

These data sets enable data scientist to develop analytical method and train it, while holding aside some
of data for testing the model.

Team develops datasets for testing, training, and production purposes. In addition, in this phase, the
team builds and executes models based on work done in the model planning phase. The team also
considers whether its existing tools will suffice for running the models, or if it will need more robust
environmentfor executing models and workflows (Example– fasthardware and parallel processing).

Freeoropen-sourceTools:
RandPL/R,Octave,WEKA,Python

CommonToolsfortheModelBuildingPhase:
RandPL/R:
They were described earlier in the model planning phase, and PL/R is procedural language for
PostgreSQL with R. Using this approach means that R commands can be executed in the database.

Octave:
It is free software programming language for computational modeling, has some of functionality of
Matlab. Because it is freely available, Octave is used in major universities when teaching machine
learning.

WEKA:
It is free data mining software package with an analytic workbench. The functions created in WAKA
can be executed within the java code.

Python:
It is programming language that provides toolkits for machine learning and analysis, such as scikit-
learn, NumPy,scipy,Pandas, and related data visualization using matplotlib.

Mr.A.SHEKAR Page8
SQL:
SQLindatabaseimplementations,suchasMADlib,providesanalternativetomemorydesktop analytical tools.

MADlib:

Itprovidesanopen-sourcemachinelearninglibraryofalgorithmsthatcanbeexecutedinthe database, for

PostgreSQLor Greenplum.

LifecycleofModelBuilding–

 Selectvariables
 Balancedata
 Buildmodels
 Validate
 Deploy
 Maintain
 Definesuccess
 Exploredata
 Conditiondata

 Data exploration is used to figure out gist of data and to develop first step assessment of its
quality, quantity, and characteristics.
 Visualization techniques can be also applied. However, this can be difficult task in high
dimensional spaces with many input variables.
 In the conditioning of data, we group functional data which is applied upon modelingtechniques
after then rescaling is done, in some cases rescaling is an issue if variables are coupled.
 Variablesectionisveryimportanttodevelopqualitymodel.
 This process isimplicitly model-dependentsince itis used to configure which combination of
variables should be used in ongoing model development.
 Databalancingistopartitiondataintoappropriatesubsetsfortraining,test,andvalidation. Model
building is to focus on desired algorithms.
 Themostfamoustechniqueissymbolicregression,othertechniquescanalsobepreferred.
 Modelvalidationisimportanttodevelopfeelingoftrustpriortoitsusage.
 Thedefinitionofgoodmodelincludesrobustnessandwell-definedaccuracy.
 Therefore,trustedaccuratemodelispotentiallyfinancialandphysicallydangeroustoobut trusted
metric is very important for symbolic regression and stacked analytic networks.

LOGISTICREGRESSION

Logisticregression istheappropriateregression analysistoconductwhen thedependentvariableis

dichotomous (binary).

Logisticregression isusedtodescribedataandtoexplaintherelationshipbetweenonedependent binary

variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Logisticregressionisaclassificationmodelinwhichtheresponsevariableiscategorical.Itisan algorithm that

comes from statistics and is used for supervised classification problems.

Mr.A.SHEKAR Page9
Logistic regression analysis is used to examine the association of (categorical or continuous)
independentvariable(s)withonedichotomousdependentvariable.Thisisincontrastto linear regression
analysis in which the dependent variable is a continuous variable.

LogisticRegression isoneofthebasicandpopularalgorithmtosolveaclassificationproblem.It isnamed

as'LogisticRegression',becauseit'sunderlyingtechniqueisquitethesameas Linear Regression. The term
“Logistic” is taken fromthe Logit function that is used in this method of classification.

Basic assumptions that must be met for logistic regression include independence of errors, linearityin
the logit for continuous variables, absence ofmulti collinearity, and lack of strongly influential
outliers.

Typesoflogisticregression

Mr.A.SHEKAR Page10
TypesofLogisticRegression:
 BinaryLogisticRegression
 MultinomialLogisticRegression
 OrdinalLogisticRegression

Forthe modelto be a cent percent accurateone, we need to calculate and find out few parametersofthe
algorithm in order to check how accurate our Binary Logistic Regression model is.
Thekeyparameterswecalculateandcheckaredependent ofthetopiccalledCONFUSIONMATRIX. The
confusion matrix is a type of table used to define the characteristics of Classification problems.

Negative(0) Positive1(1)
Negative(0) TrueNegative FalsePositive
Positive(1) FalseNegative TruePositive
Thebelowarefewexpressionscalculatedinordertofind howaccuratethepredictionofthemodel is.
 Accuracy
 Recall
 Precision
 F1score

Thegoalof logistic regression isto correctlypredictthecategoryofoutcome forindividualcasesusing the

most parsimonious model. To accomplish this goal, a model is created that includes all predictor
variables that are useful in predicting the response variable. ..

Mr.A.SHEKAR Page11
ModelTheoryInLogistic Regression

Logistic regression is basically a supervised classification algorithm. In a classification problem, the

targetvariable(or output),y, can take only discrete values for given set of features(or inputs), X.

Logistic regression becomes a classification technique only when a decision threshold is brought into
the picture. The setting of the threshold value is a very important aspect of Logistic regression and is
dependent on the classification problem itself.

The decision for the value of the threshold value is majorly affected by the values of precision andrecall.

1. LowPrecision/HighRecall: Inapplicationswherewewanttoreducethenumberoffalse negatives without

necessarily reducing the number false positives, we choose a decision value whichhas a low value of
Precision or high value of Recall.

2. High Precision/Low Recall: In applications where we want to reduce the number of false positives
without necessarily reducing the number false negatives, we choose a decision value which has a high
value of Precision or low value of Recall.

MODELFITSTATISTICS

Asinlinearregression, goodnessoffitinlogisticregressionattemptstogetathowwella model fits the

data.

Itisusuallyappliedaftera“finalmodel”hasbeenselected.

Often in selectinga model no single “final model” is selected, as aseries of models are fit, each
contributing towards final inferences and conclusions. In that case, one may wish to see how
well more than one model fits, although it is common to just check the fit of one model.

Mr.A.SHEKAR Page12
Thisisnotnecessarilybadpractice,becauseifthereareaseriesof“good”modelsbeingfit, often the fit
from each will be similar.
Itisnotclearhowtojudgethefitofamodelthatweknowisinfact wrong.

Muchofthegoodnessoffitliteratureisbasedonhypothesistestingofthefollowingtype: H0 :
model is exactly correct
HA:modelisnotexactlycorrect

This type of testing provides no useful information. If the null hypothesis is rejected, then we
have learned nothing, because we already knew that it is impossible for any model to be
“exactly correct”.

On the other hand, if we do not reject the model, it is almost surely because of a lack of
statistical power, and as the sample size grows larger, we will eventually surely reject H0.

These tests can be seen not only as not useful, but as harmful if non-rejection of a null
hypothesis is misinterpreted as proof that the model “fits well”, which is of course can be far
from the truth.

GoodnessOfFitMeasuresforLogisticRegression:

Thefollowingmeasuresoffitareavailable,sometimesdividedinto“global”and“local”measures:
• Chi-square goodnessoffittestsanddeviance
• Hosmer-Lemeshowtests
• Classificationtables
• ROCcurves
• LogisticregressionR2
• Modelvalidationviaanoutsidedatasetorbysplittingadataset

MODELCONSTRUCTION

Theprocessofmodel-buildingallowsyoutoselectthe“best”variabletoaddtoyourcurrent regression
model.

Logisticregressionmodelisoneofthemostwidelyusedmodelstoinvestigateindependent effect of a
variable on binomial outcomes in medical literature.

However, the model buildingstrategy is not explicitly stated inmany studies, compromisingthe
reliability and reproducibility of the results. There are varieties of model building strategies
reported in the literature, such as purposeful selection of variables, stepwise selection and
best subsets.

However,there is no one that has been proven to be superior to others and the model
buildingstrategyis“partscience,partstatisticalmethods,andpartexperienceandcommonsense”
.

Mr.A.SHEKAR Page13
However,theprincipalofmodelbuildingistoselectaslessvariablesaspossible,butthe model
(parsimonious model) still reflects the true outcomes of the data.

Toinvestigatewhetheranassociationexistsbetweenthevariablesofinterest

• Tomeasurethestrength(aswellasdirection) ofassociationbetweenthevariables

• Tostudytheformofrelationship(ifany)Modellingdependsontypeofoutcomevariable

• For continuous outcome variables, relationships are examined by linear or non-linear

regression models

• For categorical outcome variables, logistic regression is usually used to examine possible
relationship

A regression with an outcome variable that is categorical (e.g. success/failure) and explanatory
variables that can be a mix of continuous and categorical variables

• Addressesthesameresearchquestionsthatmultipleregressiondoes

• Predicts which of the two possible events (in case of binary outcome) are going to happen
given the info on explanatory variables (e.g. LR can be used to analyse factors that determine
whether an individual is likely to have a certain type of rehab program)

ASSUMPTIONSOFLOGISTICREGRESSION

Ratioofcasestovariables–discretevariablesshouldhaveenoughresponsesineverygiven category • If
there are many cells with no response

• parameterestimatesandstandarderrorsarelikelytobeunstable

• maximumlikelihoodestimation(MLE)ofparameterscouldbeimpossibletoobtain

• Linearity in the logit – the regression equation should have a linear relationship with the logit
form of outcome

• Absenceofmulticollinearity;nooutliers;andindependenceoferrors

TYPES OF LR:

Dichotomousoutcome(yes/no;pain/nopain)Binarylogisticregression

Polychotomousoutcome(choiceofrehabilitation:home/clinic/nursinghome/hospital)
Multinomiallogisticregression

Orderedoutcome(physicalactivity:none,low,moderate,high)Ordinallogisticregression

Mr.A.SHEKAR Page14
AnalyticsapplicationstovariousBusinessDomains

Business analytics (BA) is the practice is iterative methodical exploration of an organization’s

data, with an emphasis on statistical analysis. Business analytics is used by companies
committed to data-driven decision-making.
Typesofanalytics

 DecisionAnalytics
 DescriptiveAnalytics
 PrescriptiveAnalytics

Whilethetwocomponentsofbusinessanalytics—businessintelligenceandadvancedanalytics
— aresometimesusedinterchangeably,therearesomekeydifferencesbetweenthesetwo business
analytics techniques.
Businessanalyticsapplications
Businessanalyticstoolscomeinseveraldifferentvarieties

 Datavisualizationtools
 Businessintelligencereportingsoftware
 Self-serviceanalyticsplatforms
 Statisticalanalysistools
 Bigdataplatforms

It’sAssociationwithBusinessDomain

 FinancialServiceAnalytics
 MarketingAnalytics
 PricingAnalytics
 RetailsalesAnalytics
 RiskandCreditanalytics
 SupplyChainAnalytics
 Transportationanalytics
 CyberAnalytics
 Enterpriseoptimization

Business Analytics is critical for remaining competitive and achieving success. When you get
BA best practices in place and get buy-in from all stakeholders, your organization will benefit
from data-driven decision making.
Finance
BAisofutmostimportancetothefinancesector.DataScientistsareinhighdemandin investment
banking, portfolio management, financial planning, budgeting, forecasting, etc.

Marketing
Studyingbuyingpatterns of consumer behaviour, analysingtrends, help inidentifyingthe target
audience, employing advertising techniques that can appeal to the consumers, forecast supply
requirements, etc.

Mr.A.SHEKAR Page15
HRProfessionals
HR professionals can make use of data to find information about educational background of
high performing candidates, employee attrition rate, number of years of service of employees,
age, gender, etc. This information can play a pivotal role in the selection procedure of a
candidate.

CRM
Business Analytics helps one analyse the key performance indicators, which further helps in
decision making and make strategies to boost the relationship with the consumers. The
demographics, and data about other socio-economic factors, purchasing patterns, lifestyle, etc.,
are of prime importance to the CRM department.

Manufacturing
Business Analytics can help you in supply chain management, inventory management, measure
performance of targets, risk mitigation plans, improve efficiency in the basis of product data,etc.

CreditCardCompanies
Credit card transactions of a customer can determine many factors: financial health, life
style,preferences of purchases, behavioural trends, etc.

LinearRegression vsLogisticRegression:

LinearRegressionandLogisticRegressionarethetwofamousMachineLearning Algorithmswhich come

under supervised learning technique.

Since boththealgorithmsareofsupervised innaturehencethesealgorithmsuse labeleddatasetto make the

predictions.

Butthemaindifferencebetweenthemishowtheyarebeing used.

TheLinearRegressionisused forsolvingRegressionproblemswhereasLogisticRegressionisused for solving

the Classification problems.

Thedescriptionofboththealgorithmsisgiven belowalong withdifferencetable.

Mr.A.SHEKAR Page16
LinearRegression:

LinearRegressionisoneofthe most simpleMachine learningalgorithmthat comesunderSupervised Learning

technique and used for solving regression problems.

Itisusedforpredictingthecontinuousdependentvariablewiththehelpofindependentvariables.

ThegoaloftheLinearregressionisto findthebest fit linethat canaccuratelypredicttheoutputforthe continuous

dependent variable.

Ifsingle independent variable isused forpredictionthenit iscalledSimpleLinearRegressionand if there are

more than two independent variables then such regression is called as Multiple Linear Regression.

Byfindingthebest fit line,algorithmestablishestherelationshipbetweendependent variableand independent

variable. And the relationship should be of linear nature.

TheoutputforLinearregressionshouldonlybethecontinuousvaluessuchasprice,age,salary,etc. The
relationship between the dependent variable and independent variable can be shown in below image:

Mr.A.SHEKAR Page17
Inabove imagethedependent variable isonY-axis(salary)and independent variable isonx-
axis(experience). The regression line can be written as:

y=a0+a1x+ε

Where,a0anda1arethecoefficientsandε istheerrorterm.

LogisticRegression:

Logisticregressionisoneofthemost popularMachine learningalgorithmthat comesunderSupervised

Learning techniques.

Itcanbeused forClassificationaswellasforRegressionproblems, but mainlyused forClassification problems.

Logisticregressionisusedto predictthecategoricaldependent variablewiththehelpofindependent variables.

TheoutputofLogisticRegressionproblemcan beonlybetweenthe0 and1.

Logisticregressioncanbeusedwheretheprobabilities betweentwoclasses isrequired.Suchas whether it

will rain todayor not, either 0 or 1, true or false etc.

LogisticregressionisbasedontheconceptofMaximumLikelihoodestimation.Accordingtothis estimation, the

observed data should be most probable.

Inlogisticregression, wepasstheweightedsumofinputsthroughanactivationfunctionthat canmap values in

between 0 and 1. Such activation function is known as sigmoid function and the curve obtained is
called as sigmoid curve or S-curve. Consider the below image:

Mr.A.SHEKAR Page18
Theequationforlogisticregressionis:

DifferencebetweenLinearRegressionandLogisticRegression:

LinearRegression LogisticRegression

Linear regression is used to predict the Logistic Regression is used to predict the
continuousdependentvariableusingagivenset of categoricaldependentvariableusingagivenset of
independent variables. independent variables.

LinearRegressionisusedforsolving Regression Logisticregressionisusedforsolving Classification

problem. problems.
InLinearregression,wepredictthevalueof InlogisticRegression,wepredictthevaluesof
continuous variables. categorical variables.
Inlinearregression,wefindthebestfit line,by which InLogisticRegression,wefindtheS-curveby which
we can easily predict the output. we can classify the samples.
Leastsquareestimationmethodisusedfor Maximumlikelihoodestimationmethodisused for
estimation of accuracy. estimation of accuracy.
TheoutputforLinearRegressionmustbea The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categoricalvaluesuchas0or1,YesorNo,etc.
InLinearregression,itisrequired that InLogisticregression,itisnotrequiredtohave
relationshipbetweendependentvariableand thelinearrelationshipbetweenthedependent and
independent variable must be linear. independent variable.
Inlinearregression,theremaybecollinearity In logistic regression, there should not be
between the independent variables. collinearitybetweentheindependentvariable.

Mr.A.SHEKAR Page19
ImportantQuestions

1. WhatisRegression?Explainindetail?
2. Whatarethemajorusesofregressionanalysis?
3. TypesofRegression?
4. WhatisBLUE?PropertiesofBLUE?
5. HowtoobtainBestfitline?
6. DefineOLE,MLE?
7. ExplainLogisticRegressionindetail?
8. ExplainmodelConstructioninLogisticRegression?
9. CompareLinearandLogisticRegression?
10. WhatdoyoumeanbyLeastsquaremethod?

Mr.A.SHEKAR Page20

Unit 3 1
No ratings yet
Unit 3 1
41 pages
Unit - Iii
No ratings yet
Unit - Iii
9 pages
Da Unit III
No ratings yet
Da Unit III
43 pages
Da Unit III
0% (1)
Da Unit III
43 pages
Unit III
No ratings yet
Unit III
24 pages
Unit III
No ratings yet
Unit III
11 pages
Unit III
No ratings yet
Unit III
18 pages
Daunit 3
No ratings yet
Daunit 3
32 pages
Finance Students' Guide to Regression
No ratings yet
Finance Students' Guide to Regression
41 pages
Data Analytics Unit 3 Notes
100% (3)
Data Analytics Unit 3 Notes
28 pages
Introduction To Simple Regression
No ratings yet
Introduction To Simple Regression
14 pages
Unit-3 DA
No ratings yet
Unit-3 DA
50 pages
(Revised) Simple Linear Regression and Correlation
No ratings yet
(Revised) Simple Linear Regression and Correlation
41 pages
Linear Regression Models Guide
No ratings yet
Linear Regression Models Guide
42 pages
Understanding Blue Property Assumptions
No ratings yet
Understanding Blue Property Assumptions
27 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Unit 2 Regression
No ratings yet
Unit 2 Regression
31 pages
Least Squares Method
No ratings yet
Least Squares Method
36 pages
Unit - II - DA
No ratings yet
Unit - II - DA
22 pages
Cs3351 Aiml Unit 3 Notes Eduengg
No ratings yet
Cs3351 Aiml Unit 3 Notes Eduengg
38 pages
Unit 2-1
No ratings yet
Unit 2-1
30 pages
Regression Analysis Essentials
No ratings yet
Regression Analysis Essentials
55 pages
DA Unit 3 Trio
No ratings yet
DA Unit 3 Trio
13 pages
Regression Analysis
No ratings yet
Regression Analysis
7 pages
Artificial Intelligence and Machine Learning - CS3491 - Notes - Unit 3 - Supervised Learning
No ratings yet
Artificial Intelligence and Machine Learning - CS3491 - Notes - Unit 3 - Supervised Learning
37 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
Chapter2
No ratings yet
Chapter2
20 pages
Da Unit-Iii
No ratings yet
Da Unit-Iii
14 pages
Rohini 73149042113
No ratings yet
Rohini 73149042113
11 pages
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
47 pages
Data Analytics Iii Unit
No ratings yet
Data Analytics Iii Unit
8 pages
Session 10
No ratings yet
Session 10
14 pages
Unit III Regression
No ratings yet
Unit III Regression
24 pages
Unit 3 Da
No ratings yet
Unit 3 Da
20 pages
UNIT 2 Machine Learning BCAI601BCDS062
No ratings yet
UNIT 2 Machine Learning BCAI601BCDS062
244 pages
Econometrics For Finace Lecture II-Session Two
No ratings yet
Econometrics For Finace Lecture II-Session Two
19 pages
Simple Linear Regression Assumptions
No ratings yet
Simple Linear Regression Assumptions
20 pages
Least Squares Linear Regression Guide
No ratings yet
Least Squares Linear Regression Guide
13 pages
Fda Unit 5
No ratings yet
Fda Unit 5
20 pages
Unit 3new
No ratings yet
Unit 3new
34 pages
Da Unit III Data Analytics Unit 1
No ratings yet
Da Unit III Data Analytics Unit 1
39 pages
Basic Regression Analysis in Econometrics
No ratings yet
Basic Regression Analysis in Econometrics
5 pages
Understanding Linear Regression Methods
No ratings yet
Understanding Linear Regression Methods
17 pages
Econometrics: Linear Regression Basics
No ratings yet
Econometrics: Linear Regression Basics
52 pages
Data Analytics Unit 2
No ratings yet
Data Analytics Unit 2
13 pages
Raw Introduction to Linear Regression (서울대 회귀분석 강의노트)
No ratings yet
Raw Introduction to Linear Regression (서울대 회귀분석 강의노트)
226 pages
Da Sem Unit 3-1
No ratings yet
Da Sem Unit 3-1
13 pages
Regression Analysis: Post Mid Assignment Topic
No ratings yet
Regression Analysis: Post Mid Assignment Topic
8 pages
DA unit-III
No ratings yet
DA unit-III
30 pages
Regression Analysis: Ordinary Least Squares
No ratings yet
Regression Analysis: Ordinary Least Squares
12 pages
Unit 2
No ratings yet
Unit 2
26 pages
Econometrics Session
No ratings yet
Econometrics Session
43 pages
NOTES - UNIT 2 - Machine Learning
No ratings yet
NOTES - UNIT 2 - Machine Learning
33 pages
ECN 318 - Introductory Econometrics I Week 3 4
No ratings yet
ECN 318 - Introductory Econometrics I Week 3 4
39 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
28 pages
8.2 Regression
No ratings yet
8.2 Regression
16 pages
Unit - 2 PDA
No ratings yet
Unit - 2 PDA
20 pages
Unit 5-2
No ratings yet
Unit 5-2
28 pages
Unit 1-Pda - Extra
No ratings yet
Unit 1-Pda - Extra
79 pages
Unit-2 Pda
No ratings yet
Unit-2 Pda
69 pages
Short Answers
No ratings yet
Short Answers
2 pages
Unit 1
No ratings yet
Unit 1
17 pages
Murex Modelling Services: An Introduction
100% (1)
Murex Modelling Services: An Introduction
11 pages
DOWSIL 789 Silicone Weather Proofing Sealant Black-Safety Data Sheet-En
No ratings yet
DOWSIL 789 Silicone Weather Proofing Sealant Black-Safety Data Sheet-En
17 pages
BIA Decision: Javier Rosales U-Visa Case
100% (1)
BIA Decision: Javier Rosales U-Visa Case
2 pages
RMI Water Resources and Quality Monitoring
No ratings yet
RMI Water Resources and Quality Monitoring
31 pages
Bushara Reservoir Repair Project DRC
No ratings yet
Bushara Reservoir Repair Project DRC
241 pages
Farm Equipment Cleaning and Chemical Safety
No ratings yet
Farm Equipment Cleaning and Chemical Safety
6 pages
20.2 User's Guide: Document Imaging Solutions
No ratings yet
20.2 User's Guide: Document Imaging Solutions
93 pages
Computer Network - CS610 Power Point Slides Lecture 01
No ratings yet
Computer Network - CS610 Power Point Slides Lecture 01
19 pages
Chapter 1 Social Responsibility Framework
No ratings yet
Chapter 1 Social Responsibility Framework
53 pages
Finance Cluster Exam - Test 1078
No ratings yet
Finance Cluster Exam - Test 1078
31 pages
FBMC Physical Layer Overview
No ratings yet
FBMC Physical Layer Overview
31 pages
01 Tables DI Udit Sir
No ratings yet
01 Tables DI Udit Sir
11 pages
2D TMDC-Review Paper-10.1007/s40820-017-0152-6
No ratings yet
2D TMDC-Review Paper-10.1007/s40820-017-0152-6
23 pages
Microsoft Word - A Comprehensive Overview
No ratings yet
Microsoft Word - A Comprehensive Overview
5 pages
Afl GB
No ratings yet
Afl GB
8 pages
Advanced Vibration Analysis Training
No ratings yet
Advanced Vibration Analysis Training
1 page
Integrating Ship Design and Project Management
No ratings yet
Integrating Ship Design and Project Management
13 pages
P 355 GH
No ratings yet
P 355 GH
2 pages
Unit 7 Termination and Dismissal
No ratings yet
Unit 7 Termination and Dismissal
36 pages
Online Faculty Guide
No ratings yet
Online Faculty Guide
11 pages
CGPSC Forestry Module 1 (Eng) Hornbill Classes
No ratings yet
CGPSC Forestry Module 1 (Eng) Hornbill Classes
25 pages
4 - Reading An Analog Voltmeter Problem Worksheet
No ratings yet
4 - Reading An Analog Voltmeter Problem Worksheet
1 page
The Institute of Risk Management
No ratings yet
The Institute of Risk Management
20 pages
High Voltage Pspice Manual PDF
No ratings yet
High Voltage Pspice Manual PDF
35 pages
Cambridge First Certificate in English3 For Updated Exam Upper Intermediate Students Book With Answers Frontmatter PDF
100% (1)
Cambridge First Certificate in English3 For Updated Exam Upper Intermediate Students Book With Answers Frontmatter PDF
4 pages
Hydraulic Brake Booster Dissasembly LC200
100% (1)
Hydraulic Brake Booster Dissasembly LC200
10 pages
15 - 516x Week0 1 Program Overview en
No ratings yet
15 - 516x Week0 1 Program Overview en
2 pages
Understanding the Mischief Rule in Law
No ratings yet
Understanding the Mischief Rule in Law
28 pages
1a. (2 Marks) : The Image Shows A Forest Food Web From North America. Describe What Is Meant by A Food Chain
No ratings yet
1a. (2 Marks) : The Image Shows A Forest Food Web From North America. Describe What Is Meant by A Food Chain
7 pages
Liquid Legal Humanization and The Law Kai Jacob Dierk Schindler Download
100% (1)
Liquid Legal Humanization and The Law Kai Jacob Dierk Schindler Download
87 pages

Unit - 3 PDA

Uploaded by

Unit - 3 PDA

Uploaded by

PRINCIPLES OF DATA ANALYTICS III YEAR IISEM

ModelTheory,Modelfit statistics,modelConstruction,Analyticsapplicationstovariousbusiness domains.

Ex: youwant to estimate growthinsalesofacompanybasedoncurrent economicconditions. Youhave the

Regressionanalysisalso allowsusto comparetheeffectsofvariables measuredondifferent scales, such as

Polynomialregression isusedforcurvilineardata.Polynomialregressionisfitwiththemethod of least

Aproperty whichislessstrictthanefficiency,isthesocalledbest,linearunbiasedestimator (BLUE)

In statistics,an estimator is a ruleforcalculating an estimate of a given quantity based on observed data

• Anestimatorissaidtobealinear estimatorof(β) ifitisalinearfunctionofthesample observations

• AdesirablepropertyofadistributionofestimatesiSthatitsmeanequalsthetruemeanofthe variables being

Theleastsquaremethod istheprocessoffindingthebest-fittingcurveorlineofbestfitforaset ofdata points by

The least-squaresmethod isoftenapplied indata fitting.Thebest fit result isassumedto reducethesum of

Let usassumethatthegivenpointsofdataare(x_1, y_1), (x_2, y_2), …, (x_n, y_n) inwhichallx’sare

It’sacommontechniqueto determinehowonevariableofinterest isaffectedbyanother. Its

used for three main purposes:

line is fitted through the group of plotted data.

Computational techniquesforvariableselection In order toselectasubsetmodel,several techniques based on

Thismethodologyisbasedonchoosingtheexplanatoryvariablesinthesubset modelinsteps which can

In thisphase data science teamneeds todevelop data setsfortraining,testing,and production purposes.

Itprovidesanopen-sourcemachinelearninglibraryofalgorithmsthatcanbeexecutedinthe database, for

Logisticregression istheappropriateregression analysistoconductwhen thedependentvariableis

Logisticregression isusedtodescribedataandtoexplaintherelationshipbetweenonedependent binary

Logisticregressionisaclassificationmodelinwhichtheresponsevariableiscategorical.Itisan algorithm that

LogisticRegression isoneofthebasicandpopularalgorithmtosolveaclassificationproblem.It isnamed

Thegoalof logistic regression isto correctlypredictthecategoryofoutcome forindividualcasesusing the

Logistic regression is basically a supervised classification algorithm. In a classification problem, the

1. LowPrecision/HighRecall: Inapplicationswherewewanttoreducethenumberoffalse negatives without

Asinlinearregression, goodnessoffitinlogisticregressionattemptstogetathowwella model fits the

• For continuous outcome variables, relationships are examined by linear or non-linear

Business analytics (BA) is the practice is iterative methodical exploration of an organization’s

LinearRegressionandLogisticRegressionarethetwofamousMachineLearning Algorithmswhich come

Since boththealgorithmsareofsupervised innaturehencethesealgorithmsuse labeleddatasetto make the

TheLinearRegressionisused forsolvingRegressionproblemswhereasLogisticRegressionisused for solving

Thedescriptionofboththealgorithmsisgiven belowalong withdifferencetable.

LinearRegressionisoneofthe most simpleMachine learningalgorithmthat comesunderSupervised Learning

ThegoaloftheLinearregressionisto findthebest fit linethat canaccuratelypredicttheoutputforthe continuous

Ifsingle independent variable isused forpredictionthenit iscalledSimpleLinearRegressionand if there are

Byfindingthebest fit line,algorithmestablishestherelationshipbetweendependent variableand independent

Logisticregressionisoneofthemost popularMachine learningalgorithmthat comesunderSupervised

Itcanbeused forClassificationaswellasforRegressionproblems, but mainlyused forClassification problems.

Logisticregressionisusedto predictthecategoricaldependent variablewiththehelpofindependent variables.

TheoutputofLogisticRegressionproblemcan beonlybetweenthe0 and1.

Logisticregressioncanbeusedwheretheprobabilities betweentwoclasses isrequired.Suchas whether it

LogisticregressionisbasedontheconceptofMaximumLikelihoodestimation.Accordingtothis estimation, the

Inlogisticregression, wepasstheweightedsumofinputsthroughanactivationfunctionthat canmap values in

LinearRegressionisusedforsolving Regression Logisticregressionisusedforsolving Classification

You might also like