0% found this document useful (0 votes)
7 views20 pages

Unit - 3 PDA

The document outlines the principles of data analytics with a focus on regression analysis, including concepts such as least squares estimation, model building, and various regression techniques like linear and logistic regression. It discusses the importance of regression in predicting relationships between dependent and independent variables, as well as the assumptions and properties of estimators. Additionally, it highlights the model building lifecycle and tools used in data analytics, emphasizing the significance of variable selection and model validation.

Uploaded by

greekathena0501
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views20 pages

Unit - 3 PDA

The document outlines the principles of data analytics with a focus on regression analysis, including concepts such as least squares estimation, model building, and various regression techniques like linear and logistic regression. It discusses the importance of regression in predicting relationships between dependent and independent variables, as well as the assumptions and properties of estimators. Additionally, it highlights the model building lifecycle and tools used in data analytics, emphasizing the significance of variable selection and model validation.

Uploaded by

greekathena0501
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

PRINCIPLES OF DATA ANALYTICS III YEAR IISEM

3.REGRESSION
Syllabus:
Regression:

Concepts,Bluepropertyassumptions,LeastSquareEstimation,VariableRationalization,Model
building.

LogisticRegression:

ModelTheory,Modelfit statistics,modelConstruction,Analyticsapplicationstovariousbusiness domains.

TEXTBOOKS:
1. Student’sHandbookforAssociateAnalytics–II,III.
2. DataMiningConceptsandTechniques,Han,Kamber, 3rdEdition,MorganKaufmannPublishers.

REFERENCEBOOKS:
1. IntroductiontoDataMining,Tan,SteinbachandKumar,AddisionWisley, 2006.
2. DataMiningAnalysisandConcepts, M.ZakiandW.Meira

Mr.A.SHEKAR Page1
REGRESSION

Regression Analysis is a statistical process for estimating the relationships between the dependent
variables or criterion variables and one or more independent variables or predictors.

Regressionanalysisexplainsthechangesincriterionsinrelationtochangesinselectpredictors.

The conditional expectation of the criterions based on predictors where the average value of the
dependent variables is given when the independent variables are changed.

Three major uses for regression analysis are determining the strength ofpredictors, forecasting an effect,
and trend forecasting.

Thistechniqueisusedforforecasting,timeseriesmodellingandfindingthe causaleffectrelationshipbetween
the variables. For example, relationship between rash driving and number of road accidents by a driver
is best studied through regression.

Regressionanalysis isanimportanttoolformodellingandanalysingdata.

Ex: youwant to estimate growthinsalesofacompanybasedoncurrent economicconditions. Youhave the


recent company data which indicates that the growth in sales is around two and a half times the growth
inthe economy. Using this insight, we can predict future sales ofthe company based on current & past
information.

Therearemultiplebenefitsofusingregressionanalysis. Theyareasfollows:

1. Itindicatesthesignificantrelationshipsbetweendependentvariableandindependentvariable.
2. Itindicatesthestrengthofimpactofmultipleindependentvariablesonadependentvariable.

Regressionanalysisalso allowsusto comparetheeffectsofvariables measuredondifferent scales, such as


the effect of price changes and the number of promotional activities. These benefits help market
researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used
for building predictive models.

Here are various kinds of regression techniques available to make predictions. These techniques are
mostlydrivenbythree metrics(number ofindependent variables,type ofdependent variablesand shape of
regression line). We’ll discuss them in detail in the following sections.

Mr.A.SHEKAR Page2
By using the least squares method, we are able to construct a best fitting straight line to the scatter
diagrampoints and then formulate a regressionequation inthe formof: y= a +bx y= y+ b(x–x)What is
linear regression model? List advantages of linear regression model

TypesofRegression–

 Linearregression
 Logisticregression
 Polynomialregression
 Stepwiseregression
 Stepwiseregression
 Ridgeregression
 Lassoregression
 ElasticNetregression

Linear regression is used for predictive analysis.Linear regression is a linear approach for modellingthe
relationship between the criterion or the scalar response and the multiple predictors or explanatory
variables.Linearregressionfocuseson theconditional probability distribution of theresponsegiven
thevalues of the predictors.Forlinear regression, thereis a danger of overfitting. Theformulafor linear
regression is: Y’ = bX + A.

Mr.A.SHEKAR Page3
Logistic regression is used when the dependent variable is dichotomous. Logistic regression estimates
the parameters of alogisticmodel andis form of binomial regression. Logistic regression is used to deal
with data that has two possible criterions and the relationship between the criterions and the predictors.

Polynomialregression isusedforcurvilineardata.Polynomialregressionisfitwiththemethod of least


squares. The goal of regression analysis to model the expected value of a dependent variable y in
regards to the independent variable x.

Stepwise regression is used for fitting regression models with predictive models. It is carried out
automatically.Witheachstep,thevariableisaddedorsubtractedfrom thesetofexplanatory variables. The
approaches for stepwise regression are forward selection, backward elimination, and bidirectional
elimination.

Ridgeregression isatechniqueforanalysingmultipleregressiondata.Whenmulticollinearity
occurs,leastsquares estimates areunbiased.A degree of biasisadded tothe regression estimates,and a
result, ridge regression reduces the standard errors.

Lasso regression is a regression analysis method that performs both variable selection and
regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset of the
provided covariates for use in the final model.

Elastic Net regression is a regularized regression method that linearly combines the penalties of the
lasso and ridge methods. Elastic Net regression is used for support vector machines, metric learning,and
portfolio optimization.

BLUEPRPERTY ASSUMPTIONS

Aproperty whichislessstrictthanefficiency,isthesocalledbest,linearunbiasedestimator (BLUE)


property,whichalsousesthevarianceoftheestimators. BLUE.Avectorofestimatorsis BLUE if it is the
minimum variance linear unbiased estimator.

Toshowthisproperty,weusethe Gauss-MarkovTheorem.

OLS estimators areBLUE(i.e. theyare linear, unbiased and have the least variance among the class of all
linear and unbiased estimators).

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the
unknownparametersinalinearregressionmodel....Undertheseconditions,themethodof OLS provides
minimum-variance mean-unbiased estimation when the errors have finite variances.

WHATISAN ESTIMATOR?

In statistics,an estimator is a ruleforcalculating an estimate of a given quantity based on observed data

Mr.A.SHEKAR Page4
TWOTYPESOFESTIMATORS

PointEstimatorsApointestimateofapopulationparameter isasinglevalueofastatistic.

IntervalEstimatorsAnintervalestimateisdefinedbytwonumbers,betweenwhichapopulation parameter is
said to lie

PROPERTIESOFBLUE

• B-BEST

• L-LINEAR

• U-UNBIASED

• E-ESTIMATOR

AnestimatorisBLUEifthefollowinghold:

1. Itislinear(Regressionmodel)

2. Itisunbiased

3. Itisanefficientestimator(unbiasedestimatorwithleastvariance)

LINEARITY

• Anestimatorissaidtobealinear estimatorof(β) ifitisalinearfunctionofthesample observations

• SamplemeanisalinearestimatorbecauseitisalinearfunctionoftheXvalues.

UNBIASEDNESS

• AdesirablepropertyofadistributionofestimatesiSthatitsmeanequalsthetruemeanofthe variables being


estimated

• Formally, an estimator is an unbiased estimator if its sampling distribution has as its expected
valueequal to the true value of population.

MINIMUM VARIANCE

• Just as we wanted the mean of the sampling distribution to be centred around the true population , so
too it is desirable for the sampling distribution to be as narrow (or precise) as possible.

Mr.A.SHEKAR Page5
LEASTSQUARE ESTIMATORS

Theleast squares method is a statistical procedure to find the best fit for a set of data points by
minimizing the sum of the offsets or residuals of points from the plotted curve.

Leastsquaresregressionisusedtopredictthebehaviorofdependentvariables. OR

Theleastsquaremethod istheprocessoffindingthebest-fittingcurveorlineofbestfitforaset ofdata points by


reducing the sum of the squares of the offsets (residual part) of the points from the curve. The method
of curve fitting is an approach to regression analysis.

LeastSquareMethodDefinition

The least-squares method is a crucial statistical method that is practised to find a regression line or a
best-fit line for the given pattern.

This method is described by an equation with specific parameters. The method of least squares is
generously used in evaluation and regression.

In regression analysis, this method is said to be a standard approach for the approximation of sets of
equations having more equations than the number of unknowns.

The method of least squares actually defines the solution for the minimization of the sum of squares of
deviations or the errors in the result of each equation. Find the formula for sum of squares of errors,
which help to find the variation in observed data.

The least-squaresmethod isoftenapplied indata fitting.Thebest fit result isassumedto reducethesum of


squared errors or residuals which are stated to be the differences between the observed or experimental
value and corresponding fitted value given in the model.

Therearetwo basiccategoriesofleast-squaresproblems:

 Ordinaryorlinearleastsquares
 Nonlinearleast squares

Thesedependuponlinearityornonlinearityofthe residuals.

Thelinearproblemsareoftenseeninregressionanalysisinstatistics.

On the other hand, the non-linear problems generally used in the iterative method of refinement inwhich
the model is approximated to the linear one with each iteration.

Howdoyoucalculateleast squares?

Let usassumethatthegivenpointsofdataare(x_1, y_1), (x_2, y_2), …, (x_n, y_n) inwhichallx’sare


independent variables, while all y’s are dependent ones. Also, suppose that f(x) be the fitting curve andd
represents error or deviation from each given point.
The least-squares explain that the curve that best fits is represented by the property that the sum of
squares of all the deviations from given values must be minimum.

Mr.A.SHEKAR Page6
What istheprincipleofleastsquares?

The least squares principle states that by getting the sum of the squares of the errors a minimum value,
the most probable values of a system of unknown quantities can be obtained upon which observations
have been made.

Whatdoestheleastsquaremean?

The least square method is the process ofobtaining the best-fitting curve or line ofbest fit forthe given
data set byreducing the sum of the squares of the offsets (residual part) of the points from the curve.

Whatisleastsquarecurve fitting?

The least-squares method is a generally used method of the fitting curve for a given data set. It is the
most prevalent method used to determine the trend line for the given time series data.

LINEARREGRESSION

It’sacommontechniqueto determinehowonevariableofinterest isaffectedbyanother. Its

used for three main purposes:

 Fordescribingthelinear dependenceofonevariableontheother.
 Forpredictionofvaluesofothervariablefromtheonewhichhasmore data.
 Correctionoflineardependenceofonevariableontheother. A

line is fitted through the group of plotted data.

Y=α+βX+ε

α=interceptcoefficients β

= slope coefficients

ε =residuals

Cluster Analysis is the process of forming groups of related variable for the purpose of drawing
important conclusions based on the similarities within the group.

 The greater the similarity within a group and greater the difference between the groups, more distinct
is the clustering.

VARIABLESELECTION

Computational techniquesforvariableselection In order toselectasubsetmodel,several techniques based on


computational procedures and algorithm the available.
Theyare essentiallybasedontwoideas
– selectallpossibleexplanatoryvariablesorselecttheexplanatoryvariables stepwise.
1. Useallpossibleexplanatoryvariables
 Thismethodologyisbasedonthefollowingsteps:
 Fitamodelwithoneexplanatoryvariable.
 Fitamodelwithtwoexplanatoryvariables.
 Fitamodelwiththreeexplanatoryvariables. andsoon.

Mr.A.SHEKAR Page7
 Chooseasuitablecriterionformodelselectionandevaluateeachofthefittedregressionequation with
the selection criterion.
 Thetotalnumberofmodelsto befitted sharplyriseswithan increaseink .
 So such models can be evaluated using a model selection criterion with the help of an efficient
computation algorithm on computers.

2. Stepwiseregressiontechniques

Thismethodologyisbasedonchoosingtheexplanatoryvariablesinthesubset modelinsteps which can


be either adding one variable at a time or deleting one variable at times.
Based onthis,therearethree procedures.
- Forwardselection,
- backwardeliminationand
- stepwiseregression.

Theseproceduresarebasicallycomputer-intensiveproceduresandareexecutedusingthe
software.

MODELBUILDING

In thisphase data science teamneeds todevelop data setsfortraining,testing,and production purposes.


These data sets enable data scientist to develop analytical method and train it, while holding aside some
of data for testing the model.

Team develops datasets for testing, training, and production purposes. In addition, in this phase, the
team builds and executes models based on work done in the model planning phase. The team also
considers whether its existing tools will suffice for running the models, or if it will need more robust
environmentfor executing models and workflows (Example– fasthardware and parallel processing).

Freeoropen-sourceTools:
RandPL/R,Octave,WEKA,Python

CommonToolsfortheModelBuildingPhase:
RandPL/R:
They were described earlier in the model planning phase, and PL/R is procedural language for
PostgreSQL with R. Using this approach means that R commands can be executed in the database.

Octave:
It is free software programming language for computational modeling, has some of functionality of
Matlab. Because it is freely available, Octave is used in major universities when teaching machine
learning.

WEKA:
It is free data mining software package with an analytic workbench. The functions created in WAKA
can be executed within the java code.

Python:
It is programming language that provides toolkits for machine learning and analysis, such as scikit-
learn, NumPy,scipy,Pandas, and related data visualization using matplotlib.

Mr.A.SHEKAR Page8
SQL:
SQLindatabaseimplementations,suchasMADlib,providesanalternativetomemorydesktop analytical tools.

MADlib:

Itprovidesanopen-sourcemachinelearninglibraryofalgorithmsthatcanbeexecutedinthe database, for


PostgreSQLor Greenplum.

LifecycleofModelBuilding–

 Selectvariables
 Balancedata
 Buildmodels
 Validate
 Deploy
 Maintain
 Definesuccess
 Exploredata
 Conditiondata

 Data exploration is used to figure out gist of data and to develop first step assessment of its
quality, quantity, and characteristics.
 Visualization techniques can be also applied. However, this can be difficult task in high
dimensional spaces with many input variables.
 In the conditioning of data, we group functional data which is applied upon modelingtechniques
after then rescaling is done, in some cases rescaling is an issue if variables are coupled.
 Variablesectionisveryimportanttodevelopqualitymodel.
 This process isimplicitly model-dependentsince itis used to configure which combination of
variables should be used in ongoing model development.
 Databalancingistopartitiondataintoappropriatesubsetsfortraining,test,andvalidation. Model
building is to focus on desired algorithms.
 Themostfamoustechniqueissymbolicregression,othertechniquescanalsobepreferred.
 Modelvalidationisimportanttodevelopfeelingoftrustpriortoitsusage.
 Thedefinitionofgoodmodelincludesrobustnessandwell-definedaccuracy.
 Therefore,trustedaccuratemodelispotentiallyfinancialandphysicallydangeroustoobut trusted
metric is very important for symbolic regression and stacked analytic networks.

LOGISTICREGRESSION

Logisticregression istheappropriateregression analysistoconductwhen thedependentvariableis


dichotomous (binary).

Logisticregression isusedtodescribedataandtoexplaintherelationshipbetweenonedependent binary


variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Logisticregressionisaclassificationmodelinwhichtheresponsevariableiscategorical.Itisan algorithm that


comes from statistics and is used for supervised classification problems.

Mr.A.SHEKAR Page9
Logistic regression analysis is used to examine the association of (categorical or continuous)
independentvariable(s)withonedichotomousdependentvariable.Thisisincontrastto linear regression
analysis in which the dependent variable is a continuous variable.

LogisticRegression isoneofthebasicandpopularalgorithmtosolveaclassificationproblem.It isnamed


as'LogisticRegression',becauseit'sunderlyingtechniqueisquitethesameas Linear Regression. The term
“Logistic” is taken fromthe Logit function that is used in this method of classification.

Basic assumptions that must be met for logistic regression include independence of errors, linearityin
the logit for continuous variables, absence ofmulti collinearity, and lack of strongly influential
outliers.

Typesoflogisticregression

Mr.A.SHEKAR Page10
TypesofLogisticRegression:
 BinaryLogisticRegression
 MultinomialLogisticRegression
 OrdinalLogisticRegression

Forthe modelto be a cent percent accurateone, we need to calculate and find out few parametersofthe
algorithm in order to check how accurate our Binary Logistic Regression model is.
Thekeyparameterswecalculateandcheckaredependent ofthetopiccalledCONFUSIONMATRIX. The
confusion matrix is a type of table used to define the characteristics of Classification problems.

Negative(0) Positive1(1)
Negative(0) TrueNegative FalsePositive
Positive(1) FalseNegative TruePositive
Thebelowarefewexpressionscalculatedinordertofind howaccuratethepredictionofthemodel is.
 Accuracy
 Recall
 Precision
 F1score

Thegoalof logistic regression isto correctlypredictthecategoryofoutcome forindividualcasesusing the


most parsimonious model. To accomplish this goal, a model is created that includes all predictor
variables that are useful in predicting the response variable. ..

Mr.A.SHEKAR Page11
ModelTheoryInLogistic Regression

Logistic regression is basically a supervised classification algorithm. In a classification problem, the


targetvariable(or output),y, can take only discrete values for given set of features(or inputs), X.

Logistic regression becomes a classification technique only when a decision threshold is brought into
the picture. The setting of the threshold value is a very important aspect of Logistic regression and is
dependent on the classification problem itself.

The decision for the value of the threshold value is majorly affected by the values of precision andrecall.

1. LowPrecision/HighRecall: Inapplicationswherewewanttoreducethenumberoffalse negatives without


necessarily reducing the number false positives, we choose a decision value whichhas a low value of
Precision or high value of Recall.

2. High Precision/Low Recall: In applications where we want to reduce the number of false positives
without necessarily reducing the number false negatives, we choose a decision value which has a high
value of Precision or low value of Recall.

MODELFITSTATISTICS

Asinlinearregression, goodnessoffitinlogisticregressionattemptstogetathowwella model fits the


data.

Itisusuallyappliedaftera“finalmodel”hasbeenselected.

Often in selectinga model no single “final model” is selected, as aseries of models are fit, each
contributing towards final inferences and conclusions. In that case, one may wish to see how
well more than one model fits, although it is common to just check the fit of one model.

Mr.A.SHEKAR Page12
Thisisnotnecessarilybadpractice,becauseifthereareaseriesof“good”modelsbeingfit, often the fit
from each will be similar.
Itisnotclearhowtojudgethefitofamodelthatweknowisinfact wrong.

Muchofthegoodnessoffitliteratureisbasedonhypothesistestingofthefollowingtype: H0 :
model is exactly correct
HA:modelisnotexactlycorrect

This type of testing provides no useful information. If the null hypothesis is rejected, then we
have learned nothing, because we already knew that it is impossible for any model to be
“exactly correct”.

On the other hand, if we do not reject the model, it is almost surely because of a lack of
statistical power, and as the sample size grows larger, we will eventually surely reject H0.

These tests can be seen not only as not useful, but as harmful if non-rejection of a null
hypothesis is misinterpreted as proof that the model “fits well”, which is of course can be far
from the truth.

GoodnessOfFitMeasuresforLogisticRegression:

Thefollowingmeasuresoffitareavailable,sometimesdividedinto“global”and“local”measures:
• Chi-square goodnessoffittestsanddeviance
• Hosmer-Lemeshowtests
• Classificationtables
• ROCcurves
• LogisticregressionR2
• Modelvalidationviaanoutsidedatasetorbysplittingadataset

MODELCONSTRUCTION

Theprocessofmodel-buildingallowsyoutoselectthe“best”variabletoaddtoyourcurrent regression
model.

Logisticregressionmodelisoneofthemostwidelyusedmodelstoinvestigateindependent effect of a
variable on binomial outcomes in medical literature.

However, the model buildingstrategy is not explicitly stated inmany studies, compromisingthe
reliability and reproducibility of the results. There are varieties of model building strategies
reported in the literature, such as purposeful selection of variables, stepwise selection and
best subsets.

However,there is no one that has been proven to be superior to others and the model
buildingstrategyis“partscience,partstatisticalmethods,andpartexperienceandcommonsense”
.

Mr.A.SHEKAR Page13
However,theprincipalofmodelbuildingistoselectaslessvariablesaspossible,butthe model
(parsimonious model) still reflects the true outcomes of the data.

Toinvestigatewhetheranassociationexistsbetweenthevariablesofinterest

• Tomeasurethestrength(aswellasdirection) ofassociationbetweenthevariables

• Tostudytheformofrelationship(ifany)Modellingdependsontypeofoutcomevariable

• For continuous outcome variables, relationships are examined by linear or non-linear


regression models

• For categorical outcome variables, logistic regression is usually used to examine possible
relationship

A regression with an outcome variable that is categorical (e.g. success/failure) and explanatory
variables that can be a mix of continuous and categorical variables

• Addressesthesameresearchquestionsthatmultipleregressiondoes

• Predicts which of the two possible events (in case of binary outcome) are going to happen
given the info on explanatory variables (e.g. LR can be used to analyse factors that determine
whether an individual is likely to have a certain type of rehab program)

ASSUMPTIONSOFLOGISTICREGRESSION

Ratioofcasestovariables–discretevariablesshouldhaveenoughresponsesineverygiven category • If
there are many cells with no response

• parameterestimatesandstandarderrorsarelikelytobeunstable

• maximumlikelihoodestimation(MLE)ofparameterscouldbeimpossibletoobtain

• Linearity in the logit – the regression equation should have a linear relationship with the logit
form of outcome

• Absenceofmulticollinearity;nooutliers;andindependenceoferrors

TYPES OF LR:

Dichotomousoutcome(yes/no;pain/nopain)Binarylogisticregression

Polychotomousoutcome(choiceofrehabilitation:home/clinic/nursinghome/hospital)
Multinomiallogisticregression

Orderedoutcome(physicalactivity:none,low,moderate,high)Ordinallogisticregression

Mr.A.SHEKAR Page14
AnalyticsapplicationstovariousBusinessDomains

Business analytics (BA) is the practice is iterative methodical exploration of an organization’s


data, with an emphasis on statistical analysis. Business analytics is used by companies
committed to data-driven decision-making.
Typesofanalytics

 DecisionAnalytics
 DescriptiveAnalytics
 PrescriptiveAnalytics

Whilethetwocomponentsofbusinessanalytics—businessintelligenceandadvancedanalytics
— aresometimesusedinterchangeably,therearesomekeydifferencesbetweenthesetwo business
analytics techniques.
Businessanalyticsapplications
Businessanalyticstoolscomeinseveraldifferentvarieties

 Datavisualizationtools
 Businessintelligencereportingsoftware
 Self-serviceanalyticsplatforms
 Statisticalanalysistools
 Bigdataplatforms

It’sAssociationwithBusinessDomain

 FinancialServiceAnalytics
 MarketingAnalytics
 PricingAnalytics
 RetailsalesAnalytics
 RiskandCreditanalytics
 SupplyChainAnalytics
 Transportationanalytics
 CyberAnalytics
 Enterpriseoptimization

Business Analytics is critical for remaining competitive and achieving success. When you get
BA best practices in place and get buy-in from all stakeholders, your organization will benefit
from data-driven decision making.
Finance
BAisofutmostimportancetothefinancesector.DataScientistsareinhighdemandin investment
banking, portfolio management, financial planning, budgeting, forecasting, etc.

Marketing
Studyingbuyingpatterns of consumer behaviour, analysingtrends, help inidentifyingthe target
audience, employing advertising techniques that can appeal to the consumers, forecast supply
requirements, etc.

Mr.A.SHEKAR Page15
HRProfessionals
HR professionals can make use of data to find information about educational background of
high performing candidates, employee attrition rate, number of years of service of employees,
age, gender, etc. This information can play a pivotal role in the selection procedure of a
candidate.

CRM
Business Analytics helps one analyse the key performance indicators, which further helps in
decision making and make strategies to boost the relationship with the consumers. The
demographics, and data about other socio-economic factors, purchasing patterns, lifestyle, etc.,
are of prime importance to the CRM department.

Manufacturing
Business Analytics can help you in supply chain management, inventory management, measure
performance of targets, risk mitigation plans, improve efficiency in the basis of product data,etc.

CreditCardCompanies
Credit card transactions of a customer can determine many factors: financial health, life
style,preferences of purchases, behavioural trends, etc.

LinearRegression vsLogisticRegression:

LinearRegressionandLogisticRegressionarethetwofamousMachineLearning Algorithmswhich come


under supervised learning technique.

Since boththealgorithmsareofsupervised innaturehencethesealgorithmsuse labeleddatasetto make the


predictions.

Butthemaindifferencebetweenthemishowtheyarebeing used.

TheLinearRegressionisused forsolvingRegressionproblemswhereasLogisticRegressionisused for solving


the Classification problems.

Thedescriptionofboththealgorithmsisgiven belowalong withdifferencetable.

Mr.A.SHEKAR Page16
LinearRegression:

LinearRegressionisoneofthe most simpleMachine learningalgorithmthat comesunderSupervised Learning


technique and used for solving regression problems.

Itisusedforpredictingthecontinuousdependentvariablewiththehelpofindependentvariables.

ThegoaloftheLinearregressionisto findthebest fit linethat canaccuratelypredicttheoutputforthe continuous


dependent variable.

Ifsingle independent variable isused forpredictionthenit iscalledSimpleLinearRegressionand if there are


more than two independent variables then such regression is called as Multiple Linear Regression.

Byfindingthebest fit line,algorithmestablishestherelationshipbetweendependent variableand independent


variable. And the relationship should be of linear nature.

TheoutputforLinearregressionshouldonlybethecontinuousvaluessuchasprice,age,salary,etc. The
relationship between the dependent variable and independent variable can be shown in below image:

Mr.A.SHEKAR Page17
Inabove imagethedependent variable isonY-axis(salary)and independent variable isonx-
axis(experience). The regression line can be written as:

y=a0+a1x+ε

Where,a0anda1arethecoefficientsandε istheerrorterm.

LogisticRegression:

Logisticregressionisoneofthemost popularMachine learningalgorithmthat comesunderSupervised


Learning techniques.

Itcanbeused forClassificationaswellasforRegressionproblems, but mainlyused forClassification problems.

Logisticregressionisusedto predictthecategoricaldependent variablewiththehelpofindependent variables.

TheoutputofLogisticRegressionproblemcan beonlybetweenthe0 and1.

Logisticregressioncanbeusedwheretheprobabilities betweentwoclasses isrequired.Suchas whether it


will rain todayor not, either 0 or 1, true or false etc.

LogisticregressionisbasedontheconceptofMaximumLikelihoodestimation.Accordingtothis estimation, the


observed data should be most probable.

Inlogisticregression, wepasstheweightedsumofinputsthroughanactivationfunctionthat canmap values in


between 0 and 1. Such activation function is known as sigmoid function and the curve obtained is
called as sigmoid curve or S-curve. Consider the below image:

Mr.A.SHEKAR Page18
Theequationforlogisticregressionis:

DifferencebetweenLinearRegressionandLogisticRegression:

LinearRegression LogisticRegression

Linear regression is used to predict the Logistic Regression is used to predict the
continuousdependentvariableusingagivenset of categoricaldependentvariableusingagivenset of
independent variables. independent variables.

LinearRegressionisusedforsolving Regression Logisticregressionisusedforsolving Classification


problem. problems.
InLinearregression,wepredictthevalueof InlogisticRegression,wepredictthevaluesof
continuous variables. categorical variables.
Inlinearregression,wefindthebestfit line,by which InLogisticRegression,wefindtheS-curveby which
we can easily predict the output. we can classify the samples.
Leastsquareestimationmethodisusedfor Maximumlikelihoodestimationmethodisused for
estimation of accuracy. estimation of accuracy.
TheoutputforLinearRegressionmustbea The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categoricalvaluesuchas0or1,YesorNo,etc.
InLinearregression,itisrequired that InLogisticregression,itisnotrequiredtohave
relationshipbetweendependentvariableand thelinearrelationshipbetweenthedependent and
independent variable must be linear. independent variable.
Inlinearregression,theremaybecollinearity In logistic regression, there should not be
between the independent variables. collinearitybetweentheindependentvariable.

Mr.A.SHEKAR Page19
ImportantQuestions

1. WhatisRegression?Explainindetail?
2. Whatarethemajorusesofregressionanalysis?
3. TypesofRegression?
4. WhatisBLUE?PropertiesofBLUE?
5. HowtoobtainBestfitline?
6. DefineOLE,MLE?
7. ExplainLogisticRegressionindetail?
8. ExplainmodelConstructioninLogisticRegression?
9. CompareLinearandLogisticRegression?
10. WhatdoyoumeanbyLeastsquaremethod?

Mr.A.SHEKAR Page20

You might also like