FDA Unit 5 Notes
FDA Unit 5 Notes
This method requires reducing the sum of the squares of the residual parts of
the points from the curve or line and the trend of outcomes is found
quantitatively.
The least-squares method is a statistical method used to find the line of best fit
of the form of an equation such as y = mx + b to the given data.
The curve of the equation is called the regression line. Our main objective in
this method is to reduce the sum of the squares of errors as much as possible.
This is the reason this method is called the least-squares method.
Example :
Even though the least-squares method is considered the best method to find the line
of best fit, it has a few limitations. They are:
This method exhibits only the relationship between the two variables. All other
causes and effects are not taken into consideration.
This method is unreliable when data is not evenly distributed.
This method is very sensitive to outliers. In fact, this can skew the results of the
least-squares analysis.
Least Square Graph
The straight line shows the potential relationship between the independent variable
and the dependent variable. The ultimate goal of this method is to reduce this
difference between the observed response and the response predicted by the
regression line. Less residual means that the model fits better. The data points need
to be minimized by the method of reducing residuals of each point from the line.
3 types of regression
IMPLEMENTATION
Least-square method is the curve that best fits a set of observations with a
minimum sum of squared residuals or errors. Let us assume that the given points of
data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) in which all x’s are independent
variables, while all y’s are dependent ones. This method is used to find a linear line
of the form y = mx + b, where y and x are variables, m is the slope, and b is the y-
intercept. The formula to calculate slope m and the value of b is given by:
b = (∑y - m∑x)/n
Following are the steps to calculate the least square using the above formulas.
Step 1: Draw a table with 4 columns where the first two columns are for x and
y points.
Step 2: In the next two columns, find xy and (x)2.
Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
Step 4: Find the value of slope m using the above formula.
Step 5: Calculate the value of b using the above formula.
Step 6: Substitute the value of m and b in the equation y = mx + b
simple functions that demonstrate linear least squares:
Example
Consider the set of points: (1, 1), (-2,-1), and (3, 2). Plot these points and the least-
squares regression line in the same graph.
m = (27 - 4)/(42 - 4)
m = 23/38
b = (∑y - m∑x)/n
b = [2 - (23/38)×2]/3
b = [2 -(23/19)]/3
b = 15/(3×19)
b = 5/19
GOODNESS OF FIT
vs.
H1: the model M0 does not fit (or, some other model MA fits)
1. Chi-square.
2. Kolmogorov-Smirnov.
3. Anderson-Darling.
4. Shapiro-Wilk.
The following measures are used to validate the simple linear regression models.
1. Co-efficient of determination
2. Hypothesis test for the regression coefficient
3. ANOVA test
4. Residual Analysis to validate the regression model
5. Outlier Analysis.
Example
Here's a HypothesisTest for the model: find the birth weight of baby based on
mother age.
class SlopeTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
ages, weights = data
_, slope = thinkstats2.LeastSquares(ages, weights)
return slope
def MakeModel(self):
_, weights = self.data
self.ybar = weights.mean()
self.res = weights - self.ybar
def RunModel(self):
ages, _ = self.data
weights = self.ybar + np.random.permutation(self.res)
return ages, weights
2. WEIGHTED RESAMPLING
Sampling Vs Resampling
Bootstrapping
Bootstrapping is a statistical method for estimating the sampling
distribution of an estimator by sampling with replacement from the original sample,
most often with the purpose of deriving robust estimates of standard
errors and confidence intervals of a population parameter like
a mean, median, proportion, odds ratio, correlation
coefficient or regression coefficient. It has been called the plug-in principle, as it is
For example, when estimating the population mean, this method uses
the sample mean; to estimate the population median, it uses the sample median; to
estimate the population regression line, it uses the sample regression line.
Cross validation
Cross-validation is a statistical method for validating a predictive model.
Subsets of the data are held out for use as validating sets; a model is fit to the
remaining data (a training set) and used to predict for the validation set. Averaging
the quality of the predictions across the validation sets yields an overall measure of
prediction accuracy. Cross-validation is employed repeatedly in building decision
trees.
Important Terms
Statsmodels is Python module that provides classes and functions for the
estimation of many different statistical models
For multiple regression use StatsModels, a Python package that provides several
forms of regression and other analyses. .
There are 4 available classes of the properties of the regression model that will
help us to use the statsmodel linear regression.
The classes are as follows
1. Ordinary Least Square (OLS)
2. Weighted Least Square (WLS)
3. Generalized Least Square (GLS).
Example
4.MULTIPLE REGRESSION
For your model to be reliable and valid, there are some essential requirements:
y = m1x1+m2x2+m3x3+m4x4+…. + b
Example :
2. Prediction or Forecasting.
Nonlinear relationships
5.LOGISTIC REGRESSION
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):
On the basis of the categories, Logistic Regression can be classified into three
types:
6.ESTIMATING PARAMETERS
Unlike linear regression, logistic regression does not have a closed form solution,
so it is solved by guessing an initial solution and improving it iteratively. The usual
goal is to find the maximum-likelihood estimate (MLE), which is the set of
parameters that maximizes the likelihood of the data. For example, suppose we
have the following data:
The goal of logistic regression is to find parameters that maximizes this likelihood
To perform the time series analysis, we have to follow the following steps:
TSA is the backbone for prediction and forecasting analysis, specific to time-based
problem statements.
With the help of “Time Series,” we can prepare numerous time-based analyses and
results.
Trend: In which there is no fixed interval and any divergence within the given
dataset is a continuous timeline. The trend would be Negative or Positive or Null
Trend
Seasonality: In which regular or fixed interval shifts within the dataset in a
continuous timeline. Would be bell curve or saw tooth
Cyclical: In which there is no fixed interval, uncertainty in movement and its
pattern
Irregularity: Unexpected situations/events/scenarios and spikes in a short time
span.
Time series has the below-mentioned limitations; we have to take care of those
during our data analysis.
Similar to other models, the missing values are not supported by TSA
The data points must be linear in their relationship.
Data transformations are mandatory, so they are a little expensive.
Models mostly work on Uni-variate data.
8.MOVING AVERAGES
In the method of moving average, successive arithmetic averages are
computed from overlapping groups of successive values of a time series.
Each group includes all the observations in a given time interval, termed as
the period of moving average.can be used for data preparation, feature engineering,
and forecasting.
The trend and seasonal variations can be used to help make predictions
about the future – and as such can be very useful when budgeting and forecasting.
Calculating moving averages
One method of establishing the underlying trend (smoothing out peaks and
troughs) in a set of data is using the moving averages technique. Other methods,
such as regression analysis can also be used to estimate the trend. Regression
analysis is dealt with in a separate article.
A moving average is a series of averages, calculated from historic data.
Moving averages can be calculated for any number of time periods, for example a
three-month moving average, a seven-day moving average, or a four-quarter
moving average. The basic calculations are the same.
The following simplified example will take us through the calculation
process.
Monthly sales revenue data were collected for a company for 20X2:
From this data, we will calculate a three-month moving average, as we can see a
basic cycle that follows a three-monthly pattern
Add together the first three sets of data, for this example it would be January,
February and March. This gives a total of (125+145+186) = 456. Put this total in
the middle of the data you are adding, so in this case across from February. Then
calculate the average of this total, by dividing this figure by 3 (the figure you
divide by will be the same as the number of time periods you have added in your
total column). Our three-month moving average is therefore (456 ÷ 3) = 152.
The average needs to be calculated for each three-month period. To do this you
move your average calculation down one month, so the next calculation will
involve February, March and April. The total for these three months would be
(145+186+131) = 462 and the average would be (462 ÷ 3) = 154.
The three-month moving average represents the trend. From our example we
can see a clear trend in that each moving average is $2,000 higher than the
preceding month moving average. This suggests that the sales revenue for the
company is, on average, growing at a rate of $2,000 per month.
This trend can now be used to predict future underlying sales values.
Once a trend has been established, any seasonal variation can be calculated.
The seasonal variation can be assumed to be the difference between the actual sales
and the trend (three-month moving average) value.
A negative variation means that the actual figure in that period is less than the
trend and a positive figure means that the actual is more than the trend.
9.MISSING VALUES
The data in the real world has many missing data in most cases. There might
be different reasons why each value is missing.
There might be loss or corruption of data, or there might be specific reasons
also.
The missing data will decrease the predictive power of your model.
If you apply algorithms with missing data, then there will be bias in the
estimation of parameters.
You cannot be confident about your results if you don’t handle missing data.
Check for missing values
When you have a dataset, the first step is to check which columns have
missing data and how many.
The ” isnull()” function is used for this. When you call the sum function
along with isnull, the total sum of missing data in each column is the output.
missing_values=train.isnull().sum()
print(missing_values)
Dropping rows with missing values
It is a simple method, where we drop all the rows that have any missing
values belonging to a particular column. As easy as this is, it comes with a huge
disadvantage. You might end up losing a huge chunk of your data. This will reduce
the size of your dataset and make your model predictions biased. You should use
this only when the no of missing values is very less.
we can drop rows using dropna function
10.SERIAL CORRELATION
Serial correlation occurs in a time series when a variable and a lagged
version of itself (for instance a variable at times T and at T-1) are observed to be
correlated with one another over periods of time. Repeating patterns often show
serial correlation when the level of a variable affects its future level. In finance,
this correlation is used by technical analysts to determine how well the past price
of a security predicts the future price.
Serial correlation is similar to the statistical concepts of autocorrelation or lagged
correlation.
KEY TAKEAWAYS
we can shift the time series by an interval called a lag, and then compute the
correlation of the shifted series with the original:
AUTOCORRELATION
Autocorrelation refers to the degree of correlation of the same variables between
two successive time intervals. It measures how the lagged version of the value of a
variable is related to the original version of it in a time series. Autocorrelation, as a
statistical concept, is also known as serial correlation.
we calculate the correlation between two different versions Xt and Xt-k of the
same time series.
Given time-series measurements, Y1, Y2,…YN at time X1, X2, …XN, the lag k
autocorrelation function is defined as:
Usage:
An autocorrelation test is used to detect randomness in the time-series. In
many statistical processes, our assumption is that the data generated is random.
For checking randomness, we need to check for the autocorrelation of lag 1.
To determine whether there is a relation between past and future values of
time series, we try to lag between different values.
Where et is the residual of error from the Ordinary Least Squares (OLS) method.
The null hypothesis and alternate hypothesis for the Durbin-Watson Test are
The autocorrelation function is a function that maps from lag to the serial
correlation with the given lag. Autocorrelation" is another name for serial
correlation, used more often when the lag is not 1.
acf computes serial correlations with lags from 0 through nlags. The unbiased ag
tells acf to correct the estimates for the sample size. The result is an array of
correlations.
Introduction
powerful tool for estimating survival probabilities over time. This article provides
a concise introduction to survival analysis, unraveling its significance and
applications. We delve into the fundamentals of the Kaplan-Meier estimator,
exploring its role in handling censored data and creating survival curves. Whether
you’re new to survival analysis or seeking a refresher, this guide navigates through
the key concepts, making this statistical approach accessible and comprehensible.
1. Right Censored
Right censoring is used in many problems. It happens when we are not certain
what happened to people after a certain point in time.
It occurs when the true event time is greater than the censored time when c < t.
This happens if either some people cannot be followed the entire time because they
died or were lost to follow up or withdrew from the study.
2. Left Censored
Left censoring is when we are not certain what happened to people before some
point in time. Left censoring is the opposite, occurring when the true event time is
less than the censored time when c > t.
3. Interval Censored
Interval censoring is a concatenation of the left and right censoring when the time
is known to have occurred between two-time points
Survival Function S (t): This is a probability function that depends on the time of
the study. The subject survives more than time t. The Survivor function gives the
probability that the random variable T exceeds the specified time t.
Kaplan Meier Estimator is used to estimate the survival function for lifetime
data. It is a non-parametric statistics technique. It is also known as the product-
limit estimator, and the concept lies in estimating the survival time for a certain
time of like a major medical trial event, a certain time of death, failure of the
machine, or any major significant event.
2. How much time it will take for COVID 19 vaccine to cure the patient.
3. How much time is required to get a cure from a medical diagnosis etc.
period of time.
5. How many patients will get cured by lung cancer
To Estimate the Kaplan Meier Survival we first need to estimate the Survival
Function S (t) is the probability of event time t
Where (d) are the number of death events at the time (t), and (n) is the number of
subjects at risk of death just prior to the time (t).
In real-life cases, we do not have an idea of the true survival rate function. So in
Kaplan Meier Estimator we estimate and approximate the true survival function
from the study data. There are 3 assumptions of Kaplan Meier Survival
Survival Probabilities are the same for all the samples who joined late in the study
and those who have joined early. The Survival analysis which can affect is not
assumed to change.
Occurrence of Event are done at a specified time.
Censoring of the study does not depend on the outcome. The Kaplan Meier method
doesn’t depend on the outcome of interest.