0% found this document useful (0 votes)
37 views38 pages

FDA Unit 5 Notes

The document covers predictive analytics, focusing on linear least squares, regression methods, and their implementations using StatsModels in Python. It discusses the limitations of the least squares method, the concept of goodness of fit, and various regression techniques including multiple and logistic regression. Additionally, it highlights the importance of resampling techniques and the assumptions required for effective regression analysis.

Uploaded by

pvarshinibca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views38 pages

FDA Unit 5 Notes

The document covers predictive analytics, focusing on linear least squares, regression methods, and their implementations using StatsModels in Python. It discusses the limitations of the least squares method, the concept of goodness of fit, and various regression techniques including multiple and logistic regression. Additionally, it highlights the importance of resampling techniques and the assumptions required for effective regression analysis.

Uploaded by

pvarshinibca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

lOMoARcPSD|48712373

FDA Unit 5 - Notes

Fundamentals of Data Science and Analysis (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Varshini Pattabiraman ([email protected])
lOMoARcPSD|48712373

UNIT V PREDICTIVE ANALYTICS 09


Linear least squares – implementation – goodness of fit – testing a linear model –
weighted resampling. Regression using StatsModels – multiple regression –
nonlinear relationships – logistic regression – estimating parameters – Time series
analysis – moving averages – missing values – serial correlation – autocorrelation.
Introduction to survival analysis.

1. LINEAR LEAST SQUARES

 Least square method is the process of finding a regression line or best-fitted


line for any data set that is described by an equation.

 This method requires reducing the sum of the squares of the residual parts of
the points from the curve or line and the trend of outcomes is found
quantitatively.

 The least-squares method is a statistical method used to find the line of best fit
of the form of an equation such as y = mx + b to the given data.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

 The curve of the equation is called the regression line. Our main objective in
this method is to reduce the sum of the squares of errors as much as possible.
This is the reason this method is called the least-squares method.

Example :

Limitations for Least Square Method

Even though the least-squares method is considered the best method to find the line
of best fit, it has a few limitations. They are:

 This method exhibits only the relationship between the two variables. All other
causes and effects are not taken into consideration.
 This method is unreliable when data is not evenly distributed.
 This method is very sensitive to outliers. In fact, this can skew the results of the
least-squares analysis.
Least Square Graph

The straight line shows the potential relationship between the independent variable
and the dependent variable. The ultimate goal of this method is to reduce this
difference between the observed response and the response predicted by the

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

regression line. Less residual means that the model fits better. The data points need
to be minimized by the method of reducing residuals of each point from the line.

3 types of regression

1.Vertical difference – Direct Regression

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

2.Horizontal direction – Reverse Regression method

3.Perpendicular distance- Major Axis regression Method

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

IMPLEMENTATION

Least-square method is the curve that best fits a set of observations with a
minimum sum of squared residuals or errors. Let us assume that the given points of
data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) in which all x’s are independent
variables, while all y’s are dependent ones. This method is used to find a linear line
of the form y = mx + b, where y and x are variables, m is the slope, and b is the y-
intercept. The formula to calculate slope m and the value of b is given by:

m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2

b = (∑y - m∑x)/n

Here, n is the number of data points.

Following are the steps to calculate the least square using the above formulas.

 Step 1: Draw a table with 4 columns where the first two columns are for x and
y points.
 Step 2: In the next two columns, find xy and (x)2.
 Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
 Step 4: Find the value of slope m using the above formula.
 Step 5: Calculate the value of b using the above formula.
 Step 6: Substitute the value of m and b in the equation y = mx + b
simple functions that demonstrate linear least squares:

def LeastSquares(xs, ys):


meanx, varx = MeanVar(xs)
meany = Mean(ys)

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

slope = Cov(xs, ys, meanx, meany) / varx


inter = meany - slope * meanx
return inter, slope

Example

Consider the set of points: (1, 1), (-2,-1), and (3, 2). Plot these points and the least-
squares regression line in the same graph.

Solution: There are three points, so the value of n is 3

Now, find the value of m, using the formula.

m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2

m = [(3×9) - (2×2)]/(3×14) - (2)2

m = (27 - 4)/(42 - 4)

m = 23/38

Now, find the value of b using the formula,

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

b = (∑y - m∑x)/n

b = [2 - (23/38)×2]/3

b = [2 -(23/19)]/3

b = 15/(3×19)

b = 5/19

So, the required equation of least squares is y = mx + b = 23/38x + 5/19. The


required graph is shown as:

GOODNESS OF FIT

A goodness-of-fit test, in general, refers to measuring how well do the observed


data correspond to the fitted (assumed) model.The goodness-of-fit test compares
the observed values to the expected (fitted or predicted) values.

A goodness-of-fit statistic tests the following hypothesis:

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

H0: the model M0 fits

vs.

H1: the model M0 does not fit (or, some other model MA fits)

Goodness of fit tests commonly used in statistics are:

1. Chi-square.
2. Kolmogorov-Smirnov.
3. Anderson-Darling.
4. Shapiro-Wilk.

Testing a Linear Model

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

The following measures are used to validate the simple linear regression models.

1. Co-efficient of determination
2. Hypothesis test for the regression coefficient
3. ANOVA test
4. Residual Analysis to validate the regression model
5. Outlier Analysis.

Example

Here's a HypothesisTest for the model: find the birth weight of baby based on
mother age.
class SlopeTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
ages, weights = data
_, slope = thinkstats2.LeastSquares(ages, weights)
return slope
def MakeModel(self):
_, weights = self.data
self.ybar = weights.mean()
self.res = weights - self.ybar
def RunModel(self):
ages, _ = self.data
weights = self.ybar + np.random.permutation(self.res)
return ages, weights

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

2. WEIGHTED RESAMPLING

Resampling is a series of techniques used in statistics to gather more


information about a sample. This can include retaking a sample or estimating its
accuracy. With these additional techniques, resampling often improves the overall
accuracy and estimates any uncertainty within a population.

Sampling Vs Resampling

Sampling is the process of selecting certain groups within a population to


gather data. Resampling often involves performing similar testing methods with
sample sizes within that group. This can mean testing the same sample, or
reselecting samples that can provide more information about a population.

Resampling methods are:

1. Permutation tests (also re-randomization tests)


2. Bootstrapping
3. Cross validation
Permutation tests
Permutation tests rely on resampling the original data assuming the null
hypothesis. Based on the resampled data it can be concluded how likely the
original data is to occur under the null hypothesis.

Bootstrapping
Bootstrapping is a statistical method for estimating the sampling
distribution of an estimator by sampling with replacement from the original sample,
most often with the purpose of deriving robust estimates of standard
errors and confidence intervals of a population parameter like
a mean, median, proportion, odds ratio, correlation
coefficient or regression coefficient. It has been called the plug-in principle, as it is

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

the method of estimation of functionals of a population distribution by evaluating


the same functionals at the empirical distribution based on a sample.

For example, when estimating the population mean, this method uses
the sample mean; to estimate the population median, it uses the sample median; to
estimate the population regression line, it uses the sample regression line.

Cross validation
Cross-validation is a statistical method for validating a predictive model.
Subsets of the data are held out for use as validating sets; a model is fit to the
remaining data (a training set) and used to predict for the validation set. Averaging
the quality of the predictions across the validation sets yields an overall measure of
prediction accuracy. Cross-validation is employed repeatedly in building decision
trees.

In weighted sampling, each element is given a weight, where the probability of an


element being selected is based on its weight. As an example, if you survey
100,000 people in a country of 300 million, each respondent represents 3,000
people.
If you oversample one group by a factor of 2, each person in the oversampled
group would have a lower weight, about 1500. To correct for oversampling, we
can use resampling;
As an example, I will estimate mean birth weight with and without sampling
weights.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

Important Terms

3.REGRESSION USING STATSMODELS

Statsmodels is Python module that provides classes and functions for the
estimation of many different statistical models

For multiple regression use StatsModels, a Python package that provides several
forms of regression and other analyses. .

There are 4 available classes of the properties of the regression model that will
help us to use the statsmodel linear regression.
The classes are as follows
1. Ordinary Least Square (OLS)
2. Weighted Least Square (WLS)
3. Generalized Least Square (GLS).

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

Example

4.MULTIPLE REGRESSION

Multiple regression is an extension of simple linear regression. Multiple


linear regression is a statistical technique that uses multiple linear regression to
model more complex relationships between two or more independent variables and
one dependent variable. It is used when there are two or more x variables.

Assumptions of Multiple Linear Regression


In multiple linear regression, the dependent variable is the outcome or result from
you're trying to predict. The independent variables are the things that explain your
dependent variable. You can use them to build a model that accurately predicts
your dependent variable from the independent variables.

For your model to be reliable and valid, there are some essential requirements:

 The independent and dependent variables are linearly related.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

 There is no strong correlation between the independent variables.

 Residuals have a constant variance.

 Observations should be independent of one another.

 It is important that all variables follow multivariate normality.

y = m1x1+m2x2+m3x3+m4x4+…. + b
Example :

The Difference Between Linear and Multiple Regression


When predicting a complex process's outcome, it is best to use multiple
linear regression instead of simple linear regression.

A simple linear regression can accurately capture the relationship between


two variables in simple relationships. On the other hand, multiple linear regression
can capture more complex interactions that require more thought.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

A multiple regression model uses more than one independent variable. It


does not suffer from the same limitations as the simple regression equation, and it
is thus able to fit curved and non-linear relationships. The following are the uses of
multiple linear regression.

1. Planning and Control.

2. Prediction or Forecasting.

Estimating relationships between variables can be exciting and useful. As


with all other regression models, the multiple regression model assesses
relationships among variables in terms of their ability to predict the value of the
dependent variable.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

Nonlinear relationships

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

5.LOGISTIC REGRESSION

o Logistic regression is one of the most popular Machine Learning algorithms,


which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of
independent variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that
how they are used. Linear Regression is used for solving Regression
problems, whereas Logistic regression is used for solving the classification
problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not based
on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it
has the ability to provide probabilities and classify new data using
continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification. The below image is showing the logistic function:

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted


values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the "S" form. The S-form curve
is called the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of


the equation it will become:

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three
types:

o Binomial: In binomial Logistic regression, there can be only two possible


types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or "High".

6.ESTIMATING PARAMETERS
Unlike linear regression, logistic regression does not have a closed form solution,
so it is solved by guessing an initial solution and improving it iteratively. The usual
goal is to find the maximum-likelihood estimate (MLE), which is the set of
parameters that maximizes the likelihood of the data. For example, suppose we
have the following data:

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

The goal of logistic regression is to find parameters that maximizes this likelihood

7.TIME SERIES ANALYSIS

Time series analysis is a specific way of analyzing a sequence of data points


collected over an interval of time. In time series analysis, analysts record data
points at consistent intervals over a set period of time rather than just recording the
data points intermittently or randomly.
Characteristics of Time Series Analysis:
 Homogeneous data
 Values with different to time
 Data as per reasonable period
 Gaps of time values must be equal.
Objective of Time series Analysis:
 To evaluate past performance in respect of a particular period.
 To make future forecasts in respect of the particular variable.
 To chart short term and long term strategies of the business in a respect of a
particular variable.

How to Analyze Time Series?

To perform the time series analysis, we have to follow the following steps:

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

 Collecting the data and cleaning it


 Preparing Visualization with respect to time vs key feature
 Observing the stationarity of the series
 Developing charts to understand its nature.
 Extracting insights from prediction

Significance of Time Series:

TSA is the backbone for prediction and forecasting analysis, specific to time-based
problem statements.

 Analyzing the historical dataset and its patterns


 Understanding and matching the current situation with patterns derived from the
previous stage.
 Understanding the factor or factors influencing certain variable(s) in different
periods.

With the help of “Time Series,” we can prepare numerous time-based analyses and
results.

 Forecasting: Predicting any value for the future.


 Segmentation: Grouping similar items together.
 Classification: Classifying a set of items into given classes.
 Descriptive analysis: Analysis of a given dataset to find out what is there in it.
 Intervention analysis: Effect of changing a given variable on the outcome.

Components of Time Series Analysis:

Let’s look at the various components of Time Series Analysis:

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

 Trend: In which there is no fixed interval and any divergence within the given
dataset is a continuous timeline. The trend would be Negative or Positive or Null
Trend
 Seasonality: In which regular or fixed interval shifts within the dataset in a
continuous timeline. Would be bell curve or saw tooth
 Cyclical: In which there is no fixed interval, uncertainty in movement and its
pattern
 Irregularity: Unexpected situations/events/scenarios and spikes in a short time
span.

What Are the Limitations of Time Series Analysis?

Time series has the below-mentioned limitations; we have to take care of those
during our data analysis.

 Similar to other models, the missing values are not supported by TSA
 The data points must be linear in their relationship.
 Data transformations are mandatory, so they are a little expensive.
 Models mostly work on Uni-variate data.

A time series is a sequence of measurements from a system that varies in time.


The following code reads data from pandas dataframe.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

8.MOVING AVERAGES
In the method of moving average, successive arithmetic averages are
computed from overlapping groups of successive values of a time series.
Each group includes all the observations in a given time interval, termed as
the period of moving average.can be used for data preparation, feature engineering,
and forecasting.
The trend and seasonal variations can be used to help make predictions
about the future – and as such can be very useful when budgeting and forecasting.
Calculating moving averages
One method of establishing the underlying trend (smoothing out peaks and
troughs) in a set of data is using the moving averages technique. Other methods,
such as regression analysis can also be used to estimate the trend. Regression
analysis is dealt with in a separate article.
A moving average is a series of averages, calculated from historic data.
Moving averages can be calculated for any number of time periods, for example a
three-month moving average, a seven-day moving average, or a four-quarter
moving average. The basic calculations are the same.
The following simplified example will take us through the calculation
process.
Monthly sales revenue data were collected for a company for 20X2:

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

From this data, we will calculate a three-month moving average, as we can see a
basic cycle that follows a three-monthly pattern

Calculate the three-month moving average.

Add together the first three sets of data, for this example it would be January,
February and March. This gives a total of (125+145+186) = 456. Put this total in
the middle of the data you are adding, so in this case across from February. Then
calculate the average of this total, by dividing this figure by 3 (the figure you
divide by will be the same as the number of time periods you have added in your
total column). Our three-month moving average is therefore (456 ÷ 3) = 152.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

The average needs to be calculated for each three-month period. To do this you
move your average calculation down one month, so the next calculation will
involve February, March and April. The total for these three months would be
(145+186+131) = 462 and the average would be (462 ÷ 3) = 154.

Calculate the trend

The three-month moving average represents the trend. From our example we
can see a clear trend in that each moving average is $2,000 higher than the
preceding month moving average. This suggests that the sales revenue for the
company is, on average, growing at a rate of $2,000 per month.

This trend can now be used to predict future underlying sales values.

Calculate the seasonal variation

Once a trend has been established, any seasonal variation can be calculated.
The seasonal variation can be assumed to be the difference between the actual sales
and the trend (three-month moving average) value.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

A negative variation means that the actual figure in that period is less than the
trend and a positive figure means that the actual is more than the trend.

9.MISSING VALUES
The data in the real world has many missing data in most cases. There might
be different reasons why each value is missing.
There might be loss or corruption of data, or there might be specific reasons
also.
The missing data will decrease the predictive power of your model.
If you apply algorithms with missing data, then there will be bias in the
estimation of parameters.
You cannot be confident about your results if you don’t handle missing data.
Check for missing values
When you have a dataset, the first step is to check which columns have
missing data and how many.
The ” isnull()” function is used for this. When you call the sum function
along with isnull, the total sum of missing data in each column is the output.
missing_values=train.isnull().sum()
print(missing_values)
Dropping rows with missing values
It is a simple method, where we drop all the rows that have any missing
values belonging to a particular column. As easy as this is, it comes with a huge
disadvantage. You might end up losing a huge chunk of your data. This will reduce
the size of your dataset and make your model predictions biased. You should use
this only when the no of missing values is very less.
we can drop rows using dropna function

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

Handle missing values in Time series data


The datasets where information is collected along with timestamps in an orderly
fashion are denoted as time-series data.
1. Forward-fill missing values
The value of the next row will be used to fill the missing value.’ffill’ stands for
‘forward fill’. It is very easy to implement. You just have to pass the “method”
parameter as “ffill” in the fillna() function.
forward_filled=df.fillna(method='ffill')
print(forward_filled)

2.Backward-fill missing values


Here, we use the value of the previous row to fill the missing value. ‘bfill’ stands
for backward fill. Here, you need to pass ‘bfill’ as the method parameter.
backward_filled=df.fillna(method='bfill')
print(backward_filled)

10.SERIAL CORRELATION
Serial correlation occurs in a time series when a variable and a lagged
version of itself (for instance a variable at times T and at T-1) are observed to be
correlated with one another over periods of time. Repeating patterns often show
serial correlation when the level of a variable affects its future level. In finance,
this correlation is used by technical analysts to determine how well the past price
of a security predicts the future price.
Serial correlation is similar to the statistical concepts of autocorrelation or lagged
correlation.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

KEY TAKEAWAYS

 Serial correlation is the relationship between a given variable and a lagged


version of itself over various time intervals.
 It measures the relationship between a variable's current value given its past
values.
 A variable that is serially correlated indicates that it may not be random.
 Technical analysts validate the profitable patterns of a security or group of
securities and determine the risk associated with investment opportunities.

we can shift the time series by an interval called a lag, and then compute the
correlation of the shifted series with the original:

AUTOCORRELATION
Autocorrelation refers to the degree of correlation of the same variables between
two successive time intervals. It measures how the lagged version of the value of a
variable is related to the original version of it in a time series. Autocorrelation, as a
statistical concept, is also known as serial correlation.
we calculate the correlation between two different versions Xt and Xt-k of the
same time series.
Given time-series measurements, Y1, Y2,…YN at time X1, X2, …XN, the lag k
autocorrelation function is defined as:

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

An autocorrelation of +1 represents perfectly positive correlations and -1


represents a perfectly negative correlation.

Usage:
 An autocorrelation test is used to detect randomness in the time-series. In

many statistical processes, our assumption is that the data generated is random.
For checking randomness, we need to check for the autocorrelation of lag 1.
 To determine whether there is a relation between past and future values of
time series, we try to lag between different values.

Testing For Autocorrelation


Durbin-Watson Test:

Durbin-Watson test is used to measure the amount of autocorrelation in residuals


from the regression analysis. Durbin Watson test is used to check for the first-
order autocorrelation.

Assumptions for the Durbin-Watson Test:


 The errors are normally distributed and the mean is 0.
 The errors are stationary.
The test statistics are calculated with the following formula.

Where et is the residual of error from the Ordinary Least Squares (OLS) method.

The null hypothesis and alternate hypothesis for the Durbin-Watson Test are

 H0: No first-order autocorrelation.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

 H1: There is some first-order correlation.


The Durbin Watson test has values between 0 and 4. Below is the table
containing values and their interpretations:

 2: No autocorrelation. Generally, we assume 1.5 to 2.5 as no correlation.


 0- <2: positive autocorrelation. The more close it to 0, the more signs of
positive autocorrelation.
 >2 -4: negative autocorrelation. The more close it to 4, the more signs of
negative autocorrelation.

The autocorrelation function is a function that maps from lag to the serial
correlation with the given lag. Autocorrelation" is another name for serial
correlation, used more often when the lag is not 1.

acf computes serial correlations with lags from 0 through nlags. The unbiased ag
tells acf to correct the estimates for the sample size. The result is an array of
correlations.

11.Introduction to Survival analysis

Introduction

Survival analysis is a statistical method essential for analyzing time-to-event


data, widely employed in medical research, economics, and various scientific
disciplines. At the core of survival analysis is the Kaplan-Meier estimator, a

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

powerful tool for estimating survival probabilities over time. This article provides
a concise introduction to survival analysis, unraveling its significance and
applications. We delve into the fundamentals of the Kaplan-Meier estimator,
exploring its role in handling censored data and creating survival curves. Whether
you’re new to survival analysis or seeking a refresher, this guide navigates through
the key concepts, making this statistical approach accessible and comprehensible.

What is Survival Analysis?

Survival analysis explores the time until an event occurs, answering


questions about failure rates, disease progression, or recovery duration. It’s a
crucial statistical field, involving terms like event, time, censoring, and various
methods like Kaplan-Meier curves, Cox regression models, and log-rank tests for
group comparisons. This branch delves into modeling time-to-event data, offering
insights into diverse scenarios, from medical diagnoses to mechanical system
failures. Understanding survival analysis requires defining a specific time frame
and employing various statistical tools to analyze and interpret data effectively.

Censoring/ Censored Observation

This terminology is defined as if the subject matter on which we are doing


the study of survival analysis doesn’t get affected by the defined event of study,
then they are described as censored. The censored subject might also not have an
event after the end of the survival analysis observation. The subject is called
censored in the sense that nothing was observed out of the subject after the time of
censoring.

Censoring Observation are also of 3 types-

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

1. Right Censored

Right censoring is used in many problems. It happens when we are not certain
what happened to people after a certain point in time.

It occurs when the true event time is greater than the censored time when c < t.
This happens if either some people cannot be followed the entire time because they
died or were lost to follow up or withdrew from the study.

2. Left Censored

Left censoring is when we are not certain what happened to people before some
point in time. Left censoring is the opposite, occurring when the true event time is
less than the censored time when c > t.

3. Interval Censored

Interval censoring is when we know something has happened in an interval (not


before starting time and not after ending time of the study) but we do not know
exactly when in the interval it happened.

Interval censoring is a concatenation of the left and right censoring when the time
is known to have occurred between two-time points

Survival Function S (t): This is a probability function that depends on the time of
the study. The subject survives more than time t. The Survivor function gives the
probability that the random variable T exceeds the specified time t.

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

Kaplan Meier Estimator

Kaplan Meier Estimator is used to estimate the survival function for lifetime
data. It is a non-parametric statistics technique. It is also known as the product-
limit estimator, and the concept lies in estimating the survival time for a certain
time of like a major medical trial event, a certain time of death, failure of the
machine, or any major significant event.

There are lots of examples like

1. Failure of machine parts after several hours of operation.

2. How much time it will take for COVID 19 vaccine to cure the patient.

3. How much time is required to get a cure from a medical diagnosis etc.

4. To estimate how many employees will leave the company in a specific

period of time.
5. How many patients will get cured by lung cancer

To Estimate the Kaplan Meier Survival we first need to estimate the Survival
Function S (t) is the probability of event time t

Where (d) are the number of death events at the time (t), and (n) is the number of
subjects at risk of death just prior to the time (t).

Assumptions of Kaplan Meier Survival

In real-life cases, we do not have an idea of the true survival rate function. So in
Kaplan Meier Estimator we estimate and approximate the true survival function
from the study data. There are 3 assumptions of Kaplan Meier Survival

Downloaded by Varshini Pattabiraman ([email protected])


lOMoARcPSD|48712373

 Survival Probabilities are the same for all the samples who joined late in the study
and those who have joined early. The Survival analysis which can affect is not
assumed to change.
 Occurrence of Event are done at a specified time.
 Censoring of the study does not depend on the outcome. The Kaplan Meier method
doesn’t depend on the outcome of interest.

Interpretation of Survival Analysis is Y-axis shows the probability of subject


which has not come under the case study. The X-axis shows the representation of
the subject’s interest after surviving up to time. Each drop in the survival function
(approximated by the Kaplan-Meier estimator) is caused by the event of interest
happening for at least one observation.

The plot is often accompanied by confidence intervals, to describe the uncertainty


about the point estimates-wider confidence intervals show high uncertainty, this
happens when we have a few participants- occurs in both observations dying and
being censored.

Downloaded by Varshini Pattabiraman ([email protected])

You might also like