R-squared and adjusted R-squared are statistical measures that help
determine how well a regression model fits a set of data. R-squared measures the variation in a
model, while adjusted R-squared adjusts that value for the number of predictors in the model.
Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of
predictors in the model.
While r-squared measures the proportion of variance in the dependent variable explained by the
independent variables, it always increases when more predictors are added. Adjusted r-squared
adjusts for the number of predictors and decreases if the additional variables do not contribute to
the model's significance.
Adjusted values reduce the number of predictors, thus their values may be negative, indi- cating
that fitted variables explain less variation than expected in the case of random predictors.
an R-squared above 0.7 would generally be seen as showing a high level of correlation, whereas
a measure below 0.4 would show a low correlation.
R-squared
Measures the proportion of variance in the dependent variable explained by the independent
variables
Increases or remains the same when new predictors are added to the model
Values range from 0 to 1
Adjusted R-squared
Adjusts the R-squared value to account for the number of predictors and the sample size
Penalizes the inclusion of irrelevant predictors
Can decrease if a new predictor does not improve the model
Helps determine the goodness of fit
When to use
Investors use R-squared and adjusted R-squared to measure the correlation between a portfolio
or mutual fund and a stock index
Pizza owners can use adjusted R-squared to see if additional input variables contribute to their
model.
Unadjusted R-squared The unadjusted R-squared is a measure of model fit between 0
and 1 which does not account for the number of variables included in a model. It is available
only for linear models.
Hypothesis testing in econometrics is a statistical method used to analyze economic
data and make decisions about population parameters. It involves comparing a null hypothesis to
an alternative hypothesis to determine which is more likely to be true.
Steps in hypothesis testing:
State the hypothesis: Formulate a hypothesis based on research and evidence
Specify the significance level: Choose a critical threshold for rejecting the null
hypothesis
Collect data: Gather data to calculate the test statistic
Calculate the test statistic: Use the data to calculate the test statistic
Determine the p-value: Compare the p-value to the significance level
Make a decision: Accept or reject the null hypothesis
Interpret the results: Draw a conclusion based on the statistical evidence
What hypothesis testing is used for?
Analyzing economic theories and relationships
Estimating the relationship between two statistical variables
Testing that individual coefficients take a specific value
TYPES OF HYPOTHESIS TESTS
There are three types of hypothesis tests: right-tailed, left-tailed, and two-tailed:
When the null and alternative hypotheses are stated, it is observed that the null hypothesis is a
neutral statement against which the alternative hypothesis is tested. The alternative hypothesis is
a claim that instead has a certain direction. If the null hypothesis claims that p = 0.5, the
alternative hypothesis would be an opposing statement to this and can be put either p > 0.5, p <
0.5, or p ≠ 0.5. In all these alternative hypotheses statements, the inequality symbols indicate the
direction of the hypothesis. Based on the direction mentioned in the hypothesis, the type of
hypothesis test can be decided for the given population parameter.
When the alternative hypothesis claims p > 0.5 (notice the 'greater than symbol), the critical
region would fall at the right side of the probability distribution curve. In this case, the right-
tailed hypothesis test is used.
When the alternative hypothesis claims p < 0.5 (notice the 'less than' symbol), the critical region
would fall at the left side of the probability distribution curve. In this case, the left-tailed
hypothesis test is used.
In the case of the alternative hypothesis p ≠ 0.5, a definite direction cannot be decided, and
therefore the critical region falls at both the tails of the probability distribution curve. In this
case, the two-tailed test should be used.
Distributions
In econometrics, distributions describe how data points are spread across a range of values.
They are used to identify patterns, trends, and anomalies. This information is important for
making predictions and inferences, and for econometric analyses like hypothesis testing,
policy evaluation, and predictive modeling.
Types of distributions
Normal distribution :(Z Distribution)
Also known as the Gaussian distribution, this distribution is symmetric around the
mean, and appears as a bell curve.
Poisson distribution
This discrete probability distribution describes the probability of an event happening a
certain number of times within a given time or space.
Binomial distribution
This distribution is represented by 𝐵(𝑛,𝑝), where 𝑛 is the number of trials and 𝑝 is the
probability of success in a single trial.
Chi-squared distribution
This distribution is often used when testing hypotheses in regression models.
Exponential distribution
This continuous distribution is used to measure the expected time for an event to occur.
Student t-distribution
T-Distribution
This distribution is used when the sample size is small or when not much is known
about the population.
The t-distribution is a probability distribution that is used in econometrics to calculate
probabilities and model financial returns. It is used in t-tests to determine if there is a
significant difference between sample and population means.
Explanation
The t-distribution has fatter tails than a normal distribution, which accounts for the
greater uncertainty in smaller samples.
The t-distribution's shape depends on the degrees of freedom (df), which is related to
the sample size.
As the d𝑓 increases, the t-distribution curve becomes taller and thinner, and more
When the sample size is around 30 or more, the t-test and 𝑍-test results are very
similar to the standard normal distribution (𝑍-distribution).
similar.
F-Distribution
The F-distribution is a statistical distribution used to test hypotheses in econometrics. It
can be used to compare variances, evaluate portfolio risks, and compare stock returns.
What is the F-distribution?
The F-distribution is a ratio of two estimates of variance.
It has two degrees of freedom, one for the numerator and one for the denominator.
The F-distribution is asymmetric, with a minimum value of 0 and no maximum value.
The F-distribution is less spread out as the degrees of freedom increase.
How is the F-distribution used in econometrics?
The F-distribution is used to establish a framework for hypothesis testing in financial
analysis.
It can be used to compare stock returns and evaluate portfolio risks.
There are two sets of degrees of freedom; one for the numerator and one for the
denominator. For example, if F follows an F distribution and the number of degrees of
freedom for the numerator is four, and the number of degrees of freedom for the
denominator is ten, then F ~ F 4,10.
Example
A researcher wants to determine if different amounts of exercise impact weight
loss. The researcher sets up two groups, one that exercises for 30 minutes a day and the
other that exercises for 60 minutes a day. The researcher can use the F-distribution to
analyze the data for both groups.
Ordinary Least Squares regression (OLS)
is a common technique for estimating coefficients of linear regression equations which
describe the relationship between one or more independent quantitative variables and a
dependent variable (simple or multiple linear regression), often evaluated using r-
squared.
OLS minimizes the sum of squared errors, which is the difference between observed
and predicted values.
It creates a single regression equation to represent the relationship between the
variables
Simple linear regression:
Y=a+bX+u
The assumptions of ordinary least squares (OLS) regression
are:
Linearity: The relationship between the dependent and independent variables must
be linear.
Independence of errors: The residuals, or error terms, should be uncorrelated
with each other.
Homoscedasticity: The variance of the error terms should be constant across all
levels of the independent variables.
Normality of errors: The error terms should be normally distributed.
No multicollinearity: The independent variables should not be highly correlated
with each other.
Exogeneity: The regressor variables should not be correlated with the error term.
Explanation
Non-normality of errors
If the error terms are not normally distributed, the standard errors of the OLS estimates
will not be reliable.
Heteroscedasticity
If the variance of the error terms is not constant, this is called heteroscedasticity.
Endogenous regressors
If the regressor variables are correlated with the error term, they are called endogenous.
This can cause the OLS estimator to be biased
Correlation
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related
Types of correlation
Positive correlation: When both variables increase or decrease in the same
direction
Negative correlation: When one variable increases as the other decreases, or vice
versa
No correlation: When there is no linear relationship between the variables
Correlation coefficient
The correlation coefficient is a statistical measure of how much one variable changes in
relation to another
The correlation coefficient is represented by the letter r
the value of the correlation coefficient ranges from -1 to +1
A value of +1 indicates a perfect positive correlation, while -1 indicates a perfect
negative correlation
A value of 0 indicates no linear correlation
Correlation refers to the statistical relationship between the two entities. It measures the
extent to which two variables are linearly related. For example, the height and weight of
a person are related, and taller people tend to be heavier than shorter people.
Auto correlation
Autocorrelation, also known as serial correlation, is a statistical method that measures
how similar a variable is to itself over time. It's a key tool in econometrics for
analyzing time series data.
Autocorrelation refers to the degree of correlation of the same variables between two
successive time intervals.
For example, the temperatures on different days in a month are autocorrelated. Similar
to correlation, autocorrelation can be either positive or negative.
How it works
Autocorrelation measures the relationship between a variable's current value and its
past values.
It's a mathematical representation of the similarity between a time series and a delayed
version of itself.
Autocorrelation can be positive or negative. A perfect positive correlation is
represented by an autocorrelation of +1, and a perfect negative correlation is
represented by an autocorrelation of -1.
Why it's important
Autocorrelation can help identify repeating patterns in data.
It can help identify fundamental features of a time series, such as stationarity,
seasonality, and trends.
It can help identify when data is not random, which may indicate a need for time series
analysis or regression analysis.
Where it's used
Autocorrelation is used in econometrics, signal processing, and demand prediction.
It's also used in technical analysis in the capital markets.
Serial correlation is a common feature of time-series data in econometrics. It occurs
when the errors in a regression model are correlated with each other, or when the
residuals are not independent.
Example
Stock prices
Stock prices tend to move up and down together over time, which is an example of serial
correlation. This means that if stock prices are high today, they are likely to be high
tomorrow.
General to specific Model
In econometrics, general-to-specific (Gets) modeling is a methodology that starts with a
general model and then reduces it to a more specific model.
How it works
Start with a general model that includes all the variables that are thought to be
important
Reduce the model by successively removing variables until it's parsimonious, or
simple
Test the model's assumptions against the data
Why it's important
Gets modeling can help simplify complex phenomena
It can help ensure that the model is statistically adequate
Stationarity
A common assumption in many time series techniques is that the data are stationary. A
stationary process has the property that the mean, variance and autocorrelation structure do not
change over time.
What is stationarity test in econometrics?
The 'Stationarity Test' function test if the time series has a stationarity property, i.e., the
statistical properties like error means, variances and moments, do not change with time.
Why is stationarity important?
Stationarity makes time series data easier to analyze, model, and forecast.
Stationary time series are predictable and suitable for certain econometric models.
How to check for stationarity?
The Levin-Lin-Chu (LLC) test can be used to determine if a time series is stationary.
The LLC test has a null hypothesis that the data has a unit root, which means it's not
stationary.
A low p-value from the LLC test indicates that the data is likely stationary .
Stationarity in data means that the statistical properties of a time series do not change over
time. This assumption is important because it allows for simpler analysis, modeling, and
forecasting.
Explanation
Statistical properties: These include the mean, variance, and covariance of the
data.
Stationary time series: A time series where these statistical properties remain
constant over time.
Non-stationary time series: A time series where these statistical properties
change over time.
Stationarity assumption: The assumption that data is stationary is a common
assumption in many time series techniques.
Autoregressive Distributed Lag (ARDL)
is a model used in econometrics to analyze the relationship between time series data. It's a
single-equation framework that can be used to estimate long-term coefficients.
How it works
The model's current value of the dependent variable is dependent on its own past
values and the current and past values of other explanatory variables
The variables can be stationary, nonstationary, or a combination of the two
The ARDL cointegration technique can be used to determine the long-term
relationship between variables with different orders of integration
Benefits
The ARDL method can produce consistent estimates of long-term coefficients
The ARDL cointegration technique can be used to obtain realistic estimates of a model
Dummy Variables
In econometrics, a dummy variable is a numeric variable that takes on a value of either
0 or 1 to represent a qualitative variable. Dummy variables are used in regression
analysis to include categorical variables in models.
Why are dummy variables used?
They allow for more sophisticated modeling of data
They can help control for confounding factors
They can improve the validity of results
What are dummy variables used for?
Representing qualitative variables like race, marital status, political party, age group,
and region of residence
Representing seasonal effects
Representing the occurrence of wars or major strikes in time series analysis
How are dummy variables used?
Dummy variables can be used as explanatory variables or as the dependent variable
Multiple dummy variables can be created to represent each level of a categorical
variable
Only one dummy variable takes on a value of 1 for each observation
dummy variable (binary variable) D is a variable that takes on the value 0 or 1. Note that the
labelling is not unique, a dummy variable could be labelled in two ways, i.e. for variable
gender: – D = 1 if male, D = 0 if female; – D = 1 if female, D = 0 if male.