0% found this document useful (0 votes)
44 views95 pages

Statistical Software Application in Economics

The document is a course module for 'Statistical Software Application in Economics' at Jimma University, focusing on the use of Stata and EViews for data analysis and econometric modeling. It covers various topics including data management, univariate and bivariate analysis, and applications in econometrics, time series, and nonlinear models. The course aims to equip students with practical skills in statistical software to enhance their analytical capabilities in economics.

Uploaded by

mekuti15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views95 pages

Statistical Software Application in Economics

The document is a course module for 'Statistical Software Application in Economics' at Jimma University, focusing on the use of Stata and EViews for data analysis and econometric modeling. It covers various topics including data management, univariate and bivariate analysis, and applications in econometrics, time series, and nonlinear models. The course aims to equip students with practical skills in statistical software to enhance their analytical capabilities in economics.

Uploaded by

mekuti15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

JIMMA UNIVERSITY

COLLEGE OF BUSINESS AND ECONOMICS

DEPARTMENT OF ECONOMICS

CONTINUING AND DISTANCE EDUCATION LEARNING MODULE


FOR:

STATISTICAL SOFTWARE APPLICATION IN ECONOMICS

Course Code: Econ 4171

WRITER: OUSNI SOLOMON (M.Sc.)

EDITOR: HAILE TESFAYE (M.Sc.)

JULY, 2023

JIMMA, ETHIOPIA
Contents
Chapter 1: Introduction to Softwares ......................................................................................... 1
1.1 Econometrics and Statistical Software............................................................................. 1
1.2 What is Stata? .................................................................................................................. 4
1.3 Why use Stata? ................................................................................................................. 5
1.4 OVERVIEW OF DATA FILES ....................................................................................... 5
1.4.1 Types of data and data sources .................................................................................. 5
1.4.2 Discrete variables Vs Continuous variables.............................................................. 7
1.4.3 The Stata user interface ............................................................................................. 9
1.4.4 Basic File Management ........................................................................................... 12
1.4.5 Basic syntax and mathematical operators ............................................................... 13
Chapter 2: Application to Data Management .......................................................................... 14
2.1 Basic data commands ..................................................................................................... 14
2.3 Examining data sets ....................................................................................................... 15
 clear ........................................................................................................................... 15
 edit ............................................................................................................................. 15
 browse ....................................................................................................................... 16
 describe...................................................................................................................... 16
2.3 Data Manipulation ..................................................................................................... 16
2.3.1 Logical operators......................................................................................................... 16
2.3.1.1 Logical operator if............................................................................................... 16
2.3.1.2 Logical operators: “and,” “or” ........................................................................ 18
2.3.2 Generating Variables ................................................................................... 20
2.3.3 Modifying Variables: .............................................................................................. 24
2.4 Data Documentation .................................................................................................. 25
2.4.1 Data Documentation and Metadata ......................................................................... 25
2.4.2 the importance of documenting data ....................................................................... 27
2.4.3 Variable labels ........................................................................................................ 28
2.5 Data Storage and Organization ................................................................................. 31
2.5.1 Data Storage Formats:........................................................................................ 31
2.5.2 File Extensions:.................................................................................................. 31
2.6 Data Security and Ethics ................................................................................................ 34
Review questions on Stata application to data management in economics: ........................ 36
Chapter 3: Application to Univariate Analysis ........................................................................ 37
3.1 Summary Statistics: ..................................................................................................... 37
3.1.1 Count: .................................................................................................................... 37
3.1.2 Mean: .................................................................................................................... 37
3.1.3 Standard deviation: ........................................................................................... 38
3.1.4 Minimum: ............................................................................................................. 38
3.1.5Maximum: ............................................................................................................. 38
3.1.6 Quartiles: ............................................................................................................. 38
3.1.7Percentiles: ........................................................................................................... 38
3.1.8 Skewness: ........................................................................................................ 38
3.1.9 Kurtosis: ........................................................................................................... 38
3.2 Histogram or Frequency Distribution: ...................................................................... 39
3.3 Box Plots: ...................................................................................................................... 41
3.4 Bar plots or Pie charts ................................................................................................. 42
Review questions on Stata application to univariate analysis in economics: ...................... 43
Chapter 4: Application to Bivariate Analysis .......................................................................... 45
4.2 Scatter Plot: .................................................................................................................... 45
4.3 Correlation Coefficient:............................................................................................. 45
4.3 Regression Analysis: ...................................................................................................... 46
4.4 Line Plot or Grouped Bar Chart: .................................................................................... 47
4.5 Cross-tabulation: ............................................................................................................ 48
4.6 Independent Samples t-test or Mann-Whitney U test: ................................................... 49
4.7 Chi-Square Test: ........................................................................................................ 50
Review questions on Stata application to bivariate analysis in economics: ........................ 51
Chapter 5: Application to Cross Sectional Econometrics .................................................. 52
5.1 Heteroscedasticity test: ............................................................................................ 52
5.2 Normality test: ....................................................................................................... 53
5.3 correlation .................................................................................................................. 54
5.4 Multicollinearity test: ............................................................................................... 55
5.5 Specification tests: ....................................................................................................... 57
5.6 Goodness-of-fit measures: ....................................................................................... 58
Review questions on Stata application to cross-sectional econometrics: ............................ 59
CHAPTER SIX: APPLICATION TO TIME SERIES ECONOMETRICS AND PANEL
DATA ...................................................................................................................................... 60
6.1 Application to Time Series Econometrics ............................................................. 60
6.1.1 Importing and Managing Data:........................................................................... 60
Exploratory Data Analysis: .............................................................................................. 61
6.1.2 Time Series Analysis: Unit root tests .............................................................. 62
6.1.3 White's Test for Heteroscedasticity: .............................................................. 68
6.1.4 Granger Causality Test ........................................................................................ 69
6.1.5 Autoregressive integrated moving average (ARIMA) models ..................... 71
6.2 Panel Data Analysis: ................................................................................................... 73
6.2.1 Specifying the panel and time variables using the xtset command: ............. 73
6.2.2 Estimating fixed effects or random effects ......................................................... 74
6.2.3 Hausman tests ......................................................................................................... 75
Review questions on Stata application to time series econometrics and panel data analysis:
.............................................................................................................................................. 77
Chapter 7: Application to Nonlinear Models ........................................................................... 78
7.1 Overview of non-linear models and their applications in economics ............................ 78
Differences between linear and non-linear models .......................................................... 78
7.2 Binary Choice Models ................................................................................................... 80
7.2.1 Probit model: specification, estimation, and interpretation .................................... 80
7.2.2 Logit model: specification, estimation, and interpretation ................................ 82
7.2.3 Comparing probit and logit models ........................................................................ 84
7.3 Stata Application ............................................................................................................ 85
Marginal Effects ......................................................................................................... 87
Review questions on Stata application to Logit and Probit models: ................................... 89
Assignment .............................................................................................................................. 90
Chapter 1: Introduction to Softwares
1.1 Econometrics and Statistical Software

Economists very often work with statistical software that is used to build economic models
and conduct econometric analyses. Learning to work with and analyze data is thus an
essential skill for young economists. To be competitive as an economist in the job market,
demonstrable skills and experience using some of the popular analysis and forecasting
software environments are a must.

In modern economics research and analysis, software applications play a crucial role in data
management, statistical analysis, and econometric modeling. These applications provide
powerful tools for economists to handle large datasets, conduct complex analyses, and
generate meaningful insights. In this lecture, we will explore some of the prominent software
applications used in economics and their key features.

Stata:

Stata is a widely used statistical software package that offers a comprehensive suite of tools
for data analysis, econometrics, and visualization. It provides an extensive range of
commands and functions for managing datasets, conducting statistical tests, estimating
regression models, and producing high-quality graphs. Stata's user-friendly interface and rich
documentation make it particularly popular among economists.

R:

R is a programming language and environment specifically designed for statistical computing


and graphics. It is open-source and widely adopted in economics research due to its
versatility and extensive library of packages. R enables economists to perform a wide range
of statistical analyses, data manipulation, and visualization tasks. With its programming
capabilities, R offers flexibility and allows for customizing analyses to meet specific research
needs.

Python:

Python is a general-purpose programming language that has gained significant popularity in


the economics community. It provides a wide range of libraries, such as NumPy, Pandas, and
Matplotlib, that offer robust data analysis, manipulation, and visualization capabilities.
Page | 1
Python's simplicity and readability make it an attractive choice for economists, especially for
tasks involving data cleaning, preprocessing, and exploratory analysis.

MATLAB:

MATLAB is a powerful software package widely used in economics and various scientific
disciplines. It provides a convenient environment for numerical computing, algorithm
development, and data visualization. MATLAB offers extensive toolboxes for econometrics,
optimization, and simulations. Economists often use MATLAB for advanced econometric
modeling, algorithmic analysis, and simulation exercises.

EViews:

EViews is a specialized software package designed for econometric analysis and time series
modeling. It offers an intuitive interface, making it accessible to economists with varying
levels of technical expertise. EViews supports various econometric techniques, including unit
root tests, panel data analysis, cointegration, and forecasting. It is particularly useful for
analyzing economic and financial time series data.

Excel:

Excel is a widely known spreadsheet program used for data management, basic statistical
analysis, and modeling. While it may lack some advanced statistical capabilities compared to
dedicated statistical software, Excel is still extensively used in economics due to its
familiarity and ease of use. It allows economists to organize data, perform basic calculations,
generate charts, and conduct simple regressions.

GIS (Geographic Information Systems) Software:

GIS software, such as ArcGIS and QGIS, is essential for economists studying spatial
economics and regional analysis. These tools enable economists to analyze and visualize
spatial data, create maps, and perform spatial econometric modeling. GIS software allows for
the integration of economic data with geographic information, facilitating the examination of
spatial relationships and patterns.

In conclusion, software applications play a vital role in modern economics research and
analysis. Each software package offers unique features and capabilities that cater to different
research needs and preferences. Whether it's Stata for comprehensive statistical analysis, R

Page | 2
and Python for programming flexibility, MATLAB for advanced modeling, or specialized
software like EViews for econometric analysis, economists have a variety of powerful tools
at their disposal to conduct rigorous and insightful research.

Note: It's important to choose the software application that best suits your research
requirements and to familiarize yourself with its functionalities and capabilities. Additionally,
it's always advisable to consult relevant documentation and resources provided by the
software developers to maximize the utilization of these tools in economics research. That‟s
quite the list of names, and it can sound intimidating. To help you understand the software
tools landscape, in this module we will shed some light on the most popular software
packages for economists, and offer some details about how you can learn more about them.

The choice of software application depends on several factors, including the specific research
task at hand, the complexity of the analysis, the researcher's familiarity with the software, and
the availability of resources and support. Here are some considerations for selecting a
software application for different scenarios:

Stata: Stata is often preferred when conducting statistical analysis and econometric
modeling. It provides a comprehensive set of built-in commands specifically designed for
economists. Stata's user-friendly interface and extensive documentation make it suitable for
both beginners and advanced users. It is well-suited for tasks such as data management,
regression analysis, and generating publication-quality graphs. Stata also has a large user
community and support network.

EViews: EViews is specifically designed for time series analysis and econometrics. It offers
an intuitive interface and built-in econometric techniques, making it ideal for economists
working with time series data. EViews simplifies tasks such as unit root tests, cointegration
analysis, and forecasting. It is commonly used in applied econometrics, financial analysis,
and economic forecasting.

This course will equip students with a sufficient knowledge of STATA such that they are can
handle and analyze different types of data EVIEWS for the time series analysis. The
emphasis of the course is on the practical issues relating to data analysis and modeling rather
than econometric theory. The overriding objective of the course will be to ensure that the
students are competent and confident in econometric analysis of data.

Page | 3
The course encompasses a number of key areas in empirical analysis using various datasets.
The students will be shown how to analyze the data and how to estimate reliable econometric
models using STATA. Throughout the course, students will be shown how to avoid the
numerous pitfalls that inexperienced researchers often fall into.

Objectives

The objective of this module is to improve the ability of our students to use Stata to generate
descriptive statistics and tables from some example datasets, as well as carry out
preliminarily linear and non-linear regression analysis using appropriate data and EVIEWS
for time series data analysis. In particular, the course aims to train the participants in the
following methods:

 basic file management such as opening, modifying, and saving files


 advanced file management techniques such as merging, appending, and
aggregating files
 documenting data files with variable labels and value labels
 generating new variables using various functions and operations
 creating tables to describe the distribution of continuous and discrete variables
 creating tables to describe the relationships between two or more variables
 using regression analysis to study the impact of various variables on a dependent
variable
 testing hypotheses using statistical methods

1.2 What is Stata?

 A computer program that can be used for data analysis, data management, and

graphics

 It has a wide application and can be used for household surveys, macroeconomic data,

“big data” (data derived from mass data-collecting activities), etc.

 What applications do you foresee using Stata in your own work?

Page | 4
1.3 Why use Stata?

Choosing a statistical software package in a company or research institution is often a

strategic decision. The decision entails the investment of time and money, and you should

think about the future development and compatibility of the software. Often, it is also

influential what type of software your peer group is using, since this is usually the main

source for getting support and exchanging experience.

Statistical software can either be used by command line or by point-and-click menus, or both.

The command line usage has the invaluable advantage that all steps of the analysis, and thus

all results, are easily replicable. In contrast, menu usage might make it very difficult to

replicate results, especially in larger projects. However, it might be more difficult in the

beginning to learn a new command structure, especially for those users who have never

worked with programming languages. Nonetheless, initial ease of use should be weighed

against long-term payoffs before choosing the software.

 Over Excel
Excel is easier to use and good for quick graphing, but not as robust in terms of
statistical analysis; also in Excel many things have to be done manually (hard to
apply broad rules) Stata also allows you to keep trackof your work

 Over SPSS
While Stata‟s capabilities are seen more at the advanced end, it is easier to get support for
Stata, and morewidely used in academia

 Over R
While R is free and accessible to the public, Stata is easier to learn and
again, the community of users iswider…for now

1.4 OVERVIEW OF DATA FILES


1.4.1 Types of data and data sources
Data used in econometric analysis could be:
 Micro or macro, flow or stock, or quantitative or qualitative
 Primary data vs. secondary data (method of collection)

Page | 5
 Observational/non-experimental,
experimental, and survey data Classifications of
data types
 Time series data
 Cross-section data
 Pooled data
 Panel/longitudinal data
Records

 Observation units such as individuals, households, farm plots, regions, districts,


villages, etc.
 They are often considered to form the „rows’ of a data file
 The dataset A below has five records, each record being a household
Dataset A
HHID REGION FAMSIZE DISTMKT
221 1 5 1.5
373 1 5 0.4
457 1 4 0.6
459 3 2 5.1
175 4 8 1.2

Variables

 Are the characteristics, location, or dimensions of each record


 They are considered the “columns” of the data file
 In our data above (A), there are four variables: the household
identification number (HHID), the region where the household lives
(REGION), the size of the household (FAMSIZE), and the distance
from the house to the nearest source of water (DISTMKT)
Level

 The level of a dataset describes what each record represents


 In our previous data, each record is a different household, so it is a household-
level data set.
 However, in dataset B below, each record is a farm plot, hence it is a plot-level
dataset

Page | 6
Dataset B
REGIO DISTRIC HHI PLOT IRRI ARE
N T D G A
1 4 1 1 1 1.5
1 4 1 2 0 1.0
1 5 3 1 1 0.5
3 26 2 1 0 0.4
3 26 2 2 1 1.0
4 45 1 1 1 1.2

Key variables

 Variables that are needed to identify a record in the data


 In our first dataset, the variable HHID is enough to uniquely
identify the record so HHID is the only key variable
 In dataset B, the key variables are REGION, DISTRICT, HHID,
and PLOT because all fourvariables are needed to uniquely identify
the record (why?)
1.4.2 Discrete variables Vs Continuous variables

Discrete variables (categorical)


 Variables that have only a limited number of different values
 Examples include region, sex, income category, type of roof, and education level
 Yes/no variables such as whether a household use improved seeds are also discrete
variables
 Such are also called binary or dummy variables

Continuous variables
 Variables whose values are not limited
 Examples include income, farm size, consumption, coffee
production, and distance to the road
 Unlike discrete variables, continuous variables are usually expressed
in some units such as birr, hectares, kilograms or kilometers and
may take fractional values
Variable labels
 Longer names associated with each variable to explain them in tables and graphs
 For example, the variable label for FAMSIZE might be “Household
family size/ number of family members in a household” and the

Page | 7
label for DISTMKT could be “Distance from market in km”
 Whenever possible, variable labels should include the unit (e.g. km)
Value labels
 Longer names attached to each value of a variable (categorical/discrete)
 For example, if the variable REGION has 4 values, each value is
associated with a name. For instance, REG=1 “Tigray”, REG=3
“Amhara”, REG=4 “Oromia” and REG =5 ”Somali”

Default display at program start

 Type sysuse auto


 Stata comes with example datasets that are used for examples
 Type sysuse dir to see other example datasets

Page | 8
1.4.3 The Stata user interface

The latest stata version is Stata 18, but we will be using stata 14 for this
course whose snapshot is shown below.

Results

window

Variables

window
Review

window
Command Properties

window window

Windows

Page | 9
 Give you all the key information about the data file you are using,
recent commands, and theresults of those commands
 Results window: To view outputs and recent commands
 Command window: To enter commands
 Variables window: To see the list of variables in the active dataset opened
 Review window: to see all previously entered commands
 Properties window: to see variable and dataset properties of
a selected variable or groups of variables.
 Other widows such as browser, editor, do-file editor do not open
automatically but through some specific commands or pull-down
menu

 To open any window or to reveal a hidden window, select the window from the
Windowmenu
 Texts appearing in the results window are differently colored
 Blue: Commands or error messages that can be clicked on for more
information
 Red: Error messages
Menus

The most important menus are:


 Open (use): opens a new data file
 Save/save as: saves current data in memory to a specified location
 Do: executes a do-file
 Print: Prints the contents in the results window
The toolbar

The toolbar contains buttons that provide quick access to Stata‟s more
commonly used features. Some of the common button are:

Open: opens a Stata dataset. Click on the button to open a dataset with the
Open dialog.

Save: saves the Stata dataset currently in memory to disk.

Log: begins a new log or closes, suspends, or resumes the current log.

Page | 10
Data Editor (Edit): opens the Data Editor or brings the Data
Editor to the front of the other Stata windows.

Data Editor (Browse): opens the Data Editor in browse mode.

Variables Manager: opens the Variables Manager.

Clear –more—Condition: tells Stata to continue when it


has paused in the middle of long output.

Break: stops the current task in Stata.


Almost all commands in the menu have corresponding buttons (toolbar commands).
However, it is not recommend to learn Stata using the menu/button commands since the use
of the command line will give the user much better control and allows for a much faster and
more exact working process.

Browser

 Offers traditional view of datasets

Browser window

Page | 11
Browser Window

 How many cars are listed there?


 What is the most expensive car that is listed?
 How many variables are listed?
Variables Tab in the Browser Window

 Can you read the label for “foreign?”


 Can you hide everything except for make and price?
From the main command window

 How can you call up the browser window?


 browse

1.4.4 Basic File Management

 dir – “directory,” shows all the files that are in the folder
 Can you find which folder it is currently in?
 pwd – “present working directory”

Page | 12
 Create a folder on Windows where you want all these training files to be placed
 cd – “change directory,” changes the folder where you are working from

1.4.5 Basic syntax and mathematical operators

o disp = display
 What happens when you type disp “Hello”
 What happens when you type disp “Hello” “world”
 What happens when you type disp hello?
 Use “ ” when you are describing string characters (text)
o Otherwise, Stata will think you are talking about variables
o Mathematical operators include: + - * / ^ ( )
 What happens when you display 4
 What happens when you display 4 + 7
 How would you display (21-12)*3
 How would you display (36+12)−42

Page | 13
Chapter 2: Application to Data Management
2.1 Basic data commands

 describe - describes aspects of the data


 How would you describe only one variable, like “weight?”

 list - lists all the data


 How would you list one variable like “make?”
 How would you list two variables like “make” and “price?”
 Remember the distinction between list and list for variables

 summarize – summarizes the various data if they are numbers


 What is the average price of the cars listed?
 How much is the most expensive car?
 What happens if you want a summary of “make?”

 tabulate – counts and tabulates data, also works with non-numeric data
 Now what happens if you want a tabulate of make?
 How many of these cars are foreign and domestic?

2.2 Opening data files from other formats

Stata can import data in a variety of formats:

 ASCII data formats (such as CSV formats)


 Spreadsheet formats (including various Excel formats)
 SQL database, Access, SPSS, etc
We will see how to import data from Excel. There are many different ways of doing so as
listed below.

1. One way to import data from excel is to copy the relevant cells in
Excel and paste them in anew data editor in Stata.

2. The second method is to use the „insheet’ command after saving


the excel file into “.csv” format.

The syntax for insheet command is:

insheet [using] “C:\Users\user\Desktop\Statafiles”

Page | 14
NB: The original Excel (.xls) file has to be converted into an “.csv”
format before proceeding to the insheet command.

3. The third method is to use the


„import excel’ command.The
syntax for import excel command
is:
import excel [using] filename [, import_excel_options]

To treat first row of Excel data as variable names, use the option “first”
with the import excel command.

To import a subset of variables from an Excel file, use the following syntax

import excel extvarlist using filename [,import_excel_options]

To save/export data in memory to an Excel file, use the following syntax


export excel [using] filename [if] [in] [, export_excel_options]

To save/export a subset of variables in memory to an Excel file, use the syntax


export excel [varlist] using filename [if] [in] [, export_excel_options]

2.3 Examining data sets

 clear
The clear command deletes all files, variables, and labels from the memory
to get ready to use a new data file. You can clear memory using the clear
command or by using the clear up command as part of the use command (see
the use command). This command does not delete any data saved to the
hard- drive.

 edit
This command use to open window called data editor window that allow us
to view all observation in the memory. You can change the data using data
editor window but it is not recommended to do this because you will have no
record of the changes you make in the data. It is better to correct errors in
the data using a Do-file program that can be saved (we will see Do-file
program latter).

Page | 15
 browse
This window is exactly like the Stata editor window except that you can‟t change the data.

 describe
This command provides a brief description of the data file. You can use
“des” or “d” and Stata will understand. The output includes:

 the number of variables


 the number of observations (records)
 the size of the file
 the list of variables and their characteristics
It also provides the following information on each variable in the data file:

 the variable name


 the storage type: byte is used for binary variables, int is used for
integers, and float is usedfor continuous variables that may have
decimals. To see the limits on each storage type: help datatypes
 the display type indicates how it will appear in the output.
 the value label is the name of a set of labels for different values
 the variable label is a name for the variable that is used in output.

2.3 Data Manipulation


Here we introduce the basic commands for manipulating your data. The most important
logical operators in Stata are outlined below. The most frequently used are & (and),
(or) and ! (not). These are essential for manipulating the data correctly. We can
illustrate some of these using the “display” command. Notice that strings require
double quotes di Welcome does not work but di „„Welcome‟‟ does. You can also access
system values and programme results with this command, for example today‟s date di
„„„c(current date)‟‟‟. Note again that di „c(current date)‟ returns an error message.

2.3.1 Logical operators

2.3.1.1 Logical operator if


 if – it is a logical operator that has many uses in Stata
 How would you get a list of all cars less than $12,000?
 Logical Operators:

Page | 16
 Less than: <
 Greater than:>
 Less than or equal to: <=
 Greater than or equal to: >=
 Equals: ==
 Does not equal: =
Exercises

Using the data: sysuse auto

 List only the makes of cars whose price is less than $5,000
 What is the average price of cars whose mpg is 18?.
 How many cars are there?
 You can also use count to get this information

Page | 17
2.3.1.2Logical operators: ―and,‖ ―or‖
 &
 |

Page | 18
Exercise

 If we want the name of the car whose weight is between 1000 and 2000
pounds…
 list make if weight > 1000 & weight < 2000
 What if we also wanted weight listed with their name?
 If we want a list of cars and their mileage per gallon (mpg) whose mpg is less
than 20 or over 30…
 list make if mpg < 20 | mpg > 30
 Using the count function, how many cars is this?

Practice questions

 Use gnp96.dta, a dataset showing GNP of an unknown country over time

 sysuse gnp96.dta, clear

1. Using any method, how many observations are there?

2. What are the names of the two variables?

3. What is the meaning of the second variable? (Name of the label)

Page | 19
4. What is the average figure of the GNP over the various observations?

2.3.2 Generating Variables

You will regularly have to generate variables to aid econometric analysis. For example, you
may want to create dummy variables or run log-log regressions. To create a variable named
loggedprice equal to the natural log of price, the command is gen loggedprice = ln(price).
Similarly, to generate a variable equal to twice the square root of the price, use the command
gen twice root price = sqrt(price). Note that variable names cannot contain spaces.

The egen (“extended gen”) command works just like gen but with extra options. For example,
egen avg price = mean(price). With egen we can also break commands down into
percentiles very easily. For example, to create a variable equal to the 99th percentile of
price, enter egen high price = pctile(price), p(99). Changing the 99 to 50 in that command
would produce a variable equal to the median price. “egen” is often used to obtain a
breakdown of a particular statistic by another variable.

Page | 20
if Commands

We can create dummy variables with if commands. Typically two steps


are needed. First we create a variable set equal to zeros: gen expensive =
0. Now we replace it: replace expensive = 1 if price >avg price.
See the tutorial on regression analysis for more details on dummy variables.
Similarly we can control for outliers using if commands.

Page | 21
For example if you want to eliminate the most expensive 5% of
observations, the following would work:
egen top_fivepercent_prices = pctile(price), p(95)
drop if price > top_fivepercent_prices
We remove these variables from the data with the “drop” command as we
do not need them in this analysis. drop loggedprice twice root avg high
price expensive.

Summarizing with tab and tabstat

One of the first things you will want to do with your data is to summarize its main
features. Crosstabs are a useful pre-regression tool, and are also useful for presenting the
main points of your data succinctly. The two most important commands for this are “tab”
and “tabstat”. “Tab” tells you how many times each answer is given in response to a
particular variable. This is only suitable for variables with relatively few entries, such
as categorical data. If you try to use “tab” on a variable which has hundreds of different
entries you will get an error message. Typing tab expensive will show the how many
entries there are for each catagory. As with all commands, it can also be accessed
through the menus via: STATISTICS, SUMMARIES, TABLES. It is also easy to

Page | 22
obtain crosstabs which give a breakdown of a variable by another. For example typing
tab expensive will show how many of them are above average and below average price.
It is often useful to know the percentages as well as the actual numbers. We need to add
an option to our “tab” command. Tab expensive, col will give us the percentage with
each column. Typing headroom expensive,row will give us the percentage within each
row.

The second command which is useful here is “tabstat”. This is used for continuous
variables, and its main use is to provide mean values. For example tabstat price will
give the average price in the dataset. Using the command options we can also access
other useful statistics such as the median tabstat price, stats(med), or the variance
tabstat price, stats(var). For a full list of the available statistics, type help
tabstat. As before we can obtain these statistics according to different levels of a
second variable. tabstat price, by(var7) gives the average price for each province.

Page | 23
2.3.3 Modifying Variables:
1. Replace command:

 Syntax: replace variable = new_value if condition


 This command replaces values in a variable based on a specified condition.
 Example: replace price = 0 if price < 0
2. Rename command:

 Syntax: rename old_variable = new_variable


 This command renames a variable.
 Example usisng sysuse auto: rename rep78 repairrec

Page | 24
Conclusion: Effective data storage and organization are essential for efficient data
management and analysis in Stata. This lecture note covered various data storage formats, file
extensions, and commands for data organization. Understanding and utilizing these
commands will enable you to work with datasets seamlessly and perform complex data
manipulation tasks in Stata.

2.4 Data Documentation

2.4.1 Data Documentation and Metadata


Data documentation and metadata are essential components of data management and data
governance processes. They provide information about the structure, content, and context of
data, enabling users to understand and utilize the data effectively. Here's a breakdown of each
concept:

Data Documentation: Data documentation refers to the process of capturing and recording
information about data sets. It involves describing the data's characteristics, including its
origin, purpose, structure, relationships, and any transformations or processing applied to it.

Page | 25
The primary goal of data documentation is to provide comprehensive details that facilitate
data understanding, interpretation, and reuse.

Data documentation may include:

Data source: The origin of the data, such as databases, files, or external systems.

Data structure: The organization and format of the data, including tables, fields, data types,
and constraints.

Data dictionary: A detailed description of each data element, including its name, definition,
allowed values, and relationships with other elements.

Data transformations: Any modifications, aggregations, or calculations performed on the


data.

Data quality information: Metrics, standards, or rules that assess the data's accuracy,
completeness, consistency, and timeliness.

Data usage and access: Information about who can access the data, permissions, and
restrictions.

Data lineage: The history of data, documenting its origins, modifications, and relationships
across different stages or systems.

Metadata: Metadata refers to the data about data. It provides structured information about
various aspects of data, serving as a descriptive layer that helps users discover, understand,
and manage data resources. Metadata can be categorized into three main types:

Descriptive metadata: It describes the characteristics of a dataset, such as its title, abstract,
keywords, author, creation date, and subject area. Descriptive metadata helps users identify
and locate relevant data resources.

Structural metadata: It describes the organization and relationships between different


components of a dataset. For example, in a relational database, structural metadata specifies
tables, columns, keys, and indexes, facilitating data integration and analysis.

Administrative metadata: It includes information about the technical and operational


aspects of data, such as file formats, access permissions, security controls, storage locations,

Page | 26
and data retention policies. Administrative metadata helps manage data throughout its
lifecycle.

Metadata can be stored in dedicated repositories or embedded within data files. Common
metadata standards and frameworks include Dublin Core, Data Documentation Initiative
(DDI), and ISO 19115 for geospatial data.

Effective data documentation and metadata management enhance data discovery,


interoperability, and data governance. They support data integration, analysis, and decision-
making processes while ensuring data quality, consistency, and compliance with
organizational standards and regulations.

2.4.2 the importance of documenting data


Documenting data is of utmost importance for several reasons:

Data Understanding: Documentation helps users understand the data, its structure, and its
meaning. It provides crucial information about the variables, their definitions, and the
relationships between them. Proper documentation allows users to interpret the data
accurately, make informed decisions, and avoid misinterpretation or errors in analysis.

Data Quality Assurance: Documentation plays a vital role in ensuring data quality. It allows
data users to assess the reliability, completeness, and accuracy of the data. By documenting
data sources, collection methods, and any data transformations or cleaning processes applied,
it becomes easier to identify potential biases, errors, or inconsistencies in the data. This helps
maintain data integrity and enhances the trustworthiness of the analysis and results.

Reproducibility and Replication: Documenting data is essential for reproducibility and


replication of research or analysis. Clear documentation enables others to understand and
replicate the data processing and analysis steps, ensuring transparency and validity. It
facilitates peer review, verification of results, and the advancement of knowledge by allowing
others to build upon existing research.

Collaboration and Knowledge Sharing: Documenting data enables effective collaboration


among team members. It provides a common understanding of the data, allowing for
seamless sharing, communication, and coordination. With proper documentation, team
members can work together efficiently, reducing misunderstandings and improving

Page | 27
productivity. Documented data also promotes knowledge sharing within an organization or
research community, enabling others to benefit from the data and insights generated.

Data Governance and Compliance: Documentation is crucial for data governance and
compliance with regulatory requirements. It helps establish data ownership, access
permissions, and data usage policies. Documenting data lineage, data privacy measures, and
security protocols ensures adherence to legal and ethical standards, protecting sensitive
information and maintaining data confidentiality.

Long-term Preservation and Data Legacy: Proper documentation ensures the long-term
preservation and usability of data. As data can have long lifespans, documented information
about the data's context, structure, and metadata becomes invaluable over time. It allows
future users or researchers to understand and make meaningful use of the data, even if the
original creators are no longer available or the data is archived.

In summary, documenting data enhances data understanding, quality, reproducibility,


collaboration, and compliance. It supports effective data management, facilitates knowledge
sharing, and ensures the long-term usability and value of the data. Documenting data is a best
practice that should be followed in any data-related project or organization.

2.4.3 Variable labels


Variable labels in Stata provide descriptive information about the variables in your dataset.
They allow you to assign meaningful labels to variables, making it easier to understand their
purpose and interpretation. This lecture note will guide you through using the "label variable"
command in Stata to assign variable labels.

1. Syntax of the "label variable" Command: The basic syntax for the "label variable"
command is as follows:

label variable varname "label"

 "varname" refers to the name of the variable you want to label.

 "label" represents the descriptive label you want to assign to the variable.

2. Assigning Variable Labels: To assign a label to a variable, follow these steps:

Page | 28
Step 1: Open your dataset in Stata using the "use" command:

use "your_dataset.dta"

Step 2: Use the "label variable" command to assign a label to a variable:

label variable varname "label"

 Replace "varname" with the name of the variable you want to label.

 Replace "label" with the descriptive label you want to assign to the variable.

Example: Suppose you have a variable named "price" that represents price of cars. You can
assign a variable label using the following command:

label variable price " price if cars (USD)"

This assigns the label " price of cars (USD)" to the variable "income".

Step 3 Viewing Variable Labels: To view the variable labels in your dataset, use the "browse"
or "list" command:

browse
or
list

In the output, you will see the variable labels displayed alongside the variable names.

Page | 29
Benefits of Variable Labels:

Improved Data Understanding: Variable labels provide meaningful descriptions that help
you understand the purpose and content of variables, especially when working with large or
complex datasets.

Documentation and Reproducibility: Variable labels document the variables, making it


easier for others to understand and replicate your analysis.

Readability of Output: When generating tables or reports, variable labels make the output
more readable and comprehensible, especially when variable names are cryptic or
abbreviated.

Using variable labels in Stata enhances data understanding, documentation, and the
reproducibility of your analysis. The "label variable" command allows you to assign
descriptive labels to variables, making your dataset more informative and facilitating
effective communication of your research findings.

Page | 30
2.5 Data Storage and Organization

Efficient data storage and organization are crucial for data management and analysis in Stata.
In this lecture note, we will discuss various data storage formats, file extensions, and
commands that facilitate data organization in Stata.

2.5.1 Data Storage Formats:


1. Stata data file (.dta):

 Stata's native data file format, used to store datasets.

 Example command to save a dataset as a Stata data file: save filename.dta

2. Comma-Separated Values file (.csv):

 A widely used plain text file format where data values are separated by commas.

 Example command to import a CSV file into Stata: import delimited


filename.csv

3. Excel file (.xls, .xlsx):

 A popular spreadsheet file format.

 Example command to import an Excel file into Stata: import excel filename.xls

4. Other formats:

 Stata can import and export data in various other formats, such as SAS, SPSS, and
R.

2.5.2 File Extensions:


.dta:

Stata data file extension.

This is the native file format of Stata used to store datasets.

Example: dataset.dta

Example: filename.dta

do:

Stata do-file extension.

A do-file contains Stata commands that can be executed together.

Page | 31
Example: filename.do

Once you have written your Stata commands, save the do-file by clicking on "File" in the
menu bar and selecting "Save" or "Save As." Choose a name and location for the do-file and
give it the .do extension.

Example: analysis.do

To run the commands in the do-file, you have a few options:

Click on the green "Play" button in the toolbar of the do-file editor.

Use the keyboard shortcut Ctrl+D (Windows) or Command+D (Mac). Alternatively, you can
open the do-file in the Command window and run it by typing do filename.do and pressing
Enter.

Your do-file is now created and ready to be executed. It allows you to execute a series of
Stata commands in a sequential and reproducible manner. Remember to save your do-file
regularly as you make changes to your analysis or data management procedures.

log:

Stata log file extension.

A log file records the Stata session, including commands and output.

Page | 32
Example: filename.log

To create a log file in Stata, follow these steps:

Open Stata and ensure that the Log window is visible. You can show the Log window by
clicking on "Window" in the menu bar and selecting "Log."

To start recording the Stata session in the log file, you have a few options:

Click on the "Log" button in the toolbar of the Log window.

Use the keyboard shortcut Ctrl+L (Windows) or Command+L (Mac).

Alternatively, you can open the Command window and type log using filename.log and press
Enter. Replace "filename" with the desired name for your log file.

Once the log recording is activated, all subsequent commands and output in Stata will be
recorded in the log file.

Execute the Stata commands you want to include in the log file. You can run commands in
the Command window, do-files, or interactively using the user interface.

After executing the desired commands, you can stop recording the session by:

Clicking on the "Log" button again in the toolbar of the Log window (if it is highlighted).

Using the keyboard shortcut Ctrl+L (Windows) or Command+L (Mac) (if the log recording is
active).

Typing log close in the Command window and pressing Enter.

The log file is saved with the .log extension. By default, it is saved in the current working
directory. You can specify a different path by using the log using command when starting the
log, as mentioned in step 2.

Example: log using "C:\Documents\log_file.log"

To view the contents of the log file, you can:

Double-click on the log file in the Stata Results window.

Page | 33
Use the log using command without any additional arguments to open the log file in the Stata
Results window.

Example: log using

Creating a log file in Stata helps to document and reproduce your analysis by recording the
commands, results, and any error messages encountered during the session. It is good practice
to use log files to keep track of your work and share it with others for collaboration or
replication.

2.6 Data Security and Ethics

Data security and ethics are critical considerations when working with sensitive and
confidential data. As economics students, it is essential to handle data responsibly and adhere
to ethical guidelines. In this lecture note, we will discuss key principles of data security and
ethics, specifically focusing on their application in Stata.

I. Data Security:

Page | 34
Protecting Data:

 Ensure that data files are stored in secure locations with limited access.
 Use strong passwords and encryption to protect sensitive files.
 Regularly back up data to prevent loss or corruption.

Data Sharing:

 When sharing data, consider using anonymized or de-identified datasets.


 Remove or mask personally identifiable information (PII) to protect individuals'
privacy.
 Use secure channels for data transmission, such as encrypted file transfers.

Access Control:

 Grant access to data only to authorized individuals who have a legitimate need.
 Utilize user authentication and access control mechanisms in Stata.
 Implement role-based access control to restrict privileges based on user roles.

Data Destruction:

 Properly delete or destroy data when it is no longer needed.


 Use secure methods, such as data shredding, to ensure data cannot be recovered.

II. Ethical Considerations:


Informed Consent:

 Obtain informed consent from individuals whose data you are using for research.
 Clearly explain the purpose, procedures, and potential risks of the study.
Privacy Protection:

 Anonymize or de-identify data to protect the privacy of individuals.


 Handle confidential information with utmost care and use it only for authorized
purposes.
Responsible Data Usage:

 Use data only for the specified research purposes and avoid unauthorized disclosure.
 Follow ethical guidelines set by institutions, professional bodies, and funding agencies.

Page | 35
Reporting and Attribution:

 Accurately report and attribute the data sources used in your research.
 Give credit to the original data providers and cite relevant references.
Data security and ethics are essential considerations for economics students working with
data in Stata. This lecture note highlighted key principles for ensuring data security,
protecting privacy, and adhering to ethical guidelines. By following these principles and best
practices, you can handle data responsibly, maintain confidentiality, and contribute to sound
and ethical research practices in the field of economics.

Review questions on Stata application to data management in economics:

1. How can you import data from an external file (e.g., CSV or Excel) into Stata for data
management and analysis?
2. What are the different types of variables in Stata, and how do they affect data
management tasks?
3. How can you identify missing values in Stata datasets and handle them appropriately
for data management purposes?
4. What are the main commands used for recoding variables in Stata, and how can they
be applied to perform data management tasks in economics?
5. How can you generate new variables based on existing ones using Stata, and provide
an example where this would be beneficial in economic data management?
6. Explain the concept of data sorting in Stata and discuss its importance in data
management for economic analysis.
7. How can you generate summary statistics and descriptive analysis in Stata for a
specific variable or across multiple variables, and why is this useful in economic data
management?

Page | 36
Chapter 3: Application to Univariate Analysis
3.1 Summary Statistics:

To generate summary statistics in Stata, you can use the summarize command. This
command provides descriptive statistics for variables in your dataset. Here's an example of
how to use it:

1. Open Stata and load your dataset using the use command. For example:

Sysuse auto
 To generate summary statistics for a single variable, use the summarize command
followed by the variable name. For example, if you want to summarize the variable
"income":
summarize price

As it is seen in the above picture Summary statistics in Stata provide various univariate
analyses for a given variable. The default summary statistics include:

3.1.1 Count:
The number of non-missing observations for the variable.

There are 74 total observations

3.1.2 Mean:
The average value of the variable.

The mean value for the variable price is 6165.257

Page | 37
3.1.3 Standard deviation:
A measure of the dispersion or variability of the variable. The standard deviation of the
variable price is found to be 2949.496
3.1.4 Minimum:
The smallest value observed for the variable. The minimum price in the data is 3291
3.1.5Maximum:
The largest value observed for the variable. The maximum price in the data is 15906
In addition to these default statistics, Stata's summarize command also provides other optional
summary statistics:
3.1.6 Quartiles:
The values that divide the data into four equal parts (25th percentile, median or 50th
percentile, and 75th percentile).

3.1.7Percentiles:
The values that divide the data into specific percentage points (e.g., 10th, 90th percentile).

3.1.8 Skewness:
A measure of the asymmetry of the distribution of the variable.

3.1.9 Kurtosis:
A measure of the heaviness of the tails of the distribution of the variable.

By default, Stata calculates the count, mean, standard deviation, minimum, and maximum. To
include additional statistics such as quartiles, percentiles, skewness, and kurtosis, you can use
the detail option with the summarize command.
Here's an example of summarizing a variable called "income" and including quartiles,
percentiles, skewness, and kurtosis:

summarize price, detail

Page | 38
This will provide you with a comprehensive set of univariate summary statistics for the
variable "price" in your dataset.

3.2 Histogram or Frequency Distribution:

Univariate histograms or frequency distributions are specifically focused on analyzing a


single variable. Here are some specific uses of univariate histograms or frequency
distributions:

1. Data Distribution Analysis: Univariate histograms provide a visual representation of


the distribution of a single variable. They help in understanding the shape of the data
distribution, such as whether it is symmetric, skewed to the left or right, or bimodal.
2. Central Tendency and Spread: Histograms can be used to determine the measures of
central tendency, such as the mean, median, and mode of the variable. Additionally,
the spread or variability of the data can be assessed by examining the width of the
bins and the dispersion of the bars in the histogram.
3. Outlier Detection: Univariate histograms can reveal outliers, which are observations
that significantly differ from the majority of the data points. Outliers are often
depicted as tall bars or separate bins that are located far away from the main
distribution.

Page | 39
4. Data Cleaning and Preprocessing: Histograms help in identifying any irregularities or
anomalies in the data, such as missing values or data entry errors. By visualizing the
frequency distribution, you can spot gaps or unexpected patterns that may require
further investigation or data cleaning.
5. Understanding Data Characteristics: Univariate histograms provide insights into the
characteristics of the data, such as the range of values, minimum and maximum
values, and the presence of data clusters or peaks. This information can be valuable in
understanding the nature of the variable and its potential impact on the analysis.
6. Assessing Data Skewness and Kurtosis: Skewness refers to the asymmetry of the data
distribution, while kurtosis measures the degree of peakedness or flatness of the
distribution. Histograms can help identify whether the data is normally distributed or
exhibits significant skewness or kurtosis.
7. Data Comparison: Univariate histograms are useful for comparing different datasets
or subsets of data. By overlaying multiple histograms, you can visually compare the
distributions and identify similarities or differences in the variables being analyzed.

Create a histogram or frequency distribution to visualize the distribution of the variable using
the histogram or graph bar command. (use: sysuse auto)
histogram price

In summary, univariate histograms or frequency distributions are valuable tools for analyzing
and understanding the distribution, central tendency, variability, and anomalies of a single

Page | 40
variable. They enable researchers, analysts, and data scientists to gain insights and make data-
driven decisions based on the characteristics of the variable under examination.

3.3 Box Plots:

Box plots, also known as box-and-whisker plots, are commonly used in univariate analysis to
summarize the distribution of a single variable. They provide a visual representation of the
central tendency, spread, and skewness of the data. Here are the uses of box plots in
univariate analysis:

1. Central Tendency: Box plots provide information about the central tendency of the
data, including the median (middle value) and the position of the box. The horizontal
line within the box represents the median, which gives an indication of the typical or
central value of the variable.
2. Spread or Variability: The width of the box in a box plot represents the spread or
variability of the data. A wider box indicates a larger spread, while a narrower box
indicates a smaller spread. By comparing the widths of different box plots, you can
assess the relative variability between variables or groups.
3. Skewness and Symmetry: Box plots can reveal the skewness or asymmetry of the data
distribution. If the whiskers (lines extending from the box) are noticeably imbalanced
or unequal in length, it suggests skewness in the data. Additionally, the position of the
median within the box can provide insights into the symmetry or lack thereof in the
distribution.
4. Outlier Detection: Box plots help in identifying outliers, which are observations that
significantly differ from the majority of the data points. Outliers are represented as
individual points beyond the whiskers of the plot. Their presence can indicate
potential data anomalies or extreme values that may require further investigation.

Generate box plots to visualize the distribution of a variable using the graph box command.
(use: sysuse auto)

graph box price

Page | 41
As we can see above we can visualize our variable and we can see the outliers. In summary,
box plots are a powerful tool for summarizing and visualizing the distribution, central
tendency, spread, skewness, and outliers in a single variable. They allow for easy
comparison, detection of anomalies, and identification of important characteristics of the
data, making them valuable in univariate analysis.

3.4 Bar plots or Pie charts

In Stata, you can create bar plots and pie charts to visualize categorical variables in univariate
analysis. Here's an overview of how to create these plots in Stata and their uses in univariate
analysis:

1. Bar Plots:
o Creating Bar Plots: In Stata, you can use the graph bar command to create
bar plots.
o Bar plots are useful for visualizing the distribution of categorical variables.
They allow you to compare the frequencies or proportions of different
categories and identify patterns or trends within the data. Bar plots are
especially effective when you want to compare the frequencies across different
groups or conditions.
2. Pie Charts:

Page | 42
o Creating Pie Charts: To create a pie chart in Stata, you can use the graph pie
command. Specify the categorical variable, and Stata will generate a pie chart
displaying the relative proportions of each category.
o Pie charts are useful for illustrating the composition or relative distribution of
categorical variables. They provide a visual representation of the proportions
or percentages of each category within the whole. Pie charts can help you
quickly identify the dominant categories and understand the overall
distribution of the variable.

Both bar plots and pie charts have their own advantages and considerations:

 Bar plots are typically preferred when you have a larger number of categories or when
you want to compare the frequencies across multiple groups. They are useful for
displaying absolute frequencies or proportions using different bar lengths.
 Pie charts are more suitable when you have a smaller number of categories and want
to emphasize the relative proportions or percentages of each category. However, it's
important to note that pie charts can be more challenging to interpret accurately,
especially when there are many small categories or when comparing similar-sized
categories.

When using bar plots or pie charts in univariate analysis, it's essential to ensure the
visualizations are clear, informative, and accurately represent the data. Remember to provide
appropriate labels, legends, and titles to enhance the interpretability of the plots. Additionally,
always consider the specific characteristics of your data and research question to determine
the most appropriate type of visualization.

Review questions on Stata application to univariate analysis in economics:

1. What are the main graphical techniques available in Stata for visualizing the distribution
of a single variable, and how can they be used for univariate analysis in economics?
2. How can you calculate measures of central tendency (e.g., mean, median) and
dispersion (e.g., standard deviation, range) for a variable using Stata, and why are these
measures important in univariate analysis in economics?
3. Explain the concept of hypothesis testing in the context of univariate analysis using
Stata, and discuss how you can perform t-tests or chi-square tests to assess the

Page | 43
significance of relationships between variables in economic analysis.
4. How can you generate frequency tables and histograms for categorical variables using
Stata, and why are these techniques useful for univariate analysis in economics?
5. Discuss the concept of skewness and kurtosis in Stata, and explain how these measures
can help assess the shape and distribution of a variable in univariate analysis in
economics.
6. Explain how you can conduct outlier detection and management in Stata for a single
variable, and discuss why it is important to identify and handle outliers in univariate
analysis in economics.
7. How can you generate descriptive statistics tables in Stata that summarize multiple
variables simultaneously, and why is this beneficial for univariate analysis in
economics?

Page | 44
Chapter 4: Application to Bivariate Analysis
4.2 Scatter Plot:

Scatter Plot: The scatter command creates a scatter plot to visualize the relationship
between two variables. It allows you to assess the nature and strength of the relationship
between the variables. Scatter plots are beneficial for identifying patterns, trends, clusters, or
outliers in the data. They provide a visual representation of the data points' distribution and
help in understanding the relationship's direction (positive or negative).

Create a scatter plot to visualize the relationship between the two variables using the scatter
command. (using: sysuse auto lets see the relationship between trunk and weight)

scatter trunk weight

4.3 Correlation Coefficient:

Correlation Coefficient: The correlate command calculates the correlation coefficient


between two variables. It quantifies the strength and direction of the linear relationship
between the variables. The correlation coefficient is beneficial for determining the degree of
association between variables. It ranges from -1 to +1, with values closer to -1 or +1
indicating a stronger relationship and values close to 0 indicating a weaker or no relationship.

Page | 45
o Calculate the correlation coefficient between the two variables to measure the
strength and direction of the linear relationship using the correlate command.
o (using: sysuse auto lets see the relationship between trunk and weight)

corr trunk weight

As it is clearly seen in the above picture the variables trunk and weight have a positive
correlation and the correlation coefficient is 0.6722.

4.3 Regression Analysis:

The regress command conducts a regression analysis to examine the relationship between
the dependent variable and one or more independent variables. It estimates the regression
coefficients, provides information on the statistical significance of the predictors, and helps in
understanding how changes in the independent variables are associated with changes in the
dependent variable. Regression analysis is valuable for predicting outcomes, identifying
important predictors, and assessing the overall fit of the model.

o Conduct a regression analysis to examine the relationship between the dependent


variable and the independent variable(s) using the regress command.
o (using: sysuse auto lets see the relationship between trunk and weight)

regress price trunk

Page | 46
As it is indicated on the above picture from the regression result we can get the
coefficients, the standard error, t-value, P-value, the confidence interval the R-square and
other important information.

4.4 Line Plot or Grouped Bar Chart:

The twoway command creates line plots, while the graph bar command generates grouped
bar charts. These visualizations are useful when one variable is categorical. Line plots display
trends or changes over categories, while grouped bar charts provide a visual comparison of
values across categories. They are beneficial for understanding how a variable varies across
different groups or conditions and identifying any differences or patterns.

If one variable is categorical, you can create a line plot or a grouped bar chart to visualize the
relationship between the variables based on different categories using the twoway or graph
bar command.

o Example command for line plot: (using: sysuse auto lets see the relationship
between foreign and highprice)

twoway (line y x)

Page | 47
4.5 Cross-tabulation:

Cross-tabulation: The tabulate or tab2 command performs a cross-tabulation or


contingency table analysis to explore the relationship between two categorical variables. It
displays the frequency distribution of the variables in a tabular format and helps in
understanding the association between the categories. Cross-tabulation is valuable for
identifying associations, dependencies, or patterns between categorical variables.

Perform a cross-tabulation or contingency table analysis to explore the relationship between


two categorical variables using the tabulate or tab2 command.
o Example command: (using: sysuse auto)
tabulate var1 var2

Page | 48
4.6 Independent Samples t-test or Mann-Whitney U test:

Independent Samples t-test or Mann-Whitney U test: The ttest or ranksum command


conducts independent samples t-tests or Mann-Whitney U tests to compare means or medians
of a continuous variable between two groups when one variable is categorical. These tests
help in assessing whether there are statistically significant differences between the groups in
terms of the continuous variable. They are useful for comparing group means or medians and
determining if the differences are likely due to chance or not.

If one variable is categorical and the other is continuous, you can conduct an independent
samples t-test or a Mann-Whitney U test to compare the means or medians of the continuous
variable between the two groups using the ttest or ranksum command.

o Example command: (using: sysuse auto)


o Example command for t-test: ttest price , by( foreign )

 If calculated t-value > critical t-value (t > critical t), reject the null hypothesis.
 If calculated t-value ≤ critical t-value (t ≤ critical t), fail to reject the null hypothesis.

Page | 49
4.7 Chi-Square Test:

Chi-Square Test: The tabulate or crosstab command performs a chi-square test to


examine the association between two categorical variables. It assesses whether there is a
statistically significant relationship between the variables. The chi-square test is beneficial for
testing the independence or association between categorical variables and helps in
understanding if the observed frequencies significantly deviate from the expected
frequencies.

If both variables are categorical, you can perform a chi-square test to examine the association
between the variables using the tabulate or crosstab command.

To perform a chi-square test in Stata, you can use the tabulate command. Here's the syntax:

tabulate var1 var2 [varlist], chi2


Example command: (using: sysuse auto) tab foreign highpri,chi2

 If calculated chi-square test statistic > critical chi-square value (chi-square statistic >
critical chi-square), reject the null hypothesis.
 If calculated chi-square test statistic ≤ critical chi-square value (chi-square statistic ≤
critical chi-square), fail to reject the null hypothesis.

Page | 50
Review questions on Stata application to bivariate analysis in economics:

1. How can you calculate and interpret correlation coefficients using Stata, and why is
correlation analysis important in bivariate analysis in economics?
2. Explain the concept of scatter plots in Stata, and discuss how they can be used to
visualize the relationship between two variables in bivariate analysis in economics.
3. Discuss the process of conducting regression analysis using Stata, and explain how you
can interpret the coefficients and assess the significance of the relationship between two
variables in economic analysis.
4. Explain the concept of covariance and its interpretation in Stata, and discuss its relevance
in analyzing the relationship between two variables in bivariate analysis in economics.
5. How can you generate a scatter plot matrix in Stata to visualize the relationships between
multiple variables simultaneously, and why is this useful in bivariate analysis in
economics?

Page | 51
Chapter 5: Application to Cross Sectional Econometrics

In cross-sectional econometrics analysis, several tests can be conducted to assess the validity
of the underlying assumptions and to evaluate the results of the regression model. Here are
some commonly used tests in cross-sectional econometrics analysis, along with the
corresponding Stata commands:

5.1 Heteroscedasticity test:

o Command: hettest
o Purpose: Assess whether there is heteroscedasticity (unequal variances) in the
regression model.
o Example: hettest resid (using sysuse auto)

As you can see above after running the hettest command, the output will provide you with
several test statistics and p-values. Here's a decision rule for interpreting the results of a
heteroscedasticity test in Stata:

Look at the overall test of heteroscedasticity: The hettest command in Stata provides
several test statistics, such as the Breusch-Pagan test, the Cook-Weisberg test, and the White
test. Examine the p-values associated with these tests.

Page | 52
o If the p-value is less than your chosen significance level (e.g., 0.05), you can
conclude that there is evidence of heteroscedasticity in the data. In this case, you
should consider addressing the issue of heteroscedasticity in your analysis.
o If the p-value is greater than your chosen significance level, you fail to reject the
null hypothesis of homoscedasticity. This suggests that there is no strong evidence
of heteroscedasticity in the data, and you can assume homoscedasticity in your
analysis.

Remember that the decision rule for the heteroscedasticity test is based on the chosen
significance level (e.g., 0.05). You can adjust the significance level according to your specific
requirements and research context.

Additionally, it is important to note that the presence of heteroscedasticity may have


implications for the validity of your regression results. If heteroscedasticity is detected, you
may need to consider using robust standard errors or alternative estimation techniques that
account for heteroscedasticity to obtain reliable parameter estimates and correct standard
errors in your analysis.

5.2 Normality test:

In Stata, there are various commands that can be used to perform a normality test on a
variable, such as swilk or sktest. The decision rule for interpreting the results of a
normality test depends on the specific command used and the corresponding test statistic and
p-value. Here's a general decision rule for interpreting the results of a normality test in Stata:

1. Look at the test statistic: The specific command used in Stata will provide a test
statistic related to the normality of the variable. This could be the Shapiro-Wilk
statistic (swilk) or the skewness and kurtosis test statistic (sktest).
2. Examine the p-value: The output of the normality test in Stata will also include a p-
value associated with the test statistic. This p-value indicates the strength of evidence
against the null hypothesis of normality.
o If the p-value is less than your chosen significance level (e.g., 0.05), you can
conclude that there is evidence against the null hypothesis of normality. In this
case, you have sufficient evidence to reject the assumption of normality for the
variable.

Page | 53
o If the p-value is greater than your chosen significance level, you fail to reject
the null hypothesis of normality. This suggests that there is no strong evidence
against the assumption of normality for the variable.
o Command: swilk or sktest
o Purpose: Test the normality assumption of the error term in the regression
model.
o Example: swilk residuals

It is important to note that normality tests are generally sensitive to sample size. In larger
sample sizes, even minor deviations from normality may lead to rejection of the null
hypothesis. Therefore, when interpreting the results, consider the sample size and the
practical significance of any deviations from normality.

5.3 correlation

o This command calculates the correlation coefficients between variables in


your dataset. Here's an example of how to use it:
o Command: corr
o Stata will display the correlation matrix, which includes the correlation
coefficients between all pairs of variables. The coefficients range from -1 to 1,
where -1 indicates a perfect negative correlation, 0 indicates no correlation,
and 1 indicates a perfect positive correlation.

Page | 54
o Purpose: Test for the presence of serial correlation (autocorrelation) in the
error term.

The acceptable correlation coefficient for linearity depends on the field of study or the
specific context of analysis. However, in general, a correlation coefficient value between -1
and 1 represents the strength and direction of the linear relationship between two variables. A
positive correlation coefficient indicates a positive linear relationship, while a negative
correlation coefficient indicates a negative linear relationship. Moreover, the magnitude of
the correlation coefficient indicates the strength of the relationship, where values close to -1
or 1 indicate a stronger linear relationship, and values close to 0 suggest a weaker linear
relationship.

5.4 Multicollinearity test:

To perform a multicollinearity test and interpret the results using Stata, you can follow these
steps and commands:
1. First, load your dataset into Stata using the `use` command. For example:

use "your_dataset.dta"
2. Next, run a regression model with the variables you want to test for multicollinearity. Use
the `regress` command. For example, if you have three independent variables named x1, x2,
and x3, and the dependent variable is y, the command would be:
Page | 55
regress y x1 x2 x3
3. After running the regression, you can check the multicollinearity by using the `vif`
command, which stands for variance inflation factor. The VIF measures how much the
variance of the estimated regression coefficient is increased due to multicollinearity. For
example:
vif
4. Stata will produce the VIF values for each independent variable. Typically, a VIF value
above 5 or 10 indicates multicollinearity. Generally, the higher the VIF, the higher the
multicollinearity. You might also want to consider other diagnostics, such as condition
indices or tolerance.

5. If you find that multicollinearity exists, you can take appropriate actions to address it.
Some possible solutions include:

 Removing one or more variables that are highly correlated with others.
 Transforming the variables (e.g., using logarithmic transformation) to reduce the
correlation.
 Including interaction terms between the correlated variables to account for their joint
effect.

Remember, multicollinearity is a statistical issue that affects the interpretation and stability of
regression models. It is important to address it appropriately in order to obtain reliable results.

Page | 56
5.5 Specification tests:

In cross-sectional analysis using Stata, specification tests and decision rules are crucial for
assessing the adequacy of statistical models and making informed decisions. Below, let us see
a brief overview of some commonly used specification tests and decision rules in cross-
sectional analysis using Stata along with the relevant commands.

Testing functional form:

- Ramsey's RESET test: regress y x1 x2; estat ovtest


This command tests for omitted non-linear terms in the regression model.

- Breusch-Pagan/Cook-Weisberg test for heteroscedasticity: regress y x1 x2; estat


hettest This command tests for the presence of heteroscedasticity in the model.
Adjusted R-squared: regress y x1 x2; estat ic, adjr2

Look for a higher adjusted R-squared value, which indicates a better fit of the model while
discounting for excessive complexity.

Robust standard errors: regress y x1 x2; robust

Use robust standard errors if homoscedasticity is violated or if there are concerns about
model assumptions.

Assessing significance: regress y x1 x2; test x1 x2

Page | 57
Perform a joint significance test on the independent variables to determine if they collectively
have a significant impact on the dependent variable.

5.6 Goodness-of-fit measures:

In Stata, you can assess the goodness-of-fit measures and implement decision rules for cross-
sectional analysis using various commands. Here are a few commonly used commands and
their purposes:

1. "regress" command: This command is used to estimate linear regression models. It


provides relevant goodness-of-fit measures such as R-squared and adjusted R-squared, which
indicate the proportion of variance explained by the model.

2. "logit" or "probit" commands: These commands are used to estimate binary response
models. They provide pseudo R-squared measures (e.g., McFadden's R-squared or Cox and
Snell R-squared) to assess the model's goodness-of-fit.

3. "xtreg" command: This command is used for panel data analysis where observations are
clustered over time or across groups. It provides different goodness-of-fit measures such as
within-group R-squared or variance components.

4. "heckman" command: This command is used for estimating selection models to correct for
sample selection bias. It provides likelihood ratio tests and other measures to assess the
goodness of fit.

o Purpose: Evaluate the overall goodness-of-fit of the regression model.


o Example: reg y x1 x2 x3……. Xn#

Page | 58
From the above regression result we can see the adjusted R-squared which is 0.5297 which is
the proportion of variance which is explained in the model.

These are just a few examples of tests that can be performed in cross-sectional econometrics
analysis. The specific tests you should conduct may vary depending on the nature of your
research question, the assumptions of your model, and the specific econometric techniques
employed. It's important to consult econometric textbooks, research articles, or the Stata
documentation for more comprehensive information on conducting tests in cross-sectional
econometrics.

Review questions on Stata application to cross-sectional econometrics:

1. How can you estimate and interpret a simple linear regression model using Stata, and
what are the key assumptions underlying this model in cross-sectional econometrics?
2. Explain the concept of heteroscedasticity in the context of cross-sectional econometrics,
and discuss how you can detect and address heteroscedasticity using Stata.
3. How can you estimate and interpret a multiple regression model with multiple
independent variables using Stata, and what additional insights does this model provide
in cross-sectional econometrics?
4. Discuss the concept of multicollinearity in cross-sectional econometrics, and explain
how you can detect and mitigate multicollinearity issues using Stata.

Page | 59
CHAPTER SIX: APPLICATION TO TIME SERIES ECONOMETRICS
AND PANEL DATA
6.1 Application to Time Series Econometrics
6.1.1 Importing and Managing Data:

 Use the use command to load your dataset into Stata. You can get this data through
this link. (https://www.stata-press.com/data/r13/tsmain.html)

 Consider using the tsset command to specify the time variable . As you can see on
the below picture qtr is set as a time variable by the above command.

Page | 60
Exploratory Data Analysis:

 Utilize commands like summarize, tabulate, and histogram to gain an


understanding of the data's distribution, summary statistics, and variable relationships.

Page | 61
6.1.2 Time Series Analysis: Unit root tests

To test for a unit root in a time series using Stata, you can use the Augmented Dickey-Fuller
(ADF) test or the Phillips-Perron (PP) test. Here's how you can perform these tests using
Stata commands:

1. Augmented Dickey-Fuller (ADF) test:


o Load your time series data into Stata.
o Open the Command window and type the following command:

dfuller varname

 Replace "varname" with the name of your variable.


 Press Enter to execute the command.
 Stata will provide the test results, including the test statistic, p-value, and critical
values at different significance levels.

 The null hypothesis for the ADF test is the presence of a unit root. If the p-value is
less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis
and conclude that the series is stationary (no unit root).

Page | 62
 As you can see from the above picture both investment and income are not stationary
because the P-values for both is greater than the chosen significance level (0.05).

 If your time series data is not stationary, you can take several steps to make it
stationary before applying time series models or analyses. Here are some common
approaches to dealing with non-stationarity in time series data:

 Transformation: Apply mathematical transformations to stabilize the


variance of the series. For example, taking the natural logarithm (ln()) or
applying the Box-Cox transformation (boxcox) in Stata can help address
heteroscedasticity or nonlinear patterns in the data.
 Log Transformation:To take the natural logarithm (log base e) of a variable
named "varname," you can use the generate command with the ln()
function:
generate log_var = ln(varname)

As you can see on the picture above we have transformed our variables to natural
logarithm, now let‟s test again for stationarity.

Page | 63
As you can see the P-values still the variables are not yet stationary, therefor we will
proceed to the other method below.
 Differencing: Take first differences: Compute the difference between
consecutive observations in the time series using the generate command in
Stata. This transformation can help remove trends or seasonality and make the
data stationary. Use the arima command with the differenced variable to
estimate models on the differenced data.
 First Difference: As mentioned earlier, you can use the generate command
to create a new variable representing the first difference. For example, to
compute the first difference of a variable named "varname," you can use the
following command:
generate diff_var = varname - L.varname

Page | 64
After getting the first difference of the variables now we can test for stationarity.

Page | 65
 The null hypothesis for the ADF test is the presence of a unit root. If the p-value is
less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis
and conclude that the series is stationary (no unit root). With this rule now our P-value
is 0.000, therefor we can conclude that both investment and income data are
stationary.

2. Phillips-Perron (PP) test:

 Load your time series data into Stata.


 Open the Command window and type the following command:

 pperron varname
 Replace "varname" with the name of your variable.
 Press Enter to execute the command.
 Stata will provide the test results, including the test statistic, p-value, and critical
values at different significance levels.

Page | 66
 The null hypothesis for the PP test is also the presence of a unit root. If the p-value
is less than your chosen significance level, you can reject the null hypothesis and
conclude that the series is stationary.
For the above case for all the three variables the P-values are greater than the chosen
significance level which is 0.05, therefore we can conclude at this stage the variables are not
stationary. By performing the same task we have done on the dickey fuller test (log
transforming and first differencing) we can make the variables staionary.

 The null hypothesis for the ADF test is the presence of a unit root. If the p-value is
less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis
and conclude that the series is stationary (no unit root). With this rule now our P-value
is 0.000, therefor we can conclude that both investment and income data are
stationary.

It's important to note that when applying these tests, it's recommended to consider the
appropriate lag order for your time series. Stata's default lag selection method is based on the
Akaike Information Criterion (AIC), but you can specify a different lag order if desired.

Page | 67
Additionally, there are other unit root tests available in Stata, such as the Kwiatkowski-
Phillips-Schmidt-Shin (KPSS) test and the Elliott-Rothenberg-Stock (ERS) test. The
commands for these tests are "kpss" and "ers", respectively. Refer to Stata's documentation
for more information on these tests and their usage.

6.1.3 White's Test for Heteroscedasticity:

White's test for heteroscedasticity is a commonly used test to detect the presence of
heteroscedasticity in regression models. It is based on the residuals of the regression model
and helps assess whether the assumption of constant variance holds. Stata provides the hettest
command to perform White's test. Here's how you can use it:

 Load your dataset into Stata using the use command.


 Estimate your regression model using any of Stata's regression commands, such as
regress, xtreg, or areg, depending on the type of data and model you are working
with.
predict xb
 Open the Command window and type the following command to perform White's
test for heteroscedasticity:
hettest residual_var
gen residual= inv- xb
 Replace "residual_var" with the name of the variable that contains the residuals from
your regression model.
For example, if your residuals are stored in a variable named "resid", you can use the
following command:

1. hettest resid
2. Press Enter to execute the command. Stata will perform White's test and provide the
test results, including the test statistic, p-value, and the null hypothesis that assumes
homoscedasticity.

Page | 68
The interpretation of White's test results depends on the p-value obtained. If the p-value is
less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis of
homoscedasticity and conclude that heteroscedasticity is present in the regression model.
Conversely, if the p-value is greater than your significance level, you fail to reject the null
hypothesis, indicating no evidence of heteroscedasticity. In the above picture the P-value is
0.4583, therefor we can colclude that there is a problem of hetroscedasticity.

It's important to note that White's test is asymptotic and assumes the absence of serial
correlation in the residuals. If your data violates the assumption of no serial correlation, you
may need to consider robust standard errors or other methods to address heteroscedasticity.

6.1.4 Granger Causality Test

To perform a Granger causality test in Stata, you can use the granger command. This test
helps assess the causal relationship between two time series variables. Here's how you can
conduct a Granger causality test:

Page | 69
1. Load your time series data into Stata using the use command.
2. Open the Command window and type the following command to perform the Granger
causality test:

granger dependent_var independent_var

Replace "dependent_var" with the name of the variable assumed to be the dependent variable
and "independent_var" with the name of the variable assumed to be the independent variable.

For example, if you want to test the Granger causality between variables "y" and "x", you can
use the following command:

granger y x

Press Enter to execute the command. Stata will perform the Granger causality test and
provide the test results, including the F-statistic, p-value, and degrees of freedom.

The interpretation of the Granger causality test results depends on the obtained p-value. If the
p-value is less than your chosen significance level (e.g., 0.05), you can reject the null
hypothesis and conclude that there is evidence of Granger causality between the variables.
Conversely, if the p-value is greater than the significance level, you fail to reject the null
hypothesis, indicating no evidence of Granger causality.

It's important to note that the Granger causality test relies on the assumption that the time
series variables are stationary and have no other omitted variables affecting both the
dependent and independent variables. Violation of these assumptions may lead to unreliable
or misleading test results.

Stata's granger command provides additional options to specify the lag order, select different
tests (such as the Wald test or likelihood ratio test), and conduct robust inference. Consult
Stata's documentation for more details on these options and their usage.

The granger command is not a built-in command in Stata.

Page | 70
6.1.5 Autoregressive integrated moving average (ARIMA) models

To estimate autoregressive integrated moving average (ARIMA) models using the arima
command in Stata, follow these steps:

1. Load your time series data into Stata using the use command.
2. Open the Command window and type the following command to estimate an ARIMA
model:

arima dependent_var, arima(p, d, q)

Replace "dependent_var" with the name of the dependent variable you want to model.
Replace "p" with the desired order of the autoregressive (AR) component, "d" with the
number of differences needed to achieve stationarity (order of differencing), and "q" with the
desired order of the moving average (MA) component.

For example, if you want to estimate an ARIMA(1,1,1) model for a variable named "inv" in
the previous data which you can get from the link provided, you can use the following
command:

arima inv, arima(1, 1, 1)

Press Enter to execute the command. Stata will estimate the specified ARIMA model using
maximum likelihood estimation (MLE) and provide the results, including coefficient
estimates, standard errors, t-values, and p-values.

Page | 71
It's important to note that before estimating an ARIMA model, you should ensure that your
data is stationary or can be made stationary through differencing. If your data is non-
stationary, you can apply differencing using the generate command to create a differenced
variable, as discussed in a previous response.

Stata's arima command provides additional options to customize the model estimation, such
as including exogenous variables, specifying seasonal components, and selecting alternative
estimation methods. Consult Stata's documentation for more details on these options and how
to use them.

After estimating the ARIMA model, you can analyze the model diagnostics, evaluate the
significance of the coefficients, and make predictions or forecast future values using the
estimated model.

Page | 72
6.2 Panel Data Analysis:

In this part we are going to see how you can use Stata commands to specify panel and time
variables, estimate fixed effects or random effects panel data models, and conduct Hausman
tests for model specification:

6.2.1 Specifying the panel and time variables using the xtset command:
For the below example you can get the data from the below link
http://www.stata-press.com/data/r9/union.dta
 Load your panel data into Stata using the use command.
 Open the Command window and type the following command to specify the panel
and time variables:
xtset panel_var time_var

Replace "panel_var" with the name of the variable that identifies each panel unit (ID code),
and "time_var" with the name of the variable representing the time period(year).
Press Enter to execute the command. Stata will set the panel and time variables for your
dataset, allowing for panel data analysis.

Page | 73
6.2.2 Estimating fixed effects or random effects

Estimate fixed effects or random effects panel data models using the xtreg command:

 After setting the panel and time variables, you can estimate panel data models using
the xtreg command.
 Specify the dependent variable and independent variables in the regression model, and
include the fe option for fixed effects or the re option for random effects.
 Example command for fixed effects estimation:

xtreg dependent_var independent_vars, fe

estimate store fe

 Example command for random effects estimation:


xtreg dependent_var independent_vars, re

estimate store re

 Press Enter to execute the command. Stata will estimate the panel data model using fixed
effects or random effects estimation, depending on the option specified.

Page | 74
6.2.3 Hausman tests
Conduct Hausman tests for model specification using the hausman command.

Hausman fe re

Page | 75
When conducting a Hausman test in Stata, the test result can guide you in deciding between
fixed effects (FE) and random effects (RE) model specifications for panel data analysis.
Here's how to interpret the Hausman test result:

1. Null Hypothesis (H0): The random effects (RE) model is appropriate (no significant
difference between FE and RE coefficients).
2. Alternative Hypothesis (HA): The fixed effects (FE) model is preferred (significant
difference between FE and RE coefficients).

After running the Hausman test in Stata, you will obtain a test statistic and its associated p-
value. Here's how to interpret the results:

 If the p-value is less than your chosen significance level (e.g., 0.05), you reject the
null hypothesis (H0) in favor of the alternative hypothesis (HA). This suggests that
the fixed effects (FE) model is preferred over the random effects (RE) model. In other
words, there is evidence of a significant difference between the coefficients estimated
using the FE and RE models, indicating potential endogeneity or omitted variable
bias.
 If the p-value is greater than your chosen significance level, you fail to reject the null
hypothesis (H0). This implies that there is no significant difference between the
coefficients estimated using the FE and RE models. In this case, you can consider
using the random effects (RE) model as it assumes a more efficient estimation due to
the assumption of no endogeneity.

It's important to note that the decision between FE and RE models should not solely rely on
the Hausman test result. Consider other factors such as theoretical considerations, model fit
statistics (e.g., R-squared, likelihood ratio), and the nature of the data and research question.
The Hausman test is just one tool to aid in model selection and specification decision-
making.

Remember to interpret the Hausman test result in the context of your specific analysis and
consult relevant literature or seek expert advice when making modeling decisions.

Page | 76
Review questions on Stata application to time series econometrics and panel data
analysis:

1. How can you import and manage time series data in Stata, and what are the key
commands for handling time series objects?
2. Explain the concept of stationarity in time series analysis, and discuss the importance of
testing and ensuring stationarity using Stata.
3. Discuss the process of estimating and interpreting autoregressive integrated moving
average (ARIMA) models using Stata, and explain how you can select appropriate model
orders using diagnostic tests.
4. Explain the concept of unit roots and cointegration in time series econometrics, and
discuss how you can perform unit root tests and estimate cointegrated models using Stata.
5. How can you conduct Granger causality tests using Stata to examine the causal
relationship between two or more time series variables, and why is this analysis important
in time series econometrics?

Page | 77
Chapter 7: Application to Nonlinear Models
7.1 Overview of non-linear models and their applications in economics

Non-linear regression models are used to capture complex relationships between variables
that cannot be adequately described by linear models. These models can be used to estimate
the parameters of functions that are non-linear in the variables, such as exponential,
logarithmic, polynomial, or power functions. Non-linear regression models find applications
in various areas of economics, such as demand estimation, production function analysis, and
growth models.

Non-linear models provide economists with more accurate and flexible tools to analyze
complex economic phenomena. They allow for better capturing of non-linear relationships,
heterogeneity, dynamics, and interactions between economic variables, leading to improved
policy analysis, forecasting, and understanding of economic behavior.

Differences between linear and non-linear models

Linear and non-linear models are two distinct types of mathematical models used to describe
relationships between variables. Here are some key differences between linear and non-linear
models:

Linearity:

The fundamental difference lies in the linearity assumption. Linear models assume a linear
relationship between the predictor variables and the response variable. This means that the
relationship can be expressed as a straight line. Non-linear models, on the other hand, allow
for more complex relationships, where the relationship between the variables cannot be
adequately described by a straight line.

Functional Form:

Linear models follow a simple functional form, typically described by a linear equation,
such as Y = β0 + β1X1 + β2X2 + ... + βnXn. Non-linear models, on the other hand, involve
more complex functional forms that can include exponential, logarithmic, polynomial, or
other non-linear functions.

Page | 78
Parameter Estimation:

In linear models, the parameters (β) can be estimated using ordinary least squares (OLS)
regression, which involves minimizing the sum of squared residuals. Non-linear models often
require more advanced estimation techniques, such as maximum likelihood estimation
(MLE), nonlinear least squares, or numerical optimization methods.

Interpretability:

Linear models have the advantage of interpretability. The estimated coefficients in a linear
model represent the change in the response variable associated with a one-unit change in the
corresponding predictor, holding other variables constant. Non-linear models, especially
those with complex functional forms, may have less straightforward interpretations, making it
more challenging to interpret the effects of predictor variables.

Flexibility:

Non-linear models offer greater flexibility in capturing complex relationships between


variables. They can better handle situations where the relationship is not linear or exhibits
non-constant effects. Non-linear models can capture curves, thresholds, interactions, and
other intricate patterns that linear models cannot accommodate.

Assumptions:

Linear models have specific assumptions, such as linearity, homoscedasticity (constant


variance), and independence of errors. Non-linear models may have additional assumptions
depending on the specific functional form or estimation technique used. Violations of these
assumptions can impact the accuracy and reliability of the model results.

Model Complexity:

Non-linear models, by their nature, tend to be more complex than linear models. They
involve more intricate mathematical formulations and estimation procedures. As a result,
non-linear models may require larger sample sizes and more computational resources for
estimation and interpretation.

Applications:

Page | 79
Linear models are widely used when the relationship between variables can be reasonably
assumed to be linear, such as in simple regression analysis or linear regression with
interactions. Non-linear models are employed in various fields, such as economics, finance,
biology, and engineering, to capture more complex relationships, dynamics, and patterns.

Understanding the differences between linear and non-linear models helps researchers choose
the appropriate model type based on the nature of the data, the research question, and the
underlying relationship they aim to capture.

7.2 Binary Choice Models

7.2.1 Probit model: specification, estimation, and interpretation

The Probit model is a type of generalized linear model used to analyze binary dependent
variables. It is particularly suited for situations where the response variable takes on one of
two mutually exclusive outcomes, such as "success" or "failure," "yes" or "no," or "1" or "0".
The Probit model specifies and estimates the probability of the binary outcome as a function
of explanatory variables. Here's an overview of the Probit model's specification, estimation,
and interpretation:

1. Model Specification: The Probit model assumes that the binary dependent variable
follows a standard normal distribution. The model relates the probability of the binary
outcome to a linear combination of explanatory variables using the cumulative
distribution function of the standard normal distribution. The Probit model can be
specified as follows:

P(Y = 1 | X) = Φ(Xβ)

Where:

 P(Y = 1 | X) is the probability of the binary outcome (Y = 1) given the values of the
explanatory variables (X).
 Φ(.) represents the cumulative distribution function of the standard normal
distribution.
 X is a matrix of explanatory variables.

Page | 80
 β is a vector of coefficients to be estimated.

2. Estimation: The parameters of the Probit model are typically estimated using maximum
likelihood estimation (MLE). The MLE procedure involves finding the values of β that
maximize the likelihood function, which is a measure of how likely the observed
outcomes are given the model and its parameters. Stata provides the "probit" command
to estimate Probit models.
3. Interpretation of Coefficients: The estimated coefficients in the Probit model represent
the effect of each explanatory variable on the probability of the binary outcome. The
interpretation of the coefficients depends on the chosen scale of measurement for the
explanatory variables. In general, a positive coefficient indicates that an increase in the
corresponding explanatory variable leads to a higher probability of the binary outcome
(Y = 1), while a negative coefficient suggests a lower probability.
4. Marginal Effects: In addition to coefficient interpretation, it is common to calculate
marginal effects to measure the impact of explanatory variables on the probability of the
binary outcome. Marginal effects represent the change in the predicted probability of the
binary outcome associated with a one-unit change in the explanatory variable while
holding other variables constant. Stata provides the "margins" command to estimate
marginal effects in Probit models.
5. Goodness-of-Fit: The goodness-of-fit of the Probit model can be assessed using various
criteria, such as the likelihood ratio test, Akaike Information Criterion (AIC), Bayesian
Information Criterion (BIC), or pseudo R-squared measures. These measures provide
information on how well the model fits the observed data and can help in comparing
different model specifications.
6. Assumptions: The Probit model assumes that the error term follows a standard normal
distribution. It also assumes that the observations are independent of each other.
Violations of these assumptions, such as heteroscedasticity or correlation among the
observations, may lead to biased estimates or inefficient inference.

The Probit model is widely used in economics, social sciences, and other fields to analyze
binary outcomes. By estimating the probabilities of binary outcomes based on explanatory
variables, it provides insights into the factors influencing the likelihood of a particular
outcome occurring.

Page | 81
7.2.2 Logit model: specification, estimation, and interpretation

The Logit model is a type of generalized linear model used to analyze binary dependent
variables. It is particularly suited for situations where the response variable takes on one of
two mutually exclusive outcomes, such as "success" or "failure," "yes" or "no," or "1" or "0".
The Logit model specifies and estimates the probability of the binary outcome as a function
of explanatory variables. Here's an overview of the Logit model's specification, estimation,
and interpretation:

Model Specification:

The Logit model assumes that the binary dependent variable follows a logistic distribution.
The model relates the probability of the binary outcome to a linear combination of
explanatory variables using the logistic transformation. The Logit model can be specified as
follows:

P(Y = 1 | X) = 1 / (1 + exp(-Xβ))

Where:

P(Y = 1 | X) is the probability of the binary outcome (Y = 1) given the values of the
explanatory variables (X).

exp(.) represents the exponential function.

X is a matrix of explanatory variables.

β is a vector of coefficients to be estimated.

Estimation:

The parameters of the Logit model are typically estimated using maximum likelihood
estimation (MLE). The MLE procedure involves finding the values of β that maximize the
likelihood function, which is a measure of how likely the observed outcomes are given the
model and its parameters. Stata provides the "logit" command to estimate Logit models.

Page | 82
Interpretation of Coefficients:

The estimated coefficients in the Logit model represent the effect of each explanatory
variable on the odds of the binary outcome. The odds ratio can be calculated as exp(β),
indicating the multiplicative change in the odds of the outcome associated with a one-unit
change in the corresponding explanatory variable. A coefficient greater than 0 indicates that
an increase in the explanatory variable leads to higher odds of the binary outcome (Y = 1),
while a coefficient less than 0 suggests lower odds.

Marginal Effects:

In addition to coefficient interpretation, it is common to calculate marginal effects to


measure the impact of explanatory variables on the probability of the binary outcome.
Marginal effects represent the change in the predicted probability of the binary outcome
associated with a one-unit change in the explanatory variable while holding other variables
constant. Stata provides the "margins" command to estimate marginal effects in Logit
models.

Goodness-of-Fit:

The goodness-of-fit of the Logit model can be assessed using various criteria, such as the
likelihood ratio test, Akaike Information Criterion (AIC), Bayesian Information Criterion
(BIC), or pseudo R-squared measures (e.g., McFadden's R-squared or Cox and Snell R-
squared). These measures provide information on how well the model fits the observed data
and can help in comparing different model specifications.

Assumptions:

The Logit model assumes that the error term follows a logistic distribution. It also assumes
that the observations are independent of each other. Violations of these assumptions, such as
heteroscedasticity or correlation among the observations, may lead to biased estimates or
inefficient inference.

Page | 83
The Logit model is widely used in economics, social sciences, and other fields to analyze
binary outcomes. By estimating the probabilities and odds ratios of binary outcomes based on
explanatory variables, it provides insights into the factors influencing the likelihood of a
particular outcome occurring.

7.2.3 Comparing probit and logit models

Probit and Logit models are both types of generalized linear models used for analyzing binary
dependent variables. While they share similarities, there are also notable differences between
the two. Here's a comparison between probit and logit models:

Functional Form:

The primary difference between Probit and Logit models lies in their functional forms. The
Probit model assumes a standard normal distribution for the error term, while the Logit model
assumes a logistic distribution. Consequently, the cumulative distribution functions used in
the models differ. Probit uses the cumulative distribution function of the standard normal
distribution, while Logit uses the logistic transformation.

Interpretation:

The interpretation of coefficients differs between Probit and Logit models. In the Probit
model, the coefficients represent the change in the standard deviation of the error term
associated with a one-unit change in the explanatory variable. In contrast, Logit coefficients
represent the change in the odds ratio of the binary outcome associated with a one-unit
change in the explanatory variable.

Symmetry:

Probit and Logit models have different symmetry properties. Probit models exhibit
symmetric responses to positive and negative changes in the explanatory variables. On the
other hand, Logit models have an S-shaped sigmoidal curve, resulting in a non-linear and
asymmetric relationship between the explanatory variables and the probability of the binary
outcome.

Mathematical Properties:

Page | 84
Mathematically, the Probit model is based on the standard normal cumulative distribution
function, which has analytical properties that can facilitate mathematical calculations. The
Logit model, on the other hand, uses the logistic transformation, which does not have as
many convenient mathematical properties.

Comparative Fit:

In terms of model fit, Probit and Logit models often yield similar results. However, the
specific distributional assumptions may cause slight differences in the estimated coefficients
and standard errors between the two models. Generally, if the underlying distributional
assumption is met, the choice between Probit and Logit is typically less consequential.

Ease of Interpretation:

Logit models are often considered more straightforward to interpret because the odds ratio
can be directly interpreted as the change in odds associated with a one-unit change in the
explanatory variable. Probit coefficients, on the other hand, are more challenging to interpret
directly in terms of odds.

When choosing between Probit and Logit models, researchers should consider the specific
distributional assumptions, interpretability requirements, and the underlying context of the
analysis. In practice, the choice between the two models often comes down to personal
preference or convention within the field of study.

7.3 Stata Application


As we have discussed earlier two of the most popular alternatives are the probit and
logit estimators. Both are maximum likelihood es- timators which involve slightly
different distributional assumptions, but should produce roughly the same results. We
run regressions with each of these probit highwage exper male school and logit
highwage exper male school by taking a hypothetical data. We also verify that the
predicted probabilities are with the 0-1 range. predict xbhat2 and tabstat xbhat2,
by(school). We can also graph these graph bar xbhat2, over(school). Figure 5
compares the predicted probabilities from each model. In this case logit and probit both
give very similar results.

Page | 85
Logit and Probit Output

Iteration 0: log likelihood = -1968.4829


Iteration 1: log likelihood = -1794.7776
Iteration 2: log likelihood = -1792.4022
Iteration 3: log likelihood = -1792.4006

Probit regression Number of obs = 3296


LR chi2(3) = 352.16
Prob > chi2 = 0.0000
Log likelihood = -1792.4006 Pseudo R2 = 0.0895

highwage | Coef. Std. Err. z


P>|z| [95%
Conf. Interval]
-------------+------------------------------------------------
----------------
exper | .0320678 .0113016 2.84 0.005 .009917 .0542186
male | .5130279 .0496018 10.34 0.000 .4158102 .6102456
school | .2565379 .0161262 15.91 0.000 .2249311 .2881447
_cons | -4.138153 .2306573 -17.94 0.000 -4.590233 -3.686073

Iteration 0: log likelihood = -1968.4829


Iteration 1: log likelihood = -1796.4414
Iteration 2: log likelihood = -1791.3207
Iteration 3: log likelihood = -1791.3022

Logistic regression Number of obs = 3296


LR chi2(3) = 354.36
Prob > chi2 = 0.0000
Log likelihood = -1791.3022 Pseudo R2 = 0.0900

highwage | Coef. Std. Err. z


P>|z| [95%
Conf. Interval]
-------------+------------------------------------------------
----------------
exper | .0530249 .0194082 2.73 0.006 .0149855 .0910644
male | .8678092 .0847772 10.24 0.000 .7016489 1.03397
school | .4386614 .0283132 15.49 0.000 .3831685 .4941544
_cons | -7.028232 .4058646 -17.32 0.000 -7.823712 -6.232752

Page | 86
Problem Solved With Probit and Logit
Comparison of Predicted Probabilities

By Years of Schooling
1
5
.
0 Wage
High

0 5 10 15
Years of Schooling

High Wage OLS/LPM Predicted Pr

Probit Predicted Pr Logit Predicted Pr

Marginal Effects
It is important to recognise that these coefficients are not the same as the output
generated by an OLS regression as they refer to the unobserved latent variable.
They are not marginal effects, i.e. they do not tell us the average effect of a
change in the X variable on Y (dy/dx). This is the case with OLS as it is a
linear estimator. All we can interpret from probit or logit coefficients is the
direction of the average effect, and the significance, but not the magnitude. Marginal
29
effects are rarely reported in other disciplines in these models , however economists
are usually interested in the magnitude of an effect, not just its statistical significance.
We can calculate the marginal effects using the mfx compute command. Whereas
the coefficients from logit and probit will differ due to scaling, the marginal effects
should be almost identical. Outreg2 also works for exporting marginal effects, we will
use this to compare different ways of calculating marginal effects. logit highwage
exper male school then mfx compute and outreg2 using test, mfx excel append
ctitle(mfx logit). probit highwage exper male school then mfx compute and
outreg2 using test, mfx excel replace ctitle(mfx probit).
We see that this is indeed the case. Not only that, but these are also almost identical

Page | 87
to the OLS results. For example, for the probit model, an increase in a year of schooling
increases the probability of earning a high wage by 8%, and likewise for the logit
model. As the marginal effects differ depending on the value of the x variables, there
are a number of different ways of calculating these.

Marginal Effects Output


Marginal effects after probit

y = Pr(highwage) (predict)
= .26504098

variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X


---------+--------------------------------------------------------------------

(*) dy/dx is for discrete change of dummy

variable from 0 to 1 Marginal effects afterlogit

y = Pr(highwage) (predict)
= .26010828

variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X


---------+--------------------------------------------------------------------
exper | .0102047 .00373 2.74 0.006 .002892 .017517 8.04187
male*| .1645799 .01554 10.59 0.000 .134131 .195029 .523968
school | .0844213 .00525 16.09 0.000 .074139 .094704 11.6302

(*) dy/dx is for discrete change of dummy variable from 0 to 1

By default, Stata calculates these marginal effects at the mean of the independent
variables, however it is also possible to evaluate them at other values. For example,
suppose you suspect that the effect of experience and schooling on wages differs for men
and women. Then you could evaluate the marginal effects for women. mfx compute,
at(male=0) and outreg2 using test, mfx excel append ctitle(mfx probit
women). For men mfx compute, at(male=1) and outreg2 using test, mfx excel
append ctitle(mfx probit men). To reiterate, because OLS is a linear estimator the
estimated marginal effect is the same at every set of X values. You can think about
this in terms of the slope of the regression line in the bivariate case. OLS is a
straight line, so the slope of the line is the same at every point, unlike with logit
and probit.

Page | 88
The marginal effects of experience and schooling appear larger for men than
women. There other ways of approaching how to estimate marginal effects. An
alternative is to calculate the marginal effects for every value and then take the average
to obtain the average partial effect. For this we use the user written command margeff.
ssc install margeff to install. Then we run the command with the replace option as we
wish to export the results. margeff, replace. We use outreg2 without the mfx option
this time. outreg2 using test, excel append ctitle(amfx probit). We can see that
in this case both approaches give similar results. Mfx2 is another user written command
which produces marginal effects. If you are using a probit model, you can obtain the
marginal effects directly (without the coefficients) with the command dprobit highwage
exper male school.

Review questions on Stata application to Logit and Probit models:

1. How can you estimate a Logit model in Stata for analyzing a binary outcome, and
what are the key Stata commands or procedures involved in estimating and
interpreting the model?
2. Explain the concept of odds ratios in Logit models, and discuss how you can interpret
the coefficients and odds ratios in Stata for understanding the impact of explanatory
variables on the probability of the binary outcome.
3. Discuss the process of assessing the goodness-of-fit of a Logit model in Stata, and
explain the interpretation of measures such as the likelihood ratio test, AIC, BIC, and
pseudo R-squared measures in evaluating model performance.
4. How can you estimate a Probit model in Stata for analyzing a binary outcome, and
what are the main Stata commands or procedures used for estimation and
interpretation?
5. Explain the difference between Logit and Probit models in terms of the assumed
distribution of the error term, and discuss how this affects the estimation and
interpretation of coefficients in Stata.
6. Describe how you can calculate and interpret marginal effects in Logit and Probit
models using Stata, and discuss their significance in understanding the impact of
explanatory variables on the probability of the binary outcome.

Page | 89
Assignment

Title: Analyzing the Impact of Fiscal Policy on Economic Growth

Objective: The objective of this assignment is to analyze the impact of fiscal policy on
economic growth using software application in economics. You will use Stata to import and
manage data, perform regression analysis, and interpret the results.

Dataset: You are supposed to find a macroeconomic data that contains information on fiscal
policy variables and economic growth for various countries over a certain time period. The
dataset must include the following variables:

 Country: Name of the country


 Year: Year of observation
 GDP_growth: Annual economic growth rate (dependent variable)
 Government_spending: Government spending as a percentage of GDP
 Tax_revenue: Tax revenue as a percentage of GDP
 Debt_to_GDP: Public debt as a percentage of GDP

Tasks:

1. Import the Dataset: a) Open Stata and import the dataset "fiscal_growth_data.dta". b)
Inspect the dataset using the "describe" command to check the variables, data types,
and any missing values.
2. Data Management: a) Generate a summary statistics table for the variables of interest
(GDP_growth, Government_spending, Tax_revenue, Debt_to_GDP) using the
"summarize" command. b) Identify and handle any missing values in the dataset
appropriately.
3. Descriptive Analysis: a) Create line graphs to visualize the trends in GDP_growth,
Government_spending, Tax_revenue, and Debt_to_GDP over time using the "line" or
"twoway line" command. b) Calculate and interpret the correlation coefficients
between GDP_growth and the fiscal policy variables using the "correlate" command.
4. Regression Analysis: a) Estimate a multiple linear regression model to analyze the
impact of government spending, tax revenue, and public debt on economic growth.
Interpret the coefficients and evaluate their statistical significance using the "regress"

Page | 90
command. b) Assess the overall goodness-of-fit of the regression model using
relevant measures such as R-squared, adjusted R-squared, and F-statistic.
5. Policy Implications: a) Interpret the coefficients of the regression model to analyze
the impact of government spending, tax revenue, and public debt on economic
growth. b) Discuss the potential policy implications of the findings, considering the
trade-offs between fiscal policy variables and economic growth.

Submission: Submit your completed Stata do-file (.do) and a written report summarizing your
findings and interpretations.

Note: Ensure to include appropriate Stata commands, outputs, and interpretations in your do-
file and report.

Page | 91

You might also like