Statistical Software Application in Economics
Statistical Software Application in Economics
DEPARTMENT OF ECONOMICS
JULY, 2023
JIMMA, ETHIOPIA
Contents
Chapter 1: Introduction to Softwares ......................................................................................... 1
1.1 Econometrics and Statistical Software............................................................................. 1
1.2 What is Stata? .................................................................................................................. 4
1.3 Why use Stata? ................................................................................................................. 5
1.4 OVERVIEW OF DATA FILES ....................................................................................... 5
1.4.1 Types of data and data sources .................................................................................. 5
1.4.2 Discrete variables Vs Continuous variables.............................................................. 7
1.4.3 The Stata user interface ............................................................................................. 9
1.4.4 Basic File Management ........................................................................................... 12
1.4.5 Basic syntax and mathematical operators ............................................................... 13
Chapter 2: Application to Data Management .......................................................................... 14
2.1 Basic data commands ..................................................................................................... 14
2.3 Examining data sets ....................................................................................................... 15
clear ........................................................................................................................... 15
edit ............................................................................................................................. 15
browse ....................................................................................................................... 16
describe...................................................................................................................... 16
2.3 Data Manipulation ..................................................................................................... 16
2.3.1 Logical operators......................................................................................................... 16
2.3.1.1 Logical operator if............................................................................................... 16
2.3.1.2 Logical operators: “and,” “or” ........................................................................ 18
2.3.2 Generating Variables ................................................................................... 20
2.3.3 Modifying Variables: .............................................................................................. 24
2.4 Data Documentation .................................................................................................. 25
2.4.1 Data Documentation and Metadata ......................................................................... 25
2.4.2 the importance of documenting data ....................................................................... 27
2.4.3 Variable labels ........................................................................................................ 28
2.5 Data Storage and Organization ................................................................................. 31
2.5.1 Data Storage Formats:........................................................................................ 31
2.5.2 File Extensions:.................................................................................................. 31
2.6 Data Security and Ethics ................................................................................................ 34
Review questions on Stata application to data management in economics: ........................ 36
Chapter 3: Application to Univariate Analysis ........................................................................ 37
3.1 Summary Statistics: ..................................................................................................... 37
3.1.1 Count: .................................................................................................................... 37
3.1.2 Mean: .................................................................................................................... 37
3.1.3 Standard deviation: ........................................................................................... 38
3.1.4 Minimum: ............................................................................................................. 38
3.1.5Maximum: ............................................................................................................. 38
3.1.6 Quartiles: ............................................................................................................. 38
3.1.7Percentiles: ........................................................................................................... 38
3.1.8 Skewness: ........................................................................................................ 38
3.1.9 Kurtosis: ........................................................................................................... 38
3.2 Histogram or Frequency Distribution: ...................................................................... 39
3.3 Box Plots: ...................................................................................................................... 41
3.4 Bar plots or Pie charts ................................................................................................. 42
Review questions on Stata application to univariate analysis in economics: ...................... 43
Chapter 4: Application to Bivariate Analysis .......................................................................... 45
4.2 Scatter Plot: .................................................................................................................... 45
4.3 Correlation Coefficient:............................................................................................. 45
4.3 Regression Analysis: ...................................................................................................... 46
4.4 Line Plot or Grouped Bar Chart: .................................................................................... 47
4.5 Cross-tabulation: ............................................................................................................ 48
4.6 Independent Samples t-test or Mann-Whitney U test: ................................................... 49
4.7 Chi-Square Test: ........................................................................................................ 50
Review questions on Stata application to bivariate analysis in economics: ........................ 51
Chapter 5: Application to Cross Sectional Econometrics .................................................. 52
5.1 Heteroscedasticity test: ............................................................................................ 52
5.2 Normality test: ....................................................................................................... 53
5.3 correlation .................................................................................................................. 54
5.4 Multicollinearity test: ............................................................................................... 55
5.5 Specification tests: ....................................................................................................... 57
5.6 Goodness-of-fit measures: ....................................................................................... 58
Review questions on Stata application to cross-sectional econometrics: ............................ 59
CHAPTER SIX: APPLICATION TO TIME SERIES ECONOMETRICS AND PANEL
DATA ...................................................................................................................................... 60
6.1 Application to Time Series Econometrics ............................................................. 60
6.1.1 Importing and Managing Data:........................................................................... 60
Exploratory Data Analysis: .............................................................................................. 61
6.1.2 Time Series Analysis: Unit root tests .............................................................. 62
6.1.3 White's Test for Heteroscedasticity: .............................................................. 68
6.1.4 Granger Causality Test ........................................................................................ 69
6.1.5 Autoregressive integrated moving average (ARIMA) models ..................... 71
6.2 Panel Data Analysis: ................................................................................................... 73
6.2.1 Specifying the panel and time variables using the xtset command: ............. 73
6.2.2 Estimating fixed effects or random effects ......................................................... 74
6.2.3 Hausman tests ......................................................................................................... 75
Review questions on Stata application to time series econometrics and panel data analysis:
.............................................................................................................................................. 77
Chapter 7: Application to Nonlinear Models ........................................................................... 78
7.1 Overview of non-linear models and their applications in economics ............................ 78
Differences between linear and non-linear models .......................................................... 78
7.2 Binary Choice Models ................................................................................................... 80
7.2.1 Probit model: specification, estimation, and interpretation .................................... 80
7.2.2 Logit model: specification, estimation, and interpretation ................................ 82
7.2.3 Comparing probit and logit models ........................................................................ 84
7.3 Stata Application ............................................................................................................ 85
Marginal Effects ......................................................................................................... 87
Review questions on Stata application to Logit and Probit models: ................................... 89
Assignment .............................................................................................................................. 90
Chapter 1: Introduction to Softwares
1.1 Econometrics and Statistical Software
Economists very often work with statistical software that is used to build economic models
and conduct econometric analyses. Learning to work with and analyze data is thus an
essential skill for young economists. To be competitive as an economist in the job market,
demonstrable skills and experience using some of the popular analysis and forecasting
software environments are a must.
In modern economics research and analysis, software applications play a crucial role in data
management, statistical analysis, and econometric modeling. These applications provide
powerful tools for economists to handle large datasets, conduct complex analyses, and
generate meaningful insights. In this lecture, we will explore some of the prominent software
applications used in economics and their key features.
Stata:
Stata is a widely used statistical software package that offers a comprehensive suite of tools
for data analysis, econometrics, and visualization. It provides an extensive range of
commands and functions for managing datasets, conducting statistical tests, estimating
regression models, and producing high-quality graphs. Stata's user-friendly interface and rich
documentation make it particularly popular among economists.
R:
Python:
MATLAB:
MATLAB is a powerful software package widely used in economics and various scientific
disciplines. It provides a convenient environment for numerical computing, algorithm
development, and data visualization. MATLAB offers extensive toolboxes for econometrics,
optimization, and simulations. Economists often use MATLAB for advanced econometric
modeling, algorithmic analysis, and simulation exercises.
EViews:
EViews is a specialized software package designed for econometric analysis and time series
modeling. It offers an intuitive interface, making it accessible to economists with varying
levels of technical expertise. EViews supports various econometric techniques, including unit
root tests, panel data analysis, cointegration, and forecasting. It is particularly useful for
analyzing economic and financial time series data.
Excel:
Excel is a widely known spreadsheet program used for data management, basic statistical
analysis, and modeling. While it may lack some advanced statistical capabilities compared to
dedicated statistical software, Excel is still extensively used in economics due to its
familiarity and ease of use. It allows economists to organize data, perform basic calculations,
generate charts, and conduct simple regressions.
GIS software, such as ArcGIS and QGIS, is essential for economists studying spatial
economics and regional analysis. These tools enable economists to analyze and visualize
spatial data, create maps, and perform spatial econometric modeling. GIS software allows for
the integration of economic data with geographic information, facilitating the examination of
spatial relationships and patterns.
In conclusion, software applications play a vital role in modern economics research and
analysis. Each software package offers unique features and capabilities that cater to different
research needs and preferences. Whether it's Stata for comprehensive statistical analysis, R
Page | 2
and Python for programming flexibility, MATLAB for advanced modeling, or specialized
software like EViews for econometric analysis, economists have a variety of powerful tools
at their disposal to conduct rigorous and insightful research.
Note: It's important to choose the software application that best suits your research
requirements and to familiarize yourself with its functionalities and capabilities. Additionally,
it's always advisable to consult relevant documentation and resources provided by the
software developers to maximize the utilization of these tools in economics research. That‟s
quite the list of names, and it can sound intimidating. To help you understand the software
tools landscape, in this module we will shed some light on the most popular software
packages for economists, and offer some details about how you can learn more about them.
The choice of software application depends on several factors, including the specific research
task at hand, the complexity of the analysis, the researcher's familiarity with the software, and
the availability of resources and support. Here are some considerations for selecting a
software application for different scenarios:
Stata: Stata is often preferred when conducting statistical analysis and econometric
modeling. It provides a comprehensive set of built-in commands specifically designed for
economists. Stata's user-friendly interface and extensive documentation make it suitable for
both beginners and advanced users. It is well-suited for tasks such as data management,
regression analysis, and generating publication-quality graphs. Stata also has a large user
community and support network.
EViews: EViews is specifically designed for time series analysis and econometrics. It offers
an intuitive interface and built-in econometric techniques, making it ideal for economists
working with time series data. EViews simplifies tasks such as unit root tests, cointegration
analysis, and forecasting. It is commonly used in applied econometrics, financial analysis,
and economic forecasting.
This course will equip students with a sufficient knowledge of STATA such that they are can
handle and analyze different types of data EVIEWS for the time series analysis. The
emphasis of the course is on the practical issues relating to data analysis and modeling rather
than econometric theory. The overriding objective of the course will be to ensure that the
students are competent and confident in econometric analysis of data.
Page | 3
The course encompasses a number of key areas in empirical analysis using various datasets.
The students will be shown how to analyze the data and how to estimate reliable econometric
models using STATA. Throughout the course, students will be shown how to avoid the
numerous pitfalls that inexperienced researchers often fall into.
Objectives
The objective of this module is to improve the ability of our students to use Stata to generate
descriptive statistics and tables from some example datasets, as well as carry out
preliminarily linear and non-linear regression analysis using appropriate data and EVIEWS
for time series data analysis. In particular, the course aims to train the participants in the
following methods:
A computer program that can be used for data analysis, data management, and
graphics
It has a wide application and can be used for household surveys, macroeconomic data,
Page | 4
1.3 Why use Stata?
strategic decision. The decision entails the investment of time and money, and you should
think about the future development and compatibility of the software. Often, it is also
influential what type of software your peer group is using, since this is usually the main
Statistical software can either be used by command line or by point-and-click menus, or both.
The command line usage has the invaluable advantage that all steps of the analysis, and thus
all results, are easily replicable. In contrast, menu usage might make it very difficult to
replicate results, especially in larger projects. However, it might be more difficult in the
beginning to learn a new command structure, especially for those users who have never
worked with programming languages. Nonetheless, initial ease of use should be weighed
Over Excel
Excel is easier to use and good for quick graphing, but not as robust in terms of
statistical analysis; also in Excel many things have to be done manually (hard to
apply broad rules) Stata also allows you to keep trackof your work
Over SPSS
While Stata‟s capabilities are seen more at the advanced end, it is easier to get support for
Stata, and morewidely used in academia
Over R
While R is free and accessible to the public, Stata is easier to learn and
again, the community of users iswider…for now
Page | 5
Observational/non-experimental,
experimental, and survey data Classifications of
data types
Time series data
Cross-section data
Pooled data
Panel/longitudinal data
Records
Variables
Page | 6
Dataset B
REGIO DISTRIC HHI PLOT IRRI ARE
N T D G A
1 4 1 1 1 1.5
1 4 1 2 0 1.0
1 5 3 1 1 0.5
3 26 2 1 0 0.4
3 26 2 2 1 1.0
4 45 1 1 1 1.2
Key variables
Continuous variables
Variables whose values are not limited
Examples include income, farm size, consumption, coffee
production, and distance to the road
Unlike discrete variables, continuous variables are usually expressed
in some units such as birr, hectares, kilograms or kilometers and
may take fractional values
Variable labels
Longer names associated with each variable to explain them in tables and graphs
For example, the variable label for FAMSIZE might be “Household
family size/ number of family members in a household” and the
Page | 7
label for DISTMKT could be “Distance from market in km”
Whenever possible, variable labels should include the unit (e.g. km)
Value labels
Longer names attached to each value of a variable (categorical/discrete)
For example, if the variable REGION has 4 values, each value is
associated with a name. For instance, REG=1 “Tigray”, REG=3
“Amhara”, REG=4 “Oromia” and REG =5 ”Somali”
Page | 8
1.4.3 The Stata user interface
The latest stata version is Stata 18, but we will be using stata 14 for this
course whose snapshot is shown below.
Results
window
Variables
window
Review
window
Command Properties
window window
Windows
Page | 9
Give you all the key information about the data file you are using,
recent commands, and theresults of those commands
Results window: To view outputs and recent commands
Command window: To enter commands
Variables window: To see the list of variables in the active dataset opened
Review window: to see all previously entered commands
Properties window: to see variable and dataset properties of
a selected variable or groups of variables.
Other widows such as browser, editor, do-file editor do not open
automatically but through some specific commands or pull-down
menu
To open any window or to reveal a hidden window, select the window from the
Windowmenu
Texts appearing in the results window are differently colored
Blue: Commands or error messages that can be clicked on for more
information
Red: Error messages
Menus
The toolbar contains buttons that provide quick access to Stata‟s more
commonly used features. Some of the common button are:
Open: opens a Stata dataset. Click on the button to open a dataset with the
Open dialog.
Log: begins a new log or closes, suspends, or resumes the current log.
Page | 10
Data Editor (Edit): opens the Data Editor or brings the Data
Editor to the front of the other Stata windows.
Browser
Browser window
Page | 11
Browser Window
dir – “directory,” shows all the files that are in the folder
Can you find which folder it is currently in?
pwd – “present working directory”
Page | 12
Create a folder on Windows where you want all these training files to be placed
cd – “change directory,” changes the folder where you are working from
o disp = display
What happens when you type disp “Hello”
What happens when you type disp “Hello” “world”
What happens when you type disp hello?
Use “ ” when you are describing string characters (text)
o Otherwise, Stata will think you are talking about variables
o Mathematical operators include: + - * / ^ ( )
What happens when you display 4
What happens when you display 4 + 7
How would you display (21-12)*3
How would you display (36+12)−42
Page | 13
Chapter 2: Application to Data Management
2.1 Basic data commands
tabulate – counts and tabulates data, also works with non-numeric data
Now what happens if you want a tabulate of make?
How many of these cars are foreign and domestic?
1. One way to import data from excel is to copy the relevant cells in
Excel and paste them in anew data editor in Stata.
Page | 14
NB: The original Excel (.xls) file has to be converted into an “.csv”
format before proceeding to the insheet command.
To treat first row of Excel data as variable names, use the option “first”
with the import excel command.
To import a subset of variables from an Excel file, use the following syntax
clear
The clear command deletes all files, variables, and labels from the memory
to get ready to use a new data file. You can clear memory using the clear
command or by using the clear up command as part of the use command (see
the use command). This command does not delete any data saved to the
hard- drive.
edit
This command use to open window called data editor window that allow us
to view all observation in the memory. You can change the data using data
editor window but it is not recommended to do this because you will have no
record of the changes you make in the data. It is better to correct errors in
the data using a Do-file program that can be saved (we will see Do-file
program latter).
Page | 15
browse
This window is exactly like the Stata editor window except that you can‟t change the data.
describe
This command provides a brief description of the data file. You can use
“des” or “d” and Stata will understand. The output includes:
Page | 16
Less than: <
Greater than:>
Less than or equal to: <=
Greater than or equal to: >=
Equals: ==
Does not equal: =
Exercises
List only the makes of cars whose price is less than $5,000
What is the average price of cars whose mpg is 18?.
How many cars are there?
You can also use count to get this information
Page | 17
2.3.1.2Logical operators: ―and,‖ ―or‖
&
|
Page | 18
Exercise
If we want the name of the car whose weight is between 1000 and 2000
pounds…
list make if weight > 1000 & weight < 2000
What if we also wanted weight listed with their name?
If we want a list of cars and their mileage per gallon (mpg) whose mpg is less
than 20 or over 30…
list make if mpg < 20 | mpg > 30
Using the count function, how many cars is this?
Practice questions
Page | 19
4. What is the average figure of the GNP over the various observations?
You will regularly have to generate variables to aid econometric analysis. For example, you
may want to create dummy variables or run log-log regressions. To create a variable named
loggedprice equal to the natural log of price, the command is gen loggedprice = ln(price).
Similarly, to generate a variable equal to twice the square root of the price, use the command
gen twice root price = sqrt(price). Note that variable names cannot contain spaces.
The egen (“extended gen”) command works just like gen but with extra options. For example,
egen avg price = mean(price). With egen we can also break commands down into
percentiles very easily. For example, to create a variable equal to the 99th percentile of
price, enter egen high price = pctile(price), p(99). Changing the 99 to 50 in that command
would produce a variable equal to the median price. “egen” is often used to obtain a
breakdown of a particular statistic by another variable.
Page | 20
if Commands
Page | 21
For example if you want to eliminate the most expensive 5% of
observations, the following would work:
egen top_fivepercent_prices = pctile(price), p(95)
drop if price > top_fivepercent_prices
We remove these variables from the data with the “drop” command as we
do not need them in this analysis. drop loggedprice twice root avg high
price expensive.
One of the first things you will want to do with your data is to summarize its main
features. Crosstabs are a useful pre-regression tool, and are also useful for presenting the
main points of your data succinctly. The two most important commands for this are “tab”
and “tabstat”. “Tab” tells you how many times each answer is given in response to a
particular variable. This is only suitable for variables with relatively few entries, such
as categorical data. If you try to use “tab” on a variable which has hundreds of different
entries you will get an error message. Typing tab expensive will show the how many
entries there are for each catagory. As with all commands, it can also be accessed
through the menus via: STATISTICS, SUMMARIES, TABLES. It is also easy to
Page | 22
obtain crosstabs which give a breakdown of a variable by another. For example typing
tab expensive will show how many of them are above average and below average price.
It is often useful to know the percentages as well as the actual numbers. We need to add
an option to our “tab” command. Tab expensive, col will give us the percentage with
each column. Typing headroom expensive,row will give us the percentage within each
row.
The second command which is useful here is “tabstat”. This is used for continuous
variables, and its main use is to provide mean values. For example tabstat price will
give the average price in the dataset. Using the command options we can also access
other useful statistics such as the median tabstat price, stats(med), or the variance
tabstat price, stats(var). For a full list of the available statistics, type help
tabstat. As before we can obtain these statistics according to different levels of a
second variable. tabstat price, by(var7) gives the average price for each province.
Page | 23
2.3.3 Modifying Variables:
1. Replace command:
Page | 24
Conclusion: Effective data storage and organization are essential for efficient data
management and analysis in Stata. This lecture note covered various data storage formats, file
extensions, and commands for data organization. Understanding and utilizing these
commands will enable you to work with datasets seamlessly and perform complex data
manipulation tasks in Stata.
Data Documentation: Data documentation refers to the process of capturing and recording
information about data sets. It involves describing the data's characteristics, including its
origin, purpose, structure, relationships, and any transformations or processing applied to it.
Page | 25
The primary goal of data documentation is to provide comprehensive details that facilitate
data understanding, interpretation, and reuse.
Data source: The origin of the data, such as databases, files, or external systems.
Data structure: The organization and format of the data, including tables, fields, data types,
and constraints.
Data dictionary: A detailed description of each data element, including its name, definition,
allowed values, and relationships with other elements.
Data quality information: Metrics, standards, or rules that assess the data's accuracy,
completeness, consistency, and timeliness.
Data usage and access: Information about who can access the data, permissions, and
restrictions.
Data lineage: The history of data, documenting its origins, modifications, and relationships
across different stages or systems.
Metadata: Metadata refers to the data about data. It provides structured information about
various aspects of data, serving as a descriptive layer that helps users discover, understand,
and manage data resources. Metadata can be categorized into three main types:
Descriptive metadata: It describes the characteristics of a dataset, such as its title, abstract,
keywords, author, creation date, and subject area. Descriptive metadata helps users identify
and locate relevant data resources.
Page | 26
and data retention policies. Administrative metadata helps manage data throughout its
lifecycle.
Metadata can be stored in dedicated repositories or embedded within data files. Common
metadata standards and frameworks include Dublin Core, Data Documentation Initiative
(DDI), and ISO 19115 for geospatial data.
Data Understanding: Documentation helps users understand the data, its structure, and its
meaning. It provides crucial information about the variables, their definitions, and the
relationships between them. Proper documentation allows users to interpret the data
accurately, make informed decisions, and avoid misinterpretation or errors in analysis.
Data Quality Assurance: Documentation plays a vital role in ensuring data quality. It allows
data users to assess the reliability, completeness, and accuracy of the data. By documenting
data sources, collection methods, and any data transformations or cleaning processes applied,
it becomes easier to identify potential biases, errors, or inconsistencies in the data. This helps
maintain data integrity and enhances the trustworthiness of the analysis and results.
Page | 27
productivity. Documented data also promotes knowledge sharing within an organization or
research community, enabling others to benefit from the data and insights generated.
Data Governance and Compliance: Documentation is crucial for data governance and
compliance with regulatory requirements. It helps establish data ownership, access
permissions, and data usage policies. Documenting data lineage, data privacy measures, and
security protocols ensures adherence to legal and ethical standards, protecting sensitive
information and maintaining data confidentiality.
Long-term Preservation and Data Legacy: Proper documentation ensures the long-term
preservation and usability of data. As data can have long lifespans, documented information
about the data's context, structure, and metadata becomes invaluable over time. It allows
future users or researchers to understand and make meaningful use of the data, even if the
original creators are no longer available or the data is archived.
1. Syntax of the "label variable" Command: The basic syntax for the "label variable"
command is as follows:
"label" represents the descriptive label you want to assign to the variable.
Page | 28
Step 1: Open your dataset in Stata using the "use" command:
use "your_dataset.dta"
Replace "varname" with the name of the variable you want to label.
Replace "label" with the descriptive label you want to assign to the variable.
Example: Suppose you have a variable named "price" that represents price of cars. You can
assign a variable label using the following command:
This assigns the label " price of cars (USD)" to the variable "income".
Step 3 Viewing Variable Labels: To view the variable labels in your dataset, use the "browse"
or "list" command:
browse
or
list
In the output, you will see the variable labels displayed alongside the variable names.
Page | 29
Benefits of Variable Labels:
Improved Data Understanding: Variable labels provide meaningful descriptions that help
you understand the purpose and content of variables, especially when working with large or
complex datasets.
Readability of Output: When generating tables or reports, variable labels make the output
more readable and comprehensible, especially when variable names are cryptic or
abbreviated.
Using variable labels in Stata enhances data understanding, documentation, and the
reproducibility of your analysis. The "label variable" command allows you to assign
descriptive labels to variables, making your dataset more informative and facilitating
effective communication of your research findings.
Page | 30
2.5 Data Storage and Organization
Efficient data storage and organization are crucial for data management and analysis in Stata.
In this lecture note, we will discuss various data storage formats, file extensions, and
commands that facilitate data organization in Stata.
A widely used plain text file format where data values are separated by commas.
Example command to import an Excel file into Stata: import excel filename.xls
4. Other formats:
Stata can import and export data in various other formats, such as SAS, SPSS, and
R.
Example: dataset.dta
Example: filename.dta
do:
Page | 31
Example: filename.do
Once you have written your Stata commands, save the do-file by clicking on "File" in the
menu bar and selecting "Save" or "Save As." Choose a name and location for the do-file and
give it the .do extension.
Example: analysis.do
Click on the green "Play" button in the toolbar of the do-file editor.
Use the keyboard shortcut Ctrl+D (Windows) or Command+D (Mac). Alternatively, you can
open the do-file in the Command window and run it by typing do filename.do and pressing
Enter.
Your do-file is now created and ready to be executed. It allows you to execute a series of
Stata commands in a sequential and reproducible manner. Remember to save your do-file
regularly as you make changes to your analysis or data management procedures.
log:
A log file records the Stata session, including commands and output.
Page | 32
Example: filename.log
Open Stata and ensure that the Log window is visible. You can show the Log window by
clicking on "Window" in the menu bar and selecting "Log."
To start recording the Stata session in the log file, you have a few options:
Alternatively, you can open the Command window and type log using filename.log and press
Enter. Replace "filename" with the desired name for your log file.
Once the log recording is activated, all subsequent commands and output in Stata will be
recorded in the log file.
Execute the Stata commands you want to include in the log file. You can run commands in
the Command window, do-files, or interactively using the user interface.
After executing the desired commands, you can stop recording the session by:
Clicking on the "Log" button again in the toolbar of the Log window (if it is highlighted).
Using the keyboard shortcut Ctrl+L (Windows) or Command+L (Mac) (if the log recording is
active).
The log file is saved with the .log extension. By default, it is saved in the current working
directory. You can specify a different path by using the log using command when starting the
log, as mentioned in step 2.
Page | 33
Use the log using command without any additional arguments to open the log file in the Stata
Results window.
Creating a log file in Stata helps to document and reproduce your analysis by recording the
commands, results, and any error messages encountered during the session. It is good practice
to use log files to keep track of your work and share it with others for collaboration or
replication.
Data security and ethics are critical considerations when working with sensitive and
confidential data. As economics students, it is essential to handle data responsibly and adhere
to ethical guidelines. In this lecture note, we will discuss key principles of data security and
ethics, specifically focusing on their application in Stata.
I. Data Security:
Page | 34
Protecting Data:
Ensure that data files are stored in secure locations with limited access.
Use strong passwords and encryption to protect sensitive files.
Regularly back up data to prevent loss or corruption.
Data Sharing:
Access Control:
Grant access to data only to authorized individuals who have a legitimate need.
Utilize user authentication and access control mechanisms in Stata.
Implement role-based access control to restrict privileges based on user roles.
Data Destruction:
Obtain informed consent from individuals whose data you are using for research.
Clearly explain the purpose, procedures, and potential risks of the study.
Privacy Protection:
Use data only for the specified research purposes and avoid unauthorized disclosure.
Follow ethical guidelines set by institutions, professional bodies, and funding agencies.
Page | 35
Reporting and Attribution:
Accurately report and attribute the data sources used in your research.
Give credit to the original data providers and cite relevant references.
Data security and ethics are essential considerations for economics students working with
data in Stata. This lecture note highlighted key principles for ensuring data security,
protecting privacy, and adhering to ethical guidelines. By following these principles and best
practices, you can handle data responsibly, maintain confidentiality, and contribute to sound
and ethical research practices in the field of economics.
1. How can you import data from an external file (e.g., CSV or Excel) into Stata for data
management and analysis?
2. What are the different types of variables in Stata, and how do they affect data
management tasks?
3. How can you identify missing values in Stata datasets and handle them appropriately
for data management purposes?
4. What are the main commands used for recoding variables in Stata, and how can they
be applied to perform data management tasks in economics?
5. How can you generate new variables based on existing ones using Stata, and provide
an example where this would be beneficial in economic data management?
6. Explain the concept of data sorting in Stata and discuss its importance in data
management for economic analysis.
7. How can you generate summary statistics and descriptive analysis in Stata for a
specific variable or across multiple variables, and why is this useful in economic data
management?
Page | 36
Chapter 3: Application to Univariate Analysis
3.1 Summary Statistics:
To generate summary statistics in Stata, you can use the summarize command. This
command provides descriptive statistics for variables in your dataset. Here's an example of
how to use it:
1. Open Stata and load your dataset using the use command. For example:
Sysuse auto
To generate summary statistics for a single variable, use the summarize command
followed by the variable name. For example, if you want to summarize the variable
"income":
summarize price
As it is seen in the above picture Summary statistics in Stata provide various univariate
analyses for a given variable. The default summary statistics include:
3.1.1 Count:
The number of non-missing observations for the variable.
3.1.2 Mean:
The average value of the variable.
Page | 37
3.1.3 Standard deviation:
A measure of the dispersion or variability of the variable. The standard deviation of the
variable price is found to be 2949.496
3.1.4 Minimum:
The smallest value observed for the variable. The minimum price in the data is 3291
3.1.5Maximum:
The largest value observed for the variable. The maximum price in the data is 15906
In addition to these default statistics, Stata's summarize command also provides other optional
summary statistics:
3.1.6 Quartiles:
The values that divide the data into four equal parts (25th percentile, median or 50th
percentile, and 75th percentile).
3.1.7Percentiles:
The values that divide the data into specific percentage points (e.g., 10th, 90th percentile).
3.1.8 Skewness:
A measure of the asymmetry of the distribution of the variable.
3.1.9 Kurtosis:
A measure of the heaviness of the tails of the distribution of the variable.
By default, Stata calculates the count, mean, standard deviation, minimum, and maximum. To
include additional statistics such as quartiles, percentiles, skewness, and kurtosis, you can use
the detail option with the summarize command.
Here's an example of summarizing a variable called "income" and including quartiles,
percentiles, skewness, and kurtosis:
Page | 38
This will provide you with a comprehensive set of univariate summary statistics for the
variable "price" in your dataset.
Page | 39
4. Data Cleaning and Preprocessing: Histograms help in identifying any irregularities or
anomalies in the data, such as missing values or data entry errors. By visualizing the
frequency distribution, you can spot gaps or unexpected patterns that may require
further investigation or data cleaning.
5. Understanding Data Characteristics: Univariate histograms provide insights into the
characteristics of the data, such as the range of values, minimum and maximum
values, and the presence of data clusters or peaks. This information can be valuable in
understanding the nature of the variable and its potential impact on the analysis.
6. Assessing Data Skewness and Kurtosis: Skewness refers to the asymmetry of the data
distribution, while kurtosis measures the degree of peakedness or flatness of the
distribution. Histograms can help identify whether the data is normally distributed or
exhibits significant skewness or kurtosis.
7. Data Comparison: Univariate histograms are useful for comparing different datasets
or subsets of data. By overlaying multiple histograms, you can visually compare the
distributions and identify similarities or differences in the variables being analyzed.
Create a histogram or frequency distribution to visualize the distribution of the variable using
the histogram or graph bar command. (use: sysuse auto)
histogram price
In summary, univariate histograms or frequency distributions are valuable tools for analyzing
and understanding the distribution, central tendency, variability, and anomalies of a single
Page | 40
variable. They enable researchers, analysts, and data scientists to gain insights and make data-
driven decisions based on the characteristics of the variable under examination.
Box plots, also known as box-and-whisker plots, are commonly used in univariate analysis to
summarize the distribution of a single variable. They provide a visual representation of the
central tendency, spread, and skewness of the data. Here are the uses of box plots in
univariate analysis:
1. Central Tendency: Box plots provide information about the central tendency of the
data, including the median (middle value) and the position of the box. The horizontal
line within the box represents the median, which gives an indication of the typical or
central value of the variable.
2. Spread or Variability: The width of the box in a box plot represents the spread or
variability of the data. A wider box indicates a larger spread, while a narrower box
indicates a smaller spread. By comparing the widths of different box plots, you can
assess the relative variability between variables or groups.
3. Skewness and Symmetry: Box plots can reveal the skewness or asymmetry of the data
distribution. If the whiskers (lines extending from the box) are noticeably imbalanced
or unequal in length, it suggests skewness in the data. Additionally, the position of the
median within the box can provide insights into the symmetry or lack thereof in the
distribution.
4. Outlier Detection: Box plots help in identifying outliers, which are observations that
significantly differ from the majority of the data points. Outliers are represented as
individual points beyond the whiskers of the plot. Their presence can indicate
potential data anomalies or extreme values that may require further investigation.
Generate box plots to visualize the distribution of a variable using the graph box command.
(use: sysuse auto)
Page | 41
As we can see above we can visualize our variable and we can see the outliers. In summary,
box plots are a powerful tool for summarizing and visualizing the distribution, central
tendency, spread, skewness, and outliers in a single variable. They allow for easy
comparison, detection of anomalies, and identification of important characteristics of the
data, making them valuable in univariate analysis.
In Stata, you can create bar plots and pie charts to visualize categorical variables in univariate
analysis. Here's an overview of how to create these plots in Stata and their uses in univariate
analysis:
1. Bar Plots:
o Creating Bar Plots: In Stata, you can use the graph bar command to create
bar plots.
o Bar plots are useful for visualizing the distribution of categorical variables.
They allow you to compare the frequencies or proportions of different
categories and identify patterns or trends within the data. Bar plots are
especially effective when you want to compare the frequencies across different
groups or conditions.
2. Pie Charts:
Page | 42
o Creating Pie Charts: To create a pie chart in Stata, you can use the graph pie
command. Specify the categorical variable, and Stata will generate a pie chart
displaying the relative proportions of each category.
o Pie charts are useful for illustrating the composition or relative distribution of
categorical variables. They provide a visual representation of the proportions
or percentages of each category within the whole. Pie charts can help you
quickly identify the dominant categories and understand the overall
distribution of the variable.
Both bar plots and pie charts have their own advantages and considerations:
Bar plots are typically preferred when you have a larger number of categories or when
you want to compare the frequencies across multiple groups. They are useful for
displaying absolute frequencies or proportions using different bar lengths.
Pie charts are more suitable when you have a smaller number of categories and want
to emphasize the relative proportions or percentages of each category. However, it's
important to note that pie charts can be more challenging to interpret accurately,
especially when there are many small categories or when comparing similar-sized
categories.
When using bar plots or pie charts in univariate analysis, it's essential to ensure the
visualizations are clear, informative, and accurately represent the data. Remember to provide
appropriate labels, legends, and titles to enhance the interpretability of the plots. Additionally,
always consider the specific characteristics of your data and research question to determine
the most appropriate type of visualization.
1. What are the main graphical techniques available in Stata for visualizing the distribution
of a single variable, and how can they be used for univariate analysis in economics?
2. How can you calculate measures of central tendency (e.g., mean, median) and
dispersion (e.g., standard deviation, range) for a variable using Stata, and why are these
measures important in univariate analysis in economics?
3. Explain the concept of hypothesis testing in the context of univariate analysis using
Stata, and discuss how you can perform t-tests or chi-square tests to assess the
Page | 43
significance of relationships between variables in economic analysis.
4. How can you generate frequency tables and histograms for categorical variables using
Stata, and why are these techniques useful for univariate analysis in economics?
5. Discuss the concept of skewness and kurtosis in Stata, and explain how these measures
can help assess the shape and distribution of a variable in univariate analysis in
economics.
6. Explain how you can conduct outlier detection and management in Stata for a single
variable, and discuss why it is important to identify and handle outliers in univariate
analysis in economics.
7. How can you generate descriptive statistics tables in Stata that summarize multiple
variables simultaneously, and why is this beneficial for univariate analysis in
economics?
Page | 44
Chapter 4: Application to Bivariate Analysis
4.2 Scatter Plot:
Scatter Plot: The scatter command creates a scatter plot to visualize the relationship
between two variables. It allows you to assess the nature and strength of the relationship
between the variables. Scatter plots are beneficial for identifying patterns, trends, clusters, or
outliers in the data. They provide a visual representation of the data points' distribution and
help in understanding the relationship's direction (positive or negative).
Create a scatter plot to visualize the relationship between the two variables using the scatter
command. (using: sysuse auto lets see the relationship between trunk and weight)
Page | 45
o Calculate the correlation coefficient between the two variables to measure the
strength and direction of the linear relationship using the correlate command.
o (using: sysuse auto lets see the relationship between trunk and weight)
As it is clearly seen in the above picture the variables trunk and weight have a positive
correlation and the correlation coefficient is 0.6722.
The regress command conducts a regression analysis to examine the relationship between
the dependent variable and one or more independent variables. It estimates the regression
coefficients, provides information on the statistical significance of the predictors, and helps in
understanding how changes in the independent variables are associated with changes in the
dependent variable. Regression analysis is valuable for predicting outcomes, identifying
important predictors, and assessing the overall fit of the model.
Page | 46
As it is indicated on the above picture from the regression result we can get the
coefficients, the standard error, t-value, P-value, the confidence interval the R-square and
other important information.
The twoway command creates line plots, while the graph bar command generates grouped
bar charts. These visualizations are useful when one variable is categorical. Line plots display
trends or changes over categories, while grouped bar charts provide a visual comparison of
values across categories. They are beneficial for understanding how a variable varies across
different groups or conditions and identifying any differences or patterns.
If one variable is categorical, you can create a line plot or a grouped bar chart to visualize the
relationship between the variables based on different categories using the twoway or graph
bar command.
o Example command for line plot: (using: sysuse auto lets see the relationship
between foreign and highprice)
twoway (line y x)
Page | 47
4.5 Cross-tabulation:
Page | 48
4.6 Independent Samples t-test or Mann-Whitney U test:
If one variable is categorical and the other is continuous, you can conduct an independent
samples t-test or a Mann-Whitney U test to compare the means or medians of the continuous
variable between the two groups using the ttest or ranksum command.
If calculated t-value > critical t-value (t > critical t), reject the null hypothesis.
If calculated t-value ≤ critical t-value (t ≤ critical t), fail to reject the null hypothesis.
Page | 49
4.7 Chi-Square Test:
If both variables are categorical, you can perform a chi-square test to examine the association
between the variables using the tabulate or crosstab command.
To perform a chi-square test in Stata, you can use the tabulate command. Here's the syntax:
If calculated chi-square test statistic > critical chi-square value (chi-square statistic >
critical chi-square), reject the null hypothesis.
If calculated chi-square test statistic ≤ critical chi-square value (chi-square statistic ≤
critical chi-square), fail to reject the null hypothesis.
Page | 50
Review questions on Stata application to bivariate analysis in economics:
1. How can you calculate and interpret correlation coefficients using Stata, and why is
correlation analysis important in bivariate analysis in economics?
2. Explain the concept of scatter plots in Stata, and discuss how they can be used to
visualize the relationship between two variables in bivariate analysis in economics.
3. Discuss the process of conducting regression analysis using Stata, and explain how you
can interpret the coefficients and assess the significance of the relationship between two
variables in economic analysis.
4. Explain the concept of covariance and its interpretation in Stata, and discuss its relevance
in analyzing the relationship between two variables in bivariate analysis in economics.
5. How can you generate a scatter plot matrix in Stata to visualize the relationships between
multiple variables simultaneously, and why is this useful in bivariate analysis in
economics?
Page | 51
Chapter 5: Application to Cross Sectional Econometrics
In cross-sectional econometrics analysis, several tests can be conducted to assess the validity
of the underlying assumptions and to evaluate the results of the regression model. Here are
some commonly used tests in cross-sectional econometrics analysis, along with the
corresponding Stata commands:
o Command: hettest
o Purpose: Assess whether there is heteroscedasticity (unequal variances) in the
regression model.
o Example: hettest resid (using sysuse auto)
As you can see above after running the hettest command, the output will provide you with
several test statistics and p-values. Here's a decision rule for interpreting the results of a
heteroscedasticity test in Stata:
Look at the overall test of heteroscedasticity: The hettest command in Stata provides
several test statistics, such as the Breusch-Pagan test, the Cook-Weisberg test, and the White
test. Examine the p-values associated with these tests.
Page | 52
o If the p-value is less than your chosen significance level (e.g., 0.05), you can
conclude that there is evidence of heteroscedasticity in the data. In this case, you
should consider addressing the issue of heteroscedasticity in your analysis.
o If the p-value is greater than your chosen significance level, you fail to reject the
null hypothesis of homoscedasticity. This suggests that there is no strong evidence
of heteroscedasticity in the data, and you can assume homoscedasticity in your
analysis.
Remember that the decision rule for the heteroscedasticity test is based on the chosen
significance level (e.g., 0.05). You can adjust the significance level according to your specific
requirements and research context.
In Stata, there are various commands that can be used to perform a normality test on a
variable, such as swilk or sktest. The decision rule for interpreting the results of a
normality test depends on the specific command used and the corresponding test statistic and
p-value. Here's a general decision rule for interpreting the results of a normality test in Stata:
1. Look at the test statistic: The specific command used in Stata will provide a test
statistic related to the normality of the variable. This could be the Shapiro-Wilk
statistic (swilk) or the skewness and kurtosis test statistic (sktest).
2. Examine the p-value: The output of the normality test in Stata will also include a p-
value associated with the test statistic. This p-value indicates the strength of evidence
against the null hypothesis of normality.
o If the p-value is less than your chosen significance level (e.g., 0.05), you can
conclude that there is evidence against the null hypothesis of normality. In this
case, you have sufficient evidence to reject the assumption of normality for the
variable.
Page | 53
o If the p-value is greater than your chosen significance level, you fail to reject
the null hypothesis of normality. This suggests that there is no strong evidence
against the assumption of normality for the variable.
o Command: swilk or sktest
o Purpose: Test the normality assumption of the error term in the regression
model.
o Example: swilk residuals
It is important to note that normality tests are generally sensitive to sample size. In larger
sample sizes, even minor deviations from normality may lead to rejection of the null
hypothesis. Therefore, when interpreting the results, consider the sample size and the
practical significance of any deviations from normality.
5.3 correlation
Page | 54
o Purpose: Test for the presence of serial correlation (autocorrelation) in the
error term.
The acceptable correlation coefficient for linearity depends on the field of study or the
specific context of analysis. However, in general, a correlation coefficient value between -1
and 1 represents the strength and direction of the linear relationship between two variables. A
positive correlation coefficient indicates a positive linear relationship, while a negative
correlation coefficient indicates a negative linear relationship. Moreover, the magnitude of
the correlation coefficient indicates the strength of the relationship, where values close to -1
or 1 indicate a stronger linear relationship, and values close to 0 suggest a weaker linear
relationship.
To perform a multicollinearity test and interpret the results using Stata, you can follow these
steps and commands:
1. First, load your dataset into Stata using the `use` command. For example:
use "your_dataset.dta"
2. Next, run a regression model with the variables you want to test for multicollinearity. Use
the `regress` command. For example, if you have three independent variables named x1, x2,
and x3, and the dependent variable is y, the command would be:
Page | 55
regress y x1 x2 x3
3. After running the regression, you can check the multicollinearity by using the `vif`
command, which stands for variance inflation factor. The VIF measures how much the
variance of the estimated regression coefficient is increased due to multicollinearity. For
example:
vif
4. Stata will produce the VIF values for each independent variable. Typically, a VIF value
above 5 or 10 indicates multicollinearity. Generally, the higher the VIF, the higher the
multicollinearity. You might also want to consider other diagnostics, such as condition
indices or tolerance.
5. If you find that multicollinearity exists, you can take appropriate actions to address it.
Some possible solutions include:
Removing one or more variables that are highly correlated with others.
Transforming the variables (e.g., using logarithmic transformation) to reduce the
correlation.
Including interaction terms between the correlated variables to account for their joint
effect.
Remember, multicollinearity is a statistical issue that affects the interpretation and stability of
regression models. It is important to address it appropriately in order to obtain reliable results.
Page | 56
5.5 Specification tests:
In cross-sectional analysis using Stata, specification tests and decision rules are crucial for
assessing the adequacy of statistical models and making informed decisions. Below, let us see
a brief overview of some commonly used specification tests and decision rules in cross-
sectional analysis using Stata along with the relevant commands.
Look for a higher adjusted R-squared value, which indicates a better fit of the model while
discounting for excessive complexity.
Use robust standard errors if homoscedasticity is violated or if there are concerns about
model assumptions.
Page | 57
Perform a joint significance test on the independent variables to determine if they collectively
have a significant impact on the dependent variable.
In Stata, you can assess the goodness-of-fit measures and implement decision rules for cross-
sectional analysis using various commands. Here are a few commonly used commands and
their purposes:
2. "logit" or "probit" commands: These commands are used to estimate binary response
models. They provide pseudo R-squared measures (e.g., McFadden's R-squared or Cox and
Snell R-squared) to assess the model's goodness-of-fit.
3. "xtreg" command: This command is used for panel data analysis where observations are
clustered over time or across groups. It provides different goodness-of-fit measures such as
within-group R-squared or variance components.
4. "heckman" command: This command is used for estimating selection models to correct for
sample selection bias. It provides likelihood ratio tests and other measures to assess the
goodness of fit.
Page | 58
From the above regression result we can see the adjusted R-squared which is 0.5297 which is
the proportion of variance which is explained in the model.
These are just a few examples of tests that can be performed in cross-sectional econometrics
analysis. The specific tests you should conduct may vary depending on the nature of your
research question, the assumptions of your model, and the specific econometric techniques
employed. It's important to consult econometric textbooks, research articles, or the Stata
documentation for more comprehensive information on conducting tests in cross-sectional
econometrics.
1. How can you estimate and interpret a simple linear regression model using Stata, and
what are the key assumptions underlying this model in cross-sectional econometrics?
2. Explain the concept of heteroscedasticity in the context of cross-sectional econometrics,
and discuss how you can detect and address heteroscedasticity using Stata.
3. How can you estimate and interpret a multiple regression model with multiple
independent variables using Stata, and what additional insights does this model provide
in cross-sectional econometrics?
4. Discuss the concept of multicollinearity in cross-sectional econometrics, and explain
how you can detect and mitigate multicollinearity issues using Stata.
Page | 59
CHAPTER SIX: APPLICATION TO TIME SERIES ECONOMETRICS
AND PANEL DATA
6.1 Application to Time Series Econometrics
6.1.1 Importing and Managing Data:
Use the use command to load your dataset into Stata. You can get this data through
this link. (https://www.stata-press.com/data/r13/tsmain.html)
Consider using the tsset command to specify the time variable . As you can see on
the below picture qtr is set as a time variable by the above command.
Page | 60
Exploratory Data Analysis:
Page | 61
6.1.2 Time Series Analysis: Unit root tests
To test for a unit root in a time series using Stata, you can use the Augmented Dickey-Fuller
(ADF) test or the Phillips-Perron (PP) test. Here's how you can perform these tests using
Stata commands:
dfuller varname
The null hypothesis for the ADF test is the presence of a unit root. If the p-value is
less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis
and conclude that the series is stationary (no unit root).
Page | 62
As you can see from the above picture both investment and income are not stationary
because the P-values for both is greater than the chosen significance level (0.05).
If your time series data is not stationary, you can take several steps to make it
stationary before applying time series models or analyses. Here are some common
approaches to dealing with non-stationarity in time series data:
As you can see on the picture above we have transformed our variables to natural
logarithm, now let‟s test again for stationarity.
Page | 63
As you can see the P-values still the variables are not yet stationary, therefor we will
proceed to the other method below.
Differencing: Take first differences: Compute the difference between
consecutive observations in the time series using the generate command in
Stata. This transformation can help remove trends or seasonality and make the
data stationary. Use the arima command with the differenced variable to
estimate models on the differenced data.
First Difference: As mentioned earlier, you can use the generate command
to create a new variable representing the first difference. For example, to
compute the first difference of a variable named "varname," you can use the
following command:
generate diff_var = varname - L.varname
Page | 64
After getting the first difference of the variables now we can test for stationarity.
Page | 65
The null hypothesis for the ADF test is the presence of a unit root. If the p-value is
less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis
and conclude that the series is stationary (no unit root). With this rule now our P-value
is 0.000, therefor we can conclude that both investment and income data are
stationary.
pperron varname
Replace "varname" with the name of your variable.
Press Enter to execute the command.
Stata will provide the test results, including the test statistic, p-value, and critical
values at different significance levels.
Page | 66
The null hypothesis for the PP test is also the presence of a unit root. If the p-value
is less than your chosen significance level, you can reject the null hypothesis and
conclude that the series is stationary.
For the above case for all the three variables the P-values are greater than the chosen
significance level which is 0.05, therefore we can conclude at this stage the variables are not
stationary. By performing the same task we have done on the dickey fuller test (log
transforming and first differencing) we can make the variables staionary.
The null hypothesis for the ADF test is the presence of a unit root. If the p-value is
less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis
and conclude that the series is stationary (no unit root). With this rule now our P-value
is 0.000, therefor we can conclude that both investment and income data are
stationary.
It's important to note that when applying these tests, it's recommended to consider the
appropriate lag order for your time series. Stata's default lag selection method is based on the
Akaike Information Criterion (AIC), but you can specify a different lag order if desired.
Page | 67
Additionally, there are other unit root tests available in Stata, such as the Kwiatkowski-
Phillips-Schmidt-Shin (KPSS) test and the Elliott-Rothenberg-Stock (ERS) test. The
commands for these tests are "kpss" and "ers", respectively. Refer to Stata's documentation
for more information on these tests and their usage.
White's test for heteroscedasticity is a commonly used test to detect the presence of
heteroscedasticity in regression models. It is based on the residuals of the regression model
and helps assess whether the assumption of constant variance holds. Stata provides the hettest
command to perform White's test. Here's how you can use it:
1. hettest resid
2. Press Enter to execute the command. Stata will perform White's test and provide the
test results, including the test statistic, p-value, and the null hypothesis that assumes
homoscedasticity.
Page | 68
The interpretation of White's test results depends on the p-value obtained. If the p-value is
less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis of
homoscedasticity and conclude that heteroscedasticity is present in the regression model.
Conversely, if the p-value is greater than your significance level, you fail to reject the null
hypothesis, indicating no evidence of heteroscedasticity. In the above picture the P-value is
0.4583, therefor we can colclude that there is a problem of hetroscedasticity.
It's important to note that White's test is asymptotic and assumes the absence of serial
correlation in the residuals. If your data violates the assumption of no serial correlation, you
may need to consider robust standard errors or other methods to address heteroscedasticity.
To perform a Granger causality test in Stata, you can use the granger command. This test
helps assess the causal relationship between two time series variables. Here's how you can
conduct a Granger causality test:
Page | 69
1. Load your time series data into Stata using the use command.
2. Open the Command window and type the following command to perform the Granger
causality test:
Replace "dependent_var" with the name of the variable assumed to be the dependent variable
and "independent_var" with the name of the variable assumed to be the independent variable.
For example, if you want to test the Granger causality between variables "y" and "x", you can
use the following command:
granger y x
Press Enter to execute the command. Stata will perform the Granger causality test and
provide the test results, including the F-statistic, p-value, and degrees of freedom.
The interpretation of the Granger causality test results depends on the obtained p-value. If the
p-value is less than your chosen significance level (e.g., 0.05), you can reject the null
hypothesis and conclude that there is evidence of Granger causality between the variables.
Conversely, if the p-value is greater than the significance level, you fail to reject the null
hypothesis, indicating no evidence of Granger causality.
It's important to note that the Granger causality test relies on the assumption that the time
series variables are stationary and have no other omitted variables affecting both the
dependent and independent variables. Violation of these assumptions may lead to unreliable
or misleading test results.
Stata's granger command provides additional options to specify the lag order, select different
tests (such as the Wald test or likelihood ratio test), and conduct robust inference. Consult
Stata's documentation for more details on these options and their usage.
Page | 70
6.1.5 Autoregressive integrated moving average (ARIMA) models
To estimate autoregressive integrated moving average (ARIMA) models using the arima
command in Stata, follow these steps:
1. Load your time series data into Stata using the use command.
2. Open the Command window and type the following command to estimate an ARIMA
model:
Replace "dependent_var" with the name of the dependent variable you want to model.
Replace "p" with the desired order of the autoregressive (AR) component, "d" with the
number of differences needed to achieve stationarity (order of differencing), and "q" with the
desired order of the moving average (MA) component.
For example, if you want to estimate an ARIMA(1,1,1) model for a variable named "inv" in
the previous data which you can get from the link provided, you can use the following
command:
Press Enter to execute the command. Stata will estimate the specified ARIMA model using
maximum likelihood estimation (MLE) and provide the results, including coefficient
estimates, standard errors, t-values, and p-values.
Page | 71
It's important to note that before estimating an ARIMA model, you should ensure that your
data is stationary or can be made stationary through differencing. If your data is non-
stationary, you can apply differencing using the generate command to create a differenced
variable, as discussed in a previous response.
Stata's arima command provides additional options to customize the model estimation, such
as including exogenous variables, specifying seasonal components, and selecting alternative
estimation methods. Consult Stata's documentation for more details on these options and how
to use them.
After estimating the ARIMA model, you can analyze the model diagnostics, evaluate the
significance of the coefficients, and make predictions or forecast future values using the
estimated model.
Page | 72
6.2 Panel Data Analysis:
In this part we are going to see how you can use Stata commands to specify panel and time
variables, estimate fixed effects or random effects panel data models, and conduct Hausman
tests for model specification:
6.2.1 Specifying the panel and time variables using the xtset command:
For the below example you can get the data from the below link
http://www.stata-press.com/data/r9/union.dta
Load your panel data into Stata using the use command.
Open the Command window and type the following command to specify the panel
and time variables:
xtset panel_var time_var
Replace "panel_var" with the name of the variable that identifies each panel unit (ID code),
and "time_var" with the name of the variable representing the time period(year).
Press Enter to execute the command. Stata will set the panel and time variables for your
dataset, allowing for panel data analysis.
Page | 73
6.2.2 Estimating fixed effects or random effects
Estimate fixed effects or random effects panel data models using the xtreg command:
After setting the panel and time variables, you can estimate panel data models using
the xtreg command.
Specify the dependent variable and independent variables in the regression model, and
include the fe option for fixed effects or the re option for random effects.
Example command for fixed effects estimation:
estimate store fe
estimate store re
Press Enter to execute the command. Stata will estimate the panel data model using fixed
effects or random effects estimation, depending on the option specified.
Page | 74
6.2.3 Hausman tests
Conduct Hausman tests for model specification using the hausman command.
Hausman fe re
Page | 75
When conducting a Hausman test in Stata, the test result can guide you in deciding between
fixed effects (FE) and random effects (RE) model specifications for panel data analysis.
Here's how to interpret the Hausman test result:
1. Null Hypothesis (H0): The random effects (RE) model is appropriate (no significant
difference between FE and RE coefficients).
2. Alternative Hypothesis (HA): The fixed effects (FE) model is preferred (significant
difference between FE and RE coefficients).
After running the Hausman test in Stata, you will obtain a test statistic and its associated p-
value. Here's how to interpret the results:
If the p-value is less than your chosen significance level (e.g., 0.05), you reject the
null hypothesis (H0) in favor of the alternative hypothesis (HA). This suggests that
the fixed effects (FE) model is preferred over the random effects (RE) model. In other
words, there is evidence of a significant difference between the coefficients estimated
using the FE and RE models, indicating potential endogeneity or omitted variable
bias.
If the p-value is greater than your chosen significance level, you fail to reject the null
hypothesis (H0). This implies that there is no significant difference between the
coefficients estimated using the FE and RE models. In this case, you can consider
using the random effects (RE) model as it assumes a more efficient estimation due to
the assumption of no endogeneity.
It's important to note that the decision between FE and RE models should not solely rely on
the Hausman test result. Consider other factors such as theoretical considerations, model fit
statistics (e.g., R-squared, likelihood ratio), and the nature of the data and research question.
The Hausman test is just one tool to aid in model selection and specification decision-
making.
Remember to interpret the Hausman test result in the context of your specific analysis and
consult relevant literature or seek expert advice when making modeling decisions.
Page | 76
Review questions on Stata application to time series econometrics and panel data
analysis:
1. How can you import and manage time series data in Stata, and what are the key
commands for handling time series objects?
2. Explain the concept of stationarity in time series analysis, and discuss the importance of
testing and ensuring stationarity using Stata.
3. Discuss the process of estimating and interpreting autoregressive integrated moving
average (ARIMA) models using Stata, and explain how you can select appropriate model
orders using diagnostic tests.
4. Explain the concept of unit roots and cointegration in time series econometrics, and
discuss how you can perform unit root tests and estimate cointegrated models using Stata.
5. How can you conduct Granger causality tests using Stata to examine the causal
relationship between two or more time series variables, and why is this analysis important
in time series econometrics?
Page | 77
Chapter 7: Application to Nonlinear Models
7.1 Overview of non-linear models and their applications in economics
Non-linear regression models are used to capture complex relationships between variables
that cannot be adequately described by linear models. These models can be used to estimate
the parameters of functions that are non-linear in the variables, such as exponential,
logarithmic, polynomial, or power functions. Non-linear regression models find applications
in various areas of economics, such as demand estimation, production function analysis, and
growth models.
Non-linear models provide economists with more accurate and flexible tools to analyze
complex economic phenomena. They allow for better capturing of non-linear relationships,
heterogeneity, dynamics, and interactions between economic variables, leading to improved
policy analysis, forecasting, and understanding of economic behavior.
Linear and non-linear models are two distinct types of mathematical models used to describe
relationships between variables. Here are some key differences between linear and non-linear
models:
Linearity:
The fundamental difference lies in the linearity assumption. Linear models assume a linear
relationship between the predictor variables and the response variable. This means that the
relationship can be expressed as a straight line. Non-linear models, on the other hand, allow
for more complex relationships, where the relationship between the variables cannot be
adequately described by a straight line.
Functional Form:
Linear models follow a simple functional form, typically described by a linear equation,
such as Y = β0 + β1X1 + β2X2 + ... + βnXn. Non-linear models, on the other hand, involve
more complex functional forms that can include exponential, logarithmic, polynomial, or
other non-linear functions.
Page | 78
Parameter Estimation:
In linear models, the parameters (β) can be estimated using ordinary least squares (OLS)
regression, which involves minimizing the sum of squared residuals. Non-linear models often
require more advanced estimation techniques, such as maximum likelihood estimation
(MLE), nonlinear least squares, or numerical optimization methods.
Interpretability:
Linear models have the advantage of interpretability. The estimated coefficients in a linear
model represent the change in the response variable associated with a one-unit change in the
corresponding predictor, holding other variables constant. Non-linear models, especially
those with complex functional forms, may have less straightforward interpretations, making it
more challenging to interpret the effects of predictor variables.
Flexibility:
Assumptions:
Model Complexity:
Non-linear models, by their nature, tend to be more complex than linear models. They
involve more intricate mathematical formulations and estimation procedures. As a result,
non-linear models may require larger sample sizes and more computational resources for
estimation and interpretation.
Applications:
Page | 79
Linear models are widely used when the relationship between variables can be reasonably
assumed to be linear, such as in simple regression analysis or linear regression with
interactions. Non-linear models are employed in various fields, such as economics, finance,
biology, and engineering, to capture more complex relationships, dynamics, and patterns.
Understanding the differences between linear and non-linear models helps researchers choose
the appropriate model type based on the nature of the data, the research question, and the
underlying relationship they aim to capture.
The Probit model is a type of generalized linear model used to analyze binary dependent
variables. It is particularly suited for situations where the response variable takes on one of
two mutually exclusive outcomes, such as "success" or "failure," "yes" or "no," or "1" or "0".
The Probit model specifies and estimates the probability of the binary outcome as a function
of explanatory variables. Here's an overview of the Probit model's specification, estimation,
and interpretation:
1. Model Specification: The Probit model assumes that the binary dependent variable
follows a standard normal distribution. The model relates the probability of the binary
outcome to a linear combination of explanatory variables using the cumulative
distribution function of the standard normal distribution. The Probit model can be
specified as follows:
P(Y = 1 | X) = Φ(Xβ)
Where:
P(Y = 1 | X) is the probability of the binary outcome (Y = 1) given the values of the
explanatory variables (X).
Φ(.) represents the cumulative distribution function of the standard normal
distribution.
X is a matrix of explanatory variables.
Page | 80
β is a vector of coefficients to be estimated.
2. Estimation: The parameters of the Probit model are typically estimated using maximum
likelihood estimation (MLE). The MLE procedure involves finding the values of β that
maximize the likelihood function, which is a measure of how likely the observed
outcomes are given the model and its parameters. Stata provides the "probit" command
to estimate Probit models.
3. Interpretation of Coefficients: The estimated coefficients in the Probit model represent
the effect of each explanatory variable on the probability of the binary outcome. The
interpretation of the coefficients depends on the chosen scale of measurement for the
explanatory variables. In general, a positive coefficient indicates that an increase in the
corresponding explanatory variable leads to a higher probability of the binary outcome
(Y = 1), while a negative coefficient suggests a lower probability.
4. Marginal Effects: In addition to coefficient interpretation, it is common to calculate
marginal effects to measure the impact of explanatory variables on the probability of the
binary outcome. Marginal effects represent the change in the predicted probability of the
binary outcome associated with a one-unit change in the explanatory variable while
holding other variables constant. Stata provides the "margins" command to estimate
marginal effects in Probit models.
5. Goodness-of-Fit: The goodness-of-fit of the Probit model can be assessed using various
criteria, such as the likelihood ratio test, Akaike Information Criterion (AIC), Bayesian
Information Criterion (BIC), or pseudo R-squared measures. These measures provide
information on how well the model fits the observed data and can help in comparing
different model specifications.
6. Assumptions: The Probit model assumes that the error term follows a standard normal
distribution. It also assumes that the observations are independent of each other.
Violations of these assumptions, such as heteroscedasticity or correlation among the
observations, may lead to biased estimates or inefficient inference.
The Probit model is widely used in economics, social sciences, and other fields to analyze
binary outcomes. By estimating the probabilities of binary outcomes based on explanatory
variables, it provides insights into the factors influencing the likelihood of a particular
outcome occurring.
Page | 81
7.2.2 Logit model: specification, estimation, and interpretation
The Logit model is a type of generalized linear model used to analyze binary dependent
variables. It is particularly suited for situations where the response variable takes on one of
two mutually exclusive outcomes, such as "success" or "failure," "yes" or "no," or "1" or "0".
The Logit model specifies and estimates the probability of the binary outcome as a function
of explanatory variables. Here's an overview of the Logit model's specification, estimation,
and interpretation:
Model Specification:
The Logit model assumes that the binary dependent variable follows a logistic distribution.
The model relates the probability of the binary outcome to a linear combination of
explanatory variables using the logistic transformation. The Logit model can be specified as
follows:
P(Y = 1 | X) = 1 / (1 + exp(-Xβ))
Where:
P(Y = 1 | X) is the probability of the binary outcome (Y = 1) given the values of the
explanatory variables (X).
Estimation:
The parameters of the Logit model are typically estimated using maximum likelihood
estimation (MLE). The MLE procedure involves finding the values of β that maximize the
likelihood function, which is a measure of how likely the observed outcomes are given the
model and its parameters. Stata provides the "logit" command to estimate Logit models.
Page | 82
Interpretation of Coefficients:
The estimated coefficients in the Logit model represent the effect of each explanatory
variable on the odds of the binary outcome. The odds ratio can be calculated as exp(β),
indicating the multiplicative change in the odds of the outcome associated with a one-unit
change in the corresponding explanatory variable. A coefficient greater than 0 indicates that
an increase in the explanatory variable leads to higher odds of the binary outcome (Y = 1),
while a coefficient less than 0 suggests lower odds.
Marginal Effects:
Goodness-of-Fit:
The goodness-of-fit of the Logit model can be assessed using various criteria, such as the
likelihood ratio test, Akaike Information Criterion (AIC), Bayesian Information Criterion
(BIC), or pseudo R-squared measures (e.g., McFadden's R-squared or Cox and Snell R-
squared). These measures provide information on how well the model fits the observed data
and can help in comparing different model specifications.
Assumptions:
The Logit model assumes that the error term follows a logistic distribution. It also assumes
that the observations are independent of each other. Violations of these assumptions, such as
heteroscedasticity or correlation among the observations, may lead to biased estimates or
inefficient inference.
Page | 83
The Logit model is widely used in economics, social sciences, and other fields to analyze
binary outcomes. By estimating the probabilities and odds ratios of binary outcomes based on
explanatory variables, it provides insights into the factors influencing the likelihood of a
particular outcome occurring.
Probit and Logit models are both types of generalized linear models used for analyzing binary
dependent variables. While they share similarities, there are also notable differences between
the two. Here's a comparison between probit and logit models:
Functional Form:
The primary difference between Probit and Logit models lies in their functional forms. The
Probit model assumes a standard normal distribution for the error term, while the Logit model
assumes a logistic distribution. Consequently, the cumulative distribution functions used in
the models differ. Probit uses the cumulative distribution function of the standard normal
distribution, while Logit uses the logistic transformation.
Interpretation:
The interpretation of coefficients differs between Probit and Logit models. In the Probit
model, the coefficients represent the change in the standard deviation of the error term
associated with a one-unit change in the explanatory variable. In contrast, Logit coefficients
represent the change in the odds ratio of the binary outcome associated with a one-unit
change in the explanatory variable.
Symmetry:
Probit and Logit models have different symmetry properties. Probit models exhibit
symmetric responses to positive and negative changes in the explanatory variables. On the
other hand, Logit models have an S-shaped sigmoidal curve, resulting in a non-linear and
asymmetric relationship between the explanatory variables and the probability of the binary
outcome.
Mathematical Properties:
Page | 84
Mathematically, the Probit model is based on the standard normal cumulative distribution
function, which has analytical properties that can facilitate mathematical calculations. The
Logit model, on the other hand, uses the logistic transformation, which does not have as
many convenient mathematical properties.
Comparative Fit:
In terms of model fit, Probit and Logit models often yield similar results. However, the
specific distributional assumptions may cause slight differences in the estimated coefficients
and standard errors between the two models. Generally, if the underlying distributional
assumption is met, the choice between Probit and Logit is typically less consequential.
Ease of Interpretation:
Logit models are often considered more straightforward to interpret because the odds ratio
can be directly interpreted as the change in odds associated with a one-unit change in the
explanatory variable. Probit coefficients, on the other hand, are more challenging to interpret
directly in terms of odds.
When choosing between Probit and Logit models, researchers should consider the specific
distributional assumptions, interpretability requirements, and the underlying context of the
analysis. In practice, the choice between the two models often comes down to personal
preference or convention within the field of study.
Page | 85
Logit and Probit Output
Page | 86
Problem Solved With Probit and Logit
Comparison of Predicted Probabilities
By Years of Schooling
1
5
.
0 Wage
High
0 5 10 15
Years of Schooling
Marginal Effects
It is important to recognise that these coefficients are not the same as the output
generated by an OLS regression as they refer to the unobserved latent variable.
They are not marginal effects, i.e. they do not tell us the average effect of a
change in the X variable on Y (dy/dx). This is the case with OLS as it is a
linear estimator. All we can interpret from probit or logit coefficients is the
direction of the average effect, and the significance, but not the magnitude. Marginal
29
effects are rarely reported in other disciplines in these models , however economists
are usually interested in the magnitude of an effect, not just its statistical significance.
We can calculate the marginal effects using the mfx compute command. Whereas
the coefficients from logit and probit will differ due to scaling, the marginal effects
should be almost identical. Outreg2 also works for exporting marginal effects, we will
use this to compare different ways of calculating marginal effects. logit highwage
exper male school then mfx compute and outreg2 using test, mfx excel append
ctitle(mfx logit). probit highwage exper male school then mfx compute and
outreg2 using test, mfx excel replace ctitle(mfx probit).
We see that this is indeed the case. Not only that, but these are also almost identical
Page | 87
to the OLS results. For example, for the probit model, an increase in a year of schooling
increases the probability of earning a high wage by 8%, and likewise for the logit
model. As the marginal effects differ depending on the value of the x variables, there
are a number of different ways of calculating these.
y = Pr(highwage) (predict)
= .26504098
y = Pr(highwage) (predict)
= .26010828
By default, Stata calculates these marginal effects at the mean of the independent
variables, however it is also possible to evaluate them at other values. For example,
suppose you suspect that the effect of experience and schooling on wages differs for men
and women. Then you could evaluate the marginal effects for women. mfx compute,
at(male=0) and outreg2 using test, mfx excel append ctitle(mfx probit
women). For men mfx compute, at(male=1) and outreg2 using test, mfx excel
append ctitle(mfx probit men). To reiterate, because OLS is a linear estimator the
estimated marginal effect is the same at every set of X values. You can think about
this in terms of the slope of the regression line in the bivariate case. OLS is a
straight line, so the slope of the line is the same at every point, unlike with logit
and probit.
Page | 88
The marginal effects of experience and schooling appear larger for men than
women. There other ways of approaching how to estimate marginal effects. An
alternative is to calculate the marginal effects for every value and then take the average
to obtain the average partial effect. For this we use the user written command margeff.
ssc install margeff to install. Then we run the command with the replace option as we
wish to export the results. margeff, replace. We use outreg2 without the mfx option
this time. outreg2 using test, excel append ctitle(amfx probit). We can see that
in this case both approaches give similar results. Mfx2 is another user written command
which produces marginal effects. If you are using a probit model, you can obtain the
marginal effects directly (without the coefficients) with the command dprobit highwage
exper male school.
1. How can you estimate a Logit model in Stata for analyzing a binary outcome, and
what are the key Stata commands or procedures involved in estimating and
interpreting the model?
2. Explain the concept of odds ratios in Logit models, and discuss how you can interpret
the coefficients and odds ratios in Stata for understanding the impact of explanatory
variables on the probability of the binary outcome.
3. Discuss the process of assessing the goodness-of-fit of a Logit model in Stata, and
explain the interpretation of measures such as the likelihood ratio test, AIC, BIC, and
pseudo R-squared measures in evaluating model performance.
4. How can you estimate a Probit model in Stata for analyzing a binary outcome, and
what are the main Stata commands or procedures used for estimation and
interpretation?
5. Explain the difference between Logit and Probit models in terms of the assumed
distribution of the error term, and discuss how this affects the estimation and
interpretation of coefficients in Stata.
6. Describe how you can calculate and interpret marginal effects in Logit and Probit
models using Stata, and discuss their significance in understanding the impact of
explanatory variables on the probability of the binary outcome.
Page | 89
Assignment
Objective: The objective of this assignment is to analyze the impact of fiscal policy on
economic growth using software application in economics. You will use Stata to import and
manage data, perform regression analysis, and interpret the results.
Dataset: You are supposed to find a macroeconomic data that contains information on fiscal
policy variables and economic growth for various countries over a certain time period. The
dataset must include the following variables:
Tasks:
1. Import the Dataset: a) Open Stata and import the dataset "fiscal_growth_data.dta". b)
Inspect the dataset using the "describe" command to check the variables, data types,
and any missing values.
2. Data Management: a) Generate a summary statistics table for the variables of interest
(GDP_growth, Government_spending, Tax_revenue, Debt_to_GDP) using the
"summarize" command. b) Identify and handle any missing values in the dataset
appropriately.
3. Descriptive Analysis: a) Create line graphs to visualize the trends in GDP_growth,
Government_spending, Tax_revenue, and Debt_to_GDP over time using the "line" or
"twoway line" command. b) Calculate and interpret the correlation coefficients
between GDP_growth and the fiscal policy variables using the "correlate" command.
4. Regression Analysis: a) Estimate a multiple linear regression model to analyze the
impact of government spending, tax revenue, and public debt on economic growth.
Interpret the coefficients and evaluate their statistical significance using the "regress"
Page | 90
command. b) Assess the overall goodness-of-fit of the regression model using
relevant measures such as R-squared, adjusted R-squared, and F-statistic.
5. Policy Implications: a) Interpret the coefficients of the regression model to analyze
the impact of government spending, tax revenue, and public debt on economic
growth. b) Discuss the potential policy implications of the findings, considering the
trade-offs between fiscal policy variables and economic growth.
Submission: Submit your completed Stata do-file (.do) and a written report summarizing your
findings and interpretations.
Note: Ensure to include appropriate Stata commands, outputs, and interpretations in your do-
file and report.
Page | 91