TOPIC 2; DATA ANALYSIS USING STATA
Before starting to work with STATA, ensure you have the data that you want to work with,
preferably in an excel spreadsheet.
For example, the STATA folder has an excel file named: Data on GPA, TUCE, PSI and
GRADE. This data set consists of 4 variables and 32 observations.
To start STATA, click on the STATA folder provided, then double click the application.
This will open the STATA interface, and you will notice that STATA has four windows as
follows:
Review window Results window
Variables window Command window
To start the process of data analysis, click as follows: file – log – begin. Then stata will ask you
to provide a name for your file, say ANALYSIS 1.
Now, minimize the stata application, then open the excel file containing the data on: gpa, tuce,
psi and grade. Copy this data from excel (you can close the excel file after copying), then
maximize the stata application. In the command window, type: edit then press enter. This will
bring the stata spreadsheet. Now, you can paste your data here (in the cell highlighted with blue
– the cell on the top left of your stata spreadsheet). You may now close the data editor. Notice in
the results window, the result is “6 variables and 32 observations have been pasted in to the data
editor. Also, when you check in the review window, you will see a history of all the commands
that you are working with, and this is good for replication purposes. Finally, the variables
window displays the variables that you are working with. Having pasted the data into the data
editor, now you are ready to begin the process of data analysis.
However, the makers of stata have also installed some example data sets into stata, to aid in
teaching and training. Therefore, instead of using our data on gpa, tuce, psi and grade, it would
be more ideal if we were to use the data that the makers of stata have already installed. To do
away with the data we have just entered, type clear in the command window, then press enter. If
you type a command in the command window, you always have to press ENTER so as to
execute that command.
One of the famous example data sets that have been installed into stata is the 1978 Automobile
data which shows data on various automobiles as at 1978 and their characteristics. To get the
1978 automobile data, type use auto in the command window, then enter. Now, check in your
variables window. You will see that the variables are: make, price, mpg, rep78, headroom, trunk,
weight, length, turn, displacement, gear ratio and foreign. Thus, we have 12 variables.
To view the data, type browse in the command window then press enter. You will be able to see
12 variables and 74 observations.
To describe the data, type describe in the command window then press enter. You will be able to
see a description of all your variables in the results window (the dark screen). Make is the make
and model of the car, price is the price of the car, mpg is mileage per gallon, rep78 is the repair
record as at 1978, headroom is headroom in inches, trunk is trunk space in cubic feet, weight in
pounds, length in inches, turn is the turn circle in feet, displacement is displacement in cubic
inches, gear_ratio is Gear Ratio and finally, foreign is a dummy or indicator variable for car type
and it is defined as 1 if the car is foreign, and 0 if the car is domestic.
From the output, we notice that on storage type, some variables are string variables (str), others
are integer variables (int), while others are float variables. A string variable means that the
variable is not numeric but is in words or alphabet. Thus make is str18 which means that the
longest name in the variable make has 18 characters. Price, mpg, rep78, trunk and displacement
are int, thus they are integers. Headroom and gear_ratio are float variables which means that
their values have decimal points. Foreign is a byte which means that it is a dummy variable or
indicator variable.
To get summary statistics for the data, type summarize in the command window, then enter. The
summary statistics show the number of observations, the mean, the standard deviation, and the
maximum and minimum values. You can even copy these statistics from stata and paste them
into your word project for interpretation of the results (you could review what you learnt in
statistics or econometrics). The results are as follows:
Table 1: Summary Statistics for 1978 Automobile Data
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
rep78 | 69 3.405797 .9899323 1 5
headroom | 74 2.993243 .8459948 1.5 5
-------------+--------------------------------------------------------
trunk | 74 13.75676 4.277404 5 23
weight | 74 3019.459 777.1936 1760 4840
length | 74 187.9324 22.26634 142 233
turn | 74 39.64865 4.399354 31 51
displacement | 74 197.2973 91.83722 79 425
-------------+--------------------------------------------------------
gear_ratio | 74 3.014865 .4562871 2.19 3.89
foreign | 74 .2972973 .4601885 0 1
Source: Author
But, if you want to get more details about the summary statistics, type: summarize, detail in the
command window, and then enter. If you want summary statistics for only one variable with
details, say price, then type: summarize price, detail.
STATA also allows the user to generate new variables from the data set provided. Thus, we can
create product, square, square root, logarithm, reciprocal, and so on
- To create the product between mpg and weight, the command is: generate
productmpgweight = mpg * weight then enter
- To create the square of a variable (say mpg), the command is: generate squarempg =
mpg * mpg then enter.
- To create the square root of a variable (say price), the command is: generate sqrootprice
= price^0.5 then enter
- to create the natural logarithm of a variable (say headroom), the command is: generate
logheadroom = ln(headroom) then enter
- to create the reciprocal of a variable (say mpg), the command is: generate
reciprocalmpg = 1/mpg then enter
- In order to see your new variables, type browse in the command window, then enter.
Notice that the spreadsheet now contains the new variables and even in the variables
window, they are shown.
Graphics can also be done using stata. These include: scatter plots, line graph, bar graph, pie
chart, and so on.
- to create a scatter plot, between price and mpg, the command is: scatter price mpg then
enter
Figure 1: Scatter plot between price and mpg
Source: Author
- to create a line graph, between price and mpg, the command is: line price mpg then enter
- to create a bar graph, between price and mpg, the command is: graph bar price mpg
then enter
- to create a pie chart, between price and mpg, the command is: graph pie price mpg then
enter
- Repeat the above procedure but now using many variables rather than only two variables.
With stata, you can also perform correlation and regression analysis. For example to correlate
price and mpg, type correlate price mpg in the command window then enter. We notice that the
correlation coefficient between price and mpg is – 0.4686. There is a fair negative correlation
between price and mpg. Also, try: correlate price mpg rep78 weight length foreign then enter.
What can you say about the correlation coefficients given?
Stata also performs regression analysis, which is to find the effect of independent variables on
the dependent variable. In regression, the command is regress, followed by the dependent
variable, then followed by the list of independent variables. For example, type the command
regress price mpg rep78 weight length foreign then enter.
correlate price mpg
(obs=74)
| price mpg
-------------+------------------
price | 1.0000
mpg | -0.4686 1.0000
correlate price mpg rep78 weight length foreign
(obs=69)
| price mpg rep78 weight length foreign
-------------+------------------------------------------------------
price | 1.0000
mpg | -0.4559 1.0000
rep78 | 0.0066 0.4023 1.0000
weight | 0.5478 -0.8055 -0.4003 1.0000
length | 0.4425 -0.8037 -0.3606 0.9478 1.0000
foreign | -0.0174 0.4538 0.5922 -0.6460 -0.6110 1.0000
regress price mpg rep78 weight length foreign
Source | SS df MS Number of obs = 69
-------------+------------------------------ F( 5, 63) = 15.90
Model | 321789308 5 64357861.7 Prob > F = 0.0000
Residual | 255007650 63 4047740.48 R-squared = 0.5579
-------------+------------------------------ Adj R-squared = 0.5228
Total | 576796959 68 8482308.22 Root MSE = 2011.9
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | -26.01325 75.48927 -0.34 0.732 -176.8665 124.84
rep78 | 244.4242 318.787 0.77 0.446 -392.6208 881.4691
weight | 6.006738 1.03725 5.79 0.000 3.93396 8.079516
length | -102.2199 34.74826 -2.94 0.005 -171.6587 -32.78102
foreign | 3303.213 813.5921 4.06 0.000 1677.379 4929.047
_cons | 5896.438 5390.534 1.09 0.278 -4875.684 16668.56
Covariance matrix of coefficients of regress model
e(V) | mpg rep78 weight length foreign _cons
-------------+------------------------------------------------------------------------
mpg | 5698.6301
rep78 | -6545.3892 101625.14
weight | 19.667013 .94772928 1.0758867
length | 630.02684 -1456.3491 -28.839384 1207.4416
foreign | 16171.29 -133572.57 209.84955 2564.5577 661932.17
_cons | -282211.05 105230.53 1682.2409 -149140.82 -1209971.1 29057853
Having got the regression results, we may also wish to obtain the variance-covariance matrix for
the regression model. The command to get the variance-covariance matrix for the regression
model is to type: vce in the command window then press enter. The variance-covariance matrix
derives its name from the fact that the elements along the main diagonal are called VARIANCES
whereas the elements away from the main diagonal are called COVARIANCES.
The first or top part of the regression model is called the ANOVA table. The ANOVA table
shows SOURCE (model, residual and total); SUM OF SQUARES, SS; DEGREES OF
FREEDOM, df AND MEAN SUM OF SQUARES, MS.
The lower table provides the regression coefficients, the standard errors, the t statistics, the
probability value and the confidence intervals.
The sum of squares for the model is 312,789,308. This is also known as the explained sum of
squares (ESS). The sum of squares for the residual is 255,007,650 otherwise known as residual
sum of squares (RSS). The total sum of squares (TSS) is 576,796,959. Notice that: 312,789,308
+ 255,007,650 = 576,796,959. Hence, ESS + RSS = TSS.
The degrees of freedom for the model are 5. The formula for this is k – 1 where k is the number
of variables being estimated. Hence, k – 1 = 6 – 1 = 5. The degrees of freedom for the residual
are 63. The formula for this is n – k where n is the number of observations, and k is defined as
before. Hence, n – k = 69 – 6 = 63. The total degrees of freedom are 68. The formula for this is n
– 1. Hence, n – 1 = 69 – 1 = 68. Alternatively, 5 + 63 = 68.
Mean square is defined as the ratio of sum of squares to degrees of freedom. That is: MS =
SS/df. The mean square for the model is therefore 321,789,308/5 = 64,357,861.7; the mean
square for the residual is 255,007,650/63 = 4,047,740.48.
The model has a total of 69 observations. The probability value for the model is reported as Prob
> F = 0.0000. This means that the model is statistically significant at 1 percent level. The lower
the Prob value, the higher is the level of significance.
Th goodness of fit (R squared) of the model is reported as 0.5579. Now R Squared is the ratio of
explained sum of squares (ESS) to the total sum of squares (TSS). Thus, R squared = ESS/TSS =
321,789,308/576,796,959 = 0.5579. Thus mpg, rep78, weight, length and foreign explain or
account for 55.79 percent of all the variations in price, holding other factors constant.
Adjusted R squared is reported as 0.5228 which means that mpg, rep78, weight, length and
foreign explain or account for 55.79 percent of all the variations in price, holding other factors
constant when degrees of freedom are taken into account.
The formula for adjusted R squared is: Adj R Squared = 1 – (1 – R 2)*[(n – 1) / (n – k)]. Thus,
Adj R Squared = 1 – (1 – 0.5579)*[(69 – 1) / (69 – 6)] = 0.5228.
Root MSE is the root mean square error = 2011.9; is the square-root of mean square of the
residual. Thus, Root MSE = √ 4,047,740 = 2011.9
A coefficient measures how a unit change in a certain explanatory variable will affect the
dependent variable, holding all other factors constant. For example, the coefficient of mpg is –
26.01325. This means that if the mpg of a car increases by one unit, then price of the car will
decrease by 26.01 units, holding all other factors constant. The rest of the coefficients are
interpreted in a similar way.
The second column provides the standard errors (Std. Err.) for each coefficient in the regression
model. Standard errors are the square-root of variance. The variances are obtained from the
variance covariance matrix. Check to see whether the square root of the values on the main
diagonal of the variance-covariance matrix provide the standard errors that have been reported
for each variable.
The third column provides the t statistics. The t value is the ratio of coefficient to standard error.
That is, t = coefficient / std. err. For example, the t value for mpg = -26.01325 / 75.48927 = -
0.34, and so on for the remaining t values.
The next column is the probability value (P > |t|). The probability values help in determining the
significance of the coefficients. for example, if p < 0.01, it means that the coefficient is
significant at 1 percent level of significance; if p < 0.05, it means that the coefficient is
significant at 5 percent level of significance. If p > 0.10, the coefficient is not significant.
Stata Resources
The following are the resources that are useful to perform data analysis using Stata:
(i) Getting Started with Stata (GSW)
(ii) Stata Users Guide (U)
(iii) Stata Base Reference Manual (R)
(iv) Stata Data Management Reference Manual (G)
(v) Stata Programming Reference Manual (P)
(vi) Stata Time Series Reference Manual (TS)
(vii) Stata Quick Reference and Index (I)
(viii) Stata Website – www.stata.com
(ix) Stata demonstration videos on you tube.