0% found this document useful (0 votes)
22 views9 pages

Stata Notebook

The document provides an overview of using Stata for data analysis, including general syntax for commands, troubleshooting with the 'help' command, and exploring datasets with commands like 'describe' and 'codebook'. It also covers creating and labeling variables, generating new variables, and editing data, along with practical examples for visualizing data through bar charts and histograms. Additionally, it explains how to categorize continuous data and includes questions and answers related to the use of bar charts and histograms.

Uploaded by

Emmanuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views9 pages

Stata Notebook

The document provides an overview of using Stata for data analysis, including general syntax for commands, troubleshooting with the 'help' command, and exploring datasets with commands like 'describe' and 'codebook'. It also covers creating and labeling variables, generating new variables, and editing data, along with practical examples for visualizing data through bar charts and histograms. Additionally, it explains how to categorize continuous data and includes questions and answers related to the use of bar charts and histograms.

Uploaded by

Emmanuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

General syntax

There are a number of ways to ‘get to know’ your dataset. As you proceed
through this section, note that the general syntax used for the commands in
Stata are as follows:

command {space} variable_name(s) {space} [if expression],


{space} options

Note that after the command and the variable name, there is a comma
before any options are listed. Some commands can be abbreviated as well.
Conditional statements involving the word “if” come before the comma.
Keep this general syntax in mind as you work with the commands below.

Trouble shooting – how to find help

The ‘help’ command in Stata can be very useful to understand more about
the commands in Stata. Simply type help and the command you want more
information on and you will open a help window. For example, if you want
more information on the tabulate command, you can type:

help tabulate

You can also search for a command. For example, if you wanted to look for
general help for histograms, you can type:

search histogram

Exploring your data

The command ‘ describe ‘ will give provide the you the variable names and
their labels. You can look at the whole dataset or specific variables. Try:

describe

describe bmi_grp4

The command ‘ codebook ‘ provides a little more information about the


variables in the dataset, such as the minimum and maximum values and
information on missing data. Missing values in Stata are generally coded as .
but missing values can also be coded as 99 or 0 so you need to be clear
about how missing values are coded prior to exploring your dataset. Try:

codebook bmi_grp4
You can also get a feel for the dataset by using the ‘ list ‘ and ‘ tabulate ‘
commands. Try looking at the variables for ‘currsmoker’ and ‘frailty’:

tabulate currsmoker

tab currsmoker, missing

tab currsmoker frailty, row

tab currsmoker frailty, col

You can use the ‘ if ‘ condition to view specific observations. Try browsing
the data for current smokers, aged between 60-70 years old:

browse if currsmoker==1 & age_grp==1

Now try to tabulate frailty level among the current smokers aged 60-70 using
the if condition:

tab frailty if currsmoker==1 & age_grp==1

Notice that in Stata, if you want to specify that a variable is “equal to” some
value, then you need to hit two = signs, like this “==”.

Labelling variables
If you look at the ‘Variables’ window, you will notice that some of the
variables do not have labels, or the given name of the variable is not very
clear. The data might be easier to work with if you have a short description
(or label) of the variables. You can label variables using the label command.

General syntax:

label variable variable_name “label”

When adding a label to a variable, the command is ‘ label variable ‘, and


simply typing label is incorrect. For example:

label variable prior_cvd “Prior CVD”

If you look at the ‘Variables’ window, you should see that there now is a label
next to ‘prior_cvd’. Now look at this variable using either
the codebook or tab command:

codebook prior_cvd … (or tab prior_cvd)

You can see that ‘currsmoker’ is a binary variable and it takes the value
either 0 or 1. This is not very meaningful, in fact, 0 =no and 1=yes.
Therefore, we need to relabel the numeric values within the variable as ‘yes’
or ‘no’. Relabelling a variable is a two-step process. First, you must define
the label and then assign the label to the variable. The general syntax is
presented below:

label define label_name 0 “label1” 1 “label2” …[, add modify


replace]

label values variable_name(s) label_name

For the currsmoker variable, first we have to use the ‘label define’ command
to create value label called ‘smoke_lab’, which defines 0 with “no” and 1 with
“yes”. For example:

label define smoke_lab 0 “no” 1 “yes”

Next, we need to apply the new label (smoke_lab) to the currsmoker


variable. For example:

label values currsmoker smoke_lab

Now look at currsmoker (either tab or codebook) and you should see that the
values 0 and 1 are now labelled as “no” and “yes”. You can also look at the
labels of a variable by typing:
label list label_name

For example, type: label list smoke_lab

Creating new variables – generate

The generate command allows you to create new variables. The general
syntax is:

generate new_variable=expression [if ]

Try the following:

generate var1 = 1

generate var2 = 5

browse

You can also copy existing variables or generate new variables based on
existing data. For example:

generate age4 = age_grp

You can also use mathematical operations and functions such as + (add), –
(subtract), * (multiply), / (divide), ^ (to the power of), sqrt square root, ln
(natural log), exp (exponential) For example:

generate var3 = var1+var2

gen var4 = var3*var2

These calculations can be used to calculate different variables. For example,


you could compute the total number of prior diseases a participant has been
diagnosed with:

gen prior_disease = prior_cvd + prior_t2dm + prior_cancer

tab prior_disease, m

There is also an extension to the generate command, called egen, which can
be useful (see help egen for more information).

Replace and recode


You can edit variables using ‘replace’ and ‘recode’ commands. PLEASE BE
AWARE THERE ARE MULTIPLE (CORRECT) WAYS TO RECODE AND GENERATE
NEW VARIABLES. You will see some examples in this practical, and more
examples in future sessions. Choosing the method to replace or recode
variables is generally a matter of personal preference, and it will rarely
matter which method you use when recoding variables as several different
commands will produce the same result.

Try changing the ‘bmi’ variable so that you create to a binary variable, which
indicates those who have obesity and those that do not. But never recode
the original variable (in case you change your mind)! Duplicate a variable
first, and then recode it. For example:

gen bmi2=bmi

recode bmi2 min/29=0 30/max=1

Now compare ‘bmi’ and ‘bmi2’:

browse bmi bmi2

Here is another way to recode BMI:

gen bmi_bin = 1 if bmi_grp4<=2

replace bmi_bin = 0 if bmi_grp4>2

It is good practice to cross-tabulate your binary and categorical variables to


check your coding:

tab bmi_bin bmi_grp4, miss

Stata considers missing values to be the highest numerical values, so notice


where the missing values went using this code.

gen ldl2=ldlc

replace ldl2=0 if ldl2<=4

replace ldl2=1 if ldl2>4

tab ldl2

label define ldl 0 “Under 4” 1 “Over 4”

label values ldl2 ldl


tab ldl2, m

Note, as with many commands in Stata, there are alternative correct ways to generate our new
variable. One way would be to use the ‘recode’ command.

If you used the ‘recode’ command above, the recoded information is stored in the variable to be
recoded, i.e. the original information stored in this variable is overwritten. This is the reason why
we created a copy of the variable first. We can skip this step, and make our code more efficient
by utilising the “, gen()” option of the “recode” command, as shown below:

recode ldl2 min/4=0 4.01/max=1, gen(ldl2)

The code can be made even more efficient by combining the information on value labels in the
“recode” command:

recode ldlc (min/4=0 “under 4”) (4.01/max =1 “over 4”),


gen(ldl2_b)

Dropping variables

You can drop variables from the dataset if you no longer want to use them.
But once you do this, you cannot undo it, so be careful when using this
command. A variable can be dropped from the dataset by typing the
following command:

drop var1

Editing data – making changes in data editor

Stata has an editing browser, where you can see the data in your dataset
and make changes to the dataset. To access the edit window, you can either
type edit or you can open it from the drop down menu (Data>Data Editor>
Data Editor (Edit)). You can then click on the appropriate cell in the Edit
Window and change the values of the dataset. HOWEVER… when data
cleaning it is strongly recommended that you save the relevant commands in
a .do-file and then run that for each session. This ensures your original
dataset is kept intact in case you make a mistake while editing- or you need
to remember something you edited a long time ago- and you have a
permanent record of the data cleaning process. “Data cleaning” is the
process whereby you get all the variables you received in your raw dataset
ready to be used in your analysis.

Practical on Bar charts --STATA

Bar charts are a useful way of comparing groups by a particular


characteristic. We can tell Stata what summary statistic we wish to include in
the bar chart, for example, the frequency within each category of a variable,
or the mean of one variable within each level of another categorical variable.
For categorical variables, it can be useful to look at frequencies within each
level. To do this, we use the ‘graph bar’ command and include ‘(count),’
followed by ‘over ([variable name])’.

The following code will present a bar chart comparing the frequencies within
each age group category:

graph bar (count), over (age_grp)

To look at percentages within each category:

graph bar (percent), over (age_grp)

Explore the above command with some variables within your dataset. We
can also look at summary statistics of a continuous variable within each level
of another, categorical variable. For example, the following code will produce
a bar chart that presents the mean of vitamin D serum levels within each
age group category:

graph bar vitd, over (age_grp)

To present the median of vitamin D by age group, you simply include


(median) after the command ‘graph bar’:

graph bar (median) vitd, over (age_grp)

It is also possible to present the bar chart with multiple categorical variables.
The following code will produce a bar chart presenting the mean vitamin D
by age groups and history of cardiovascular disease.

graph bar vitd, over (age_grp) over (prior_cvd)

To add a title to the y-axis, we can use the following code:

graph bar (mean) vitd, over(age_grp) ytitle(Mean vitamin D


concentration)

To remove labels or change the size, you can use the following code:

graph bar (mean) vitd, over(age_grp) ytitle(Mean vitamin D


concentration) ylabel(, nolabels)

graph bar (mean) vitd, over(age_grp) ytitle(Mean vitamin D


concentration) ylabel(, labels labsize(small))
If comparing multiple variables on one chart, it can be useful to change the
colour of bars.

To do this, add in the following code ‘bar (1, fcolour([insert colour])’:

graph bar (mean) vitd, bar (1,fcolor(black)) ytitle(Mean vitamin D


concentration)

Try producing a number of different bar charts and play around with
changing different features.

Histograms

When you want to look at the distribution of a variable, rather than


comparing characteristics, you can use a histogram. A histogram can be
produced for a continuous or categorical variable, as long as they are
measured on an interval scale.

Type ‘histogram [variable name]’.

histogram sbp

If the variable is not continuous, type ‘, discrete’ afterwards:

histogram bmi, discrete

A histogram is often used to check whether a variable is normally


distributed. To add a normal distribution curve to the histogram, use the
following code:

histogram bmi, discrete normal

To adjust the number of bins, include ‘, bin ([number of bins])’

histogram sbp, bin (20)

histogram sbp, bin (10)

You can also add a title and labels to the x-axis: histogram bmi, discrete
normal title (“Body Mass Index”) histogram bmi_grp4, discrete normal
title(“Body Mass Index”) xlabel (1 “Underweight” 2 “Normal weight” 3
“Overweight” 4 “Obese”) It is also possible to show the percentage or
frequency on a histogram. To do this, amend the code at the end of the
histogram command. histogram bmi, discrete percent histogram bmi,
discrete frequency Grouping continuous data There are different ways you
can group continuous data to create a categorical variable. To this, firstly
generate a duplicate variable, so you are not altering the original. 1. ‘xtile’
If you want to create a new variable with percentiles, the ‘xtile’ command is
useful. For example, if you wish to produce deciles of systolic blood pressure:
xtile sbp10=sbp, nquantiles(10) Or quartiles of systolic blood pressure: xtile
sbp4=sbp, nquantiles(4) 2. ‘cut’ If you want to create a variable with
specific categories you can use the ‘egen’ function with the ‘cut’ command.
The code below is an example of creating a new categorical systolic blood
pressure variable. The new variable categories are <90 = low sbp; 90-<120
= normal sbp; 120-<130 = elevated sbp; ≥130 = high. egen
sbp_cat=cut(sbp), at(0,90, 120, 130, 231) Note: that the max systolic blood
pressure recorded in this population is 230, therefore, the cut off 231,
includes all values below 231. 3. ‘recode’ The recoded command also works
in the same way to the cut command above. gen sbp_cat=recode(sbp, 90,
120, 130, 231) 4. ‘autocode’ The autocode command creates evenly spaced
categories of a continuous variable: gen [new var name]=autocode([original
var name], [number of categories], [minimum], [maximum]) To create a
new systolic blood pressure categorical variable, with 4 evenly spaced
categories between 0 and 230: gen sbp_cat=autocode(sbp, 4, 0, 230) You
can use the tab and tabstat commands to check that your new categorical
variables include the correct categories. Use ‘label’ function to label the
variable and the categories in your new variable.

Questions A1.3b: Which type of variable can you plot with a bar chart? When should you use a
histogram? Plot a histogram of total cholesterol and describe the distribution. Can you change the
number of bins used to plot the histogram? What is the effect of changing the number of bins? Split
total cholesterol into groups and make a bar chart of the number of participants in each cholesterol
group. Can you give this graph a title? Can you label the y axis and change the colour of the bars in the
chart?

Answers

Answer A1.3b.i: A bar chart can be used to compare the frequency and percentage of participants
within each level of a categorical variable. They can also be used to look at summary statistics of
continuous variables, but only within level of categorical variables. Histograms should be used to look at
the distribution of data.

Answer A1.3b.ii: histogram chol (normally distributed)

Answer A1.3b.iii: histogram chol With too few bins it becomes difficult to identify the distribution of
the data histogram chol, bin(3)

Answer A1.3b.iv: tab chol gen chol_cat=recode(chol, 0, 5, 7.5, 11) label var chol_cat “Categories of
cholesterol” label define chol_cat 5 “Normal” 7.5 “High” 11 “Very high” label values chol_cat chol_cat
tab chol_cat graph bar (count), over(chol_cat) bar(1, fcolour(black)) ytitle (Frequency)

You might also like