Basic principles of statistical
programming
• During the training, find all materials in our shared
Reproducible Research Fundamentals DropBox: https://bit.ly/rrf-materials
September 2023 • Permanent link to final materials: https://osf.io/pszy3/
Development Impact (DIME)
The World Bank
Overview
Session overview
Before we dig deep into specific aspects of research reproducibility, this session
will introduce you to 7 general principles that you should always keep in mind:
1. Your code is an output
2. Know your data
3. Track your changes
4. Write code for others to read
5. Think critically
6. Ask for help
7. Keep improving your skills
2
Principle 1: Your code is an output
Your code is an output
Do not treat code as a means to an end.
In fact, your code is equally as much an end in
itself as the paper or the report you are writing!
This is absolutely fundamental to making
research transparent, reproducible and credible.
3
Your code is an output
4
Your code is an output
During this course, we will focus on using Stata and
R. Why are we not using Excel?
5
The main reason why we code
• In Excel you make changes directly to the data and save new versions of
the dataset.
• In Stata and R you make changes to the instructions on how to get from the
original data to the final analysis and save new versions of the instructions.
6
Create recipes and not meals
7
Create recipes and not meals
We are rarely trained in writing recipes
• What we are trained to do in school:
• Cook a delicious meal!
• Econometrics/statistics assignments tend to only grade the final results, not
necessarily how we got to them
• What is expected from us in the work place:
• Delicious meals (correct results) are just as important in the workplace
• But as a research assistant, your task is to write the recipe (code scripts, folder
structures, data documentation, etc.) that creates those delicious meals
8
Principle 2: Know your data
Know your data
• To write a good recipe you need to know your ingredients very well
• The ingredients for a data work recipe are contained in the datasets
• Let’s discuss a framework to understand and communicate how your data is
structured
9
Exploring a new dataset
What is the first thing you
want to look for whenever
you open a new dataset for
the first time?
10
Exploring a new dataset
What is the first thing you
want to look for whenever 1. Unit of observation
you open a new dataset for 2. Uniquely and fully identifying ID
variable
the first time?
10
ID variables
• ID variables are crucial to understanding and handling data
• Make sure all your datasets have an ID variable
• If the dataset that you have received does not have one, then creating it is
your first task
• The session on data cleaning will discuss in more details the desired
properties of an ID variable
11
Understand project data
• It is easy to remember information about one or two datasets while you are
working with them
• However, in your role as a research assistant, you will need to keep track of
multiple datasets, explain to other team members how they are organized,
and hand them to other researchers
• To communicate our understanding of datasets, we use data maps. We will
learn about this tool in the next session
12
Principle 3: Track your changes
Track your changes
• Your code will constantly change, but when using a
version control tool like Git/GitHub, then you can
access all previous versions of your code.
• If your original data is backed up, as well as all
versions of your code, then all versions of your
outputs are also backed up.
• To be able to reproduce all past outputs is central to
credibility and transparency.
13
How can you track changes?
• Using file naming conventions (such as
adding dates and initials as suffixes) is better
than no version control, but it can get very
unwieldy very quickly
• Syncing software (such as OneDrive and
Dropbox) allow teams to revert to old version
of a document, but not to track specific
changes
• git is currently the best version control
system out there as one can track changes
and revert to old versions easily
14
Recommended practices for version control
• DIME projects are required to use git for version control of code
• Anything can be version-controlled through git, but it is only suitable for
code and outputs in plain text formats such as .csv, .do, .R, .tex
• The World Bank does not allow us to store data on GitHub, but you can track
changes to it by saving metadata such as codebooks on plain text format
15
Principle 4: Write code that others
can read
How to write good recipes
A recipe only has any value if someone
else can follow it
How do you write code that is useful to
others?
16
Is this slide easy to read?
White Space. Stata does not distinguish between one empty space and many empty spaces,
or one line break or many line breaks. It makes a big difference to the human eye and we
would never share a Word document, an Excel sheet or a PowerPoint presentation without
thinking about white space - although we call it formatting.
17
White Space
• Stata does not distinguish between one empty space and many empty
spaces, or one line break or many line breaks
• It makes a big difference to the human eye and we would never share a Word
document, an Excel sheet or a PowerPoint presentation without thinking
about white space – although we call it formatting
18
Vertical spacing
19
Vertical spacing
19
Horizontal spacing
20
Horizontal spacing
20
Style Guides
Style guides are common in most programming languages. Following a style guide
will make your code much more readable, and it will reduce the risk of errors.
• Stata: See appendix A in DIME Analytic’s Data Handbook -
https://worldbank.github.io/dime-data-handbook/coding.html
• R: https://style.tidyverse.org
21
Code linters
Linters are tools that flag style errors and possible bugs in software.
• Stata: Install the Stata linter (proudly developed by DIME Analytics!) from
SSC with: ssc install stata linter. More information is available here.
• R: Use the package lintr, available in CRAN. More information in this link.
22
Don’t repeat yourself
23
Principle 5: Think critically about
the data work
Critical thinking about data work
Do I believe this number?
24
Critical thinking about data work
• What does my data look like?
• What can go wrong in my code?
• How will missing values be treated in this command?
• What would happen if more observations would be added to the dataset?
• What would happen if some observations would be removed from the
dataset?
• We will cover this on the lecture Best practices for reproducible outputs
25
Principle 6: Ask for help
Help file usage and coding knowledge
26
Help file usage and coding knowledge
27
Help file usage and coding knowledge
The Dunning-Krueger effect
28
Help files
• In Stata, type: help command name
• In R, type: ?command name
• Get in the habit of using the help file as often as possible!
• Even with familiar commands, always more to learn
• Help files are not the only place to learn
• Follow blogs and Twitter accounts that discuss best practices
• Follow the tag for your programming language on
https://stackoverflow.com/
• In Stata, there are a reference manual that you access by clicking [R]
command name in the help file where the developers at Stata Corp discuss coding
practices, common mistakes, alternative approaches etc.
29
Asking for help
The quality of the help you will get
depends on how well you asked your question
This is always the case, no matter who you ask: DIME Analytics, Stack Overflow, a
friend from grad school etc.
30
How to ask for help
• You will never get a good answer if you only say “my code is not working”
• In good code question etiquette, include at least:
• Error message or description of unexpected behavior
• Software language and point to the part of your code that breaks
• Describe what you have tested so far and what you have learned
• The more you include of this the better answer:
• Your version of the software and your operating system (mac/windows)
• Show that it is indeed that part of the code that cause the error and not just that
it is there the code crash
• Provide a minimum reproducible example
Much more details and advice on this topic at https://git.io/JtQTb and http://tinyurl.com/stack-hints
31
Principle 7: Keep improving your
skills
When your code works you are only half done.
- Ancient proverb
32
Re-write your own code
• Can this code be made simpler?
• Can I generalize this code so I can use it in other projects?
• Read your own code as a recipe. Would you be able to follow the instructions
if you were a new person joining the team?
33
Read other peoples code
• Look for code on GitHub
• https://github.com/vikjam/mostly-harmless-replication - all examples in
the book coded in Stata, R, Python and Julia
• Read our book https://worldbank.github.io/dime-data-handbook/
• Google code, but before using, ask yourself critical questions about the code
you found
• Why did this person code this way?
• Does this apply to my context?
34
Have someone else read your own code
• Swap code with someone and discuss differences in coding style. Think of
each other’s code as recipes, can you follow the instructions?
• Have you ever asked someone to help you proofread your Word document?
Ask people to proof read you code
• In DIME, we hold structured peer code review sessions every quarter
35
Wrapping up
Wrapping up
1. Your code is an output
2. Know your data
3. Track your changes
4. Write code that others can read
5. Think critically about data work
6. Ask for help
7. Keep improving your skills
36
Wrapping up
1. Your code is an output
2. Know your data
3. Track your changes
4. Write code that others can read
5. Think critically about data work
6. Ask for help
7. Keep improving your skills
We will see these principles in practice during the rest of this training.
36
Thank you! Gracias!