MED101x
Introduction to Applied Biostatistics
Data Entry for a statistical analysis
Data from Hell
MED101x
Introduction to Applied Biostatistics
Data from Heaven
Data from Hell
1. Delete the first row with the title of the project
2. Delete the 2 rows under the variable name.
3. Delete the 2 row between the groups.
4. Delete the row of mean at the bottom.
4
MED101x
Introduction to Applied Biostatistics
Data from Hell
5. Give each
patient a unique,
sequential case
number (ID).
Place this ID
number in the
first column on
the left
Data on the way to Heaven
6. Give each variable a valid name. Short,
easy to remember word names. Each
variable name must be unique; duplication
is not allowed. In some software, variable
names are not case sensitive. The names
such as TumorSize, Tomorsize, and
tumorsize are all considered identical.
Do not sue symbols such as #, @, $, %, &.
Do not start with a number.
MED101x
Introduction to Applied Biostatistics
Data on the way to Heaven
7. Encode
categorical
variables.
Convert letters
and words to
numbers
Data on the way to Heaven
8. Avoid mixing
symbols with
data. Convert
them to
numbers.
MED101x
Introduction to Applied Biostatistics
9. Each variable should be in its own column.
Animal
Control1
Control2
Experiment1
Experiment2
Animal
1
2
3
4
Group
0
0
1
1
* Do not combine variables in one column
* It is recommended to use 0/1 for 2 groups with 0 as a reference group.
10. Do not include graphs or summary statistics in the spreadsheet.
10
MED101x
Introduction to Applied Biostatistics
11. Each patient should be entered on a single line or row. Do not copy
a patients information to another row to perform subgroup analysis.
Male
11
12. However when data are repeatedly collected over a patient, its
recommended to have patient-day observation on a single line to ease
data management. R has functions to convert from the longitudinal
format to horizontal format. When the number of repeats are few (2 or 3),
horizontal format may be preferred for simplicity.
Longitudinal data entry
Horizontal data entry
Date
1/2/2005
1/3/2005
1/4/2005
3/1/2005
3/2/2005
ID SYSBP1 SYSBP2 SYSBP3
1
130
120
120
2
110
140
ID
1
1
1
2
2
SYSBP
130
120
120
110
140
12
MED101x
Introduction to Applied Biostatistics
13. For yes/no questions, enter 0 for no and 1 for yes. Do not leave
blanks for no. Do not enter ?, *, or NA for missing data because
this indicates to the statistical program than the variable is a string
variable. Leave blanks for missing value (unless you need to specify
type of missing data). String variables cannot be used for any arithmetic
computation.
Complication
0
1
1
0
0
Complication
no
y
Yes
N
Dont know
NO
13
14. Put ordinal variables into one column if they are mutually exclusive.
PainMild
1
0
0
PainMid
0
1
0
PainSev
0
0
1
Pain
1
2
3
14
MED101x
Introduction to Applied Biostatistics
Entering Date in Excel.
In Excel,go to:
Format, Cells, select Date under Category,
Choose Type for a format you like
15
16. Merging Data Files (Data can be entered in multiple files as long as
same ID as used )
16
MED101x
Introduction to Applied Biostatistics
17. Data confidentiality
Data need to be stored in a secure locked place, need to be back-up
daily or once a week. When you send your data to a biostatistician
for further statistical analysis, delete patient name, social security
numbers, medical record numbers, actual dates (birth day, admission
date, etc)
17
18. Data dictionary
Create data dictionary to keep a list of variable names with
explanation of what they are.
WH
Patients weight in pounds at study entry
HT
Patients height in inches at study entry
Age
Patients age at study enrollment
Gender
Patients gender: 0 for female, 1 for male
18