DIRTY DATA?
CLEAN IT USING SAS
AN INTRODUCTION TO DATA CLEANING PRINCIPLES
CYP-C Research Champion Webinar
August 11, 2017
Giancarlo Di Giuseppe, MPH
Pediatric Oncology Group of Ontario
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Outline
• SAS overview and procedures – revisited
• Fundamental principles to build a clean dataset
• Inclusion / exclusion criteria
• Visualizing data distributions
• Outliers
• Invalid or inconsistent character variables
• Dealing with missing data
• Creating data checkpoints
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
SAS Overview - Revisited
• For our purposes only two major things you can do in SAS
– DATA step - Manipulate the data in some way
• Reading in Data
• Creating and Redefining Variables
• Sub-Setting Data
• Working with Dates
• Working with Formats
– PROCedure step
• Analyze the data
• Produce frequency tables
• Estimate a regression model
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
SAS Procedures – Revisited
• SAS Procedures
– PROC FREQ
– PROC PRINT
– PROC MEANS
– PROC UNIVARIATE
– PROC SORT
– PROC CONTENTS
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
PRINCIPLES FOR CLEANLINESS
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Understanding Your Dirty Data Source
• No database is initially ever
“clean”
• Databases are not constructed
with our own specific research
questions in mind
• Researchers must be familiar with
the purpose, how variables are
captured and defined, and the http://3rdsectorlabs.com/wp-content/uploads/2014/06/TSL-data-
structure of the database shower.png
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Having an Analysis Plan
• Having clean data requires a sound analysis plan
– Envision what the analysis dataset will look like with all
variables and formats before performing data cleaning
• Determine what your study population denominator
is before you begin cleaning
– Is it patient population? Is it number of total diagnoses
(therefore, multiple dx’s per patient is possible)? Or is it
person-time? Etc.
– Based on the research question!
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Data Manipulation and Data Cleaning:
A Simultaneous Process
• Data manipulation and data cleaning are not
mutually exclusive, rather they go hand-in-hand!
• Both can (and should) be
performed within a single
DATA step
http://i.telegraph.co.uk/multimedia/archive/03219/handshake1_3219777k.jpg
• Ensures efficient and easy to follow SAS
programming
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
SUB-SETTING YOUR DATA
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Receiving Your Data Cut
• Typically data is requested with slightly more
information than needed
– Allows for wiggle room if hypothesis change slightly
• No data cut is ever perfect
– Data still needs to be cleaned
• Initial data cuts are never ready to be analyzed,
they must first be cleaned
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Cleaning Using Inclusion & Exclusion Criteria
PROC SORT DATA = T7 OUT=T7_SORT; BY CYPCID DX_DATE; RUN;
DATA T8; SET T7_SORT; BY CYPCID; First cancers
/* INTERESTED IN PRIMARY DIAGNOSIS ONLY */
IF FIRST.CYPCID;
IF ORDINAL_PRIMARY IN (1);
/* AGE INCLUSION CRITERIA – 0 TO 14 */
IF 0 <= DX_AGE < 15;
Children aged 0 to 14
IF 0 <= DX_AGE < 1 THEN DX_AGE_GR=1;
ELSE IF DX_AGE < 7 THEN DX_AGE_GR=2;
ELSE IF DX_AGE < 11 THEN DX_AGE_GR=3; Note: Data cleaning
ELSE DX_AGE_GR=4; and data manipulation
LABEL DX_AGE_GR = "AGE AT FIRST DIAGNOSIS - GROUPED";
FORMAT DX_AGE_GR DX_AGE_GR.; done simultaneously!
/* SELECTS THOSE WITH A DIAGNOSIS BETWEEN 2002 & 2012 */
IF 2002 <= YEAR(DX_DATE) <= 2012; Diagnosed between 2002-
DX1_YEAR = YEAR(DX_DATE);
2012
/* LEUKEMIA CASES */
IF ICCC_MAIN = 1010 OR ICDO_M_CODE IN (9826, 9835, 9836,
9837);
Only concerned with
leukemia cases
RUN; *N=2,492; Keep logs of sample size in your DATA steps!!
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
DATA DISTRIBUTION
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Recall From Last Session
• PROC FREQ produces frequency outputs
– Can be used for numeric or character variables
– Useful for counts and proportions
• PROC MEANS and UNIVARIATE produce outputs
describing the data distribution for numeric variables
– Checkpoint for data distributions and normality
• PROC FREQ and PROC MEANS/UNIVARIATE are
used in the first step of data cleaning to understand
the data
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Duplicate Entries
PROC FREQ DATA=T8 ORDER=FREQ;
TABLE CYPCID /MISSING;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Distribution of Continuous Data
ODS GRAPHICS ON;
PROC UNIVARIATE DATA = T8 NORMAL;
ID CYPCID;
VAR WBC_COUNT;
HISTOGRAM WBC_COUNT / NORMAL;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Distribution of Continuous Data II
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Distribution of Continuous Data III
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Normality of Continuous Data
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
OUTLIERS
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Dealing With Outliers
• If there are many outliers, these will introduce
bias in your study
• Many options to handle these skewed data:
– Median + IQR instead of mean
– Use a logical range of values and assign any
outlier the upper bound of the range
– Categorize your data based on the distribution or
clinically meaningful ranges
• Whichever approach used should be justified!
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Dealing With Outliers II
DATA T8; SET T8;
/* UPPER LIMIT TO OUTLIERS */
IF WBC_COUNT >= 500 THEN WBC_COUNT_CLEAN = 500;
ELSE WBC_COUNT_CLEAN = WBC_COUNT;
/* CREATING CLINICAL CATEGORIES */
IF WBC_COUNT ^= . THEN DO;
IF WBC_COUNT < 50 THEN WBC_GROUP = 1;
ELSE IF WBC_COUNT < 100 THEN WBC_GROUP = 2;
ELSE IF WBC_COUNT < 200 THEN WBC_GROUP = 3;
DO loop
ELSE IF WBC_COUNT < 300 THEN WBC_GROUP = 4;
ELSE IF WBC_COUNT < 400 THEN WBC_GROUP = 5;
ELSE WBC_GROUP = 6; END;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Dealing With Outliers III
/* MEAN VS MEDIAN + IQR */
PROC MEANS DATA=T8 MEAN MIN MAX Q1 MEDIAN Q3;
VAR WBC_COUNT_CLEAN;
RUN;
/* DATA CATEGORIZATION */
PROC FREQ DATA=T8;
TABLES WBC_GROUP /MISSING;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Dealing With Outliers III
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Dealing With Outliers III
/* MEAN VS MEDIAN + IQR */
PROC MEANS DATA=T8 MEAN MIN MAX Q1 MEDIAN Q3;
VAR WBC_COUNT_CLEAN;
RUN;
/* DATA CATEGORIZATION */
PROC FREQ DATA=T8;
TABLES WBC_GROUP /MISSING;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
CLEANING CHARACTER VARIABLES
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
CApITOLIzATioN Matters!
PROC FREQ DATA=T8;
TABLES PROTOCOL_NAME;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
CAPITOLIZATION Matters! Use UPCASE
DATA T8; SET T8;
PROTOCOL_NAME = UPCASE(PROTOCOL_NAME);
RUN;
PROC FREQ DATA=T8; TABLES PROTOCOL_NAME; RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
FINDing, Cleaning, and Manipulating
DATA T9; SET T8;
PROTOCOL_NAME = UPCASE(PROTOCOL_NAME);
IF FIND(PROTOCOL_NAME,"ALL PROTOCOL C") THEN DO;
PROTOCOL_NAME = "ALL PROTOCOL C";
DO loop
ALL_RISK = "HIGH RISK";
END;
RUN;
PROC FREQ DATA=T9; TABLES PROTOCOL_NAME; RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Use Caution When Searching Text
• When performing character search functions in SAS,
be wary of the phrase being used
• Can lead to errors in data cleaning
• Searched term should be unique enough to prevent
unwanted matches
• If “ALL PROTOCOL B” was searched using FIND(),
then the BFM-90 protocol would have been
misclassified as Protocol B
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
MISSING DATA
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Recall: Viewing Missing Data
PROC FREQ DATA = T8;
TABLES STAGE_CODE /MISSING;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Understanding Your Missing Data
PROC FREQ DATA = T8; • Staging not done for the
WHERE DX1_GRP = 2; leukemia’s which represent a high
TABLES STAGE_CODE /MISSING;
RUN; % of childhood cancers
• Staging important for lymphomas
• Know your data!
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
DATA CHECKPOINTS
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Date Checkpoints I
DATA FLAGS; SET T8 (KEEP=PATIENT_ID DOB DX_DATE1 DOD);
IF DOD < DX_DATE1 AND DOD ^=. THEN DEATH_FLAG = 1;
ELSE DEATH_FLAG=0;
IF DX_DATE1 < DOB THEN DX_FLAG = 1;
ELSE DX_FLAG = 0;
RUN;
PROC FREQ DATA=FLAGS; TABLES DEATH_FLAG DX_FLAG; RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Date Checkpoints II
PROC PRINT DATA=T8 NOOBS;
WHERE DOD < DX_DATE1 AND DOD ^=. ;
VAR PATIENT_ID DX_DATE1 DOD;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Treatment Checkpoints
DATA TX_FLAGS;
MERGE T8 (IN=MASTER) CHEMO (IN=A) SURG (IN=B)
BMT (IN=C) RAD (IN=D);
BY CYPCID;
IF A THEN CHEMO = 1; ELSE CHEMO = 0;
IF B THEN SURGERY = 1; ELSE SURGERY = 0;
IF C THEN BMT = 1; ELSE BMT = 0; Treatment flags
IF D THEN RAD = 1; ELSE RAD = 0;
NUM_TX_MODALITIES = SUM(CHEMO,SURGERY,BMT,RAD);
IF FIRST.CYPCID;
IF MASTER THEN OUTPUT;
RUN;
REMEMBER: All datasets involved in a merge must be sorted by the common identifier (ie.CYPCID)
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Treatment Checkpoints II
PROC FREQ DATA=TX_FLAGS;
TABLES DX1_GRP * (CHEMO SURGERY BMT RAD);
TABLES DX1_GRP * NUM_TX_MODALITIES;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Treatment Checkpoints II
PROC FREQ DATA=TX_FLAGS;
TABLES DX1_GRP * (CHEMO SURGERY BMT RAD);
TABLES DX1_GRP * NUM_TX_MODALITIES;
RUN;
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Topics Covered
• Key principles to build a clean dataset
• Using Inclusion / exclusion criteria
• Visualizing data distributions
• Handling data outliers
• Cleaning character variables
• Dealing with missing data
• Creating data checkpoints
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
THANK YOU!
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research