0% found this document useful (0 votes)
10 views16 pages

Clinical Data Getting Started Overview

Guidebook for using OAI dataset (downloaded from OAI website)

Uploaded by

lingni6829
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views16 pages

Clinical Data Getting Started Overview

Guidebook for using OAI dataset (downloaded from OAI website)

Uploaded by

lingni6829
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Getting Started with OAI Data: Overview of Structure, Use, and

Conventions

Participant demographics ............................................................................................ 1


Publication guidelines for OAI public datasets .......................................................... 3
Versioning of releases ................................................................................................. 3
SAS datasets available ................................................................................................ 4
Categories and subcategories ...................................................................................... 7
Variable guides ........................................................................................................... 8
Dates of measurements in OAI and calculating time between measurements ........... 8
Changes and corrections to clinical data and images.................................................. 9
Release comments ..................................................................................................... 10
Combining two or more datasets with one record per participant ............................ 10
Unreleased variables ................................................................................................. 11
Variable naming conventions ................................................................................... 12
Variable formats........................................................................................................ 12
SAS special missing value codes .............................................................................. 13
ASCII files for non-SAS users .................................................................................. 15
Downloading OAI datasets ....................................................................................... 15
Annotated forms........................................................................................................ 16

Participant demographics
The following three tables show sex, subcohort, age and race distributions for enrolled
OAI participants.

Table 1. Number of enrollees by sex and subcohort


Male Female Total
N % N % N
Progression subcohort 597 42.9% 793 57.1% 1390
Incidence subcohort 1348 41.0% 1936 59.0% 3284
Non-exposed control subcohort 47 38.5% 75 61.5% 122
Total 1992 41.5% 2804 58.5% 4796

[Link] 9/9/2015 1
Table 2. Number of enrollees by age, sex and subcohort
Age Groups (years, at time of enrollment)
45-49 50-59 60-69 70-79 Total
N % N % N % N % N
Progression subcohort
Male 70 11.7% 220 36.8% 158 26.5% 149 25.0% 597
Female 86 10.8% 242 30.5% 286 36.1% 179 22.6% 793
Total 156 11.2% 462 33.3% 444 31.9% 328 23.6% 1390
Incidence subcohort
Male 175 13.0% 503 37.3% 325 24.1% 345 25.6% 1348
Female 180 9.3% 643 33.2% 669 34.6% 444 22.9% 1936
Total 355 10.8% 1146 34.9% 994 30.3% 789 24.0% 3284
Non-exposed control subcohort
Male 14 29.8% 16 34.1% 12 25.5% 5 10.6% 47
Female 24 32.0% 38 50.7% 13 17.3% 75
Total 38 31.1% 54 44.3% 25 20.5% 5 4.1% 122
Overall
Male 259 13.0% 739 37.1% 495 24.8% 499 25.1% 1992
Female 290 10.4% 923 32.9% 968 34.5% 623 22.2% 2804
Total 549 11.4% 1662 34.7% 1463 30.5% 1122 23.4% 4796

Table 3. Number of enrollees by race, sex and subcohort


White Non-white Total
N % N % N
Progression subcohort
Male 475 79.6% 122 20.4% 597
Female 499 63.0% 293 37.0% 792
Total 974 70.1% 415 29.9% 1389*
Incidence subcohort
Male 1150 85.5% 195 14.5% 1345
Female 1553 80.3% 382 19.7% 1935
Total 2703 82.4% 577 17.6% 3280**
Non-exposed control subcohort
Male 41 87.2% 6 12.8% 47
Female 72 96.0% 3 4.0% 75
Total 113 92.6% 9 7.4% 122
Overall
Male 1666 83.8% 323 16.2% 1989
Female 2124 75.8% 678 24.2% 2802
Total 3790 79.1% 1001 20.9% 4791
* Race data missing for 1 progression subcohort participant
** Race data missing for 4 incidence subcohort participants

[Link] 9/9/2015 2
Publication guidelines for OAI public datasets
The OAI requires appropriate use of the OAI public datasets by the scientific community. The
objectives of the guidelines are to ensure appropriate citation and acknowledgment of OAI as
the source of the data used in publications, promote collaboration, assist investigators in
avoiding unintended duplication of effort, and foster high quality and creative publications
based on valid use and interpretation of the data.

Versioning of releases
Each release of clinical data in OAI is given a unique version number. The first position
(before the first decimal) identifies the clinic visit cycle (baseline=0, 12-month follow-up
=1, 18-month interim=2, 24-month follow-up=3, etc.). The second position (between the
decimals) indicates the group of participants included in the release (1=first half of cohort
enrolled *, 2=entire enrolled cohort). The third position (after the second decimal)
indicates the sequential number of the clinical data release for that visit cycle, beginning
with 1. Thus the first release of the baseline clinical data for the first half of the OAI
cohort is numbered 0.1.1 (for further discussion of the stepwise release of OAI data, see
[Link]). The variable VERSION, included in each dataset,
serves to track which release the user has downloaded.

Similarly, each image release was given a unique version code consisting of a number, a
letter, and a number. The first position (before the decimal) identifies the clinic visit
cycle, as above. The letter in the second position (between the decimals) identifies the
group of participants selected to have their images included in the release (A=first group,
B=second group, and so on). The third position (after the second decimal) indicates the
sequential number of the release for a particular image. For example, one screening knee
x-ray released originally as 0.A.1 was later discovered to belong to another participant.
That image has now been replaced with an image correctly assigned to that participant.
This replacement image has the version code 0.A.2 and the old image has been removed
from the image set. The images that are not replaced will stay at version 0.A.1. Users

*
Because the first half of the cohort (defined as enrolled before 4/30/05) is sometimes released before all
data cleaning is complete, a participant enrolled in the first half of the study and thus eligible for release
may be held back because their data are not clean. When the entire cohort is released, data cleaning will be
complete and all participants will be included. In addition, any participants whose images were released in
the first image set (V00IMAGESA=3) are also included in the “first half,” even if they were enrolled a little
later. And, finally, five participants enrolled before 4/30/05 whose enrollment status was confirmed very
late are considered to be in the second half for data release.
[Link] 9/9/2015 3
who request Image Group A after the release of 0.A.2 will receive the most up-to-date
images for Image Group A, all but one of which will be 0.A.1 images (those that have not
been replaced), and the 0.A.2 image (that which replaces the old image). Similarly, if
images initially unavailable become available (e.g., full limb radiographs from Image
Group B), these will also be given the new version number (in this case, 1.B.2). Again,
the original images will remain at 1.B.1. New requests for Image Group B will include
0.B.1, 1.B.1, and 1.B.2 images.

Each release of image assessment data in OAI is given a unique version number. The first
position (before the first decimal) identifies the clinic visit cycle, as above. The second
position (after the first decimal) indicates the sequential number of the image assessment
data release for that visit cycle, beginning with 1. Thus, the first release of the baseline
image assessment data is numbered 0.1.

Each release of Enrollees, Outcomes and Measurement Inventory data in OAI is given a
unique version number. The number indicates the sequential number of the release for the
dataset, beginning with 1.

The clinical data from the same visit cycle at which the released images were acquired
are always available from the latest version (highest numbers in the second and third
positions) of the clinical data set for that visit cycle. (Images for a particular participant
are not released until their images have been cleaned and their clinical data have been
released.) Similarly, the image assessment data from the same visit cycle as the clinical
data or at which the release images were acquired are available as the latest version of
image assessment data for that visit cycle.

IMPORTANT NOTE: Initially the first half of the cohort was usually released early
(before that specific visit cycle is complete); these data are suitable only for
preliminary analyses and planning. Because various subgroups of participants were
not enrolled at a steady pace throughout the enrollment phase of the study (e.g., the
majority of the progression cohort was enrolled during the first half of the study,
non-exposed controls almost entirely near the end, and black participants
predominantly late in the enrollment phase) the user will only get a representative
sample of the cohort by analyzing data for the entire cohort. If using previously
released data for the first half of the cohort, users are encouraged to re-run their
analyses on data for the entire cohort.

SAS datasets available


Clinical data from each visit are grouped into several SAS datasets based on the type of
data they contain. Most of the clinical data are also included in a single dataset that serves
as an alternative to the multiple, smaller datasets. Datasets of the same type from
different visits have different 2-digit suffixes. For example, Xray00 contains the x-ray
meta-data from the baseline visit, while Xray01 contains the x-ray meta-data from the 12-
month follow-up visit.

[Link] 9/9/2015 4
Each dataset has one record per participant, except for the MIF, Xray, MRI, and image
assessment datasets, which may have multiple records per participant. Every dataset is
sorted by the key variable ID. All variables other than ID, SIDE (image-assessment data
only), READPRJ (image-assessment data only) and VERSION are contained in only a
single dataset.

Previously, only the datasets that grouped data based on type were available. As of
releases 2.2.2, 3.2.1 and 4.2.1, new datasets named AllClinicalXX are available for each
visit. The AllClinicalXX datasets combine the data from all the clinical datasets with one
row per participant for that visit. Data from the MIF, MRI and Xray datasets are not
included in the AllClinicalXX datasets, as this data are from datasets that have multiple
rows per participant. Data from Enrollees are not included as the Enrollees dataset is
updated for each data release.

The clinical data are grouped into datasets as follows:

• Enrollees – a visit-independent dataset, released with each clinical data release


starting with 1.1.1. This dataset contains grouping information that will not
change from year to year, such as race (P02RACE), Hispanic/Latino ethnicity
(P02HISP), sex (P02SEX), cohort assignment (V00COHORT), site
(V00SITE), and imaging groups to which the participant belongs
(V00IMAGESA, V00IMAGESB, etc.). At the time of release 0.1.1 some of
these variables were released as part of SubjectChar00. As of release 0.2.1, all
of these variables can be found only in Enrollees.

• AllClinicalXX – combines all the data from AccelerometryXX,


BiomarkersXX, JointSxXX, MedHistXX, NutritionXX, PhysExamXX, and
SubjectCharXX.
• AccelerometryXX – contains accelerometry data from Dr. Dorothy Dunlop’s
ancillary study AS05-10.
• BiomarkersXX – contains biospecimen collection information, biospecimen
assay results, and clinical readings of the screening knee x-rays.
• JointSxXX– contains self-reported data about joint symptoms, disability, and
function.
• MedHistXX – contains self-reported general health history other than specific
prescription medication inventory data, but includes indicator variables for
various classes of medications used.
• MIFXX – contains one record per prescription medication ingredient used by
the participant within the 30 days prior to the visit.
• NutritionXX – contains nutrition variables collected using the modified Block
Food Frequency questionnaire (baseline only) and other interview questions.
• MRIXX – contains meta-data related to the MRI imaging, with one record per
expected series for each participant belonging to a released image group.
• PhysExamXX – contains physical measurements of participants performed in
the clinic, such as anthropometry, knee examination, physical performance
measures, etc.

[Link] 9/9/2015 5
• SubjectCharXX – contains demographic data for all enrolled participants
collected at that visit.
• XrayXX – contains meta-data related to the radiograph imaging, with one
record per expected radiograph image for each participant belonging to a
released image group.

where XX is:
00 for the baseline visit
01 for the first annual (12-month) follow-up visit
02 for the 18-month interim visit
03 for the second annual (24-month) follow-up visit
04 for the 30-month interim visit
05 for the third annual (36-month) follow-up visit
06 for the fourth annual (48-month) follow-up visit
07 for the fifth annual (60-month) follow-up telephone contact
08 for the sixth annual (72-month) follow-up visit
09 for the seventh annual (84-month) follow-up telephone contact
10 for the eight annual (96-month) follow-up visit
11 for the ninth annual (108-month) follow-up telephone contact
99 for outcomes

For the analyst who wants the universe of data, the AllClinicalXX datasets provide an
alternative to having to download multiple, separate datasets. AllClinicalXX should be
downloaded for the visit of interest, along with MIFXX, MRIXX and XrayXX. Since the
AllClinicalXX dataset contains most of the clinical data, it is a very large dataset. If only
a small subset of variables is of interest, the smaller dataset the variables reside in can be
downloaded for the visit of interest. For example, if the analyst is interested in only the
knee examination variables, the PhysExamXX dataset for the visit of interest should be
downloaded. Although the knee examination variables for that visit are also in
AllClinicalXX, the PhysExamXX dataset is much smaller in size and, therefore, faster to
download and easier to work with. In addition, the analyst should always download the
most recent version of the Enrollees dataset when working with OAI data.

More detailed information about each of these datasets can be found in the description
field of the Datasets table. These dataset descriptions apply to all visits and all releases.
A full list and description of all variables contained in each dataset is in the
documentation file along with each dataset. All variables are also listed in the three
variable guides, see below.

The reasons for splitting the data into separate, smaller functional datasets are:
1. Smaller datasets are more manageable.

2. Updates/fixes/changes can be isolated to one smaller dataset instead of having to


re-release all variables each time. If the change is in a one-row-per-participant
dataset, the corresponding AllClinicalXX dataset would also be re-released.

[Link] 9/9/2015 6
− As biomarker assays, imaging measurements, and other secondary processing
of collected data become available, it is easier to release these data in a timely
manner if the entire universe of data does not need to be re-released.
− Errors can be corrected and the data re-published more readily.
3. Some variables (such as medications, x-ray image meta-data, MRI image meta-
data, and image assessments) would have to be distributed as separate datasets in
any model, since they contain more than one record per participant, while all
other datasets contain one record per participant.

An attempt was made to make these groupings of variables into separate datasets
functional. That is, variables in each dataset tend to be related in such a way that if an
analyst is interested in one variable in the dataset, then they are more likely to be
interested in other variables in that same dataset. For example, variables related to
physical activity are grouped together in the Subject Characteristics dataset; variables
related to a participant’s health and medical history are grouped together in the Medical
History dataset; and so on.

Of course, there are disadvantages to separating the data in this way. Analysts who
download several smaller datasets instead of an AllClinicalXX dataset will have to merge
datasets, but this is ameliorated by having datasets sorted by the same key variable (ID),
by having only one record per participant in all datasets except MIF, Xray, MRI, and the
image assessment datasets and by having unique variable names for all variables released
in OAI (no need to worry about over-writing data when merging).

Image assessment data are grouped into datasets by vendor and visit. Please see the
“Overview and Description of Central Image Assessments” for more information about
how image assessment data are organized.

Outcomes data are grouped into one dataset that is not visit specific.

Measurement Inventory data are also grouped into one dataset that is not visit specific.

OA Biomarkers Consortium FNIH Project data are grouped by vendor and visit. Please
see the “Overview and Description of OA Biomarkers Consortium FNIH Project”.

Categories and subcategories


As well as grouping variables into datasets, the variables are categorized in various ways
to help analysts quickly home in on the group of variables of greatest interest to them. A
variable can belong to multiple categories because it may be useful in different types of

[Link] 9/9/2015 7
analysis. To further pinpoint the most useful group of variables in a broader category of
variables of interest, the categories are further broken down into subcategories. Again, a
variable may be in more than one subcategory within the same category, and the same
subcategory may appear in more than one category grouping.

This organization of the variables is intentionally independent of the grouping into SAS
datasets, but in many cases, particular categories are entirely contained in a single SAS
dataset. In other cases, for example medications, a category may be split across datasets.

Variables from the some image assessment, Measurement Inventory, and OA Biomarkers
Consortium FNIH Project variables have not been categorized into categories and
subcategories.

Variable guides
Every variable is listed in two variable guides, VG_Variable.pdf (sorted alphabetically
by variable name) and VG_Form.pdf (organized by data collection form). See the
Variable Guide tutorial (VG_Tutorial.pdf).

The guides list all released variables (with the exception of some image assessment,
Measurement Inventory, and OA Biomarkers Consortium FNIH Project variables) sorted
in two different ways, alphabetically, and by form on which it was collected. Although
image assessment data are not collected on data collection forms, these data are included
in the variable guides under a descriptive heading.

The variable guides also give frequencies and formats (for categorical variables),
univariate statistics (for continuous variables), the SAS dataset name, and the variable
label. (The variable label is the same as the SAS label as seen in the datasets.) When
working with the datasets, analysts are encouraged to always output and view SAS
variable labels in their entirety to ensure important information about the variables is not
lost. The maximum SAS label length used in OAI is 160 characters.

The Variable Guides also indicate whether a release comment is available for each
variable (see below).

Dates of measurements in OAI and calculating time between measurements


Ideally, the clinic visit measurements, biospecimen collection, and imaging should all
occur on the same day for each participant contact, and each annual contact should occur

[Link] 9/9/2015 8
365 days after the previous one. However, this is operationally impossible. Please keep
the following information in mind when analyzing time-sensitive data:

1) SAS dates. All dates in the OAI datasets are SAS dates, and are represented as the
number of days since January 1, 1960. Thus, even if SAS is not the analytic software
being used, the number of days between any two dates can easily be calculated as the
difference between the two SAS dates. For example the number of days between the
Enrollment Visit and the 12-month follow-up visit can be calculated as V01FVDATE-
V00EVDATE. Variables for some frequently used intervals have been programmed. For
example, V02VISDYS is the number of days between the Enrollment Visit and the 18-
month interim visit. V01MRSURDY is the number of days between the 12-month MRI
and the most recent reported surgery.

2) Visit windows. The clinics were asked to schedule participants for annual follow-up
visits within + 45 days of the anniversary of their Enrollment Visit date. However, if this
was not possible, participants could be scheduled up until 135 days after the anniversary
of their Enrollment Visit. After this point, the clinics were instructed to stop attempting to
schedule the missing contact and concentrate on scheduling the next contact. Sometimes
this guideline was not strictly followed, and in very rare cases, the participant was
scheduled within the visit window for the following visit. If that was the only contact
they had during that following visit window, the visit was reassigned. For example, one
participant had an “18-month” interim visit during the window for their 24-month follow-
up visit, and they did not have their 24-month follow-up visit (a 24-month Missed
Follow-Up Contact form was entered into the data system). The 18-month interim visit
was reassigned to 24-months, and the Missed Follow-up Contact form was deleted. In
these rare cases, a participant may be missing much of the data for the visit they were
reassigned to.

3) Imaging. The clinics were asked to schedule imaging associated with a visit within +7
days of the clinic visit, although no shows and re-scheduled appointments frequently
resulted in gaps of more than 7 days. Full limb x-rays were very difficult to schedule at
some clinics, so there was a long delay between the clinic visit and imaging. When an
analysis is time-sensitive (e.g., a comparison of bone marrow edema and pain), it is
important to calculate the time lag between the two events (imaging and questionnaire)
by subtracting the relevant dates.

4) Specimen collection. If a participant was not fasting at the time of the clinic visit, had
a difficult blood draw and either no blood or insufficient blood (or urine) was obtained,
they were asked to return to the clinic for a repeat specimen collection. It is important to
use the correct specimen collection date (VxxPDATE1 or 2 or VxxUCDATE1 or 2, for
blood and urine, respectively) for any time-dependent analyses. For example, if blood
was obtained only on the second attempt at baseline and blood was obtained only on the
first attempt at the 12-month follow-up visit, then the number of days between the two
blood samples would be V01PDATE1-V00PDATE2.

Changes and corrections to data and images


Major changes made to the data, including data fixes, that have occurred between
releases are noted in the Data Changes and Corrections document. The document lists

[Link] 9/9/2015 9
variables that have been unreleased or modified, including when dataset assignments for
variables have changed. Further details about the change are noted in the release
comment for the pertinent variable(s). Changes and/or fixes applied to calculated
variables are also noted in the summary or revision section in the SAS code header. SAS
code can be found in release comments, which are located in the documentation and in
Search/Browse (see Release comments section below).

In addition, please see the Changes to Released Images document for information
on corrections, additions and deletions that have been made to OAI images for any
given image set.

Release comments
Variables may or may not have release comments, depending on whether they are
calculated, or derived, from collected variables; or they have caveats regarding their
usage; or they have notes about changes and/or fixes made to the variable, the data or the
meta-data associated with the variable; or there is any additional information that may
make use of the variable in analysis easier. The variable guides indicate whether there is a
release comment available for each variable. If the variable was calculated or derived
from collected variables, the SAS code used to create the variable is included in the
release comments.

Combining two or more datasets with one record per participant


Any two or more SAS datasets with one record per participant can be easily merged as
follows:
DATA COMBINED;
MERGE SUBJECTCHAR00 JOINTSX00 PHYSEXAM00;
BY ID;
RUN;

This code will merge the subject characteristics variables, the joint symptom/function
variables, and the physical measurements by ID, to create the dataset COMBINED with
one record per participant and all variables from the three datasets. The datasets are
already sorted by ID, and all variables other than ID are unique to their dataset, so no
other processing is necessary before the merge.

There may be times when an analyst wishes to keep multiple records per participant from
one of the three datasets that have more than one record per ID, while merging with other
datasets. For example, to add ENROLLEES to the MRI data, one could use the following
code:
DATA COMBINED;
MERGE ENROLLEES MRI00;
BY ID;
RUN;

[Link] 9/9/2015 10
The resulting dataset COMBINED will contain as many records for each participant as
were originally in MRI00. The variables from ENROLLEES will be copied onto each
record for that ID. Thus, any statistics calculated from ENROLLEES variables in
COMBINED will be misleading. Generally, it is safer to subset the multiple-records-per-
participant dataset to get just the records of interest before merging with the one-record-
per-participant dataset. For example, if the analyst wished to look at the distribution of
the subcohort assignment for all participants with an available enrollment visit pelvis x-
ray, with a view to choosing a random sample from each cohort, the following datastep
would work well:
DATA TEMP;
MERGE XRAY00 (IN=INA WHERE=(V00EXAMTP=’AP Pelvis’ AND
V00XRCOMP=1))
ENROLLEES (IN=INB KEEP=ID V00COHORT);
BY ID;
IF INA AND INB;
RUN;

Merging two datasets where both have multiple records per participant: This tends to
have unexpected results and is not advised except when the analyst is very familiar with
merging datasets in SAS.

Please see the “Overview and Description of Central Image Assessments” for more
information about how to merge image assessment data with other OAI data.

Unreleased variables
Not all variables collected are released. In some cases, collected variables are replaced by
calculated variables that combine data from more than one instrument (e.g., MRI pre-
screener and MRI safety screener questions that are asked twice if the participant gets an
MRI, but only once if the pre-screener determines that they are not eligible. The
calculated variable gives the most up-to-date information from the two screeners). In
other cases, calculated variables that bring information down from the parent variable to
the follow-up child variable replace the child variable. Another reason for not releasing a
variable is that a question is asked repeatedly (e.g., history of knee replacement), and
only one place is considered the gold standard. In addition, variables that might unmask
the identity of a participant are not released. In some cases, a variable is potentially
identifying when the dataset for the entire cohort is not complete (first release of each
visit), and may be released when data from the entire cohort are released. Therefore, a
potentially-identifying variable may have a differing release status depending on the visit.
For instance, it may be unreleased for a visit where data is available for the first half of
the cohort only, and released for a visit where data is available for the entire cohort.
Sometimes the variable will simply stay unreleased, even when the data from the entire
cohort are released, if there is no reasonable way to combine categories, and the data are
considered not likely to be used in analysis. There are some calculated variables that were
created after several visits’ data have been released. These variables are available for the
most recent visit released, but not for previously-released visits. Some variables are
simply bookkeeping

[Link] 9/9/2015 11
variables used to ensure that the examiner follows the correct visit flow, and these are
also not released. And, finally, some variables are collected as part of an Ancillary Study
and will not be released until the Principal Investigator has had a chance to review and
analyze them. Unreleased variables are shown on the annotated forms in grey strikethru
text (see Annotated Forms, below).

Sometimes a variable will remain available, but certain unusual values suppressed. For
categorical variables, the number of participants with a particular value of the variable
may be so low that potential identification of participants may be possible. Sometimes
two response categories are combined to resolve this problem (e.g., P02RACE combines
all non-white categories other than Black, or African-American, and Asian). For
continuous variables, potentially identifying data are set to a special missing value, and
the decision rule for what is considered potentially identifying is given in the release
comment (see section below on SAS special missing value codes).

Variable naming conventions


OAI variable names consist of a prefix and a root. The prefix consists of a letter and two
numbers. A “P” indicates a contact (Initial Eligibility Interview or Screening Visit) that
occurred prior to the Enrollment Visit, and a “V” indicates a regular in-clinic visit for the
clinical data, starting with the Enrollment Visit. Visits are numbered starting with
enrollment at time 00, 12-month follow-up at time 01, etc., and this designation serves as
the numeric part of the prefix. Pre-enrollment contacts are numbered negatively with
increasing length of time before the Enrollment Visit, with the Screening Visit numbered
as -01, and the initial eligibility interview numbered as -02. These numbers, without the
minus sign, also serve as the numeric part of the prefix. Thus, P02 is used for variables
collected on the Initial Eligibility Interview; P01 is used for variables collected at the
Screening Visit, and V00 is used for variables collected at the Enrollment Visit.

The prefix + root is unique within a visit year, while the root always represents the same
information in different visit years. Thus, participant weight in kilograms at baseline
(measured at the Screening Visit) and at the 12-month follow-up are called P01WEIGHT
and V01WEIGHT, respectively. An attempt was made to make the root user friendly and
intuitively meaningful, but length constraints and the necessity for the roots to be unique
make that not always possible.

V99 is used for the Outcomes data variables, which is not visit specific. For a complete
list of variable prefixes, see [Link].

Variable formats
Most categorical variables have a format assigned that gives a description of each value
used for that variable. In the SAS datasets, these formats are already attached and will
display on any SAS-generated output, as long as the Formats catalog has been
downloaded and properly defined (see below). Format assignments are listed in the
contents document (SAS proc contents) zipped with the documentation that accompanies
each dataset and are also shown in the frequencies contained in the variable guides.

[Link] 9/9/2015 12
To access these formats in SAS, please make sure you have downloaded the latest
Formats catalog (formats.sas7bcat) along with the datasets. Include a line in your SAS
program/SAS session that defines the location of this formats library:

libname LIBRARY ‘[pathname]’;

For instance, if you have the formats.sas7bcat file on your C: drive in a folder called OAI
Formats then your libname statement would be:

libname LIBRARY ‘c:\OAI Formats’;

This will work for both batch and interactive sessions of SAS.

If you prefer not to use the formats, you can type the following line in your SAS program
to bypass the format errors:

options nofmterr;

This will cause SAS to generate warnings, rather than errors, when the formats are not
found.

If you are not using SAS, the formats will not be automatically attached to each variable.
However you can use the Variable Guides to see a description of all variable values.

SAS special missing value codes


SAS allows for stratification of missing values. This allows some flexibility in handling
missing values due to different causes. For example, an analyst may want to treat refused
performance tests as though the participant was unable to perform them. Importantly,
laboratory assays that cannot be accurately determined outside a linear range will be
assigned .B and .H missing values for values that fall below or above that range,
respectively. The analyst can then use these data, either by replacing them with a standard
value (e.g., replace .B with half the lowest linear limit of the assay) or by dividing the
data into quantiles, with .B values included in the lowest quantile and .H values included
in the highest quantile.

The baseline questionnaire and exam data for the first half of the cohort (version 0.1.1)
had the values 88 assigned to Don't Know, 77 assigned to Refused, and 99 assigned to
Don’t Do for categorical variables and some calculated variables. This was changed
starting with the 12-month questionnaire and exam data (version 1.1.1) so that .D is
assigned to Don't Know/Unknown/Uncertain, .R is assigned to Refused, and .X is
assigned to Don’t Do for all categorical and calculated variables (see below). The
baseline data were updated with these latest conventions in release version 0.2.1.

The following missing values have been assigned to OAI data:

[Link] 9/9/2015 13
. = ".: Missing Form/Incomplete Workbook" – assigned when the entire data collection
form is missing.
.A= ".A: Not Expected" – assigned when the variable is either expected to be missing
based on the skip pattern of the questionnaire or is one of a set of “check all that
apply” variables, where any particular response is not necessarily expected for
any particular participant. When .A is assigned to the joint replacement
adjudication/confirmation status Outcomes variables, it indicates “no
replacement reported in this joint.”
.B= ".B: Low/Below Range" – assigned when an exact value cannot be determined, but
the value is known to be below the detection limit or linear range of an assay
.C= ".C: Cannot Do/Attempted: unable to complete" – assigned when a continuous
variable is missing but is linked to a categorical variable whose value is Cannot
do/Attempted: unable to complete.
.D= ".D: Don’t Know/Unknown/Uncertain" – assigned when a continuous variable is
missing but is paired with a categorical variable whose value is Don’t Know or
when a categorical or calculated variable has a possible value of Don’t Know,
Unknown, or Uncertain.
.E= ".E: Non-Exposed Control" – assigned when a measurement was not done because it
was not required for Non-Exposed Control participants, and the participant is in
that cohort.
.F= " .F: Not done, phone contact" – assigned to certain key measurement variables when
the entire measurement was not done because the participant would consent
only to a telephone contact (follow-up visits only).
.G= " .G: Unreleased high value" – assigned to high data values that are potentially
identifying. See the release comment in the documentation for the decision rule
used to set data to .G for a particular variable.
.H= ".H: High/Above range"– assigned when an exact value cannot be determined, but the
value is known to be above the linear range of an assay.
.I= ".I: Inadequate data" – assigned when a calculated value or score cannot be
determined because of missing data.
.K= ".K: Cannot do/not attempted, unable" – assigned when a continuous variable is
missing but is paired with a categorical variable whose value is Cannot do/Not
attempted, unable. Also assigned to calculated variables when the calculation is
not attempted because the choice of input variables for particular individuals is
best left to the discretion of the analyst.
.L= ".L: Permanently Lost" – assigned when a measurement was done but the data were
lost and cannot be retrieved.
.M= ".M: Missing" – assigned when the data collection form is received and a value is
expected but is missing. In the kMRI_BLKSBML_BICLxx datasets, .M
indicates that bone marrow lesions have merged.
.N= ".N: Not Required/Not edited" – assigned when a variable was not edited if missing
.O= " .O: Not done, other reason" – assigned to certain key measurement variables when
the entire measurement was not done for some other reason (participant became
ill and left the visit early, etc.).
.P= ".P: Prosthetic" – assigned to indicate that the value is missing because the participant
has had a knee or hip replacement.
.R= ".R: Refused" – assigned when a continuous variable is missing but is paired with a
categorical variable whose value is Refused or when a categorical or calculated

[Link] 9/9/2015 14
variable has a possible value of Refused. Also assigned to certain key
measurement variables when the participant refused the entire measurement.
.S= ".S: Unreleased low value" – assigned to low data values that are potentially
identifying. See the release comment in the documentation for a particular
variable. In the kMRI_BLKSBML_BICLxx datasets, .S indicates that a bone
marrow lesion has split.
.T= ".T: Technical Problems" – assigned, usually to an electronic or laboratory
measurement that had to be thrown out due to a technical problem with the
measurement.
.U= ".U: Unable to examine" – assigned when the measurement or examination could not
be completed for miscellaneous reasons (e.g., participant had to leave).
.V= “.V: Missed visit” – assigned to certain follow-up data when a Missed Follow-up
Contact form was entered, indicating a missed visit.
.W=".W: Impossible value” – assigned when the value originally recorded is impossible
or nearly impossible (e.g., fasts longer than 24 hours for biospecimen
collection). Assignment of .W to outlandish values will be ongoing, so the lack
of assignment does not mean that a value has been examined and considered
possible.

.X=" .X: Don’t do" – assigned when a categorical variable has a Don’t Do option.

In SAS, when using special missing values in logical expressions, the missing value is no
longer only equal to ‘.’ To express a value equal to any missing, the code should be
written: <= .Z or alternately: le .Z To express a value not equal to missing, the code
should be written >.Z or alternately: gt .Z

.Z is the greatest value of missing available in SAS.

ASCII files for non-SAS users


To aid investigators who do not have access to SAS, we have included a pipe-delimited
ASCII file and a SAS transport dataset in the zipped dataset file (see below). Each ASCII
file or transport dataset corresponds to one of the SAS datasets. The ASCII files can be
examined directly in a database application. Some of these ASCII files are very large, so
they may have too many rows or columns to import completely into a spreadsheet
program such as Microsoft Excel, but should be easily handled by most database
applications, such as Access or dBase. It is important to note also that the SAS special
missing value codes (see above) cannot be represented meaningfully in either the
transport files or the ASCII files, so any missing value is represented simply as a blank.
Analysts who need to take advantage of the nuances of such special missing values as .G,
.H, .B, or .S should use SAS.

[Link] 9/9/2015 15
We strongly recommend that you use SAS V9 or higher to access OAI data. If you
have an earlier version of SAS, you must use the SAS transport file (.xpt), see below.

There are two files associated with each dataset. One, a documentation file, contains the
following PDF files: [Link] (proc contents output), [Link] (SAS frequencies and
univariates), [Link] (release comments about the variables included in the
dataset, including the SAS code, if calculated from collected variables), a dataset
description document, and other supporting documents. The SAS dataset zip file
contains a SAS V9 dataset (.sas7bdat), ASCII file, SAS transport file (.xpt), and
instructions for use of both the SAS transport dataset and the ASCII file. The SAS
datasets in the zip files were created using the V9 engine. As of release 2.2.2 and 3.2.1,
the transport files will allow you to use SAS V7 (previously, they were compatible with
SAS V6). This allows the variable names and labels to remain identical to those in the
documentation.

Annotated forms
Annotated forms include variable names and value labels for each released clinical
variable on the data collection forms. Please read the Annotation Form Conventions
document for a description of the annotation conventions.

[Link] 9/9/2015 16

You might also like