Clinical Data Getting Started Overview
Clinical Data Getting Started Overview
Conventions
Participant demographics
The following three tables show sex, subcohort, age and race distributions for enrolled
OAI participants.
[Link] 9/9/2015 1
Table 2. Number of enrollees by age, sex and subcohort
Age Groups (years, at time of enrollment)
45-49 50-59 60-69 70-79 Total
N % N % N % N % N
Progression subcohort
Male 70 11.7% 220 36.8% 158 26.5% 149 25.0% 597
Female 86 10.8% 242 30.5% 286 36.1% 179 22.6% 793
Total 156 11.2% 462 33.3% 444 31.9% 328 23.6% 1390
Incidence subcohort
Male 175 13.0% 503 37.3% 325 24.1% 345 25.6% 1348
Female 180 9.3% 643 33.2% 669 34.6% 444 22.9% 1936
Total 355 10.8% 1146 34.9% 994 30.3% 789 24.0% 3284
Non-exposed control subcohort
Male 14 29.8% 16 34.1% 12 25.5% 5 10.6% 47
Female 24 32.0% 38 50.7% 13 17.3% 75
Total 38 31.1% 54 44.3% 25 20.5% 5 4.1% 122
Overall
Male 259 13.0% 739 37.1% 495 24.8% 499 25.1% 1992
Female 290 10.4% 923 32.9% 968 34.5% 623 22.2% 2804
Total 549 11.4% 1662 34.7% 1463 30.5% 1122 23.4% 4796
[Link] 9/9/2015 2
Publication guidelines for OAI public datasets
The OAI requires appropriate use of the OAI public datasets by the scientific community. The
objectives of the guidelines are to ensure appropriate citation and acknowledgment of OAI as
the source of the data used in publications, promote collaboration, assist investigators in
avoiding unintended duplication of effort, and foster high quality and creative publications
based on valid use and interpretation of the data.
Versioning of releases
Each release of clinical data in OAI is given a unique version number. The first position
(before the first decimal) identifies the clinic visit cycle (baseline=0, 12-month follow-up
=1, 18-month interim=2, 24-month follow-up=3, etc.). The second position (between the
decimals) indicates the group of participants included in the release (1=first half of cohort
enrolled *, 2=entire enrolled cohort). The third position (after the second decimal)
indicates the sequential number of the clinical data release for that visit cycle, beginning
with 1. Thus the first release of the baseline clinical data for the first half of the OAI
cohort is numbered 0.1.1 (for further discussion of the stepwise release of OAI data, see
[Link]). The variable VERSION, included in each dataset,
serves to track which release the user has downloaded.
Similarly, each image release was given a unique version code consisting of a number, a
letter, and a number. The first position (before the decimal) identifies the clinic visit
cycle, as above. The letter in the second position (between the decimals) identifies the
group of participants selected to have their images included in the release (A=first group,
B=second group, and so on). The third position (after the second decimal) indicates the
sequential number of the release for a particular image. For example, one screening knee
x-ray released originally as 0.A.1 was later discovered to belong to another participant.
That image has now been replaced with an image correctly assigned to that participant.
This replacement image has the version code 0.A.2 and the old image has been removed
from the image set. The images that are not replaced will stay at version 0.A.1. Users
*
Because the first half of the cohort (defined as enrolled before 4/30/05) is sometimes released before all
data cleaning is complete, a participant enrolled in the first half of the study and thus eligible for release
may be held back because their data are not clean. When the entire cohort is released, data cleaning will be
complete and all participants will be included. In addition, any participants whose images were released in
the first image set (V00IMAGESA=3) are also included in the “first half,” even if they were enrolled a little
later. And, finally, five participants enrolled before 4/30/05 whose enrollment status was confirmed very
late are considered to be in the second half for data release.
[Link] 9/9/2015 3
who request Image Group A after the release of 0.A.2 will receive the most up-to-date
images for Image Group A, all but one of which will be 0.A.1 images (those that have not
been replaced), and the 0.A.2 image (that which replaces the old image). Similarly, if
images initially unavailable become available (e.g., full limb radiographs from Image
Group B), these will also be given the new version number (in this case, 1.B.2). Again,
the original images will remain at 1.B.1. New requests for Image Group B will include
0.B.1, 1.B.1, and 1.B.2 images.
Each release of image assessment data in OAI is given a unique version number. The first
position (before the first decimal) identifies the clinic visit cycle, as above. The second
position (after the first decimal) indicates the sequential number of the image assessment
data release for that visit cycle, beginning with 1. Thus, the first release of the baseline
image assessment data is numbered 0.1.
Each release of Enrollees, Outcomes and Measurement Inventory data in OAI is given a
unique version number. The number indicates the sequential number of the release for the
dataset, beginning with 1.
The clinical data from the same visit cycle at which the released images were acquired
are always available from the latest version (highest numbers in the second and third
positions) of the clinical data set for that visit cycle. (Images for a particular participant
are not released until their images have been cleaned and their clinical data have been
released.) Similarly, the image assessment data from the same visit cycle as the clinical
data or at which the release images were acquired are available as the latest version of
image assessment data for that visit cycle.
IMPORTANT NOTE: Initially the first half of the cohort was usually released early
(before that specific visit cycle is complete); these data are suitable only for
preliminary analyses and planning. Because various subgroups of participants were
not enrolled at a steady pace throughout the enrollment phase of the study (e.g., the
majority of the progression cohort was enrolled during the first half of the study,
non-exposed controls almost entirely near the end, and black participants
predominantly late in the enrollment phase) the user will only get a representative
sample of the cohort by analyzing data for the entire cohort. If using previously
released data for the first half of the cohort, users are encouraged to re-run their
analyses on data for the entire cohort.
[Link] 9/9/2015 4
Each dataset has one record per participant, except for the MIF, Xray, MRI, and image
assessment datasets, which may have multiple records per participant. Every dataset is
sorted by the key variable ID. All variables other than ID, SIDE (image-assessment data
only), READPRJ (image-assessment data only) and VERSION are contained in only a
single dataset.
Previously, only the datasets that grouped data based on type were available. As of
releases 2.2.2, 3.2.1 and 4.2.1, new datasets named AllClinicalXX are available for each
visit. The AllClinicalXX datasets combine the data from all the clinical datasets with one
row per participant for that visit. Data from the MIF, MRI and Xray datasets are not
included in the AllClinicalXX datasets, as this data are from datasets that have multiple
rows per participant. Data from Enrollees are not included as the Enrollees dataset is
updated for each data release.
[Link] 9/9/2015 5
• SubjectCharXX – contains demographic data for all enrolled participants
collected at that visit.
• XrayXX – contains meta-data related to the radiograph imaging, with one
record per expected radiograph image for each participant belonging to a
released image group.
where XX is:
00 for the baseline visit
01 for the first annual (12-month) follow-up visit
02 for the 18-month interim visit
03 for the second annual (24-month) follow-up visit
04 for the 30-month interim visit
05 for the third annual (36-month) follow-up visit
06 for the fourth annual (48-month) follow-up visit
07 for the fifth annual (60-month) follow-up telephone contact
08 for the sixth annual (72-month) follow-up visit
09 for the seventh annual (84-month) follow-up telephone contact
10 for the eight annual (96-month) follow-up visit
11 for the ninth annual (108-month) follow-up telephone contact
99 for outcomes
For the analyst who wants the universe of data, the AllClinicalXX datasets provide an
alternative to having to download multiple, separate datasets. AllClinicalXX should be
downloaded for the visit of interest, along with MIFXX, MRIXX and XrayXX. Since the
AllClinicalXX dataset contains most of the clinical data, it is a very large dataset. If only
a small subset of variables is of interest, the smaller dataset the variables reside in can be
downloaded for the visit of interest. For example, if the analyst is interested in only the
knee examination variables, the PhysExamXX dataset for the visit of interest should be
downloaded. Although the knee examination variables for that visit are also in
AllClinicalXX, the PhysExamXX dataset is much smaller in size and, therefore, faster to
download and easier to work with. In addition, the analyst should always download the
most recent version of the Enrollees dataset when working with OAI data.
More detailed information about each of these datasets can be found in the description
field of the Datasets table. These dataset descriptions apply to all visits and all releases.
A full list and description of all variables contained in each dataset is in the
documentation file along with each dataset. All variables are also listed in the three
variable guides, see below.
The reasons for splitting the data into separate, smaller functional datasets are:
1. Smaller datasets are more manageable.
[Link] 9/9/2015 6
− As biomarker assays, imaging measurements, and other secondary processing
of collected data become available, it is easier to release these data in a timely
manner if the entire universe of data does not need to be re-released.
− Errors can be corrected and the data re-published more readily.
3. Some variables (such as medications, x-ray image meta-data, MRI image meta-
data, and image assessments) would have to be distributed as separate datasets in
any model, since they contain more than one record per participant, while all
other datasets contain one record per participant.
An attempt was made to make these groupings of variables into separate datasets
functional. That is, variables in each dataset tend to be related in such a way that if an
analyst is interested in one variable in the dataset, then they are more likely to be
interested in other variables in that same dataset. For example, variables related to
physical activity are grouped together in the Subject Characteristics dataset; variables
related to a participant’s health and medical history are grouped together in the Medical
History dataset; and so on.
Of course, there are disadvantages to separating the data in this way. Analysts who
download several smaller datasets instead of an AllClinicalXX dataset will have to merge
datasets, but this is ameliorated by having datasets sorted by the same key variable (ID),
by having only one record per participant in all datasets except MIF, Xray, MRI, and the
image assessment datasets and by having unique variable names for all variables released
in OAI (no need to worry about over-writing data when merging).
Image assessment data are grouped into datasets by vendor and visit. Please see the
“Overview and Description of Central Image Assessments” for more information about
how image assessment data are organized.
Outcomes data are grouped into one dataset that is not visit specific.
Measurement Inventory data are also grouped into one dataset that is not visit specific.
OA Biomarkers Consortium FNIH Project data are grouped by vendor and visit. Please
see the “Overview and Description of OA Biomarkers Consortium FNIH Project”.
[Link] 9/9/2015 7
analysis. To further pinpoint the most useful group of variables in a broader category of
variables of interest, the categories are further broken down into subcategories. Again, a
variable may be in more than one subcategory within the same category, and the same
subcategory may appear in more than one category grouping.
This organization of the variables is intentionally independent of the grouping into SAS
datasets, but in many cases, particular categories are entirely contained in a single SAS
dataset. In other cases, for example medications, a category may be split across datasets.
Variables from the some image assessment, Measurement Inventory, and OA Biomarkers
Consortium FNIH Project variables have not been categorized into categories and
subcategories.
Variable guides
Every variable is listed in two variable guides, VG_Variable.pdf (sorted alphabetically
by variable name) and VG_Form.pdf (organized by data collection form). See the
Variable Guide tutorial (VG_Tutorial.pdf).
The guides list all released variables (with the exception of some image assessment,
Measurement Inventory, and OA Biomarkers Consortium FNIH Project variables) sorted
in two different ways, alphabetically, and by form on which it was collected. Although
image assessment data are not collected on data collection forms, these data are included
in the variable guides under a descriptive heading.
The variable guides also give frequencies and formats (for categorical variables),
univariate statistics (for continuous variables), the SAS dataset name, and the variable
label. (The variable label is the same as the SAS label as seen in the datasets.) When
working with the datasets, analysts are encouraged to always output and view SAS
variable labels in their entirety to ensure important information about the variables is not
lost. The maximum SAS label length used in OAI is 160 characters.
The Variable Guides also indicate whether a release comment is available for each
variable (see below).
[Link] 9/9/2015 8
365 days after the previous one. However, this is operationally impossible. Please keep
the following information in mind when analyzing time-sensitive data:
1) SAS dates. All dates in the OAI datasets are SAS dates, and are represented as the
number of days since January 1, 1960. Thus, even if SAS is not the analytic software
being used, the number of days between any two dates can easily be calculated as the
difference between the two SAS dates. For example the number of days between the
Enrollment Visit and the 12-month follow-up visit can be calculated as V01FVDATE-
V00EVDATE. Variables for some frequently used intervals have been programmed. For
example, V02VISDYS is the number of days between the Enrollment Visit and the 18-
month interim visit. V01MRSURDY is the number of days between the 12-month MRI
and the most recent reported surgery.
2) Visit windows. The clinics were asked to schedule participants for annual follow-up
visits within + 45 days of the anniversary of their Enrollment Visit date. However, if this
was not possible, participants could be scheduled up until 135 days after the anniversary
of their Enrollment Visit. After this point, the clinics were instructed to stop attempting to
schedule the missing contact and concentrate on scheduling the next contact. Sometimes
this guideline was not strictly followed, and in very rare cases, the participant was
scheduled within the visit window for the following visit. If that was the only contact
they had during that following visit window, the visit was reassigned. For example, one
participant had an “18-month” interim visit during the window for their 24-month follow-
up visit, and they did not have their 24-month follow-up visit (a 24-month Missed
Follow-Up Contact form was entered into the data system). The 18-month interim visit
was reassigned to 24-months, and the Missed Follow-up Contact form was deleted. In
these rare cases, a participant may be missing much of the data for the visit they were
reassigned to.
3) Imaging. The clinics were asked to schedule imaging associated with a visit within +7
days of the clinic visit, although no shows and re-scheduled appointments frequently
resulted in gaps of more than 7 days. Full limb x-rays were very difficult to schedule at
some clinics, so there was a long delay between the clinic visit and imaging. When an
analysis is time-sensitive (e.g., a comparison of bone marrow edema and pain), it is
important to calculate the time lag between the two events (imaging and questionnaire)
by subtracting the relevant dates.
4) Specimen collection. If a participant was not fasting at the time of the clinic visit, had
a difficult blood draw and either no blood or insufficient blood (or urine) was obtained,
they were asked to return to the clinic for a repeat specimen collection. It is important to
use the correct specimen collection date (VxxPDATE1 or 2 or VxxUCDATE1 or 2, for
blood and urine, respectively) for any time-dependent analyses. For example, if blood
was obtained only on the second attempt at baseline and blood was obtained only on the
first attempt at the 12-month follow-up visit, then the number of days between the two
blood samples would be V01PDATE1-V00PDATE2.
[Link] 9/9/2015 9
variables that have been unreleased or modified, including when dataset assignments for
variables have changed. Further details about the change are noted in the release
comment for the pertinent variable(s). Changes and/or fixes applied to calculated
variables are also noted in the summary or revision section in the SAS code header. SAS
code can be found in release comments, which are located in the documentation and in
Search/Browse (see Release comments section below).
In addition, please see the Changes to Released Images document for information
on corrections, additions and deletions that have been made to OAI images for any
given image set.
Release comments
Variables may or may not have release comments, depending on whether they are
calculated, or derived, from collected variables; or they have caveats regarding their
usage; or they have notes about changes and/or fixes made to the variable, the data or the
meta-data associated with the variable; or there is any additional information that may
make use of the variable in analysis easier. The variable guides indicate whether there is a
release comment available for each variable. If the variable was calculated or derived
from collected variables, the SAS code used to create the variable is included in the
release comments.
This code will merge the subject characteristics variables, the joint symptom/function
variables, and the physical measurements by ID, to create the dataset COMBINED with
one record per participant and all variables from the three datasets. The datasets are
already sorted by ID, and all variables other than ID are unique to their dataset, so no
other processing is necessary before the merge.
There may be times when an analyst wishes to keep multiple records per participant from
one of the three datasets that have more than one record per ID, while merging with other
datasets. For example, to add ENROLLEES to the MRI data, one could use the following
code:
DATA COMBINED;
MERGE ENROLLEES MRI00;
BY ID;
RUN;
[Link] 9/9/2015 10
The resulting dataset COMBINED will contain as many records for each participant as
were originally in MRI00. The variables from ENROLLEES will be copied onto each
record for that ID. Thus, any statistics calculated from ENROLLEES variables in
COMBINED will be misleading. Generally, it is safer to subset the multiple-records-per-
participant dataset to get just the records of interest before merging with the one-record-
per-participant dataset. For example, if the analyst wished to look at the distribution of
the subcohort assignment for all participants with an available enrollment visit pelvis x-
ray, with a view to choosing a random sample from each cohort, the following datastep
would work well:
DATA TEMP;
MERGE XRAY00 (IN=INA WHERE=(V00EXAMTP=’AP Pelvis’ AND
V00XRCOMP=1))
ENROLLEES (IN=INB KEEP=ID V00COHORT);
BY ID;
IF INA AND INB;
RUN;
Merging two datasets where both have multiple records per participant: This tends to
have unexpected results and is not advised except when the analyst is very familiar with
merging datasets in SAS.
Please see the “Overview and Description of Central Image Assessments” for more
information about how to merge image assessment data with other OAI data.
Unreleased variables
Not all variables collected are released. In some cases, collected variables are replaced by
calculated variables that combine data from more than one instrument (e.g., MRI pre-
screener and MRI safety screener questions that are asked twice if the participant gets an
MRI, but only once if the pre-screener determines that they are not eligible. The
calculated variable gives the most up-to-date information from the two screeners). In
other cases, calculated variables that bring information down from the parent variable to
the follow-up child variable replace the child variable. Another reason for not releasing a
variable is that a question is asked repeatedly (e.g., history of knee replacement), and
only one place is considered the gold standard. In addition, variables that might unmask
the identity of a participant are not released. In some cases, a variable is potentially
identifying when the dataset for the entire cohort is not complete (first release of each
visit), and may be released when data from the entire cohort are released. Therefore, a
potentially-identifying variable may have a differing release status depending on the visit.
For instance, it may be unreleased for a visit where data is available for the first half of
the cohort only, and released for a visit where data is available for the entire cohort.
Sometimes the variable will simply stay unreleased, even when the data from the entire
cohort are released, if there is no reasonable way to combine categories, and the data are
considered not likely to be used in analysis. There are some calculated variables that were
created after several visits’ data have been released. These variables are available for the
most recent visit released, but not for previously-released visits. Some variables are
simply bookkeeping
[Link] 9/9/2015 11
variables used to ensure that the examiner follows the correct visit flow, and these are
also not released. And, finally, some variables are collected as part of an Ancillary Study
and will not be released until the Principal Investigator has had a chance to review and
analyze them. Unreleased variables are shown on the annotated forms in grey strikethru
text (see Annotated Forms, below).
Sometimes a variable will remain available, but certain unusual values suppressed. For
categorical variables, the number of participants with a particular value of the variable
may be so low that potential identification of participants may be possible. Sometimes
two response categories are combined to resolve this problem (e.g., P02RACE combines
all non-white categories other than Black, or African-American, and Asian). For
continuous variables, potentially identifying data are set to a special missing value, and
the decision rule for what is considered potentially identifying is given in the release
comment (see section below on SAS special missing value codes).
The prefix + root is unique within a visit year, while the root always represents the same
information in different visit years. Thus, participant weight in kilograms at baseline
(measured at the Screening Visit) and at the 12-month follow-up are called P01WEIGHT
and V01WEIGHT, respectively. An attempt was made to make the root user friendly and
intuitively meaningful, but length constraints and the necessity for the roots to be unique
make that not always possible.
V99 is used for the Outcomes data variables, which is not visit specific. For a complete
list of variable prefixes, see [Link].
Variable formats
Most categorical variables have a format assigned that gives a description of each value
used for that variable. In the SAS datasets, these formats are already attached and will
display on any SAS-generated output, as long as the Formats catalog has been
downloaded and properly defined (see below). Format assignments are listed in the
contents document (SAS proc contents) zipped with the documentation that accompanies
each dataset and are also shown in the frequencies contained in the variable guides.
[Link] 9/9/2015 12
To access these formats in SAS, please make sure you have downloaded the latest
Formats catalog (formats.sas7bcat) along with the datasets. Include a line in your SAS
program/SAS session that defines the location of this formats library:
For instance, if you have the formats.sas7bcat file on your C: drive in a folder called OAI
Formats then your libname statement would be:
This will work for both batch and interactive sessions of SAS.
If you prefer not to use the formats, you can type the following line in your SAS program
to bypass the format errors:
options nofmterr;
This will cause SAS to generate warnings, rather than errors, when the formats are not
found.
If you are not using SAS, the formats will not be automatically attached to each variable.
However you can use the Variable Guides to see a description of all variable values.
The baseline questionnaire and exam data for the first half of the cohort (version 0.1.1)
had the values 88 assigned to Don't Know, 77 assigned to Refused, and 99 assigned to
Don’t Do for categorical variables and some calculated variables. This was changed
starting with the 12-month questionnaire and exam data (version 1.1.1) so that .D is
assigned to Don't Know/Unknown/Uncertain, .R is assigned to Refused, and .X is
assigned to Don’t Do for all categorical and calculated variables (see below). The
baseline data were updated with these latest conventions in release version 0.2.1.
[Link] 9/9/2015 13
. = ".: Missing Form/Incomplete Workbook" – assigned when the entire data collection
form is missing.
.A= ".A: Not Expected" – assigned when the variable is either expected to be missing
based on the skip pattern of the questionnaire or is one of a set of “check all that
apply” variables, where any particular response is not necessarily expected for
any particular participant. When .A is assigned to the joint replacement
adjudication/confirmation status Outcomes variables, it indicates “no
replacement reported in this joint.”
.B= ".B: Low/Below Range" – assigned when an exact value cannot be determined, but
the value is known to be below the detection limit or linear range of an assay
.C= ".C: Cannot Do/Attempted: unable to complete" – assigned when a continuous
variable is missing but is linked to a categorical variable whose value is Cannot
do/Attempted: unable to complete.
.D= ".D: Don’t Know/Unknown/Uncertain" – assigned when a continuous variable is
missing but is paired with a categorical variable whose value is Don’t Know or
when a categorical or calculated variable has a possible value of Don’t Know,
Unknown, or Uncertain.
.E= ".E: Non-Exposed Control" – assigned when a measurement was not done because it
was not required for Non-Exposed Control participants, and the participant is in
that cohort.
.F= " .F: Not done, phone contact" – assigned to certain key measurement variables when
the entire measurement was not done because the participant would consent
only to a telephone contact (follow-up visits only).
.G= " .G: Unreleased high value" – assigned to high data values that are potentially
identifying. See the release comment in the documentation for the decision rule
used to set data to .G for a particular variable.
.H= ".H: High/Above range"– assigned when an exact value cannot be determined, but the
value is known to be above the linear range of an assay.
.I= ".I: Inadequate data" – assigned when a calculated value or score cannot be
determined because of missing data.
.K= ".K: Cannot do/not attempted, unable" – assigned when a continuous variable is
missing but is paired with a categorical variable whose value is Cannot do/Not
attempted, unable. Also assigned to calculated variables when the calculation is
not attempted because the choice of input variables for particular individuals is
best left to the discretion of the analyst.
.L= ".L: Permanently Lost" – assigned when a measurement was done but the data were
lost and cannot be retrieved.
.M= ".M: Missing" – assigned when the data collection form is received and a value is
expected but is missing. In the kMRI_BLKSBML_BICLxx datasets, .M
indicates that bone marrow lesions have merged.
.N= ".N: Not Required/Not edited" – assigned when a variable was not edited if missing
.O= " .O: Not done, other reason" – assigned to certain key measurement variables when
the entire measurement was not done for some other reason (participant became
ill and left the visit early, etc.).
.P= ".P: Prosthetic" – assigned to indicate that the value is missing because the participant
has had a knee or hip replacement.
.R= ".R: Refused" – assigned when a continuous variable is missing but is paired with a
categorical variable whose value is Refused or when a categorical or calculated
[Link] 9/9/2015 14
variable has a possible value of Refused. Also assigned to certain key
measurement variables when the participant refused the entire measurement.
.S= ".S: Unreleased low value" – assigned to low data values that are potentially
identifying. See the release comment in the documentation for a particular
variable. In the kMRI_BLKSBML_BICLxx datasets, .S indicates that a bone
marrow lesion has split.
.T= ".T: Technical Problems" – assigned, usually to an electronic or laboratory
measurement that had to be thrown out due to a technical problem with the
measurement.
.U= ".U: Unable to examine" – assigned when the measurement or examination could not
be completed for miscellaneous reasons (e.g., participant had to leave).
.V= “.V: Missed visit” – assigned to certain follow-up data when a Missed Follow-up
Contact form was entered, indicating a missed visit.
.W=".W: Impossible value” – assigned when the value originally recorded is impossible
or nearly impossible (e.g., fasts longer than 24 hours for biospecimen
collection). Assignment of .W to outlandish values will be ongoing, so the lack
of assignment does not mean that a value has been examined and considered
possible.
.X=" .X: Don’t do" – assigned when a categorical variable has a Don’t Do option.
In SAS, when using special missing values in logical expressions, the missing value is no
longer only equal to ‘.’ To express a value equal to any missing, the code should be
written: <= .Z or alternately: le .Z To express a value not equal to missing, the code
should be written >.Z or alternately: gt .Z
[Link] 9/9/2015 15
We strongly recommend that you use SAS V9 or higher to access OAI data. If you
have an earlier version of SAS, you must use the SAS transport file (.xpt), see below.
There are two files associated with each dataset. One, a documentation file, contains the
following PDF files: [Link] (proc contents output), [Link] (SAS frequencies and
univariates), [Link] (release comments about the variables included in the
dataset, including the SAS code, if calculated from collected variables), a dataset
description document, and other supporting documents. The SAS dataset zip file
contains a SAS V9 dataset (.sas7bdat), ASCII file, SAS transport file (.xpt), and
instructions for use of both the SAS transport dataset and the ASCII file. The SAS
datasets in the zip files were created using the V9 engine. As of release 2.2.2 and 3.2.1,
the transport files will allow you to use SAS V7 (previously, they were compatible with
SAS V6). This allows the variable names and labels to remain identical to those in the
documentation.
Annotated forms
Annotated forms include variable names and value labels for each released clinical
variable on the data collection forms. Please read the Annotation Form Conventions
document for a description of the annotation conventions.
[Link] 9/9/2015 16