Module 2
Preparing Data for Analysis
Isabelle Bichindaritz, SUNY Oswego 1
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 2
Learning Objectives
1. Locate and download files for data analysis
involving genes and medicine.
2. Open files and preprocess data using R
language.
3. Write R scripts to replace missing values,
normalize data, discretize data, and sample
data.
Isabelle Bichindaritz, SUNY Oswego 3
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 4
Locating and Downloading Datasets
In general, we are interested in
any element capable of decreasing or
increasing the uncertainty of a system
Change in the state of
Information = « uncertainty" in a
system
Isabelle Bichindaritz, SUNY Oswego
5
Locating and Downloading Datasets
A piece of information is some knowledge,
some data processed so that it modifies the
state of « uncertainty » about a system
Data Information
Isabelle Bichindaritz, SUNY Oswego
6
Locating and Downloading Datasets
Second viewpoint :
The different dimensions of
« lnformation »
Isabelle Bichindaritz, SUNY Oswego
7
Locating and Downloading Datasets
First Technical Dimension
dimension = A stored or transmitted symbol
« The data"
Isabelle Bichindaritz, SUNY Oswego
8
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 9
Datasets and Files
Tranforming Data
into Information
Variables,
Datasets, and
Databases
Isabelle Bichindaritz, SUNY Oswego
10
Datasets and Files
Data are collected in datasets that can be stored in
Variables – containers used by programming languages to
store data (Ex: a cell in an Excel spreadsheet).
Not persistent – turning off the computer loses the data.
Files - sets of data meant to store them on digital media (Ex: an
Excel spreadsheet).
Persistent – are preserved when the computer is turned off.
Databases – shared collection of logically related data (and a
description of these data) recorded or stored on digital media
designed to meet the information needs of an organization.
Persistent.
Isabelle Bichindaritz, SUNY Oswego 11
Datasets and Files
Data are stored in files, organized in variables.
from Connolly and
Begg (2014)
Isabelle Bichindaritz, SUNY Oswego 12
Datasets and Files
Data in a file is generally organized like in a spreadsheet.
Unique identifier (ID)
FILE
« CLIENT"
clientNo fName lName address telNo
Client No = Unique 1 Lisa Smith mountain 1439
identifier 2 John Mack city 5634
3 Mary Lewis river 9045
4 Mark Trump plain 2710
5 Leslie Clinton village 3592
Isabelle Bichindaritz, SUNY Oswego 13
Datasets and Files
Data in a file is generally organized in columns and rows.
A column corresponds to a variable – also called attribute,
feature, field, or simply column.
A column has a name.
A row stores the set of values for each variable for a particular
sample or instance.
A sample corresponds to an individual, person, patient, object,
or other entity.
See example on previous slide.
Isabelle Bichindaritz, SUNY Oswego 14
Datasets and Files
Files can have many formats.
One of the most useful formats is a text delimited file:
It can be imported in Excel or database management software.
The suffix can be .txt or .csv.
Each variable is separated by a character called a delimiter,
such as a comma, a tab, a semi-colon etc.
The first row may or may not contain the list of variables in the
file, separated by the delimiter.
Each additional row contains the values associated with the
corresponding variable for a particular sample.
Isabelle Bichindaritz, SUNY Oswego 15
Datasets and Files
[Link] file example (with delimiter “, “)
clientNo, fName, lName, address, telNo
1, Lisa, Smith, mountain, 1439
2, John, Mack, city, 5634
3, Mary, Lewis, river, 9045
4, Mark, Trump, plain, 2710
5, Leslie, Clinton, village, 3592
Isabelle Bichindaritz, SUNY Oswego 16
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 17
Data Sources
Where do we find biomedical Big Data sources ?
Data in existing repositories (proprietary):
Electronic Medical Records (EMRs).
Clinical studies.
Open data sources (public).
List in resources
The Cancer Genome Atlas (TCGA)
Alzheimer’s Disease Neuroimaging Initiative (ADNI)
Health and Retirement Study (HRS)
UK Biobank
Millennium Cohort Study
CALIBER (EHR and admin data)
UCI Machine Learning Repository
…
Isabelle Bichindaritz, SUNY Oswego 18
Data Sources
Heterogeneous types of Big Data to analyze separately or
together:
Numeric, nominal.
Text.
Image.
Video.
Sound.
Social media.
Web.
Time series.
Signal.
…
Isabelle Bichindaritz, SUNY Oswego 19
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 20
Importance of Data Preprocessing
Data as found in public datasets and other types of
datasets is imperfect, ‘dirty’.
Data quality is essential to get good analytics results -
Garbage In, Garbage Out (GIGO).
The format of data often requires to make changes – for
example the analytical method used may required
nominal data, while your data is numeric.
Isabelle Bichindaritz, SUNY Oswego 21
Importance of Data Preprocessing
Types of data characteristics to fix:
Missing values
Noisy data
Incorrect data type
Incomplete data
Isabelle Bichindaritz, SUNY Oswego 22
Importance of Data Preprocessing
The goal is to improve the quality of data to ensure that
the measurements provided are as
Accurate
Precise
Complete
Interpretable
Correct
as possible.
Isabelle Bichindaritz, SUNY Oswego 23
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 24
Data Preprocessing Tasks
Data preprocessing involves several tasks
Data cleaning
Dealing with missing values
Dealing with erroneous data and outliers
Data transformation
Changing data types (discretization)
Changing range of data values (normalization)
Adding variables
Data reduction
Feature selection
Sampling
Isabelle Bichindaritz, SUNY Oswego 25
Data Preprocessing Tasks
from Han and Kamber
(2014)
Isabelle Bichindaritz, SUNY Oswego 26
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 27
Missing Values
Missing values can appear in your dataset as
A blank
A ‘.’
A ‘n/a’
A ‘?’
There are several strategies to deal with them, for
example applying some kind of filter.
Isabelle Bichindaritz, SUNY Oswego 28
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 29
Replacing Missing Values
Strategies for replacing missing values
Delete the entire row (depends on how many rows you have)
Replace by a fixed value (‘unknown’)
Replace values by a statistic associated with a particular
column or a particular group – mean, median, mode
Replace values based on nearest neighbors
Replace values based on likelihood.
Isabelle Bichindaritz, SUNY Oswego 30
Replacing Missing Values
Delete entire
row
Data deletion
Delete entire
column
Impute a
Handling constant value
missing values
Impute with
mean …
Data Impute from
imputation neighbors
Impute based
on a model
Impute
randomly
Isabelle Bichindaritz, SUNY Oswego 31
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 32
Normalizing and Discretizing Data
In order to handle noise in the data, they can be
transformed globally to
Reduce the grain in the data (discretize) from fine-rain to
higher-grain, for example from numeric to nominal.
Change the scale or range of the data (normalize).
It might also be necessary to discretize to apply different
data analytics methods because some prediction
methods require a nominal target attribute.
Isabelle Bichindaritz, SUNY Oswego 33
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 34
Data Normalization
Normalization consists changing the scale in the data.
When having data of mixed scale, some data analytics
methods do not behave well (Ex: age and income have
widely different ranges).
For example, it is frequent to scale all data between the
range [-1, 1] or [0, 1].
Isabelle Bichindaritz, SUNY Oswego 35
Data Normalization
Generally, data are scaled into a smaller range.
Methods include:
Min-max normalization
Z-score normalization
Decimal scaling
Isabelle Bichindaritz, SUNY Oswego 36
Data Normalization
Min-max normalization transforms data from range [m,
M] into range [m’, M’] using the formula
val’ = (val – m) / (M – m) * (M’ – m’) + m’
Example: normalizing into [0, 1] age values between [0,
150]
age 50 0.33 (intuitively)
check val’ = (50 – 0) / (150 – 0) * (1 – 0) + 0
= 50 / 150 = 1/3 = 0.33
Isabelle Bichindaritz, SUNY Oswego 37
Data Normalization
Z-score normalization
val’ = val – mean / std
Ex: normalizing age values between [0, 150]
where mean age in the population is 36.8 and standard
deviation is 12
age = 50 val’ = 50 – 36.8 / 12 = 1.1
Isabelle Bichindaritz, SUNY Oswego 38
Data Normalization
The normal (distribution) curve
From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
From μ–2σ to μ+2σ: contains about 95% of it
From μ–3σ to μ+3σ: contains about 99.7% of it
Isabelle Bichindaritz, SUNY Oswego 39
Data Normalization
Decimal scaling
val’ = val / 10n
where n is determined such as the largest val’ would be less than 1
this formula transforms the values into interval [-1, 1] is there are negative
values, and into [0, 1] otherwise.
Ex: normalizing age values between [0, 150]
we want the highest age to be less than 1, therefore divide by 1,000 = 103
age = 50 val’ = 50 / 103 = 0.05
Isabelle Bichindaritz, SUNY Oswego 40
Data Normalization
Comparison between the methods
The method that preserves the original data distribution is decimal scaling,
therefore it preserves more than the others the shape of the data
repartition. It acts similarly to image resizing in photo editing software
(shrink / magnify).
Z-score normalization is the most used because the resulting distribution is
going to be normal, which is advantageous with certain statistical methods.
However it distorts the natural shape of the data distribution.
Min-max normalization can accommodate any new range we want, not
only [0, 1] and [-1, 1] like the other ones.
Isabelle Bichindaritz, SUNY Oswego 41
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 42
Discretization
Discretization transforms data from numeric into
nominal data type.
Effects of discretization:
Smooths data.
Reduces noise.
Reduces data size.
Enables specific methods using nominal data.
Isabelle Bichindaritz, SUNY Oswego 43
Discretization
Discretization methods
Manual methods:
Distribution analysis.
Automatic methods:
Binning.
Equal-width binning
Equal-depth binning
Regression analysis.
Cluster analysis.
Natural partitioning.
Isabelle Bichindaritz, SUNY Oswego 44
Discretization
Equal-width binning
Given a range of values [min, max], we divide in intervals of
approximately same width; either we set the width arbitrarily
to w, or we set the desired number of bins to n, in which case
w is calculated as:
w = max – min / n
Ex: if the range is [0, 100] and we want 4 bins, each bin will
have a width of
100 – 0 / 4 = 25
the bins will be: [0, 24], [25, 49], [50, 74], [75, 100].
Isabelle Bichindaritz, SUNY Oswego 45
Discretization
Equal-depth binning
Given a range of values [min, max], we place approximately the
same number of instances in each bin by dividing the total number
of samples nb by the desired number of samples in each bin (depth)
d, in which case the number of bins n is calculated as:
n = nb / d
Ex: if the range is [0, 100] for 100 samples of different values (for
example 99 is missing), we want 20 samples in each bin, the number
of bins will be:
100 / 20 = 5
the bins will be: [0, 19], [20, 39], [40, 59], [60, 79], [80, 100].
Isabelle Bichindaritz, SUNY Oswego 46
Discretization
Advantage of each method:
Equal-width binning is more simple however very sensitive to
outliers in the data.
Equal-depth binning scales well by keeping the distribution of the
data however the bin values may be more difficult to interpret.
Smoothing of data can be accomplished by replacing the
values in a bin by statistic such as average (numeric data),
median (numeric data), or mode (categorical data).
Isabelle Bichindaritz, SUNY Oswego 47
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 48
Data Reduction
Data reduction can take several forms:
Feature selection.
Sampling.
Data compression.
Data aggregation.
etc.
Isabelle Bichindaritz, SUNY Oswego 49
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 50
Feature Selection
Feature selection is also called dimensionality reduction.
A feature is also called a variable (or a column).
It is very important in biomedical data due to an often large
number of features available – the curse of dimensionality
(Ex: number of gene expressions).
It will be studied in a future module.
Isabelle Bichindaritz, SUNY Oswego 51
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 52
Data Sampling
Data sampling refers to creating a subset or sample of
the complete dataset.
The sample needs to be representative.
Main methods:
Simple random sampling with replacement.
Simple random sampling without replacement.
Stratified sampling.
Isabelle Bichindaritz, SUNY Oswego 53
Data Sampling
from Han and Kamber
(2014)
Raw Data
Isabelle Bichindaritz, SUNY Oswego
54
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 55
Introduction to R Language
R is a computation, graphic, and open source programming environment for
statistical analysis and data science applications.
R comprises a set of functions for statistical analysis and graphics, a
programming language, a run-time interpreter, a debugger,
numerous add-on packages, and script files.
Packages provide added functionality and allow for extensibility
of the language functionality since any researcher can contribute
a package to R.
In terms of programming language, R’ syntax is close to that of Scheme.
Developed originally by Ross Ihaka and Robert Gentleman at the University of
Auckland in New Zealand, it is now maintained by the “R core group”
([Link]
Isabelle Bichindaritz, SUNY Oswego 56
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 57
Principles of R
Installation. Installers can be downloaded from a mirror
listed on [Link] Installation
is automatic by double-clicking on the installer. Installation
requirements are about 50Mb of disk space.
Configuration. User should select a working directory and
memory size, which will contain all files input or output with
R. This can be done from the desktop shortcut to R, through
the properties:
Change the working directory under Start-in (Windows). This
is the directory from which R will read, or into which R will
write, by default.
Change the memory size by adding at the end of Target the
number of Gb wanted: --max-mem-size=3G. The memory
limit varies depending on the available memory and the
operating system.
Some useful tools: Rgui, Rstudio, notepad++.
Isabelle Bichindaritz, SUNY Oswego 58
Principles of R
Running. Double-clicking on the desktop shortcut or selecting from the start
menu R will open R window.
Documentation. Important documents for getting started with R are the
following:
An FAQ for R for Windows is available from [Link]
[Link]/bin/windows/base/[Link].
An FAQ for R is available from [Link]
The user guide of R is entitled “Using R for Data Analysis and Graphics” and is
available from: [Link]
Other documents are available from the documentation section of
[Link]
Online documentation is available from R itself through help(name) or ?name.
Isabelle Bichindaritz, SUNY Oswego 59
Principles of R
Important commands. Some important commands
include:
ls() to list the content of the memory.
rm() to empty the memory.
rm(object) to remove an object from memory.
q() to quit.
summary(object) to display summary characteristics of an
object.
class(object) to display the class (type) of an object.
Isabelle Bichindaritz, SUNY Oswego 60
Principles of R
Packages are libraries of functions to use in addition to the
standard functions. They need to be loaded specifically. There are
two types of packages:
Standard packages, which can be installed from the Package
menu, choosing Load package in the graphical user interface (GUI).
Packages to install from a local zip file, which can be installed from
the Package menu, choosing Install package(s) from local zip
files…, which proposes to load a zipped package from the working
directory.
Packages can also be installed with [Link]().
Once packages are installed, they can be loaded with library().
Isabelle Bichindaritz, SUNY Oswego 61
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 62
Working with R
Some distributions are available combining several tools
for bioinformatics, such as
Bioconductor ([Link] for bioinformatics
packages.
Anaconda ([Link] for data
Science includes Python, R, and Scala with their most popular
packages (including Bioconductor).
Isabelle Bichindaritz, SUNY Oswego 63
Working with R
Anaconda includes the Jupyter notebook in which R can
be run.
In this course, Jupyter notebook is provided from a link in
the menu so that no local R installation is required.
Isabelle Bichindaritz, SUNY Oswego 64
Outline
Introduction to module
Locating and downloading datasets
Datasets and files
Data sources
Data preprocessing
Importance of data preprocessing
Data preprocessing tasks
Missing values
Replacing missing values Feature selection
Normalizing and discretizing data
Data normalization
Discretization
Data reduction
Feature selection
Data sampling
Introduction to R language
Principles of R
Working with R
Data preprocessing with R
Isabelle Bichindaritz, SUNY Oswego 65
Data Preprocessing with R
Watch the video
Start Jupyter notebook from the provided link
Isabelle Bichindaritz, SUNY Oswego 66