Dr Athanasios Tsanas (‘Thanasis’)
Associate Prof. in Data Science
Usher Institute, Medical School
University of Edinburgh
[Link]. Maths, Edinburgh
[Link]. + [Link]. [Link]. Signal
Post-doc: Eng. Usher Institute
Engineering Processing
Lecturer: SBS Medical School
2001-2007 2007-2008 2008 - 2019 2017 - present
Tidal Generation: Rolls-Royce EMEC 500kW © A. Tsanas, 2020
Lecture notes
Papers/reports at the ‘Reading material’ of each
lecture: this is part of the exam material
“An introduction to statistical learning”
by G. James, D. Witten, T. Hastie, R.
Tibshirani. It is freely available as a pdf
from the authors’ website: [Link]
[Link]/~gareth/ISL/ISLR%20First%20
[Link].
© A. Tsanas, 2020
10-credit course (total hours = 100)
Coursework: 50% + Exam: 50%
Lectures
• Thanasis Tsanas
• 10 hours of lectures
R Labs
• Sjoerd Beentjes
• 10 hours R labs
© A. Tsanas, 2020
Characterising raw data (data mining)
Probability distributions
Statistical associations
Statistical mapping (learning models)
Model validation and assessment
R programming (labs with Stuart)
© A. Tsanas, 2020
Understand setting and complications related
to using and analysing biomedical data
Understand first-principle and data-driven
models differences
Data mining and feature extraction
Statistical learning models & validation
Ηigh-dimensional data implications
Write well-written and modular R code
© A. Tsanas, 2020
Day 1 • Introduction and overview; reminder of basic concepts
Day 2 • Data collection and sampling
Day 3 • Data mining: signal/image processing and information extraction
Day 4 • Data visualization: density estimation, statistical descriptors
Day 5 • Exploratory analysis: hypothesis testing and quantifying relationships
Day 6 • Feature selection and feature transformation
Day 7 • Statistical machine learning and model validation
Day 8 • Statistical machine learning and model validation
Day 9 • Practical examples: bringing things together
Day 10 • Revision and exam preparation
© A. Tsanas, 2020
Data
Exploratory Feature Statistical
visualization
analysis selection or mapping
(density
(statistical transformation (regression/clas
estimation,
associations) (e.g. PCA) sification)
scatter plots)
© A. Tsanas, 2020
Day 1 part 2
Talk in the language of the
clinicians
• Understand what they need and the
terms the experts in the domain use
Understand the physiology
• Domain dependent
• You will have to read biology/physiology
books and articles
© A. Tsanas, 2020
Monitor Parkinson’s disease using voice
• Before looking into the data, talk with the domain experts
• Understand the underlying physiology
Understanding heart physiology and circulatory system
Differential equations Statistics
First principle models Data driven models
• Mechanistic insight ☺ • Less interpretable
• Difficult to match data • Better predictions ☺
© A. Tsanas, 2020
Day 1 part 3
• Usefulness
– Often occurs, e.g. heights, IQ, returns, errors
– Can be used to approximate other distributions
– Central Limit Theorem - distribution of averages
• Structure 𝑿~𝓝(𝝁,
X ~ N(𝝈
𝟐
))
– Continuous, bell shape
– Two parameters, and
1 𝑥−𝜇 2
– Analytic formula 𝑝 𝑥 = exp −
2𝜋𝜎 2 2𝜎 2
© A. Tsanas, 2020
Mean μ = 0
𝑿~𝓝(𝟎, 𝟐
Z~N(0,1) 𝟏 )
Standard deviation σ = 1
Tabulation
- a necessity
- given for Z 0
- tables not all same (area in tail, area from mean)
Examples
P(Z >1.96) = 0.025
P(-1.96 < Z < 1.96) = 0.95 P(-2<Z<2) 0.95
P(-1<Z<1) = 0.68
© A. Tsanas, 2020
Each value is area beyond
point which is Z standard
deviations from mean.
Z-value of X is number of
standard deviations of X
from mean:
X −
Z=
© A. Tsanas, 2020
X −
Z=
© A. Tsanas, 2020
The time required for a certain drug to have an effect is normally
distributed with mean of 30 minutes and [Link]. of 9 minutes.
What is the probability that the drug takes more than 42
minutes to have an effect on a random patient?
30 42
P(time 42) = P(Z (42-30)/9) = P(Z1.33) = 0.0918
© A. Tsanas, 2020
68% of area lies
within 1σ of μ.
95% of area lies
within 2σ of μ.
99.7% of area
1 1 lies within 3σ
of μ.
2 2
3 3
© A. Tsanas, 2020
Remember: 99.7% of area lies within 3σ of μ
You could go back to the Z-table and look for the
corresponding value with a very low probability, but
it is not that detailed
Practically, the max and min values are about 5σ
around μ with about 1,000,000 points (there are
detailed analytical formulas for computations)
Draw random data in R to verify concept
© A. Tsanas, 2020
Day 1 part 4
High level programming language
Large supporting community
Free of charge
Widely used in academia + industry
© A. Tsanas, 2020
© A. Tsanas, 2020
© A. Tsanas, 2020
Install R
Install R studio
R packages
Convention to attract your attention!
© A. Tsanas, 2020
Set values to variables: a=5; b = 3;
Simple arithmetic: c = a+b
Write something and provide comments: the
“#” operator
“<-” is more typically used instead of “=“
© A. Tsanas, 2020
Repeat process for some times (for-loop)
for(condition){smth happens}
Example: repeat something 100 times:
for (i in 1:1000){smth happens as a function of i,
e.g. some sort of checking entries in a vector}
© A. Tsanas, 2020
Check if condition applies
if(condition){smth happens}
if(condition) {smth happens} else {smth happens}
Example: check if the variable a is null
if([Link](a))
© A. Tsanas, 2020
Check varying conditions and switch accordingly
switch(conditioned value,
condition1_happens=output1,
condition2_happens=output2)
If conditioned value is not present, switch returns
‘NULL’
Example: switch amongst multiple countries
a <- switch(country[i], "United-States"=1,
"Equador"=2) #you can populate this with countries
© A. Tsanas, 2020
© A. Tsanas, 2020
No specific text
Refresh your understanding of the normal
distribution
Refresh your understanding on probabilities
Use my tutorial document to download and
experiment with R in your computers if you want:
[Link]
© A. Tsanas, 2020