0% found this document useful (0 votes)
66 views34 pages

Data Science Course Overview

Biological Data Science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views34 pages

Data Science Course Overview

Biological Data Science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Dr Athanasios Tsanas (‘Thanasis’)

Associate Prof. in Data Science


Usher Institute, Medical School
University of Edinburgh
[Link]. Maths, Edinburgh
[Link]. + [Link]. [Link]. Signal
Post-doc: Eng. Usher Institute
Engineering Processing
Lecturer: SBS Medical School

2001-2007 2007-2008 2008 - 2019 2017 - present


Tidal Generation: Rolls-Royce EMEC 500kW © A. Tsanas, 2020
 Lecture notes
 Papers/reports at the ‘Reading material’ of each
lecture: this is part of the exam material

 “An introduction to statistical learning”


by G. James, D. Witten, T. Hastie, R.
Tibshirani. It is freely available as a pdf
from the authors’ website: [Link]
[Link]/~gareth/ISL/ISLR%20First%20
[Link].
© A. Tsanas, 2020
 10-credit course (total hours = 100)

 Coursework: 50% + Exam: 50%

Lectures
• Thanasis Tsanas
• 10 hours of lectures

R Labs
• Sjoerd Beentjes
• 10 hours R labs
© A. Tsanas, 2020
 Characterising raw data (data mining)
 Probability distributions
 Statistical associations
 Statistical mapping (learning models)
 Model validation and assessment
 R programming (labs with Stuart)

© A. Tsanas, 2020
 Understand setting and complications related
to using and analysing biomedical data
 Understand first-principle and data-driven
models differences
 Data mining and feature extraction
 Statistical learning models & validation
 Ηigh-dimensional data implications
 Write well-written and modular R code
© A. Tsanas, 2020
Day 1 • Introduction and overview; reminder of basic concepts
Day 2 • Data collection and sampling

Day 3 • Data mining: signal/image processing and information extraction

Day 4 • Data visualization: density estimation, statistical descriptors

Day 5 • Exploratory analysis: hypothesis testing and quantifying relationships

Day 6 • Feature selection and feature transformation

Day 7 • Statistical machine learning and model validation

Day 8 • Statistical machine learning and model validation

Day 9 • Practical examples: bringing things together

Day 10 • Revision and exam preparation


© A. Tsanas, 2020
Data
Exploratory Feature Statistical
visualization
analysis selection or mapping
(density
(statistical transformation (regression/clas
estimation,
associations) (e.g. PCA) sification)
scatter plots)

© A. Tsanas, 2020
Day 1 part 2
Talk in the language of the
clinicians
• Understand what they need and the
terms the experts in the domain use

Understand the physiology


• Domain dependent
• You will have to read biology/physiology
books and articles
© A. Tsanas, 2020
Monitor Parkinson’s disease using voice
• Before looking into the data, talk with the domain experts
• Understand the underlying physiology
Understanding heart physiology and circulatory system
Differential equations Statistics

First principle models Data driven models


• Mechanistic insight ☺ • Less interpretable 
• Difficult to match data  • Better predictions ☺
© A. Tsanas, 2020
Day 1 part 3
• Usefulness
– Often occurs, e.g. heights, IQ, returns, errors
– Can be used to approximate other distributions
– Central Limit Theorem - distribution of averages

• Structure 𝑿~𝓝(𝝁,
X ~ N(𝝈
𝟐
))
– Continuous, bell shape
– Two parameters,  and 

1 𝑥−𝜇 2
– Analytic formula 𝑝 𝑥 = exp −
2𝜋𝜎 2 2𝜎 2
© A. Tsanas, 2020
 Mean μ = 0
𝑿~𝓝(𝟎, 𝟐
Z~N(0,1) 𝟏 )
 Standard deviation σ = 1

 Tabulation
- a necessity 
- given for Z  0
- tables not all same (area in tail, area from mean)

Examples
P(Z >1.96) = 0.025
P(-1.96 < Z < 1.96) = 0.95  P(-2<Z<2)  0.95
P(-1<Z<1) = 0.68
© A. Tsanas, 2020
 Each value is area beyond
point which is Z standard
deviations from mean.
 Z-value of X is number of
standard deviations of X
from mean:

X −
Z=

© A. Tsanas, 2020
X −
Z=

© A. Tsanas, 2020
 The time required for a certain drug to have an effect is normally
distributed with mean of 30 minutes and [Link]. of 9 minutes.
What is the probability that the drug takes more than 42
minutes to have an effect on a random patient?

30 42

 P(time  42) = P(Z  (42-30)/9) = P(Z1.33) = 0.0918


© A. Tsanas, 2020
 68% of area lies
within 1σ of μ.

 95% of area lies


within 2σ of μ.

 99.7% of area
1 1 lies within 3σ
of μ.
2 2

3 3

© A. Tsanas, 2020
 Remember: 99.7% of area lies within 3σ of μ

 You could go back to the Z-table and look for the


corresponding value with a very low probability, but
it is not that detailed

 Practically, the max and min values are about 5σ


around μ with about 1,000,000 points (there are
detailed analytical formulas for computations)

 Draw random data in R to verify concept


© A. Tsanas, 2020
Day 1 part 4
 High level programming language

 Large supporting community

 Free of charge

 Widely used in academia + industry

© A. Tsanas, 2020
© A. Tsanas, 2020
© A. Tsanas, 2020
 Install R

 Install R studio

 R packages

Convention to attract your attention!


© A. Tsanas, 2020
 Set values to variables: a=5; b = 3;
 Simple arithmetic: c = a+b
 Write something and provide comments: the
“#” operator

“<-” is more typically used instead of “=“

© A. Tsanas, 2020
 Repeat process for some times (for-loop)

 for(condition){smth happens}

 Example: repeat something 100 times:


for (i in 1:1000){smth happens as a function of i,
e.g. some sort of checking entries in a vector}

© A. Tsanas, 2020
 Check if condition applies

 if(condition){smth happens}

 if(condition) {smth happens} else {smth happens}

 Example: check if the variable a is null


if([Link](a))

© A. Tsanas, 2020
 Check varying conditions and switch accordingly

 switch(conditioned value,
condition1_happens=output1,
condition2_happens=output2)
 If conditioned value is not present, switch returns
‘NULL’

 Example: switch amongst multiple countries


a <- switch(country[i], "United-States"=1,
"Equador"=2) #you can populate this with countries
© A. Tsanas, 2020
© A. Tsanas, 2020
 No specific text

 Refresh your understanding of the normal


distribution

 Refresh your understanding on probabilities

 Use my tutorial document to download and


experiment with R in your computers if you want:
[Link]
© A. Tsanas, 2020

You might also like