Lecture #1: Introduction to CS109A
aka STAT121A, AC209A, CSCIE-109A
CS109A Introduction to Data Science
Pavlos Protopapas, Kevin Rader and Chris Tanner
1
Lecture Outline
• Why data science? Why taking CS109A?
• What is data science?
• What is this class and what it is not?
• The data science process
• Example
CS109A, PROTOPAPAS, RADER, TANNER 2
Why?
Jobs!
CS109A, PROTOPAPAS, RADER, TANNER 3
Why?
Jobs!
CS109A, PROTOPAPAS, RADER, TANNER 4
Why?
Jobs!
CS109A, PROTOPAPAS, RADER, TANNER 5
Why?
CS109A, PROTOPAPAS, RADER, TANNER 6
Why?
CS109A, PROTOPAPAS, RADER, TANNER 7
Why?
Why do I love data science?
Why are you here?
CS109A, PROTOPAPAS, RADER, TANNER 8
Why?
CS109A, PROTOPAPAS, RADER, TANNER 9
Why?
Why are you here?
CS109A, PROTOPAPAS, RADER, TANNER 10
A little bit of history
CS109A, PROTOPAPAS, RADER, TANNER 11
History
Long time ago (thousands of years) science was only
empirical and people counted stars
CS109A, PROTOPAPAS, RADER, TANNER 12
History (cont)
Long time ago (thousands of years) science was only
empirical and people counted stars or crops
CS109A, PROTOPAPAS, RADER, TANNER 13
History (cont)
Long time ago (thousands of years) science was only
empirical and people counted stars or crops and used the data to
create machines to describe the phenomena
CS109A, PROTOPAPAS, RADER, TANNER 14
History (cont)
Few hundred years: theoretical approaches, try to derive
equations to describe general phenomena.
CS109A, PROTOPAPAS, RADER, TANNER 15
History (cont)
About a hundred years ago: computational approaches
CS109A, PROTOPAPAS, RADER, TANNER 16
History (cont)
And then …. data science
CS109A, PROTOPAPAS, RADER, TANNER 17
What is data science?
CS109A, PROTOPAPAS, RADER, TANNER 18
What?
The Data Science Process
Ask an interesting question
Get the Data
Explore the Data
Model the Data
Communicate/Visualize the Results
CS109A, PROTOPAPAS, RADER, TANNER 19
What?
The Data Science Process
Ask an interesting question
What is the scientific goal?
Get the Data What would you do if you had all of the data?
What do you want to predict or estimate?
Explore the Data
Model the Data
Communicate/Visualize the Results
CS109A, PROTOPAPAS, RADER, TANNER 20
What?
The Data Science Process
Ask an interesting question
How were the data sampled?
Get the Data Which data are relevant?
Are there privacy issues?
Explore the Data
Model the Data
Communicate/Visualize the Results
CS109A, PROTOPAPAS, RADER, TANNER 21
What?
The Data Science Process
Ask an interesting question
Plot the data.
Get the Data Are there anomalies or egregious issues?
Are there patterns?
Explore the Data
Model the Data
Communicate/Visualize the Results
CS109A, PROTOPAPAS, RADER, TANNER 22
What?
The Data Science Process
Ask an interesting question
Build a model.
Get the Data Fit the model.
Validate the model.
Explore the Data
Model the Data
Communicate/Visualize the Results
CS109A, PROTOPAPAS, RADER, TANNER 23
What?
The Data Science Process
Ask an interesting question
What did we learn?
Get the Data Do the results make sense?
Can we effectively tell a story?
Explore the Data
Model the Data
Communicate/Visualize the Results
CS109A, PROTOPAPAS, RADER, TANNER 24
What?
The material of the course will integrate the five key facets of an
investigation using data:
1. data collection; data wrangling, cleaning, and sampling to get
a suitable data set
2. data management; accessing data quickly and reliably
3. exploratory data analysis; generating hypotheses and building
intuition
4. prediction or statistical learning
5. communication; summarizing results through visualization,
stories, and interpretable summaries.
CS109A, PROTOPAPAS, RADER, TANNER 25
What?
Week 1:
Getting ready with python, jupyter notebooks, environments and
numpy.
CS109A, PROTOPAPAS, RADER, TANNER 26
What?
Week 2:
Basic statistics, visualization, pandas and data scraping
CS109A, PROTOPAPAS, RADER, TANNER 27
What?
Week 3 and 4:
Regression, and sklearn using transportation data:
• knn regression
• Linear and Polynomial Regression
• Multiple Regression
• Model Selection
• Regularization
CS109A, PROTOPAPAS, RADER, TANNER 28
What?
Week 5:
Exploratory Data Analysis, matplotlib and seaborn:
• Basic concepts of EDA
• Basic concepts of Visualization and Communications
CS109A, PROTOPAPAS, RADER, TANNER 29
What?
Week 6-7:
Classification, data imputations on Health Data:
• Logistic Regression (linear and polynomial)
• Multiple Logistic Regression
• Missing data and knn classification
CS109A, PROTOPAPAS, RADER, TANNER 30
What?
Week 8:
EthiCS
PCA and high dimensionality
CS109A, PROTOPAPAS, RADER, TANNER 31
What?
Week 9 and 10:
Decisions trees and ensemble methods :
• Simple Decision Trees for classification and Regression
• Bagging
• Random Forest
• Boosting
• Stacking
CS109A, PROTOPAPAS, RADER, TANNER 32
What?
Week 10-12:
Neural Networks:
• Perceptron, Back Propagation and SGD
• MLP and design choices
• Advanced MLP, regularization, dropout, batch normalization
• Neural Network solvers
CS109A, PROTOPAPAS, RADER, TANNER 33
What?
Week 12:
More visualization and model interpretation
CS109A, PROTOPAPAS, RADER, TANNER 34
What?
Week 13:
Experimental Design:
• AB testing
• Causal inference
• Randomization testing
• Adaptive and multi-arm bandit designs
CS109A, PROTOPAPAS, RADER, TANNER 35
CS109B
A. Neural Networks:
• CNNs
• RNNs
• Generative models
B. Unsupervised Clustering
C. Piecewise Linear Regression
D. Bayesian Modeling
CS109A, PROTOPAPAS, RADER, TANNER 36
CS109C – Advanced Practical Data Science
A. Productions Data Science, from notebooks to the cloud
B. Big models, transfer learning and architecture learning
C. Visualization tools for interpreting models
D. Sequential data, seq2seq with attention, transformers, NLP and
time series modeling
CS109A, PROTOPAPAS, RADER, TANNER 37
Who?
Pavlos Protopapas
Scientific Director of the Institute for
Applied Computational Science (IACS)
Teaches CS109(a/b/c) and the data science
capstone course.
Research in astrostatistics: machine
learning, statistical learning, big data for
astronomical problems. He is excited
about the new telescopes coming online in
the next few years. He has absolutely no
hobbies or interests except teaching CS109
and eating.
CS109A, PROTOPAPAS, RADER, TANNER 38
Who? Instructor
Kevin Rader
Senior preceptor in Statistics.
Teaches CS 109A & Stat 139 this fall
and Stat 102 and Stat 98 in the
spring.
Research interests include complex
survey analysis and causal inference.
Hobbies include the outdoors, sports
(especially the aquatic variety), and
of course, farming.
CS109A, PROTOPAPAS, RADER, TANNER 39
Who? Instructor
Chris Tanner
Lecturer at IACS, teaching CS109A and
AC297R (capstone) now, and CS109B
in the Spring. Research interests are
within Natural Language Processing
and Deep Learning. Hobbies include
hiking and camping,
designing/sewing hiking bags, and
photography.
CS109A, PROTOPAPAS, RADER, TANNER 40
Who? Lab instructors
Eleni Kaxiras
Eleni is the assist. Director for Data
Science and Computation at SEAS.
She has been this course’s Head TF
for the last 3 years and she is now a
lab instructor. She is currently a
doctoral student. She is interested
in the application of deep learning
in analyzing biological signals. She
owns olive trees in the island of
Crete.
CS109A, PROTOPAPAS, RADER, TANNER 41
Who? Head TFs
Chris Gumb Sol Girouard
Chris is currently working towards a She has been a head TF for 109B and
graduate degree in Data Science she is a Quant, Math-Econ and Data
from Harvard Extension School with Scientist who channels her applied
a particular focus on NLP. His other interdisciplinary background in the
interests and hobbies include: intersection of financial markets and
music theory & jazz improvisation; technology. Tae kwon full contact
and film history. second degree black belt.
CS109A, PROTOPAPAS, RADER, TANNER 42
Who? Teaching Fellows
Advanced Section (the 209 part):
Cedric Flamant
Section leaders:
Marios Mattheakis
Robbert Struyven
Abhimanyu (Abhi) Vasishth
CS109A, PROTOPAPAS, RADER, TANNER 43
Who? Teaching Fellows
Rashmi Banthia Yun Bin (Matteo)Zhang
Evan Mackay Marcus Heijer
Brandon Walker Nathan Hollenberg
Rachel Moon Maddy Nakada
Nicholas Stern Tim Pugh
Pat Sukhum Alex Yu
Zheyu Wu JavierMachin
CS109A, PROTOPAPAS, RADER, TANNER 44
Lectures, Labs, Advanced Sections, Sections and Office Hours
During lecture will cover the material which you will need to complete the
homework, and to survive the rest of your life in CS109A. Attending
lectures is required - quizzes during and at the end of each lecture (drop
50% of them).
We will use a mix of notes and examples via notebooks.
1. Lecture notes and associated notebooks will be posted before lecture
on GitHub.
2. Lectures will be video taped (and live streamed for DCE students) and
posted approximately within 24 hours on web page.
Mondays and Wednesdays 1:30-2:45pm @Northwest Building B103.
CS109A, PROTOPAPAS, RADER, TANNER 45
Lectures, Labs, Advanced Sections, Sections and Office Hours
Labs are meant to help you better understand the lecture materials
via examples.
Labs will be video taped (and live streamed for DCE students) and
posted approximately within 24 hours on Canvas.
Thursdays 4:30-5:45 pm @Pierce 301.
CS109A, PROTOPAPAS, RADER, TANNER 46
Lectures, Labs, Advanced Sections, Sections and Office Hours
Lectures and labs are supplemented by 1.5 hour sections led by teaching
fellows. There are two types of sections:
• Standard Sections will be a mix of review of material and practice
problems similar to the homework
Friday 0:30-11:45 am at 1 Story St. Room 306 and Mon 4:30-5:45 pm in
Science Center 110
• Advanced Sections (A-Sections) will cover advanced topics like the
mathematical underpinnings of the methods seen in lectures and
labs.
Weds 4-5:15 pm at 1 Story St. Room 306
CS109A, PROTOPAPAS, RADER, TANNER 47
Lectures, Labs, Advanced Sections, Sections and Office Hours
Topics
1. Linear Algebra and Hypothesis Testing: The Short Versions
2. Methods of regularization and their justifications
3. Generalized Linear Models
4. Mathematics of PCA
5. Decision trees and Ensemble method;
6. Stochastic Gradient Descent
NOTE 1: The material covered in the Advanced Sections is required for all AC 209A
students. There will be one extra question in most homework for AC 209 students
which will be based on the A-Section materials.
NOTE 2: No additional quizzes for A-section.
NOTE 3: A-sections and Friday’s regular section will be live streamed to everyone.
CS109A, PROTOPAPAS, RADER, TANNER 48
Lectures, Labs, Advanced Sections, Sections and Office Hours
CS109A, PROTOPAPAS, RADER, TANNER 49
Homework(s)
There will be 8 homework (not including Homework 0):
• Homework 0 (due Sept 11)
• Homework 1: Web scraping, Beautiful Soup
• Homework 2: Regression kNN and LinReg
• Homework 3: Multi-regression, polynomial reg and model selection
• Homework 4*: Log Reg and more
• Homework 5: PCA and ethics
• Homework 6: Random Forest, Boosting and Neural Networks
• Homework 7*: Neural Networks
• Homework 8: Experimental Design
CS109A, PROTOPAPAS, RADER, TANNER 50
Homework(s)
You are encouraged but not required to submit in pairs, except homework 4
and homework 7, which must work individually.
We will be using the Groups function in Canvas to do this, details to be
announced later.
All homework are due 11:59pm Wednesday and homework will be released on
Wednesday 3:00pm.
CS109A, PROTOPAPAS, RADER, TANNER 51
Final Project
There will be a final group project (2-4 students) due during
exams period.
• We will provide 7 pre-defined projects which you could use for
your final project.
• In some very special cases you can use your own (public) data
set and your own project definition (to be approved by the
instructors)
CS109A, PROTOPAPAS, RADER, TANNER 52
Help
CS109A, PROTOPAPAS, RADER, TANNER 53
Help
The process to get help is:
1. Post the question in Ed and hopefully your peers will answer. We monitor the
posts and we will respond within 8 hours from the posting time.
2. Go to Office Hours, this is the best way to get help.
3. For private matters send an email to the Helpline: [email protected].
The Helpline is monitored by all the instructors and TFs.
4. For personal matters send an email to Pavlos, Kevin and Chris.
Sundays will be slow days, so please be patient!
CS109A, PROTOPAPAS, RADER, TANNER 54
Grades
CS109A, PROTOPAPAS, RADER, TANNER 55
Grades
• Homework 0: 1%
• Paired Homework (six): 39%
• Individual Homework (two): 17%
• Quizzes: 10%
• Project: 30%
• Participation: 3%
• Total: 100%
We do not have predefined cuts for grades. We look for breaks
in the cumulative distribution.
CS109A, PROTOPAPAS, RADER, TANNER 56
CS109A, PROTOPAPAS, RADER, TANNER 57
The Data Science Process
CS109A, PROTOPAPAS, RADER, TANNER 58
The Data Science Process
The Data Science Process is similar to the scientific process -
one of observation, model building, analysis and conclusion:
• Ask questions
• Data Collection
• Data Exploration
• Data Modeling
• Data Analysis
• Visualization and Presentation of Results
Note: This process is by no means linear!
CS109A, PROTOPAPAS, RADER, TANNER 59
Analyzing Hubway Data
Introduction: Hubway is metro-Boston’s public bike share program,
with more than 1600 bikes at 160+ stations across the Greater Boston
area. Hubway is owned by four municipalities in the area.
By 2016, Hubway operated 185 stations and 1750 bicycles, with 5 million
ride since launching in 2011.
The Data: In April 2017, Hubway held a Data Visualization Challenge at
the Microsoft NERD Center in Cambridge, releasing 5 years of trip data.
The Question: What does the data tell us about the ride share program?
CS109A, PROTOPAPAS, RADER, TANNER 60
The Data Exploration/Question Refinement Cycle
Our original question: ‘What does the data tell us about the ride share
program?’ is a reasonable slogan to promote a hackathon. It is not good
for guiding scientific investigation.
Before we can refine the question, we have to look at the data!
Based on the data, what kind of questions can we ask?
CS109A, PROTOPAPAS, RADER, TANNER 61
The Data Exploration/Question Refinement Cycle
Who? Who’s using the bikes?
Refine into specific hypotheses:
• More men or more women?
• Older or younger people?
• Subscribers or one time users?
CS109A, PROTOPAPAS, RADER, TANNER 62
The Data Exploration/Question Refinement Cycle
Where? Where are bikes being checked out?
Refine into specific hypotheses:
• More in Boston than Cambridge?
• More in commercial or residential?
• More around tourist attractions?
Sometimes the data is given to you in pieces and must be merged!
CS109A, PROTOPAPAS, RADER, TANNER 63
The Data Exploration/Question Refinement Cycle
When? When are the bikes being checked out?
Refine into specific hypotheses:
• More during the weekend than on the weekdays?
• More during rush hour?
• More during the summer than the fall?
Sometimes the feature you want to explore doesn’t exist in the data,
and must be engineered!
CS109A, PROTOPAPAS, RADER, TANNER 64
The Data Exploration/Question Refinement Cycle
Why? For what reasons/activities are people
checking out bikes?
Refine into specific hypotheses:
• More bikes are used for recreation than commute?
• More bikes are used for touristic purposes?
• Bikes are use to bypass traffic?
Do we have the data to answer these questions with reasonable
certainty?
What data do we need to collect in order to answer these questions?
CS109A, PROTOPAPAS, RADER, TANNER 65
The Data Exploration/Question Refinement Cycle
How? Questions that combine variables.
• How does user demographics impact the duration the bikes are being used?
Or where they are being checked out?
• How does weather or traffic conditions impact bike usage?
• How do the characteristics of the station location affect the number of bikes
being checked out?
How questions are about modeling relationships between different
variables.
CS109A, PROTOPAPAS, RADER, TANNER 66
Inspirations for Data Viz/Exploration
So how well did we do in
formulating creative hypotheses
and manipulating the data for
answers?
Check out the winners of the
Hubway Challenge:
http://hubwaydatachallenge.org
CS109A, PROTOPAPAS, RADER, TANNER 67