0% found this document useful (0 votes)
68 views67 pages

Lecture1 Introduction

Uploaded by

Upendra Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views67 pages

Lecture1 Introduction

Uploaded by

Upendra Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Lecture #1: Introduction to CS109A

aka STAT121A, AC209A, CSCIE-109A

CS109A Introduction to Data Science


Pavlos Protopapas, Kevin Rader and Chris Tanner

1
Lecture Outline

• Why data science? Why taking CS109A?

• What is data science?

• What is this class and what it is not?

• The data science process

• Example

CS109A, PROTOPAPAS, RADER, TANNER 2


Why?
Jobs!

CS109A, PROTOPAPAS, RADER, TANNER 3


Why?
Jobs!

CS109A, PROTOPAPAS, RADER, TANNER 4


Why?
Jobs!

CS109A, PROTOPAPAS, RADER, TANNER 5


Why?

CS109A, PROTOPAPAS, RADER, TANNER 6


Why?

CS109A, PROTOPAPAS, RADER, TANNER 7


Why?

Why do I love data science?

Why are you here?

CS109A, PROTOPAPAS, RADER, TANNER 8


Why?

CS109A, PROTOPAPAS, RADER, TANNER 9


Why?

Why are you here?

CS109A, PROTOPAPAS, RADER, TANNER 10


A little bit of history

CS109A, PROTOPAPAS, RADER, TANNER 11


History

Long time ago (thousands of years) science was only


empirical and people counted stars

CS109A, PROTOPAPAS, RADER, TANNER 12


History (cont)

Long time ago (thousands of years) science was only


empirical and people counted stars or crops

CS109A, PROTOPAPAS, RADER, TANNER 13


History (cont)

Long time ago (thousands of years) science was only


empirical and people counted stars or crops and used the data to
create machines to describe the phenomena

CS109A, PROTOPAPAS, RADER, TANNER 14


History (cont)

Few hundred years: theoretical approaches, try to derive


equations to describe general phenomena.

CS109A, PROTOPAPAS, RADER, TANNER 15


History (cont)

About a hundred years ago: computational approaches

CS109A, PROTOPAPAS, RADER, TANNER 16


History (cont)

And then …. data science

CS109A, PROTOPAPAS, RADER, TANNER 17


What is data science?

CS109A, PROTOPAPAS, RADER, TANNER 18


What?
The Data Science Process

Ask an interesting question

Get the Data

Explore the Data

Model the Data

Communicate/Visualize the Results

CS109A, PROTOPAPAS, RADER, TANNER 19


What?
The Data Science Process

Ask an interesting question


What is the scientific goal?

Get the Data What would you do if you had all of the data?

What do you want to predict or estimate?


Explore the Data

Model the Data

Communicate/Visualize the Results

CS109A, PROTOPAPAS, RADER, TANNER 20


What?
The Data Science Process

Ask an interesting question


How were the data sampled?

Get the Data Which data are relevant?

Are there privacy issues?


Explore the Data

Model the Data

Communicate/Visualize the Results

CS109A, PROTOPAPAS, RADER, TANNER 21


What?
The Data Science Process

Ask an interesting question


Plot the data.

Get the Data Are there anomalies or egregious issues?

Are there patterns?


Explore the Data

Model the Data

Communicate/Visualize the Results

CS109A, PROTOPAPAS, RADER, TANNER 22


What?
The Data Science Process

Ask an interesting question


Build a model.

Get the Data Fit the model.

Validate the model.


Explore the Data

Model the Data

Communicate/Visualize the Results

CS109A, PROTOPAPAS, RADER, TANNER 23


What?
The Data Science Process

Ask an interesting question


What did we learn?

Get the Data Do the results make sense?

Can we effectively tell a story?


Explore the Data

Model the Data

Communicate/Visualize the Results

CS109A, PROTOPAPAS, RADER, TANNER 24


What?

The material of the course will integrate the five key facets of an
investigation using data:
1. data collection; data wrangling, cleaning, and sampling to get
a suitable data set
2. data management; accessing data quickly and reliably
3. exploratory data analysis; generating hypotheses and building
intuition
4. prediction or statistical learning
5. communication; summarizing results through visualization,
stories, and interpretable summaries.
CS109A, PROTOPAPAS, RADER, TANNER 25
What?

Week 1:

Getting ready with python, jupyter notebooks, environments and


numpy.

CS109A, PROTOPAPAS, RADER, TANNER 26


What?

Week 2:

Basic statistics, visualization, pandas and data scraping

CS109A, PROTOPAPAS, RADER, TANNER 27


What?

Week 3 and 4:
Regression, and sklearn using transportation data:
• knn regression
• Linear and Polynomial Regression
• Multiple Regression
• Model Selection
• Regularization

CS109A, PROTOPAPAS, RADER, TANNER 28


What?

Week 5:
Exploratory Data Analysis, matplotlib and seaborn:
• Basic concepts of EDA
• Basic concepts of Visualization and Communications

CS109A, PROTOPAPAS, RADER, TANNER 29


What?

Week 6-7:
Classification, data imputations on Health Data:
• Logistic Regression (linear and polynomial)
• Multiple Logistic Regression
• Missing data and knn classification

CS109A, PROTOPAPAS, RADER, TANNER 30


What?

Week 8:
EthiCS
PCA and high dimensionality

CS109A, PROTOPAPAS, RADER, TANNER 31


What?

Week 9 and 10:


Decisions trees and ensemble methods :
• Simple Decision Trees for classification and Regression
• Bagging
• Random Forest
• Boosting
• Stacking

CS109A, PROTOPAPAS, RADER, TANNER 32


What?

Week 10-12:
Neural Networks:
• Perceptron, Back Propagation and SGD
• MLP and design choices
• Advanced MLP, regularization, dropout, batch normalization
• Neural Network solvers

CS109A, PROTOPAPAS, RADER, TANNER 33


What?

Week 12:
More visualization and model interpretation

CS109A, PROTOPAPAS, RADER, TANNER 34


What?

Week 13:
Experimental Design:
• AB testing
• Causal inference
• Randomization testing
• Adaptive and multi-arm bandit designs

CS109A, PROTOPAPAS, RADER, TANNER 35


CS109B

A. Neural Networks:
• CNNs
• RNNs
• Generative models
B. Unsupervised Clustering
C. Piecewise Linear Regression
D. Bayesian Modeling

CS109A, PROTOPAPAS, RADER, TANNER 36


CS109C – Advanced Practical Data Science

A. Productions Data Science, from notebooks to the cloud


B. Big models, transfer learning and architecture learning
C. Visualization tools for interpreting models
D. Sequential data, seq2seq with attention, transformers, NLP and
time series modeling

CS109A, PROTOPAPAS, RADER, TANNER 37


Who?
Pavlos Protopapas
Scientific Director of the Institute for
Applied Computational Science (IACS)
Teaches CS109(a/b/c) and the data science
capstone course.
Research in astrostatistics: machine
learning, statistical learning, big data for
astronomical problems. He is excited
about the new telescopes coming online in
the next few years. He has absolutely no
hobbies or interests except teaching CS109
and eating.
CS109A, PROTOPAPAS, RADER, TANNER 38
Who? Instructor
Kevin Rader
Senior preceptor in Statistics.
Teaches CS 109A & Stat 139 this fall
and Stat 102 and Stat 98 in the
spring.
Research interests include complex
survey analysis and causal inference.
Hobbies include the outdoors, sports
(especially the aquatic variety), and
of course, farming.

CS109A, PROTOPAPAS, RADER, TANNER 39


Who? Instructor
Chris Tanner
Lecturer at IACS, teaching CS109A and
AC297R (capstone) now, and CS109B
in the Spring. Research interests are
within Natural Language Processing
and Deep Learning. Hobbies include
hiking and camping,
designing/sewing hiking bags, and
photography.

CS109A, PROTOPAPAS, RADER, TANNER 40


Who? Lab instructors
Eleni Kaxiras

Eleni is the assist. Director for Data


Science and Computation at SEAS.
She has been this course’s Head TF
for the last 3 years and she is now a
lab instructor. She is currently a
doctoral student. She is interested
in the application of deep learning
in analyzing biological signals. She
owns olive trees in the island of
Crete.
CS109A, PROTOPAPAS, RADER, TANNER 41
Who? Head TFs
Chris Gumb Sol Girouard
Chris is currently working towards a She has been a head TF for 109B and
graduate degree in Data Science she is a Quant, Math-Econ and Data
from Harvard Extension School with Scientist who channels her applied
a particular focus on NLP. His other interdisciplinary background in the
interests and hobbies include: intersection of financial markets and
music theory & jazz improvisation; technology. Tae kwon full contact
and film history. second degree black belt.

CS109A, PROTOPAPAS, RADER, TANNER 42


Who? Teaching Fellows

Advanced Section (the 209 part):


Cedric Flamant

Section leaders:
Marios Mattheakis
Robbert Struyven
Abhimanyu (Abhi) Vasishth

CS109A, PROTOPAPAS, RADER, TANNER 43


Who? Teaching Fellows

Rashmi Banthia Yun Bin (Matteo)Zhang


Evan Mackay Marcus Heijer
Brandon Walker Nathan Hollenberg
Rachel Moon Maddy Nakada
Nicholas Stern Tim Pugh
Pat Sukhum Alex Yu
Zheyu Wu JavierMachin

CS109A, PROTOPAPAS, RADER, TANNER 44


Lectures, Labs, Advanced Sections, Sections and Office Hours

During lecture will cover the material which you will need to complete the
homework, and to survive the rest of your life in CS109A. Attending
lectures is required - quizzes during and at the end of each lecture (drop
50% of them).
We will use a mix of notes and examples via notebooks.
1. Lecture notes and associated notebooks will be posted before lecture
on GitHub.
2. Lectures will be video taped (and live streamed for DCE students) and
posted approximately within 24 hours on web page.

Mondays and Wednesdays 1:30-2:45pm @Northwest Building B103.


CS109A, PROTOPAPAS, RADER, TANNER 45
Lectures, Labs, Advanced Sections, Sections and Office Hours

Labs are meant to help you better understand the lecture materials
via examples.

Labs will be video taped (and live streamed for DCE students) and
posted approximately within 24 hours on Canvas.

Thursdays 4:30-5:45 pm @Pierce 301.

CS109A, PROTOPAPAS, RADER, TANNER 46


Lectures, Labs, Advanced Sections, Sections and Office Hours
Lectures and labs are supplemented by 1.5 hour sections led by teaching
fellows. There are two types of sections:

• Standard Sections will be a mix of review of material and practice


problems similar to the homework
Friday 0:30-11:45 am at 1 Story St. Room 306 and Mon 4:30-5:45 pm in
Science Center 110

• Advanced Sections (A-Sections) will cover advanced topics like the


mathematical underpinnings of the methods seen in lectures and
labs.
Weds 4-5:15 pm at 1 Story St. Room 306
CS109A, PROTOPAPAS, RADER, TANNER 47
Lectures, Labs, Advanced Sections, Sections and Office Hours
Topics
1. Linear Algebra and Hypothesis Testing: The Short Versions
2. Methods of regularization and their justifications
3. Generalized Linear Models
4. Mathematics of PCA
5. Decision trees and Ensemble method;
6. Stochastic Gradient Descent

NOTE 1: The material covered in the Advanced Sections is required for all AC 209A
students. There will be one extra question in most homework for AC 209 students
which will be based on the A-Section materials.
NOTE 2: No additional quizzes for A-section.
NOTE 3: A-sections and Friday’s regular section will be live streamed to everyone.
CS109A, PROTOPAPAS, RADER, TANNER 48
Lectures, Labs, Advanced Sections, Sections and Office Hours

CS109A, PROTOPAPAS, RADER, TANNER 49


Homework(s)
There will be 8 homework (not including Homework 0):

• Homework 0 (due Sept 11)


• Homework 1: Web scraping, Beautiful Soup
• Homework 2: Regression kNN and LinReg
• Homework 3: Multi-regression, polynomial reg and model selection
• Homework 4*: Log Reg and more
• Homework 5: PCA and ethics
• Homework 6: Random Forest, Boosting and Neural Networks
• Homework 7*: Neural Networks
• Homework 8: Experimental Design
CS109A, PROTOPAPAS, RADER, TANNER 50
Homework(s)
You are encouraged but not required to submit in pairs, except homework 4
and homework 7, which must work individually.
We will be using the Groups function in Canvas to do this, details to be
announced later.
All homework are due 11:59pm Wednesday and homework will be released on
Wednesday 3:00pm.

CS109A, PROTOPAPAS, RADER, TANNER 51


Final Project

There will be a final group project (2-4 students) due during


exams period.
• We will provide 7 pre-defined projects which you could use for
your final project.
• In some very special cases you can use your own (public) data
set and your own project definition (to be approved by the
instructors)

CS109A, PROTOPAPAS, RADER, TANNER 52


Help

CS109A, PROTOPAPAS, RADER, TANNER 53


Help
The process to get help is:

1. Post the question in Ed and hopefully your peers will answer. We monitor the
posts and we will respond within 8 hours from the posting time.
2. Go to Office Hours, this is the best way to get help.
3. For private matters send an email to the Helpline: [email protected].
The Helpline is monitored by all the instructors and TFs.
4. For personal matters send an email to Pavlos, Kevin and Chris.

Sundays will be slow days, so please be patient!

CS109A, PROTOPAPAS, RADER, TANNER 54


Grades

CS109A, PROTOPAPAS, RADER, TANNER 55


Grades

• Homework 0: 1%
• Paired Homework (six): 39%
• Individual Homework (two): 17%
• Quizzes: 10%
• Project: 30%
• Participation: 3%
• Total: 100%

We do not have predefined cuts for grades. We look for breaks


in the cumulative distribution.
CS109A, PROTOPAPAS, RADER, TANNER 56
CS109A, PROTOPAPAS, RADER, TANNER 57
The Data Science Process

CS109A, PROTOPAPAS, RADER, TANNER 58


The Data Science Process

The Data Science Process is similar to the scientific process -


one of observation, model building, analysis and conclusion:
• Ask questions
• Data Collection
• Data Exploration
• Data Modeling
• Data Analysis
• Visualization and Presentation of Results
Note: This process is by no means linear!

CS109A, PROTOPAPAS, RADER, TANNER 59


Analyzing Hubway Data

Introduction: Hubway is metro-Boston’s public bike share program,


with more than 1600 bikes at 160+ stations across the Greater Boston
area. Hubway is owned by four municipalities in the area.

By 2016, Hubway operated 185 stations and 1750 bicycles, with 5 million
ride since launching in 2011.

The Data: In April 2017, Hubway held a Data Visualization Challenge at


the Microsoft NERD Center in Cambridge, releasing 5 years of trip data.

The Question: What does the data tell us about the ride share program?

CS109A, PROTOPAPAS, RADER, TANNER 60


The Data Exploration/Question Refinement Cycle

Our original question: ‘What does the data tell us about the ride share
program?’ is a reasonable slogan to promote a hackathon. It is not good
for guiding scientific investigation.

Before we can refine the question, we have to look at the data!

Based on the data, what kind of questions can we ask?

CS109A, PROTOPAPAS, RADER, TANNER 61


The Data Exploration/Question Refinement Cycle

Who? Who’s using the bikes?

Refine into specific hypotheses:


• More men or more women?
• Older or younger people?
• Subscribers or one time users?

CS109A, PROTOPAPAS, RADER, TANNER 62


The Data Exploration/Question Refinement Cycle

Where? Where are bikes being checked out?

Refine into specific hypotheses:


• More in Boston than Cambridge?
• More in commercial or residential?
• More around tourist attractions?

Sometimes the data is given to you in pieces and must be merged!

CS109A, PROTOPAPAS, RADER, TANNER 63


The Data Exploration/Question Refinement Cycle

When? When are the bikes being checked out?

Refine into specific hypotheses:


• More during the weekend than on the weekdays?
• More during rush hour?
• More during the summer than the fall?

Sometimes the feature you want to explore doesn’t exist in the data,
and must be engineered!

CS109A, PROTOPAPAS, RADER, TANNER 64


The Data Exploration/Question Refinement Cycle

Why? For what reasons/activities are people


checking out bikes?

Refine into specific hypotheses:


• More bikes are used for recreation than commute?
• More bikes are used for touristic purposes?
• Bikes are use to bypass traffic?

Do we have the data to answer these questions with reasonable


certainty?
What data do we need to collect in order to answer these questions?

CS109A, PROTOPAPAS, RADER, TANNER 65


The Data Exploration/Question Refinement Cycle

How? Questions that combine variables.

• How does user demographics impact the duration the bikes are being used?
Or where they are being checked out?
• How does weather or traffic conditions impact bike usage?
• How do the characteristics of the station location affect the number of bikes
being checked out?

How questions are about modeling relationships between different


variables.

CS109A, PROTOPAPAS, RADER, TANNER 66


Inspirations for Data Viz/Exploration

So how well did we do in


formulating creative hypotheses
and manipulating the data for
answers?

Check out the winners of the


Hubway Challenge:

http://hubwaydatachallenge.org

CS109A, PROTOPAPAS, RADER, TANNER 67

You might also like