0% found this document useful (0 votes)
128 views66 pages

Lecture 1 - Introduction: IEOR 242 - Applications in Data Analysis Fall 2019 - Paul Grigas

The document provides an overview of Google's search engine and the PageRank algorithm. It discusses how early search engines ranked pages based on simple word counts, and how Google developed PageRank to view the web as a network and rank pages based on the number and quality of inbound links. PageRank models a random web user clicking on links to probabilistically estimate the popularity of pages based on this network structure.

Uploaded by

Marc Horneck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views66 pages

Lecture 1 - Introduction: IEOR 242 - Applications in Data Analysis Fall 2019 - Paul Grigas

The document provides an overview of Google's search engine and the PageRank algorithm. It discusses how early search engines ranked pages based on simple word counts, and how Google developed PageRank to view the web as a network and rank pages based on the number and quality of inbound links. PageRank models a random web user clicking on links to probabilistically estimate the popularity of pages based on this network structure.

Uploaded by

Marc Horneck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

+

Lecture 1 – Introduction
IEOR 242 – Applications in Data Analysis
Fall 2019 – Paul Grigas
IEOR 242, Fall 2019 - Lecture 1
+ 2

Today’s Agenda

n Motivation and Objectives

n Google’s Search Engine

n Course Information

n Introduction to R

n Introduction to Statistical Learning

IEOR 242, Fall 2019 - Lecture 1


+ 3

Importance of Analytics

n Tremendous growth in the amount and variety of


data, and in computational power

n Organizations are becoming much more data-


driven
n Number of Chief Data Officers in companies grew from 400
to 4000 between 2014 to 2017

n Study of 330 public North American companies:


n Companies in top third of industry in data-driven decision
making were 5% more productive and 6% more
profitable than competitors

IEOR 242, Fall 2019 - Lecture 1


+ 4

What is Data Analytics?

n Using data to build models that lead to better


decisions

n… and ultimately add value

n Analytics is both “big data” and “small data”

n Hopefully you have seen some examples before


n We will take a deep dive into methodology, applications,
and practical skills

IEOR 242, Fall 2019 - Lecture 1


+ 5

What is Data Analytics?

n Descriptive – finding patterns in data


n Summary statistics
n Visualization
n Clustering, etc.

n Predictive – predicting outcomes


n Linear Regression
n Logistic regression, linear discriminant analysis
n CART, Random Forests, Boosting, etc.

n Prescriptive – making decisions


n Data-driven optimization

IEOR 242, Fall 2019 - Lecture 1


+ 6

What is Statistical Machine


Learning?
nA synonym for data analytics?

nA vast set of tools for understanding data

n Data analytics places more emphasis on decisions


n Not a 100% fair statement

n Distinction seems to be more historical than


anything else
n Lines are increasingly blurred in recent years

IEOR 242, Fall 2019 - Lecture 1


+ 7

Analytics, ML, and Optimization –


One Perspective

(Image from Dimitris Bertsimas, Editor-in-Chief of the


INFORMS Journal on Optimization.)

IEOR 242, Fall 2019 - Lecture 1


+ 8

Key Themes

n Analytics provides a key competitive edge to both


individuals and companies

n Our perspective:
n Analytics is not a series of black boxes, but there is no need
to reinvent the bicycle
n (Rather, we will teach you how to ride the bicycle)
n Teach analytics techniques through real-world examples
and real data
n Problem → data → tool(s) → solutions → decisions

IEOR 242, Fall 2019 - Lecture 1


+ 9

Course Objectives

n Teach you how to identify opportunities for creating


value through the use of data analytics

n Teach you how to interpret, understand, and use the


results of analytical models

n Convey enough about the inner workings of analytics


methods so that you can make an informed choice
about which method to use

n Give you practical experience applying analytics


methods and communicating their results

n Inspire you to leverage analytics in your future career

IEOR 242, Fall 2019 - Lecture 1


+
Analytics and Google’s
Search Engine

IEOR 242, Fall 2019 - Lecture 1 10


+ 11

Google’s Search Engine

n “I think Google should be like a Swiss Army knife:


clean, simple, the tool you want to take
everywhere.” – Marissa Mayer

IEOR 242, Fall 2019 - Lecture 1


+ 12

Brief History of Google

n 1996 – Sergei Brin and Larry Page, then graduate


students at Stanford, started Google
n 2001 – Eric E. Schmidt, then CEO of Novell,
became CEO of Google
n Many new products launched since then

n 2004 – Google IPO: 19,605,052 shares at $85 per


share
n Alphabet’s market cap today: $828.65B
n Behind only Microsoft, Apple, and Amazon

IEOR 242, Fall 2019 - Lecture 1


+ 13

How Did Search Engines Work in


the Early 1990s?
n Simplest model:
n User searches for “analytics”
n Rank webpages according to the number of times they use
the term “analytics”
n What can go wrong?

n Somewhat improved model (?):


n Let T = 1 if “analytics” appears in the title of a given
webpage, T = 0 otherwise
n Rank webpages according to score = 50*T + # of times
“analytics” appears in the page
n Does this help?

IEOR 242, Fall 2019 - Lecture 1


+ 14

How Does Google’s Search Engine


Work?
n First determine/estimate which pages are visited
most frequently
n create a ranked list of all ~30 trillion pages – the
“PageRank”

n Response to search query is comprised of those


pages with the queried word combinations, ranked
by popularity of page

n Challenge: how to reliably estimate the popularity


of each and every web page?

IEOR 242, Fall 2019 - Lecture 1


+ 15

World Wide Web is a Network

n We need a new way of thinking

n View the World Wide Web as a network


n Data: webpages and who they link to

IEOR 242, Fall 2019 - Lecture 1


+ 16

World Wide Web is a Network

n Vertices are pages


A B
n Arcs are hyperlinks

F
E

IEOR 242, Fall 2019 - Lecture 1


+ 17

Web User Model

n Model: users go
from page to page A B
by randomly
clicking on a
hyperlink on the C
current page

n If I start at page C, I F
go to A, B, or E each E
with probability 1/3
D

IEOR 242, Fall 2019 - Lecture 1


+ 18

Web User Model

n Think of a hyperlink as a recommendation: a


hyperlink from my homepage to yours is my
endorsement of your webpage

nA page is more important if it receives many


recommendations

n But status of recommender also matters

IEOR 242, Fall 2019 - Lecture 1


+ 19

Web User Model

n An endorsement from Warren Buffett probably


does more to strengthen a job application than 20
endorsements from 20 unknown teachers and
colleagues

n However, if the interviewer knows that Warren


Buffett is very generous with praise and has written
20,000 very strong recommendations, then the
endorsement drops in importance

IEOR 242, Fall 2019 - Lecture 1


+ 20

Slightly More Sophisticated Web


User Model
n Model: users go from page to page by randomly
clicking on a hyperlink on the current page

n Slightly more sophisticated model:


n First flip a weighted coin with P(heads) = 0.85
n If heads, now randomly click on a hyperlink from the
current page
n If tails, randomly go to any other page on the web

IEOR 242, Fall 2019 - Lecture 1


+ 21

Assessing the Popularity of a


Webpage
n Imagine a user who keeps surfing the web
according to the previous model, ad infinitum
n The popularity of a given webpage is the
proportion of time she spends on that page
n (The probability of that webpage in the stationary
distribution of the Markov Chain)

n Method is “easy” to compute


n Classical approaches from Markov Chain theory
n ~30 trillion webpages creates a challenge from a
computational engineering point of view

IEOR 242, Fall 2019 - Lecture 1


+ 22

The PageRank Algorithm

n Brin and Page proposed the idea of ranking pages


using this method

n As always, in practice there are other issues to deal


with:
n Web users can spend lots of time or little time on a page
n Not all users click on all hyperlinks with equal likelihood
n Other issues

n However, this method forms the heart of Google’s


Search Engine

IEOR 242, Fall 2019 - Lecture 1


+ 23

The Edge of Simplicity

n Simple but powerful use of “basic” Markov Chain


theory

n Brilliantly implemented and led to the rise of


Google

n Would there be a search business without the


mathematics?

n Without a good search engine, how valuable would


the web be?

IEOR 242, Fall 2019 - Lecture 1


+ 24

What about Ads?

n Sponsored search
advertising is Google’s
main source of revenue

n Ads are ordered by bid *


predicted click-through
rate * quality score

n How does Google predict


the click-through rate
(CTR)?

IEOR 242, Fall 2019 - Lecture 1


+ 25

Analytics for AdWords Auctions

n Many different sources of data for this prediction


problem:
n What is the text of the ad?
n What is the demographic information of the user?
n What is the historical CTR of this ad?
n If the ad is new, what is the historical CTR of other related ads?

n We will take a deep dive into internet advertising later


in the course
n Considering both the CTR prediction problem and the
decision problems faced by Google and advertisers alike

n We will also examine methods for incorporating text


data and temporal data into predictions

IEOR 242, Fall 2019 - Lecture 1


+ 26

Bringing it All Together

n In this course, you will learn about a wide range of


data analytics techniques

n Many times, several techniques need to be


integrated together to effectively address a
business problem

n We will illustrate many applications of business


analytics so you can see how the tools might be
effectively applied

IEOR 242, Fall 2019 - Lecture 1


+
Course Information

IEOR 242, Fall 2019 - Lecture 1 27


+ 28

Course Information

n Instructor: Paul Grigas


n I am an assistant professor in the IEOR Department, joined
in Fall 2016

n GSIs: Lucas Spangher and Mengxin Wang


n PhD students in the EECS and IEOR

n Office hours:
n Start next week
n TBD and will be posted on bCourses

IEOR 242, Fall 2019 - Lecture 1


+ 29

Course Information
n Website(s): bCourses and Piazza

n Prerequisites:
n Basic fluency in statistics, probability, and ideally optimization
n Implicit in the above is also a working knowledge of notation/basic
concepts in linear algebra and calculus (will be reviewed as needed)
n Some programming experience

n Recommended Readings:
n Chapters from An Introduction to Statistical Learning with Applications in
R by James, Witten, Hastie, and Tibshirani (available online)
n Excerpts from R for Data Science: Import, Tidy, Transform, Visualize, and
Model Data by Hadley Wickham and Garrett Grolemund (available
online)

n Supplementary Readings:
n Chapters from The Analytics Edge by Bertsimas, O’Hair, and Pulleyblank
n Chapters from The Elements of Statistical Learning: Data Mining,
Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome
Friedman
IEOR 242, Fall 2019 - Lecture 1
+ 30

Piazza

n Piazza is the main electronic communication method for


the course
n For 99% of things, you should be using Piazza instead of
email
n Either a public or private post that only the instructor and GSIs
(Lucas, Mengxin, and myself) can see

n You are encouraged to use public posts in situations where


other students may benefit from the discussion
n Priority list:
n 1.) Consider making a public post on Piazza
n 2.) Consider making a private post on Piazza visible only to the
instructors
n 3.) In exceptional circumstances, send an email to me using “[IEOR
242 Urgent]” to start the subject line

IEOR 242, Fall 2019 - Lecture 1


+ 31

Course Information

n Grading:
n Homework Assignments: 40%
n Final Project: 35%
n Midterm Exam: 25%

n Lecture:
n We will post lecture notes ahead of class
n Participation is encouraged

IEOR 242, Fall 2019 - Lecture 1


+ 32

Course Information, cont.

n Discussion:
n Optional, but attendance is encouraged
n The main goal of discussion sections is to demo R code
n We will post notes and associated R code snippets

n Two discussion section labs:


n Thursday afternoon 2:00 pm – 3:00 pm in 9 Lewis (we are
looking into changing the room)
n Friday afternoon 12:00 pm – 1:00 pm in 9 Lewis
n Please try to attend the section you are registered for (at least
at the start of the semester) but it is OK to hop around if you
need to

n First discussion section will be next week


n Topics: Introduction to R, linear algebra notation, and a bit of
linear regression

IEOR 242, Fall 2019 - Lecture 1


+ 33

Homework Assignments

n Discussion/some collaboration is allowed, but final


product must be completed individually
n5 or 6 homework assignments
n Assignments will be due on Gradescope by
midnight
n Assignments will be a mix of conceptual exercises
and data analysis exercises in R
n You are allowed a total of 5 slip days throughout
the semester

IEOR 242, Fall 2019 - Lecture 1


+ 34

Midterm

n In class on October 24

n Purpose of the midterm is to gauge your


understanding of the material so far

n Conceptual questions similar to the homework and


also some conceptual R based questions
n No need to memorize R commands

n More details and sample exam(s) as the date


approaches

IEOR 242, Fall 2019 - Lecture 1


+ 35

Final Project

n Final project carries a lot of weight and is a crucial


part of the course

n Teams of 5

n You propose a topic

n We are available for advice and will provide


sample resources to help you find data if necessary

IEOR 242, Fall 2019 - Lecture 1


+ 36

Some Example Projects

n Predict code blue events in an ICU, improve


response times

n Predict the winning team in the video game DOTA


2, and then optimize hero selection strategy

n Predict which Kickstarter proposals will be


successful

n Predict energy demand and then optimize solar


panel placement

IEOR 242, Fall 2019 - Lecture 1


+ 37

Final Project Timeline

n October 28: submit project proposal


n We will provide e-mail feedback

n December 3 and 5: 5-minute project


presentations
n During lecture (plus maybe some extra meeting time, e.g.,
during discussion)

n December 20: four page (+ appendices) final


report due

IEOR 242, Fall 2019 - Lecture 1


+ 38

Course Content

n Methods: linear regression, logistic regression, linear


discriminant analysis, ROC analysis, cross validation,
bootstrapping, CART, random forests, boosting, text
analytics, time series analysis, clustering, regularization
and feature engineering, PCA, neural networks, data
manipulation and visualization in R, integration with
optimization, others
n Applications: predicting wine quality and prices, loan
defaults, customer retention, parole violators, click-
through rates, sentiment of tweets, sales volume,
housing prices. Analyzing social networks, customer
characteristics, recommendation systems, internet
advertising spending, and more

IEOR 242, Fall 2019 - Lecture 1


+ 39

IEOR 242 Fall 2019 vs. Spring


2020?
nI will not teach IEOR 242 in Spring 2020

n Thus, there is no guarantee that the topics, course


requirements, style etc. will be the same

n My style of teaching this course: I like to blend


methodology, theoretical underpinnings,
applications, and working with data

IEOR 242, Fall 2019 - Lecture 1


+
Introduction to R

IEOR 242, Fall 2019 - Lecture 1 40


+ 41

Why do we use R?

n Some course goals:


n Understand the complexity of data and how to deal with
data
n Understand, use, and think critically about the results of
analytics models
n Create your own analytics models

n We can reach these goals effectively by using R

IEOR 242, Fall 2019 - Lecture 1


+ 42

What is R?

nR is a software environment for data analysis,


statistical computing, and graphics
n Natural to use, complete data analyses in just a few lines
n Can create almost any analytics model imaginable

nR is a sophisticated programming language


n More R packages are consistently being developed
n Data wrangling and transformation techniques, integration with
databases and web scraping, improved ways to communicate
results, etc.
n We will explore some advanced features later in the course

nR is an ecosystem/community for Data Science


n Check out #rstats on Twitter

IEOR 242, Fall 2019 - Lecture 1


+ 43

Why use R?

n There are many choices for data analysis


software/programming languages
n Python, SAS, Stata, SPSS, Excel, MATLAB, Minitab, Julia, …
n We will at times give pointers/demos to comparable tools in
Python, but not everything that we need can be done in Python
easily (as far as we know)

n R is free (open source)


n Easy to re-run previous work and make adjustments
n Excellent graphics and visualizations
n Many companies and organizations that use R
n Check out: [Link]

IEOR 242, Fall 2019 - Lecture 1


+ 44

R vs. Python?

n “Generally, I think R and python are much more


similar than they are different. I'm not really
interested in the debates about which one you
should learn. Obviously, I think learning R is the
right choice, but you can be effective with either.
My main advice is to focus on one and get good at
it. That's a much more effective way of learning
than dabbling in both. (Of course, once you get
good in one, you can learn the other, but do it in
serial, not parallel)” – Hadley Wickham, Chief
Scientist at RStudio

IEOR 242, Fall 2019 - Lecture 1


+ 45

R for Data Science

n Excellent reference on
practical tools for how to
do data science with R

n We will cover some


aspects of this book, but
not nearly everything

n Funand enjoyable to
read and follow along
with R

IEOR 242, Fall 2019 - Lecture 1


+ 46

What is RStudio?

n RStudio is an integrated development environment


(IDE) for R

n RStudio puts everything in one place and will


enable us to be more productive and efficient

IEOR 242, Fall 2019 - Lecture 1


+ 47

Guidelines About R in this Course

n This is not a programming course


n Focus of this course is about learning data analytics tools and
applications

n We will distribute scripts that serve as coding


templates for many tasks that you will need to perform

n The project (and likely some assignments) will require


you to solve programming problems
n Use Stack Overflow, Google, etc.!
n All experienced programmers search for help online every
single day
n Last resort: ask the teaching team for help (Piazza and office
hours are best)
IEOR 242, Fall 2019 - Lecture 1
+
Introduction to Statistical
Learning

IEOR 242, Fall 2019 - Lecture 1 48


+ 49

What is Statistical Learning?

n Data analytics is a process


n From data to decisions

n Statisticallearning is a set of tools for modeling


and understanding complex datasets
n Like optimization is a set of tools for modeling and solving
complex decision problems

n Statistical
Learning is inherently a more
mathematical subject
n Data analytics uses mathematics
n Effective use of data analytics requires fluency in the
language of statistical learning

IEOR 242, Fall 2019 - Lecture 1


+ 50

Advertising Budget Planning


Example
n Imagine that you are consulting for a company that
wants to improve sales of a particular product

n 200 different markets

n For each market, we collected data on:


n Sales of the product in dollars
n Advertising budgets for three different types of media: TV,
radio, and newspaper

n The company would like to know how advertising


budget decisions impact sales

IEOR 242, Fall 2019 - Lecture 1


+ 51

Advertising Budget Planning


Example 25

25

25
20

20

20
Sales

Sales

Sales
15

15

15
10

10

10
5

5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100

TV Radio Newspaper

IEOR 242, Fall 2019 - Lecture 1


+ 52

General Statistical Learning Model

n Input variables: X = (X1 , X2 , . . . , Xp )


n Also often called features, predictors, or independent
variables

n Output variable: Y
n Also often called response or dependent variable

n Collected data in the form of n pairs:


n (xi , yi ) i = 1, . . . , n
n xi = (xi1 , xi2 , . . . , xip )T

IEOR 242, Fall 2019 - Lecture 1


+ 53

A General Statistical Learning


Model
n We presume that there is some relationship
between X and Y :

Y = f (X) + ✏
n ✏ is a random error term that is independent of X
and has mean 0

n f is a fixed but unknown function that represents


the systematic information that X provides about Y

n (Supervised) statistical learning is a set of tools for


estimating f
IEOR 242, Fall 2019 - Lecture 1
+ 54

Why estimate f ?

n Two reasons to estimate f


n Prediction
n Inference

n If we have a good estimate for f , call it fˆ , then we


can use fˆ to make a prediction for a new value of X

Ŷ = fˆ(X)
n Think of the advertising example

IEOR 242, Fall 2019 - Lecture 1


+ 55

Statistical Learning for Prediction

n True model:
Y = f (X) + ✏
n Our prediction: Ŷ = fˆ(X)

n What does the accuracy of our prediction depend


on?
n Reducible error: fˆ is not a perfect estimate of f
n Irreducible error: Var(✏) > 0

n The aim of statistical learning techniques is to


reduce the reducible error!

IEOR 242, Fall 2019 - Lecture 1


+ 56

Statistical Learning for Inference


n Inference: how does Y change when X changes?
n Which predictor variables are associated with the
response?
n What is the relationship between the response and each
associated predictor? Positive or negative?
n Is a linear equation adequate to describe the relationship
between X and Y?

n These are all essentially questions about the


behavior (e.g., slope) of f

n In this course, we will mostly be concerned with


prediction, but inference is important!

IEOR 242, Fall 2019 - Lecture 1


+ 57

How do we estimate f ?

n As always, we start with data:


n (xi , yi ) i = 1, . . . , n
n Often called the training data

n A statistical learning method is a procedure,


applied to the training data, for estimating f
n We’ll cover a lot of these methods in this course

n Broadly speaking, two classes of methods:


n Parametric methods
n Non-parametric methods

IEOR 242, Fall 2019 - Lecture 1


+ 58

Parametric Methods

n Start by assuming a particular functional form for f


n For example, assume that f is linear:
f (X) = 0 + 1 X1 + . . . + p Xp
n f is parameterized by = ( 0, 1, . . . , p)

n Now apply a method that uses the training data to


estimate
n We sometimes call this fitting the model
n Classic example: ordinary least squares, i.e., linear
regression
n We will consider more sophisticated approaches as well

IEOR 242, Fall 2019 - Lecture 1


+ 59

Parametric Methods

n Advantages of Parametric Methods:


n Simplifies the problem of estimating f to the problem of
estimating
n Potentially relatively less data needed to produce a
reliable estimate of

n Major Disadvantage of Parametric Methods:


n The true functional form of f is usually more complicated
than the model we chose
n This may be remedied by selecting a flexible model class,
but this comes at the danger of overfitting

IEOR 242, Fall 2019 - Lecture 1


+ 60

Non-parametric Methods

n Of course, non-parametric methods do not make


parametric assumptions about f

n No explicit functional form is assumed


n Allows for greater flexibility
n Runs a greater risk of overfitting if you are not careful
n Generally requires more data to produce an accurate
estimate

IEOR 242, Fall 2019 - Lecture 1


+ 61

Tradeoff Between Flexibility and


Interpretability
n Why not just always use flexible, non-parametric
methods?
n One reason is that parametric models are more
interpretable and thus better for inference
n Even if you don’t care about inference, non-parametric
methods may overfit the training data

Incom
Incom

e
e

ity

y
rit
or

Ye Ye

o
ni

ni
ars ars
Se

Se
of ofE
Ed du
uc ca
ati tio
on n

IEOR 242, Fall 2019 - Lecture 1


+ 62

Supervised vs. Unsupervised


Learning
n So far, we have been looking at supervised
learning – estimating f

n Unsupervised learning is about finding structure


or patterns in data
n There is no longer a response variable, we just observe
data x1 , x2 , …, xn
12
n Examples in this course:

8
10

Clustering
8

6
X2

X2
6

PCA

4
n
4

2
2

0 2 4 6 8 10 12 0 2 4 6

X1 X1

IEOR 242, Fall 2019 - Lecture 1


+ 63

Regression vs. Classification

n Two important classes of supervised learning problems

n Regression involves predicting a continuous response


variable
n The value of a household
n Next week: the quality of wine

n Classification involves predicting a binary yes/no


outcome
n Did the user click on the ad or not?
n Is that email spam?

n Overarching themes but different methods for each

IEOR 242, Fall 2019 - Lecture 1


+ 64

Looking Ahead

n In this course, we will explore a plethora of


statistical learning methods in the contexts of:
n Regression
n Classification
n Unsupervised learning

IEOR 242, Fall 2019 - Lecture 1


+ 65

Next Week

n We will take a deep dive into linear regression


n Simple linear regression, multiple linear regression, other
issues

n We will see how linear regression can be used to


“beat the experts” at predicting the quality of wine
n We will learn how to run and analyze linear regression
models in R

IEOR 242, Fall 2019 - Lecture 1


+ 66

n Some of the figures in this presentation are taken


from “An Introduction to Statistical Learning, with
applications in R” (Springer, 2013) with
permission from the authors: G. James, D. Witten, T.
Hastie and R. Tibshirani

IEOR 242, Fall 2019 - Lecture 1

You might also like