0% found this document useful (0 votes)
28 views72 pages

Lecture 1-2 - Introduction

Uploaded by

nikita.andhale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views72 pages

Lecture 1-2 - Introduction

Uploaded by

nikita.andhale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Business Data

Science
MIS382N

Constantine Caramanis
The University of Texas at Austin
constantine@[Link]
Business Data Science

• Prof. Constantine Caramanis


• [Link]

• Teaching Assistant: Debajit Chakraborty


• email: debajit@[Link]

• Recap of administrative issues through Syllabus


• This class and next semester’s class
• Our traditional conception of knowledge is
axiomatic:
• Example: We prove theorems of geometry
based on Euclid’s axioms.

What is • Even in empirical sciences, we use

Artificial
experiments to try to formulate general laws
of nature, and then use those laws to make
predictions about how things work / the
Intelligence? future.

• In its modern rendition, AI and Machine


learning are giving much more emphasis on
the pathway directly from data to
predictions.
What is Artificial Intelligence?

Axioms
Theorems
“Knowledge”

Results
Data Predictions
Decisions
John Snow &
Cholera

• 1854 was a bad year in the


SoHo neighborhood, in the
heart of London.
• For yet another time in the
history of the city, Cholera
was claiming scores of
lives…

• Why? What was the cause?


John Snow &
Cholera
• Living conditions were already
challening…

• General lack of hygiene

• Overpopulation

• Very dense living


conditions

• Pollution in air and water


John Snow
& Cholera
• Cholera incidences in
the SoHo NBHD:
John Snow
& Cholera
• Cholera incidences in
the SoHo NBHD:
• Population without
cholera:
John Snow &
Cholera
• Possible explanations /
theories at the time about what
was the cause of the epidemic

• Miasma in the air


John Snow &
Cholera
• Possible explanations /
theories at the time about what
was the cause of the epidemic

• Person to person transmission,


like Flu or Covid
John Snow
& Cholera
• Possible explanations
/ theories at the time
about what was the
cause of the epidemic
1. Miasma in the air
2. Transmission Α à B
John Snow
& Cholera
• Possible explanations
/ theories at the time
about what was the
cause of the epidemic
1. Miasma in the air
2. Transmission Α à B
John Snow
& Cholera
• Possible explanations
/ theories at the time
about what was the
cause of the epidemic
1. Miasma in the air
2. Transmission Α à B
John Snow
& Cholera
• Possible explanations
/ theories at the time
about what was the
cause of the epidemic
1. Miasma in the air
2. Transmission Α à B
3. Something else?
John Snow
& Cholera
• Possible explanations
/ theories at the time
about what was the
cause of the epidemic
1. Miasma in the air
2. Transmission Α à B
3. Cluster according to
proximity to water
wells
John Snow
& Cholera
• Possible explanations
/ theories at the time
about what was the
cause of the epidemic
1. Miasma in the air
2. Transmission Α à B
3. Cluster according to
proximity to water
wells
Modern
Problems in • Essentially infinite computational power.
• Essentially infinite data.
ML/AI • Problems that combine images/videoi
(Computer Vision), natural language – English
or other languages or multiple languages
(Natural Language Processing), Dynamic
What has Decision Making.

changed?
Prediction:

Some of the • Classification


• Regression
basic or • Images/Video, Language, Combination

foundational Generative AI
problems in • Natural Language
ML/AI • Images
• Video
• Multimodal/combinations
Computer
Vision:
classification
Computer
Vision:
Regression
Natural
Language
Processing:
Classification
Natural
Language
Processing:
Regression
Generative AI: Computer Vision
[Link]
Generative AI: Computer Vision
[Link]
Generative AI: Computer Vision
[Link]
Generative AI: Natural Language
Overview of the landscape

Terms used:

Data science
Machine Learning
Artificial Intelligence
Big Data
Data Mining
Statistics
Overview of the landscape
@jeremyjarvis
Terms used:
“A data scientist is a statistician who lives in San Francisco”
Data science
@BigDataBorat
Machine Learning
Data science is statistics on a Mac.
Artificial Intelligence
Big Data
@josh_wills
Data Mining
Data Scientist (n.): Person who is better at statistics than any software
Statistics
engineer and better at software engineering than any statistician

(anonymous)
The difference between statistics and data science is about 30k per
year.

(anonymous)
It is statistics if it’s done in R.
It is Machine learning if it’s done in Python
It is AI if it’s done in Powerpoint.
Overview of the landscape
Statistics: The very first line of the American Statistical Association’s definition of statistics is “Statistics is the science of
learning from data...” Given that the words “data” and “science” appear in the definition, one might assume that data science is
just a rebranding of statistics.
Statisticians are not happy they are not getting the research funding and salary bumps involved.

Data science: Broad and modern term. Includes much more software engineering knowledge. Usually done in Python (as opposed
to R/Stata/SAS/SPSS/Excel etc). Includes analytics on bigger datasets (e.g. terabytes or petabytes by using tools like Apache Spark,
Hadoop Mapreduce which enable distributed processing. Includes data collection and data cleaning pipelines (data engineering/
data wrangling). Connections to database backends and web-serving front-ends. Includes to some extend machine learning and
AI as sub-areas.

Machine learning: The more mathematically complex part of data science focused on modeling (as opposed to software).
Includes supervised learning, predictive modeling, unsupervised learning (like clustering), text and image understanding.

Artificial Intelligence: Broad and classic term that includes machine learning as a sub-discipline. Allowing computers to do things
that humans do when they say they are thinking. Includes perception (image understanding, speech understanding), Language
translation, playing games, statistical machine learning techniques but also logic-based symbolic AI, reasoning, planning, and
robotics).

Data Mining: Applied version of Machine learning. Includes more large-scale software and performance issues. Also intersects
the database community.

Big Data: Focused on scaling data analytics to very large data sets. Part of data science that will hopefully follow the information
superhighway and internet surfing to obsolete historical nomenclature.
In terms of research communities:
Statistics
Research published in stats journals. Top venues: Annals of Statistics, JASA, Journal of the Royal Statistical Society.

Data science:
Not a properly defined research community.

Machine learning:
Research published in top ML conferences: NeurIPS, ICML, also more recently ICLR. Also includes KDD (more applied, data
mining).
[Link]

Artificial Intelligence:
Includes ML conferences but also AAAI and IJCAI as top venues.

Data Mining: Applied version of Machine learning. Includes more large-scale software and performance issues. Also intersects
the database community.
Research published in top Data Mining conferences: KDD, SDM.

Big Data:
Not a properly defined research community.
Engineers who can setup Hadoop/Spark clusters. Can work on data directly on disk and process at massive scale.
Supervised and Unsupervised Learning
A taxonomy for machine learning

• Supervised Learning: learning how to predict labels

• Unsupervised Learning: finding structure in data, without labels.


Supervised learning: Binary classification
• Given a table of training data containing features (x1,x2,..) and a target
variable y we want to predict.
• Example: Taste-test: Predict if a new beverage will be evaluated as having
‘Great taste’ from a focus group.

Acidity(A) Sweetness (S) y=‘Great taste’?


Bev1 0.8 0.8 1
Bev2 0.3 0.25 0
Bev3 0.2 0.8 0
Bev4 0.3 0.7 0
Bev5 0.9 0.7 1
Supervised learning: Binary classification
Data-driven Taste Test example
• Jargon: Every row is called a training set sample.
• We have 2 features (Acidity and Sweetness). In practice we may have many more (Color,
Carbonation level, .. ). In statistics these are also called covariates or regressors.
The target variable y is now binary (good taste or not): Binary classification.
y could have multiple levels: (Poor, mediocre, Good, Great): Multi-label classification
Or y could be a continuous number to predict (Taste score from 1:100): Regression.

Acidity(A) Sweetness (S) y=‘Great taste’?


Bev1 0.8 0.8 1
Bev2 0.3 0.25 0
Bev3 0.2 0.8 0
Bev4 0.3 0.7 0
Bev5 0.9 0.7 1
Binary classification with a short tree:
A decision stump
Acidity(A) Sweetness y=‘Great Model 1
(S) taste’? predicts
Bev1 0.8 0.8 1
Bev2 0.3 0.25 0
Bev3 0.2 0.8 0
Bev4 0.3 0.7 0
Bev5 0.9 0.7 1

o/w S>= 0.75

Predict f(x)=0 Predict f(x)=1


Binary classification with a short tree:
A decision stump
Acidity(A) Sweetness y=‘Great Model 1
Accuracy of this model
(S) taste’? predicts
on the training set is:
Bev1 0.8 0.8 1 1
Bev2 0.3 0.25 0 0 ?/ 5
Bev3 0.2 0.8 0 1
Bev4 0.3 0.7 0 0
Bev5 0.9 0.7 1 0

o/w S>= 0.75

Predict f(x)=0 Predict f(x)=1


Binary classification with a short tree:
A decision stump
Acidity(A) Sweetness y=‘Great Model 1
Accuracy of this model
(S) taste’? predicts
on the training set is:
Bev1 0.8 0.8 1 1
Bev2 0.3 0.25 0 0 3/ 5
Bev3 0.2 0.8 0 1
Bev4 0.3 0.7 0 0
Bev5 0.9 0.7 1 0

o/w S>= 0.75

Predict f(x)=0 Predict f(x)=1


Partitioning the feature space
Acidity(A) Sweetness y=‘Great
(S) taste’?
Bev1 0.8 0.8 1
Bev2 0.3 0.25 0
Bev3 0.2 0.8 0
Bev4 0.3 0.7 0
Bev5 0.9 0.7 1
Lets position Bev1 on
Sweetnes this feature space
s

o/w S>= 0.75

Predict f(x)=0 Predict f(x)=1

Acidity
Partitioning the feature space
Acidity(A) Sweetness y=‘Great
(S) taste’?
Bev1 0.8 0.8 1
Bev2 0.3 0.25 0
Bev3 0.2 0.8 0
Bev4 0.3 0.7 0
Bev5 0.9 0.7 1
Let's position Bev1 on
Sweetnes this feature space
s

o/w S>= 0.75

Predict f(x)=0 Predict f(x)=1

Acidity
Partitioning the feature space
Acidity(A) Sweetness y=‘Great
(S) taste’?
Bev1 0.8 0.8 1
Bev2 0.3 0.25 0
Bev3 0.2 0.8 0
Bev4 0.3 0.7 0
Bev5 0.9 0.7 1
Each binary classifier
has a decision region:
Sweetnes How it partitions the
s feature space.
S=0.75

o/w S>= 0.75

Predict f(x)=0 Predict f(x)=1

Acidity
Binary classification with depth 2 decision tree:
Acidity(A) Sweetness y=‘Great Model 2
(S) taste’? predicts
Accuracy of this model
Bev1 0.8 0.8 1 ?
on the training set is:
Bev2 0.3 0.25 0 ?
Bev3 0.2 0.8 0 ? ?
Bev4 0.3 0.7 0 ?
Bev5 0.9 0.7 1 ?

Model 2: This model splits first on A with threshold 0.5


A>= 0.5 and then on S with threshold 0.75.
o/w

Predict f(x)=1 o/w S>0.75

Predict f(x)=0 Predict f(x)=1


Binary classification with depth 2 decision tree:
Acidity(A) Sweetness y=‘Great Model 2
(S) taste’? predicts
Accuracy of this model
Bev1 0.8 0.8 1 1
on the training set is:
Bev2 0.3 0.25 0 1
Bev3 0.2 0.8 0 1 ?
Bev4 0.3 0.7 0 1
Bev5 0.9 0.7 1 0

Model 2: This model splits first on A with threshold 0.5


A>= 0.5 and then on S with threshold 0.75.
o/w
Could you do get better training accuracy by
labeling leaves differently ?
Predict f(x)=1 o/w S>0.75
What is the highest training accuracy you can get?

Predict f(x)=0 Predict f(x)=1


Binary classification with depth 2 decision tree:
Acidity(A) Sweetness y=‘Great Model 2
(S) taste’? predicts
Accuracy of this model
Bev1 0.8 0.8 1 1
on the training set is:
Bev2 0.3 0.25 0 1
Bev3 0.2 0.8 0 1
Bev4 0.3 0.7 0 1
Bev5 0.9 0.7 1 0

Model 2: Sweetnes
A>= 0.5 s leaf 1
o/w
leaf 3
leaf 1
o/w S>0.75
leaf 2

leaf 2 leaf 3

Acidity
Binary classification with a linear classifier
Acidity(A) Sweetness y=‘Great Model 3
(S) taste’? predicts
Bev1 0.8 0.8 1 ?
Bev2 0.3 0.25 0 ?
Bev3 0.2 0.8 0 ?
Bev4 0.3 0.7 0 ?
Bev5 0.9 0.7 1 ?

Model 3: Sweetnes
s
f(A,S) = 1 if A+S -1 ≥ 0
0 otherwise

Compute the predictions of this model.


Draw the decision boundary

Acidity
Binary classification with a linear classifier
Acidity(A) Sweetness y=‘Great Model 3
(S) taste’? predicts
Bev1 0.8 0.8 1 ?
Bev2 0.3 0.25 0 ?
Bev3 0.2 0.8 0 ?
Bev4 0.3 0.7 0 ?
Bev5 0.9 0.7 1 ?

Model 3: Sweetnes
s
f(A,S) = 1 if A+S -1 ≥ 0
0 otherwise

Compute the predictions of this model.


Draw the decision boundary

Acidity
Binary classification with a linear classifier
Acidity(A) Sweetness y=‘Great Model 3
(S) taste’? predicts
Bev1 0.8 0.8 1 ?
Bev2 0.3 0.25 0 ?
Bev3 0.2 0.8 0 ?
Bev4 0.3 0.7 0 ?
Bev5 0.9 0.7 1 ?

Model 4: Sweetnes
s
f(A,S) = 1 if A+S -1.3 ≥ 0
0 otherwise

Compute the predictions of this model.


Draw the decision boundary

Acidity
Diabetes is the 8th leading cause of death in the US.
Predicting Diabetes Major goals include predicting and preventing
diabetes.
Diabetes is the 8th leading cause of death in the US.
Predicting Diabetes Major goals include predicting and preventing
diabetes.

First patient in our dataset Result: is the patient diabetic


Second patient in our dataset
Diabetes is the 8th leading cause of death in the US.
Predicting Diabetes Major goals include predicting and preventing
diabetes.

Χ – features X y = what we want to predict (outcome/target) y


Diabetes is the 8th leading cause of death in the US.
Predicting Diabetes Major goals include predicting and preventing
diabetes.

X1 X2 y
Predicting
Diabetes
• What is Χ?
• What is y?
Predicting
Diabetes
Eye-balling it: bottom left
should be blue (outcome = 0),
top right probably should be
red (outcome = 1)
Predicting
Diabetes
Eye-balling it: bottom left
should be blue (outcome = 0),
top right probably should be
red (outcome = 1)

We want an algorithm: we
want an implementable (on a
computer) procedure that
takes two numbers, X1, X2,
and outputs “0” or “1”
A first algorithm: Decision Trees

Algorithm:

𝑋! ≥ 45: 𝑟𝑒𝑑
Χ ! < 45: 𝑏𝑙𝑢𝑒
A first algorithm: Decision Trees

Algorithm :

𝑋! ≥ 150: 𝑟𝑒𝑑
Χ! < 150: 𝑏𝑙𝑢𝑒
Decision Trees

Decision Tree with Depth = 1

𝑋/ ≥ α We have 4 parameters:
(1) i
(2) α
(3) a
(4) b
𝑦=𝑎 𝑦=𝑏
Decision Trees
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters:


(1) i = 2
(2) α = 45
(3) a = blue
(4) b = red
Decision Trees
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four paremeters:


(1) i = 1
(2) α = 150
(3) a = blue
(4) b = red
Decision Trees

Decision Tree with Depth = 2

𝑋" ≥ 𝛼#

We have 10 parameters:
𝑋$ ≥ 𝛼! 𝑋% ≥ 𝛼& (1-3): i, j, k
(4-6): α1, α2, α3
(7-10) a, b, c, d

𝑦=𝑎 𝑦=𝑏 𝑦=𝑐 𝑦=𝑑


Which of the following could be a decision tree?
Which of the following could be a decision tree?
The “best” depth 1 decision tree
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters:


(1) i = 2
(2) α = 18
(3) a = blue
(4) b = red
The “best” depth 1 decision tree
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters::


(1) i = 2
(2) α = 32
(3) a = blue
(4) b = red
The “best” depth 1 decision tree
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters::


(1) i = 2
(2) α = 43
(3) a = blue
(4) b = red
The “best” depth 1 decision tree
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters::


(1) i = 2
(2) α = 53
(3) a = blue
(4) b = red
The “best” depth 1 decision tree
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters::


(1) i = 1
(2) α = 59
(3) a = blue
(4) b = red
The “best” depth 1 decision tree
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters::


(1) i = 1
(2) α = 82
(3) a = blue
(4) b = red
The “best” depth 1 decision tree
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters::


(1) i = 1
(2) α = 117
(3) a = blue
(4) b = red
The “best” depth 1 decision tree
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters::


(1) i = 1
(2) α = 147
(3) a = blue
(4) b = red
The “best” depth 1 decision tree
Decision Tree with Depth = 1

𝑋/ ≥ α

𝑦=𝑎 𝑦=𝑏

We have four parameters::


(1) i = 1
(2) α = 180
(3) a = blue
(4) b = red
The “best” depth 1 decision tree

The best decision tree is the one that has


the lowest loss, in this case, the one that
makes the fewest mistakes on the data!
Supervised learning paradigm

Features X
(Acidity,
Sweetness,
Color, prediction: h(X)
Carbonation,
Others…)

Model: h

Goal: design and train the model to make good


predictions on data it has not seen.
Several Important Objectives
• What are the algorithmic tools available and how to use
them/expand them, given computational constraints.

• How to use the algorithmic tools and statistical knowledge to ask


the right questions.

• How to use algorithmic tools and statistical knowledge to interpret


the results.
• Are they meaningful?
• Can we trust them?

You might also like