0% found this document useful (0 votes)
29 views84 pages

Introduction

The document outlines the DBA3803 course, taught by Tan Hong Ming, focusing on analytics applications and methodologies. It includes a tentative schedule, assessment components, and course logistics, emphasizing participation and group projects. The course aims to provide a comprehensive understanding of data science, machine learning, and their practical applications in real-world scenarios.

Uploaded by

wtan1206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views84 pages

Introduction

The document outlines the DBA3803 course, taught by Tan Hong Ming, focusing on analytics applications and methodologies. It includes a tentative schedule, assessment components, and course logistics, emphasizing participation and group projects. The course aims to provide a comprehensive understanding of data science, machine learning, and their practical applications in real-world scenarios.

Uploaded by

wtan1206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Introduction to DBA3803

Tan Hong Ming


About me - https://thm.sg
• Deputy Head and Senior Lecturer in
Dept of Analytics and Operations
[email protected]
• BIZ1-08-72
Education Teaching/Research Advisor
References
• Lecture slides
• James, Gareth, et al. An introduction to statistical learning
(2013). https://www.statlearning.com
• Hull, John C. Machine Learning in Business: An Introduction to
the World of Data Science (2020). https://www-
2.rotman.utoronto.ca/~hull/MLThirdEditionFiles/index3rdEd.html
• Dimitris Bertsimas, Allison O’Hair and Bill Pulleyblank, The
Analytics Edge, Dynamic Ideas, 2016. ISBN: 978-0989910897.
Course vision
• Highlight successful and innovative applications of analytics,
providing motivation to learn more in this area

• Beyond specific methods, learn an overall approach to solving


real world problems and deriving novel insights

• Analytics in Action:
• Focus on real examples as much as possible

7
What this course is about
• Lectures will try to focus on a case study or application
• Through that application, we will explore a particular methodology
or combination of methodologies
• We will see how the model was created, what data was used, and
what were the results and implications
• Equal focus on theory and how to build and apply models to
answer real, interesting, fun, and impactful questions
• Some math and programming is unavoidable

8
Tentative Schedule
Week Session
1 Introduction
2 Statistical Learning
3 Regression
4 Regression/Classification
5 Classification/Resampling Methods
6 Resampling Methods
R Reading week
7 Model Selection
8 Tree based Methods
9 Support Vector Machines
10 Deep learning
11 Hari Raya Puasa Holiday
12 Model Interpretability
13 In-class Quiz
Assessment Components Weightage
Class Participation 10%
Group Project (1) 30%
Quiz (1) 30%
Individual Assignments (2) 30%
Course logistics – General policies
• Participation worth 10% of final grade
• Won’t take attendance; you’re responsible adults
• In class, be prepared to engage actively and participate
• Main rule: don’t be a distraction
• Laptops, eating/drinking, headphones, talking, chess matches…
• All participation/effort is taken into account
• Office hours
• Emails
• Any effort that goes into the course

11
Course logistics – Project
• Project worth 30% of final grade
• About 5 students
• You define problem, get data, develop models, analyze results, report
findings
• Project grade based on Final report
• More details into the semester

12
Course logistics – Quiz
• Worth 30% of final grade
• Held in class
• Week 13, last lesson
• Only LOA or MC accepted
• One make up quiz
• No final exam
• More details nearer to date
Course logistics – Assignments
• 2 individual homeworks worth 15% each
• Rough timeline
• Week 6
• Week 9
• Deadline ~2 weeks
What is AI?
16
Standard Model
• Study and construction of Agents that do the right thing
• Control theory – controller minimizes a cost function
• Operations Research – policy maximizes a sum of rewards
• Statistics – decision rule minimizes a loss function
• Economics – decision maker maximizes utility or some measure of social
welfare

17
There are many labels for AI
• Statistical Learning
• Predictive Analytics
• Data Science
• Machine Learning
• Artificial Intelligence (AI)
• ANI and AGI
AI Machine Learning
Deep learning
Neural Network

ChatGPT

22
There are many labels for AI
• Statistical Learning
• Predictive Analytics
• Data Science and Machine Learning
• Artificial Intelligence (AI)
• ANI and AGI
Different types of Analytics tasks
Descriptive Analytics: Understanding the
Past
• Analyzing past sales data to
understand the most
popular products and
identify sales trends

• Understanding the
frequency and types of
maintenance required for
different equipment

25
Predictive Analytics: Foreseeing the Future
• Forecasting demand in different
Month Orders
regions based on historical sales
Jan 20
data and market trends Feb 21
Mar ?

• Predicting future equipment


failures or maintenance
requirements based on
historical data, enabling
proactive maintenance
scheduling

26
Prescriptive Analytics: Informing Decision
Making
• Recommending adjustments
to manufacturing schedules
and inventory levels based on
demand predictions to Netflix
Rain
optimize resource utilization
Weather

• Suggesting proactive
Sunny
maintenance schedules for Picnic
equipment to minimize
downtime and ensure
compliance

27
Unsupervised Learning
• Data: input/features 𝑋 only, unlabeled
• Clustering

Annual Physical
Age Gender BMI Smoker Checkup Activity
56 Male 18.9 Yes Yes High
69 Male 39.4 Yes No Medium
46 Male 36.4 No Yes Low
32 Male 23.1 No No Medium
60 Female 22.4 No Yes High
25 Male 22.4 No No Medium
78 Female 25.0 No No Low
Algorithm: K-Means Clustering
Use case: Market Segmentation
Reinforcement Learning
• Data: (𝑋, 𝑎, 𝑅 𝑎, 𝑋 ), input/features 𝑋, action a, reward/cost 𝑅(𝑎, 𝑋)
• Feedback
• Learning how to play a game, driverless cars
https://www.youtube.com/watch?v=kopoLzvh5jY
https://www.youtube.com/watch?v=L_4BPjLBF4E
Supervised Learning
• Outcome measurement 𝑌 (dependent variable, response, target)
• Vector of 𝑝 predictor measurements 𝑋 (inputs, regressors,
covariates, features, independent variables)
• Regression: 𝑌 is quantitative
• price, blood pressure
• Classification: 𝑌 takes values in a finite set
• survived/died, digit 0-9, cancer class of tissue sample
• Training data 𝑥1 , 𝑦1 , … , 𝑥𝑁 , 𝑦𝑁 . These are observations
(examples, instances) of these measurements
Example of Data
Annual Physical
Age Gender BMI Smoker Checkup Activity Diabetes
56 Male 18.9 Yes Yes High 0
69 Male 39.4 Yes No Medium 1
46 Male 36.4 No Yes Low 1
32 Male 23.1 No No Medium 0
60 Female 22.4 No Yes High 1
25 Male 22.4 No No Medium 0
78 Female 25.0 No No Low 0
31 Female 20.1 Yes Yes Medium ?
Example of Data
Annual Physical
Age Gender BMI Smoker Checkup Activity Diabetes
56 Male 18.9 Yes Yes High 0
69 Male 39.4 Yes No Medium 1
46 Male 36.4 No Yes Low 1
32 Male 23.1 No No Medium 0
60 Female 22.4 No Yes High 1
25 Male 22.4 No No Medium 0
78 Female 25.0 No No Low 0
New 31 Female 20.1 Yes Yes Medium ?
Use Case: Market Segmentation
In Practice
• What is the problem you want to solve?
• E.g., help a user find all photos that match a specific term
• Is it a supervised, unsupervised, or reinforcement learning?
• It can vary: recommending songs or movies
• Supervised: inputs of customer, output do they like the recommendation
• Reinforcement: recommend and observe reward
• What type of data is available?
• Labeled vs unlabeled
Objectives
On the basis of the training data we would like to:
• Accurately predict unseen test cases.
• Understand which inputs affect the outcome, and how.
• Assess the quality of our predictions and inferences.
It all begins with Data

https://www.nytimes.com/2023/12/22/technology/apple
-ai-news-publishers.html

https://www.theverge.com/2024/1/4/24025409/openai-training-data-lowball-nyt-ai-
copyright
41
https://www.deere.com/en/technology-products/precision-ag-technology/
42
• John Deere has embedded advanced sensors in its farming equipment,
which collect vast amounts of data from fields—such as soil quality, crop
health, moisture levels, and machine performance. This data is collected
continuously as farmers use the equipment.

Optimizing Planting Autonomous Machinery

https://www.deere.com/assets/pdfs/common/privacy-and-data/mjd-privacy-notice/mjd-privacy-
notice_r2_5.25.18_en_EN.pdf
43
4 V’s of big data

44
Volume

https://arxiv.org/abs/2202.07659
Volume
Velocity
• What can happen in 1 second?
Variety

Relational
Database
Text Sound
strings File

Geo-
location Movie
data File
Image
File

• We will mainly focus on structured data in this course


Labels

https://medium.com/@thenextcorner/you-are-helping-google-ai-image-recognition-b24d89372b7e
50
New Captcha
https://www.theverge.com/2024/11/27/24307360/uber-scaled-solutions-ai-labeling-workforce
52
Veracity

https://www.donaldjtrump.com/landing/2020-trump-vs-dem-poll
Veracity: Case Study 1

54
Veracity: Case Study 1
SALE DATE EVENT DATE DAYS TO EVENT
7/10/18 24/10/18 17
23/1/18 3/2/18 11
18/9/17 28/10/17 40
4/12/17 11/12/17 7
16/3/18 19/3/18 3
21/4/18 28/4/18 7
28/7/18 28/7/18 -1
21/8/18 25/8/18 4
22/10/18 28/10/18 6
5/11/18 6/11/18 1
8/11/18 15/11/18 7
17/11/18 22/11/18 5
8/11/18 26/11/18 18

55
Veracity: Case Study 2

56
Veracity: Case Study 2
Test Date Employer ID Fr1 (KHz) Diameter
1 8/6/15 801-510 417.25 1.97
0 8/6/15 801-510 417.9375 1.93
1 8/6/15 801-510 416.734375 2.06
0 8/6/15 801-510 416.90625 1.92
0 8/6/15 801-510 417.25 1.93
1 8/6/15 801-510 416.90625 1.92
0 8/6/15 801-510 416.90625 2.03
1 8/6/15 801-510 417.765625 1.95
1 8/6/15 801-510 417.9375 1.98
1 8/6/15 801-510 416.734375 1.99
57
Let’s Practice:
How to use the 4 V’s?
A framework to think about AI and data projects

58
Vs Considerations
Volume Project Scope: Define the scale of the data involved in the project. Discuss the
expected data volume and how it will impact storage, processing, and analysis
requirements.
Infrastructure Needs: Consider the infrastructure needed to handle large
datasets. Discuss whether existing systems are adequate or if new solutions are
required.
Cost Implications: Address the costs associated with managing large volumes
of data, including storage, processing, and maintenance.

Velocity Data Ingestion: Talk about the speed at which data will be collected and
processed. Discuss the sources of data and how real-time data will be handled.
Processing Capabilities: Consider the tools and technologies needed to process
data at the required speed. Evaluate current capabilities and identify gaps.
Impact on Decision-making: Discuss how the project will enable faster
decision-making and what business processes will be affected.

59
Vs Considerations
Variety Data Sources and Types: Identify the different data sources and types that the project
will encompass. Discuss how structured, unstructured, and semi-structured data will be
integrated.
Analysis: Discuss the analytical tools and techniques that will be used to glean insights
from varied data sources.
Data Transformation: Consider the processes needed to transform raw data into a
format suitable for analysis, including data cleaning, normalization, and aggregation.
Veracity Data Quality: Address the quality of the data, including accuracy, completeness, and
consistency. Discuss measures to ensure data quality and the impact of poor-quality
data on project outcomes.
Governance and Compliance: Discuss data governance policies and compliance
requirements related to the project. Talk about how data will be managed, who will
have access, and how data privacy will be ensured.
Trustworthiness of Insights: Consider the reliability of the insights expected from the
data. Discuss how veracity affects decision-making and what steps will be taken to
validate and verify findings.
60
“Deer Lady”

61
What went wrong?

62
http://www.tylervigen.com/spur
ious/variable?id=19995
Why understanding causality is
important

65
Causal analysis
• Analyst must put into context
• Not automatic (black-box)
• Randomized Controlled Trials (experiments, but not always
possible/costly)
• A/B testing

66
Bing’s revenue increased by a whopping 12%, which
at the time translated to over $100M annually in the
US alone, without significantly hurting key user-
experience metrics.

68
“A team... couldn’t decide between two blues, so they’re
testing forty-one shades between each blue to see which one
performs better. I had a recent debate over whether a border
should be three, four, or five pixels wide, and was asked to
prove my case. I can’t operate in an environment like that.”
2009 Google’s visual design head, Doug Bowman (Quit)
In 2014, Google estimated that getting the shade of blue right via the kind of testing that Bowman disparaged had led to an
additional $200 million per year in ad revenue.

69
DBA4712 Causal Analytics for Managerial
Decisions
• We often use machine learning to find the association of two
events, such as: the frequency to the hospital is strongly
negatively correlated with life expectancy; or the number of
bedrooms strongly positively correlates with HDB price.
• Such strong associations might lead to non-sensible decisions
like not seeing a doctor when getting sick or building a wall to
create a new bedroom.
• Other than the warning “correlation is not causation,” can we have
a systematic way to disentangle correlation from causation?
Revision and Notation
• 𝑛 number of distinct data points or observations in our data
• 𝑝 number of variables available for predictions

X_1 X_2 … X_p


1
2
3
4


n
Summation operator
1. σ𝑛𝑖=1 𝑥𝑖 = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
2. σ𝑛𝑖=1 𝑐 = 𝑛𝑐
3. σ𝑛𝑖=1 𝑐𝑥𝑖 = 𝑐 σ𝑛𝑖=1 𝑥𝑖
4. σ𝑛𝑖=1 𝑎𝑥𝑖 + 𝑏𝑦𝑖 = 𝑎 σ𝑛𝑖=1 𝑥𝑖 + 𝑏 σ𝑛𝑖=1 𝑦𝑖
𝑛 𝑥𝑖 σ𝑛
𝑖=1 𝑥𝑖
5. σ𝑖=1 ≠ σ𝑛 (note, not equals)
𝑦𝑖 𝑖=1 𝑦𝑖
1 𝑛
6. Average: σ𝑖=1 𝑥𝑖
n
Where the discrete r.v. 𝑋 has probability distribution:
𝒙 𝒙𝟏 𝒙𝟐 … 𝒙𝒏
𝑃(𝑋 = 𝑥) 𝑝1 𝑝2 … 𝑝𝑛
𝒙 𝟏 𝟐 𝟑 4
𝑃(𝑋 = 𝑥) 0.4 0.4 0.1 0.1
1 𝑛 𝑛
σ𝑖=1 𝑥𝑖 vs σ𝑖=1 𝑝𝑖 𝑥𝑖
n
𝒙 𝒙𝟏 𝒙𝟐 … 𝒙𝒏
𝑃(𝑋 = 𝑥) 𝑝1 𝑝2 … 𝑝𝑛

= 𝐸 𝑿2 − 𝐸 𝑿 2
Visualize Sum of Squares and
Variance
𝒙 𝟏 𝟐 𝟑 4
𝑃(𝑋 = 𝑥) 0.4 0.4 0.1 0.1

𝐸 𝑿2 − 𝐸 𝑿 2
Conditional Probability
• You work at a bank that Customer ID
Loan Amount
Defaulted (Yes/No)
(Small/Medium/Large)
wants to understand the
probability that a 1 Small No

customer will default on a 2 Large Yes


3 Medium No
loan based on the size of
4 Medium Yes
the loan they took out
5 Large Yes
• You have a small
6 Small No
historical dataset of 10
7 Small No
customers, showing
8 Large No
whether they defaulted
9 Medium Yes
and the amount of the
10 Large No
loan they borrowed.
Calculate the probability that a Customer ID
Loan Amount
Defaulted (Yes/No)
(Small/Medium/Large)
customer with a
1 Small No
a) Small 2 Large Yes
b) Medium 3 Medium No
c) Large 4 Medium Yes
loan amount has defaulted. 5 Large Yes
6 Small No
7 Small No
8 Large No
9 Medium Yes
10 Large No
Install python
• https://www.anaconda.com/
• https://colab.research.google.com/
• DAO2702

You might also like