Module 1
Module 1
▪ Marketing: correlation, regression and data mining helps identifying specific requirements for a
targeted group of customers to market products more efficiently
▪ Health Care: t-tests, ANOVA, survival analysis can help in identifying differences between two or
more therapists in evaluating patients, or differences in the efficacy of two cancer drugs etc.
▪ Quality Improvement: concepts distribution theory and process measure standard deviation
(sigma) greatly help in reducing product defects
▪ Product Warranty: estimation of average cost of product warranty claims in the first year of sale
based on collected data, and using the estimate to predict future cost incurred to the companies
along with a degree of reliability of the estimate
CASE STUDY – DEMAND SUPPLY GAP
OLA CABS SERVICE (INDIAN CASES C.3 CASE 1)
▪ Bangalore based ridesharing company launched by ANI Technologies Pvt. Ltd. in 2010 in Mumbai
▪ Offers transportation services: superior luxury cars, Ola auto
▪ Functions in 102 cities with 450000 vehicles
▪ Customer care registers many complaints on vehicle unavailability and last moment cancellations on
the Bangalore city – Airport route bookings
▪ Problems:
▪ Lack of car availability during peak hours
▪ Cancellation by the drivers
both leading to loss of potential revenue
▪ The concerned team in the company wanted to identify cause and find a possible solution to the
problem
▪ Data were collected on 25541 rides for the month of November 2019 on seven important features –
Request ID, Driver ID, Time of Request, Pick-up time, Drop-off time, Pick-up point and Status of the
request (Completed/Cancelled/Not Available)
VARIABLES AND DATA
Request ID Driver ID Time of Request Pick-up time Drop-off time Pick-up point Status of the Request
1278 A1 01/11/19 9:15 01/11/19 9:36 01/11/19 10:55 Airport Completed Row: Observation
567 A2 01/11/19 13:01 01/11/19 13:35 01/11/19 15:28 Airport Completed
432 C1 03/11/19 17:22 03/11/19 17:45 03/11/19 20:05 Airport Completed Subjects/Items: 25541
23 B1 03/11/19 14:59 NA NA City Cancelled Observations: 25541
1989 A3 01/11/19 11:24 NA NA City Not Available Variables: 7
Column: Variable
The sample data are mainly collected in one of the two ways.
▪ Cross sectional data: records on attributes of the items are collected at the same point in time without
considering time difference
▪ Time series data: data collected over several time periods like hourly, daily, weekly, monthly, quarterly,
or annually etc.
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
Time series data 1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
(Air Passenger Data) 1952 171 180 193 181 183 218 230 242 209 191 172 194
TYPES OF VARIABLES
Request ID Driver ID Time of Request Pick-up time Drop-off time Pick-up point Status of the Request
1278 A1 01/11/19 9:15 01/11/19 9:36 01/11/19 10:55 Airport Completed
567 A2 01/11/19 13:01 01/11/19 13:35 01/11/19 15:28 Airport Completed
Variable 432 C1 03/11/19 17:22 03/11/19 17:45 03/11/19 20:05 Airport Completed
23 B1 03/11/19 14:59 NA NA City Cancelled
1989 A3 01/11/19 11:24 NA NA City Not Available
Qualitative Quantitative
allows for classification of numerical measures of
individuals based on individuals and allows for Designator
attribute or characteristic, arithmetic operations,
e.g., gender, political e.g., sales, salary,
affiliation, car brand expenditure, demand
Measurement
Scale
Qualitative Quantitative
Ratio
Nominal Ordinal Interval measured in real
un-ordered/un-ranked, no measured in real numbers, ratio of two
ordering of arrangements like nominal but with
numbers, operations + or - values makes sense,
specific order of
name, brand, level, class, are meaningful but * or / properly defined zero
arrangements
type, category are not, no defined zero implying absence of
quantity
LEVEL OF MEASUREMENTS – DISCUSSION
For each example below, determine the scale of measurement:
2. Total revenue generated by a business every month in the last year (in Rs.)
o Quantitative variable measured in ratio scale
3. Car brand: Tata, Maruti Suzuki, Honda, Hyundai, Toyota, Ford, Volkswagen,
Kia, Nissan, Mercedes
o Qualitative variable measured in nominal scale
o If we want to represent percentage of requests for each ‘Status of the Request’ (Cancelled,
Not Available, Completed) category, what kind of visual techniques would be useful?
o If we want to represent number of requests in each ‘Status of the Request’ category further
categorized by ‘Pick-up point’, what kind of visual techniques would be useful?
▪ Pie Chart: This is represented by a circle divided into sectors. Each sector represents a category and area of
each category is proportional to the frequency of that category. Pie charts are applied to the cases where
relative percentages are of importance over absolute counts.
▪ Overwhelmingly large
number of cancellations
happened for the city pick
ups – traffic jam, requests
from short-distance
commuters
Bar Graph
DATA VISUALIZATION: SIDE-BY-SIDE BAR GRAPH
▪ Most cancellations occurred
during the ‘Morning’ slot
Quantitative Data:
Qualitative Data:
▪ Frequency (Relative Frequency)
▪ Frequency (Relative Frequency) Distribution Table for classes
Distribution Table for categories
▪ Graphs:
▪ Graphs: o Histogram – applicable for large
o Bar Plot – absolute/relative value number of observations
representation) – horizontal/vertical o Stem-and-Leaf plot – raw data can
o Pie Chart – percentages/relative be retrieved, not suitable for large
value representation observations
o Pareto Chart – most frequent o Dot Plot – not suitable for large
categories observations
o Side-by-Side Bar Plot (comparison o Line Chart – display time series
across groups by same attribute) data, several variables
simultaneously
NUMERICAL SUMMARY
To describe/summarize data, we apply mainly 3 kinds of measures:
▪ Measures of Central Tendency
o Mean Provide a representative or
o Median aggregate number around which all
o Mode observations lie
▪ Measures of Dispersion
o Range
o Standard Deviation (SD)
o Mean Absolute Deviation (MAD) Denote the expanse or spread of the
o Variance observations
o Coefficient of variation (CV)
increasing order
▪ Mode: most frequent
observation/class
▪ The average ‘Trip Duration’ is 62.58 mins.
Mathematical notation
MEASURES OF DISPERSION − AN ILLUSTRATION
We want to construct a measure which denotes variation or
spread of the data.
▪ Consider 5 numbers: {2,4,5,6,8}
2+4+5+6+8 25
▪ Mean = = =5
5 5
▪ Deviation of each number from the mean:
{(2 − 5), (4 − 5), (5 − 5), (6 − 5), (8 − 5)} ≡ {−3, −1, 0, 1, 3} 0 2 4 5 6 8
These denote the distances of each observation from the
average.
−3 + −1 +0+1+3
▪ Average/mean deviation from the mean = =0
5
▪ Average/mean squared deviation from the mean =
(−3)2 +(−1)2 +(0)2 +(1)2 +(3)2 20
= =4
5 5 Is dividing by 5 ok?
▪ Square root of average/mean squared deviation from the
mean = 4 = 2
▪ Average/mean absolute deviation from the mean =
3 + 1 +0+1+3
= 1.6
5 Mathematical notation
MEASURES OF DISPERSION – II
Range (68.61)
▪ Range: maximum-minimum Variance (342.40) CV (29.73)
▪ Standard Deviation: square root of
sum of the squared deviation from the SD (18.61)
mean divided by (𝑛 − 1) Mean (62.58)
▪ Variance: sum of the squared deviation
from the mean divided by (𝑛 − 1) MAD (15.92)
Interpretations:
▪ P25 is 46.81 meaning 25% of Trip Durations
were of less than or equal to 46.81 mins.
LOF=Q1-3*IQR UOF=Q3+3*IQR
UOF
LIF=Q1-1.5*IQR UIF=Q3+1.5*IQR
Outlier
o Detect outliers
LOF
Empirical Rule:
For any bell-shaped nearly symmetric distributed data, the
empirical rule says:
34% 34%
13.5% 13.5%
2.35% 2.35%
0.15% 0.15%
𝑥
𝑥−𝑠 𝑥+𝑠
68%
𝑥 − 2𝑠 𝑥 + 2𝑠
95%
𝑥 − 3𝑠 𝑥 + 3𝑠
99.7%
Z-SCORES
▪ Universally, a z-score (of salaries) of 1.68 or above
indicates the almost certainty of repayment of loan of 30 L.
σ𝑛𝑖=1 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦
𝑟=
σ𝑛𝑖=1 𝑥𝑖 − 𝑥 2 σ𝑛𝑖=1 𝑦𝑖 − 𝑦 2
MEASURE OF CORRELATION – II
▪ For the ‘Gasoline Consumption’ data, generate the correlation matrix with variables Y
(mileage, miles per gallon) and X1-X5.