MKT 3600: Marketing
Research
Lecture 9: Correlation and
regression
Zhuping Liu
Announcements
• Article 7 discussion
• Assignment 5
– available on Blackboard
– due on May 17th
• Final Exam Review on May 10th
• Final Project Presentation on May
17th
• Final Exam:
– EMA MAY 19 W on Blackboard
– FMA MAY 24 M on Blackboard
What We Have Learned
Hypothesis testing
• Chi-Square Test
• Hypothesis testing about single
mean (t-test)
Hypothesis Testing: Steps
1. Formulate Hypotheses
2. Select significance level (usually
0.05)
3. Select appropriate formula and
calculate test statistic
– Compare what we observe from data
with what we expect under H0
4. Calculate degrees of freedom
5. Obtain critical value from table
6. Make decision regarding H0
Example
Question: Is the Sample Representative
of the Population in age?
Age Observed
Groups Population
N=210
10-20 35 12%
21-30 125 18%
31-40 27 28%
41+ 23 42%
Hypothesis Testing:
Steps
• Step 1: Formulate Hypotheses
H0: no difference between sample distribution and
population distribution
Ha: there is difference
• Step 2: Select significance level (0.05)
• Step 3: Select appropriate formula and
calculate test statistic
G (Obsg Expg ) 2
2
g 1 Expg
Hypothesis Testing: Step 3
G (Obs Exp ) 2
2 g1 g g
Exp g
G=Total number of groups
Age Observed
Population Expected
Groups N=210
10-20 35 12% 210*12/100
21-30 125 18% 210*18/100
31-40 27 28% 210*28/100
41+ 23 42% 210*42/100
(Obs Exp ) 2
Step 3
G
2 g1 g g
Exp g
G=Total number of groups
Age Observe Test Statistic
Expected
Group d N=210
s (35 25.2) 2
35 25.2 4
10-20 25.2
(125 37.8) 2
125 37.8 201
21-30 37.8
(27 58.8) 2
27 58.8 17
31-40 58.8
(23 88.2) 2
23 88.2 48
41+ 88.2
2 =4+201+17+48=270
Hypothesis Testing:
Steps
• Step 4: Calculate degrees of freedom
df = degrees of freedom = number of groups-1
= 4-1 = 3
• Step 5: Obtain critical value from table
2
Critical(.05,3) 7.81
• Step 6: Make decision regarding H0
2 270 7.81 2 Critical
We reject H0 that the sample is
Significance Level
df 0.1 0.05 0.025 0.01 0.005
1 2.7 3.8 5.0 6.6 7.9
2 4.6 6.0 7.4 9.2 10.6
3 6.3 7.8 9.3 11.3 12.8
4 7.8 9.5 11.1 13.3 14.9
5 9.2 11.1 12.8 15.1 16.8
6 10.6 12.6 14.4 16.8 18.5
7 12.0 14.1 16.0 18.5 20.3
8 13.4 15.5 17.5 20.1 22.0
9 14.7 16.9 19.0 21.7 23.6
10 16.0 18.3 20.5 23.2 25.2
11 17.3 19.7 21.9 24.7 26.8
12 18.5 21.0 23.3 26.2 28.3
13 19.8 22.4 24.7 27.7 29.8
14 21.1 23.7 26.1 29.1 31.3
15 22.3 25.0 27.5 30.6 32.8
16 23.5 26.3 28.8 32.0 34.3
17 24.8 27.6 30.2 33.4 35.7
18 26.0 28.9 31.5 34.8 37.2
19 27.2 30.1 32.9 36.2 38.6
Testing Relationships in Cross
Tabs
• Question: is there association between
income and journal choice?
Low High
Total
Income Income
Wall Street
Journal 83 180 263
USA Today 276 41 317
Total 359 221 580
Hypothesis Testing:
Steps
• Step 1: Formulate Hypotheses
H0: no association between income and journal
choice
Ha: there is association
• Step 2: Select significance level (0.05)
• Step 3: Select appropriate formula and
calculate test statistic
2
(Obs Exp )
2
Exp
Expected Outcome
if No Association (assuming H0
is true)
Low Income High Income
263 221
580 * * 100
580 580
263 * 359 221* 263
Wall 163 100 263
Street 580 580
Journal
USA 317 * 359 221* 317 317
Today 196 121
580 580
359 221 580
2-test for Association
Observed Expected if no association
Low High Low High
Incom Incom Incom Incom
e e e e
Wall Wall Street 26
Street 83 180 263 Journal
163 100
3
Journal
USA 31
276 41 317 USA Today 196 121
Today 7
359 221 580 58
359 221
0
• 2 = (83-163)2/163 + (180-100)2/100 + (276-
196)2/196 + (41-121)2/121
= 188.8
Hypothesis Testing:
Steps
• Step 4: Calculate degrees of freedom
df = degrees of freedom = (r-1)*(c-1)
= (2-1)(2-1) = 1
• Step 5: Obtain critical value from table
2 ( 0.05, df 1) 3.84
• Step 6: Make decision regarding H0
2 188.8 3.84 2 Critical
We reject H0 that there is no association
Significance Level
df 0.1 0.05 0.025 0.01 0.005
1 2.7 3.8 5.0 6.6 7.9
2 4.6 6.0 7.4 9.2 10.6
3 6.3 7.8 9.3 11.3 12.8
4 7.8 9.5 11.1 13.3 14.9
5 9.2 11.1 12.8 15.1 16.8
6 10.6 12.6 14.4 16.8 18.5
7 12.0 14.1 16.0 18.5 20.3
8 13.4 15.5 17.5 20.1 22.0
9 14.7 16.9 19.0 21.7 23.6
10 16.0 18.3 20.5 23.2 25.2
11 17.3 19.7 21.9 24.7 26.8
12 18.5 21.0 23.3 26.2 28.3
13 19.8 22.4 24.7 27.7 29.8
14 21.1 23.7 26.1 29.1 31.3
15 22.3 25.0 27.5 30.6 32.8
16 23.5 26.3 28.8 32.0 34.3
17 24.8 27.6 30.2 33.4 35.7
18 26.0 28.9 31.5 34.8 37.2
19 27.2 30.1 32.9 36.2 38.6
Testing Hypothesis about a
Single Mean
Question: Do people think that the quality of
food at our restaurant is above average (5)?
Data: Average rating from 100 respondents
is 6.5
• Step 1: Formulate Hypotheses
– H0: Ratingfoodquality = 5
– Ha: There is a difference.
• Two-sided
– The average rating of food quality is not 5.
Ha: ratingfoodquality ≠ 5
• One-sided
– The average rating of food quality is higher than
5.
Ha: ratingfoodquality > 5
• Step 2: Select significance level (0.05)
Step 3: Computing t-
statistic
H : x k o
H a : x k or H a : x k
x k
t-statistic: t
sx 2
s sample variance
s:standard
x error,sx
x
n
x:sample mean,
x 6.5 2
s 4
x
n 100
6 .5 5 1 .5
t 7.5
4 .2
100
t-critical
• Step 4: calculate the degrees of freedom
df = degrees of freedom = total sample size-1
= n-1 = 100-1 = 99
• Step 5: Obtain critical value from table
– For two-sided test:
Ha: ratingfoodquality ≠ 5
• tcritical = t α/2,n-1
• For large n, t.025 = 1.96
t-critical=1.96
– For one-sided test:
Ha: ratingfoodquality > 5
• tcritical = t α,n-1
• For large n, t.05 =1.65 t-critical=1.65
Hypothesis Testing: Step 6
• Step 6: Make decision regarding H0
Ha: ratingfoodquality ≠ 5
For two-sided test:
- Reject null if |t| > t critical |t|=7.5 > t-critical=1.96
- Fail to reject null if |t| < tReject H0
critical
For one-sidedHa:
test:
ratingfoodquality > 5
- Reject null if t > t critical t=7.5 > t-critical=1.65
- Fail to reject null if t < t Reject
critical
H0
Type I and Type II errors
wasted resources & missed
opportunities
True State of the Null Hypothesis
Decision
Ad has no effect Ad has an effect
false positive
Select ad Correct
wasted resources
Do not select ad Correct missed opportunity
True State of the Null Hypothesis
Decision
H0 True H0 False
Reject H0 Type I error Correct
Do not Reject H0 Correct Type II error
21
Today
• Regression Analysis
– Regression Analysis in Excel:
http://www.excel-easy.com/examples/regression.
html
– Instructions on how to install the data analysis
tool are available on assignment 5
Linear Regression Model
Elements of a linear model
Random
Intercept Slope Error
y a bx
Dependent Independent
Variable Variable
Linear regression
linear y a bx
regression
observed unobserved
y dependent variable b regression coefficient (slope)
variable related to various other variables The effect. Measures the change of y as x
e.g., sales, preference increases by one unit (holding other factors constant
Also referred to as the marginal effect
a intercept
X
independent variable
value of y when x= 0
variables that influence
the value of the dependent variable
e.g., prices, promotions, etc. random error
unobserved errors. E.g.,
measurement error 24
missing variables
Linear Regression Model
y
e a ns)
neo fm
(l i
+ bx Change
(y ) =a
E b = Slope in y
Change in x
a = y-intercept
x
Linear Regression
Model
y 𝑦 𝑖= 𝑎 ^ 𝑥 +𝜀
^ +𝑏 𝑖 𝑖 Observed
value
i = Random error
^ ^𝑥
^ +𝑏
𝑦 𝑖= 𝑎 𝑖
x
Observed value
What is the “Best”
Regression Model?
• How would you draw a line through the points?
• How do you determine which line ‘fits best’?
y
60
40
20
0 x
0 20 40 60
• ‘Best fit’ means difference between actual y values and
predicted y values are a minimum (least squares)
• So minimize SSE = 𝑛 𝑛
∑ 𝑖 𝑖 ∑ 𝑖
( 𝑦 − ^
𝑦 )
2
= 𝜀
2
• SSE: sum of squared
𝑖=1error 𝑖=1
Least Squares
Illustration
𝑛
Least squares
minimizes
∑𝜀 𝑖
2 2 2 2
=𝜀1 + 𝜀2 + 𝜀3 + 𝜀 4 2
𝑖 =1
y 𝑦 2= 𝑎 ^ 𝑥 +𝜀
^ +𝑏 2 2
𝜺𝟐 𝜺𝟒
𝜺𝟏 𝜺𝟑
^ ^𝑥
^ +𝑏
𝑦 𝑖= 𝑎 𝑖
x
Interpretation of
Regression Coefficients
Impact of Advertising on Yogurt Sales:
• Slope ()
– Yogurt sales are expected to increase
by 0.1 units for each $1 increase in
advertising (x)
• Intercept ()
– Average yogurt sales are expected to
be 100 units when there is no
Linear Regression:
Assessing Fit
how well does the regression line
Assess fit fit the data points ?
R2 : amount of variance of Y explained through the regre
0 < R2 < 1
Y Y
X
X
low R2 high R2
Linear regression:
Prediction
Once we know a and b’s, we can predict Y for any value of X’
How?
^ ^ ^
=
Ya+b X
^
1 1 + b2 X 2 + … + bK X K
e.g., what will be sales (Y)
‘what if’ analyses
when we set prices to $X
Hypotheses Testing
Yi = a + b1 X i1 + b2 X i2 + … + bK X iK + e i
there a statistically significant effect of an independent variable, X k (say price)
the dependent variable Y (say sales)?
Null hypothesis: H0 : bk = 0
2
use t statistic: tn-k-1 = bk / Sbk Sbk : variance of bk
with n-K-1 degrees of freedom
compute p-value based on T statistic
if p small, say < 0.05 reject H0 significant effect
Application of Regression:
Tropicana Orange Juice Pricing
Data
• Weekly sales data of Tropicana orange juice in
Dominick’s stores
• Data Description:
– WEEK Week number
– SalesTrop Units Sales of Tropicana Orange Juice
(cartons)
– PriceTrop Price of Tropicana Orange Juice
– PriceMM Price of Minute Maid Orange Juice
– PriceDom Price of Dominick’s Orange Juice
– Feature Dummy variable indicating that Tropicana was
featured in weekly brochure
– Display Dummy Variable indicating that Tropicana had
an In-store display, bonus-tags
Step 1: Model
• Estimate the following model
SalesTrop= Intercept +
a*PriceTrop + b*PriceMM +
c*PriceDom + d*Feature +
e*Display +
Step 2 : “What Ifs”
• If price of Tropicana were to increase
by $1 what would happen to the unit
sales of Tropicana?
• If the price of Minute Maid were to
increase by $1 what would happen to
the unit sales of Tropicana?
• If the price of store brand were to
increase by $1 what would happen to
the unit sales of Tropicana?
Step 2 : “What Ifs”
• When there is a Feature for
Tropicana, what is the impact on unit
sales of Tropicana?
• When there is a Display for
Tropicana, what is the impact on unit
sales of Tropicana?
Interpreting Regression
Output
SUMMARY OUTPUT
How good is the fit?
Regression Statistics
Multiple R 0.818424409
R Square 0.669818514
Adjusted R Square 0.654810265
Standard Error 11811.44376 What is an intercept?
Observations 116
ANOVA
What does a negative
df SS MS F coefficient imply?
Regression 5 31131717954 6226343591 44.63002297
Residual 110 15346122395 139510203.6
Total 115 46477840349
Coefficients Standard Error t Stat P-value What does the positive
Intercept 54434.01686 10151.90119 5.36195298 4.59178E-07
PriceTrop -21274.83138 2606.694783 -8.161611987 5.97451E-13
Coefficient of “Display
PriceMM 8796.880907 2830.093242 3.108336071 0.002395097 Mean?
PriceDom 898.2071683 3047.525898 0.294733236 0.768753189
Feature 938.3844498 2644.852137 0.354796564 0.723421321
Display 19576.16684 3195.870786 6.125456299 1.43361E-08
Is it significant? NO!
What kind of variable “Feature” is?
Use of Dummy Variables
• To capture the effect of categorical
variables
– Brands, In-store displays, Gender
• Dummy variables estimate indicate
the impact of the category on
dependent variable
• Dummy variable has a value of 0 or 1
– 1 indicates presence of characteristic
– 0 indicates absence of characteristic
Coding Dummy Variables
• If a category can either be present or
absent, then code:
– Presence as 1
– Absence as 0
– Example: Presence of “In Store Display”
• If a category can be of two “types”:
– Code one of the category as 1
– Code the other as 0
– Example: Male/ Female; Cash/ Credit
Example
Effect of presence of an in-store display (X) on brand sales
dummy codinguse one or more 0/1 variables as
independent variables
1 : if brand is on display
Di =
0 : if brand is not on display
a + b: if brand is on display
Yi = a + b D i =
a : otherwise
so b is the effect of display in this example
41
Non-Linear Effects
Likelihood of Purchasing Candy Bar = 1.1+ 3 *
Sweetness
So should we keep adding sugar?
hat if more is not better? Y i = a + b 1 X i + b 2 X i 2 + ei
purchase likelihood of candy bar (Y)
b1>0 b2 < 0
b1<0 b2 > 0
sweetness (X)
The log-log sales
Response Model
• The log-log sales response model is the single
most useful tool in analyzing the competitive
structure of retail markets
log(sales in period t) = β0 + β1*log(own price in period t) +
β2*log(competitor price in
period t) + εt
• This model typically fits the data much better
than the linear model
• Coefficients to log(prices) may be interpreted as
price elasticities
Log-Log Sales Model
Log(Sales in period t) = a + bown* Log (Own
Price in period t)+ bcross * Log (Other Good
Price in period t)+ badvert * Log (Advertising)
+ bdisplay * Display
• Interpretation of Coefficients:
– Coefficient on ln x = % change in Y,
when x increases by 1%
Running the Model
Log(SalesTrop)= Intercept +
a*Log(PriceTrop) + b*Log(PriceMM) +
c*Log(PriceDom) + d*Feature +
e*Display
Output of the Log-Log
Model
SUMMARY OUTPUT
Regression Statistics Is it a better model compared to a
Multiple R 0.89144 Simple linear model ???
R Square 0.794666
Adjusted R Square
0.785333
Standard Error 0.348969
Observations 116
ANOVA
df SS MS F
Regression 5 51.84306 10.36861 85.1425
Residual 110 13.39575 0.12178 Check R-Square
Total 115 65.23881
Coefficients
Standard Error t Stat P-value
Intercept 11.56145 0.273605 42.25597 7.83E-70
Ln(PriceTrop) -2.51154 0.203726 -12.328 1.85E-22
Ln(PriceMM) 0.553096 0.182128 3.036851 0.002986
Ln(PriceDom) -0.04492 0.196747 -0.2283 0.819838
Feature 0.065482 0.078316 0.836129 0.404895
Display 0.632155 0.094394 6.697001 9.37E-10
Linear Regression
Output
SUMMARY OUTPUT
How good is the fit?
Regression Statistics
Multiple R 0.8184244
R Square 0.6698185
Adjusted R Square0.6548103
Standard Error 11811.444
Observations 116
ANOVA
df SS MS F
Regression 5 31131717954 6.226E+09 44.630023
Residual 110 15346122395 139510204
Total 115 46477840349
CoefficientsStandard Error t Stat P-value
Intercept 54434.017 10151.90119 5.361953 4.592E-07
PriceTrop -21274.831 2606.694783 -8.161612 5.975E-13
PriceMM 8796.8809 2830.093242 3.1083361 0.0023951
PriceDom 898.20717 3047.525898 0.2947332 0.7687532
Feature 938.38445 2644.852137 0.3547966 0.7234213
Display 19576.167 3195.870786 6.1254563 1.434E-08
Interpreting the Ln(Price)
Coefficients
• When the price of Tropicana
increases by 1%, what is the impact
on sales for Tropicana?
– DECREASE 2.5%
• Is this effect statistically significant?
– YES
Interpreting the Ln(Price)
Coefficients
• When the price of Minute Maid
increases by 1%, what is the impact
on sales for Tropicana?
– INCREASE .55%
• Is this effect statistically significant?
– YES
• When the price of Dominick’s
increases by 1%, what is the impact
on sales for Tropicana?
– DECREASE 0.05%
• Is this effect statistically significant?
– NO
Prediction
• Compute the predicted sales when:
– PriceTrop = 3.99
– PriceMM= 2.85
– PriceDom= 2.99
– Feature = 0
– Display=0
• Answer: 5519
Preparations
• OL:
– Work on final project presentation
• May 10th
– Final Exam Review
51