Data Analysis – Basics
Topics of today’s discussion
Reading Data Tables to make Conclusions
Reading Central Tendency and Dispersion
Correlation and Causality
Row-Column Normalization of data tables
Variables type and Statistical Analysis tool to build
relationships
Topics of today’s discussion
Reading Data Tables to make Conclusions
Reading Central Tendency and Dispersion
Correlation and Causality
Row-Column Normalization of data tables
Variables type and Statistical Analysis tool to build
relationships
Reading Data Tables – Situation 1
Let us assume a city has 4 modern format stores (named
Store 1, Store 2, etc) of a single retail player
They are more or less of similar size and have similar total
monthly Sales
However, the Sales by different categories are different –
for example, one Store might have a higher Sales of FMCG
and another a higher Sale of Staples
In such a scenario, let us look at the buyers of “instant
noodles” in these 4 stores in 2014
Reading Data Tables – Situation 1
Brands Purchased in each Store – Instant Noodles (2014)
Store 1 Store 2 Store 3 Store 4
Among buyers of Instant
Noodles in each Store
% Buying Maggi variants only 70% 75% 55% 85%
% Buying Yippee or Top Ramen
20% 20% 35% 10%
or others only
% Buying both 10% 5% 10% 5%
Reading Data Tables – Situation 1
% Contribution of Buyers of Instant Noodles Brands from each
Store (2014)
Store 1 Store 2 Store 3 Store 4
% Buying Maggi variants only 18% 27% 42% 13%
% Buying Yippee or Top Ramen
13% 18% 66% 4%
or others only
% Buying both 20% 14% 60% 6%
Reading Data Tables – Situation 1 – Assignment
Reading the 2 tables what will you conclude about Instant
Noodles sales from the 4 stores?
Can you make some guesses about the difference in the
catchment profiles of these stores?
Reading Data Tables – Situation 1
SOME SIMPLE CONCLUSIONS
Store 3 has a substantially large number of buyers of
Instant Noodles as category – Store 4 has the least
Among their respective buyers, Stores 1, 2 and 4 have high
(70%+) “solus Maggi” buyers, especially Store 4 (85%)
Store 3 has lesser (55%) “solus Maggi” buyers. But, being
the largest seller of instant noodles, contributes maximum
to Maggi sales, as well as to the other brands’ sales
Reading Data Tables – Situation 1
SOME SIMPLE CONCLUSIONS
Same-sized Stores, yet Store 3 has
Higher Instant Noodles sales and
Higher % of new brand (Yippee, Smoodles, etc) Sales
So, the catchment profile might be
- younger, with more double income hhlds, bachelors, etc
- also, psychographically, more open to trying new brands
- more exposed to media… hence aware of new brands
Similarly, Store 4 catchment profile might be just the
opposite
Reading Data Tables – Situation 1 – Hint
(Column %s)
Brands Purchased in each Store – Instant Noodles
Store 1 Store 2 Store 3 Store 4
Base: Buyers of Instant
500 700 1500 300
Noodles in each Store
% Buying Maggi variants only 70% 75% 55% 85%
% Buying Yippee or Top Ramen
20% 20% 35% 10%
or others only
% Buying both 10% 5% 10% 5%
Reading Data Tables – Situation 1 – Hint
(Row %s)
% Contribution of Buyers of Instant Noodles Brands from each
Store
Store 1 Store 2 Store 3 Store 4
% of ALL INSTANT
17% 23% 50% 10%
NOODLES BUYERS
% Buying Maggi variants only 18% 27% 42% 13%
% Buying Yippee or Top Ramen
13% 18% 66% 4%
or others only
% Buying both 20% 14% 60% 6%
Reading Data Tables – Situation 2
Car Ownership (Among higher Social Class) by Pop Strata
All Large Urban Urban Rural Rural
India Metros 1 – 50L <1L 10K+ <10K
Target household
20000 2000 6000 5000 3000 4000
Population (‘000)
Sample Size 5000 1000 1000 1000 1000 1000
Column %
% Owning Cars 20% 51% 28% 19% 7% 2%
% Not Owning
80% 49% 72% 81% 93% 98%
Cars
Reading Data Tables – Situation 2
Car Ownership (Among higher Social Class) by Pop Strata
All Large Urban Urban Rural Rural
India Metros 1 – 50L <1L 10K+ <10K
Target household
20000 2000 6000 5000 3000 4000
Population (‘000)
Row %
% Owning Cars 100% 26% 43% 24% 5% 2%
From the given two tables (i.e. Column % and Row %) what do you
conclude about car ownership in India?
Reading Data Tables – Situation 2
BROADLY, THERE ARE 3 CONCLUSIONS ON CAR
OWNERSHIP:
1. Overall, 20% of households in higher Social Class of India,
own cars
2. Ownership is highest in large metros (51%) and comes
down step-wise as we go down the lower pop strata – lower
in non-metro urban and lowest in rural
3. However, due to large size of Urban 1 – 50 lakh population,
the highest contribution of cars (43%) comes from this pop
strata
Topics of today’s discussion
Reading Data Tables to make Conclusions
Reading Central Tendency and Dispersion
Correlation and Causality
Row-Column Normalization of data tables
Variables type and Statistical Analysis tool to build
relationships
Assignment 2
In the catchment area of a store, 1000 people were asked some questions in a
survey. TG: 21 – 45yrs, housewives or single earning members, who
themselves shop for day-to-day household items,
They were asked to agree or disagree with a statement “I prefer buying day-to-
day items from modern format outlets rather than going to traditional Kirana
stores” in a five point scale (Likert Scale):
Of the 1000 people responding to this question, the mean score obtained was
3.1 out of 5. What can you conclude from this?
Assignment 2
If you are now given the following distribution:
What would you conclude?
Can you make some hypotheses on the sub-groups of people giving this
opinion?
Would you want the data to be analyzed in some other sub-groups?
Assignment 2
We may need to look at an output like:
… to check, Is the polarization of findings due to different attitudes
among different age groups and different stages of life?
Topics of today’s discussion
Reading Data Tables to make Conclusions
Reading Central Tendency and Dispersion
Correlation and Causality
Row-Column Normalization of data tables
Variables type and Statistical Analysis tool to build
relationships
Correlation between two variables
When we look at data for two or more variables, we sometimes see that
data for two variables move in the same direction.
For example, if we record the heights and weights of a large number of
people, we would observe that there are many taller people who also weigh
higher and similarly there are many shorter people who also weigh lesser
Also, there can be two variables that move mostly in opposite directions
For example, the Power of the engine and Mileage of the car – mostly, cars
with higher Power would have lower Mileage.
We refer to the term ‘Correlation’ to explain the strength of linear
association between two variables and a co-efficient ‘r’ is used
to measure this strength
The value of ‘r’ can range between –1 and +1
Correlation between two variables
r value closer to +1 a strong positive association
r value closer to -1 a strong negative association
r value closer to 0 weak association between the two variables
In Retail Sales data too, it will be interesting to observe the Correlation
of Sales of certain categories over the period of time
- Is there a high positive Correlation between Sales of Shampoos and
Conditioners?
- Is there a negative Correlation between Sales of Shower Gels and
Soaps?
by looking at long-term Sales data
Correlation does not imply Causality
However, one must note “CORRELATION DOES NOT IMPLY
CAUSALITY”
Meaning, a high positive Correlation between A and B does not
mean that A causes B or A leads to B
e.g. Brand Imagery vs Brand Usage
generally a high positive correlation… but does it mean increase in
Brand Imagery would lead to increase in Brand Usage?
NO!
Topics of today’s discussion
Reading Data Tables to make Conclusions
Reading Central Tendency and Dispersion
Correlation and Causality
Row-Column Normalization of data tables
Variables type and Statistical Analysis tool to build
relationships
Let us look at a Brand Image Q’re situation
• You want to understand the imagery of 5 brands of
shampoos on several attributes
• The attributes can be “Makes hair shiny”, “Cleans hair
well”, …, “Has good packaging”, etc
• You may collect this info as “Ratings” (OPTION 1) or as
simple “Association” (OPTION 2)
Rate the following brands in terms of your preference on
each attribute in a 1 – 5 scale, 5 = Excellent,…, 1 = Poor
Dove Tresemme Head & S Clear Sunsilk
Gives shiny hair
Cleans hair well
Removes dandruff
…
…
OPTION 1
…
…
Good vfm
Good packaging
Which brands would you associate on each attribute as per your
preference? (TICK/CIRCLE CODE AS APPLICABLE)
Dove Tresemme Head & S Clear Sunsilk
Gives shiny hair A B C D E
Cleans hair well A B C D E
Removes dandruff A B C D E
… A B C D E
… A B C D E
… A B C D E
OPTION 2
… A B C D E
… A B C D E
Good vfm A B C D E
Good packaging A B C D E
Obviously…
• OPTION 1 will be more detailed, more robust to analyze
Statistically
• However, OPTION 1 will be very time consuming to
administer
• So, lot of times, we go ahead with OPTION 2 when we have
large number of attributes and/or brands to work with
A Typical output of such Image Association
Dove Tresemme Head & S Clear Sunsilk
Gives shiny hair 72% 32% 54% 42% 73%
Cleans hair well 68% 22% 60% 45% 75%
Removes dandruff 45% 12% 88% 82% 70%
Good vfm 55% 18% 60% 62% 82%
Good packaging 65% 24% 57% 51% 76%
• PROBLEM HERE IS… Large brands will always have high
associations across …and small brands will have small
associations across all attributes
SO, HOW DO WE GET THE RELATIVE STRENGTHS AND
WEAKNESSES OF SMALL BRANDS?
ROW – COLUMN NORMALIZATION
• It brings all Brands and all Attributes to the same platform
• Hence fair comparison can be made in terms of relative
Strengths and Weaknesses
(Refer to Excel File for computations)
Image Association
(Refer to Excel File for Row – Column normalization)
After Normalization…
Tin plate Plastic Glass Tetrapack
Makes the product reasonably priced -- ++ -
Increases longevity of product
Convenient for stocking in godown ++ + --
Looks attractive on shelves for a long time as colours do not
fade away +
Destroyable and the material useable without damaging the
environment -- -- + -
Popular among consumers ++ +
Convenient for transportation as it does not break / get
tampered easily ++ --
Not tampered easily by insects / rats ++ -- ++ --
Gives protection against foreign odours -- + +
Convenient for stocking on shelves - ++ -
Convenient for transportation by requiring less space --
Re-useable by the customer for some other purpose -- ++
Topics of today’s discussion
Reading Data Tables to make Conclusions
Reading Central Tendency and Dispersion
Correlation and Causality
Row-Column Normalization of data tables
Variables type and Statistical Analysis tool to build
relationships
Variable Types and Statistical tools
DEPENDENT VARIABLE
INDEPENDENT VARIABLE
36
Variable Types and Statistical tools
Types of tools Applicable:
- Chi-Square test
- ANOVA
- Multiple Regression
- Discriminant Analysis OR Logistic Regression
37
Examples of Applications:
- Chi-Square test (Purchaser vs Non-Purchaser… Are there
differences by demographic groups?)
- ANOVA (Preference levels among different brands of vodka drinkers)
- Multiple Regression (“Ad likeability” by Uniqueness, Relevance, etc)
- Discriminant Analysis OR Logistic Regression (Purchase /
Non-Purchase by Ad Uniqueness, Relevance, etc)
38
THANK YOU!