0% found this document useful (0 votes)
11 views1 page

Report Machine Learning 101 1 5

The document discusses the challenges of demonstrating causation through data analysis, particularly in the context of market basket analysis used by supermarkets to identify purchasing patterns. It highlights the distinction between correlation and causation, using examples like the association between beer and diapers to illustrate common misconceptions. The report also outlines various applications of association rules, including text analysis, plagiarism detection, and predicting outcomes based on identified correlations.

Uploaded by

Shahriar Azizi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views1 page

Report Machine Learning 101 1 5

The document discusses the challenges of demonstrating causation through data analysis, particularly in the context of market basket analysis used by supermarkets to identify purchasing patterns. It highlights the distinction between correlation and causation, using examples like the association between beer and diapers to illustrate common misconceptions. The report also outlines various applications of association rules, including text analysis, plagiarism detection, and predicting outcomes based on identified correlations.

Uploaded by

Shahriar Azizi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DATA SCIENCE REPORT SERIES MACHINE LEARNING 101

It is rather difficult to “demonstrate” causation via data anal- It is not too difficult to see how this information could poten-
ysis; in practice, decision-makers pragmatically (and often tially be used to help supermarkets turn a profit: announc-
erroneously) focus on the second half of Tufte’s rejoinder, ing or advertising a sale on sausages while simultaneously
which asserts that “there’s no smoke without fire.” (and quietly) raising the price of buns could have the effect
Case in point, while being a triathlete does not cause one of bringing in a higher number of customers into the store,
to drive a Subaru, Subaru Canada thinks that the connection increasing the sale volume for both items while keeping the
is strong enough to offer to reimburse the registration fee combined price of the two items constant.4
at an IRONMAN 70.3 competition (since at least 2018)! [6]
A (possibly) apocryphal story shows the limitations of as-
Market Basket Analysis Association rules discovery is also
sociation rules: a supermarket found an association rule
known as market basket analysis after its original applica-
linking the purchase of beer and diapers and consequently
tion, in which supermarkets record the contents of shopping
moved its beer display closer to its diapers display, having
carts (the baskets) at check-outs to determine which items
confused correlation and causation.
are frequently purchased together.
Purchasing diapers does not cause one to purchase beer
(or vice-versa); it could simply be that parents of newborns
For instance, while bread and milk might often be purchased
have little time to visit public houses and bars, and what-
together, that is unlikely to be of interest to supermarkets
ever drinking they do will be done at home. Who knows?
given the frequency of market baskets containing milk or
Whatever the case, rumour has it that the experiment was
bread (in the mathematical sense of “or”).
neither popular nor successful.
Knowing that a customer has purchased bread does
provide some information regarding whether they also pur- Applications Typical uses include:
chased milk, but the individual probability that each item
is found, separately, in the basket is so high to begin with finding related concepts in text documents – looking
that this insight is unlikely to be useful. for pairs (triplets, etc) of words that represent a joint
If 70% of baskets contain milk and 90% contain bread, concept: {San Jose, Sharks}, {Michelle, Obama}, etc.;
say, we would expect at least detecting plagiarism – looking for specific sentences
that appear in multiple documents, or for documents
90% × 70% = 63% that share specific sentences;
identifying biomarkers – searching for diseases that
of all baskets to contain milk and bread, should the presence are frequently associated with a set of biomarkers;
of one in the basket be totally independent of the presence making predictions and decisions based on associa-
of the other. tion rules (there are pitfalls here);
If we then observe that 72% of baskets contain both altering circumstances or environment to take advan-
items (a 1.15-fold increase on the expected proportion, as- tage of these correlations (suspected causal effect);
suming there is no link), we would conclude that there was using connections to modify the likelihood of certain
at best a weak correlation between the purchase of milk outcomes (see immediately above);
and the purchase of bread. imputing missing data,
text autofill and autocorrect, etc.
Sausages and hot dog buns, on the other hand, which we
might suspect are not purchased as frequently as milk and Other uses and examples can be found in [5, 17, 48].
bread, might still be purchased as a pair more often than
one would expect given the frequency of baskets containing 3.1 Causation and Correlation
sausages or buns. Association rules can automate hypothesis discovery, but
If 10% of baskets contain sausages, and 5% contain one must remain correlation-savvy (which is less prevalent
buns, say, we would expect that among quantitative specialists than one might hope, in our
experience). If attributes A and B are shown to be correlated
10% × 5% = 0.5% in a dataset, there are four possibilities:

of all baskets would contain sausages and buns, should the A and B are correlated entirely by chance in this par-
presence of one in the basket be totally independent of ticular dataset;
the presence of the other. A is a relabeling of B (or vice-versa);
If we then observe that 4% of baskets contain both items A causes B (or vice-versa), or
(an 8-fold increase on the expected proportion, assuming some combination of attributes C1 , . . . , Cn (which may
there is no link), we would obviously conclude that there not be available in the dataset) cause both A and B.
is a strong correlation between the purchase of sausages 4
The marketing team is banking on the fact that customers are unlikely
and the purchase of hot dog buns. to shop around to get the best deal on hot dogs AND buns, which may or
may not be a valid assumption.

[Link], [Link] (2021) 5

You might also like