DWM Unit-4
DWM Unit-4
ASSOCIATION ANALYSIS
Association Analysis: Association Analysis: Problem Definition, Frequent Item set
Generation, Rule generation. Alternative Methods for Generating Frequent Item sets, FP
Growth Algorithm, Evaluation of Association Patterns.
---------- -------------------------- -------------------------------------------------------------------------------------
Business enterprises accumulate large quantities of data from their day to- day operations.
For example, huge amount of customer purchase data are collected daily at the checkout
counters of grocery stores.
Table below illustrates an example of such data, commonly known as market basket
transactions.
Each row in this table corresponds to a transaction, which contains a unique identifier
labelled TID and a set of items bought by a given customer.
Retailers are interested in analyzing the data to learn about the purchasing behaviour
of their customers.
Such valuable information can be used to support a variety of business-related
applications such as marketing promotions, inventory management and customer
relationship management.
The uncovered relationships can be represented in the form of association rules or sets
of frequent items.
For example, the following rule can be extracted from the data set shown in Table 6.1:
{Milk} {Bread}.
The rule suggests that a strong relationship exists between the sale of Milk and Bread because
many customers who buy Milk also buy Bread.
Retailers can use this type of rules to help them identify new opportunities for cross selling
their products to the customers.
Besides market basket data, association analysis is also applicable to other application
domains such as bioinformatics, medical diagnosis, Web mining and scientific data analysis.
There are two key issues that need to be addressed when applying association analysis to
market basket data.
1. Discovering patterns from a large transaction data set can be computationally
expensive.
2. Some of the discovered patterns are potentially spurious because they may happen
simply by chance.
6.1 Problem Definition
Binary Representation Market basket data can be represented in a binary format as shown in
Table 6.2.
Here each row corresponds to a transaction and each column corresponds to an item. An item
can be treated as a binary variable whose value is one if the item is present in a transaction
and zero otherwise. Because the presence of an item in a transaction is often considered more
important than its absence, an item is an asymmetric binary variable.
Itemset and Support Count :
Let I = {i1,i2,...,id} be the set of all items in a market basket data and T = { t1,t2,...,td } be the
set of all transactions.
Each transaction ti contains a subset of items chosen from I.
In association analysis, a collection of zero or more items is termed an itemset. If an
itemset contains k items, it is called a k-itemset. {Bread, Diapers, Milk} is an example
of a 3-itemset.
The null (or empty) set is an itemset that does not contain any items. The transaction
width is defined as the number of items present in a transaction.
A transaction tj is said to contain an itemset X if X is a subset of tj.
For example, the second transaction shown in Table 6.2 contains the itemset {Bread,
Diapers} but not {Bread, Milk}.
An important property of an itemset is its support count, which refers to the number
of transactions that contain a particular itemset.
Mathematically, the support count, (X), for an itemset X can be stated as follows:
A brute-force approach for mining association rules is to compute the support and confidence
for every possible rule.
This approach is prohibitively expensive because there are exponentially many rules that can
be extracted from a data set.
Te total number of possible rules extracted from a data set that contains d items is :
Even for the small data set shown in Table 6.1, this approach requires us to compute the
support and confidence for 36 – 2 7 + 1 = 602 rules.
More than 80% of the rules are discarded after applying minsup >= 20% and minconf
>= 50% , thus making most of the computations become wasted.
To avoid performing needless computations, it would be useful to prune the rules
early without having to compute their support and confidence values.
An initial step toward improving the performance of association rule mining
algorithms is to decouple the support and confidence requirements.
From Equation 6.2, the support of a rule X -> Y depends only on the support of its
corresponding itemset , X U Y.
For example, the following rules have identical support because they involve items from the
same itemset, {Beer, Diapers, MiIk}:
{Beer, Diapers} -- {Milk}, {Beer, Milk} ----- > { Diapers},
{Diapers, Milk} --- { Beer}, {Beer} -- {Diapers, Milk},
{Milk} ---- > { Beer,Diapers}, {Diapers} -- {Beer,Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned immediately without
our having to compute their confidence values.
Therefore, a common strategy adopted by many association rule mining algorithms is to
decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy the
minsup threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the
frequent itemsets found in the previous step. These rules are called strong rules.
A lattice structure can be used to enumerate the list of all possible itemsets. Figure 6.1 shows
an itemset lattice for I= {a,b,c.,d,e}.
In general, a data set that contains k items can potentially generate up to 2k - 7 frequent
itemsets, excluding the null set.
A brute-force approach for finding frequent itemsets is to determine the support count for
every candidate itemset in the lattice structure.
Compare each candidate against every transaction, an operation that is shown in
Figure 6.2.
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be
infrequent too.
As illustrated in Figure 6.4, the entire subgraph containing the supersets of {a, b} can be
pruned immediately once {a, b} is found to be infrequent.
This strategy of trimming the exponential search space based on the support measure is
known as support-based pruning.
Such a pruning strategy is made possible by a key property of the support measure, namely,
that the support for an itemset never exceeds the support for its subsets. This property is also
known as the anti-monotone property of the support measure.
Definition 6.2 (Monotonicity Property). Let I be a set of items, and J =2I be the power set
of I. A measure f is monotone (or upward closed) if
which means that if X is a subset of Y, then f(X) must not exceed f(Y).
f is anti-monotone (or downward closed) if
which means that if X is a subset of Y, then f(Y) must not exceed f(X).
6.2.2 Frequent Itemset Generation in the Apriori Algorithm
Apriori, is the first association rule mining algorithm that pioneered the use of support-based
pruning to systematically control the exponential growth of candidate itemsets.
Figure 6.5 provides a high-level illustration of the frequent itemset generation part of the
Apriori algorithm for the transactions shown in Table 6.1.
We assume that the support threshold is 60%, which is equivalent to a minimum support
count equal to 3.
Initially, every item is considered as a candidate l-itemset.
After counting their supports, the candidate itemsets {Co1a} and {Eggs} are
discarded
because they appear in fewer than three transactions.
In the next iteration, candidate 2-itemsets are generated using only the frequent 1-
itemsets because the Apriori principle ensures that all supersets of the infrequent 1-
itemsets must be infrequent.
Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets
generated by the algorithm is ( 4 C 2) = 6.
Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found
to be infrequent after computing their support values.
The remaining four candidates are frequent, and thus will be used to generate
candidate 3-itemsets.
Without support-based. pruning, there are ( 6 C 3 ) =20 candidate 3-itemsets that can
be formed using the six items given in this example.
With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets
are frequent.
The only candidate that has this property is {Bread, Diapers, Milk).
The effectiveness of the Apriori pruning strategy can be shown by counting the number of
candidate itemsets generated.
Brute force strategy :
The Apriori principle, decreases the candidates, which represents a 68% reduction in the
number of candidate itemsets even in this simple example.
The pseudocode for the frequent itemset generation part of the Apriori, algorithm is shown in
Algorithm 6.1.
Let Cp denote the set of candidate k-itemsets and Fk denote the set of frequent k-itemsets:
The algorithm initially makes a single pass over the data set to determine the support
of each item. Upon completion of this step, the set of all frequent 1-itemsets, F1, will
be known (steps 1 and 2).
Next, the algorithm will iteratively generate new candidate k-itemsets using the
frequent (k - 1)-itemsets found in the previous iteration (step 5).
To count the support of the candidates, the algorithm needs to make an additional pass
over the data set (steps 6- 10). The subset function is used to determine all the
candidate itemsets in Ca that are contained in each transaction f.
After counting their supports, the algorithm eliminates all candidate itemsets whose
support counts are less than minsup (step 12).
The algorithm terminates when there are no new frequent itemsets generated, i.e.,
Fk=Ø (step 13).
The frequent itemset generation part of the Apriori algorithm has two important
characteristics.
1. It is a level-wise algorithm; i.e., it traverses the itemset lattice one level at a time,
from frequent 1-itemsets to the maximum size of frequent itemsets.
2. It employs a generate-and-test strategy for finding frequent itemsets. At each
iteration, new candidate itemsets are generated from the frequent itemsets found in the
previous iteration.
The support for each candidate is then counted and tested against the minsup threshold. The
total number of iterations needed by the algorithm is kmax+1, where kmax is the maximum
size of the frequent itemsets.
6.2.3 Candidate Generation and Pruning
The apriori-gen function shown in Step 5 of Algorithm 6.1 generates candidate itemsets by
performing the following two operations:
1. Candidate Generation. This operation generates new candidate k-itemsets based on the
frequent (k - l)-itemsets found in the previous iteration.
2. Candidate Pruning. This operation eliminates some of the candidate k-itemsets using the
support-based pruning strategy.
Eg. For candidate pruning operation : Consider a candidate k-itemset, X= {i1,i2,...,ik}. The
algorithm must determine whether all of its proper subsets,
are frequent.
If one of them is infrequent, then X is immediately pruned. This approach can effectively
reduce the number of candidate itemsets considered during support counting. The complexity
of this operation is O(k) for each candidate k-itemset.
The following is a list of requirements for an effective candidate generation procedure:
1. It should avoid generating too many unnecessary candidates. A candidate itemset is
unnecessary if at least one of its subsets is infrequent.
Such a candidate is guaranteed to be infrequent according to the antimonotone
property of support.
2. It must ensure that the candidate set is complete, i.e., no frequent itemsets are left out by
the candidate generation procedure. To ensure completeness, the set of candidate itemsets
Candidate pruning becomes extremely expensive because a large number of itemsets must be
examined.
Given that the amount of computations needed for each candidate is O(k), the overall
be a pair of of frequent (k - l)-itemsets. A and B are merged if they satisfy the following
conditions :
In the Figure above, the frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to
form a candidate 3-itemset {Bread, Diapers, Milk}.
The algorithm does not have to merge {Beer, Diapers} with {Diapers, Milk} because the
first item in both itemsets is different.
If {Beer, Diapers, Milk} is a viable candidate, it would have been obtained by merging
{Beer, Diapers} with {Beer, MiIk} instead.
This example illustrates both the completeness of the candidate generation procedure and the
advantages of using lexicographic ordering to prevent duplicate candidates.
6.2.4 Support Counting
Support counting is the process of determining the frequency of occurrence for every
candidate itemset that survives the candidate pruning step of the apriori-gen function.
Support counting is implemented in steps 6 through 11 of Algorithm 6.1.
One approach for doing this is to compare each transaction against every candidate itemset
and to update the support counts of candidates contained in the transaction.
This approach is computationally expensive, especially when the numbers of transactions and
candidate itemsets are large.
An alternative approach is to enumerate the itemsets contained in each transaction and use
them to update the support counts of their respective candidate itemsets.
For example, corresponds to itemsets that begin with prefix (1 2) and are
followed by items 3, 5, or 6.
Finally, the prefix structures at Level 3 represent the complete set of 3-itemsets contained in
‘t’.
For example, the 3-itemsets that begin with prefix {1 2} are {1,2,3}, {1,2,5}, and {1,2,6},
while those that begin with prefix {2 3} are {2,3,5} and { 2 , 3 , 6 } .
The prefix structures shown in Figure above demonstrate how itemsets contained in a
transaction can be systematically enumerated, i.e., by specifying their items one by one, from
the leftmost item to the rightmost item.
For a given transaction {1,2,3,5,6} , Hashing transaction at the root node of hash tree tree is :
Each internal node of the tree uses the following hash function, h(p) = p mod 3, to determine
which branch of the current node should be followed next.
For example, items 1, 4, and 7 are hashed to the same branch (i.e., the leftmost branch)
because they have the same remainder after dividing the number by 3.
All candidate itemsets are stored at the leaf nodes of the hash tree. The hash tree shown in
Figure above contains 15 candidate 3-itemsets, distributed across 9 leaf nodes.
Consider a transaction, t = {1,2,3,5,6}.
To update the support counts of the candidate itemsets, the hash tree must be traversed in
such a way that all the leaf nodes containing candidate 3-itemsets belonging to t must be
visited at least once.
The 3-itemsets contained in t must begin with items 1, 2,or 3, as indicated by the Level 1
prefix structures shown in the Figure in section 6.2.4.
Therefore, at the root node of the hash tree, the items 1, 2, and 3 of the transaction are hashed
separately.
Item 1 is hashed to the left child ofthe root node.
Item 2 is hashed to the middle child.
Item 3 is hashed to the right child.
At the next level of the tree, the transaction is hashed on the second item listed in the Level 2
structures shown in the Figure in section 6.2.4.
For example, after hashing on item 1 at the root node, items 2, 3, and 5 of the transaction are
hashed.
Items 2 and 5 are hashed to the middle child, while item 3 is hashed to the right child, as
shown in Figure a.
This process continues until the leaf nodes of the hash tree are reached.
The candidate itemsets stored at the visited leaf nodes are compared against the transaction.
If a candidate is a subset of the transaction, its support count is incremented.
In the above example, 5 out of the 9 leaf nodes are visited and 9 out of the 15 itemsets are
compared against the transaction.
Number of Transactions: Since the Apriori, algorithm makes repeated passes over the data
set, its run time increases with a larger number of transactions.
Average transaction Width: For dense data sets, the average transaction width can be very
large.
This affects the complexity of the Apriori algorithm in two ways:
1. The maximum size of frequent itemsets tends to increase as the average transaction
width increases. As a result, more candidate itemsets must be examined during
candidate generation and support counting, as shown in Figure.
2. As the transaction width increases, more itemsets are contained in the transaction.
This will increase the number of hash tree traversals performed during support
counting.
Generation of frequent l-itemsets: For each transaction, we need to update the support
count for every item present in the transaction.
Assuming that w is the average transaction width, this operation requires O(Nw) time, where
N is the total number of transactions.
Candidate generation: To generate candidate k-itemsets, pairs of frequent (k - l)-itemsets
are merged to determine whether they have at least k – 2 items in common.
Each merging operation requires at most k - 2 equality comparisons.
In the best-case scenario, every merging step produces a viable candidate k-itemset.
In the worst-case scenario, the algorithm must merge every pair of frequent (k - 1)-itemsets
found in the previous iteration.
Therefore, the overall cost of merging frequent itemsets is
A hash tree is also constructed during candidate generation to store the candidate itemsets.
Because the maximum depth of the tree is k, the cost for populating the hash tree with
candidate itemsets is .
During candidate pruning, we need to verify that the ,k - 2 subsets of every candidate k-
itemset are frequent.
Since the cost for looking up a candidate in a hash tree is O(k), the candidate pruning step
requires time.
Each frequent k-itemset, Y, can produce up to 2k - 2 association rules, ignoring rules that
have empty antecedents or consequents (Ø--- >Y or YØ). An association rule can be
extracted by partitioning the itemset Y into two non-empty subsets, X and Y ->X, such that X
->Y-X satisfies the confidence threshold. All such rules must have already met the support
threshold because they are generated from a frequent itemset.
Example: Let X = {1,2,3} be a frequent itemset. There are six candidate association rules that
can be generated from X : {1,2} -> {3}, {1,3} -> {2}, {2,3}-> { 1}, {1} -> { 2,3}, {2}-->{
1,3}, and {3}{ 1,2}.
As each of their support is identical to the support for X, the rules must satisfy the support
threshold.
Computing the confidence of an association rule does not require additional scans of the
transaction data set.
Consider the rule {1,2} -> {3}, which is generated from the frequent itemset X : {1,2,3}. The
confidence for this rule is ({1, 2,3}) / ({1, 2}). Because {1, 2, 3} is frequent, the anti-
monotone property of support ensures that {1,2} must be frequent, too.
Since the support counts for both itemsets were already found during frequent itemset
generation, there is no need to read the entire data set again.
If any node in the lattice has low confidence, then according to Theorem 6.2, the entire
subgraph spanned by the node can be pruned immediately.
Suppose the confidence for {bcd} -> {a } is low. All the rules containing item a in its
consequent, including {cd} -> {ab}, {bd}-> {a c}, {b c) ->{ a d}, and {d} -> {a b c} can be
discarded.
The difference between Rule genearation and Support counting algorithms is that, in rule
generation, we do not have to make additional passes over the data set to compute the
confidence of the candidate rules.
Here we determine the confidence of each rule by using the support counts computed during
frequent itemset generation.
6.5 Alternative Methods for Generating Frequent Itemsets
It first discovers all the frequent 1-itemsets, followed by the frequent 2-itemsets, and so on,
until no new frequent itemsets are generated. The itemset lattice can also be traversed in a
depth-first manner, as shown in Figures 6.21(b) and 6.22.
The algorithm can start from, say, node a, in Figure 6.22, and count its support to determine
whether it is frequent. If so, the algorithm progressively expands the next level of nodes, i.e.,
ab, abc, and so on, until an infrequent node is reached, say, abcd.
It then backtracks to another branch, say, abce, and continues the search from there.
The depth-first approach is often used by algorithms designed to find maximal frequent
itemsets. This approach allows the frequent itemset border to be detected more quickly than
using a breadth-first approach.
Once a maximal frequent itemset is found, substantial pruning can be performed on its
subsets.
For example, if the node bcde shown in Figure 6.22 is maximal frequent, then the algorithm
does not have to visit the subtrees rooted at bd,, be, c, d, and e because they will not contain
any maximal frequent itemsets.
If abc is maximal frequent, only the nodes such as ac and bc arc not maximal frequent (but
the subtrees of ac and bc may still contain maximal frequent itemsets).
The depth-first approach also allows a different kind of pruning based on the support of
itemsets.
For example, suppose the support for {a,b,c} is identical to the support for {a, b}.
The subtrees rooted at abd and abe can be skipped because they are guaranteed not to have
any maximal frequent itemsets.
Representation of Transaction Data Set : There are many ways to represent a transaction
data set. The choice of representation can affect the I/O costs incurred when computing the
support of candidate itemsets.
Figure 6.23 shows two different ways of representing market basket transactions.
The representation on the left is called a horizontal data layout, which is adopted by many
association rule mining algorithms, including Apriori.
Another possibility is to store the list of transaction identifiers (TID-list) associated with each
item. Such a representation is known as the vertical data layout.
The support for each candidate itemset is obtained by intersecting the TlD-lists of its subset
items.
The length of the TlD-lists shrinks as we progress to larger sized itemsets.
One problem with this approach is that the initial set of TlD-lists may be too large to fit into
main memory, thus requiring more sophisticated techniques to compress the TlD-lists.
FP-growth finds all the frequent itemsets ending with a particular suffix by employing a
divide-and-conquer strategy to split the problem into smaller subproblems.
Example 1: Suppose we are interested in finding all frequent itemsets ending in e.
To do this, we must first check whether the itemset {e} itself is frequent.
If it is frequent, we consider the subproblem of finding frequent itemsets ending in de,
followed by ce, be, and ae.
In turn, each of these subproblems are further decomposed into smaller subproblems.
By merging the solutions obtained from the subproblems, all the frequent itemsets ending in e
can be found.
This divide-and-conquer approach is the key strategy employed by the FP-growth algorithm.
Example 2. Consider the task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initial paths are called
prefix paths and are shown in Figure 6.27(a).
2. From the prefix paths shown in Figure 6.27(a), the support count for e is obtained by
adding the support counts associated with node e. Assuming that the minimum support count
is 2, {e} is declared a frequent itemset because its support count is 3.
[Link] {e} is frequent, the algorithm has to solve the sub problems of finding frequent
itemsets ending in de, ce, be, and ae.
Before solving these subproblems, it must first convert the prefix paths into a conditional FP-
tree, which is structurally similar to an FP-tree, except it is used to find frequent itemsets
ending with a particular suffix.
A conditional FP-tree is obtained in the following way:
(a) First, the support counts along the prefix paths must be updated because some of the
counts include transactions that do not contain item e.
For example, the rightmost path shown in Figure 6.27(a),
null-b: 2 -c:2 e: 1, includes a transaction {b,c} that does not contain item e.
The counts along the prefix path must therefore be adjusted to 1 to reflect the actual number
of transactions containing {b, c, e}.
(b) The prefix paths are truncated by removing the nodes for e. These nodes can be removed
because the support counts along the prefix paths have been updated to reflect only
transactions that contain e and the sub problems of finding frequent itemsets ending in de, ce,
be, anrd ae no longer need information about node e.
(c) After updating the support counts along the prefix paths, some of the items may no longer
be frequent.
For example, the node b appears only once and has a support count equal to 1, which means
that there is only one transaction that contains both b and e. Item b can be safely ignored from
subsequent analysis because all itemsets ending in be must be infrequent.
The conditional FP-tree for e is shown in Figure 6.27(b). The tree looks different than the
original prefix paths because the frequency counts have been updated and the nodes b and e
have been eliminated.
[Link]-growth uses the conditional FP-tree for e to solve the sub problems of finding frequent
itemsets ending in de, ce, and ae.
To find the frequent itemsets ending in de, the prefix paths for d are gathered from the
conditional FP-tree for e (Figure 6.27(c)).
By adding the frequency counts associated with node d, we obtain the support count for
{d,e}.
Since the support count is equal to 2, {d,e} is declared a frequent itemset.
Next, the algorithm constructs the conditional FP-tree for de using the approach described in
step 3.
After updating the support counts and removing the infrequent item c, the conditional FP-tree
for de is shown in Figure 6.27(d).
Since the conditional FP-tree contains only one item, α, whose support is equal to minsup, the
algorithm extracts the frequent itemset {a,d,e} and moves on to the next sub problem, which
is to generate frequent itemsets ending in ce.
After processing the prefix paths for c, only {c, e} is found to be frequent. The algorithm
proceeds to solve the next subprogram and found {a,e} to be the only frequent itemset
remaining.
This example illustrates the divide-and-conquer approach used in the FPgrowth algorithm. At
each recursive step, a conditional FP-tree is constructed by updating the frequency counts
along the prefix paths and removing all infrequent items.
Because the subproblems are disjoint, FP-growth will not generate any duplicate itemsets. In
addition, the counts associated with the nodes allow the algorithm to perform support
counting while generating the common suffix itemsets.
Interest Factor The tea-coffee example shows that high-confidence rules can sometimes be
misleading because the confidence measure ignores the support of the itemset appearing in
the rule consequent.
One way to address this problem is by applying a metric known as lift:
which computes the ratio between the rule's confidence and the support of the itemset in the
rule consequent.
For binary variables, lift is equivalent to another objective measure called interest factor,
which is defined as follows:
Interest factor compares the frequency of a pattern against a baseline frequency computed
under the statistical independence assumption. The baseline frequency for a pair of mutually
independent variables is
This equation follows from the standard approach of using simple fractions as estimates for
probabilities. The fraction f11/N is an estimate for the joint probability P(A,B), while f1+/N
and f+1/ N are the estimates for P(A) and P(B), respectively.
If A and B are statistically independent, then P(A,B): P(A) x P(B), thus leading to the
formula shown in Equation 6.6. Using
Equations 6.5 and 6.6, we can interpret the measure as follows:
Limitations of Correlation Analysis The drawback of using correlation can be seen from
the word association example given in Table 6.9.
Although the words p and q appear together more often than r and s, their Ø -coefficients
are identical, i.e., Ø (p,q) = Ø (r,s) = 0.232.
This is because the Ø –coefficient gives equal importance to both co-presence and co-absence
of items in a transaction.
It is therefore more suitable for analyzing symmetric binary variables.
IS Measure .IS is an alternative measure that has been proposed for handling asymmetric
binary variables. The measure is defined as follows:
IS is large when the interest factor and support of the pattern are large.
For example, the value of IS for the word pairs {p, q} and {r, s} shown in Table 6.9 are 0.946
and 0.286, respectively.
Contrary to the results given by interest factor and the Ø -coefficient, the IS measure suggests
that the association between {p, q} is stronger than {r, s}, which agrees with what we expect
from word associations in documents.
It is possible to show that IS is mathematically equivalent to the cosine measure for binary
variables .
Consider A and B as a pair of bit vectors, A . B = s(A,B) the dot product between the vectors,
and lAl = √s(A) the magnitude of vector A.
Therefore:
The IS measure can also be expressed as the geometric mean between the confidence of
association rules extracted from a pair of binary variables:
Because the geometric mean between any two numbers is always closer to the smaller
number, the IS value of an itemset {p, q} is low whenever one of its rules, p Q or q-> p,
has low confidence.
Limitations of IS Measure The IS value for a pair of independent itemsets, A and B, is
since the value depends on s(A) and s(B), IS shares a similar problem as the confidence
measure-that the value of the measure can be quite large, even for uncorrelated and
negatively correlated patterns.
For example, despite the large IS value between items p and q given in Table 6.10 (0.889), it
is still less than the expected value when the items are statistically independent (ISindep : 0.9).
These measures can be divided into two categories, symmetric and asymmetric measures.
A measure M is symmetric if M(A B) = M(B A ).
For example, interest factor is a symmetric measure because its value is identical for the rules
A B and B A.
In contrast, confidence is an asymmetric measure since the confidence for A B and B->A,
may not be the same.
Symmetric measures are generally used for evaluating itemsets, while asymmetric measures
are more suitable for analyzing association rules.
Tables 6.11 and 6.12 provide the definitions for some of these measures in terms of the
frequency counts of a 2 x 2 contingency table.
Consistency among Objective Measures
If the measures are consistent, then we can choose any one of them as our evaluation metric.
Otherwise, it is important to understand what their differences are in order to determine
which measure is more suitable for analyzing certain types of patterns.
Suppose the symmetric and asymmetric measures are applied to rank the ten contingency
tables shown in Table 6.13.
These contingency tables are chosen to illustrate the differences among the existing
measures.
The ordering produced by these measures are shown in Tables 6.14 and 6.15, respectively
(with 1 as the most interesting and 10 as the least interesting table).
Some of the measures appear to be consistent with each other, there are certain measures that
produce quite different ordering results.
For example, the rankings given by the Ø-coefficient agree with those provided by k and
collective strength, but are somewhat different than the rankings produced by interest factor
and odds ratio.
Furthermore, a contingency table such as –E10 is is ranked lowest according to the Ø-
coefficient, but highest according to interest factor.
The 0/1 bit in each column vector indicates whether a transaction (row) contains a particular
item (column).
For example, the vector A indicates that item α belongs to the first and last transactions,
whereas the vector B indicates that item b is contained only in the fifth transaction.
The vectors C and E are in fact related to the vector A-their bits have been inverted from 0's
(absence) to l's (presence), and vice versa.
Similarly, D is related to vectors B and F by inverting their bits.
The process of flipping a bit vector is called inversion.
If a measure is invariant under the inversion operation, then its value for the vector pair (C,
D) should be identical to its value for (A, B).
The inversion property of a measure can be tested as follows.
Definition 6.6 (Inversion Property). An objective measure M is invariant under the
inversion operation if its value remains the same when exchanging the frequency counts f11
with f00and f10 with f01.
Among the measures that remain invariant under this operation include the Ø-coefficient
odds ratio, k, and collective strength.
These measures may not be suitable for analyzing asymmetric binary data.
For example, the Ø -coefficient between C and D is identical to the @-coefficient between A
and B, even though items c and d appear together more frequently than a and b.
Furthermore, the d-coefficient between C and D is less than that between E and F even
though items e and f appear together only once.
For asymmetric binary data, measures that do not remain invariant under the inversion
operation are preferred. Some of the non-invariant measures include interest factor, IS, PS,
and the Jaccard coefficient .
Null Addition Property Suppose we are interested in analyzing the relationship between a
pair of words, such as data and rnining, in a set of documents.
This process of adding unrelated data (in this case, documents) to a given data set is known
as the null addition operation.
Deffnition 6.7 (Null Addition Property). An objective measure M is invariant under the null
addition operation if it is not affected by increasing f00, while all other frequencies in the
contingency table stay the same.
For applications such as document analysis or market basket analysis, the measure is
expected to remain invariant under the null addition operation.
Otherwise, the relationship between words may disappear simply by adding enough
documents that do not contain both words.
Examples of measures that satisfy this property include cosine (IS) and Jaccard measures,
while those that violate this property include interest factor, PS, odds ratio, and the Ø-
coefficient.
Scaling Property Table 6.16 shows the contingency tables for gender and the grades
achieved by students enrolled in a particular course in 1993 and 2004.
The data in these tables showed that the number of male students has doubled since 1993,
while the number of female students has increased by a factor of 3.
The male students in 2004 are not performing any better than those in 1993 because the ratio
of male students who achieve a high grade to those who achieve a low grade is still the same,
i.e., 3:4.
Similarly, the female students in 2004 are performing no better than those in 1993. The
association between grade and gender is expected to remain unchanged despite changes in the
sampling distribution.
Each entry fijk in this table represents the number of transactions that contain a particular
combination of items a, b, and c. For example, f101 is the number of transactions that contain
a and c, but not b.
On the other hand, a marginal frequency such as ,f1+1 is the number of transactions that
contain a and c, irrespective of whether b is present in the transaction.
Given a k-itemset {i1,i2,i3,i4,....,in}, the condition for statistical independence can be stated
as follows
Above definition can be extended to objective measures such as interest factor and P,S, which
are based on deviations from statistical independence' to more than two variables:
Another approach is to define the objective measure as the maximum, minimum, or average
value for the associations between pairs of items in a pattern.
For example, given a k-itemset X= {i1,i2,i3,i4,....,in},we may define the
Ø-coefficient for X as the average Ø -coefficient between every pair of items
(ip,iq) in X.
Since the measure considers only pairwise associations, it may not capture all the underlying
relationships within a pattern.
Analysis of multidimensional contingency tables is more complicated because of the
presence of partial associations in the data.
For example, some associations may appear or disappear when conditioned upon the value of
certain variables. This problem is known as Simpson's paradox
6.7.3 Simpson's Paradox
It is important to exercise caution when interpreting the association between variables
because the observed relationship may be influenced by the presence of other confounding
factors, i.e., hidden variables that are not included in the analysis.
In some cases, the hidden variables may cause the observed relationship between a pair of
variables to disappear or reverse its direction, a phenomenon that is known as Simpson's
paradox.
Example: Consider the relationship between the sale of high-definition television (HDTV)
and exercise machine, as shown in Table 6.19.
The rule {HDTV=Yes}{Exercise machine = Yes} has a confidence of 99/180 = 55% and
the rule {HDTV = No} { Exercise machine = Yes} has a confidence of 54/120 = 45%.
Together, these rules suggest that customers who buy high-definition televisions are more
likely to buy exercise machines than those who do not buy high-definition televisions.
Sales of these items depend on whether the customer is a college student or a working adult.
Table 6.20 summarizes the relationship between the sale of HDTVs and exercise machines
among college students and working adults.
The support counts given in the table for college students and working adults sum up to the
frequencies shown in Table 6.19.
For college students:
For working adults:
The rules suggest that, for each group, customers who do not buy high definition televisions
are more likely to buy exercise machines, which contradict the previous conclusion when
data from the two customer groups are pooled together.
Even if alternative measures such as correlation, odds ratio or interest are applied, we still
find that the sale of HDTV and exercise machine is positively correlated in the combined data
but is negatively correlated in the stratified data.
The reversal in the direction of association is known as Simpson's paradox.
The paradox can be explained in the following way:
Most customers who buy HDTVs are working adults.
Working adults are also the largest group of customers who buy exercise machines.
Nearly 85% of the customers are working adults, the observed relationship between HDTV
and exercise machine turns out to be stronger in the combined data than what it would have
been if the data is stratified.
This can also be illustrated mathematically as follows:
Suppose
where a/b and p/q may represent the confidence of the rule A B in two different strata,
while c/d and r/ s may represent the confidence of the rule in the two strata.
When the data is pooled together' the confidence values of the rules in the combined data are
(a+p) / (b+q) and (c+r) / (d+s) respectively.
thus leading to the wrong conclusion about the relationship between the variables.
Proper stratification is needed to avoid generating spurious patterns resulting from Simpson's
paradox.
For example, market basket data from a major supermarket chain should be stratified
according to store locations, while medical records from various patients should be stratified
according to confounding factors such as age and gender.