Follow the Apriori algorithm to obtain ALL the association rules with at least 50% support and 100% confidence that are derivable from the set of transactions below (example taken from M.J. Zaki's KDD lecture notes). Show the results obtained in each step of the algorithm.
Items A = Jane Austen C = Agatha Christie D = Sir Arthur Conan Doyle T = Mark Twain W = P.G. Wodehouse Transaction Database: Transaction ID Items 1 A C T W 2 C D W 3 A C T W 4 A C D W 5 A C D T W 6 C D T
The problem of discovering all association rules can be divided into two parts:
1. Find all frequent itemsets.
F1
Itemset  Support 
{C} {D} {T} {W} 
6 4 4 5 
According to the candidate generation process, we could generate C2 from F1. After the join step, C2 is {{A C}, {A D}, {A T}, {A W}, {C D}, {C T}, {C W}, {D T}, {D W}, {T W}}. Since all 1subsets of the itemsets in C2 are contained in F1, the prune step deletes nothing. By counting the supports of the itemsets in C2, we get F2.
F2
Itemset  Support 
{A T} {A W} {C D} {C T} {C W} {D W} {T W} 
3 4 4 4 5 3 3 
Similarly, C3 could be generated from F2. After the join step, C3 is {{A C T}, {A C W}, {A T W}, {C D T}, {C D W}, {C T W}}. The prune step deletes {C D T}, since {D T} is not in F2. Then we could get F3 by counting the supports.
F3


{A C W} {A T W} {C D W} {C T W} 
4 3 3 3 
After the join step, C4 only contains one itemset {A C T W}, and it is left after the prune step.
F4




C5 is empty, so the whole set of frequent itemsets is the union
of F1, F2, F3 and F4.
In our example, the frequent itemsets whose sizes are greater than 1 are the itemsets in F2, F3 and F4, i.e., {A C}, {A T}, {A W}, {C D}, {C T}, {C W}, {D W}, {T W}, {A C T}, {A C W}, {A T W}, {C D W}, {C T W}, {A C T W}.
Let's start with the first one. For {A C}, the two possible rules are
{A} >{C} and {C} >
{A}. Because
support({A C}) / support({A}) = 4 / 4 = 100%,
support({A C}) / support({C}) = 4 / 6 = 66.7%,
{A} >{C} is one desired rule.
Similarly, for {A C T}, the possible rules are {A C} >
{T}, {A T} >{C}, {C T} >
{A}, {A} >{C T}, {C} >
{A T} and {T} >{A C}. By computing their
confidences respectively,
support({A C T}) / support({A C}) = 3 / 4,
support({A C T}) / support({A T}) = 3 / 3 = 100%,
support({A C T}) / support({C T}) = 3 / 4,
support({A C T}) / support({A}) = 3 / 4,
support({A C T}) / support({C}) = 3 / 6,
support({A C T}) / support({T}) = 3 / 4,
only {A T} >{C} is a desired rule.