Follow the Apriori algorithm to obtain ALL the association rules with at least 50% support and 100% confidence that are derivable from the set of transactions below (example taken from M.J. Zaki's KDD lecture notes). Show the results obtained in each step of the algorithm.
Items A = Jane Austen C = Agatha Christie D = Sir Arthur Conan Doyle T = Mark Twain W = P.G. Wodehouse Transaction Database: Transaction ID Items 1 A C T W 2 C D W 3 A C T W 4 A C D W 5 A C D T W 6 C D T
The problem of discovering all association rules can be divided into two parts:
1. Find all frequent itemsets.
F1
Itemset | Support |
{C} {D} {T} {W} |
6 4 4 5 |
According to the candidate generation process, we could generate C2 from F1. After the join step, C2 is {{A C}, {A D}, {A T}, {A W}, {C D}, {C T}, {C W}, {D T}, {D W}, {T W}}. Since all 1-subsets of the itemsets in C2 are contained in F1, the prune step deletes nothing. By counting the supports of the itemsets in C2, we get F2.
F2
Itemset | Support |
{A T} {A W} {C D} {C T} {C W} {D W} {T W} |
3 4 4 4 5 3 3 |
Similarly, C3 could be generated from F2. After the join step, C3 is {{A C T}, {A C W}, {A T W}, {C D T}, {C D W}, {C T W}}. The prune step deletes {C D T}, since {D T} is not in F2. Then we could get F3 by counting the supports.
F3
|
|
{A C W} {A T W} {C D W} {C T W} |
4 3 3 3 |
After the join step, C4 only contains one itemset {A C T W}, and it is left after the prune step.
F4
|
|
|
|
C5 is empty, so the whole set of frequent itemsets is the union
of F1, F2, F3 and F4.
In our example, the frequent itemsets whose sizes are greater than 1 are the itemsets in F2, F3 and F4, i.e., {A C}, {A T}, {A W}, {C D}, {C T}, {C W}, {D W}, {T W}, {A C T}, {A C W}, {A T W}, {C D W}, {C T W}, {A C T W}.
Let's start with the first one. For {A C}, the two possible rules are
{A} ->{C} and {C} ->
{A}. Because
support({A C}) / support({A}) = 4 / 4 = 100%,
support({A C}) / support({C}) = 4 / 6 = 66.7%,
{A} ->{C} is one desired rule.
Similarly, for {A C T}, the possible rules are {A C} ->
{T}, {A T} ->{C}, {C T} ->
{A}, {A} ->{C T}, {C} ->
{A T} and {T} ->{A C}. By computing their
confidences respectively,
support({A C T}) / support({A C}) = 3 / 4,
support({A C T}) / support({A T}) = 3 / 3 = 100%,
support({A C T}) / support({C T}) = 3 / 4,
support({A C T}) / support({A}) = 3 / 4,
support({A C T}) / support({C}) = 3 / 6,
support({A C T}) / support({T}) = 3 / 4,
only {A T} ->{C} is a desired rule.