Problem 4

Consider the problem of mining association rules as presented in the papers "Mining Association rules between sets of items in large databases" (by R. Agrawal, T. Imilinski, and A. Swami), and "Fast Algorithms for Mining Association Rules" (by R. Agrawal, R. Srikant).
Follow the Apriori algorithm to obtain ALL the association rules with at least 50% support and 100% confidence that are derivable from the set of transactions below (example taken from M.J. Zaki's KDD lecture notes). Show the results obtained in each step of the algorithm.
Items A = Jane Austen C = Agatha Christie D = Sir Arthur Conan Doyle T = Mark Twain W = P.G. Wodehouse Transaction Database: Transaction ID Items 1 A C T W 2 C D W 3 A C T W 4 A C D W 5 A C D T W 6 C D T

Solution by Weiyang Lin:

The problem of discovering all association rules can be divided into two parts:

1. Find all frequent itemsets.

F1

Itemset Support

{A}
{C}
{D}
{T}
{W} 4
6
4
4
5

According to the candidate generation process, we could generate C2 from F1. After the join step, C2 is {{A C}, {A D}, {A T}, {A W}, {C D}, {C T}, {C W}, {D T}, {D W}, {T W}}. Since all 1-subsets of the itemsets in C2 are contained in F1, the prune step deletes nothing. By counting the supports of the itemsets in C2, we get F2.

F2

Itemset Support

{A C}
{A T}
{A W}
{C D}
{C T}
{C W}
{D W}
{T W} 4
3
4
4
4
5
3
3

Similarly, C3 could be generated from F2. After the join step, C3 is {{A C T}, {A C W}, {A T W}, {C D T}, {C D W}, {C T W}}. The prune step deletes {C D T}, since {D T} is not in F2. Then we could get F3 by counting the supports.

F3

Itemset Support

{A C T}
{A C W}
{A T W}
{C D W}
{C T W} 3
4
3
3
3

After the join step, C4 only contains one itemset {A C T W}, and it is left after the prune step.

F4

Itemset Support

{A C T W} 3

C5 is empty, so the whole set of frequent itemsets is the union of F1, F2, F3 and F4.

2. Generate association rules by using frequent itemsets.

In our example, the frequent itemsets whose sizes are greater than 1 are the itemsets in F2, F3 and F4, i.e., {A C}, {A T}, {A W}, {C D}, {C T}, {C W}, {D W}, {T W}, {A C T}, {A C W}, {A T W}, {C D W}, {C T W}, {A C T W}.

Let's start with the first one. For {A C}, the two possible rules are {A} ->{C} and {C} -> {A}. Because
support({A C}) / support({A}) = 4 / 4 = 100%,
support({A C}) / support({C}) = 4 / 6 = 66.7%,
{A} ->{C} is one desired rule.

Similarly, for {A C T}, the possible rules are {A C} -> {T}, {A T} ->{C}, {C T} -> {A}, {A} ->{C T}, {C} -> {A T} and {T} ->{A C}. By computing their confidences respectively,
    support({A C T}) / support({A C}) = 3 / 4,
    support({A C T}) / support({A T}) = 3 / 3 = 100%,
    support({A C T}) / support({C T}) = 3 / 4,
    support({A C T}) / support({A}) = 3 / 4,
    support({A C T}) / support({C}) = 3 / 6,
    support({A C T}) / support({T}) = 3 / 4,
only {A T} ->{C} is a desired rule.

After going through all the frequent itemsets, we could get all rules satisfying the minimum support and minimum confidence. They are listed below.

{A} ->{C},
{A} -> {W},
{D} ->{C},
{T} -> {C},
{W} ->{C},
{A} -> {C W},
{A T} ->{C},
{A T} -> {W},
{A C} ->{W},
{A W} -> {C},
{D W} ->{C},
{T W} -> {A},
{T W} ->{C},
{A T} -> {C W},
{T W} ->{A C},
{A C T} -> {W},
{A T W} ->{C},
{C T W} -> {A}.

Itemset	Support
{A C} {A T} {A W} {C D} {C T} {C W} {D W} {T W}	4 3 4 4 4 5 3 3

Itemset	Support
{A C T} {A C W} {A T W} {C D W} {C T W}	3 4 3 3 3