cs525 - Q5

Mining Sequential Patterns

Consider the papers "Mining Sequential Patterns: Generalizations and Performance Improvements" (by Srikant, R. and Agarwal, R.) and "Discovery of frequent episodes in sequences" (by Manilla, H., Toivonen, H., and Verkamo, A.I.).
1. State what the problem(s) that these papers address is.
2. Describe the approach followed by each of the papers to tackle the problem.
3. Discuss similarities and differences between the approaches presented in the two papers.
4. Describe potential applications of the proposed approaches.

Solution by Akshay Ganga Nitesh Tom Gangadharan, Kannan Kothare, Akshay Mehrotra, Nitesh Tom Zhu
These papers address the problem of discovering sequential patterns.
Srikant et al. provide an algorithm to find sequential patterns across sequences of transactions after generalizing it to incorporate time constraints, sliding time windows and taxonomies. Given any database, a taxonomy, user specified min-gap and max-gap time constraints and a user specified sliding window size, they provide a solution to find all sequences whose support is greater than the user specified minimum support.
Manilla et al. provide a solution to a slightly different flavor of the same problem. They define a sequence S to have a starting time, finish time and an ordered sequence of events where time is associated with each event. They provide a solution to find all frequent episodes in this sequence, given a window width and a frequency threshold (min-support).
Srikant et al.
They propose an algorithm called Generalized Sequential Patterns (GSP). Given a database of data sequences GSP makes multiple passes over the data. Each pass starts with a seed set. which is the frequent sequences found in the previous pass. Using the seed set candidate sequences are generated. The candidate sequences (potentially frequent sequences) have one more item than the seed sequence. Then the support for each of these candidate sequences is calculated during the pass to determine which are actually frequent sets. These become the seed for the next pass.
The two main parts of the algorithm are :-
Candidate generation:- This is done in 2 steps.
     * Join Phase -> Candidate sequences are obtained by joining two seed set elements if the subsequence obtained by removing the first element of one is the same as the subsequence obtained by removing the last element of the other.
     * Prune Phase -> All candidates which have any subsequence whose support is less than minimum support are discarded.
Counting Candidates:- Here they determine the support for each of the candidate sequences. They use a hash-tree data structure to reduce the number of candidates that are checked for a sequence.
Manila et al.
Even their algorithm works in a similar manner. Their algorithm has two main phases (building phase and recognition phase) and iterates between these. In the building phase a collection of candidate episodes (potentially frequent episodes ) is built. Then in the recognition phase these episodes are recognized in the event sequence and their support or frequency is computed. The frequent episodes are then used to build the candidate episodes in the next iteration.
Building Phase:- They suggest using the same method as in Srikant et. al to build candidate sequences from frequent sequences obtained in the previous pass.
Recognition Phase:- They use the fact that every episode can be expressed as a composite episode containing only serial and parallel episodes. Thus they recursively decompose the episode to parallel and serial composite episodes and find the support for each of these.
     * Support for parallel episodes -> For each parallel episode they maintain a count and increment it every time the total number of events in the episode in the current window reduces from max count.
     * Support for serial episode -> They recognize and count serial episodes by building a state automaton for each such episode and incrementing the count every time we reach the last state in the automaton.
Similarities
Broadly speaking both approaches to find frequent patterns consist of two phases. One where they build candidate elements and the other where they find the support count for each of the elements. They have multiple passes through the data where they iterate through these two phases.
Dissimilarities
Srikant et al. use a hash tree structure to find the support count of each of the candidate sequences whereas Manila et al. use the idea that every elementary episode can be expressed as a composite of serial and parallel episodes. It then recursively finds the support for these composite episodes.
These approaches are used to try and find frequent sequences in ordered transactions (sequential data)
* Observing patterns of fault occurrence might help us analyze the cause and predict failures.
* The retailing industry can benefit by such patterns as it helps them in deciding on attached mailing, add-on-sales etc.
* Finding such sequences can be beneficial in other domains too like the medical, scientific and business domains.