Consider the papers "Mining Sequential Patterns: Generalizations
and Performance Improvements" (by Srikant, R. and Agarwal, R.) and "Discovery
of frequent episodes in sequences" (by Manilla, H., Toivonen, H., and Verkamo,
A.I.).
1. State what the problem(s) that these papers address is.
2. Describe the approach followed by each of the papers to
tackle the problem.
3. Discuss similarities and differences between the approaches presented
in the two papers.
4. Describe potential applications of the proposed approaches.
These papers address the problem of discovering sequential patterns.
Srikant et al. provide an algorithm to find sequential patterns across
sequences of transactions after generalizing it to incorporate time constraints,
sliding time windows and taxonomies. Given any database, a taxonomy, user
specified min-gap and max-gap time constraints and a user specified sliding
window size, they provide a solution to find all sequences whose support
is greater than the user specified minimum support.
Manilla et al. provide a solution to a slightly different flavor of
the same problem. They define a sequence S to have a starting time, finish
time and an ordered sequence of events where time is associated with each
event. They provide a solution to find all frequent episodes in this sequence,
given a window width and a frequency threshold (min-support).
Srikant et al.
They propose an algorithm called Generalized Sequential Patterns (GSP).
Given a database of data sequences GSP makes multiple passes over the data.
Each pass starts with a seed set. which is the frequent sequences found
in the previous pass. Using the seed set candidate sequences are generated.
The candidate sequences (potentially frequent sequences) have one more
item than the seed sequence. Then the support for each of these candidate
sequences is calculated during the pass to determine which are actually
frequent sets. These become the seed for the next pass.
The two main parts of the algorithm are :-
Candidate generation:- This is done in 2 steps.
* Join Phase -> Candidate sequences
are obtained by joining two seed set elements if the subsequence obtained
by removing the first element of one is the same as the subsequence obtained
by removing the last element of the other.
* Prune Phase -> All candidates which
have any subsequence whose support is less than minimum support are discarded.
Counting Candidates:- Here they determine the support for each
of the candidate sequences. They use a hash-tree data structure to reduce
the number of candidates that are checked for a sequence.
Manila et al.
Even their algorithm works in a similar manner. Their algorithm has
two main phases (building phase and recognition phase) and iterates between
these. In the building phase a collection of candidate episodes (potentially
frequent episodes ) is built. Then in the recognition phase these
episodes are recognized in the event sequence and their support or frequency
is computed. The frequent episodes are then used to build the candidate
episodes in the next iteration.
Building Phase:- They suggest using the same method as in Srikant
et. al to build candidate sequences from frequent sequences obtained in
the previous pass.
Recognition Phase:- They use the fact that every episode can
be expressed as a composite episode containing only serial and parallel
episodes. Thus they recursively decompose the episode to parallel and serial
composite episodes and find the support for each of these.
* Support for parallel episodes ->
For each parallel episode they maintain a count and increment it every
time the total number of events in the episode in the current window reduces
from max count.
* Support for serial episode -> They
recognize and count serial episodes by building a state automaton for each
such episode and incrementing the count every time we reach the last state
in the automaton.
Similarities
Broadly speaking both approaches to find frequent patterns consist
of two phases. One where they build candidate elements and the other where
they find the support count for each of the elements. They have multiple
passes through the data where they iterate through these two phases.
Dissimilarities
Srikant et al. use a hash tree structure to find the support count
of each of the candidate sequences whereas Manila et al. use the idea that
every elementary episode can be expressed as a composite of serial and
parallel episodes. It then recursively finds the support for these composite
episodes.
These approaches are used to try and find frequent sequences in ordered
transactions (sequential data)
* Observing patterns of fault occurrence might help us analyze the
cause and predict failures.
* The retailing industry can benefit by such patterns as it helps them
in deciding on attached mailing, add-on-sales etc.
* Finding such sequences can be beneficial in other domains too like
the medical, scientific and business domains.