### CS 525D KNOWLEDGE DISCOVERY AND DATA MINING   Project 2: Sequence Mining - Spring 2004

#### PROF. CAROLINA RUIZ

Due Date: March 23 at 3:00 pm

#### PROJECT DESCRIPTION

Use sequence mining techniques to find patterns in sequential data.

#### PROJECT ASSIGNMENT

This project consists of two parts, each one dealing with a different sequence mining technique.

• Part I. Sequence Similarity For this part of the assignment, start by reading in detail the paper R. Agrawal, C. Faloutsos and A. Swami. "Efficient Similarity Search in Sequence Databases Foundations of Data Organization and Algorithms". (FODO) Conference, Oct. 1993, Evanston, Illinois, Oct. 13-15, 1993. PostScript Online. Then implement a system to mine similarities in sequences as described below.

• Code: Write YOUR OWN, INDIVIDUAL program that has the following functionalily. Taking code available online will be considered cheating. You can write your code in any high level programming language (C, C++, Java, ...).

Given a collection of time series CS, a collection of time series CT to be used as templates, a non-negative integer value r, and a non-negative real value epsilon,

1. Given a time series S (in the time domain), it transforms the time series into a sequence FS (in the frequency domain) using the Discrete Fourier Transformation and truncates this resulting sequence by keeping just the first r coefficients.
2. Given two sequences, it computes the Euclidean distance between the sequences.
3. For each template T in CT, it finds all the sequences in CS that are within an epsilon distance from T. This should be done following this procedure:
• All time series in CS and all the templates in CT are transformed and truncated as described above.
• For each transformed template FT:
• For each transformed time series FS:
• Compute the Euclidean distance between FT and FS.
• If this distance is less than or equal to epsilon
• Compute the Euclidean distance between T (the template before the transformation) and S (the time series before the transformation).
• If this distance is less than or equal to epsilon then output the fact that T and S are "similar".

4. For each time series in CS, it finds all the other time series in CS that are similar to it (that is that are within an epsilon distance form it) using the procedure sketched above. This is what the paper calls "All-pairs query (or spatial join): Given N sequences, find the pairs of sequences that are within epsilon of each other").

You DO NOT need to use any sophisticated data structures to store sequences before or after they are transformed. Simple data structures are acceptable for the scope of this project.

• Data:
• You can use the dataset that you selected for your other course project, if that dataset consists of time series.
• Otherwise, you can use Nasdaq (financial) data available at http://www.marketdata.nasdaq.com/mr_section4.html Select at least 28 sequences and 5 templates, each at least 250 positions long.

• Experiments: Run an initial experiment with say r= 3 and epsilon= 10. After recording the results of this experiment (see below), run new experiments varying the values r and epsilon as you find appropriate to obtain better results. Consider eliminating the 1st Fourier coefficient (average of the sequence) to scale up the sequences and run experiments again.

• Results: For each experiment you run, your system should output a list of similar sequences in CS for each of the templates in CT; and a clustering of the sequences in CS according to their similarity (what I mean by a clustering is what the paper calls "All-pairs query (or spatial join): Given N sequences, find the pairs of sequences that are within epsilon of each other"). Analyze in detail the time savings of using the procedure outlined above instead of computing the Euclidean distance directly from the inital sequences (without tranforming them). Describe any interesting similarities found by your system.

• Part II. Frequent Subsequences This part of the project is based on the paper "Mining Sequential Patterns: Generalizations and Performance Improvements" (by Srikant, R. and Agrawal, R.). AVAILABLE ONLINE. Read the paper in detail first. Then follow the procedure by hand in full detail to compute the frequent subsequences from the following set of sequences. Use minimal support 50%. Show your work.  Sequence ID TIME ITEMS s1 10 C D s1 15 A B C s1 20 A B F s1 25 A B D F s2 15 A B F s2 20 E s3 10 A B F s4 10 D G H s4 20 B F s4 25 A G H

#### REPORT AND DUE DATE

• Written Report. Your written report is due at 3:00 pm. Please hand it in at the beginning of class. Your report should contain the following sections with the corresponding discussions:

1. Code Description: Describe the code that you wrote. Remember to acknowledge any sources of information that you used. Include a copy of your DOCUMENTED code in your report.

2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.

3. Experiments:
• Analysis of results of your experiments and their significance as described in the "Results" section above.

4. Summary of Results
• What was the best collection of results that you obtained? Describe.
• Discuss the strengths and the weaknesses of your project.

5. Part II of the Project: Include your detailed solutions to part II of the project.

• Oral Report. We will discuss the results from the individual projects during the class on March 23. Be ready to show your results and to discuss your project in class. PREPARE SLIDES SHOWING YOUR WORK.