### CS 525M KNOWLEDGE DISCOVERY AND DATA MINING   PROJECT 2 - Sequence Mining. Fall 2001

#### PROF. CAROLINA RUIZ

DUE DATE: This project is due on Thursday Nov. 1, 2001 at 12 noon .

#### PROJECT DESCRIPTION

Use sequence mining techniques to find patterns in sequential data.

#### PROJECT ASSIGNMENT

This project consists of two parts, each one dealing with a different sequence mining technique.

• Part I. Sequence Similarity For this part of the assignment, start by reading in detail the paper R. Agrawal, C. Faloutsos and A. Swami. "Efficient Similarity Search in Sequence Databases Foundations of Data Organization and Algorithms". (FODO) Conference, Oct. 1993, Evanston, Illinois, Oct. 13-15, 1993. PostScript Online. Then implement a system to mine similarities in sequences as described below.

• Code: Write YOUR OWN, INDIVIDUAL Java program that has the following functionalily. Taking code available online will be considered cheating.

Given a collection of time series CS, a collection of time series CT to be used as templates, a non-negative integer value r, and a non-negative real value epsilon,

1. Given a time series S (in the time domain), it transforms the time series into a sequence FS (in the frequency domain) using the Discrete Fourier Transformation and truncates this resulting sequence by keeping just the first r coefficients.
2. Given two sequences, it computes the Euclidean distance between the sequences.
3. For each template T in CT, it finds all the sequences in CS that are within an epsilon distance from T. This should be done following this procedure:
• All time series in CS and all the templates in CT are transformed and truncated as described above.
• For each transformed template FT:
• For each transformed time series FS:
• Compute the Euclidean distance between FT and FS.
• If this distance is less than or equal to epsilon
• Compute the Euclidean distance between T (the template before the transformation) and S (the time series before the transformation).
• If this distance is less than or equal to epsilon then output the fact that T and S are "similar".

4. For each time series in CS, it finds all the other time series in CS that are similar to it (that is that are within an epsilon distance form it) using the procedure sketched above.

You DO NOT need to use any sophisticated data structures to store sequences before or after they are transformed. Simple data structures are acceptable for the scope of this project.

• Data:
• You can use the dataset that you selected for your other course project, if that dataset consists of time series.
• . Otherwise, you can use Nasdaq (financial) data available at http://www.marketdata.nasdaq.com/mr_section4.html Select at least 30 sequences and 5 templates, each at least 300 positions long.

• Experiments: Run an initial experiment with say r= 3 and epsilon= 10. After recording the results of this experiment (see below), run new experiments varying the values r and epsilon as you find appropriate to obtain better results. Consider eliminating the 1st Fourier coefficient (average of the sequence) to scale up the sequences and run experiments again.

• Results: For each experiment you run, your system should output a list of similar sequences in CS for each of the templates in CT; and a clustering of the sequences in CS according to their similarity. Analyze in detail the time savings of using the procedure outlined above instead of computing the Euclidean distance directly from the inital sequences (without tranforming them). Describe any interesting similarities found by your system.

• Part II. Frequent Subsequences This part of the project is based on the paper "Mining Sequential Patterns: Generalizations and Performance Improvements" (by Srikant, R. and Agrawal, R.). Read the paper in detail first. Then follow the procedure by hand in full detail to compute the frequent subsequences from the following set of sequences. Use minimal support 50%. Show your work.  Sequence ID TIME ITEMS s1 10 C D s1 15 A B C s1 20 A B F s1 25 A B D F s2 15 A B F s2 20 E s3 10 A B F s4 10 D G H s4 20 B F s4 25 A G H

#### REPORT AND DUE DATE

• Written Report. Your written report is due at 12 noon. Please leave a hardcopy of your report in my mailbox (CS Office, FL231) by the due date/time. Please note that my mailbox is the one BELOW the label marked with my last name RUIZ. In the EXCEPTIONAL case that you cannot go to the CS office, email your report to me by noon. Your report should contain the following sections with the corresponding discussions:

1. Code Description: Describe the code that you wrote. Remember to acknowledge any sources of information that you used. Include a copy of your DOCUMENTED code in your report.

2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.

3. Experiments:
• Analysis of results of your experiments and their significance as described in the "Results" section above.

4. Summary of Results
• What was the best collection of results that you obtained? Describe.
• Discuss the strengths and the weaknesses of your project.

5. Part II of the Project: Include your detailed solutions to part II of the project.

• Oral Report. We will discuss the results from the individual projects during the class on November 1st. Be ready to show your results and to discuss your project in class. PREPARE OVERHEAD TRANSPARENCIES SHOWING YOUR WORK.