CS 4445 B Term 2012

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2012
Homework and Project 5: Clustering and Anomaly Detection

Prof. Carolina Ruiz and Ken Loomis.

DUE DATES: Friday, Dec. 7, 11:00 am (electronic submission) and 1:00 pm (hardcopy submission)

Homework and Project Objectives
Homework Assignment
Project Assignment

HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

To gain experience with clustering and anomaly detection.
To gain experience with issues in text mining and web mining.
To gain additional experience with data mining techniques (and their combinations) from previous projects.

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read Chapter 8 and Chapter 10 of your textbook in great detail.

This project consists of two parts:

Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

See solutions by Ken Loomis. (Access restricted to the WPI domain.)

[20 points] Exercise 16, p. 563 of the textbook. Show your work. Provided an updated similarity matrix at each step.
[10 points] Exercise 32, p. 567 of the textbook. Explain your answer in detail.
[10 points] Exercise 10, p. 682 of the textbook. Explain your answer in detail.

Part II. GROUP PROJECT ASSIGNMENT

Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
*** The written report for your group project should be at most 10 pages long (including all graphs, tables, figures, appendices, ...) and the font size should be no smaller than 11 pts. ***
Data Mining Technique(s): Run experiments using any (combinations) of the following techniques:
- Pre-processing Techniques:
  - Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ... .
- Data Mining Techniques:
  - Clustering
    - Simple K-means
    - HierarchicalClusterer: Make sure to experiment with different "linkType"s
  - Anomaly Detection
    - LOF filter (see WPI level below)
    - Any data mining technique (including clustering) covered in the course that can be used to detect anomalies in the data.
- Advanced Techniques:
  - You should consider using advanced techniques to improve the accuracy of your predictions. For instace, try ensemble methods (see Section 5.6 of your textbook), ways to deal with inbalanced classification targets (see Section 5.7 of your textbook), cost-sensitive classification, etc. But, in terms of data mining techniques, this project is restricted to the techniques listed above.
  - Any other creative ideas you have to bust model perfomance and/or to combine different models into a more powerful one.
Dataset We will work with the Amazon Commerce reviews set Data Set available at the UCI Machine Learning Repository.
Important: Please use the pre-processing described in the easy challenge for all subsequent challenges as well. Additional pre-processing may be performed as needed.
Challenges: In each of the following challenges provide a detailed description of the preprocessing techniques used, the motivation for using these techniques, and any hypothesis/intuition gained about the information represented in the dataset. Answer the question provided as well as provide the information described in the PROJECT GUIDELINES
- Easy Level: This is to be a simple exercise to practice your preprocessing techniques and allow you to become familiar with the dataset.
  The dataset as found on the website contains problems that prevent it from being used in Weka without "cleaning" the data. DO NOT remove the classification attribute from the dataset as it will be needed later.
  Attempt to open the ARFF file in Weka. Examine the the problems in the data that prevent the file from being used as-is. Repair this file so that it may be used by Weka. Describe what you did to the dataset so this was accomplished.
  Answer the following questions in your description about this expecise:
  1. Describe the data. How many attributes does this dataset contain? How many instances? What do the instances represent? What do you expect to find when you begin exploring this dataset?
  2. Since this dataset contains far more attributes than instances would it be reasonable somehow to transform the attributes to instances (and make the instances attributes)? Briefly describe how this could be done.
  3. Briefly describe a domain application where this (#2 above) might be useful.
- Moderate Level: This is a bit more of a challenge (be sure to leave yourself time for challenges 3 and 4).
  Use the SimpleKMeans clustering tool in Weka to generate a clustering of this dataset. Be sure to IGNORE the classification attribute when performing the clustering, but DO NOT remove it from the dataset. Use K values no greater than 50 for this clustering. Experiment with both Euclidean distance and Manhattan distance.
  Examine the model. Describe the performance of the clustering. Answer the following questions in your description about this experiment:
  1. Which distance metric and which value for K worked best for this experiment? Why do you think this was so?
  2. Compare and contrast the assigned cluster of the instances with their classification values. Provide a visualization and a description. Were the results what you expected?
  3. Are there any limitations of the dataset that make this a more challenging experiment (other than the obvious limitations caused by the ARFF file problems in the "easy" challenge)? Explain.
- WPI Level: This and the WPI+ are the big challenges that should spend the most time on.
  Install the Local Outlier Factor (LOF) filter add-on package. Open the package manager in Weka and locate "localOutlierFactor" and install it.
  Use the LOF filter on your dataset. Identify the top 25 instances that are identified as outliers. Provide a list the of the class values for these instances. Perform SimpleKMeans clustering on the data both with and without these instances included in the dataset. In the dataset with the instances included, change the classification value of the 25 outliers to a new class value "Outlier" (so that these instances can easily be identified in the clusters that you will construct). IGNORE the class attribute and the LOF attribute when performing the clustering. Describe the performance of these clusterings.
  1. How do the performance of these 2 clusterings compare to each other?
  2. Attempt to use visualization to find these outliers. Were you successful?
    - a. If not: Is there some way you can characterize these outliers? Describe an idea for a method.
    - b. If so: Describe any useful information you can glean from the outliers?
  3. What challenge(s) did you encounter while performing this experiment? Give an more detailed explanation of how you used preprocessing, visualization, postprocessing or some other technique to overcome a specific challenge.
  4. Use any other anomaly detection approach you wish to identify 25 outliers in this dataset. Include these outliers in your report, and explain in detail how you found them and why they are outliers. Compare this list of 25 outliers to that you found above using the LOF filter.
- WPI+ Level: For this experiment you may use either the Pennsylvania School Dataset from the previous projects or the Amazon dataset that you have been using for this experiment.
  Design another experiment that performs clustering using hierarchical clustering. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description. Provide a clear description of the clustering using visualization as needed to aide your description. Make sure to experiment with different "linkType"s.
  1. What was your motivation for choosing this goal? Is it very useful?
  2. What challenge(s) did you encounter while developing this model? Give an more detailed explanation of how you used preprocessing, visualization, postprocessing or some other technique to overcome a specific challenge.
  3. Describe any anomolies that appeared in your model. What might these anomalies mean about the data?
Grading sheet for this project.

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2012 Homework and Project 5: Clustering and Anomaly Detection

Prof. Carolina Ruiz and Ken Loomis.

HOMEWORK AND PROJECT OBJECTIVES

HOMEWORK AND PROJECT ASSIGNMENTS

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2012
Homework and Project 5: Clustering and Anomaly Detection