CS 525D Spring 2008

Computer Science Department

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008
Project 5: Clustering

PROF. CAROLINA RUIZ

DUE DATE: Thursday March 27, 2008.

Slides: Submit by email by 2:30 pm.
Written report: Hand in a hardcopy by 3:30 pm.
Oral Presentation: during class that day.

Project Description

[1000 points: 100 points for each of the 4 clustering methods per dataset, and 25 points for meaningful interpretation of the resulting clusters for each of the 4 clustering methods per dataset.] See Project Guidelines for the detailed distribution of these points.

Project Instructions: Thoroughly read and follow the Project Guidelines. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
Data Mining Technique(s): We will run experiments using the following clustering methods available in Weka:
- Partitioning methods: Simple K-Means
- Hierarchical methods: COBWEB
- Density-based methods: DBSCAN
- Probabilistic-based methods: EM
Dataset(s): In this project, we will use two datasets:
- The World Happiness Dataset with Continents information added by Paul Sader.
  Since the SWL-ranking can be derived from SWL-index, remove SWL-ranking from consideration. Also, remove the attribute country as each of its values identifies an instance uniquely.
- A dataset that you choose depending on your own insterests. It may be a dataset you are working with for your research or your job. It should contain enough instances (at least 200 instances) and several attributes (at least 10). Ideally it should contain a good mix of numeric and nominal attributes.
  I include below some links to Data Repositories containing multiple datasets to choose from:
Performance Metric(s): A major part of this project (as reflected in the grade distribution above) is to find meaningful ways of evaluating and interpreting the resulting clusters. Devise a variety of approaches to do so, including but not limited to visualization of the resulting clusters, inspection of the clusters' members to find commonalities, etc. The more creative/ingenious your approaches, the better. You might want to extend the Weka code to provide the evaluation/interpretation functionality you need.
General Comments Focus on experimenting with different ways of preprocessing the data, varying the parameters of the clustering algorithms, and providing your own methods to evaluating and interpreting the results.

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008 Project 5: Clustering

PROF. CAROLINA RUIZ

Project Description

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008
Project 5: Clustering