DUE DATE: Thursday March 27, 2008.
- Slides: Submit by email by 2:30 pm.
- Written report: Hand in a hardcopy by 3:30 pm.
- Oral Presentation: during class that day.
[1000 points: 100 points for each of the 4 clustering methods per dataset,
25 points for meaningful interpretation of the resulting clusters for
each of the 4 clustering methods per dataset.]
for the detailed distribution of these points.
- Project Instructions:
Thoroughly read and follow the
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
- Data Mining Technique(s):
We will run experiments using the following clustering methods available
- Partitioning methods: Simple K-Means
- Hierarchical methods: COBWEB
- Density-based methods: DBSCAN
- Probabilistic-based methods: EM
In this project, we will use two datasets:
The World Happiness Dataset
with Continents information added by Paul Sader.
Since the SWL-ranking can be derived from SWL-index, remove SWL-ranking from
consideration. Also, remove the attribute country as each of its values
identifies an instance uniquely.
- A dataset that you choose depending on your own insterests.
It may be a dataset you are working with for your research or your job.
It should contain enough instances (at least 200 instances) and
several attributes (at least 10). Ideally it should contain a good mix of
numeric and nominal attributes.
I include below some links to Data Repositories containing
multiple datasets to choose from:
- Performance Metric(s):
A major part of this project (as reflected in the grade distribution
to find meaningful ways of evaluating and
interpreting the resulting clusters.
Devise a variety of approaches to do so, including but not limited
to visualization of the resulting clusters, inspection of the
clusters' members to find commonalities, etc.
The more creative/ingenious your approaches, the better.
You might want to extend the Weka code to provide the
evaluation/interpretation functionality you need.
- General Comments
Focus on experimenting with different ways of preprocessing
the data, varying the parameters of the clustering algorithms, and
providing your own methods to evaluating and interpreting the results.