DUE DATE: Tuesday April 15, 2008.
- Slides: Submit by email by 2:30 pm.
- Written report: Hand in a hardcopy by 3:30 pm.
- Oral Presentation: during class that day.
[200 points: 100 points the anomaly detection part and
100 points for the additional advanced data mining technique of your
choice. Additional points will be given to particularly
creative and/or high quality work, and/or for independently
researching related techniques not covered in class.]
Use performance metrics appropriate to the mining application that you
chose. If you are not aware of any,
propose a variety of approaches to measure how good the results
of your experiments are.
Consider using visualization of the constructed model or patterns
to evaluate your results.
The more creative/ingenious your approaches, the better.
You might want to extend the Weka code to provide the
evaluation/interpretation functionality you need.
Focus on experimenting with different ways of preprocessing
the data, adapting different techniques studied in this course
to tackle the problem at hand, and investigating on your own
other existing approaches.
The more creative/ingenious your work and/or the more research
into the related literature you do, the better.
- Project Instructions:
Thoroughly read and follow the
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
- Data Mining Technique(s):
We will run experiments in each of the following two areas:
- Anomaly Detection, and
- One advanced data mining application of your choice:
Web mining, text mining, sequence mining, or multimedia data mining.
In this project, we will use two datasets (they can be the same if
You can choose both datasets depending on your own insterests.
They may be datasets that you are working with for your research or your job.
They should contain enough instances (at least 200 instances) and
several attributes (at least 10). Ideally they should contain a good mix of
attributes types (if appropriate).
I include below some links to Data Repositories containing
multiple datasets to choose from and other specific suggestions: