CS 525 Knowledge Discovery and Data Mining

Syllabus— Spring 2008

Prof. Carolina Ruiz

WARNING: Small changes to this syllabus may be made during the semester.

COURSE DESCRIPTION:

Due to advances in technology and the availability of increasingly cheap storage devices, data in different domains have been accumulating at an impressively high rate in recent years, leading to very large databases. This course presents current research in Knowledge Discovery in Databases (KDD) dealing with the data integration, mining, and interpretation of patterns in such databases. Topics include data warehousing and mediation techniques aimed at integrating distributed, heterogeneous datasources; data mining techniques such as decision trees, association rule mining, and statistical analysis for discovery of patterns in the integrated data; and evaluation and interpretation of the mined patterns using visualization techniques. The work discussed originates in the fields of databases, artificial intelligence, information retrieval, data visualization, and statistics. Industrial and scientific applications will be given.

Students will be expected to read assigned textbook chapters and research papers, and work on implementation/research projects that cover the different stages of the KDD process.


CLASS MEETING:

Time: Tuesdays and Thursdays 3:30-4:50 pm
Room: HL114

Students are also encouraged to attend the Knowledge Discovery in Databases and Data Mining Research Group (KDDRG) Seminar Fridays at 2 pm in Beckett Conference Room (FL246).


INSTRUCTOR:

Prof. Carolina Ruiz

Office: FL 232
Phone Number: (508) 831-5640
Office Hours: Thursdays 2-3 pm, or by appointment.


TEXTBOOK:

  • Required:
    "Data Mining (Second Edition)". I.H. Witten and E. Frank. Morgan Kaufmann Publishers. 2005. ISBN: 0-12-088407-0
  • Recommended: Several other books on the subject and related subjects are recommended below. Some research papers will be handed out during the term.

PREREQUISITE:

Background in databases and artificial intelligence at the undergraduate level, or permission of the instructor. Background in statistics would be helpful but is not assumed. Proficiency in a high level programming language (preferable Java) is required.


GRADES:

7 Projects (15% each project)   100% (+ 5% extra credit)
Class Participation   Extra points

Your final grade will reflect your own work and achievements during the course. Any type of cheating will be penalized and reported to the WPI Judicial Board in accordance with the Academic Honesty Policy.


CLASS PARTICIPATION

All students are expected to read the material assigned for each class in advance and to participate in class discussions. Also, students will take turns presenting papers and leading class discussions of assigned readings. Class participation will be taken into account when deciding students' final grades.

PROJECTS AND ASSIGNMENTS

There will be a total of seven projects related to the data mining stages and/or techniques covered in the class. Datasets for those projects will be selected from online database repositories, or other sources.

About the Weka System: For most of the projects, we will use the Weka system (http://www.cs.waikato.ac.nz/ml/weka/). Weka is an excellent data-mining environment. It provides a large collection of Java-based mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is open source software issued under the GNU General Public License. For more information on the Weka system, to download the system and to get its documentation, look at Weka's webpage (http://www.cs.waikato.ac.nz/ml/weka/). You should download and use the latest stable GUI version of the system.

Students will be required to provide both a written report and an oral (in-class) presentation describing their achievements in each of these projects.

More detailed descriptions of the assignments and projects will be posted to the course webpage at the appropriate times during the semester.


CLASS MAILING LIST

The mailing list for this class is:

This mailing list reaches the professor and all the students in the class.

CLASS WEB PAGES

The webpages for this class are located at http://www.cs.wpi.edu/~cs525d/s08/
Announcements will be posted on the web pages and/or the class mailing list, and so you are urged to check your email and the class web pages frequently.

WARNING:

Small changes to this syllabus may be made during the course of the semester.

ADDITIONAL SUGGESTED REFERENCES

Knowledge Discovery and Data Mining

Databases

Statistics

  • "Statistical Inference for Management and Economics". P. Billingsley, D. Croft, D. Huntsberger, C. Watson. Boston: Allyn and Bacon, Inc. 1986.
  • "Probability and Statistics". 2nd edition. M. DeGroot. Addison Wesley, 1986.
  • "Statistical Inference". G. Casella, R. Berger. Wadsworth and Brooks/Cole, 1990.

Machine Learning

  • "Machine Learning". Tom M. Mitchell. McGraw-Hill, 1997.
  • "Elements of Machine Learning". P. Langley. Morgan Kaufmann Publishers, Inc. 1996.

General AI

  • "Artificial Intelligence: A Modern Approach". S. Russell, P. Norvig. Prentice Hall, 1995. ISBN 0-13-103805-2
  • "Artificial Intelligence: Theory and Practice". T. Dean, J. Allen, Y. Aloimonos. The Benjamin/Cummings Publishing Company, Inc. 1995.

OTHER ONLINE RESOURCES:

Previous offerings of CS525d Knowledge Discovery and Data Mining

Webpages of my previous offerings of this course:

Previous offerings of CS4445

Webpages of my previous offerings of the undergraduate data mining course contain plenty of useful resources: practice exams, exams, homework, solutions of those exams/hw, etc.

Data Sets

KDD

KDD Commercial Products / Prototypes

Data Warehousing and OLAP

Statistics

Machine Learning

General AI


WPI Worcester Polytechnic Institute
   

Computer Science Department
------------------------------------------