CS 548 Knowledge Discovery and Data Mining

Syllabus Spring 2014

Prof. Carolina Ruiz

WARNING: Small changes to this syllabus may be made during the semester.


This course presents current research in Knowledge Discovery in Databases (KDD) dealing with data integration, mining, and interpretation of patterns in large collections of data. Topics include data warehousing and data preprocessing techniques; data mining techniques for classification, regression, clustering, deviation detection, and association analysis; and evaluation of patterns mined from data. Industrial and scientific applications are discussed.

Students will be expected to read assigned textbook chapters and research papers, and work on implementation/research projects that cover the different stages of the KDD process.

This course can be used to satisfy the graduate AI bin requirement.


Time: Tuesdays and Fridays 3:00-4:20 pm
Room: HL230


Prof. Carolina Ruiz

Office: FL 232
Phone Number: (508) 831-5640
Office Hours: Thursdays 1:00 - 2:00 pm. If you need to see me at a different time, email me to schedule an appointment.


Several other books on the subject and related subjects are recommended below. Some research papers will be handed out during the semester.


Background in artificial intelligence, databases, and statistics at the undergraduate level, or permission of the instructor. Proficiency in a high level programming language (preferably Java) is required.


Projects   80%
Showcase   10%
Class Participation   10%

Your final grade will reflect your own work and achievements during the course. Any type of cheating will be penalized and reported to the WPI Judicial Board in accordance with the Academic Honesty Policy.

Note that this course follows the guidelines established by the WPI faculty in May 2010:

"A student is expected to expend at least 56 hours of total effort for each graduate credit. This means that a student in a 3-graduate credit 14-week course is expected to expend at least 12 hours of total effort per week."
Hence, please expect to have to spend at least 9 hours of work outside the classroom on this course each week.


All students are expected to read the material assigned for each class in advance and to participate in class discussions. Also, students will take turns presenting papers and leading class discussions of assigned readings. Class participation will be taken into account when deciding students' final grades.



This course is project-intensive. Several projects related to the data mining stages and/or techniques covered in the class will be assigned. Students will work on this projects individually, not in teams. Students will be required to provide both a written report and an oral (in-class) presentation describing their work on each of these projects. Datasets for those projects will be selected from online database repositories, or other sources.

Several different data mining tools will be used in this course:

  • Matlab: Matlab is a high-level language and interactive environment for numerical computation, visualization, and programming. It provides data analysis and algorithm development functionality. You can download and access Matlab through the WPI CCC. The Statistics Toolbox will be particularly useful.

  • Weka: Weka is a machine-leaning/data-mining environment. It provides a large collection of Java-based mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is open source software issued under the GNU General Public License. For more information on the Weka system, to download the system and to get its documentation, go to the Weka webpage. You should download and use the latest Developer Version (currently weka-3-7-10) of the system.

  • RapidMiner: RapidMiner is an analytics platform that includes a multitude of methods for data integration, data transformation, data modeling, and data visualization. It provides access to data sources in various different formats. Download the free Community edition (currently version v5.3) available online.

  • Your own code. You can use
    • Python (for Python tutorials, see its documentation),
    • R (for R manuals, follow its Manuals link),
    • or any other programming language to implement your own programs and scripts to complement the functionality of the systems above.

More detailed descriptions of the assignments and projects will be posted to the course webpage at the appropriate times during the semester.


Each student should search for a real-world successful application of data mining and present it in class. This sucessful data mining story should be about using data mining to discover novel and useful patterns that made a difference in a certain industry or field. The application domain is up to the student (e.g., finance, sports, healthcare, science, ...). The chosen sucessful data mining story should be discussed with the professor in advance. The student will then give a 10 minute in-class presentation describing this application in as much detail as possible, focusing on its data mining aspects. Students will take turns presenting their showcases throughout the term, one student per class.


The mailing list for this class is:

This mailing list reaches the professor and all the students in the class.


The webpages for this class are located at http://www.cs.wpi.edu/~cs548/s14/
Announcements will be posted on the web pages and/or the class mailing list, and so you are urged to check your email and the class web pages frequently.


Small changes to this syllabus may be made during the course of the semester.


Knowledge Discovery and Data Mining


WPI Worcester Polytechnic Institute

Computer Science Department