CS4445 Data Mining and Knowledge Discovery in Databases

Syllabus— B Term 2006

Prof. Carolina Ruiz

WARNING: Small changes to this syllabus may be made during the semester.


This course provides an introduction to Knowledge Discovery in Databases (KDD) and Data Mining. KDD deals with data integration techniques and with the discovery, interpretation and visualization of patterns in large collections of data. Topics covered in this course include data warehousing and mediation techniques; data mining methods such as rule-based learning, decision trees, association rules and sequence mining; and data visualization. The work discussed originates in the fields of artificial intelligence, machine learning, statistical data analysis, data visualization, databases, and information retrieval. Several scientific and industrial applications of KDD will be studied.


CS4341 Introduction to Artificial Intelligence, MA2611 Applied Statistics I, and CS3431 Database Systems I.


Mondays, Tuesdays, Thursdays, Fridays 12:00-12:50 pm
Room: SL104
Please come to class on time and stay for the whole class period.


  • Learning fundamental principles about and generalizations of:

    • Computational techniques for data manipulation, integration, and cleaning.
      Practice with and evaluation of this objective: Projects 1, 2, 3, 4. Exams 1, 2

    • Computational techniques to discover patterns and trends in data collections.
      Practice with and evaluation of this objective: Projects 1, 2, 3, 4. Exams 1, 2

    • Computational approaches for constructing and evaluating models built upon patterns discovered from data collections.
      Practice with and evaluation of this objective: Projects 1, 2, 3, 4. Exams 1, 2

  • Learning to apply course material (to improve thinking, problem solving, and decision) during the utilization and analysis of computer programs that discover patterns in data and that are capable of learning from data in a variety of application domains.
    Practice with and evaluation of this objective: Projects 1, 2, 3, 4. Exams 1, 2

  • Developing creative capacities for the design, implementation, and analysis of computer programs that mine patterns from large collections of data.
    Practice with and evaluation of this objective: Projects 1, 2, 3, 4. Exams 1, 2

  • Learning to analyze and experimentally evaluate algorithms and implementations of data mining techniques in multiple real-world application domains, in particular those investigated in the course projects.
    Practice with and evaluation of this objective: Projects 1, 2, 3, 4.


Prof. Carolina Ruiz

Office: FL 232
Phone Number: (508) 831-5640
Office Hours:
Mondays 1:00 - 2:00 pm,
Thursdays 2:00 - 3:00 pm
or by appointment .


  • Piotr Mardziel
    Office Hours: Fuller Labs A22
    Mondays 3:00 - 4:00 pm
    Tuesdays 4:00 - 5:00 pm
    Fridays 10:00 - 11:00 am
Class Mailing Lists for instructions on how to reach the professor and the TA by email.


Several other books on the subject and related subjects are recommended below. Some research papers will be handed out during the term.


Exam 1 25%
Exam 2 25%
Project/Homework 1 12.5%
Project/Homework 2 12.5%
Project/Homework 3 12.5%
Project/Homework 4 12.5%
Class Participation and Pop Quizzes: Extra Points

Your final grade will reflect your own work and achievements during the course. Any type of cheating will be reported to the WPI Judicial Board and penalized in accordance with the Academic Honesty Policy.

According to the WPI Undergraduate Catalog, "Unless otherwise indicated, WPI courses usually carry credit of 1/3 unit. This level of activity suggests at least 17 hours of work per week, including class and laboratory time." Hence, you are expected to spend at least 13 hours of work per week on this course outside the classroom.


This course may be taken for graduate credit by students in the BS/MS CS program. Written permission from the professor is required. In order to receive graduate credit, students who have signed up for this program need to work on projects/homework alone (that is, in "groups" of 1 student).


There will be a total of 2 exams. Each exam will cover the material presented in class since the beginning of the term. In particular, the final exam is cumulative. Exams will be in-class, 50 minute, closed-book, individual exams. Collaboration or other outside assistance on exams is not allowed. The exams are scheduled for Monday, Nov. 20 and for Tuesday, Dec. 12th, 2006

Regarding makeup exams, I follow Prof. Gennert's policy: "Makeup and/or early examinations are not given except under the most dire of circumstances, and then only with corroborating documentation. Note well that neither oversleeping, forgetting to show up for an exam, nor conflicting travel arrangements are considered dire circumstances."


There will be a total of 4 projects/homework. Each of the projects deals with one of the data mining techniques covered in the class.
Data Mining Tool
For most of the projects, we will use the
Weka system (http://www.cs.waikato.ac.nz/ml/weka/). Weka is an excellent machine-leaning/data-mining environment. It provides a large collection of Java-based mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is open source software issued under the GNU General Public License. For more information on the Weka system, to download the system and to get its documentation, look at Weka's webpage (http://www.cs.waikato.ac.nz/ml/weka/). You should download and use the 3-4-8a GUI version of the system.
Students are expected to organize themselves into groups of exactly 2 for each of the projects, except for students taking this course for BS/MS credit who are expected to work on the projects alone. Each project will contain both an individual assignment and a group assignment. Groups need not be the same for all projects.
Submissions and Late Policy
See each project statement for details.
Project Descriptions
More detailed descriptions of the projects/homework will be posted to the course webpage at the appropriate times during the term. Although you may find similar programs/systems available online or in the references, the design and all code you use and submit, the results, and the analysis of the results in your projects/homework submissions MUST be your own original work.


Students are expected to read the material assigned for each class in advance and to participate in class discussions. Class participation will be taken into account when deciding students' final grades.


There are two mailing lists for this class (replace XXXX with 4445 below):

There is also a myWPI account for this class that will be used for project submissions only, as needed.


The web pages for this class are located at http://www.cs.wpi.edu/~cs4445/b06/
Announcements will be posted on the web pages and/or the class mailing list, and so you are urged to check your email and the class web pages frequently. 


Small changes to this syllabus may be made during the course of the term.


Knowledge Discovery and Data Mining

Machine Learning

  • "Machine Learning". Tom M. Mitchell. McGraw-Hill, 1997.
  • "Elements of Machine Learning". P. Langley. Morgan Kaufmann Publishers, Inc. 1996.

General AI

  • "Artificial Intelligence: A Modern Approach". S. Russell, P. Norvig. Prentice Hall, 1995. ISBN 0-13-103805-2
  • "Artificial Intelligence: Theory and Practice". T. Dean, J. Allen, Y. Aloimonos. The Benjamin/Cummings Publishing Company, Inc. 1995.
  • "Readings in Artificial Intelligence". B. L. Webber, N. J. Nilsson, eds. Tioga Publishing Company, 1981.
  • "Artificial Intelligence". 3rd edition. Patrick H. Winston. Addison Wesley.
  • "The Elements of Artificial Intelligence Using Common Lisp". S. L. Tanimoto. Computer Science Press 1990.
  • "Artificial Intelligence" Second edition. E. Rich and K. Knight. McGraw Hill 1991.
  • "Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp". P. Norvig. Morgan Kaufmann Publishers, 1992.
  • "Essentials of Artificial Intelligence". M. Ginsberg. Morgan Kaufmann Publishers, 1993.
  • "Artificial Intelligence Structures and Strategies for Complex Problem Solving". Third edition. G. F. Luger and W. A. Stubblefield. Addison-Wesley, 1998.
  • "Logical Foundations of Artificial Intelligence". M.R. Genesereth and N. Nilsson. Morgan Kaufmann, 1987.



  • "Statistical Inference for Management and Economics". P. Billingsley, D. Croft, D. Huntsberger, C. Watson. Boston: Allyn and Bacon, Inc. 1986.
  • "Probability and Statistics". 2nd edition. M. DeGroot. Addison Wesley, 1986.
  • "Statistical Inference". G. Casella, R. Berger. Wadsworth and Brooks/Cole, 1990.


Previous offerings of CS4445

Webpages of my previous offerings of this course have plenty of useful resources: practice exams, exams, homework, solutions of those exams/hw, etc.

Data Sets


KDD Commercial Products / Prototypes

Data Warehousing and OLAP

Machine Learning


General AI

WPI Worcester Polytechnic Institute

Computer Science Department