
CS 525 KNOWLEDGE DISCOVERY AND DATA MINING
SYLLABUS - Spring 2004
WARNING:
Small changes to this syllabus may be made during the course of the semester.

COURSE DESCRIPTION:
Due to advances in technology and the availability of increasingly cheap
storage devices, data in different domains has been accumulating at an
impressively high rate in recent years, leading to very large databases.
This course presents current research in Knowledge Discovery in Databases
(KDD) dealing with the data integration, mining, and interpretation of
patterns in such databases. Topics include data warehousing and mediation
techniques aimed at integrating distributed, heterogeneous datasources;
data mining techniques such as rule-based learning, decision trees, association
rule mining, and statistical analysis for discovery of patterns in the
integrated data; and evaluation and interpretation of the mined patterns
using visualization techniques. The work discussed originates in the fields
of databases, artificial intelligence, information retrieval, data visualization,
and statistics. Industrial and scientific applications will be given.
This course presents data mining from a database perspective.
For an in-depth study of the machine learning techniques
used in data mining, take
CS539 Machine Learning which is scheduled to be offered
during the 2004-2005 academic year.
Students will be expected to read assigned textbook chapters and research papers,
and work on implementation/research projects that cover the different
stages of the KDD process.
PREREQUISITE:
Background in databases and artificial intelligence
at the undergraduate level, or permission of the instructor. Background
in statistics would be helpful but is not assumed.
Proficiency in a high level programming language (preferable Java)
is required.
CLASS MEETING:
Tuesdays and Thursdays 3:00 - 4:20 pm
FL320
Students are also encouraged to attend the
Knowledge Discovery
in Databases and Data Mining Research Group (KDDRG) Seminar Fridays at
2 pm in Beckett Conference Room (FL246).
PROFESSOR:
Prof. Carolina Ruiz
ruiz@cs.wpi.edu
Office: FL 232
Phone Number: (508) 831-5640
Office Hours: Tu 2-2:50 pm, Fr 3-4 pm, or by appointment.
Other speakers may occasionally be invited to lecture to the class.
READINGS:
Several other books on the subject and related subjects are
recommended below.
Several research papers will be handed out during the semester.
GRADES:
Exam |
20% |
Homework
|
08% |
Project |
72% (12% each project) |
Participation in class discussions of assigned topics
|
10% Extra points |
Your final grade will reflect your own work and achievements during
the course. Any type of cheating will be penalized with an F grade for
the course and will be reported to the WPI Judicial Board in accordance
with the Academic
Honesty Policy.
EXAM
There will be one midterm exam. This exam will cover the material presented
in class since the beginning of the semester.
HOMEWORK
There will be one assigned homework. The homework is intended as preparation
for the midterm exam. The homework will cover the material in chapters 1 through 5
of the textbook.
PROJECTS
There will be a total of six interrelated projects.
Each of the projects deals with one of the data mining techniques
covered in the class.
Datasets for those projects will be selected from
online database repositories,
or other sources.
About the Weka System:
For most of the projects, we will use the
Weka system
(http://www.cs.waikato.ac.nz/ml/weka/).
Weka is an excellent machine-leaning/data-mining environment.
It provides a large collection of Java-based mining algorithms,
data preprocessing filters, and experimentation capabilities.
Weka is open source software issued under the GNU General Public License.
For more information on the Weka sytem, to download the system and
to get its documentation, look at
Weka's webpage
(http://www.cs.waikato.ac.nz/ml/weka/).
You should download the latest available stable GUI version of the system.
Students will be required
to provide both a written report and an oral (in-class) presentation describing
their achievements in each of these projects.
CLASS PARTICIPATION
All students are expected to read the material assigned for each class in
advance and to participate in class discussions. Also, students will
take turns presenting papers and leading class discussions of assigned
readings.
CLASS MAILING LIST
There are two mailing lists for this class:
- messages sent to cs525d-all AT cs.wpi.edu go to the entire class (students and professor)
If you haven't received the "welcome to CS525D" email message
by the end of the first day of classes,
you should subscribe to the mailing list by sending the following one-line
email message to majordomo@cs.wpi.edu:
subscribe cs525d
- messages sent to cs525d-staff AT cs.wpi.edu go to the professor only.
(Please use this email address to reach the professor.)
CLASS WEB PAGES
The web pages for this class are located at
http://www.cs.wpi.edu/~cs525d/s04/
Announcements will be posted on the web pages and/or the class mailing
list, and so you are urged to check your email and the class web pages
frequently.
ADDITIONAL REFERENCES
(See also the list of selected papers in the Class
Schedule.)
Knowledge Discovery and Data Mining
-
"Advances in Knowledge Discovery and Data Mining". Eds.: Fayyad,
Piatetsky-Shapiro, Smyth, and Uthurusamy. The MIT Press, 1995.
-
"Data Mining. Technologies, Techniques, Tools, and Trends".
B. Thuraisingham. CRC, 1998.
-
"Data Mining. A hands-on approach for business professionals".
R. Groth. Prentice Hall, 1998.
-
"Data Preparation for Data Mining". Dorian Pyle, 3/99.
- "Data Mining".
P. Adriaans & D. Zantinge
-
"Data Mining Methods for
Knowledge Discovery" Cios, Pedrycz, & Swiniarski, 1998.
-
"Data Mining Techniques for
Marketing, Sales and Customer Support". Berry & Linoff.
-
"Decision Support using
Data Mining". Anand and Buchner.
-
"Feature
Selection for Knowledge Discovery and Data Mining". Liu
and Motoda.
-
"Feature Extraction, Construction and Selection:
A Data Mining Perpective". Eds: Motoda and Liu.
-
"Knowledge Acquisition from Databases". Xindong Wu.
-
"Mining Very Large Databases with
Parallel Processing".
Alex Freitas, Simon Lavington.
-
"Predictive Data-Mining: A
Practical Guide". Weiss & Indurkhya.
- "Machine Learning and Data Mining: Methods and Applications."
Michalski, Bratko, and Kubat, 1998; John Wiley & Sons.
-
"Mining Very Large Databases with Parallel
Processing". Freitas & Lavington.
- "Rough Sets and Data Mining: Analysis of Imprecise Data."
Eds: Lin and Cercone; Kluwer.
-
"Seven Methods for Transforming Corporate Data into
Business Intelligence". Vasant Dhar and Roger Stein; Prentice-Hall,
1997.
Machine Learning
General AI
-
"Artificial Intelligence: A Modern Approach".
S. Russell, P. Norvig.
Prentice Hall, 1995. ISBN 0-13-103805-2
-
"Artificial Intelligence: Theory and Practice".
T. Dean, J. Allen, Y. Aloimonos.
The Benjamin/Cummings Publishing Company, Inc. 1995.
-
"Readings in Artificial Intelligence".
B. L. Webber, N. J. Nilsson, eds.
Tioga Publishing Company, 1981.
-
"Artificial Intelligence".
3rd edition.
Patrick H. Winston.
Addison Wesley.
-
"The Elements of Artificial Intelligence Using Common Lisp".
S. L. Tanimoto.
Computer Science Press 1990.
-
"Artificial Intelligence" Second edition.
E. Rich and K. Knight.
McGraw Hill 1991.
-
"Paradigms of Artificial Intelligence Programming: Case Studies
in Common Lisp".
P. Norvig.
Morgan Kaufmann Publishers, 1992.
-
"Essentials of Artificial Intelligence".
M. Ginsberg.
Morgan Kaufmann Publishers,
1993.
-
"Artificial Intelligence Structures
and Strategies for Complex Problem Solving".
Third edition.
G. F. Luger and W. A. Stubblefield.
Addison-Wesley,
1998.
-
"Logical Foundations of Artificial Intelligence".
M.R. Genesereth and N. Nilsson.
Morgan Kaufmann, 1987.
Databases
Statistics
- "Statistical Inference for Management and Economics".
P. Billingsley, D. Croft, D. Huntsberger, C. Watson.
Boston: Allyn and Bacon, Inc. 1986.
- "Probability and Statistics". 2nd edition.
M. DeGroot. Addison Wesley, 1986.
- "Statistical Inference".
G. Casella, R. Berger.
Wadsworth and Brooks/Cole, 1990.
OTHER ONLINE RESOURCES:
Data Sets
KDD
KDD Commercial Products / Prototypes
Data Warehousing and OLAP
Machine Learning
Statistics
General AI