COURSE DESCRIPTION:
Due to advances in technology and the availability of increasingly
cheap storage devices, data in different domains have been accumulating
at an impressively high rate in recent years, leading to very large
databases. This course presents current research in Knowledge Discovery
in Databases (KDD) dealing with the data integration, mining, and
interpretation of patterns in such databases. Topics include data
warehousing and mediation techniques aimed at integrating distributed,
heterogeneous datasources; data mining techniques such as
decision trees, association rule mining, and statistical
analysis for discovery of patterns in the integrated data; and
evaluation and interpretation of the mined patterns using visualization
techniques. The work discussed originates in the fields of artificial
intelligence, databases, information retrieval, data visualization,
and statistics. Industrial and scientific applications will be given.
Students will be expected to read assigned textbook chapters and
research papers,
and work on implementation/research projects that cover the different
stages of the KDD process.
This course can be used to satisfy the graduate AI bin requirement.
CLASS MEETING:
Time: Tuesdays and Thursdays 2:00-3:20 pm
Room: WB323
Students are also encouraged to attend the
Knowledge Discovery
in Databases and Data Mining Research Group (KDDRG) Seminar Fridays at
1 pm in Beckett Conference Room (FL246).
Prof. Carolina Ruiz
Office: FL 232
Phone Number: (508) 831-5640
Office Hours: Mondays 11-12 noon, Thursdays 1-1:50 pm, or by appointment.
- Required:
"Data Mining (Second Edition)".
I.H. Witten and E. Frank.
Morgan Kaufmann Publishers.
2005.
ISBN: 0-12-088407-0
- Recommended:
Several other books on the subject and related subjects are
recommended below.
Some research papers will be handed out during the term.
Background in artificial intelligence, databases, and statistics
at the undergraduate level, or permission of the instructor.
Proficiency in a high level programming language (preferable Java)
is required.
6 Projects (15% each project) |
90% |
Class Participation |
10% |
Your final grade will reflect your own work and achievements during
the course. Any type of cheating will be penalized
and reported to the WPI Judicial Board in accordance
with the Academic
Honesty Policy.
All students are expected to read the material assigned for each class in
advance and to participate in class discussions. Also, students will
take turns presenting papers and leading class discussions of assigned
readings.
Class participation will be taken into account when deciding
students' final grades.
There will be a total of six projects
related to the data mining stages and/or techniques
covered in the class.
Datasets for those projects will be selected from
online database repositories,
or other sources.
About the Weka System:
For most of the projects, we will use the
Weka system
(http://www.cs.waikato.ac.nz/ml/weka/).
Weka is an excellent data-mining environment.
It provides a large collection of Java-based mining algorithms,
data preprocessing filters, and experimentation capabilities.
Weka is open source software issued under the GNU General Public License.
For more information on the Weka system, to download the system and
to get its documentation, look at
Weka's webpage
(http://www.cs.waikato.ac.nz/ml/weka/).
You should download and use the latest developer version of the system
(currently 3-7-0).
Students will be required
to provide both a written report and an oral (in-class) presentation describing
their achievements in each of these projects.
More detailed descriptions of the assignments and projects will be posted to the
course webpage at the appropriate times during the semester.
The mailing list for this class is:
This mailing list reaches the professor and all the students in the class.
The webpages for this class are located at
http://www.cs.wpi.edu/~cs525d/f09/
Announcements will be posted on the web pages and/or
the class mailing list, and so you are urged to check your email and
the class web pages frequently.
Small changes to this syllabus may be made during the course
of the semester.
Knowledge Discovery and Data Mining
-
"Data Mining: Concepts and Techniques".
J. Han and M. Kamber. Morgan Kaufmann Publishers. 2001.
ISBN: 1-55860-489-8.
-
"Advances in Knowledge Discovery and Data Mining". Eds.: Fayyad,
Piatetsky-Shapiro, Smyth, and Uthurusamy. The MIT Press, 1995.
-
"Data Mining. Technologies, Techniques, Tools, and Trends".
B. Thuraisingham. CRC, 1998.
-
"Data Mining. A hands-on approach for business professionals".
R. Groth. Prentice Hall, 1998.
-
"Data Preparation for Data Mining". Dorian Pyle, 3/99.
- "Data Mining".
P. Adriaans & D. Zantinge
-
"Data Mining Methods for
Knowledge Discovery" Cios, Pedrycz, & Swiniarski, 1998.
-
"Data Mining Techniques for
Marketing, Sales and Customer Support". Berry & Linoff.
-
"Decision Support using
Data Mining". Anand and Buchner.
-
"Feature
Selection for Knowledge Discovery and Data Mining". Liu
and Motoda.
-
"Feature Extraction, Construction and Selection:
A Data Mining Perpective". Eds: Motoda and Liu.
- "Introduction to Data Mining".
Tan, Steinbach, & Kumar. 2006.
- "The Text Mining Handbook:
Advanced Approaches in Analyzing Unstructured Data".
Ronen Feldman, James Sanger. 2006.
-
"Knowledge Acquisition from Databases". Xindong Wu.
-
"Mining Very Large Databases with
Parallel Processing".
Alex Freitas, Simon Lavington.
-
"Predictive Data-Mining: A
Practical Guide". Weiss & Indurkhya.
- "Machine Learning and Data Mining: Methods and Applications."
Michalski, Bratko, and Kubat, 1998; John Wiley & Sons.
- "Rough Sets and Data Mining: Analysis of Imprecise Data."
Eds: Lin and Cercone; Kluwer.
-
"Seven Methods for Transforming Corporate Data into
Business Intelligence". Vasant Dhar and Roger Stein; Prentice-Hall,
1997.
Databases
-
"A First Course in Database Systems".
J. Ullman, J. Widom.
Prentice-Hall, 1997.
-
"Database Management Systems", 2nd ed.
R. Ramakrishnan. McGraw-Hill, 1999.
- Advanced Database Systems".
C. Zaniolo, S. Ceri, C. Faloutsos, R.T. Snodgrass, V.S. Subrahmanian, R. Zicari.
The Morgan Kaufmann, 1997.
- "Readings in Database Systems". 2nd Edition.
Ed. M. Stonebraker. 1994, Morgan Kaufmann.
Statistics
- "Statistical Inference for Management and Economics".
P. Billingsley, D. Croft, D. Huntsberger, C. Watson.
Boston: Allyn and Bacon, Inc. 1986.
- "Probability and Statistics". 2nd edition.
M. DeGroot. Addison Wesley, 1986.
- "Statistical Inference".
G. Casella, R. Berger.
Wadsworth and Brooks/Cole, 1990.
Machine Learning
-
"Machine Learning".
Tom M. Mitchell.
McGraw-Hill, 1997.
-
"Elements of Machine Learning".
P. Langley.
Morgan Kaufmann Publishers,
Inc. 1996.
General AI
-
"Artificial Intelligence: A Modern Approach".
S. Russell, P. Norvig.
Prentice Hall, 1995. ISBN 0-13-103805-2
-
"Artificial Intelligence: Theory and Practice".
T. Dean, J. Allen, Y. Aloimonos.
The Benjamin/Cummings Publishing Company, Inc. 1995.
OTHER ONLINE RESOURCES:
Previous offerings of CS525d Knowledge Discovery and Data Mining
Webpages of my previous offerings of this course:
Previous offerings of CS4445
Webpages of my previous offerings of the undergraduate data mining course
contain plenty of useful resources: practice exams, exams, homework,
solutions of those exams/hw, etc.
Data Sets
KDD
KDD Commercial Products / Prototypes
Data Warehousing and OLAP
Statistics
Machine Learning
General AI