CS 548 Knowledge Discovery and Data Mining

Syllabus Fall 2017

WARNING: Small changes to this syllabus may be made during the semester.

COURSE DESCRIPTION:

This course presents current research in Knowledge Discovery in Databases (KDD) dealing with data integration, mining, and interpretation of patterns in large collections of data. Topics include data warehousing and data preprocessing techniques; data mining techniques for classification, regression, clustering, deviation detection, and association analysis; and evaluation of patterns mined from data. Industrial and scientific applications are discussed.

Students will be expected to read assigned textbook chapters and research papers, and work on implementation/research projects that cover the different stages of the KDD process.

In addition to being a graduate CS course, this course can be used to satisfy:

the MS and PhD AI bin requirements of the Computer Science Graduate Program;
the interdisciplinary BCB503 Biological and Biomedical Database Mining course of the Bioinformatics and Computational Biology Program; and/or

the Data Analytics and Mining core of the Data Science Program.

CLASS MEETING:

Time: Tuesdays and Thursdays 4:00-5:20 pm
Room: FL320

INSTRUCTOR:

Prof. Carolina Ruiz

Office: FL 232

Office Hours:

At the end of class: TuTh: 5:20-6:00 pm.
Wednesdays 1:00 - 2:00 pm.
If you need to see me at a different time, please email me to schedule an appointment.

TEXTBOOK:

Required Textbook
Introduction to Data Mining P.-N. Tan, M. Steinbach, V. Kumar. Addison-Wesley 2005. ISBN-10: 0321321367 ISBN-13: 9780321321367 (See the book's link above for book slides and other resources.)
Recommended Textbook
"Data Mining: Practical Machine Learning Tools and Techniques (Third Edition)". Ian H. Witten, Eibe Frank, Mark A. Hall Morgan Kaufmann January 2011 ISBN 978-0-12-374856-0

Several other books on the subject and related subjects are recommended below. Some research papers will be handed out during the semester.

PREREQUISITE:

Background in artificial intelligence, databases, and statistics at the undergraduate level, or permission of the instructor. Proficiency in a high level programming language (preferably Java and Python) is required.

GRADES:

5 Project-Test Combinations	17.5% each
typically, 11% test,6% project report and 0.5% presentation
Showcase	10%
Class Participation	2.5%

Your final grade will reflect your own work and achievements during the course. Any type of cheating will be penalized and reported to the WPI Judicial Board in accordance with the Academic Honesty Policy.

Note that this course follows the guidelines established by the WPI faculty in May 2010:

"A student is expected to expend at least 56 hours of total effort for each graduate credit. This means that a student in a 3-graduate credit 14-week course is expected to expend at least 12 hours of total effort per week."

Hence, please expect to have to spend at least 9 hours of work outside the classroom on this course each week.

CLASS PARTICIPATION

All students are expected to read the material assigned for each class in advance and to participate in class discussions. Also, students will take turns presenting papers and leading class discussions of assigned readings. Class participation will be taken into account when deciding students' final grades.

PROJECTS, TESTS, AND SHOWCASES

Tests

There will be a test given during the class when each project is due. Each test will cover the topics including in the corresponding project. This includes materials on these topics from lectures, book chapters, posted materials on the lecture notes website (see Quiz/Exam Topics and Sample Questions), AND project experiments and results. Tests will be individual (not group) work, closed-book, closed-notes.

Projects

This course is project-intensive. Several projects related to the data mining stages and/or techniques covered in the class will be assigned. Students will work on these projects in teams. Students will be required to provide both a written report and an oral (in-class) presentation describing their work on each of these projects. Datasets for those projects will be selected from online database repositories, or other sources.

Several different data mining tools will be used in this course, but the two main ones will be:

Weka: Weka is a machine-leaning/data-mining environment. It provides a large collection of Java-based mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is open source software issued under the GNU General Public License. For more information on the Weka system, to download the system and to get its documentation, go to the Weka webpage. You should download and use the latest Developer Version (currently weka-3-9-1) of the system. The Weka MOOC (consisting of several videos) may also be helpful.
Python: See Ruiz's notes on Python. For Python tutorials, see its documentation.
Python has many open source packages available specifically for Data Mining and Knowledge Management. Here is a list of the most widely used ones, along with brief descriptions:
- Scikit-learn: Simple and efficient tools for data mining and data analysis. Has algorithms implemented in the fields of Preprocessing, Classification, Regression, Clustering, Dimensionality Reduction and Model selection. It is built on the commonly used NumPy and SciPy packages. Scikit-learn is usually the default choice when it comes to Data Mining in Python.
- Pandas: Python Data Analysis Library: Slightly more advanced library than Scikit-learn. Has a very good API. Pandas introduces some useful data structures, such as .dataframes.. However, Pandas doesn.t provide all of the predictive modelling tools. Pandas is used when more control is needed when working directly on raw data.
- Orange: The best thing about Orange is that it has a Graphical User Interface. Has quite a comprehensive collection of algorithms for Classification, Clustering and feature selection. It also has add-ons for Bioinformatics and Text mining.
- MLPy: Machine Learning Python: MLPy is a Machine Learning package similar to Scikit-Learn. It has most of the algorithms necessary for Data mining, but is not as comprehensive as Scikit-learn. MLPy can be used for both Python 2 and 3.
Note: Python Package Index: All Python packages can be searched by name or keyword in the Python Package Index.

More detailed descriptions of the assignments and projects will be posted to the course webpage at the appropriate times during the semester.

Showcase

Each student needs to sign up for one of the available showcase topics. The team of students assigned to a showcase topic should identify a real-world, successful application of the data mining topic. This sucessful data mining story should be about using the corresponding data mining technique to discover novel and useful patterns that made a difference in a certain industry or field in the past 7 years. The application domain is up to the student team (e.g., finance, sports, healthcare, science, ...). The chosen sucessful data mining story should be discussed with and approved by the professor at least 2 weeks in advance. Then the team should investigate the application in depth, and prepare and deliver a 10 minute in-class presentation describing this application in as much detail as possible, focusing on its data mining aspects. Teams will present their showcases throughout the semester, according to the showcase schedule.

Sample showcases from my previous offerings of this course (note that showcases this semester must be different to those from previous semesters):

CLASS DISCUSSION FORUMS AND MAILING LIST

Class Discussion Forums: The main digital venue for communication outside the classroom will be the CS548 Discussion Forums provided by Canvas. To access these discussion forums, go to Canvas, click "BCB503-CS548-F17-MASTER: KNOWLEDGE DISCOVERY AND DATA MINING" under "My Courses", and then click on "Discussions" on the left hand-side bar.
Class Mailing List: There is also a mailing list for this class that will be used by the professor for general announcements, but not for class discussions This mailing list reaches the professor and all the students in the class.

Please make sure to read Canvas CS548 forums and email sent to the class mailing list constantly throughout the semester so that you don't miss any important course information.