### CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2017 Project 4: Clustering

#### PROF. CAROLINA RUIZ

DUE DATE: Thursday November 16th, 2017.
• Slides: Submit via Canvas by 2:00 pm.
• Written report: Hand in a hardcopy by the beginning of class (by 3:59 pm).

### Project Assignment

• Read Chapter 8 of the textbook in great detail.

• Study all the materials posted on the course Lecture Notes:

• THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written summary, and how to study for the test.

*** You must use the Project 4 Template provided for your written report. Do not change the template, the page limits or the font size. *** (if you prefer not to use Word, you can copy and paste this format in a different editor as long as you respect the stated page structure and page limit.)

• Data Mining Technique(s): We will run experiments using the following techniques in Weka and Python.

• Clustering Techniques:
• Simple K-means
• HierarchicalClusterer: Use min (= single link), max (= complete linkage), and at last one other inter-cluster similarity metric among: group-average, centroid, and Ward's method.
• Density-based clustering: Use DBSCAN in Python. No need to run density-based clustering in Weka.

• Dataset:
• Students in CS548: Use the Wall Street Journal's "Where it Pays to Attend College: Salaries by college, region, and academic major" datasets available at Kaggle.

• Use each of the 3 datasets provided separately in your experiments:

1. degrees-that-pay-back:
• Your guiding question for this dataset should be to determine which undergraduate majors are clustered together.
• Use the following attribute as discrete (= nominal): Undergraduate Major.
• Use all of the remaining attributes as continuous (= numeric).

2. salaries-by-college-type:
• Your guiding questions for this dataset should be to determine: (1) which school types are clustered together, and (2) which school names are clustered together.
• Use the following attributes as discrete (= nominal): School Name and School Type. Do not include School Type in the distance calculations, but keep it as a label so that you can analyze the results of the clusterings with respect to School Type.
• Use all of the remaining attributes as continuous (= numeric).

3. salaries-by-region:
• Your guiding questions for this dataset should be to determine: (1) which regions are clustered together, and (2) which school names are clustered together.
• Use the following attributes as discrete (= nominal): School Name and Region. Do not include Region in the distance calculations, but keep it as a label so that you can analyze the results of the clusterings with respect to Region.
• Use all of the remaining attributes as continuous (= numeric).

• Another general guiding question should be: What schools are deemed similar to WPI by these clusterings?
Also, compare results of the clusterings accross the three different datasets and different experiments to the extend possible.

• Students in BCB503: Use a biological or biomedical dataset for this project. You may consider using the dataset that you collected for Project 3, or another dataset. Please discuss your choice of dataset with the professor before running experiments.

• Performance Metric(s): A major part of this project is to find meaningful ways of evaluating and interpreting the resulting clusters. Devise a variety of approaches to do so, including but not limited to:
• visualization (MDS and others) of the resulting clusters (Weka provides only very basic visualizations. Python provides more advance visualization, and well as R and Matlab);
• inspection of the clusters' members to find similarities among data instances in a cluster and dissimilarities among data instances in different clusters; and
• use of clustering-specific performance metrics described in the textbook. Include metrics like purity and normalized mutual information (NMI). For experiments in which you use evaluation metrics that analyze the resulting clusters with respect to a target attribute (e.g., School Name), don't include this attribute among the input attributes used for clustering.
The deeper your analysis, the better your project grade. You may consider extending Weka's and Python's existing code to provide the evaluation/interpretation functionality you need.