Due Time: Canvas Submission by Tuesday, April 4 at 5:30 pm.
Test Instructions:
-
Work on all the parts of this test by yourself without help from anyone else
or online help other than Matlab documentation.
- Submission:
Submit via Canvas ("Test 3" under "Assignments") by 5:30 pm
a zipped file containing the following files:
- a file with your answers to the test problems, including visualizations,
using the
Test 3 Report Template provided.
For this you can use either Word or other editors, but you need to submit either a .docx file, or a .pdf file;
- recommended but not required: one or more .m files with the Matlab code/instructions/functions that you use to solve those problems; and
- a text file with the transcript of your Matlab session:
- In Matlab:
Use the diary command, which saves your session into a text file. At the beginning of the session, type diary('filename'), where filename = <your first name_your last name_diary>.txt
- Dataset:
Use the
Breast Cancer Wisconsin (Diagnostic) Data Set available at the
UCI Machine Learning Repository.
- Note that there are different variations of this dataset in the dataset website. You need to use the following WDBC files from the dataset's Data Folder:
- Remove the ID number attribute.
- Use the Diagnosis (M = malignant, B = benign) as the prediction target.
Replace "M" with 1 and "B" with 0 (i.e., zero).
- Problems:
- Start Matlab. Start saving your history with the diary command as
described above.
- [5 minutes] Upload the dataset onto Matlab.
- Remember to remove the ID number attribute,
and to replace "M" with 1 and "B" with 0 (i.e., zero) in the Diagnosis attribute,
before or after uploading the dataset onto Matlab.
- Perform any data preprocessing you deem necessary.
It is preferable that you do all your preprocessing in Matlab,
but if you need to, you may use Excel to preprocess the dataset before uploading it onto Matlab.
No other software package is allowed.
Include in your report a description of what preprocessing you do, if any,
and what functions you use for this.
- [5 points, 5 minutes] Stratified Random Sampling:
Use stratified random sampling to split the dataset into
two parts:
75% (approx. 427 data instances) for training and
the remaining (approx. 142 data instances) for testing.
Include in your report what functions/instructions you use for this.
- [50 points, 30 minutes] Classification Experiments:
- (40 points) Run experiments with each of the following classification methods.
Collect appropriate/relevant evaluation metrics including accuracy, runtime,
size of the model (in the case of decision trees), precision, recall,
and coefficient of determination (R2).
- k-Nearest Neighbors. Run at least one experiment with k = 5.
- Decision Trees.
- Artificial Neural Networks.
- Support Vector Machines. Run at least one experiment without kernels and at least
one experiment using a polynomial kernel of degree 2 (i.e., quadratic kernel).
- Include in your report the Matlab functions, code, and parameter
values that you used for each of the experiments.
For the Artificial Neural Networks, state the architecture that you used,
the number of hidden layers, and the number of hidden units on each layer.
- Re-run experiments using different parameter values, settings, and/or different data
preprocessing until you are satisfied with the results and/or you run out of
time for this part.
- (10 points) In your written report, summarize all results of your experiments in a table
so that it is easy to compare results of experiments varying parameters for the same
classification method, and results across different classification methods.
- [15 points, 15 minutes] Regression Experiments:
Use Matlab functions to construct regression trees over the dataset
using the same training and testing sets as in the classification experiments above,
but using the target attribute as numeric rather than nominal.
Run at least 5 different experiments varying parameter values, including pruning.
Show the results of all of your experiments neatly organized on a table
showing parameter values,
Sum of Square Errors (SSE),
Root Mean Square Error (RMSE),
Relative Square Error (RSE),
Coeffient of Determination (R2),
size of the tree (number of nodes and/or number of leaves), and runtime.
- [10 points, 5 minutes] Experiments with PCA Preprocessing:
Apply Principal Components Analysis to the full data set
(i.e., training and test sets together).
Show the results in your report.
What is the minimum number of components needed to capture at least 95% of the variance?
- [15 points, 10 minutes] Analysis and Comparison of Results:
Analyze the results of your experiments.
Describe any interesting aspects of your results,
comparing results accross different experiments and different machine learning techniques used in this test.
- [10 points, 10 minutes] Visualization:
Include in your report two Matlab plots that
illustrate interesting aspects of your results.
At least one of these plots should be a visualization of a model you constructed in your
experiments. Make sure to explain your plots in your report and analyze them.
- [3 minutes] Submit your files on Canvas using the Submission directions given above.