WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS539 Machine Learning 
Homework 3 - Spring 2017

PROF. CAROLINA RUIZ 

Due Date: Tuesday, April 04, 2017 
------------------------------------------

HW Instructions


Section A: Trees (80 points)

Dataset: For this part of the project, you will use the
Adult Dataset (use the adult.data file) available at the UCI Machine Learning Repository.
Carefully read the description provided for this dataset and familiarize yourself with the dataset as much as possible.

  1. Classification Trees:
    For classification, use the attribute ">50K, <=50K" as the target.
    1. (20 points) Use Matlab functions to construct decision trees over the dataset using 4-fold cross-validation. Briefly describe in your report the functions that you use and their parameters. Run at least 5 different experiments varying parameter values. Repeat the same experiments but now using pruning. Show the results of all of your experiments neatly organized on a table showing parameter values, classification accuracy, size of the tree (number of nodes and/or number of leaves), and runtime.
    2. (10 points) Select the pruned tree with smallest size. Use Matlab plotting functions to depict the tree. Include the plot in your report (or at least the top levels if the tree is too large). Briefly comment on any interesting aspect of this tree.
    3. (10 points) Research what the random forest technique does. Describe this technique briefly in your report, including what the inputs to this technique are, what it outputs, and how it constructs its output.
    4. (10 points) Include also what Matlab function constructs random trees. Run at least 5 different experiments varying parameter values. Show the results of your experiments neatly organized on a table showing parameter values, classification accuracy, size of the random forest, and runtime.

  2. Regression Trees:
    For regression, use the attribute "education-num" as the target.
    1. (20 points) Use Matlab functions to construct regression trees over the dataset using 4-fold cross-validation. Briefly describe in your report the functions that you use and their parameters. Run at least 5 different experiments varying parameter values. Repeat the same experiments but now using pruning. Show the results of all of your experiments neatly organized on a table showing parameter values, Sum of Square Errors (SSE), Root Mean Square Error (RMSE), Relative Square Error (RSE), Coeffient of Determination (R2), size of the tree (number of nodes and/or number of leaves), and runtime.
    2. (10 points) Select the pruned tree with smallest size. Use Matlab plotting functions to depict the tree. Include the plot in your report (or at least the top levels if the tree is too large). Briefly comment on any interesting aspect of this tree.

  3. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.
    Chapter 9 Exercises 1, 2, 4, 6, 8, 9, 10 of the textbook (pp. 235-236).

Section B: Artificial Neural Networks and Deep Learning (50 points)

Dataset: For this part of the project, you will use the OptDigit Dataset available at the UCI Machine Learning Repository.

  1. Classification using Artificial Neural Networks (ANNs):
    Use Matlab functions to construct and train ANNs over optdigits.tra and then test them over optdigits.tes.

    Topology of your Neural Net:


    Experiments:
    1. (5 points) Briefly describe in your report the Matlab functions that you use and their parameters.
    2. (5 points) Explain also how many nodes you use on the output layer, and how you use the output from the output node(s) to assign a classification label to a test instance.
    3. (35 points) Run at least 10 different experiments varying parameter values. Show the results of all of your experiments neatly organized on a table showing parameter values, number of hidden nodes in each layer, classification accuracy, and runtime.
    4. (5 points) Pick the experiment that you think produced the best result. Justify your choice in your report. Include the confusion matrix for this experiment. See what misclassifications are most common and elaborate on your observations.

  2. Deep Learning:
    1. Read the following article: Yann LeCun, Yoshua Bengio, Geoffrey Hinton. "Deep learning". Nature 521, 436-444 (28 May 2015) doi:10.1038/nature14539.
    2. Watch one of the following videos about deep learning (if you have time try to watch both). You're not expected to understand all the details, but try to get from the videos some of the theoretical foundations of deep learning and some of its applications.
      1. "Deep Learning" by Ruslan Salakhutdinov from the collection of Deep Learning Summer School, Montreal 2015
      2. "Recent developments on Deep Learning" Geoffrey Hinton's GoogleTech Talk, March 2010.
      Links to both videos (and several others) are available at deeplearning.net tutorials

  3. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.

Section C: Support Vector Machines (50 points)

Dataset: For this part of the project, you will use the
Adult Dataset (use the adult.data file) available at the UCI Machine Learning Repository.

  1. Classification using Support Vector Machines (SVMs):
    For classification, use the attribute ">50K, <=50K" as the target.
    1. (9 points) Use Matlab functions to construct a support vector machine over the dataset using 4-fold cross-validation. Briefly describe in your report the functions that you use and their parameters.
    2. (36 points) Run at least 12 different experiments varying parameter values for each of the following kernel functions (run at least 4 experiments for each one of the 3 kernel functions required):
      • polynomial (including linear, quadratic, ...)
      • radial-basis functions (Gaussian)
      • sigmoid (tanh)
      Show the results of all of your experiments neatly organized on a table showing kernel function used, parameter values, classification accuracy, and runtime.
    3. (5 points) Pick the experiment that you think produced the best result. Justify your choice in your report. Use Matlab functionality to plot a 2 or 3 dimensional depiction of data instances in each of the two classes, support vectors, and the decision boundary.

  2. Homework Problems:
    These homework problems are for you to study this topic. You do NOT need to submit your solutions.
    Chapter 13 Exercises 1 and 2 of the textbook (pp. 382-383).