WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS539 Machine Learning 
Homework 1 - Spring 2017

PROF. CAROLINA RUIZ 

Due Date: Thursday, February 9th, 2017 
------------------------------------------

HW Instructions


Section A: Exercises from the Textbook (75 points)


Section B: Univariate Data (175 points + bonus points)

Important: When you are asked to randomly generate data, make sure to record the random seed used for the generation so that you can reproduce your experiments later.

  1. Data Generation:
    (5 points) Randomly generate a dataset X with N=1000 consisting of one attribute normally distributed with mean=60 and standard deviation=8.

  2. MLE:
    1. (10 points) Use the formulas (4.8) p. 68 to find the Maximum Likelihood Estimation (MLE) of sample distribution parameters (mean and stardard deviation) directly from the sample. Show your work in the report.
    2. (10 points) Use the Maximum Likelihood Estimation (MLE) function provided by Matlab to calculate these parameter values from X. Do these parameter values coincide with the ones you found directly from the formulas above? Explain.

  3. MAP and Bayes' Estimator:
    In this part, you will look at the Maximum A Posteriori (MAP) and Bayes' estimator to estimate the parameter values of the sample X above. Assume that collection of all these possible parameter value estimates is also distributed normally. That is, X ~ N(θ,σ2) and θ ~ N002). Assume that σ=8, μ0=60, σ0=3.
    1. (10 points) Calculate the MAP estimate and the Bayes' estimate of the mean value used to generate data sample X. Are the MAP estimate and the Bayes' estimate the same in this case? Why or why not?
    2. (5 points) Should the MAP estimate in this case be the same as the mean estimated by MLE? Why or why not?

  4. Classification:
    1. (5 points) Randomly generate 3 normally distributed samples, each consisting of just one attribute as follows:
      • Sample 1: number of instances: 500, mean=60 and standard deviation=8.
      • Sample 2: number of instances: 300, mean=30 and standard deviation=12.
      • Sample 3: number of instances: 200, mean=80 and standard deviation=4.
      Create a dataset X that consists of these 3 samples, where data instances in Sample i above belong to class Ci, for i=1, 2, 3.
    2. (10 points) Following the material presented in Section 4.5 of the textbook, define a precise discriminant function gi for each class Ci. Remember to apply MLE to estimate the parameters of each of the classes. Show your work.
    3. (5 points) Based on these discriminant functions, what would be the chosen class for each of the following inputs: x = 10, 30, 50, 70, 90. Show your work.
    4. (15 points) Find analytically (i.e., by hand algebraically) the "decision thresholds" (see Fig. 4.2 p. 75) for these 3 classes.
    5. (5 points) Implement each of these 3 discriminant function gi as a new function in Matlab.
    6. (5 points) Based on these 3 functions, implement a "decision" function that receives a number x as its input and outputs i, where i is the chosen class for input x. Test your function on inputs: x = 10, 30, 50, 70, 90. Show the results in your report.
    7. (5 points) Use your decision function on inputs: x = 0, 0.5, 1, 1.5, ..., 99, 99.5, 100. Do the "decision thresholds" you calculated analytically coincide with the results of this test? Explain.
    8. (10 points) Generate a pair of plots like those in Fig. 4.2 for this particular dataset.
    9. (10 points) Use stratified random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances). Test the "decision" function that you implemented on part 6 above on the validation set. Report the accuracy and the confusion matrix of your decision function, as well as the precision and the recall of your decision function for each of the three classes.

  5. Regression:
    1. (10 points) Create a dataset consisting of one input and one output as follows. For the input, use the dataset X you generated in part I above with N=1000, mean=60 and standard deviation=8. For the output, use r = f(x) + ε where f(x) = 2 sin(1.5x), and the noise ε ~ N(μ=0,σ2=1). (as in the example in Sections 4.6-4.8 pp. 77-87).
    2. (5 points) Use random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances).
    3. (10 points) Create three 2-dimensional plots: one for the entire dataset X, one for the training set, and one for the validation set. In each of these plots, the x axis correspond to the input variable x, and the y axis corresponds to the output (response) variable r.
    4. (15 points) Create 5 different regression models over the training set using the regression functionality provided by Matlab:
      gk(x| wk,...,w0) = wk xk + ... + w1 x + w0, for k=0,1,2,3,4. Report the obtained coefficients in your written report.
    5. (15 points) Create two 2-dimensional plots: one containing the training set and the 5 fitting curves, and one containing the validation set and the 5 fitting curves obtained over the training set. In each of these plots, the x axis correspond to the input variable x, and the y axis corresponds to the output (response) variable r.
    6. (10 points) Evaluate each of the 5 regression models over the validation set. Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R2) of each regression model over the validation set. If the programming language you are using reports AIC, BIC, and/or log likelihood values, include these values in your report too. Based on these error measures, which model would you pick among the five regression models? Explain.
    7. (Bonus points) See if the regression functionality in Matlab allows the use of Akaike information criterion (AIC). and/or the use of Bayesian information criterion (BIC), instead of minimizing SSE, to guide the construction of the regression model. If so, repeat parts 4 and 6 above for AIC and then for BIC. Which of the three approaches produced better results? Explain.

Section C: Multivariate Data (155 points + bonus points)

Important: When you are asked to randomly generate data, make sure to record the random seed used for the generation so that you can reproduce your experiments later.

  1. Multivariate Normal Distribution:
    In this part, you will work with randomly generated datasets with N=1000 data instances and d=20 dimensions (attributes). Each dataset will be generated using a multivariate normal distribution with parameters μ (1-by-d vector of means, one for each attribute) and Σ (d-by-d covariance matrix). To simplify the notation, we'll denote μ by "trueMeans" and Σ by "trueSigma".

  2. Multivariate Classification:
    In this part, you will work with datasets that consist of 2 classes C1 and C2. These datasets will contain N=1800 data instances and d=20 attributes.

  3. Multivariate Regression:
    1. (10 points) Create a dataset consisting of d inputs and one output as follows. For the d inputs, use the multivariate dataset X1 you generated in part I above with N=1000, trueMeans and trueSigmaA. For the output, use r = f(x) + ε where f(x) = 3*average(x) - min(x), that is the output is three times the average of the d input values minus the minimum input value; and the noise ε ~ N(μ=0,σ2=1).
    2. (5 points) Use random sampling to split your dataset into 2 parts: a training set (with 60% of the data instances) and a validation set (with the remaining 40% of the data instances).
    3. (10 points) Create a multivariate linear regression model over the training set using the regression functionality provided by Matlab. Report the obtained regression formula in your written report.
    4. (10 points) Evaluate the regression model over the validation set. Report the Sum of Square Errors (SSE), the Root Mean Square Error (RMSE), the Relative Square Error (RSE), and the Coeffient of Determination (R2) of each regression model over the validation set. If the programming language you are using reports AIC, BIC, and/or log likelihood values, include these values in your report too.
    5. (Bonus points) See if the regression functionality in Matlab allows the use of Akaike information criterion (AIC). and/or the use of Bayesian information criterion (BIC), instead of minimizing SSE, to guide the construction of the regression model. If so, repeat part 4 above for AIC and then for BIC. Which of the three approaches produced better results? Explain.
    6. Bias and Variance:
      1. (10 points) Construct 10 new different datasets D1, ..., D10 each one consisting of 100 data instances randomly generated with trueMeans and trueSigmaA. For the output, use r = f(x) + ε where f(x) = 3*average(x) - min(x) and the noise ε ~ N(μ=0,σ2=1) as before.
      2. (10 points) Fit a multivariate linear regression formula gito each of these datasets.
      3. (10 points) Estimate the bias and the variance using the formulas on slide 24 of Chapter 4 slides (see also Section 4.7 of the textbook). Apply the formulas for bias and variance over the x's in the dataset X1 (together with the output value) that you constructed in part 1 above (hence N=1000 and M=10).