Section D: Nonparametric Methods (90 + 10 bonus points)
Dataset: For this part of the project, you will use the
OptDigit Dataset available at the
UCI Machine Learning Repository.
- Carefully read the description provided for this dataset and
familiarize yourself with the dataset as much as possible.
- Use the following files:
- optdigits.names
- optdigits.tra: training dataset
- optdigits.tes: test dataset
- Univariate Density Estimation:
Randomly generate a set of N=100 data points using a uniform distribution in
the range from 0 to 50.
Construct the following 6 plots, using in each case the specified
density estimation function over the randomly generated dataset.
- (5 points)
Using a naive estimator with bin width h= 1 and a separate plot with h=4
(see Fig. 8.2 p. 189 of the textbook).
- (5 points)
Using a Gaussian kernel estimator with bin width h= 1 and a separate plot with h=4
(see Fig. 8.3 p. 190 of the textbook).
- (5 points)
Using a k-nearest neighbor kernel estimator with k=3 and a separate plot with k=6
(see Fig. 8.4 p. 191 of the textbook).
- Nonparametric Classification:
Use the OpDigit dataset for this part.
Use k-nearest classification functions in Matlab to
classify the data instances in the test set
optdigits.tes using optdigits.tra as the training set.
Run knn with k=1, 5, 9, 11, using 3 different distance metrics:
Mahalanobis, Euclidean, and cosine.
- (20 points)
Use a table to summarize your results.
In this table, include runtime, k, distance metric, and
classification accuracy
for each experiment.
Provide a brief analysis of your results.
- (5 points)
Pick the experiment that you think produced the best result.
Justify your choice.
Include the confusion matrix for this experiment.
See what misclassifications are most common and
elaborate on your observations.
- Outlier Detection:
Use the optdigits.tra dataset for this part.
- (20 points)
Calculate the Local Outlier Factor (LOF) of each data instance in optdigits.tra.
Describe what code you used to do this calculation.
- (5 points)
Sort the data instances in increasing order according to their LOF.
Plot a graph where the horizontal axis consists of the sorted data instances
and the vertical axis denotes their LOF values.
Is there an "elbow" in the plot that could be a good threshold to
discern between non-outliers and outliers?
Explain your answer.
- (5 bonus points)
Take the 3 data instances with the highest LOF values.
See if you can plot the image (digit) corresponding to each of these
data instances, and see if you can tell whether or not they are
abnormal/outliers.
- Nonparametric Regression:
You may find it useful to watch
Matlab's nonparametric fitting video.
Use the OpDigit dataset for this part, but instead of using the
class attribute as discrete (or nominal), use it as continuous.
- (25 points)
Use locally weighted regression functions in Matlab that implement
techniques like loess or lowess (LOcally WEighted Scatter plot Smooth)
on the optdigits.tra dataset.
Use cross-validation to determine a good value for k (= number of
nearest neighbors used).
Summarize the results of your experiments on a table.
- (5 bonus points) Find a good way to visualize the
smoothed regression curves constructed.