*** The written report for your group project should be at
most 10 pages long (including all graphs, tables, figures,
appendices, ...) and the font size should be no smaller than
11 pts. ***
Data Mining Technique(s):
Run experiments using any (combinations) of the following techniques:
HierarchicalClusterer: Make sure to experiment with different "linkType"s
Anomaly Detection
LOF filter (see WPI level below)
Any data mining technique (including clustering) covered in the course
that can be used to detect anomalies in the data.
Advanced Techniques:
You should consider using advanced techniques to improve the accuracy
of your predictions. For instace, try
ensemble methods (see Section 5.6 of your textbook),
ways to deal with inbalanced classification targets
(see Section 5.7 of your textbook),
cost-sensitive classification, etc.
But, in terms of data mining techniques, this project is restricted to
the techniques listed above.
Any other creative ideas you have to bust model perfomance and/or
to combine different models into a more powerful one.
Dataset
We will work with the
Amazon Commerce reviews set Data Set
available at the
UCI Machine Learning Repository.
Important:
Please use the pre-processing described in the easy
challenge for all subsequent challenges as well.
Additional pre-processing may be performed as needed.
Challenges:
In each of the following challenges provide a detailed
description of the preprocessing techniques used, the motivation for
using these techniques, and any hypothesis/intuition gained about
the information represented in the dataset. Answer the question
provided as well as provide the information described in the
PROJECT GUIDELINES
Easy Level:
This is to be a simple exercise to practice your
preprocessing techniques and allow you to become familiar with the
dataset.
The dataset as found on the website contains problems that prevent
it from being used in Weka without "cleaning" the data. DO NOT
remove the classification attribute from the dataset as it will
be needed later.
Attempt to open the ARFF file in Weka. Examine the the problems in
the data that prevent the file from being used as-is. Repair this
file so that it may be used by Weka. Describe what you did to the
dataset so this was accomplished.
Answer the following questions in your description about this
expecise:
Describe the data. How many attributes does this dataset
contain? How many instances? What do the instances represent?
What do you expect to find when you begin exploring this
dataset?
Since this dataset contains far more attributes than instances
would it be reasonable somehow to transform the attributes to
instances (and make the instances attributes)? Briefly describe
how this could be done.
Briefly describe a domain application where this (#2 above) might be
useful.
Moderate Level:
This is a bit more of a challenge (be sure to
leave yourself time for challenges 3 and 4).
Use the SimpleKMeans clustering tool in Weka to generate a
clustering of this dataset. Be sure to IGNORE the classification
attribute when performing the clustering, but DO NOT remove it
from the dataset. Use K values no greater than 50 for this
clustering. Experiment with both Euclidean distance and
Manhattan distance.
Examine the model. Describe the performance of the clustering.
Answer the following questions in your description about this
experiment:
Which distance metric and which value for K worked best
for this experiment? Why do you think this was so?
Compare and contrast the assigned cluster of the instances
with their classification values. Provide a visualization
and a description. Were the results what you expected?
Are there any limitations of the dataset that make this a
more challenging experiment (other than the obvious
limitations caused by the ARFF file problems in the "easy"
challenge)? Explain.
WPI Level:
This and the WPI+ are the big challenges that should spend the most time on.
Install the Local Outlier Factor (LOF) filter add-on package.
Open the package manager in Weka and locate
"localOutlierFactor" and install it. Use the LOF filter on your dataset. Identify the top 25 instances
that are identified as outliers. Provide a list the of the class
values for these instances. Perform SimpleKMeans clustering on
the data both with and without these instances included in
the dataset. In the dataset with the instances included,
change the classification value of the 25 outliers to a new class value "Outlier"
(so that these instances can easily be identified in the clusters that you
will construct).
IGNORE the class attribute
and the LOF attribute when performing the clustering. Describe
the performance of these clusterings.
How do the performance of these 2 clusterings compare to
each other?
Attempt to use visualization to find these outliers. Were
you successful?
a. If not: Is there some way you can characterize these
outliers? Describe an idea for a method.
b. If so: Describe any useful information you can glean from
the outliers?
What challenge(s) did you encounter while performing this
experiment? Give an more detailed explanation of how you
used preprocessing, visualization, postprocessing or some
other technique to overcome a specific challenge.
Use any other anomaly detection approach you wish to identify 25
outliers in this dataset. Include these outliers in your report, and explain
in detail how you found them and why they are outliers.
Compare this list of 25 outliers to that you found above using the LOF filter.
WPI+ Level:
For this experiment you may use either the Pennsylvania School
Dataset from the previous projects or the Amazon dataset that
you have been using for this experiment.
Design another experiment that performs clustering using
hierarchical clustering.
Provide detailed descriptions
about the parameters used to develop your model and/or
preprocessing techniques used. One should be able to repeat the
experiment from your description. Provide a clear description of
the clustering using visualization as needed to aide your
description. Make sure to experiment with different
"linkType"s.
What was your motivation for choosing this goal? Is it very
useful?
What challenge(s) did you encounter while developing this
model? Give an more detailed explanation of how you used
preprocessing, visualization, postprocessing or some
other technique to overcome a specific challenge.
Describe any anomolies that appeared in your model. What
might these anomalies mean about the data?