Part I. GROUP HOMEWORK ASSIGNMENT
See solutions to this homework:
Consider the following cars dataset.
This dataset was constructed by taking a small sample of data instances and
attributes from the
Car Evaluation Dataset
available at
the UCI Data Repository.
@relation 'cars-weka.filters.unsupervised.attribute.Remove-R3,5-weka.filters.unsu
pervised.instance.Resample-S1-Z2.0'
@attribute buying {vhigh,high,med,low}
@attribute maint {vhigh,high,med,low}
@attribute persons {2,4,more}
@attribute safety {low,med,high}
@attribute class {unacc,acc,good}
@data
med,vhigh,more,low,unacc
med,vhigh,2,med,unacc
vhigh,vhigh,more,med,unacc
med,high,4,low,unacc
high,med,4,high,good
low,med,2,med,unacc
low,high,2,high,unacc
low,vhigh,more,med,acc
med,vhigh,4,med,acc
med,vhigh,4,med,acc
vhigh,vhigh,4,med,unacc
med,med,more,med,acc
med,vhigh,2,med,unacc
med,med,4,low,unacc
med,vhigh,more,low,unacc
med,low,4,med,acc
high,low,2,high,unacc
high,med,4,low,unacc
med,low,4,low,unacc
high,high,4,low,unacc
low,med,4,high,good
low,low,2,high,unacc
- Classification Rules using RIPPER:
You can choose between two possibilities:
Option 1: Solve part (a) below by hand. In this case you have to answer part (b) below too.
Option 2: Write your own code to solve part (a) below. In this case you don't have to answer part (b) below. Submit your code by email (with instructions on how to run it) and it will be read for correctness and tested. Your code will be graded over 25 points (separately from the score you receive in part (a)).
-
Use Weka, Excel, your own code (in a programming language of your choice), or other application
to help you calculate the metrics used by RIPPER.
- (20 points)
Construct the first rule that RIPPER would contruct over this dataset.
Show your work and explain each step of the process.
- (10 points)
Describe in words the process that RIPPER would follow to prune this rule.
(No need for you to prune the rule, you just need to describe what RIPPER would
do.)
-
Existing real-world application of Classification Rules:
Each group should search for an existing real-world successful application of Classification Rules. This sucessful story should be about mining these rule-based models from data to discover novel and useful patterns that have made a difference in a certain industry or field. The application domain is up to the group (e.g., finance, sports, healthcare, science, ...). The application must be a current application (ideally not older than 5 years).
Your homework report for this part should contain (at most 1 page):
- (10 points) Name and description of the chosen application.
- (10 points) Detailed description of how this application uses Classification Rules.
- (5 points) List of (recent) references and sources that you used to identify and investigate this application.
- Association Rules using Apriori:
You can choose between two possibilities:
Option 1: Solve part (i) below by hand. In this case you have to answer part (ii) below too.
Option 2: Write your own code to solve part (i) below. In this case you don't have to answer part (ii) below. Submit your code by email (with instructions on how to run it) and it will be read for correctness and tested. Your code will be graded over 25 points (separately from the score you receive in part (i)).
-
Use Weka, Excel, your own code (in a programming language of your choice),
or other application
to help you calculate support of the itemsets constructed below.
If you write your own correct, well documented code to solve this part
of homework, you will ...
- Generate all frequent itemsets over this dataset, following
the Apriori algorithm level by level. Show each step of the process.
Let min-support-count = 3 instances (that is, min-support about 13%).
- (5 points)
State what the "join" condition is (called "merge" in the Fk-1xFk-1
method in your textbook p. 341).
- (5 points)
State what the "subset" condition is (called "candidate pruning" in
the Fk-1xFk-1 method in your textbook p. 341).
- For each level,
- (10 points)
Show how the "join" condition was used to generate k-itemsets
(this level's itemsets) from frequent (k-1)-itemsets
(previous level's frequent itemsets).
- (10 points)
Show how the "subset" condition was used to eliminate candidate
itemsets from consideration before unnecessarily counting their support.
- (10 points)
Count support for all remaining itemsets in the level.
- (5 points) What's the termination condition for this process?
Explain.
- (10 points)
What are "lift", "leverage", and "conviction"? Provide an explicit formula
for each one of them (look at the Weka code to find those formulas).
Pick one association rule from those that you generate in the next part below,
and use the values of these metrics for this association rule
to judge how interesting/useful this rule is.
- Take the first frequent itemset of the last level you constructed above.
- (5 points) Generate all rules from this frequent itemset that have
exactly 2 items in the right-hand-side of the rule.
- (10 points) For each rule, calculate the confidence and the lift
of the rule.
- (5 points)
Explain how the processs of mining association rules in Weka's Apriori
is performed in terms of the following parameters: lowerBoundMinSupport,
upperBoundMinSupport, delta, metricType, minMetric, numRules.
-
Existing real-world application of Association Rules:
Each group should search for an existing real-world successful application of Association Rules. This sucessful story should be about mining association rules from data to discover novel and useful patterns that have made a difference in a certain industry or field. The application domain is up to the group (e.g., finance, sports, healthcare, science, ...). The application must be a current application (ideally not older than 5 years).
Your homework report for this part should contain (at most 1 page):
- (10 points) Name and description of the chosen application.
- (10 points) Detailed description of how this application uses Association Rules.
- (5 points) List of (recent) references and sources that you used to identify and investigate this application.
Part II. GROUP PROJECT ASSIGNMENT
- Project Instructions:
THOROUGHLY READ AND FOLLOW THE
PROJECT GUIDELINES.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
Note that the project guidelines specify strict page limits for your written report.
- Data Mining Technique(s):
We will run experiment using the following techniques:
- Pre-processing Techniques:
- Feature selection, feature creation, dimensionality reduction,
noise reduction, attribute discretization, ...
- Data Mining Techniques:
- Classification Rules: JRip (implementing RIPPER)
Given that this technique is able to handle numeric attributes and
missing values directly, make sure to run
some experiments with no pre-processing
and
some experiments with pre-processing, and compare your results.
Experiment also with different parameter values to see how
they affect the rules produced and
and their classification performance.
- Association Rules and Classification Association Rules (CARs): Apriori
Given that this technique does not handle numeric attributes
directly, make sure to run
experiments with different pre-processing, and compare your results.
Based on the results of an experiment, design and apply new preprocessing
to improve upon the rules obtained.
- Advanced Techniques:
- You can consider using advanced techniques to improve the accuracy
of your predictions. For instace, you can try
ensemble methods (see Section 5.6 of your textbook),
ways to deal with inbalanced classification targets
(see Section 5.7 of your textbook), etc.
- Dataset:
We will work with the same dataset used in projects 1, 2, and 3:
Census-Income (also known as "Adult") Dataset
available from the
Univ. California Irvine (UCI) Machine Learning Repository.
In particular,
- Use the data in the file "adult.data", and the description of the data in the file "adult.names".
- Use the nominal attribute salary (called >50K, .50K in the data files) as the classification target.
- remove the fnlwgt attribute from the dataset.
- Challenges:
In each of the following challenges provide a detailed description of the
preprocessing techniques used, the motivation for using these techniques,
and any hypothesis/intuition gained about the information represented
in the dataset. Answer the question provided as well as provide the
information described in the
PROJECT GUIDELINES.
- Easy Level:
This is to be a simple guided experimentation, thus little description
is needed for preprocessing techniques.
Construct classification rules using Weka's JRip implementation using the default parameters. Use salary as the target attribute.
Use 10-fold cross-validation to perform an analysis of the classification accuracy.
Construct association rules using Weka's Apriori implementation using the default parameters.
Construct classification association rules using Weka's Apriori implementation using the default parameters, except for car=true. (Lower the min. confidence threshold as needed to obtain at least 10 rules.)
Examine your three sets of rules. Compare and contrast them. Answer the following questions in your description about this experiment:
- Compare the three models generated.
Are they different? If so how? If not, why not?
Are there any rules in common in any two of these models?
Are there any rules in common in all three of these models?
- Compare the accuracies of the classification rules against the goodness metrics of the association rules.
- Moderate Level:
This is a bit more of a challenge.
Use modified parameters and preprocessing techniques to generate a set of JRip classification rules that classifies salary. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description.
In particular, make sure to experiment with and without rule pruning.
Examine the model. Compare and contrast this model against a ZeroR model, a OneR model, models generated in Projects 2 and 3, and models generated in the challenge above. Answer the following questions in your description about this experiment:
- Is this model a better model than the other models? If so, why? If not, why not?
- Did you find that using a dataset with no preprocessing but modified parameters produced a better model that using a preprocessed dataset with default parameters? Explain why or why not.
- What challenge(s) did you encounter while developing this model? Give an more detailed explanation of how you used preprocessing, postprocessing, or some other technique to overcome a specific challenge.
- WPI Level:
Design another experiment with a different goal other than the ones that have appeared previously in this assignment. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description.
- What was your motivation for choosing this goal? Is it very useful?
- Are there any limitations of the dataset that make this a more challenging experiment? Explain.
- Describe any anomolies that appeared in your model. What might these anomalies mean about the data?