

Consider the following subset of the Mushroom dataset.
@relation sample-mushroom
@attribute cap-surface {fibrous,grooves,scaly,smooth}
@attribute bruises? {bruises,no}
@attribute gill-size {broad,narrow}
@attribute habitat {grasses,leaves,meadows,paths,urban,waste,woods}
@attribute poisonousness {edible,poisonous}
@data
scaly,bruises,broad,waste,edible
smooth,no,narrow,woods,poisonous
fibrous,no,broad,grasses,edible
scaly,bruises,broad,woods,edible
scaly,no,narrow,leaves,poisonous
scaly,bruises,broad,paths,edible
smooth,no,broad,leaves,edible
scaly,no,broad,woods,poisonous
scaly,no,narrow,woods,poisonous
smooth,no,broad,leaves,edible
fibrous,no,broad,paths,poisonous
fibrous,bruises,broad,woods,edible
smooth,bruises,narrow,grasses,poisonous
fibrous,no,broad,paths,poisonous
smooth,bruises,narrow,grasses,poisonous
scaly,no,narrow,leaves,poisonous
scaly,no,narrow,woods,poisonous
fibrous,no,broad,grasses,edible
scaly,bruises,broad,woods,edible
fibrous,no,broad,grasses,edible
Note that this dataset contains repeated instances. Your resulting classification and association rules should be affected by this fact.
I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:
Class (0 = crew, 1 = first, 2 = second, 3 = third) Age (1 = adult, 0 = child) Sex (1 = male, 0 = female) Survived (1 = yes, 0 = no)
In particular,
Use PRISM the covering algorithm to generate classification rules implemented in the Weka system. Read the Weka code implementing PRISM in great detail (you need to describe the algorithm used in PRISM in your written report). Read in great detail Sections 4.1, 4.4, 6.2 from your textbook.
Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
Once that are done with the join experiments, MODIFY the Prism code so that it uses the p*[log_2(p/t) - log_2(P/T)] measure to rank the attribute-values that are candidates for inclusion in a rule. DESCRIBE in detail in your report how exactly you modified the code. INCLUDE the relevant pieces of code in your report.
Repeat your joint experiments to see the differences in the results between the p/t and the p*[log_2(p/t) - log_2(P/T)] measures. If none of your joint experiments produces different results, construct at least one dataset in which the two measures produce different results and compare them.
Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
Once that are done with the join experiments, MODIFY the Apriori code so that the user can specify a certain attribute from the input dataset, and only association rules whose right-hand-sides consist only of attribute-value pairs formed with that attribute are generated by the algorithm. DESCRIBE in detail in your report how exactly you modified the code and the interface of Weka's Apriori. INCLUDE the relevant pieces of code in your report.
Your modification of the code should be (1) "complete", that is it should generate all the required association rules; and (2) "efficient", hence producing all regular association rules and then filtering out the ones that don't satisfy the right-hand-constraint is not considered an adequate solution. Run several experiments to make sure that your code modifications satisfy these two conditions.
Please submit the following files using the myWpi digital drop box:
If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.
FOR THE CLASSIFICATION RULES PART OF THE PROJECT
TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY
(30 POINTS TOTAL: 15 points for individual and 15 for group work)
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
(i.e. using nominal values instead of numeric
when appropriate, using as many of them
as possible, etc.)
(up to 5 extra credit points)
Trying to do "fancier" things with attributes
(i.e. combining two attributes highly correlated
into one, using background knowledge, etc.)
(TOTAL: 15 points for individual work)
ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(15 points) Description of the algorithm underlying the construction and
pruning of classification rules in Weka's PRISM code
(up to 5 extra credit points for an outstanding job)
(providing just a structural description of the code, i.e. a list of
classes and methods, will receive 0 points)
(TOTAL: 30 points for group work)
CODE MODIFICATION:
(10 points) Description of the algorithmic modification
(20 points) Description of the modifications made to the Prism code
(up to 10 extra credit points for an outstanding job)
(120 POINTS TOTAL: 60 points for individual and 60 points for group work)
EXPERIMENTS
(TOTAL: 30 points each dataset) FOR EACH DATASET:
(06 points) ran a good number of experiments
to get familiar with the PRISM classification method and
different evaluation methods (%split, cross-validation,...)
(08 points) good description of the experiment setting and the results
(08 points) good analysis of the results of the experiments
(08 points) comparison of the results obtained with Prism and the
classifiers from previous project (ZeroR, ID3, and J4.8)
and argumentation of weaknesses and/or strengths of each of the
methods on this dataset, and argumentation of which method
should be preferred for this dataset and why.
(up to 5 extra credit points) excellent analysis of the results and
comparisons
(up to 10 extra credit points) running additional interesting experiments
selecting other classification attributes instead of the
required in this project statement ("private/public", "Survived")
(TOTAL 5 points) SLIDES - how well do they summarize concisely
the results of the project? We suggest you summarize the
setting of your experiments and their results in a tabular manner.
---------------------------------------------------------------------------------
FOR THE ASSOCIATION RULES PART OF THE PROJECT
TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY
(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(10 points) Description of the Apriori algorithm for the construction of
frequent itemsets and association rules.
(up to 5 extra credit points for an outstanding job)
(providing just a structural description of the code, i.e. a list of
classes and methods, will receive 0 points)
(TOTAL: 35 points for group work)
CODE MODIFICATION:
(10 points) Description of the algorithmic modification
(20 points) Description of the modifications made to the Apriori code
(up to 10 extra credit points for an outstanding job)
(20 POINTS TOTAL: 10 points for individual and 10 points for group work)
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(up to 5 extra credit points)
Trying to do "fancier" things with attributes
(i.e. combining two attributes highly correlated
into one, using background knowledge, etc.)
(110 POINTS TOTAL: 55 points for individual and 55 points for group work)
EXPERIMENTS
(TOTAL: 28 points each dataset) FOR EACH DATASET:
(05 points) ran a good number of experiments to get familiar with the
Apriori algorithm varying the input parameters
(05 points) good description of the experiment setting and the results
(13 points) good analysis of the results of the experiments
INCLUDING discussion of particularly interesting association
rules obtained.
(05 points) comparison of the association rules obtained by Apriori and
the classification rules obtained by Prism in project 2.
Argumentation of weaknesses and/or strengths of each of the
methods on this dataset, and argumentation of which method
should be preferred for this dataset and why.
(up to 5 extra credit points) excellent analysis of the results and
comparisons
(up to 10 extra credit points) running additional interesting experiments
(TOTAL 5 points) SLIDES - how well do they summarize concisely
the results of the project? We suggest you summarize the
setting of your experiments and their results in a tabular manner.
(up to 6 extra credit points) for excellent summary and presentation of results
in the slides.
(TOTAL 15 points) Class presentation - how well your oral presentation summarized
concisely the results of the project and how focus your presentation was
on the more creative/interesting/useful of your experiments and results.
This grade is given individually to each team member.