Consider the following subset of the Mushroom dataset.
@relation sample-mushroom @attribute cap-surface {fibrous,grooves,scaly,smooth} @attribute bruises? {bruises,no} @attribute gill-size {broad,narrow} @attribute habitat {grasses,leaves,meadows,paths,urban,waste,woods} @attribute poisonousness {edible,poisonous} @data scaly,bruises,broad,waste,edible smooth,no,narrow,woods,poisonous fibrous,no,broad,grasses,edible scaly,bruises,broad,woods,edible scaly,no,narrow,leaves,poisonous scaly,bruises,broad,paths,edible smooth,no,broad,leaves,edible scaly,no,broad,woods,poisonous scaly,no,narrow,woods,poisonous smooth,no,broad,leaves,edible fibrous,no,broad,paths,poisonous fibrous,bruises,broad,woods,edible smooth,bruises,narrow,grasses,poisonous fibrous,no,broad,paths,poisonous smooth,bruises,narrow,grasses,poisonous scaly,no,narrow,leaves,poisonous scaly,no,narrow,woods,poisonous fibrous,no,broad,grasses,edible scaly,bruises,broad,woods,edible fibrous,no,broad,grasses,edible
Note that this dataset contains repeated instances. Your resulting classification and association rules should be affected by this fact.
I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:
Class (0 = crew, 1 = first, 2 = second, 3 = third) Age (1 = adult, 0 = child) Sex (1 = male, 0 = female) Survived (1 = yes, 0 = no)
In particular,
Use PRISM the covering algorithm to generate classification rules implemented in the Weka system. Read the Weka code implementing PRISM in great detail (you need to describe the algorithm used in PRISM in your written report). Read in great detail Sections 4.1, 4.4, 6.2 from your textbook.
Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
Once that are done with the join experiments, MODIFY the Prism code so that it uses the p*[log_2(p/t) - log_2(P/T)] measure to rank the attribute-values that are candidates for inclusion in a rule. DESCRIBE in detail in your report how exactly you modified the code. INCLUDE the relevant pieces of code in your report.
Repeat your joint experiments to see the differences in the results between the p/t and the p*[log_2(p/t) - log_2(P/T)] measures. If none of your joint experiments produces different results, construct at least one dataset in which the two measures produce different results and compare them.
Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
Once that are done with the join experiments, MODIFY the Apriori code so that the user can specify a certain attribute from the input dataset, and only association rules whose right-hand-sides consist only of attribute-value pairs formed with that attribute are generated by the algorithm. DESCRIBE in detail in your report how exactly you modified the code and the interface of Weka's Apriori. INCLUDE the relevant pieces of code in your report.
Your modification of the code should be (1) "complete", that is it should generate all the required association rules; and (2) "efficient", hence producing all regular association rules and then filtering out the ones that don't satisfy the right-hand-constraint is not considered an adequate solution. Run several experiments to make sure that your code modifications satisfy these two conditions.
Please submit the following files using the myWpi digital drop box:
If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.
FOR THE CLASSIFICATION RULES PART OF THE PROJECT TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY (30 POINTS TOTAL: 15 points for individual and 15 for group work) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (05 points) Dealing with attributes appropriately (i.e. using nominal values instead of numeric when appropriate, using as many of them as possible, etc.) (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.) (TOTAL: 15 points for individual work) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION (05 points) Description of the algorithm underlying the Weka filters used (15 points) Description of the algorithm underlying the construction and pruning of classification rules in Weka's PRISM code (up to 5 extra credit points for an outstanding job) (providing just a structural description of the code, i.e. a list of classes and methods, will receive 0 points) (TOTAL: 30 points for group work) CODE MODIFICATION: (10 points) Description of the algorithmic modification (20 points) Description of the modifications made to the Prism code (up to 10 extra credit points for an outstanding job) (120 POINTS TOTAL: 60 points for individual and 60 points for group work) EXPERIMENTS (TOTAL: 30 points each dataset) FOR EACH DATASET: (06 points) ran a good number of experiments to get familiar with the PRISM classification method and different evaluation methods (%split, cross-validation,...) (08 points) good description of the experiment setting and the results (08 points) good analysis of the results of the experiments (08 points) comparison of the results obtained with Prism and the classifiers from previous project (ZeroR, ID3, and J4.8) and argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "Survived") (TOTAL 5 points) SLIDES - how well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner. --------------------------------------------------------------------------------- FOR THE ASSOCIATION RULES PART OF THE PROJECT TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY (TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION (05 points) Description of the algorithm underlying the Weka filters used (10 points) Description of the Apriori algorithm for the construction of frequent itemsets and association rules. (up to 5 extra credit points for an outstanding job) (providing just a structural description of the code, i.e. a list of classes and methods, will receive 0 points) (TOTAL: 35 points for group work) CODE MODIFICATION: (10 points) Description of the algorithmic modification (20 points) Description of the modifications made to the Apriori code (up to 10 extra credit points for an outstanding job) (20 POINTS TOTAL: 10 points for individual and 10 points for group work) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.) (110 POINTS TOTAL: 55 points for individual and 55 points for group work) EXPERIMENTS (TOTAL: 28 points each dataset) FOR EACH DATASET: (05 points) ran a good number of experiments to get familiar with the Apriori algorithm varying the input parameters (05 points) good description of the experiment setting and the results (13 points) good analysis of the results of the experiments INCLUDING discussion of particularly interesting association rules obtained. (05 points) comparison of the association rules obtained by Apriori and the classification rules obtained by Prism in project 2. Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments (TOTAL 5 points) SLIDES - how well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner. (up to 6 extra credit points) for excellent summary and presentation of results in the slides. (TOTAL 15 points) Class presentation - how well your oral presentation summarized concisely the results of the project and how focus your presentation was on the more creative/interesting/useful of your experiments and results. This grade is given individually to each team member.