See solutions to this homework by Piotr Mardziel:
Consider the following dataset, adapted from the Car Evaluation Dataset available at the The University of California Irvine (UCI) Machine Learning Data Repository.
ATTRIBUTES: POSSIBLE VALUES: buying-price {vhigh,high,med,low} maintenance {vhigh,high,med,low} persons {2,4,more} % Assumed to be a nominal attribute safety {low,med,high} recommendation {unacc,acc,good}
buying-price | maintenance | persons | safety | recommendation |
high | med | 4 | high | good |
low | med | 2 | med | unacc |
low | high | 2 | high | unacc |
low | vhigh | more | med | acc |
med | vhigh | 4 | med | acc |
vhigh | vhigh | 4 | med | unacc |
med | med | more | med | acc |
med | vhigh | more | low | unacc |
med | low | 4 | med | acc |
high | med | 4 | low | unacc |
low | med | 4 | high | good |
low | low | 2 | high | unacc |
I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:
Class (0 = crew, 1 = first, 2 = second, 3 = third) Age (1 = adult, 0 = child) Sex (1 = male, 0 = female) Survived (1 = yes, 0 = no)The "Survived" attribute is the class/target attribute of the Titanic Data.
In particular,
Use PRISM the covering algorithm to generate classification rules implemented in the Weka system. Read the Weka code implementing PRISM in great detail (you need to describe the algorithm used in PRISM in your written report). Read in great detail Sections 4.1, 4.4, 6.2 from your textbook.
Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
Once that are done with the join experiments, MODIFY the Prism code so that it uses the p*[log_2(p/t) - log_2(P/T)] measure to rank the attribute-values that are candidates for inclusion in a rule. DESCRIBE in detail in your report how exactly you modified the code. INCLUDE the relevant pieces of code in your report.
Repeat your joint experiments to see the differences in the results between the p/t and the p*[log_2(p/t) - log_2(P/T)] measures. If none of your joint experiments produces different results, construct at least one dataset in which the two measures produce different results and compare them.
Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
Once that are done with the join experiments, MODIFY the Apriori code so that the user can specify a certain attribute from the input dataset, and only association rules whose right-hand-sides consist only of attribute-value pairs formed with that attribute are generated by the algorithm. DESCRIBE in detail in your report how exactly you modified the code and the interface of Weka's Apriori. INCLUDE the relevant pieces of code in your report.
Your modification of the code should be (1) "complete", that is it should generate all the required association rules; and (2) "efficient", hence producing all regular association rules and then filtering out the ones that don't satisfy the right-hand-constraint is not considered an adequate solution. Run several experiments to make sure that your code modifications satisfy these two conditions.
Please submit the following files using the myWpi digital drop box:
If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.
INDIVIDUAL
(TOTAL 15 points) Class presentation - how well your oral presentation summarized concisely the results of the project and how focus your presentation was on the more creative/interesting/useful of your experiments and results. This grade is given individually to each team member.Classification Rules
(TOTAL: 15 points for individual work) ALGORITHMIC DESCRIPTION OF THE CODE (05 points) Description of the algorithm underlying the Weka filters used (15 points) Description of the algorithm underlying the construction and pruning of classification rules in Weka's PRISM code (up to 5 extra credit points for an outstanding job) (providing just a structural description of the code, that is, a list of classes and methods, will receive 0 points) (15 POINTS TOTAL: 15 points for individual) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (05 points) Dealing with attributes appropriately (i.e., using nominal values instead of numeric when appropriate, using as many of them as possible, etc.) (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e., combining two attributes highly correlated into one, using background knowledge, etc.) (72 POINTS TOTAL: 72 points for individual) EXPERIMENTS FIRST DATASET (36 points) (12 points) ran a good number of experiments to get familiar with the PRISM classification method and different evaluation methods (%split, cross-validation,...) (08 points) good description of the experiment setting and the results (12 points) good analysis of the results of the experiments (04 points) discussion of weaknesses and/or strengths of the Prism algorithm and its application to the dataset. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "Survived") SECOND DATASET (36 points) (12 points) ran a good number of experiments to get familiar with the PRISM classification method and different evaluation methods (%split, cross-validation,...) (08 points) good description of the experiment setting and the results (12 points) good analysis of the results of the experiments (04 points) discussion of weaknesses and/or strengths of the Prism algorithm and its application to the dataset. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "Survived")Association Rules
(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION (05 points) Description of the algorithm underlying the Weka filters used (10 points) Description of the Apriori algorithm for the construction of frequent itemsets and association rules. (up to 5 extra credit points for an outstanding job) (providing just a structural description of the code, that is, a list of classes and methods, will receive 0 points) (15 POINTS TOTAL: 15 points for individual) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (05 points) Dealing with attributes appropriately (i.e., using nominal values instead of numeric when appropriate, using as many of them as possible, etc.) (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e., combining two attributes highly correlated into one, using background knowledge, etc.) (65 POINTS TOTAL: 65 points for individual) EXPERIMENTS FIRST DATASET (33 points) (10 points) ran a good number of experiments to get familiar with the Apriori algorithm varying the input parameters (05 points) good description of the experiment setting and the results (13 points) good analysis of the results of the experiments INCLUDING discussion of particularly interesting association rules obtained. (05 points) comparison of the association rules obtained by Apriori and the classification rules obtained by Prism in project 2. Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments SECOND DATASET (33 points) (10 points) ran a good number of experiments to get familiar with the Apriori algorithm varying the input parameters (05 points) good description of the experiment setting and the results (13 points) good analysis of the results of the experiments INCLUDING discussion of particularly interesting association rules obtained. (05 points) comparison of the association rules obtained by Apriori and the classification rules obtained by Prism in project 2. Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experimentsJOINT
(TOTAL 10 points) SLIDES - how well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner.Classification Rules
(10 POINTS TOTAL: 10 points for group work) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e., combining two attributes highly correlated into one, using background knowledge, etc.) (48 POINTS TOTAL: 48 points for group work) EXPERIMENTS FIRST DATASET (24 points) (08 points) good description of the experiment setting and the results (12 points) good analysis of the results of the experiments (04 points) discussion of weaknesses and/or strengths of the Prism algorithm and its application to the dataset. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "Survived") SECOND DATASET (24 points) (08 points) good description of the experiment setting and the results (12 points) good analysis of the results of the experiments (04 points) discussion of weaknesses and/or strengths of the Prism algorithm and its application to the dataset. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "Survived") (TOTAL: 30 points for group work) CODE MODIFICATION: (10 points) Description of the algorithmic modification (20 points) Description of the modifications made to the Prism code (up to 10 extra credit points for an outstanding job)Association Rules
(10 POINTS TOTAL: 10 points for group work) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e., combining two attributes highly correlated into one, using background knowledge, etc.) (45 POINTS TOTAL: 45 points for group work) EXPERIMENTS FIRST DATASET (23 points) (05 points) good description of the experiment setting and the results (13 points) good analysis of the results of the experiments INCLUDING discussion of particularly interesting association rules obtained. (05 points) comparison of the association rules obtained by Apriori and the classification rules obtained by Prism in project 2. Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments SECOND DATASET (23 points) (05 points) good description of the experiment setting and the results (13 points) good analysis of the results of the experiments INCLUDING discussion of particularly interesting association rules obtained. (05 points) comparison of the association rules obtained by Apriori and the classification rules obtained by Prism in project 2. Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments (TOTAL: 35 points for group work) CODE MODIFICATION: (10 points) Description of the algorithmic modification (20 points) Description of the modifications made to the Apriori code (up to 10 extra credit points for an outstanding job)