CS 4445 A Term 2004 - Homework 2 and Project 2

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2004
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules

PROF. CAROLINA RUIZ

DUE DATE: Part I (the individual homework assignment) is due on Tuesday, September 14th at 5:00 pm and Parts II.1 and II.2 (the individual+group project) are due on Sunday, September 26 2004 at 5 pm.

Project Description
Homework Assignment
Project Assignment
Project Submission and Due Date
Grading Criteria

HOMEWORK AND PROJECT DESCRIPTION

The purpose of this project is multi-fold:

To gain experience with the mining and evaluation of classification rules.
To gain experience with the mining of association rules.
To compare these two data mining techniques over two datasets.

Readings: Read in great detail Sections 4.1, 4.4, 4.5 and 6.2 from your textbook.

INDIVIDUAL HOMEWORK ASSIGNMENT

See solutions to the classification rules part and the association rules part of this HW by Min Song.

Consider the following subset of the Mushroom dataset.


@relation sample-mushroom

@attribute cap-surface {fibrous,grooves,scaly,smooth}
@attribute bruises? {bruises,no}
@attribute gill-size {broad,narrow}
@attribute habitat {grasses,leaves,meadows,paths,urban,waste,woods}
@attribute poisonousness {edible,poisonous}

@data

scaly,bruises,broad,waste,edible
smooth,no,narrow,woods,poisonous
fibrous,no,broad,grasses,edible
scaly,bruises,broad,woods,edible
scaly,no,narrow,leaves,poisonous
scaly,bruises,broad,paths,edible
smooth,no,broad,leaves,edible
scaly,no,broad,woods,poisonous
scaly,no,narrow,woods,poisonous
smooth,no,broad,leaves,edible
fibrous,no,broad,paths,poisonous
fibrous,bruises,broad,woods,edible
smooth,bruises,narrow,grasses,poisonous
fibrous,no,broad,paths,poisonous
smooth,bruises,narrow,grasses,poisonous
scaly,no,narrow,leaves,poisonous
scaly,no,narrow,woods,poisonous
fibrous,no,broad,grasses,edible
scaly,bruises,broad,woods,edible
fibrous,no,broad,grasses,edible

(50 points) Construct "by hand" all the perfect classification rules that the Prism algorithm would output for this dataset using the ratio p/t to rank the attribute-values that are candidates for inclusion in a rule. You written solutions should show all your work. That is, the list of all attribute-values that were candidates during each of the stages of the rule construction process and which ones were selected.
(50 points) Mine association rules by hand from this dataset by faithfully following the Apriori algorithm with minimal support = 25% and minimal confidence 90%. That is, start by generating candidate itemsets and frequent itemsets level by level and after all frequent itemsets have been generated, produce from them all the rules with confidence greater than or equal to the min. confidence. SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

Note that this dataset contains repeated instances. Your resulting classification and association rules should be affected by this fact.

Submission and Due Date.

Part I is due Tuesday, Sept. 14th at 5:00 pm. Bring a hardcopy of your homework to my office FL232 before the deadline. No submissions after 5:00 pm will be accepted.

PROJECT ASSIGNMENT

The following are general guidelines for the project.

Datasets:

Consider the following sets of data:

The Titanic Dataset. Look at the dataset description and the Data instances.
I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:
```
Class (0 = crew, 1 = first, 2 = second, 3 = third)
Age   (1 = adult, 0 = child)
Sex   (1 = male, 0 = female)
Survived (1 = yes, 0 = no)
```
1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are:
Let's make "private/public" the classification target. Note that even though the values of this attribute are 0s and 1s, this is a nominal (not a numeric!) attribute.
The Microsoft Anonymous Web Data. This dataset is available at the UCI KDD Repository

The first two of these datasets (1 and 2) will be used for the Classification Rules experiments, and the last two of these datasets (2 and 3) will be used for the Association Rules experiments.

Experiments:

For each of the datasets, use the "Explorer" option of the Weka system to perform the following operations:

A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
In particular,
- explore different ways of discretizing (if needed) continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
- explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.
To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting classification rules are easy to read.
You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your set of rules, the better.

CLASSIFICATION RULES
The purpose of this part of the project is to construct the most accurate set of classification rules possible for each of the following classification tasks: (1) Predict the "Survived" attribute of the Titanic Data. (2) Predict the "public/private" attribute of the College Data.
Use PRISM the covering algorithm to generate classification rules implemented in the Weka system. Read the Weka code implementing PRISM in great detail (you need to describe the algorithm used in PRISM in your written report). Read in great detail Sections 4.1, 4.4, 6.2 from your textbook.
- INDIVIDUAL PROJECT AND WRITTEN REPORT.
  Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
  1. Code Description: Describe algorithmicly the Weka code of the classifiers and filters that you used in the project. More precisely, explain the ALGORITHM underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output. PLEASE NOTE THAT WE EXPECT A DETAIL DESCRIPTION OF THE ALGORITHMS USED NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE. For the description of PRISM, detail exactly how classification rules are constructed and pruned.
  2. Experiments: For EACH EXPERIMENT YOU RAN describe:
    - Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it? Why?
    - Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
    - Your system parameters.
    - For the PRISM classifier,
      - Results and detail ANALYSIS of results of the experiments you ran using different ways of testing (split ratio and N-fold cross-validation) the classifier.
      - Accuracy of the resulting models
      - Comparison the classification accuracies of the PRISM models obtained with the ZeroR, ID3, and J4.8 classifiers from the previous project. Analyze in detail this comparison and argue what the strengths and weeknesses of each of the classification methods over each of the two datasets are.
        (This implies that if you didn't get (enough) experiments ran in project 1 to make a meaningful comparison, then you should run those decision trees experiments now as part of your project 2.)
  3. Summary of Results
    - For each of the datasets, what was the accuracy of the most accurate set of rules constructed in your project? Include this set in your report.
    - Strengths and weaknesses of your project.
- GROUP PROJECT AND WRITTEN REPORT. Your group report should contain discussions of all the parts of the group project. In particular, it should elaborate on the the following topics:
  1. Experiments: as described in the individual part above. For the group part, start by running experiments that build upon the experience that you gained with the individual projects.
    Once that are done with the join experiments, MODIFY the Prism code so that it uses the p*[log_2(p/t) - log_2(P/T)] measure to rank the attribute-values that are candidates for inclusion in a rule. DESCRIBE in detail in your report how exactly you modified the code. INCLUDE the relevant pieces of code in your report.
    Repeat your joint experiments to see the differences in the results between the p/t and the p*[log_2(p/t) - log_2(P/T)] measures. If none of your joint experiments produces different results, construct at least one dataset in which the two measures produce different results and compare them.
  2. Summary of Results as described in the individual part above
- GROUP ORAL REPORT. We will discuss the results from the individual projects during the class on Monday, September 27. Your oral report should summarize the different sections of your written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!
ASSOCIATION RULES
The purpose of this project is to mine the best sets of association rules possible from the College Dataset and the Microsoft Anonymous Web Dataset. Use the Apriori implementation in Weka to mine association rules from the following two datasets. Read the Weka code implementing Apriori in great detail (you need to describe the algorithm used in Apriori in your written report). Read in great detail Section 4.5 from your textbook.
- INDIVIDUAL PROJECT AND WRITTEN REPORT.
  Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
  1. Code Description: Describe algorithmicly the Weka code of the classifiers and filters that you used in the project. More precisely, explain the ALGORITHM underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output. PLEASE NOTE THAT WE EXPECT A DETAILED DESCRIPTION OF THE ALGORITHMS USED NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE. For the description of Weka's Apriori, detail exactly how association rules are constructed.
  2. Experiments: For EACH EXPERIMENT YOU RAN describe:
    - Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it? Why? Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
    - Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
    - Your system parameters. Run multiple experiment by modifying the input parameters offered by the Weka implementation of Apriori. Those input parameters include confidence, support, minimum number of rules, and others.
    - Results and detailed ANALYSIS of results of the experiments you ran. INCLUDING discussion of particularly interesting association rules that you obtained.
    - Comparison the classification rules obtained with PRISM above with the association rules mined here. Analyze in detail this comparison and argue what the strengths and weeknesses of each of these two methods to mine rules over each of the two datasets are. (This implies that if you didn't get enough experiments ran with Prism to make a meaningful comparison, then you should run those experiments now.)
  3. Summary of Results
    - Strengths and weaknesses of your project.
- GROUP PROJECT AND WRITTEN REPORT. Your group report should contain discussions of all the parts of the group project. In particular, it should elaborate on the the following topics:
  1. Experiments: as described in the individual part above. For the group part, start by running experiments that build upon the experience that you gained with the individual projects.
    Once that are done with the join experiments, MODIFY the Apriori code so that the user can specify a certain attribute from the input dataset, and only association rules whose right-hand-sides consist only of attribute-value pairs formed with that attribute are generated by the algorithm. DESCRIBE in detail in your report how exactly you modified the code and the interface of Weka's Apriori. INCLUDE the relevant pieces of code in your report.
    Your modification of the code should be (1) "complete", that is it should generate all the required association rules; and (2) "efficient", hence producing all regular association rules and then filtering out the ones that don't satisfy the right-hand-constraint is not considered an adequate solution. Run several experiments to make sure that your code modifications satisfy these two conditions.
  2. Summary of Results as described in the individual part above
- GROUP ORAL REPORT. We will discuss the results from the individual projects during the class on September 27. Your oral report should summarize the different sections of your written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!

PROJECT SUBMISSION AND DUE DATE

Part II is due Sunday, Sept. 26 at 5:00 pm. Submissions received on Sunday, Sept 26 between 5:01 pm and 7:00 pm will be penalized with 30% off the grade and submissions after Sept 26 at 7:00 pm won't be accepted.

Please submit the following files using the myWpi digital drop box:

[lastname]_proj2_report.[ext] containing your individual written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
- ruiz_proj2_report.pdf
If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.
[lastname1_lastname2]_proj2_report.[ext] containing your group written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
- ruiz_smith_proj2_report.pdf if I worked with Joe Smith on this project.
[lastname1_lastname2]_proj2_slides.[ext] (or [lastname]_proj2_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral reports. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Your group will have only 4 minutes in class to discuss the entire project (both individual and group parts, and classification and association rules).

GRADING CRITERIA

FOR THE CLASSIFICATION RULES PART OF THE PROJECT 
TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY

(30 POINTS TOTAL: 15 points for individual and 15 for group work) 
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
           (i.e. using nominal values instead of numeric
            when appropriate, using as many of them 
            as possible, etc.) 
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(TOTAL: 15 points for individual work) 
ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(15 points) Description of the algorithm underlying the construction and
            pruning of classification rules in Weka's PRISM code
(up to 5 extra credit points for an outstanding job) 
(providing just a structural description of the code, i.e. a list of 
classes and methods, will receive 0 points)

(TOTAL: 30 points for group work) 
CODE MODIFICATION:
(10 points) Description of the algorithmic modification
(20 points) Description of the modifications made to the Prism code 
(up to 10 extra credit points for an outstanding job) 

(120 POINTS TOTAL: 60 points for individual and 60 points for group work) 
EXPERIMENTS
(TOTAL: 30 points each dataset) FOR EACH DATASET:
       (06 points) ran a good number of experiments
                   to get familiar with the PRISM classification method and
                   different evaluation methods (%split, cross-validation,...)
       (08 points) good description of the experiment setting and the results 
       (08 points) good analysis of the results of the experiments
       (08 points) comparison of the results obtained with Prism and the
                   classifiers from previous project (ZeroR, ID3, and J4.8)
                   and argumentation of weaknesses and/or strengths of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments
                   selecting other classification attributes instead of the 
                   required in this project statement ("private/public", "Survived")

(TOTAL 5 points) SLIDES - how well do they summarize concisely
        the results of the project? We suggest you summarize the
        setting of your experiments and their results in a tabular manner.

---------------------------------------------------------------------------------

FOR THE ASSOCIATION RULES PART OF THE PROJECT 
TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY


(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(10 points) Description of the Apriori algorithm for the construction of
            frequent itemsets and association rules. 
(up to 5 extra credit points for an outstanding job) 
(providing just a structural description of the code, i.e. a list of 
classes and methods, will receive 0 points)

(TOTAL: 35 points for group work) 
CODE MODIFICATION:
(10 points) Description of the algorithmic modification
(20 points) Description of the modifications made to the Apriori code 
(up to 10 extra credit points for an outstanding job) 

(20 POINTS TOTAL: 10 points for individual and 10 points for group work) 
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(110 POINTS TOTAL: 55 points for individual and 55 points for group work) 
EXPERIMENTS
(TOTAL: 28 points each dataset) FOR EACH DATASET:
       (05 points) ran a good number of experiments to get familiar with the 
                   Apriori algorithm varying the input parameters 
       (05 points) good description of the experiment setting and the results 
       (13 points) good analysis of the results of the experiments
                   INCLUDING discussion of particularly interesting association 
                   rules obtained.
       (05 points) comparison of the association rules obtained by Apriori and 
                   the classification rules obtained by Prism in project 2.
                   Argumentation of weaknesses and/or strengths of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments

(TOTAL 5 points) SLIDES - how well do they summarize concisely
        the results of the project? We suggest you summarize the
        setting of your experiments and their results in a tabular manner.
   (up to 6 extra credit points) for excellent summary and presentation of results 
   in the slides.


(TOTAL 15 points) Class presentation - how well your oral presentation summarized 
        concisely the results of the project and how focus your presentation was
        on the more creative/interesting/useful of your experiments and results.
        This grade is given individually to each team member.

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2004 Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules