CS 4445 B Term 2006 - Homework 2 and Project 2

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules

PROF. CAROLINA RUIZ

DUE DATE:
Part I (the individual homework assignment) is due on Tuesday, November 14 2006 at 12:00 noon, and
Parts II.1 and II.2 (the individual+group project) are due on Friday, November 17 2006 at 12:00 noon.

Project Description
Homework Assignment
Project Assignment
Project Submission and Due Date
Grading Criteria

HOMEWORK AND PROJECT DESCRIPTION

The purpose of this project is multi-fold:

To gain experience with the mining and evaluation of classification rules.
To gain experience with the mining of association rules.
To compare these two data mining techniques over two datasets.

Readings: Read in great detail Sections 4.1, 4.4, 4.5 and 6.2 from your textbook.

INDIVIDUAL HOMEWORK ASSIGNMENT

See solutions to this homework by Piotr Mardziel:

classification rules part (pdf).
association rules part (pdf).

Consider the following dataset, adapted from the Car Evaluation Dataset available at the The University of California Irvine (UCI) Machine Learning Data Repository.

ATTRIBUTES:	POSSIBLE VALUES:
buying-price 	{vhigh,high,med,low}
maintenance 	{vhigh,high,med,low}
persons 	{2,4,more}  % Assumed to be a nominal attribute
safety 		{low,med,high}
recommendation 	{unacc,acc,good}

buying-price	maintenance	persons	safety	recommendation
high	med	4	high	good
low	med	2	med	unacc
low	high	2	high	unacc
low	vhigh	more	med	acc
med	vhigh	4	med	acc
vhigh	vhigh	4	med	unacc
med	med	more	med	acc
med	vhigh	more	low	unacc
med	low	4	med	acc
high	med	4	low	unacc
low	med	4	high	good
low	low	2	high	unacc

(50 points) Construct "by hand" all the perfect classification rules that the Prism algorithm would output for this dataset using the ratio p/t to rank the attribute-values that are candidates for inclusion in a rule. You written solutions should show all your work. That is, the list of all attribute-values that were candidates during each of the stages of the rule construction process and which ones were selected.
(50 points) Mine association rules by hand from this dataset by faithfully following the Apriori algorithm with minimum support = 25% (since the dataset contains 12 instances, then the min support count is 3 instances) and minimum confidence 90%. That is, start by generating candidate itemsets and frequent itemsets level by level and after all frequent itemsets have been generated, produce from them all the rules with confidence greater than or equal to the min. confidence. SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

Submission and Due Date.

Part I is due Tuesday, Nov. 14th at 12:00 noon. Bring a hardcopy of your homework solutions to class.

PROJECT ASSIGNMENT

The following are general guidelines for the project.

Datasets:

Together with your project partner, choose two datasets from the following three options:

The Titanic Dataset. Look at the dataset description and the Data instances.
I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:
```
Class (0 = crew, 1 = first, 2 = second, 3 = third)
Age   (1 = adult, 0 = child)
Sex   (1 = male, 0 = female)
Survived (1 = yes, 0 = no)
```
The "Survived" attribute is the class/target attribute of the Titanic Data.
1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are:
Let's make "private/public" the classification target. Note that even though the values of this attribute are 0s and 1s, this is a nominal (not a numeric!) attribute.
A dataset of your choice. This dataset can be one available on a public, online data repository (including but not limited to the datasets used on Project 1) or any other valid source. The dataset should contain at least 500 data instances with at least 5 different attributes (ideally some numeric and some nominal).

Experiments:

For each of the two datasets, use the Weka system to perform the following operations:

A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
In particular,
- explore different ways of discretizing (if needed) continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
- explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.
To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting classification rules are easy to read.
You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your set of rules, the better.

CLASSIFICATION RULES
The purpose of this part of the project is to construct the most accurate set of classification rules possible for predicting the class/target attribute of each dataset.
Use PRISM the covering algorithm to generate classification rules implemented in the Weka system. Read the Weka code implementing PRISM in great detail (you need to describe the algorithm used in PRISM in your written report). Read in great detail Sections 4.1, 4.4, 6.2 from your textbook.
- INDIVIDUAL PROJECT AND WRITTEN REPORT.
  Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
  1. Code Description: Describe algorithmicly the Weka code of the classifiers and filters that you used in the project. More precisely, explain the ALGORITHM underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output. PLEASE NOTE THAT WE EXPECT A DETAIL DESCRIPTION OF THE ALGORITHMS USED NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE. For the description of PRISM, detail exactly how classification rules are constructed and pruned.
  2. Dataset: If you have selected a dataset different from the two provided on this project and the two datasets on Project 1, describe the dataset in detail. Explain the major characteristics of the dataset and also provide an URL where the dataset can be found.
  3. Experiments: For EACH EXPERIMENT YOU RAN describe:
    - Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it? Why?
    - Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
    - Your system parameters.
    - For the PRISM classifier,
      - Results and detail ANALYSIS of results of the experiments you ran using different ways of testing (split ratio and N-fold cross-validation) the classifier.
      - Accuracy of the resulting models
  4. Summary of Results
    - For each of the datasets, what was the accuracy of the most accurate set of rules constructed in your project? Include this set in your report.
    - Strengths and weaknesses of your project.
- GROUP PROJECT AND WRITTEN REPORT. Your group report should contain discussions of all the parts of the group project. In particular, it should elaborate on the the following topics:
  1. Experiments: as described in the individual part above. For the group part, start by running experiments that build upon the experience that you gained with the individual projects.
    Once that are done with the join experiments, MODIFY the Prism code so that it uses the p*[log_2(p/t) - log_2(P/T)] measure to rank the attribute-values that are candidates for inclusion in a rule. DESCRIBE in detail in your report how exactly you modified the code. INCLUDE the relevant pieces of code in your report.
    Repeat your joint experiments to see the differences in the results between the p/t and the p*[log_2(p/t) - log_2(P/T)] measures. If none of your joint experiments produces different results, construct at least one dataset in which the two measures produce different results and compare them.
  2. Summary of Results as described in the individual part above
- GROUP ORAL REPORT. We will discuss the results from the individual projects during the class on Tuesday, Nov. 21st. Your oral report should summarize the different sections of your written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!
ASSOCIATION RULES
The purpose of this project is to mine the best sets of association rules possible from the two datasets you are working with. Use the Apriori implementation in Weka to mine association rules from the these two datasets. Read the Weka code implementing Apriori in great detail (you need to describe the algorithm used in Apriori in your written report). Read in great detail Section 4.5 from your textbook.
- INDIVIDUAL PROJECT AND WRITTEN REPORT.
  Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
  1. Code Description: Describe algorithmicly the Weka code of the classifiers and filters that you used in the project. More precisely, explain the ALGORITHM underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output. PLEASE NOTE THAT WE EXPECT A DETAILED DESCRIPTION OF THE ALGORITHMS USED NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE. For the description of Weka's Apriori, detail exactly how association rules are constructed.
  2. Experiments: For EACH EXPERIMENT YOU RAN describe:
    - Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it? Why? Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
    - Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
    - Your system parameters. Run multiple experiment by modifying the input parameters offered by the Weka implementation of Apriori. Those input parameters include confidence, support, minimum number of rules, and others.
    - Results and detailed ANALYSIS of results of the experiments you ran. INCLUDING discussion of particularly interesting association rules that you obtained.
    - Comparison the classification rules obtained with PRISM above with the association rules mined here. Analyze in detail this comparison and argue what the strengths and weeknesses of each of these two methods to mine rules over each of the two datasets are. (This implies that if you didn't get enough experiments ran with Prism to make a meaningful comparison, then you should run those experiments now.)
  3. Summary of Results
    - Strengths and weaknesses of your project.
- GROUP PROJECT AND WRITTEN REPORT. Your group report should contain discussions of all the parts of the group project. In particular, it should elaborate on the the following topics:
  1. Experiments: as described in the individual part above. For the group part, start by running experiments that build upon the experience that you gained with the individual projects.
    Once that are done with the join experiments, MODIFY the Apriori code so that the user can specify a certain attribute from the input dataset, and only association rules whose right-hand-sides consist only of attribute-value pairs formed with that attribute are generated by the algorithm. DESCRIBE in detail in your report how exactly you modified the code and the interface of Weka's Apriori. INCLUDE the relevant pieces of code in your report.
    Your modification of the code should be (1) "complete", that is it should generate all the required association rules; and (2) "efficient", hence producing all regular association rules and then filtering out the ones that don't satisfy the right-hand-constraint is not considered an adequate solution. Run several experiments to make sure that your code modifications satisfy these two conditions.
  2. Summary of Results as described in the individual part above
- GROUP ORAL REPORT. We will discuss the results from the individual projects during the class on Tuesday Nov. 21st. Your oral report should summarize the different sections of your written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!

PROJECT SUBMISSION AND DUE DATE

Part II is due Friday, Nov. 17 at 12:00 noon. BRING A HARDCOPY OF THE INDIVIDUAL AND GROUP WRITTEN REPORTS WITH YOU TO CLASS. In addition, you must submit your report electronically as specified below. Submissions received on Friday, Nov. 17th between 12:01 pm and 12:00 midnight will be penalized with 30% off the grade, submissions received on Saturday Nov. 18th between 12:01 am (early morning) and 8:00 am will will be penalized with 60% off the grade; and submissions received after Saturday Nov. 18th at 8:00 am won't be accepted.

Please submit the following files using the myWpi digital drop box:

[lastname]_proj2_report.[ext] containing your individual written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
- ruiz_proj2_report.pdf
If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.
[lastname1_lastname2]_proj2_report.[ext] containing your group written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
- ruiz_smith_proj2_report.pdf if I worked with Joe Smith on this project.
[lastname1_lastname2]_proj2_slides.[ext] (or [lastname]_proj2_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral reports. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Your group will have only 4 minutes in class to discuss the entire project (both individual and group parts, and classification and association rules).

GRADING CRITERIA

INDIVIDUAL
(TOTAL 15 points) Class presentation - how well your oral presentation summarized 
        concisely the results of the project and how focus your presentation was
        on the more creative/interesting/useful of your experiments and results.
        This grade is given individually to each team member.
Classification Rules
(TOTAL: 15 points for individual work) 
ALGORITHMIC DESCRIPTION OF THE CODE 
(05 points) Description of the algorithm underlying the Weka filters used
(15 points) Description of the algorithm underlying the construction and
            pruning of classification rules in Weka's PRISM code
(up to 5 extra credit points for an outstanding job) 
(providing just a structural description of the code, that is, a list of 
classes and methods, will receive 0 points)

(15 POINTS TOTAL: 15 points for individual) 
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
           (i.e., using nominal values instead of numeric
            when appropriate, using as many of them 
            as possible, etc.) 
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e., combining two attributes highly correlated
            into one, using background knowledge, etc.)

(72 POINTS TOTAL: 72 points for individual) 
EXPERIMENTS
FIRST DATASET (36 points)
       (12 points) ran a good number of experiments
                   to get familiar with the PRISM classification method and
                   different evaluation methods (%split, cross-validation,...)
       (08 points) good description of the experiment setting and the results 
       (12 points) good analysis of the results of the experiments
       (04 points) discussion of weaknesses and/or strengths of the Prism
                   algorithm and its application to the dataset. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments
                   selecting other classification attributes instead of the 
                   required in this project statement ("private/public", "Survived")
SECOND DATASET (36 points)
       (12 points) ran a good number of experiments
                   to get familiar with the PRISM classification method and
                   different evaluation methods (%split, cross-validation,...)
       (08 points) good description of the experiment setting and the results 
       (12 points) good analysis of the results of the experiments
       (04 points) discussion of weaknesses and/or strengths of the Prism
                   algorithm and its application to the dataset. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments
                   selecting other classification attributes instead of the 
                   required in this project statement ("private/public", "Survived")
Association Rules
(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(10 points) Description of the Apriori algorithm for the construction of
            frequent itemsets and association rules. 
(up to 5 extra credit points for an outstanding job) 
(providing just a structural description of the code, that is, a list of 
classes and methods, will receive 0 points)

(15 POINTS TOTAL: 15 points for individual) 
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
           (i.e., using nominal values instead of numeric
            when appropriate, using as many of them 
            as possible, etc.) 
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e., combining two attributes highly correlated
            into one, using background knowledge, etc.)

(65 POINTS TOTAL: 65 points for individual) 
EXPERIMENTS
FIRST DATASET (33 points)
       (10 points) ran a good number of experiments to get familiar with the 
                   Apriori algorithm varying the input parameters 
       (05 points) good description of the experiment setting and the results 
       (13 points) good analysis of the results of the experiments
                   INCLUDING discussion of particularly interesting association 
                   rules obtained.
       (05 points) comparison of the association rules obtained by Apriori and 
                   the classification rules obtained by Prism in project 2.
                   Argumentation of weaknesses and/or strengths of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments
SECOND DATASET (33 points)
       (10 points) ran a good number of experiments to get familiar with the 
                   Apriori algorithm varying the input parameters 
       (05 points) good description of the experiment setting and the results 
       (13 points) good analysis of the results of the experiments
                   INCLUDING discussion of particularly interesting association 
                   rules obtained.
       (05 points) comparison of the association rules obtained by Apriori and 
                   the classification rules obtained by Prism in project 2.
                   Argumentation of weaknesses and/or strengths of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments

JOINT
(TOTAL 10 points) SLIDES - how well do they summarize concisely
        the results of the project? We suggest you summarize the
        setting of your experiments and their results in a tabular manner.
Classification Rules
(10 POINTS TOTAL: 10 points for group work) 
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e., combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(48 POINTS TOTAL: 48 points for group work) 
EXPERIMENTS
FIRST DATASET (24 points)
       (08 points) good description of the experiment setting and the results 
       (12 points) good analysis of the results of the experiments
       (04 points) discussion of weaknesses and/or strengths of the Prism
                   algorithm and its application to the dataset. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments
                   selecting other classification attributes instead of the 
                   required in this project statement ("private/public", "Survived")
SECOND DATASET (24 points)
       (08 points) good description of the experiment setting and the results 
       (12 points) good analysis of the results of the experiments
       (04 points) discussion of weaknesses and/or strengths of the Prism
                   algorithm and its application to the dataset. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments
                   selecting other classification attributes instead of the 
                   required in this project statement ("private/public", "Survived")

(TOTAL: 30 points for group work) 
CODE MODIFICATION:
(10 points) Description of the algorithmic modification
(20 points) Description of the modifications made to the Prism code 
(up to 10 extra credit points for an outstanding job)

Association Rules
(10 POINTS TOTAL: 10 points for group work) 
PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e., combining two attributes highly correlated
            into one, using background knowledge, etc.)

(45 POINTS TOTAL: 45 points for group work) 
EXPERIMENTS
FIRST DATASET (23 points)
       (05 points) good description of the experiment setting and the results 
       (13 points) good analysis of the results of the experiments
                   INCLUDING discussion of particularly interesting association 
                   rules obtained.
       (05 points) comparison of the association rules obtained by Apriori and 
                   the classification rules obtained by Prism in project 2.
                   Argumentation of weaknesses and/or strengths of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments
SECOND DATASET (23 points)
       (05 points) good description of the experiment setting and the results 
       (13 points) good analysis of the results of the experiments
                   INCLUDING discussion of particularly interesting association 
                   rules obtained.
       (05 points) comparison of the association rules obtained by Apriori and 
                   the classification rules obtained by Prism in project 2.
                   Argumentation of weaknesses and/or strengths of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments

(TOTAL: 35 points for group work) 
CODE MODIFICATION:
(10 points) Description of the algorithmic modification
(20 points) Description of the modifications made to the Apriori code 
(up to 10 extra credit points for an outstanding job)

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006 Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT DESCRIPTION

INDIVIDUAL HOMEWORK ASSIGNMENT

Submission and Due Date.

PROJECT ASSIGNMENT

Datasets:

Experiments:

CLASSIFICATION RULES

ASSOCIATION RULES

PROJECT SUBMISSION AND DUE DATE

GRADING CRITERIA

INDIVIDUAL

Classification Rules

Association Rules

JOINT

Classification Rules

Association Rules

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules