CS 4445 B Term 2006 - Homework 3 and Project 3

NOTES:

Along with scoring criteria, provided are lists of suggestions. Positive suggestions (marked with green "+") include aspects of a project that have positive impacts on the scoring criteria. Negative suggestions (marked with red "-") have negative impacts. Finally, suggestions marked with "!" result in extra points.

!this can earn a few extra points (0-8)

! !this can earn a more extra points (0-16)

! ! !this can earn even more extra points (0-24)

! ! ! !this can earn a large extra points (0-32)

! ! ! ! !this can earn a very large amount of extra points (0-40)

+this has a positive impact

+ +this has a greater positive impact

+ + +this has a significant positive impact

+ + + +this has very significant positive impact

+ + + + +this has huge positive impact

-this has a negative impact

- -this has a greater negative impact

- - -this has a significant negative impact

- - - -this has very significant negative impact

- - - - -this has huge negative impact

TOTAL: 200 points: Project 3 Report

TOTAL: 20 points: Algorithmic Description of the Code

(05 points) Description of the algorithms underlying the WEKA filters used. If you have already described some used filters in previous projects then copying and pasting your work would be a good idea. If issues were raised about the old descriptions, consider revising them.

(TOTAL: 15 points) Description of the ALGORITHMS underlying the data mining methods used in this project (see METHODS). This includes:

(10 points) Descriptions of model construction

(5 points) Descriptions of instance classification/prediction

+(very) high-level pseudo code with explanation and justification

- -raw pseudo code with no justification or reasoning behind the steps involved

- - - - -structural description of WEKA code, i.e. classes/members/methods

TOTAL: 5 points: Slides

(5 points) How well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner.

! excellent summary and presentation of results in the slides

+ (potential) unanswered questions presented to class

+ visual aids

+ +main ideas and observations summarized

- nothing but accuracy measures

- lack of visual aids

TOTAL: 5 points: Pre-Processing of the Datasets

(5 points) Preprocess attributes as needed and dealing with missing values appropriately

!Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.)

TOTAL: 15 points: Class Presentation

(15 points) How well your oral presentation summarized concisely the results of the project and how focus your presentation was on the more creative/interesting/useful of your experiments and results. This grade is given individually to each team member.

TOTAL: 155 points: Experiments Goals

(30 points) Good description of the experiment setting and the results. This applies to all the experiments done for this project.

+ + + motivation for experiments described

+ specify testing method used (cross-validation / % split / etc.)

+ + overall summary of your experiments in one concise list that includes relevant parameters

- included are trivial experiment details that are not used in dicscussion (full algorithm parameters, full WEKA output, etc.)

- - ambiguous or unclear setting and/or results

(30 points) Good analysis of the results of the experiments INCLUDING discussion of evaluations statistics returned by the WEKA systems (accuracy and/or errors) and discussion of particularly interesting results. This applies to all the experiments done for this project.

! ! ! !excellent analysis of the results and comparisons

! ! ! !running additional interesting experiments

- - - - results rewritten in prose used in place of discussion and analysis

TOTAL: 35 points: Method-Oriented Goals

(15 points) ran a good number of experiments to get familiar with the data mining methods in this project (see METHODS) using a variety of datasets (see DATASETS)

+explore variation in non-trivial method parameters (debug/verbose mode IS a trivial parameter)

- - -use only one of the datasets

(10 points) Investigate the pruning method used in M5P (i.e., when the option UNPRUNED=FALSE is chosen from the M5P GUI), which is explained in your textbook. Describe in detail how the pruning is done and run experiments to see how it modifies the resulting tree and how this affects the error values reported for the tree.

- - -use only one of the datasets

(10 points) Investigate also the smoothing method used in M5PRIME (i.e. when the option useUNSMOOTHED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the smoothing is done and run experiments to see how it modifies the resulting tree and/or how it affects the error values reported for the tree.

- - -use only one of the datasets

TOTAL: 40 points: Dataset-Oriented Goals

For each dataset (see DATASETS):

(10 points) 5 or more specific questions/conjectures about the dataset domain that you aim to answer/validate with your experiments

(10 points) Comparison of the results across all the methods used in this project (see METHODS). That is, compare the results of one method on this dataset to the results of the other methods on the same dataset.

! !comparison with methods from past projects

TOTAL: 20 points: Holistic Goals

For each dataset (see DATASETS):

(10 points) Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why.

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006
Homework and Project 3: Numeric Predictions

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

Experiments:

PROJECT SUBMISSION AND DUE DATE

GRADING CRITERIA

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006 Homework and Project 3: Numeric Predictions

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

Experiments:

PROJECT SUBMISSION AND DUE DATE

GRADING CRITERIA

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006
Homework and Project 3: Numeric Predictions