CS 4445 B Term 2006 - Homework 4 and Project 4

@relation world_happiness % - Life Expectancy from UN Human Development Report (2003) % - GDP per capita from figure published by the CIA (2006), figure in US$. % - Access to secondary education rating from UNESCO (2002) % - SWL (satisfaction with life) index calculated from data published % by New Economics Foundation (2006). @attribute country string @attribute continent {Americas,Africa,Asia,Europe} @attribute life-expectancy numeric @attribute GDP-per-capita numeric @attribute access-to-education-score numeric @attribute SWL-index numeric @data Switzerland, Europe, 80.5, 32.3, 99.9, 273.33 Canada, Americas, 80, 34, 102.6, 253.33 Usa, Americas, 77.4, 41.8, 94.6, 246.67 Germany, Europe, 78.7, 30.4, 99, 240 Mexico, Americas, 75.1, 10, 73.4, 230 France, Europe, 79.5, 29.9, 108.7, 220 Thailand, Asia, 70, 8.3, 79, 216.67 Brazil, Americas, 70.5, 8.4, 103.2, 210 Japan, Asia, 82, 31.5, 102.1, 206.67 India, Asia, 63.3, 3.3, 49.9, 180 Ethiopia, Africa, 47.6, 0.9, 5.2, 156.67 Russia, Asia, 65.3, 11.1, 81.9, 143.3

4-NN PREDICTION Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________ United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________ South_Africa, Africa, 48.4, 12, 90.2, 190 __________ 4-NN ERROR root mean-square error (see p. 178) __________ mean absolute error (see p. 178) __________

GRADING CRITERIA

(TOTAL: 100 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.

(TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:

NOTES:

Along with scoring criteria, provided are lists of suggestions. Positive suggestions (marked with green "+") include aspects of a project that have positive impacts on the scoring criteria. Negative suggestions (marked with red "-") have negative impacts. Finally, suggestions marked with "!" result in extra points.

!this can earn a few extra points (0-8)

! !this can earn a more extra points (0-16)

! ! !this can earn even more extra points (0-24)

! ! ! !this can earn a large extra points (0-32)

! ! ! ! !this can earn a very large amount of extra points (0-40)

+this has a positive impact

+ +this has a greater positive impact

+ + +this has a significant positive impact

+ + + +this has very significant positive impact

+ + + + +this has huge positive impact

-this has a negative impact

- -this has a greater negative impact

- - -this has a significant negative impact

- - - -this has very significant negative impact

- - - - -this has huge negative impact

METHODS for this project are:

IBk,
k-nearest neighbors,
locally weighted regression (LWL \w linear regression), and
simple k-means.

(OPTIONAL) METHODS for this project are:

COBWEB/CLASSIT and
EM

DATASETS for this project are:

dataset1: world happiness dataset

TOTAL: 15 points: Class Presentation

(15 points) How well your oral presentation summarized concisely the results of the project and how focus your presentation was on the more creative/interesting/useful of your experiments and results. This grade is given individually to each team member.

TOTAL: 185 points: Project 4 Report

TOTAL: 20 points: Algorithmic Description of the Code

(05 points) Description of the algorithms underlying the WEKA filters used. If you have already described some used filters in previous projects then copying and pasting your work would be a good idea. If issues were raised about the old descriptions, consider revising them.

(TOTAL: 15 points) Description of the ALGORITHMS underlying the data mining methods used in this project (see METHODS). This includes:

(10 points) Descriptions of model construction

(5 points) Descriptions of instance classification/prediction

!extra credit points for description of the optional methods (see OPTIONAL METHODS)

+(very) high-level pseudo code with explanation and justification

- -raw pseudo code with no justification or reasoning behind the steps involved

- - - - -structural description of WEKA code, i.e. classes/members/methods

TOTAL: 5 points: Slides

(5 points) How well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner.

! excellent summary and presentation of results in the slides

+ (potential) unanswered questions presented to class

+ visual aids

+ +main ideas and observations summarized

- nothing but accuracy measures

- lack of visual aids

TOTAL: 5 points: Pre-Processing of the Datasets

(5 points) Preprocess attributes as needed and dealing with missing values appropriately

!Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.)

TOTAL: 155 points: Experiments Goals

(30 points) Good description of the experiment setting and the results. This applies to all the experiments done for this project.

! ! ! meaningful graphical representation of the results (i.e. k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clusterings)

+ + + motivation for experiments described

+ specify testing method used (cross-validation / % split / etc.)

+ + overall summary of your experiments in one concise list that includes relevant parameters

- included are trivial experiment details that are not used in dicscussion (full algorithm parameters, full WEKA output, etc.)

- - ambiguous or unclear setting and/or results

(30 points) Good analysis of the results of the experiments INCLUDING discussion of evaluations statistics returned by the WEKA systems (accuracy and/or errors) and discussion of particularly interesting results. This applies to all the experiments done for this project.

! ! ! !excellent analysis of the results and comparisons

! ! ! !running additional interesting experiments

- - - - results rewritten in prose used in place of discussion and analysis

- -looking only at accuracy/error measures

TOTAL: 35 points: Method-Oriented Goals

(35 points) ran a good number of experiments to get familiar with the data mining methods in this project (see METHODS) using a variety of datasets (see DATASETS)

+explore variation in non-trivial method parameters (debug/verbose mode IS a trivial parameter)

- -looking only at accuracy/error measures resulting from parameter changes

TOTAL: 40 points: Dataset-Oriented Goals

(20 points) 5 or more specific questions/conjectures about the dataset domain that you aim to answer/validate with your experiments

-questions not specific to the dataset

(20 points) Comparison of the results across all the methods used in this project (see METHODS). That is, compare the results of one method on this dataset to the results of the other methods on the same dataset.

! !comparison with methods from past projects

- -comparing only accuracy/error measures

TOTAL: 20 points: Holistic Goals

(20 points) Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why.

TOTAL: 60 points: Extra Credit

You may want to modify the Weka code so that it outputs:

(15 points) the k-nearest neighbors of the each test instance
(15 points for each of the 3 clustering methods) the specific training instances assigned to each cluster

PROJECT SUBMISSION AND DUE DATE

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006
Homework and Project 4: Instance Based Learning and Clustering

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT DESCRIPTION

HOMEWORK ASSIGNMENT

PROJECT ASSIGNMENT

GRADING CRITERIA

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006 Homework and Project 4: Instance Based Learning and Clustering

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT DESCRIPTION

HOMEWORK ASSIGNMENT

PROJECT ASSIGNMENT

PROJECT SUBMISSION AND DUE DATE

GRADING CRITERIA

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006
Homework and Project 4: Instance Based Learning and Clustering