SDR = sd(sepallength over all instances) - ((k1/n)*sd(sepallength of instances with attribute value below split point) + (k2/n)*sd(sepallength of instances with attribute value above split point)) where sd stands for standard deviation. k1 is the number of instances with attribute value below split point. k2 is the number of instances with attribute value above split point. n is the number of instances.Note that you don't need to construct the whole tree, just the root node. SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.
A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting models are easy to read.
You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your models, the better.
Analyze in detail the results obtained. For classification models, analyze the accuracy of the resulting models. For numeric predictions analyze the errors reported by Weka and explain their meaning.
Your report should contain discussions of all the parts described in the PROJECT ASSIGNMENT section above and in addition should elaborate on the the following topics:
Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining classification rules from it.
Please submit the following files using the turnin system by 12 noon on Wed, April 23 2003. For your turnin submission, THE NAME OF THE PROJECT IS "project4". PLEASE MAKE JUST ONE PROJECT SUBMISSION PER GROUP. Submissions received on Wed, April 23 between 12 noon and 3:00 pm will be penalized with 30% off the grade and submissions after April 23 AT 3:00 pm won't be accepted.
Turnin complains about file names that are too long. If the name of your file is too long, feel free to shorten it as necessary, but please keep the _proj3_report.pdf part intact for easy identification.
TOTAL: 100 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY --------------------------------------------------------------------------- (TOTAL: 30 points) FOR PART I OF THE PROJECT (TOTAL: 70 points) FOR PART II OF THE PROJECT (TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE (05 points) Description of the algorithm underlying the Weka filters used (10 points) Description of the ALGORITHM undelying the data mining methods used in this project. (up to 10 extra credit points for an outanding job) (providing just a structural description of the code, i.e. a list of classes and methods, will receive 0 points) (TOTAL: 5 points) PRE-PROCESSING OF THE DATASET: Discretizing attributes IF needed, and dealing with missing values appropriately (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.) (TOTAL: 46 points) EXPERIMENTS (TOTAL: 23 points each dataset) FOR EACH DATASET: (05 points) ran a good number of experiments to get familiar with the data mining methods in this project (05 points) good description of the experiment setting and the results (08 points) good analysis of the results of the experiments INCLUDING discussion of evaluations statistics returned by the Weka systems (accuracy and/or errors) and discussion of particularly interesting results (05 points) comparison of the results with those obtained using other methods in this and previous projects Argumentation of weknesses and/or strenghts of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 10 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments (TOTAL 4 points) SLIDES - how well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner. (up to 6 extra credit points) for excellent summary and presentation of results in the slides.