Consider the dataset below. This dataset is a subset of the Auto Miles-per-galon (MPG) dataset that is available at the The University of California Irvine (UCI) Data Repository.
@relation subset-auto-mpg @attribute mpg numeric @attribute cylinders numeric @attribute horsepower numeric @attribute weight numeric @attribute acceleration numeric @attribute model-year numeric @attribute car-name {chevrolet,toyota,volkswagen,ford} @data 18, 8, 130, 3504, 12, 70, chevrolet 27, 4, 90, 2950, 17.3, 82, chevrolet 34, 4, 88, 2395, 18, 82, chevrolet 24, 4, 95, 2372, 15, 70, toyota 28, 4, 75, 2155, 16.4, 76, toyota 32, 4, 96, 2665, 13.9, 82, toyota 26, 4, 46, 1835, 20.5, 70, volkswagen 29, 4, 70, 1937, 14.2, 76, volkswagen 44, 4, 52, 2130, 24.6, 82, volkswagen 10, 8, 215, 4615, 14, 70, ford 28, 4, 79, 2625, 18.6, 82, ford 16, 8, 149, 4335, 14.5, 77, fordFor this homework, we want to predict the mpg attribute (prediction target) from the other predicting attributes cylinders, horsepower, weight, acceleration, model-year, and car-name
SDR = sd(CLASS over all instances) - ((k1/n)*sd(CLASS of instances with attribute value below split point) + (k2/n)*sd(CLASS of instances with attribute value above split point)) where sd stands for standard deviation. k1 is the number of instances with attribute value below split point. k2 is the number of instances with attribute value above split point. n is the number of instances.
LINEAR MODEL TREE REGRESSION TREE REGRESSION PREDICTION PREDICTION 13, 8, 150, 4464, 12, 73, chevrolet __________ ___________ __________ 21, 4, 72, 2401, 19.5, 73, chevrolet __________ ___________ __________ 20, 6, 122, 2807, 13.5, 73, toyota __________ ___________ __________ 27.5, 4, 95, 2560, 14.2, 78, toyota __________ ___________ __________ 27, 4, 60, 1834, 19, 71, volkswagen __________ ___________ __________ 31.5, 4, 71, 1990, 14.9, 78, volkswagen __________ ___________ __________ 21, 4, 86, 2226, 16.5, 72, ford __________ ___________ __________ 36.1, 4, 66, 1800, 14.4, 78, ford __________ ___________ __________ LINEAR MODEL TREE REGRESSION TREE REGRESSION ERROR ERROR ERROR root mean-square error (see p. 148) __________ ___________ __________ mean absolute error (see p. 148) __________ ___________ __________SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.
Use the attribute age as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.
Use the attribute price as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.
Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
Once that are done with these join experiments, investigate the pruning method used in M5PRIME (i.e. when the option UNPRUNED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the pruning is done and run experiments to see how it modifies the resulting tree and how this affects the error values reported for the tree.
Investigate also the smoothing method used in M5PRIME (i.e. when the option useUNSMOOTHED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the smoothing is done and run experiments to see how it modifies the resulting tree and/or how it affects the error values reported for the tree.
Please submit the following files using the myWpi digital drop box:
If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.
--------------------------------------------------------------------------- (TOTAL: 75 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above. (TOTAL: 200 points) FOR THE PROJECT (PART II) as follows: (TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE (05 points) Description of the algorithm underlying the Weka filters used (10 points) Description of the ALGORITHM undelying the data mining methods used in this project. (up to 10 extra credit points for an outanding job) (providing just a structural description of the code, i.e. a list of classes and methods, will receive 0 points) (TOTAL: 5 points) PRE-PROCESSING OF THE DATASET: Preprocess attributes as needed and dealing with missing values appropriately (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.) (TOTAL: 160 points: 80 points for the individual part and 80 points for the group part) EXPERIMENTS (TOTAL: 40 points each dataset) FOR EACH DATASET: (05 points) ran a good number of experiments to get familiar with the data mining methods in this project (15 points) good description of the experiment setting and the results (15 points) good analysis of the results of the experiments INCLUDING discussion of evaluations statistics returned by the Weka systems (accuracy and/or errors) and discussion of particularly interesting results (05 points) comparison of the results with those obtained using the other methods in this project. Argumentation of weknesses and/or strenghts of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 10 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments (TOTAL 5 points) SLIDES - how well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner. (up to 6 extra credit points) for excellent summary and presentation of results in the slides. (TOTAL 15 points) Class presentation - how well your oral presentation summarized concisely the results of the project and how focus your presentation was on the more creative/interesting/usuful of your experiments and results. This grade is given individually to each team member.