

Consider the dataset below. This dataset is a subset of the Auto Miles-per-galon (MPG) dataset that is available at the The University of California Irvine (UCI) Data Repository.
@relation subset-auto-mpg
@attribute mpg numeric
@attribute cylinders numeric
@attribute horsepower numeric
@attribute weight numeric
@attribute acceleration numeric
@attribute model-year numeric
@attribute car-name {chevrolet,toyota,volkswagen,ford}
@data
18, 8, 130, 3504, 12, 70, chevrolet
27, 4, 90, 2950, 17.3, 82, chevrolet
34, 4, 88, 2395, 18, 82, chevrolet
24, 4, 95, 2372, 15, 70, toyota
28, 4, 75, 2155, 16.4, 76, toyota
32, 4, 96, 2665, 13.9, 82, toyota
26, 4, 46, 1835, 20.5, 70, volkswagen
29, 4, 70, 1937, 14.2, 76, volkswagen
44, 4, 52, 2130, 24.6, 82, volkswagen
10, 8, 215, 4615, 14, 70, ford
28, 4, 79, 2625, 18.6, 82, ford
16, 8, 149, 4335, 14.5, 77, ford
For this homework, we want to predict the mpg attribute (prediction target)
from the other predicting attributes
cylinders,
horsepower,
weight,
acceleration,
model-year, and
car-name
SDR = sd(CLASS over all instances)
- ((k1/n)*sd(CLASS of instances with attribute value below split point)
+ (k2/n)*sd(CLASS of instances with attribute value above split point))
where sd stands for standard deviation.
k1 is the number of instances with attribute value below split point.
k2 is the number of instances with attribute value above split point.
n is the number of instances.
LINEAR MODEL TREE REGRESSION TREE
REGRESSION PREDICTION PREDICTION
13, 8, 150, 4464, 12, 73, chevrolet __________ ___________ __________
21, 4, 72, 2401, 19.5, 73, chevrolet __________ ___________ __________
20, 6, 122, 2807, 13.5, 73, toyota __________ ___________ __________
27.5, 4, 95, 2560, 14.2, 78, toyota __________ ___________ __________
27, 4, 60, 1834, 19, 71, volkswagen __________ ___________ __________
31.5, 4, 71, 1990, 14.9, 78, volkswagen __________ ___________ __________
21, 4, 86, 2226, 16.5, 72, ford __________ ___________ __________
36.1, 4, 66, 1800, 14.4, 78, ford __________ ___________ __________
LINEAR MODEL TREE REGRESSION TREE
REGRESSION ERROR ERROR
ERROR
root mean-square error (see p. 148) __________ ___________ __________
mean absolute error (see p. 148) __________ ___________ __________
SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.
Use the attribute age as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.
Use the attribute price as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.
Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
Once that are done with these join experiments, investigate the pruning method used in M5PRIME (i.e. when the option UNPRUNED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the pruning is done and run experiments to see how it modifies the resulting tree and how this affects the error values reported for the tree.
Investigate also the smoothing method used in M5PRIME (i.e. when the option useUNSMOOTHED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the smoothing is done and run experiments to see how it modifies the resulting tree and/or how it affects the error values reported for the tree.
Please submit the following files using the myWpi digital drop box:
If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.
---------------------------------------------------------------------------
(TOTAL: 75 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.
(TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:
(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE
(05 points) Description of the algorithm underlying the Weka filters used
(10 points) Description of the ALGORITHM undelying the data mining
methods used in this project.
(up to 10 extra credit points for an outanding job)
(providing just a structural description of the code, i.e. a list of
classes and methods, will receive 0 points)
(TOTAL: 5 points) PRE-PROCESSING OF THE DATASET:
Preprocess attributes as needed and dealing with missing values appropriately
(up to 5 extra credit points)
Trying to do "fancier" things with attributes
(i.e. combining two attributes highly correlated
into one, using background knowledge, etc.)
(TOTAL: 160 points: 80 points for the individual part and 80 points for the group part)
EXPERIMENTS
(TOTAL: 40 points each dataset) FOR EACH DATASET:
(05 points) ran a good number of experiments to get familiar with the
data mining methods in this project
(15 points) good description of the experiment setting and the results
(15 points) good analysis of the results of the experiments
INCLUDING discussion of evaluations statistics returned by
the Weka systems (accuracy and/or errors) and discussion of
particularly interesting results
(05 points) comparison of the results with those obtained using the other
methods in this project.
Argumentation of weknesses and/or strenghts of each of the
methods on this dataset, and argumentation of which method
should be preferred for this dataset and why.
(up to 10 extra credit points) excellent analysis of the results and
comparisons
(up to 10 extra credit points) running additional interesting experiments
(TOTAL 5 points) SLIDES - how well do they summarize concisely
the results of the project? We suggest you summarize the
setting of your experiments and their results in a tabular manner.
(up to 6 extra credit points) for excellent summary and presentation of results
in the slides.
(TOTAL 15 points) Class presentation - how well your oral presentation summarized
concisely the results of the project and how focus your presentation was
on the more creative/interesting/usuful of your experiments and results.
This grade is given individually to each team member.