
CS4341 Introduction to Artificial Intelligence
Project 4 - D 2001
By Chris Shoemaker and Carolina Ruiz
DUE DATE: Saturday, April 14 at 5 pm.

To understand decision trees, and algorithms that create them. To implement a
program that creates a decision tree and is able to make classification
predictions for untrained data. Note: The term "decision tree",
in this assignment, is used in the practical sense, that is, to mean the type of
decision tree that Winston calls a "identification tree".
Construct the most accurate decision tree you can
for predicting whether the game of connect 4 that starts with the given 8 ply
will be won, lost or drawn. Relevant files are available
at the /cs/cs4341/Proj4 directory of the Unix CCC machines and also
at ftp://ftp.cs.wpi.edu/pub/courses/cs4341/project4/
The files contained in those directories were derived from files taken from the
Univ. of California-Irvine Machine Learning Repository.
The directory contains the following files:
- connect-4.names:
Contains a description of the database, including information about the
attributes. There is one piece of information that is incorrect in this
file. This files originally accompanied a connect-4.data that
contained 67557 instances. However, the database has been split into two parts at random:
2/3 of the records for training (now contained in the file connect-4.data), and
1/3 of the records for testing (contained in the file connect-4.test).
- connect-4.data:
Contains the training records. There are 45038 training instances.
- connect-4.test:
Contains the testing records. There are 22519 test instances.
As an illustration of the type of results obtained using decision trees
over this data set, see
Chris Shoemaker's decision tree results using
See5, a commercial decision tree tool developed by Ross Quinlan.
See5 is the Windows version of C5 (descendant of C4.5).
You may accept the training and testing filenames as command-line arguments,
or you may hard-code them. Relevant information about the file format is
found in the connect-4.names file. You may use the files in their current
locations, or make copies of them, but your program must operate on files
exactly like the original files.
The only formal output specification is that, at minimum, your program
should display:
- the number of training instances used
- the size of the tree that was constructed (how many nodes)
- the number of testing instances used
- the 3 x 3 confusion matrix of the form:
(actual)
win lose draw total
win 12
1
0 13
(predicted) lose
2 13
1 16
draw 3
2
4 9
total 17
16
5 38 <- (For
you, this number will be 22519)
- the overall classification accuracy.
- a printout of the tree using the tree format
specified in class.
- a printout of the rules derived from the tree using the
rule format specified in class.
The following are requirements for the construction and testing of your
decision tree:
- Code: You MUST write your own decision tree learning code.
There are several pieces of code available online
(including
a Lisp version) that you may examine for guidance, but the code
you submit MUST be your own.
Your code MUST run on the CCC Unix machines.
- Training Instances:
For the construction of your decision tree, you MUST use the instances from
the connect-4.data file.
You MAY restrict your experiments to a subset of the records in that
file if your system cannot handle the whole file. But remember that
the more accurate your decision tree is over the test data, the better.
You MAY pre-process the attributes so that you maximize the accuracy
of your decision tree. Pre-processing alternatives include:
disregarding an attribute that doesn't seem to have any predictive
capability.
-
Test Instances:
Test data are available in the connect-4.test file.
To test your decision tree you MUST use all 22519 records from
that file.
The accuracy of a decision tree is measured as the percentage
of correctly classified instances. That is,
accuracy = number of
correctly classified instances/
total number of classified instances.
Therefore, if you classify 20000 instances correctly, your classification
accuracy is 20000/22519 = 88.8%
Project 4 is due on Saturday, April 14 at 5:00 pm.
Your system should follow the
CS Department Documentation Standard.
- Program and Decision Tree.
You should submit
- the source code of your program,
- a tree.txt file containing a print-out of the
most accurate decision tree you obtained in the
tree format described in class, and
- a rules.txt file containing a print-out of the
IF-THEN rules obtained from your most accurate decision
tree in the rule format described in class.
using the turnin
program.
- Written Report.
Submit a report.txt file containing your written report.
Your report should discuss the following issues:
- a description of your own code,
- the description of the dataset (or subset thereof) used by your program
to construct and to test your decision tree,
- the experiments you ran with the system,
including a confusion matrix for each experiment,
- the most accurate decision tree constructed by your system,
including the confusion matrix for this tree,
- any pre or post processing done to improve the accuracy of your tree,
- evaluation of your tree using the test data,
- strengths and weaknesses of your system.
Your report should also include a short user manual explaining how to
install, run, and use your system.
- Oral Report.
We will discuss the results from the different group projects in
class on Tuesday, April 17, 2001 .
Each group should present their results
(prepare ONE good overhead slide)
and discuss their project solution in class for up to 2 minutes.
The oral presentation will be worth 10% of the project grade.
Extend and (hopefully) improve your decision tree generating program to avoid
overfitting.
- Create a leaf node BEFORE the training data instances for that node become
homogenous in classification.
- Instead, choose some other method(s) for determining when to make a leaf
node. Possible methods include, to stop branching and to create a leaf
node when the training data instances at that node:
- are homogenous in classification OR there are fewer than x
instances at that node.
- are homogenous in classification EXCEPT for up to x
instances.
- are homogenous in classification EXCEPT for up to x percent of
the instances at that node.
- are ...{can you think of any other good ones?}
- Perform experiments with various values of your variable, x.
- Justify your choice of method(s).
- Document your results.