CS 4341 D01 - Project 4

CS4341 Introduction to Artificial Intelligence
Project 4 - D 2001

By Chris Shoemaker and Carolina Ruiz

DUE DATE: Saturday, April 14 at 5 pm.

Project Goal
Assignment Description
Input Specifications
Output Specifications
Your Code
Report and Project Submission
Graduate Credit Problem

Project Goal:

To understand decision trees, and algorithms that create them. To implement a program that creates a decision tree and is able to make classification predictions for untrained data. Note: The term "decision tree", in this assignment, is used in the practical sense, that is, to mean the type of decision tree that Winston calls a "identification tree".

Assignment Description:

Construct the most accurate decision tree you can for predicting whether the game of connect 4 that starts with the given 8 ply will be won, lost or drawn. Relevant files are available at the /cs/cs4341/Proj4 directory of the Unix CCC machines and also at ftp://ftp.cs.wpi.edu/pub/courses/cs4341/project4/ The files contained in those directories were derived from files taken from the Univ. of California-Irvine Machine Learning Repository.

The directory contains the following files:

connect-4.names: Contains a description of the database, including information about the attributes. There is one piece of information that is incorrect in this file. This files originally accompanied a connect-4.data that contained 67557 instances. However, the database has been split into two parts at random: 2/3 of the records for training (now contained in the file connect-4.data), and 1/3 of the records for testing (contained in the file connect-4.test).
connect-4.data: Contains the training records. There are 45038 training instances.
connect-4.test: Contains the testing records. There are 22519 test instances.

As an illustration of the type of results obtained using decision trees over this data set, see Chris Shoemaker's decision tree results using See5, a commercial decision tree tool developed by Ross Quinlan. See5 is the Windows version of C5 (descendant of C4.5).

Input Specifications:

You may accept the training and testing filenames as command-line arguments, or you may hard-code them. Relevant information about the file format is found in the connect-4.names file. You may use the files in their current locations, or make copies of them, but your program must operate on files exactly like the original files.

Output Specifications:

The only formal output specification is that, at minimum, your program should display:

the number of training instances used
the size of the tree that was constructed (how many nodes)
the number of testing instances used
the 3 x 3 confusion matrix of the form:

                                (actual)
                            win    lose    draw    total
                    win    12       1        0        13
(predicted)    lose    2      13       1        16
                    draw   3        2       4         9
                    total   17     16       5        38 <-   (For you, this number will be 22519)

the overall classification accuracy.
a printout of the tree using the tree format specified in class.
a printout of the rules derived from the tree using the rule format specified in class.

Your Code:

The following are requirements for the construction and testing of your decision tree:

Code: You MUST write your own decision tree learning code. There are several pieces of code available online (including a Lisp version) that you may examine for guidance, but the code you submit MUST be your own. Your code MUST run on the CCC Unix machines.
Training Instances: For the construction of your decision tree, you MUST use the instances from the connect-4.data file. You MAY restrict your experiments to a subset of the records in that file if your system cannot handle the whole file. But remember that the more accurate your decision tree is over the test data, the better.
You MAY pre-process the attributes so that you maximize the accuracy of your decision tree. Pre-processing alternatives include: disregarding an attribute that doesn't seem to have any predictive capability.
Test Instances: Test data are available in the connect-4.test file. To test your decision tree you MUST use all 22519 records from that file. The accuracy of a decision tree is measured as the percentage of correctly classified instances. That is,
accuracy = number of correctly classified instances/ total number of classified instances.

Therefore, if you classify 20000 instances correctly, your classification accuracy is 20000/22519 = 88.8%

Report and Project Submission

Project 4 is due on Saturday, April 14 at 5:00 pm. Your system should follow the CS Department Documentation Standard.

Program and Decision Tree. You should submit
- the source code of your program,
- a tree.txt file containing a print-out of the most accurate decision tree you obtained in the tree format described in class, and
- a rules.txt file containing a print-out of the IF-THEN rules obtained from your most accurate decision tree in the rule format described in class.
using the turnin program.
Written Report. Submit a report.txt file containing your written report. Your report should discuss the following issues:
1. a description of your own code,
2. the description of the dataset (or subset thereof) used by your program to construct and to test your decision tree,
3. the experiments you ran with the system, including a confusion matrix for each experiment,
4. the most accurate decision tree constructed by your system, including the confusion matrix for this tree,
5. any pre or post processing done to improve the accuracy of your tree,
6. evaluation of your tree using the test data,
7. strengths and weaknesses of your system.
Your report should also include a short user manual explaining how to install, run, and use your system.
Oral Report. We will discuss the results from the different group projects in class on Tuesday, April 17, 2001 . Each group should present their results (prepare ONE good overhead slide) and discuss their project solution in class for up to 2 minutes. The oral presentation will be worth 10% of the project grade.

Graduate Credit Problem

Extend and (hopefully) improve your decision tree generating program to avoid overfitting.

Create a leaf node BEFORE the training data instances for that node become homogenous in classification.
Instead, choose some other method(s) for determining when to make a leaf node. Possible methods include, to stop branching and to create a leaf node when the training data instances at that node:
- are homogenous in classification OR there are fewer than x instances at that node.
- are homogenous in classification EXCEPT for up to x instances.
- are homogenous in classification EXCEPT for up to x percent of the instances at that node.
- are ...{can you think of any other good ones?}
Perform experiments with various values of your variable, x.
Justify your choice of method(s).
Document your results.

CS4341 Introduction to Artificial Intelligence Project 4 - D 2001