CS539 Machine Learning
Assignment Chapter 8 - Fall 2000
Due: Thursday, November 9, 2000 at 6:00 pm.
PROJECT DESCRIPTION
Construct the most accurate instance-based classifier you can
for predicting whether the income of a given person is >50K or <= 50K
using the
census-income dataset
from the US Census Bureau which is
available at the
Univ. of California Irvine Repository.
I have downloaded the dataset into the following directory:
/cs/courses/cs539/f00/Projects/Census_Income_Data
You can access the dataset from there.
The census-income dataset contains census information for 48,842
people. It has 14 attributes for each person
(age,
workclass,
fnlwgt,
education,
education-num,
marital-status,
occupation,
relationship,
race,
sex,
capital-gain,
capital-loss,
hours-per-week, and
native-country)
and a boolean attribute class classifying the input
of the person as belonging to one of two categories >50K, <=50K.
PROJECT ASSIGNMENT
Construct the most accurate instance-based classifier you can to predict the Salary
attribute of the Census-Income data.
The following are guidelines to construct your instance-based classifier:
- Code: You should write code that implements the k-nearest neighbor
algorithm described in Chapter 8 of the textbook.
- Training Instances:
Use the
census-income dataset.
You can restrict your experiments to a subset of the dataset if
your system cannot handle the whole dataset. But remember that the
more accurate your system is, the better.
Also,
note that this dataset has missing values. It is up to you how to fill in
appropriate data for those missing values. Also, it is up to you
to decide if it's a good idea to discretize continues attributes, and if
so, how.
- Test Instances:
Test data are also available at the UCI.
YOU MUST USE AT LEAST THE FIRST 1000 TEST RECORDS FROM THAT TEST
DATA IN YOUR EXPERIMENTS.
REPORT AND DUE DATE
- Written Report.
Please bring your report to my office (FL232) or to class by the due date/time.
Your report should contain the following sections that discuss the issues:
- Code Description:
Describe any adaptations of the code that you made.
- Experiments: You should run several experiments varying the value of k,
the training and perhaps the test data. For each further experiment you ran describe:
- Training Data: What data did you use to construct your instance-based classifier?
- Test Data: What data did you use to test your instance-based classifier?
- Any pre or post processing done to improve the accuracy of your classifier.
- Value of k used for the experiment.
- Accuracy of the resulting instance-based classifier.
- Summary of Results
- What was the accuracy of the most accurate instance-based classifier
you obtained?
- Discuss how this accuracy compares with that of your
most accurate decision tree, neural network, and naive Bayes classifier
from the previous assignments.
- Discuss the strengths and the weaknesses of your system.
- Oral Report.
We will discuss the results from the individual projects during the class
on November 9.
Be ready to show your results (prepare transparencies on your results)
and to discuss your project solution in class.