CS539 Machine Learning
Assignment 10 - Fall 2000
Due:
Thursday, November 30, 2000 at 6:00 pm.
PROJECT DESCRIPTION
Use a FOIL-like algorithm
to construct the best set of rules you can
for predicting whether the income of a given person is >50K or <= 50K
using the
census-income dataset
from the US Census Bureau which is
available at the
Univ. of California Irvine Repository.
I have downloaded the dataset into the following directory:
/cs/courses/cs539/f00/Projects/Census_Income_Data
You can access the dataset from there.
The census-income dataset contains census information for 48,842
people. It has 14 attributes for each person
(age,
workclass,
fnlwgt,
education,
education-num,
marital-status,
occupation,
relationship,
race,
sex,
capital-gain,
capital-loss,
hours-per-week, and
native-country)
and a boolean attribute class classifying the input
of the person as belonging to one of two categories >50K, <=50K.
PROJECT ASSIGNMENT
Construct, using a FOIL-like algorithm,
the most accurate hypothesis (i.e. set of rules) you can to predict the Salary
attribute of the Census-Income data.
The following are guidelines to use a FOIL-like algorithm to construct your hypothesis:
- Code: I strongly encourage you to use a version of FOIL available online,
for instance the one available
at Quinlan's Webpage or a
more recent one if you find one.
However, you can implement your own code if you prefer.
- Instances:
Use the
census-income dataset.
You can restrict your experiments to a subset of the dataset if
your system cannot handle the whole dataset. But remember that the
more accurate your system is, the better.
Also,
note that this dataset has missing values. It is up to you how to fill in
appropriate data for those missing values. Also, it is up to you
to decide if it's a good idea to discretize continues attributes, and if
so, how.
YOU MUST USE AT LEAST THE FIRST 1000 TEST RECORDS FROM THE
Test data
IN YOUR EXPERIMENTS.
REPORT AND DUE DATE
- Written Report.
Please bring your report to my office (FL232) or to class by the due date/time.
Your report should contain the following sections that discuss the issues:
- Code Description:
Describe the code that you used/wrote. Remember to acknowledge any sources
of information/code you used for the implementation of your system.
- Experiments: For each experiment you ran describe:
- Your system parameters.
- Instances: What data did you use for the experiments?
- Any pre or post processing done to improve the accuracy of your net.
- Accuracy of the resulting hypothesis.
- Summary of Results
- What was the accuracy of the most accurate hypothesis
you obtained?
- Discuss how this accuracy compares with that of your
most accurate results from the previous assignments.
- Include the most accurate hypothesis you obtained in your
report.
- Discuss the strengths and the weaknesses of your system.
- Oral Report.
We will discuss the results from the individual projects during the class
on November 30th.
Be ready to show your results (prepare transparencies on your results)
and to discuss your project solution in class.