
CS539 Machine Learning. Spring 2003
Project 1 - Using the Weka System to Preprocess Datasets
DUE DATE:
This project is due on Monday, Jan. 20, 2003.
Your oral report is scheduled for Monday, Jan. 20's class
Your written report is due on Thursday, Jan. 23, 2003 at 11 am.

PROJECT DESCRIPTION
The purpose of this project is two-fold:
- to gain familiarity with
the Weka system, its GUI, its code, and its input data format (arff).
- To gain experience "pre-processing" datasets
to clean, normalize, and discretize data attributes,
and, when needed, reduce the dimensionality of the data.
PROJECT ASSIGNMENT
For this and other course projects, we will use the
Weka system
(http://www.cs.waikato.ac.nz/ml/weka/).
Weka is an excellent machine-learning/data-mining environment.
It provides a large collection of Java-based mining algorithms,
data preprocessing filters, and experimentation capabilities.
Weka is open source software issued under the GNU General Public License.
For more information on the Weka sytem, to download the system and
to get its documentation, look at
Weka's webpage
(http://www.cs.waikato.ac.nz/ml/weka/).
-
You should download and use the 3.2.3 GUI version of the system.
-
Study the tutorial provided with the Weka system.
Note that the tutorial uses Weka's command line to illustrate
how to run the system, but you can actually use the GUI provided
with the system to execute the same commands.
- Datasets: Consider the following sets of data:
- The weather data (available in the data directory of the
Weka system as the "weather.arff" file).
- The soybean data (available in the data directory of the
Weka system as the "soybean.arff" file).
- The
The Insurance Company Benchmark (COIL 2000)
For this project, use only the ticdata2000.txt Training data. (1M).
- Experiments:
For each of the above datasets,
use the "Explorer" option of the Weka system to perform the
following operations:
- Translate the dataset into the arff format if needed.
- Open the dataset in Weka.
- Preprocess the dataset attributes using Weka's filters.
In particular,
- explore different ways of discretizing continuous attributes.
That is, convert numeric attributes into "nominal" ones
by binning numeric values into intervals - See the
weka.filter.DiscretizeFilter in Weka.
Play with the filter and read the Java code implementing it.
- explore different ways of removing missing values.
Missing values in arff files are represented with the character "?".
See the weka.filter.ReplaceMissingValuesFilter in Weka.
Play with the filter and read the Java code implementing it.
-
Use the "ZeroR" classifier under the "Classify" tab.
Use different ways of testing your results. That is, explore the
following alternatives offered by the Weka system:
- Testing your results over the training data.
- Splitting your input file into two parts one for training and
one for testing.
- Using n-fold crossvalidation. Play with different values for n.
Analyze the results obtained (i.e. interpret the meaning of the output
produced by Weka).
Read to the extent possible the Java code implementing the ZeroR classifier.
Run several experiments with your data and the system
varying the parameters so that you gain familiarity with
the system.
ORAL AND WRITTEN REPORTS AND DUE DATE
- Written Report.
Your written report is due at 11:00 am on Thursday, Jan. 23.
Please hand in a hardcopy of your report at the beginning of class on Thursday.
Your report should contain the following sections with the corresponding discussions:
- Data:
Describe the datasets that you used in terms of the attributes
present in the data, the number of instances, missing values, and
other relevant characteristics.
- Code Description:
Describe to the extent possible any observations you made when
looking at the Weka code implementing the filters you used and the ZeroR
function.
- Experiments:
For each experiment you ran describe:
- Instances: What data did you use for the experiments?
That is, did you use the entire dataset of just a subset of it?
- Any pre-processing done to the data. That is, did you remove
any attributes? Did you discretize any continuous attribute?
If so, what strategy did you use to bin the values?
Did you replace missing values?
If so, what strategy did you use to select a replacement of
the missing values?
- Your system parameters.
- For the ZeroR function,
analysis of results of the experiments you ran using different
ways of testing the classifier (crossvalidation, etc.).
- Summary of Results
- Discuss the strengths and the weaknesses of your project.
- Oral Report.
We will discuss the results from the individual projects during the class
on Monday, Jan 20.
Each of you will have approximately 6 minutes to present your report.
Prepare detailed SLIDES with the results of your experiments.
Your slides should be a good "preview" of your written report and should summarize
the contents of the different sections of your written report as described above.
Be ready to show your results
and to discuss your project in class.