data:image/s3,"s3://crabby-images/7d556/7d5565e3075d2494243bc49868a0d4dead6e3c25" alt="------------------------------------------"
CS 525D KNOWLEDGE DISCOVERY AND DATA MINING
Project 1: Association Rule Mining - Spring 2004
Due Date:
March 2nd at 3:00 pm
data:image/s3,"s3://crabby-images/7d556/7d5565e3075d2494243bc49868a0d4dead6e3c25" alt="------------------------------------------"
PROJECT DESCRIPTION
Use the association rule mining module of the Weka system
to mine asociation rules from the following datasets:
- The
census-income dataset
from the US Census Bureau which is
available at the
Univ. of California Irvine Repository.
The census-income dataset contains census information for 48,842
people. It has 14 attributes for each person
(age,
workclass,
fnlwgt,
education,
education-num,
marital-status,
occupation,
relationship,
race,
sex,
capital-gain,
capital-loss,
hours-per-week, and
native-country)
and a boolean attribute class classifying the input
of the person as belonging to one of two categories >50K, <=50K.
-
1995 Data Analysis Exposition.
This dataset contains college data taken from the U.S. News & World Report's
Guide to America's Best Colleges. The necessary files are:
- A dataset that you choose depending on your own insterests.
It may be a dataset you are working with for your research or your job.
It should contain enough instances (at least 500 instances) and
several attributes (at least 10). Ideally it should contain a good mix of
numeric and nominal attributes.
I include below some links to Data Repositories containing
multiple datasets to choose from:
PROJECT ASSIGNMENT
You must work on this project individually.
Mine, using the Weka system,
association rules from the datasets above.
Keep in mind that due to the representation of
frequent itemsets in Weka, this system may run
out of memory when mining datasets with as few as
a dozen attributes.
Run several experiments with your data and the system
varying the parameters until you obtain a collection
of association rules that represent your data well.
The following are guidelines for your experiments:
- Code:
Use the Weka system to mine the association rules
as well as for preparing the data and presenting
the results.
Code by yourself any functionality that you need for
manipulating the data and that is not offered in the
Weka system.
- Data:
- You can restrict your experiments to a subset of the dataset if
Weka cannot handle the whole dataset. But remember that the
more representative the association rules you mine from the
data, the better.
- Use the preprocessing techniques discussed in class to select,
clean, and normalize the data.
- Define concept hierarchies over the different attributes so that
you can analyze your data at different levels of generality.
- Experiments:
After you have cleaned and selected a subset of your data (if
necessary), mine association rules using different parameter
(confidence, support, etc.) settings.
Analyze the resulting rules and repeat the experiment with
other "view" of the data given by generalizing/specializing
your data according to the concept hierarchies and/or by selecting
different portions of the data.
- Results:
Assume that you as the user/miner you want to obtain association
rules for decision support, for understanding the data better,
and/or for increasing your company's profit. Mine rules until
you obtain a collection of rules that satisfies this objective.
REPORT AND DUE DATE
- Written Report.
Your written report is due at 3:00 pm. Please hand it in at the beginning
of class.
Your report should contain the following sections with the corresponding discussions:
- Code Description:
Describe the code that you used/wrote. Remember to acknowledge any sources
of information/code you used.
- Data:
Describe the dataset that you selected in terms of the attributes
present in the data, the number of instances, missing values, and
other relevant characteristics.
- Experiments:
- Describe what the objective of your analysis is. Is it to understand
the data better? If so, what about the data you want to understand?
Or is it for decision support? If so, what decisions you need to make
based on the data? Or is it for classification/characterization/discrimination
purposes? Explain.
- For each experiment you ran describe:
- Instances: What data did you use for the experiments?
- Any pre-processing done to improve the quality of your results.
- Your system parameters.
- Any post-processing done to improve the quality of your results.
- Analysis of results of the experiment and their significance.
- Summary of Results
- What was the best collection of association rules that
you obtained? Describe.
- Discuss the strengths and the weaknesses of your project.
- Oral Report.
We will discuss the results from the individual projects during the class
on March 2nd.
Be ready to show your results
and to discuss your project in class.
PREPARE SLIDES SHOWING YOUR WORK.