CS 525D Spring 2004

Computer Science Department

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING
Project 1: Association Rule Mining - Spring 2004

PROF. CAROLINA RUIZ

Due Date: March 2nd at 3:00 pm

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

Use the association rule mining module of the Weka system to mine asociation rules from the following datasets:

The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.
1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are:
A dataset that you choose depending on your own insterests. It may be a dataset you are working with for your research or your job. It should contain enough instances (at least 500 instances) and several attributes (at least 10). Ideally it should contain a good mix of numeric and nominal attributes.
I include below some links to Data Repositories containing multiple datasets to choose from:
- Univ. of California Irvine KDD Data Repository.
- Univ. of California Irvine Machine Learning Data Repository.
- Time Series Data Library
- Data Repositories
- Datasets for Data Mining
- CMU's StatLib-Datasets Archive
- Miscellaneous
- genetic dataset (TBA)

PROJECT ASSIGNMENT

You must work on this project individually. Mine, using the Weka system, association rules from the datasets above. Keep in mind that due to the representation of frequent itemsets in Weka, this system may run out of memory when mining datasets with as few as a dozen attributes. Run several experiments with your data and the system varying the parameters until you obtain a collection of association rules that represent your data well. The following are guidelines for your experiments:

Code: Use the Weka system to mine the association rules as well as for preparing the data and presenting the results. Code by yourself any functionality that you need for manipulating the data and that is not offered in the Weka system.
Data:
- You can restrict your experiments to a subset of the dataset if Weka cannot handle the whole dataset. But remember that the more representative the association rules you mine from the data, the better.
- Use the preprocessing techniques discussed in class to select, clean, and normalize the data.
- Define concept hierarchies over the different attributes so that you can analyze your data at different levels of generality.
Experiments: After you have cleaned and selected a subset of your data (if necessary), mine association rules using different parameter (confidence, support, etc.) settings. Analyze the resulting rules and repeat the experiment with other "view" of the data given by generalizing/specializing your data according to the concept hierarchies and/or by selecting different portions of the data.
Results: Assume that you as the user/miner you want to obtain association rules for decision support, for understanding the data better, and/or for increasing your company's profit. Mine rules until you obtain a collection of rules that satisfies this objective.

REPORT AND DUE DATE

Written Report. Your written report is due at 3:00 pm. Please hand it in at the beginning of class. Your report should contain the following sections with the corresponding discussions:
1. Code Description: Describe the code that you used/wrote. Remember to acknowledge any sources of information/code you used.
2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
3. Experiments:
  - Describe what the objective of your analysis is. Is it to understand the data better? If so, what about the data you want to understand? Or is it for decision support? If so, what decisions you need to make based on the data? Or is it for classification/characterization/discrimination purposes? Explain.
  - For each experiment you ran describe:
    - Instances: What data did you use for the experiments?
    - Any pre-processing done to improve the quality of your results.
    - Your system parameters.
    - Any post-processing done to improve the quality of your results.
    - Analysis of results of the experiment and their significance.
4. Summary of Results
  - What was the best collection of association rules that you obtained? Describe.
  - Discuss the strengths and the weaknesses of your project.
Oral Report. We will discuss the results from the individual projects during the class on March 2nd. Be ready to show your results and to discuss your project in class. PREPARE SLIDES SHOWING YOUR WORK.

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING Project 1: Association Rule Mining - Spring 2004

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORT AND DUE DATE

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING
Project 1: Association Rule Mining - Spring 2004