CS 548 Fall 2017 - Project 3

Computer Science Department

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2017
Project 3: Association Rule Mining, Text Mining, and Sequence Mining

PROF. CAROLINA RUIZ

DUE DATE: Thursday Nov. 2nd, 2017.

Slides: Submit via Canvas by 2:00 pm.
Written report: Hand in a hardcopy by the beginning of class (by 3:59 pm).

Project Assignment

Project Topics: This is a text mining project using association rules. In this project you will gain experience with the following topics:
- Text Mining:
  - You'll learn about converting text (unstructured data) to a bag-of-words (structured data): This process is usually done following the text preprocessing steps below:
    - Tokenization: where words are extracted from sentences/text;
    - Stop word removal: where words like "a", "the", "with" are removed; and
    - Stemming: where related words are reduced to their "stem" form (e.g., "swim", "swam", "swims", "swimming" are all made equal to "swim").
  - To learn more about tokenization, stop word removal, stemming, and bag-of-words use materials posted on the course Lecture Notes: Text Mining.
    In particular, look at the links marked with "**".
- Association Rules:
  - Read sections 6.1-6.3, 6.7-6.9 of the textbook in great detail.
  - Study all the materials posted on the course Lecture Notes: Association Rules.
    In particular, you should know what an association rule is; metrics to quantify association rules (e.g., support, confidence, lift, leverage, conviction, interest factor, correlation analysis, IS measure, ...); the Apriori principle; the Apriori algorithms to construct association rules, including frequent itemset generation and candidate generation and prunning (join/merge condition and subset pruning), and rule generation and confidence-based pruning. You should be able to use these algorithms to construct association rules from data by hand during the test. See examples provided in the Lecture Notes linked above.
THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written summary, and how to study for the test.
*** You must use the Project 3 Template provided for your written report. (If you prefer not to use Word, you can copy and paste this format in a different editor as long as you respect the stated page structure and page limit.)
Dataset: Each group can select its own dataset following all of the requirements below:
1. The dataset must be a text dataset. It can come from either an existing text corpus, or text data that you collect yourselves from the web (e.g., Twitter). Python provides APIs to interface with Twitter and other text corpora.
2. The dataset must contain at least 500 documents, with each document containing at least 100 words. Exceptions to this requirement must be approved by the professor in advance.
3. The dataset must be related to your own interests and you must be familiar with the domain of the dataset. In particular you must be able to state meaningful guiding questions and interpret the association rules that you will obtain from your dataset.
4. BCB503 students: Your dataset must be related to bioinformatics, computational biology, and/or medicine. For example, you can download abstracts and/or articles from Pubmed or any other text repository.
Data Mining Technique(s): You will run experiments in Weka and in Python using the following techniques:
- Pre-processing Techniques:
  - To convert text to bag-of-words (or word vector): Use either Weka's filters or Python libraries or a combination of both to convert your dataset of documents into a bag of words dataset.
    - In Weka: Use unsupervised filters including StringToWordVector, FixedDictionaryStringToWordVector, and any others you find appropriate for this task.
    - In Python: Use Python libraries like the NLTK library and others to process text data (e.g., Regular Expression (RE) library if you need it). There are lots of online tutorials and resources on using Python for text mining.
  - Make sure to review the output you will obtain after converting text to a word vector. Remove all attributes that look useless (e.g., punctuation symbols, mispellings, and thelike).
  - Run experiments with and without additional data pre-processing to determine what pre-processing produces useful and meaningful association rules. For instance, feature selection and attribute discretization. Also, experiment with changing zero counts to missing values (e.g., in Weka changing a "0" entry in a vector to "?") so that association rules will only be formed about word presence and not word absences in a document. In Weka you can achive this by using the "treatZeroAsMissing" parameter in Apriori.
- Association rule mining: Apriori algorithm available in Weka and in Python.
Evaluation:
- Quantitative evaluation:
  - Use support, confidence, lift, leverage, and conviction. Include in your report a definition (using a precise formula) and a description of the meaning of each of these metrics.
  - Also, for extra credit you are encouraged (but not required) to implement in Weka other association rule metrics defined in Section 6.7 of the textbook (e.g., interest factor, correlation analysis, IS measure, ...), and experiment with them.
- Qualitative evaluation:
  - Use visualizations of the sets of association rules obtained and analyze those visualitions.
  - Read the association rules obtained and pick a handful of interesting ones to describe in your report.
  - Search the literature online of the domain of your dataset (e.g., medical) to see if studies have shown associations among the attributes/words present in your association rules.
General Comments: In contrast with our previous classification and regression projects, we won't use any evaluation protocol (e.g., 10-fold cross validation) for the association analysis of this project, as we're not using the rules for prediction. Focus instead on experimenting with different ways of preprocessing the data, varying the parameters of the Apriori algorithm, and providing your own method to evaluate the resulting collections of association rules.
Advanced Topic: Sequence Mining using Association Rules Investigate in depth (experimentally, theoretically, or both) how to use an association-rules-like approach for sequence mining. For this, start by studying in detail Section 7.4 of the textbook. Then look for additional papers or references online. provide in your report a summary of what you have learned in your investigation of this topic, and of your results if you run any experiments.

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2017 Project 3: Association Rule Mining, Text Mining, and Sequence Mining

PROF. CAROLINA RUIZ

Project Assignment

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2017
Project 3: Association Rule Mining, Text Mining, and Sequence Mining