Written report: Hand in a hardcopy by the beginning of class (by 3:59 pm).
This is a text mining project using association rules. In this project you will gain experience with the following topics:
You'll learn about converting text (unstructured data) to a bag-of-words (structured
data): This process is usually done following the text preprocessing steps below:
Tokenization: where words are extracted from sentences/text;
Stop word removal: where words like "a", "the", "with" are removed; and
Stemming: where related words are reduced to their "stem" form (e.g.,
"swim", "swam", "swims", "swimming" are all made equal to "swim").
To learn more about tokenization, stop word removal, stemming, and bag-of-words
use materials posted on the course Lecture Notes:
In particular, look at the links marked with "**".
Read sections 6.1-6.3, 6.7-6.9 of the textbook in great detail.
Study all the materials posted on the course Lecture Notes:
In particular, you should know what an association rule is;
metrics to quantify association rules (e.g., support, confidence, lift, leverage, conviction, interest factor, correlation analysis, IS measure, ...);
the Apriori principle;
the Apriori algorithms to construct association rules,
including frequent itemset generation and candidate generation and prunning
(join/merge condition and subset pruning), and
rule generation and confidence-based pruning.
You should be able to use these algorithms to construct association rules from data
by hand during the test.
See examples provided in the Lecture Notes linked above.
THOROUGHLY READ AND FOLLOW THE
These guidelines contain detailed information about how to structure your
project, and how to prepare your written summary, and how to study for the test.
Each group can select its own dataset following all of the requirements below:
The dataset must be a text dataset. It can come from either an existing text corpus,
or text data that you collect yourselves from the web (e.g., Twitter).
Python provides APIs to interface with Twitter and other text corpora.
The dataset must contain at least 500 documents, with each document containing at least 100 words. Exceptions to this requirement must be approved by the professor in advance.
The dataset must be related to your own interests and you must be familiar with the
domain of the dataset. In particular you must be able to state meaningful guiding questions
and interpret the association rules that you will obtain from your dataset.
BCB503 students: Your dataset must be related to bioinformatics, computational biology, and/or medicine. For example, you can download abstracts and/or articles from
Pubmed or any other text repository.
Data Mining Technique(s):
You will run experiments in Weka and in Python using the following techniques:
To convert text to bag-of-words (or word vector): Use either Weka's filters or Python libraries or a combination of both to convert your dataset of documents into a bag of words dataset.
In Weka: Use unsupervised filters including
and any others you find appropriate for this task.
In Python: Use Python libraries like the
NLTK library and others to process text data
(e.g., Regular Expression (RE) library
if you need it).
There are lots of online tutorials and resources on using Python for text mining.
Make sure to review the output you will obtain after converting text to a word vector.
Remove all attributes that look useless (e.g., punctuation symbols, mispellings, and thelike).
Run experiments with and without additional data pre-processing
to determine what pre-processing produces
useful and meaningful association rules.
For instance, feature selection and attribute discretization.
Also, experiment with changing zero counts to missing values
(e.g., in Weka changing a "0" entry in a vector to "?")
so that association rules will only be formed about word presence and not word absences in a document.
In Weka you can achive this by using the "treatZeroAsMissing" parameter in Apriori.
Association rule mining:
Apriori algorithm available in Weka and in
Use support, confidence, lift, leverage, and conviction. Include
in your report a definition (using a precise formula) and a description
of the meaning of each of these metrics.
Also, for extra credit you are
encouraged (but not required) to implement in Weka other association rule
metrics defined in Section 6.7 of the textbook (e.g., interest factor,
correlation analysis, IS measure, ...), and experiment with them.
Use visualizations of the sets of association rules obtained and
analyze those visualitions.
Read the association rules obtained and pick a handful of interesting
ones to describe in your report.
Search the literature online of the domain of your dataset (e.g., medical)
to see if studies have
shown associations among the attributes/words present in your association rules.
In contrast with our previous classification and regression projects,
we won't use any evaluation protocol (e.g., 10-fold cross validation)
for the association analysis of this project, as we're not using the
rules for prediction.
Focus instead on experimenting with different ways of preprocessing
the data, varying the parameters of the Apriori algorithm, and
providing your own method to evaluate the resulting collections of
Advanced Topic: Sequence Mining using Association Rules
Investigate in depth (experimentally, theoretically, or both)
how to use an association-rules-like approach for sequence mining.
For this, start by studying in detail Section 7.4 of the textbook.
Then look for additional papers or references online.
provide in your report a summary of what you have learned in your
investigation of this topic, and of your results if you run any experiments.