Quiz Topics and Sample Questions

**Introduction: CHAPTER 1 + Textbook Slides**- Define Data Mining
- List the steps of the Knowledge Discovery in Databases (KDD) and describe each of them.
- What fields contribute to Data Mining and how?
- What is the difference between predictive models and descriptive models? Give an example of each type.
- What Data Mining Tasks (or "Approaches") exist? [Answer: Classification, Regression, Clustering, Summarization, Dependency/Association Analysis, Change/Deviation Detection.] Describe each of them.
**Textbook Exercises:**1, 2.

**Data: CHAPTER 2 + Appendix B + Textbook Slides**- What's an attribute? What's a data instance?
- What type of attributes exist? Describe the differences among them.
- What's the difference between discrete and continuous attributes?
- What's noise? How can noise be reduced in a dataset?
- Define outlier. Describe 2 different approaches to detect outliers in a dataset.
- Describe 3 different techniques to deal with missing values in a dataset. Explain when each of these techniques would be most appropriate.
- Given a sample dataset with missing values, apply an appropriate technique to deal with them.
- Give 2 examples in which aggregation is useful.
- Given a sample dataset, apply aggregation of data values.
- What's sampling?
- What's simple random sampling? Is it possible to sample data instances using a distribution different from the uniform distribution? If so, give an example of a probability distribution of the data instances that is different from uniform (i.e., equal probability).
- What's stratified sampling?
- What's "the curse of dimensionality"?
- Provide a brief description of what Principal Components Analysis (PCA) does. [Hint: See Appendix B.] State what the input and what the output of PCA are.
- What's the difference between Principal Components Analysis (PCA) and Singular Value Decomposition (SVD)? [Hint: See Appendix B.]
- What's the difference between dimensionality reduction and feature selection?
- Describe in detail 2 different techniques for feature selection. [For instance describe: using the correlation matrix of the data attributes; using heuristic search for a good subset of attributes, like Correlation based feature Selection (CFS) implemented in Weka (CfsSubsetEval) and described in class; or feature weighting.]
- Given a sample dataset (represented by a set of attributes, a correlation matrix, a co-variance matrix, ...), apply feature selection techniques to select the best attributes to keep (or equivalently, the best attributes to remove).
- What's the difference between feature selection and feature extraction?
- Give two examples of data in which feature extraction would be useful.
- Given a sample dataset, apply feature extraction.
- What's data discretization and when is it needed?
- What's the difference between supervised and unsupervised discretization?
- Given a sample dataset, apply unsupervised (e.g., equal width, equal frequency) discretization, or supervised discretization (e.g., using entropy).
- Describe 2 approaches to handle nominal attributes with too many values.
- Given a dataset, apply variable transformation: Either a simple given function, normalization, or standardization.
- Define the notions of proximity, similarity, disimilarity, and distance. What metrics can be used to measure these notions?
- Given a numeric data attribute, what transformation can be used to map the values of the attribute to the range [0,1]? Explain.
- What metrics exist to measure the dissimilarity (or distance) between two numeric values? Between two nominal values? Between a numeric value and a nominal value?
- Provide the formula for the Minkowski distance. Show that the Euclidean distance, the Manhattan distance, and the Hamming distance are particular cases of the Minkowski distance.
- Know how the Jaccard Coefficient and the Cosine Similarity can be used to measure similarity/dissimilarity between attributes.
- Definition of Correlation and Covariance, and how to use them in data pre-processing (see pp. 76-78).
- What's the Mahalanobis distance and when is it useful?
**Textbook Exercises:**1, 2 (binary, discrete, and continuous only), 5, 12, 15, 18, 19, 20, 22, 23, 24.

**Exploring Data: CHAPTER 3 + Textbook Slides**- Define Data Exploration.
**Summary Statistics:**- What summary statistics can be used on nominal attributes? Continuous attributes? Apply those statistics to a given sample dataset.
- Be prepared to define, calculate, and/or use correlation and covariance over a set of data attributes. Also, be prepared to interpret the meaning of the correlation value between two data attributes.

**Visualization:**- What's the difference between data visualization and (analytical) data mining? Explain.
- Describe issues involved in mapping data to graphical elements.
- Be prepared to define, give examples of, and/or construct (given a sample dataset) any of the following visualizations: Histograms, pie charts, box plots, scatter plots, visualization matrices (e.g., heatmaps), parallel coordinates, star coordinates, Chernoff Faces.
- What type of data is each of the above visualization methods useful for (1-Dimensional, 2D, 3D, high dimensional)? What are the differences between each pair of these visualization methods?
- Use a given visualization of the dataset (e.g., scatter plot, heatmap of correlation matrix, ...) to draw conclusions about the data.

**Data Warehouses and OLAP:**- Given a multidimensional data cube and hierarchies over the data dimensions, apply roll-up, drill-down, dicing, slicing, and pivoting operators.

**Textbook Exercises:**2, 4, 5, 6, 8, 9, 12, 13, 16, 17.

**Classification, Decision Trees, and Model Evaluation: CHAPTER 4: Sections 4.1-4.5. + Textbook Slides**- Define the data mining task "classification".
- What's the difference between a Descriptive Model and a Predictive Model? Is it possible for a model to be both descriptive and predictive? If so, provide an example of a model that is both descriptive and predictive. If not, explain why.
**ZeroR / majority class classifier:**- Define this classifier and how to construct it when the target attribute is nominal and when the target attribute is continuous.

**OneR:**- Define this classifier. What metric is used to select the attribute in the One Rule? How is that One Rule constructed?

**Decision Trees:**- Understand in detail:
- the algorithm to construct decision trees,
- impurity metrics used to select attributes (entropy/gain, Gini, and classification error), and how these metrics are applied to nominal and to continuous attributes,
- how to split a dataset once that an attribute has been selected,
- alternate stopping criteria for the decision tree construction.

- Describe in detail 3 characteristics of decision tree construction (see Section 4.3.7.)
- What's over-fitting and how to avoid/prevent it?
- How does pre-pruning of decision trees work? Given a sample dataset construct a decision tree use pre-pruning.
- How does post-pruning of decision trees work? Given a sample dataset construct a decision tree use post-pruning.
- What's known as the Occams' razor principle and how does it apply to the construction of a model?
- How does J4.8 handle continuous attributes directly? Missing values?

- Understand in detail:
**Model Evaluation:**- Define training set, testing set, and validation set.
- Define accuracy, error rate, and confusion matrix. Given a model and a test set, calculate the accuracy, error rate, and the confusion matrix of the model over the test set.
- Discuss disadvantages of testing a model over its training set.
- Define the Percentage Split approach for training and testing a model. Discuss its advantages and disadvantages.
- Explain how n-fold cross-validation works. Illustrate with an example.
- What's the leave-one-out evaluation method?
- What's bootstrap? What's the difference between bootstrap and random subsampling?

**Textbook Exercises:**2, 3, 4, 5.

**Bayesian Classifiers: Section 5.3. + Textbook Slides**- Basic probability concepts: Probability of an event, independence, conditional probability, conditional independence, Bayes theorem, how to use the Bayes theorem for classification.
- Bayesian Classifiers: Graph (topology) + Conditional Probability Tables (CPTs).
- Naive Bayesian Classifier: Naive Bayes assumption; topology and CPTs of a naive Bayes model.
- (General) Bayesian Classifier: How to determine the topology (graph edges) and CPTs of a Bayesian model.
- Bayesian Classification: How to use naive or general Bayesian nets to classify a test instance.
**Textbook Exercises:**7, 8, 9, 11, 12.

**Rule-Based Classifiers: Section 5.1. + Textbook Slides**- What's an "if ... then ..." rule?
- Define the following terms: antecedent, consequent, left-hand-side (LHS), right-hand-side (RHS) of a rule.
- What's the formula to calculate the coverage of a rule? Given a rule and a dataset, calculate the coverage of the rule with respect to the dataset.
- What's the formula to calculate the accuracy of a rule? Given a rule and a dataset, calculate the accuracy of the rule with respect to the dataset.
- Describe the steps of the sequential covering algorithm in detail.
- Given a dataset, follow the sequential covering algorithm to obtain classification rules.
- Describe different ways to apply the rules in a rule set to classify a given data instance: ordered rules and unordered rules.
- Describe different rule-ordering schemes: rule-based and class-based.
- Given a test dataset and a rule based-model, use the model to make predictions over the test data and also to evaluate the model.
- Explain how RIPPER constructs and prunes rules. Given a dataset, a validation set, and a rule show how RIPPER works.
- Using ideas similar to those of J4.8, describe an approach to pre- and to post-pruning rules.
- Using ideas similar to those of J4.8, describe how you would enhance the rule-construction algorithm to handle continuous attributes directly.
- Using ideas similar to those of J4.8, describe how you would enhance the rule-construction algorithm to handle missing values directly.
- Describe how the ideas behind pre- and post-pruning of decision trees may be transferred to prune rules constructed by the sequential covering algorithm.
**Textbook Exercises:**1, 2*, 3.

**Association Analysis: Sections 6.1, 6.2, 6.3, 6.7. + Textbook Slides**- What's association analysis? How is it different from the other data mining tasks studied in this course (classification, regression, clustering, and anomaly detection)?
- If association analysis is about finding relationships among data attributes, is calculating the correlation matrix for the set of attributes association analysis? Why or why not?
- Given a dataset of transactions, where each transaction is a set of items, what's the meaning of an association rule X → Y with confidence = c and support = s? Assume that X and Y are sets of items.
- Provide formulas that define confidence, support, and lift in terms of probability.
- How has the association rule mining task been traditionally formulated? (See Definition 6.1. on p. 330 of your textbook).
- Describe a brute-force algorithm that generates all association rules that satisfy the conditions in Def. 6.1.
- Know in detail the Apriori algorithm, which generates all association rules that satisfy the conditions in Def. 6.1.
- How does the apriori algorithm generate all the frequent itemsets?
- What's the apriori principle? Provide 2 different formulations of this principle.
- Describe how the first level of itemsets is generated.
- Describe how the second level of itemsets is generated.
- After level k of frequent itemsets (i.e., itemsets of cardinality k that have enough support) has been obtained, describe how the candidate k+1 itemsets are generated. Describe in detail the merge/join condition, and the subset (or candidate pruning) condition.

- How does the apriori algorithm generate all the rules with sufficient confidence from the frequent itemsets? Explain in detail the pruning method used to avoid generating rules that are known in advance to not have enough confidence.
- How does the apriori algorithm improve upon the brute-force algorithm? (Hint: The apriori algorithm uses the apriori principle (and its corollaries: the merge/join condition and subset/candidate-pruning condition) to avoid generating itemsets and/or scanning the dataset counting the support of those itemsets, when it can be determined in advance that those itemsets will not have enough support. Also, the apriori algorithm uses an efficient way of generating rules that have enough confidence.)
- In the context of evaluation of association pattterns, what's an objective measure of interestingness? What's the difference between an objective and a subjective measure of interestingness?
- Define the following metrics for association rules: lift, interest factor, corelation analysis, IS, conviction, and leverage.
- Given a dataset of transactions, min. support and min. confidence thresholds, be prepared to follow the apriori algorithm to generate all association rules that satisfy the min threshold conditions.
- Given a dataset of transactions and an association rule, be prepared to calculate the support, confidence, lift, conviction, and leverage of the rule.
**Textbook Exercises:**1, 2* 3* 6, 7*, 8* 14, 16*, 17*.

**Clustering: Sections 8.1-8.5 + Textbook Slides**- What's clustering?
- Contrast the following approaches to clustering: hierachical vs. partitional; exclusive vs. overlapping vs. fuzzy; and complete vs. partial.
**K-means:**- How does the K-means algorithm work?
- Given a dataset and a distance metric, be prepared to follow the K-means algorithm to cluster the instances in the dataset.
- What's the difference between centroid and medoids? Are centroids always data instances in the dataset? Are medoids always data instances in the dataset? Explain.
- How is sum of squared error (SSE) defined?
- Discuss effective ways to determine an appropriate value of k to use.
- Discuss effective ways to choose appropriate initial centroids. Illustrate situations in which one way would be more appropriate than the others.
- Explain why the time complexity of the k-means algorithm is O(I*K*m*n). (Hint: See p. 505.)

**Hierarchical Clustering:**- What the difference between agglomerative and divisive hierarchical clustering?
- What's a dendrogram?
- Describe the steps of the basic agglomerative hierarchical clustering algorithm.
- What's a proximity matrix?
- Describe the following distance metrics to calculate the distance between two clusters: min (single link), max (complete link), group average, distance between centroids, and Ward distance.
- What's the time complexity of the basic agglomerative hierarchical clustering algorithm?
- Is this basic algorithm greedy or not? Explain.

**DBSCAN:**- Why is it important to consider density when clustering a dataset? Illustrate your argument(s) with examples.
- Given a dataset, a radius value (epsilon), and a min. number of points (MinPts), how are are core, border, and noise points defined by DBSCAN?
- How does DBSCAN identify core, border, and noise points in a dataset? How does DBSCAN use those points to cluster the dataset?
- What's the time complexity of DBSCAN?
- Discuss effective ways to determine appropriate values for epsilon and MinPts.
- How would you extend DBSCAN to appropriately cluster a dataset in which different regions have different densities?
- Are the versions of the DBSCAN algorithm in the textbook and in the textbook slides equivalent (i.e., do they always produce the same result)? Explain.

**Cluster Evaluation:**- How can SSE (see above) be used to evaluate clusters? Discuss whether or not SSE is useful in evaluating clusters results of each of the following methods: k-means, basic agglomerative, DBSCAN.
- What are cluster cohesion and separation? Provide formulas to calculate them. How can these measures be used to evaluate a clustering?
- What is the silhouette coefficient? Provide a formula to calculate it. How can this measure be used to evaluate a clustering?
- How can the similarity (proximity) matrix of a dataset in which data instances have been sorted so that instances that belong to the same cluster appear next to each other be used to visually evaluate a clustering? Explain. Does this visualization help evaluating a set of clusters? Just one cluster? Both? Explain. Be prepare to evaluate a clustering based on this visualization (e.g., Figures 8.30 and 8.31). Be prepared to produce this visualization given a dataset and a clustering over it.
- How can correlation between the similary (proximity) matrix and the cluster-based incidence matrix be used to evaluate a clustering? By the way, how is this incidence matrix defined? (See Section 8.5.3.)
- How can SSE be used to help determine the appropriate number of clusters? Use Figure 8.32 as an example.
- Given a value of a evaluation metric (e.g., SSE), how do you assess whether or not it is a good value? (See Section 8.5.8)

**General:**- Given a dataset, a distance metric, and other necessary parameters, be prepared to apply any of the clustering algorithms studied in this course (k-means, basic agglomerative, DBSCAN), or a variation of them provided.
- Given a dataset, argue which of the above clustering methods would be more appropriate and why.

**Textbook Exercises:**5*, 6, 7 , 11, 12*, 13*, 14*, 16**, 17**, 19, 21, 24*, 23*, 32**.

**Anomaly Detection: Chapter 10 + Textbook Slides**- What's an anomaly (or outlier)?
- Give an example of a situation in which an anomaly should be removed during pre-processing of the dataset, and another example of a situation in which an anomaly is an interesting data instance worth keeping and/or studying in more detail.
- Define each of the following approaches to anomaly detection, and describe the differences between each pair: Model-based, Proximity-based, and Density-based techniques.
- Can visualization be used to detect outliers? If so, how? Give specific examples of visualization techniques that can be used for anomaly detection. For each one, explain whether or not the visualization technique can be considered a Model-based (which includes Statistical), Proximity-based, or Density-based technique for anomaly detection.
- Define each of the following modes to anomaly detection, and describe the differences between each pair: supervised, unsupervised, and semi-supervised.
- Consider the case of a dataset that has labels identifying the anomalies and the task is to learn how to detect similar anomalies in unlabelled data. Is that supervised or unsupervised anomaly detection? Explain.
- Consider the case of a dataset that doesn't have labels identifying the
anomalies and the task is to find how to assign a sound anomaly score,
*f(x)*, to each instance*x*in the dataset. Is that supervised or unsupervised anomaly detection? Explain. - Precision, recall, and false positive rate are mentioned in your textbook as appropriate metrics to evaluate anomaly detection algorithms (see p. 657). What are those metrics (see Section 5.7.1, p. 295) and how can they be used to evaluate anomaly detection?
- For each of the anomaly detection approaches
(statistical-based, proximity-based, density-based, and clustering-based) do
- State the definition(s) of outlier used by the approach
- How can this be definition used to assign an anomaly score to each data instance?
- How does this anomaly detection approach work in general? Give an example to illustrate your description.

**Approach****Definition of Outlier**

(state full definition)**Anomaly score function****How does the approach work?**

(in general)**Example**Statistical-based Probabilistic definition of outlier Proximity-based Proximity-based definition of outlier using

distance to k-nearest neighborDensity-based Density-based definition of outlier using - inverse distance
- count of points within radius, or
- average relative density

Clustering-based Clustering-based definition of outlier - Be prepared to interpret graphs like those in Figs. 10.4-10.7 (p.667), and to generate those figures from a given dataset.
**Textbook Exercises:**2, 3, 5, 6*, 8, 10*, 11*, 12, 13, 15*, 16.

**Text Mining**- Differences and similarities between text mining and general data mining.
- Converting text data sources (unstructed data) to bags of words or word vectors (structured data). Issues encountered during this conversion: ambiguity of natural languages; parsing; stemming (that is, the process of representing inflected, derived, or modified words by their stem or root word - for example, "flying", "flies", "flew", "flown" can all be represented by their stem: "fly"); removal of stop words.
- Other data preprocessing issues such as feature selection, dimensionality reduction.
- After converting text sources into structured data, how is text/document classification performed? How is document clustering performed? How are word associations found in the documents? How about finding anomalies?

**Web Mining**- Differences and similarities between web mining and general data mining.
- Three subareas of web mining: web content mining, web structure mining, and web usage mining.
- Web content mining deals with text, images, audio, video, and also structured records.
- Web structure mining deals the "topology" of the web, including hyperlinks (intra-document hyperlinks and inter-document hyperlinks), and document structure.
- Web usage mining deals with web server logs, application server logs, and application level logs.
- Preprocessing for web mining.
- Classification, clustering, associations, and anomaly detection in web mining.