Quiz/Exam Topics and Sample Questions

The textbook referred to on this page is:

"Introduction to Data Mining(2nd Edition)".

By Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar.

Pearson. 2019. ISBN-13: 978-0133128901 ISBN-10: 0133128903.

(See the book's link above for book slides and other resources.)

**Introduction to Data Mining and Knowledge Discovery in Databases:**

All materials covered in class, online lecture notes, and Chapter 1 + Slides of the Textbook- Define Data Mining
- List the steps of the Knowledge Discovery in Databases (KDD) and describe each of them.
- What fields contribute to Data Mining and how?
- What is the difference between predictive models and descriptive models? Give an example of each type.
- What Data Mining Tasks (or "Approaches") exist? [Answer: Classification, Regression, Clustering, Summarization, Dependency/Association Analysis, Change/Deviation Detection.] Describe each of them.
**Textbook Exercises:**1, 2.

**Data and Data Preprocessing:**

All materials covered in class, online lecture notes, and Chapter 2 + Slides + Appendix B of the Textbook- What is an attribute? What is a data instance?
- What type of attributes exist? Describe the differences among them.
- What is the difference between discrete and continuous attributes?
- What is noise? How can noise be reduced in a dataset?
- Define outlier. Describe 2 different approaches to detect outliers in a dataset.
- Describe 3 different techniques to deal with missing values in a dataset. Explain when each of these techniques would be most appropriate.
- Given a sample dataset with missing values, apply an appropriate technique to deal with them.
- Give 2 examples in which aggregation is useful.
- Given a sample dataset, apply aggregation of data values.
- What is sampling?
- What is simple random sampling? Is it possible to sample data instances using a distribution different from the uniform distribution? If so, give an example of a probability distribution of the data instances that is different from uniform (i.e., equal probability).
- What is stratified sampling?
- What is "the curse of dimensionality"?
- Provide a brief description of what Principal Components Analysis (PCA) does. [Hint: See Appendix B.] State what the input and what the output of PCA are.
- What is the difference between Principal Components Analysis (PCA) and Singular Value Decomposition (SVD)? [Hint: See Appendix B.]
- What is the difference between dimensionality reduction and feature selection?
- Describe in detail 2 different techniques for feature selection. [For instance describe: using the correlation matrix of the data attributes; using heuristic search for a good subset of attributes, like Correlation based feature Selection (CFS) implemented in Weka (CfsSubsetEval) and described in class; or feature weighting.]
- Given a sample dataset (represented by a set of attributes, a correlation matrix, a co-variance matrix, ...), apply feature selection techniques to select the best attributes to keep (or equivalently, the best attributes to remove).
- What is the difference between feature selection and feature extraction?
- Give two examples of data in which feature extraction would be useful.
- Given a sample dataset, apply feature extraction.
- What is data discretization and when is it needed?
- What is the difference between supervised and unsupervised discretization?
- Given a sample dataset, apply unsupervised (e.g., equal width, equal frequency) discretization, or supervised discretization (e.g., using entropy).
- Describe 2 approaches to handle nominal attributes with too many values.
- Given a dataset, apply variable transformation: Either a simple given function, normalization, or standardization.
- Define the notions of proximity, similarity, disimilarity, and distance. What metrics can be used to measure these notions?
- Given a numeric data attribute, what transformation can be used to map the values of the attribute to the range [0,1]? Explain.
- What metrics exist to measure the dissimilarity (or distance) between two numeric values? Between two nominal values? Between a numeric value and a nominal value?
- Provide the formula for the Minkowski distance. Show that the Euclidean distance, the Manhattan distance, and the Hamming distance are particular cases of the Minkowski distance.
- Know how the Jaccard Coefficient and the Cosine Similarity can be used to measure similarity/dissimilarity between attributes.
- Definition of Correlation and Covariance, and how to use them in data pre-processing (see pp. 83-85).
- What is the Mahalanobis distance and when is it useful?
**Textbook Exercises:**1, 2 (binary, discrete, and continuous only), 5, 12, 15, 18, 19, 20, 22, 23, 24.

**Classification: ZeroR, OneR, Decision Trees:**

All materials covered in class, online lecture notes, and Sections 3.1-3.3 + Slides + Appendix B of the Textbook- Define the data mining task "classification".
- What is the difference between a Descriptive Model and a Predictive Model? Is it possible for a model to be both descriptive and predictive? If so, provide an example of a model that is both descriptive and predictive. If not, explain why.
**ZeroR / majority class classifier:**- Define this classifier and how to construct it when the target attribute is nominal and when the target attribute is continuous.

**OneR:**- Define this classifier. What metric is used to select the attribute in the One Rule? (Hint: Is it entropy or classification error?) How is that One Rule constructed?

**Decision Trees:**- Understand in detail:
- the algorithm to construct decision trees,
- impurity metrics used to select attributes (entropy/gain, Gini, and classification error), and how these metrics are applied to nominal and to continuous attributes,
- how to split a dataset once that an attribute has been selected,
- alternate stopping criteria for the decision tree construction.

- Describe in detail 3 characteristics of decision tree construction (see Section 3.3.6.)
- How does pre-pruning of decision trees work? Given a sample dataset construct a decision tree using pre-pruning.
- How does post-pruning of decision trees work? Given a sample dataset construct a decision tree and then use post-pruning.
- What is known as the Occams' razor principle and how does it apply to the construction of a model?
- How does J4.8 handle continuous attributes directly? How does it handle missing values?

- Understand in detail:
**Textbook Exercises:**2, 3, 4, 5, 9.

**Regression: Linear Regression, Regression Trees, Model Trees:**

All materials covered in class, online lecture notes, and online Appendix B.1 of the Textbook- Define the data mining task "regression".
- What is the difference between regression and classification? Explain.
- Describe the unsupervised "one-hot encoding" transformation of a discrete attribute into a set of continuous (in this case, binary) attributes. Given a discrete attribute, be able to apply this transformation to the attribute.
- Describe the supervised transformation used by the Weka system to convert a discrete attribute into a set of continuous (in this case, binary) attributes. Given a discrete attribute, be able to apply this transformation to the attribute.
**Linear Regression:**- Define this data mining method and how to construct it when the target attribute is nominal and when the target attribute is continuous. What criterion/function is aimed to be minimized in the construction of a linear regression model?

**Regression Trees and Model Trees:**- Understand in detail:
- the algorithm to construct regression and model trees,
- impurity metrics used to select attributes (standard deviation reduction), and how this metric is used,
- how to split a dataset once that an attribute has been selected,
- alternate stopping criteria for the decision tree construction,
- what do leaf nodes contain? That is, what function or procedure does a leaf node in a regression tree use to make predictions? Same question for model trees.

- How does pre-pruning of a tree work? Given a sample dataset construct a regression or a model tree using pre-pruning.
- How does post-pruning of a tree work? Given a sample dataset construct a regression or a model tree and then use post-pruning.
- How does smoothing work? That is, given a test instance explain how its predicted target value is obtained using the leaf node and its ancestor nodes on the branch of the tree that matches the test instance.

- Understand in detail:

**Model Evaluation and Model Comparison: (for prediction models)**

All materials covered in class, online lecture notes, and Sections 3.4-3.9 + Slides of the Textbook**Model Overfitting**- What is over-fitting and how to avoid/prevent it?

**Model Selection:**- What is a validation set and how can it be used to compare alternative (partial) models during training?

**Model Evaluation:**- Define training set, testing set, and validation set.
- Classification performance: Evaluating a classification model:
- Define accuracy, classification error, and confusion matrix. Given a model and a test set, calculate the accuracy, error rate, and the confusion matrix of the model over the test set.
- Given a specific class value of the target attribute (e.g., "sunny" if the classification target is "Outlook" (with values "sunny", "overcast" and "rainy"), define True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN), Precision, Recall, F1-score, and AUC. Given a model and a test set, calculate these metrics of the model over the test set.

- Regression performance: Evaluating a regression model:
- Define the following metrics:
mean-squared error (MSE),
root mean-squared error (RMSE)
mean-absolute error (MAE),
relative squared error (RSE),
relative absolute error (RAE),
correlation coefficient (denoted by r or R),
coefficient of determination (denoted by r
^{2}or R^{2}).

Given a model and a test set, calculate these metrics of the model over the test set.

- Define the following metrics:
mean-squared error (MSE),
root mean-squared error (RMSE)
mean-absolute error (MAE),
relative squared error (RSE),
relative absolute error (RAE),
correlation coefficient (denoted by r or R),
coefficient of determination (denoted by r
- Discuss disadvantages of testing a model over its training set.
- Define the Holdout (same as Weka's Percentage Split) approach for training and testing a model. Discuss its advantages and disadvantages.
- Explain how k-fold cross-validation works. Illustrate with an example.
- What is the leave-one-out evaluation method?
- What is bootstrap? What is the difference between bootstrap and random subsampling?

**Parameters and Hyper-parameters:**- What is a model parameter? In the case of decision trees, what parameters are typically used?
- Describe an approach to determine good parameter values.
- What is a hyper-parameter? How do hyper-parameters differ from a parameters?
- Describe nested cross-validation as an approach to determine good hyper-parameter values.

**Model Evaluation (cont.) and Model Comparison:**- Given the prediction error of a model M and the number of data instances n in the test set used to calculate the prediction error:
- What is the formula to construct the confidence interval for the prediction error of M? Does this formula work for any size n of the test set? Or is a certain minimun n required? Explain.
- Know how to use the formula above to construct 99%, 98%, 95%, 90%, 80% 68%, and 50% confidence intervals.

- Given two models, M1 and M2, the prediction error of each of the models M and the number of data instances n1 and n2 in the test sets used to calculate the prediction error of each of the models respectively:
- Explain the process to determine whether the difference in performace between the two models is statistically significant at a given p level (e.g., p < 0.05).
- What is the formula to construct the confidence interval for the difference of the prediction errors of the two models? Does this formula work for any sizes n1 and n2 of the test sets? Or is a certain minimun values for n1 and n2 required? Explain.
- Know how to use the formula above to construct 99%, 98%, 95%, 90%, 80% 68%, and 50% confidence intervals.
- If you want to establish statistical significance at a level of p < 0.1, what confidence interval do you construct among: 99%, 98%, 95%, 90%, 80% 68%, and 50% ? Same question for the following p value thresholds: 0.05, 0.01, and 0.2.
- Once that you construct the confidence interval for the difference in performance between M1 and M2, how do you determine if there is a statisical significant difference between the two models? If there is such, which of the two models is significantly better than the other?

- Given the prediction error of a model M and the number of data instances n in the test set used to calculate the prediction error:
**Textbook Exercises:**9, 11, 12.

**Artificial Neural Networks and Deep Learning:**

All materials covered in class, online lecture notes, and Sections 4.7-4.8 + Slides of the TextbookNeed to know all of the following concepts, what they are and how to use them:

- Perceptron (also called "neuron" or single "unit").
- Networks of perceptrons and different architectures for these nets:
- Layered, fully connected
- Convolutional neural networks (CNNs)
- Combinations of the above: that is architectures that have some fully connected layers and some convolutional layers
- Recurrent Neural Networks (RNN)
- Autoencoders

- Describe how nodes in hidden layers can be thought of as extracting features from the input data.
- Activation function: What an activation function is used for. Also, know the following activation functions (their formulas, graphs and their derivatives):
- Step and Sign
- Linear
- Sigmoid
- Tanh
- ReLU
- SeLu

- Loss function: What a loss function is used for. Also, know the following loss functions (their formulas, graphs and their gradient (i.e., their partial derivatives with respect to the weights) when applied to an activation function):
- Squared loss function
- Cross-entropy

- Gradient descent:
- How this search method works: search space (i.e., landscape), what the goal of the search is, what the initial state/location of the search is, how each step of the search is performed, what are possible termination criteria, what to do if the search fails.
- Differences among classical (or "batch"), stochastic (or "incremental") and "mini-batch" gradient descent.

- Error back-propagation algorithm: What the goal of this algorithm is and how it works.
- What "vanishing gradient problem" is. How to prevent it.
- Regularization: What it is used for.
- How the "dropout" method works and what the rationale for using this method is.

- Parameters and Hyperparameters:
- What "parameters" and "hyperparameters" in deep networks are. What the difference between them is. List specific parameters and hyperparameters used in the training of deep networks.
- Parameter initialization methods:
- Random
- Supervised pretraining
- Unsupervised pretraining
- Using autoencoders
- Hybrid pretraining

- Need to know how to use Python and Keras as required on the project to create, train and evaluate artificial neural networks.
- Need to know from experience with the project how changes in training (e.g., using a different initialization method, activation function, learning rate, ...) may affect the training process and the resulting network.

**Bayesian Classifiers:**

All materials covered in class, online lecture notes, and Section 4.5 + Slides of the Textbook- Basic probability concepts: Probability of an event, independence, conditional probability, conditional independence, Bayes theorem, how to use the Bayes theorem for classification.
- Bayesian Classifiers: Graph (topology) + Conditional Probability Tables (CPTs).
- Naive Bayesian Classifier: Naive Bayes assumption; topology and CPTs of a naive Bayes model.
- (General) Bayesian Classifier: How to determine the topology (graph edges) and CPTs of a Bayesian model.
- Bayesian Classification: How to use naive or general Bayesian nets to classify a test instance.
**Textbook Exercises:**7, 8, 9, 11, 12.

**Rule-Based Classifiers:**

All materials covered in class, online lecture notes, and Section 4.2 + Slides Textbook- What is an "if ... then ..." rule?
- Define the following terms: antecedent, consequent, left-hand-side (LHS), right-hand-side (RHS) of a rule.
- What is the formula to calculate the coverage of a rule? Given a rule and a dataset, calculate the coverage of the rule with respect to the dataset.
- What is the formula to calculate the accuracy of a rule? Given a rule and a dataset, calculate the accuracy of the rule with respect to the dataset.
- Describe the steps of the sequential covering algorithm in detail.
- Given a dataset, follow the sequential covering algorithm to obtain classification rules.
- Describe different ways to apply the rules in a rule set to classify a given data instance: ordered rules and unordered rules.
- Describe different rule-ordering schemes: rule-based and class-based.
- Given a test dataset and a rule based-model, use the model to make predictions over the test data and also to evaluate the model.
- Explain how RIPPER constructs and prunes rules. Given a dataset, a validation set, and a rule show how RIPPER works.
- Using ideas similar to those of J4.8, describe an approach to pre- and to post-pruning rules.
- Using ideas similar to those of J4.8, describe how you would enhance the rule-construction algorithm to handle continuous attributes directly.
- Using ideas similar to those of J4.8, describe how you would enhance the rule-construction algorithm to handle missing values directly.
- Describe how the ideas behind pre- and post-pruning of decision trees may be transferred to prune rules constructed by the sequential covering algorithm.
**Textbook Exercises:**1, 2*, 3.

**Association Analysis:**

All materials covered in class, online lecture notes, and Sections 5.1, 5.2, 5.3, 5.7. + Textbook Slides- What is association analysis? How is it different from the other data mining tasks studied in this course (classification, regression, clustering, and anomaly detection)?
- If association analysis is about finding relationships among data attributes, is calculating the correlation matrix for the set of attributes association analysis? Why or why not?
- Given a dataset of transactions, where each transaction is a set of items, what's the meaning of an association rule X → Y with confidence = c and support = s? Assume that X and Y are sets of items.
- Provide formulas that define confidence, support, and lift in terms of probability.
- How has the association rule mining task been traditionally formulated? (See Definition 6.1. on p. 330 of your textbook).
- Describe a brute-force algorithm that generates all association rules that satisfy the conditions in Def. 6.1.
- Know in detail the Apriori algorithm, which generates all association rules that satisfy the conditions in Def. 6.1.
- How does the apriori algorithm generate all the frequent itemsets?
- What is the apriori principle? Provide 2 different formulations of this principle.
- Describe how the first level of itemsets is generated.
- Describe how the second level of itemsets is generated.
- After level k of frequent itemsets (i.e., itemsets of cardinality k that have enough support) has been obtained, describe how the candidate k+1 itemsets are generated. Describe in detail the merge/join condition, and the subset (or candidate pruning) condition.

- How does the apriori algorithm generate all the rules with sufficient confidence from the frequent itemsets? Explain in detail the pruning method used to avoid generating rules that are known in advance to not have enough confidence.
- How does the apriori algorithm improve upon the brute-force algorithm? (Hint: The apriori algorithm uses the apriori principle (and its corollaries: the merge/join condition and subset/candidate-pruning condition) to avoid generating itemsets and/or scanning the dataset counting the support of those itemsets, when it can be determined in advance that those itemsets will not have enough support. Also, the apriori algorithm uses an efficient way of generating rules that have enough confidence.)
- In the context of evaluation of association pattterns, what's an objective measure of interestingness? What is the difference between an objective and a subjective measure of interestingness?
- Define the following metrics for association rules: lift, interest factor, corelation analysis, IS, conviction, and leverage.
- Given a dataset of transactions, min. support and min. confidence thresholds, be prepared to follow the apriori algorithm to generate all association rules that satisfy the min threshold conditions.
- Given a dataset of transactions and an association rule, be prepared to calculate the support, confidence, lift, conviction, and leverage of the rule.
**Textbook Exercises:**1, 2* 3* 6, 7*, 8* 14, 16*, 17*.

**Clustering:**

All materials covered in class, online lecture notes, and Chapter 7 + Slides- What is clustering?
- Contrast the following approaches to clustering: hierachical vs. partitional; exclusive vs. overlapping vs. fuzzy; and complete vs. partial.
**K-means:**- How does the K-means algorithm work?
- Given a dataset and a distance metric, be prepared to follow the K-means algorithm to cluster the instances in the dataset.
- What is the difference between centroid and medoids? Are centroids always data instances in the dataset? Are medoids always data instances in the dataset? Explain.
- How is sum of squared error (SSE) defined?
- Discuss effective ways to determine an appropriate value of k to use.
- Discuss effective ways to choose appropriate initial centroids. Illustrate situations in which one way would be more appropriate than the others.
- Explain why the time complexity of the k-means algorithm is O(I*K*m*n). (Hint: See p. 505.)

**Hierarchical Clustering:**- What the difference between agglomerative and divisive hierarchical clustering?
- What is a dendrogram?
- Describe the steps of the basic agglomerative hierarchical clustering algorithm.
- What is a proximity matrix?
- Describe the following distance metrics to calculate the distance between two clusters: min (single link), max (complete link), group average, distance between centroids, and Ward distance.
- What is the time complexity of the basic agglomerative hierarchical clustering algorithm?
- Is this basic algorithm greedy or not? Explain.

**DBSCAN:**- Why is it important to consider density when clustering a dataset? Illustrate your argument(s) with examples.
- Given a dataset, a radius value (epsilon), and a min. number of points (MinPts), how are are core, border, and noise points defined by DBSCAN?
- How does DBSCAN identify core, border, and noise points in a dataset? How does DBSCAN use those points to cluster the dataset?
- What is the time complexity of DBSCAN?
- Discuss effective ways to determine appropriate values for epsilon and MinPts.
- How would you extend DBSCAN to appropriately cluster a dataset in which different regions have different densities?
- Are the versions of the DBSCAN algorithm in the textbook and in the textbook slides equivalent (i.e., do they always produce the same result)? Explain.

**Cluster Evaluation:**- How can SSE (see above) be used to evaluate clusters? Discuss whether or not SSE is useful in evaluating clusters results of each of the following methods: k-means, basic agglomerative, DBSCAN.
- What are cluster cohesion and separation? Provide formulas to calculate them. How can these measures be used to evaluate a clustering?
- What is the silhouette coefficient? Provide a formula to calculate it. How can this measure be used to evaluate a clustering?
- How can the similarity (proximity) matrix of a dataset in which data instances have been sorted so that instances that belong to the same cluster appear next to each other be used to visually evaluate a clustering? Explain. Does this visualization help evaluating a set of clusters? Just one cluster? Both? Explain. Be prepare to evaluate a clustering based on this visualization (e.g., Figures 8.30 and 8.31). Be prepared to produce this visualization given a dataset and a clustering over it.
- How can correlation between the similary (proximity) matrix and the cluster-based incidence matrix be used to evaluate a clustering? By the way, how is this incidence matrix defined? (See Section 8.5.3.)
- How can SSE be used to help determine the appropriate number of clusters? Use Figure 8.32 as an example.
- Given a value of a evaluation metric (e.g., SSE), how do you assess whether or not it is a good value? (See Section 8.5.8)

**General:**- Given a dataset, a distance metric, and other necessary parameters, be prepared to apply any of the clustering algorithms studied in this course (k-means, basic agglomerative, DBSCAN), or a variation of them provided.
- Given a dataset, argue which of the above clustering methods would be more appropriate and why.

**Textbook Exercises:**5*, 6, 7 , 11, 12*, 13*, 14*, 16**, 17**, 19, 21, 24*, 23*, 32**.

**Anomaly Detection:**

All materials covered in class, online lecture notes, and Chapter 9- What is an anomaly (or outlier)?
- Give an example of a situation in which an anomaly should be removed during pre-processing of the dataset, and another example of a situation in which an anomaly is an interesting data instance worth keeping and/or studying in more detail.
- Define each of the following approaches to anomaly detection, and describe the differences between each pair: Model-based, Proximity-based, and Density-based techniques.
- Can visualization be used to detect outliers? If so, how? Give specific examples of visualization techniques that can be used for anomaly detection. For each one, explain whether or not the visualization technique can be considered a Model-based (which includes Statistical), Proximity-based, or Density-based technique for anomaly detection.
- Define each of the following modes to anomaly detection, and describe the differences between each pair: supervised, unsupervised, and semi-supervised.
- Consider the case of a dataset that has labels identifying the anomalies and the task is to learn how to detect similar anomalies in unlabelled data. Is that supervised or unsupervised anomaly detection? Explain.
- Consider the case of a dataset that doesn't have labels identifying the
anomalies and the task is to find how to assign a sound anomaly score,
*f(x)*, to each instance*x*in the dataset. Is that supervised or unsupervised anomaly detection? Explain. - Precision, recall, and false positive rate are mentioned in your textbook as appropriate metrics to evaluate anomaly detection algorithms (see p. 657). What are those metrics (see Section 5.7.1, p. 295) and how can they be used to evaluate anomaly detection?
- For each of the anomaly detection approaches
(statistical-based, proximity-based, density-based, and clustering-based) do
- State the definition(s) of outlier used by the approach
- How can this be definition used to assign an anomaly score to each data instance?
- How does this anomaly detection approach work in general? Give an example to illustrate your description.

**Approach****Definition of Outlier**

(state full definition)**Anomaly score function****How does the approach work?**

(in general)**Example**Statistical-based Probabilistic definition of outlier Proximity-based Proximity-based definition of outlier using

distance to k-nearest neighborDensity-based Density-based definition of outlier using - inverse distance
- count of points within radius, or
- average relative density

Clustering-based Clustering-based definition of outlier - Be prepared to interpret graphs like those in Figs. 10.4-10.7 (p.667), and to generate those figures from a given dataset.
**Textbook Exercises:**2, 3, 5, 6*, 8, 10*, 11*, 12, 13, 15*, 16.

**Text Mining:**

All materials covered in class and online lecture notes- Differences and similarities between text mining and general data mining.
- Converting text data sources (unstructed data) to bags of words or word vectors (structured data). Issues encountered during this conversion: ambiguity of natural languages; parsing; stemming (that is, the process of representing inflected, derived, or modified words by their stem or root word - for example, "flying", "flies", "flew", "flown" can all be represented by their stem: "fly"); removal of stop words.
- Other data preprocessing issues such as feature selection, dimensionality reduction.
- After converting text sources into structured data, how is text/document classification performed? How is document clustering performed? How are word associations found in the documents? How about finding anomalies?

**Web Mining:**

All materials covered in class and online lecture notes- Differences and similarities between web mining and general data mining.
- Three subareas of web mining: web content mining, web structure mining, and web usage mining.
- Web content mining deals with text, images, audio, video, and also structured records.
- Web structure mining deals the "topology" of the web, including hyperlinks (intra-document hyperlinks and inter-document hyperlinks), and document structure.
- Web usage mining deals with web server logs, application server logs, and application level logs.
- Preprocessing for web mining.
- Classification, clustering, associations, and anomaly detection in web mining.

**Data Visualization:**

All materials covered in class and online lecture notes- What is the difference between data visualization and (analytical) data mining? Explain.
- Describe issues involved in mapping data to graphical elements.
- Be prepared to define, give examples of, and/or construct (given a sample dataset) any of the following visualizations: Histograms, pie charts, box plots, scatter plots, visualization matrices (e.g., heatmaps), parallel coordinates, star coordinates, Chernoff Faces.
- What type of data is each of the above visualization methods useful for (1-Dimensional, 2D, 3D, high dimensional)? What are the differences between each pair of these visualization methods?
- Use a given visualization of the dataset (e.g., scatter plot, heatmap of correlation matrix, ...) to draw conclusions about the data.