CS 548 Fall 2019 - Project 1

WPI Worcester Polytechnic Institute

Computer Science Department

Project 1: Data Pre-processing


Due Date: Canvas submission by 3:00 pm on Thursday Sept. 12, 2019. ------------------------------------------


Problem I. Knowledge Discovery in Databases (15 points)

  1. (3 points) Define knowledge discovery in databases.

  2. (9 points) Briefly describe the steps of the knowledge discovery in databases process.

  3. (3 points) Define data mining.
Base your answers on the definitions presented in class, the textbook, and the following paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996. However, your answers must be written in your own words.

Problem II. Data Preprocessing: Attribute Transformations (140 points)

Consider the Flags Dataset available at the UCI Machine Learning Data Repository. See the link above for a description of this dataset.

Note that although most attribute values in this dataset are represented using numeric values, this dataset contains attributes that are conceptually of different types:

In this project, you will apply different scikit-learn preprocessing functions (see this link) to this dataset.

  1. (5 points) Discrete attributes with too many values:
    The attribute name (attribute #1) contains many values (one for each data instance). Would you keep this attribute in the dataset when mining for patterns? Why or why not. Explain.

  2. (25 points) Converting discrete attributes to continuous:
    1. (10 points) Read scikit-learn data transformation functions section 5.3.4 and use the OneHotEncoder function to encode each of these nominal attributes: mainhue (#18), topleft (#29), and botright (#30). Include the Python code you use for this in your written report and your .py file.
    2. (15 points + 1 extra point) Attributes landmass (#2), zone (#3), language (#6) and religion (#7) are discrete ("nominal") even though their values were represented using numbers. For each of these attributes:
      • (2 points/each) Discuss whether or not the numeric encoding used is appropriate.
      • (2 points/each) If you answer above is "no", use the OneHotEncoder function to encode the attribute in a more appropriate way
        • Decide what the best values (the default value or another value) for the function parameters are. Explain your choices.
        • Include the OneHotEncoder function parameter values that you used in your written report and a brief description of your observations. You need to include your Python code in your .py file as well.

  3. (25 points) Handling missing values:
    In this part, you need to consider only the area attribute (#4):

  4. (50 points) Standardization, scaling and normalization of continuous attributes:
    In this part, you need to work only with the area attribute (#4):

  5. (25 points) Discretization:
    In this part, you need to work only with the population attribute (#5):

  6. (10 points) Custom transformation:
    In this part, you need to work only with the area attribute (#4):

Problem III. Data Preprocessing: Dimensionality Reduction (130 points)

  1. (20 points) Correlation and Covariance Analysis:

  2. (15 points) Data Sampling:
    Assume that you want to reduce the number of data instances by keeping just 60% of the data instances.

  3. (60 points) Feature Selection:

  4. (35 points) Feature Extraction:
    In this part, you will experiment with Principal Components Analysis (PCA). Read also Scikit-learn's Decomposing signals in components (matrix factorization problems) Section (Reading the whole section 2.5 is recommended though not required.)
    • (15 points) First, use default parameter values for the PCA function (use svd_solver='auto' and leave n_components unset). Include in your report how many principal components were obtained, how much of the variance each of them explains (including cumulative variance explained), singular (eigen) value of each component. Also, include in your report the linear combinations that define the first tree new attributes (= components) obtained. Look at the results and elaborate on any interesting observations you can make about the results of the PCA function.
    • (10 points) Now, assume that you need to reduce the number of dimensions as much as possible. Looking at the explained_variance_ , explained_variance_ratio_ , singular_values_ , and n_components_ values (if needed, rerun PCA with n_components set to 'mle'), determine a good number of components (= dimensions) to keep. Include this number in your written report and justify why you chose it.
    • (10 points) Apply the PCA function again using n_components equal to the number of components you chose and copy equal to True (the default value of this parameter). Include in your written report the 4 first data instances (rows) in the transformed dataset.
    • Include all of your Python code in your .py file.