CS 548 Fall 2018 - Project 3

Computer Science Department

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2018
Project 3: Artificial Neural Networks and Deep Learning

By Michael Sokolovsky, Ahmedul Kabir and Prof. Carolina Ruiz

DUE DATE: November 1st, 2018.

Slides: Submit via Canvas by 2:00 pm.
Written report: Hand in a hardcopy by the beginning of class (by 3:59 pm).

Project Assignment:

Study Section 4.7 Artificial Neural Networks and Sectio 4.8 Deep Learning of the textbook in great detail.
Study all the materials posted on the course Lecture Notes, especially those marked with "**":
- Artificial Neural Networks
- Deep Learning
Work in groups of 3 students.
Project Description and Dataset
- Primary Goal:
  
  (Image taken from https://corochann.com/)
  In this project, you will build a neural network model for categorizing images. You will write a program that takes images like the hand-written numbers above and output what number the input image represents.
- Dataset:
  MNIST is a dataset composed of handwritten numbers and their labels. It is a famous dataset that has been used for testing new machine learning algorithm's performance. Each MNIST image is a 28x28 grey-scale image. Data is provided as 28x28 matrices containing numbers ranging from 0 (corresponding to white pixels) to 255 (corresponding to black pixels). Labels for each image are also provided with integer values ranging from 0 to 9, corresponding to the actual value in the image. There are 6500 images in our version of the database and 6500 corresponding labels.
- Tasks:
  In this project, your will experiment with training a Neural Network model for identifying which digit is represent by a MNIST image. Your task is to build a digit classifier using the artificial neural network package called Keras, implemented in Python3. The input to your model will be an image, and the output will be a classification of the number, from 0-9. You will get a chance to work with a common Neural Network package used in state-of-the-art research. In addition, you will practice using the numerical computation package Numpy for preprocessing.
  
  Your goal will be to create a model that successfully identifies the digits represented by images in MNIST with a level of accuracy as high as you can but of at least 70%, and describe and analyze your results in a written report.
Project Requirements
Projects will include turning in a written report, code for training and visualization, and a model. Details of what to include are listed below:
1. Written Report:
  Due time: Hand in by 3:59 pm right before the beginning of class.
  - Set of Experiments Performed: Page limit: 1.5 pages
    Include a section describing the set of experiments that you performed, what structures you experimented with (i.e., number of layers, number of neurons in each layer), what parameters you varied (e.g., number of epochs of training, batch size and any other parameter values, weight initialization scheme, activation function)and what accuracies you obtained on each of these experiments.
  - Procedure Description: Page limit: 1 page
    Include a section describing in more detail the most accurate model you were able to obtain: the structure of your model, including number of layers, number of neurons in each layer, weight initialization scheme, activation function, number of epochs used for training, and batch size used for training.
  - Plot: Page limit: 0.5 pages
    Include a plot showing how training accuracy and validation accuracy change over time during the training of your best model. That is, the horizontal axis of your plot should be the number of training epochs and the vertical axis should be training and validation accuracy.
  - Model Performance and Confusion Matrix: Page limit: 1 page
    Include a confusion matrix showing results of your best model reported on the test set. The matrix should be a 10x10 grid showing which categories images were classified as. Use your confusion matrix to additionally report precision and recall for each of the 10 classes, as well as overall accuracy of your model.
  - Visualization: Page limit: 1 page
    Include visualizations of three images that were misclassified by your best model and any observations about why you think these images were misclassified. You will have to create or use a visualization program that takes a 28x28 matrix input and translate it into a black-and-white image.
  - Advanced Topic: Page limit: 1 page
    Include the description of your advanced topic (see instructions in bullet 7 below). It should contain 3 parts:
    - List of sources/books/papers used for this topic (include URLs if available).
    - In your own words, provide an in-depth, yet concise, description of your chosen topic. Make sure to cover all relevant data mining aspects of your topic. Your description here should be in-depth and at the graduate level.
    - How does this topic relate to trees and the material covered in this course?
2. Code:
  Due time: Submit your code files on Canvas by 2:00 pm.
  - Model Code:
    Turn in your preprocessing, model creation, model training, plotting and confusion matrix code.
3. Model:
  Due time: Submit your trained model file on Canvas by 2:00 pm.
  - Copy of Trained Model:
    Turn in a copy of your best model saved as `trained_model.proj3.' Please use the following Keras methods for saving your model.
4. Slides:
  Due time: Submit your project slides on Canvas by 2:00 pm.
  Turn in slides summarizing your work on the projects and what you learned. Each team will have 4 minutes to present. Make sure to cover your Advanced Topic during your presentation. Many sure that each team member has equal chance to present.
Project Preparatory Tasks and Guidelines:
Below are import guidelines to follow for implementing the project. A model template is provided for you on this project webpage, and these guidelines follow the structure of the template.
1. Installing Software and Dependencies:
  template.py is written with the Keras API in a Python3 script. You will use this template to build and train a model. To do so, you will need to implement the project in Python3 and install Keras and its dependencies. Please make sure you have a working version of Python3 and Keras as soon as possible, as these programs are necessary for completing the project.
2. Downloading Data:
  Raw data is provided here:
  - Images are provided for you in the images.npy file, which contains 6500 images from the MNIST dataset.
  - The file labels.npy contains the 6500 corresponding labels for the image data.
3. Preprocessing Data:
  All data is provided as NumPy .npy files. To load and preprocess data, use Python's NumPy package.
  Image data is provided as 28x28 matrices of integer pixel values. However, the input to the network will be a flat vector of length 28*28 = 784. You will have to flatten each matrix to be a vector, as illustrated by the toy example below:
  
  The label for each image is provided as an integer in the range of 0 to 9. However, the output of the network should be structured as a "one-hot vector" of length 10 encoded as follows:
  
  To preprocess data, use NumPy functions like reshape for changing matrices into vectors. You can also use Keras' to_categorical function for converting label numbers into one-hot encodings.
  
  After preprocessing, you will need to take your data and randomly split it into Training, Validation, and Test Sets. In order to create the three sets of data, use stratified sampling, so that each set maintains the same relative frequency of the ten classes.
  
  You are given 6500 images and labels. The training set should contain ~60% of the data, the validation set should contain ~15% of the data, and the test set should contain ~25% of the data.
  Example Stratified Sampling Procedure:
  - Take data and separate it into 10 classes, one for each digit
  - From each class:
4. Building a Model:
  
  In Keras, Models are instantiations of the class Sequential. A Keras model template, template.py, written with the Sequential Model API is provided which can be used as starting point for building your model. The template includes a sample first input layer and output layer. You must limit yourself to "Dense" layers, which are Keras' version of traditional neural network layers. This portion of the project will involve experimentation.
  Good guidelines for model creation are:
  - Initialize weights randomly for every layer, try different initialization schemes.
  - Experiment with using ReLu Activation Units, as well as SeLu and Tanh.
  - Experiment with number of layers and number of neurons in each layer, including the first layer.
  Leave the final layer as it appears in the template with a softmax activation unit.
5. Compiling a Model:
  
  Prior to training a model, you must specify what your loss function for the model is and what your gradient descent method is. Please use the standard categorical cross-entropy and stochastic gradient descent (`sgd') when compiling your model (as provided in the template).
6. Training a Model:
  
  You have the option of changing how many epochs to train your model for and how large your mini-batch size is. Experiment to see what works best. Also remember to include your validation data in the fit() method.
7. Reporting your Results:
  
  fit() returns data about your training experiment. In the template.py this is stored in the "history" variable. Use this information to construct your graph that shows how validation and training accuracy change after every epoch of training.
  
  Use the predict() method on model to evaluate what labels your model predicts on test set. Use these and the true labels to construct your confusion matrix, like the toy example below, although you do not need to create a fancy visualization of the confusion matrix . Your matrix should have 10 rows and 10 columns.
Advanced Topic(s):
Investigate in depth (experimentally, theoretically, or both) a topic of your choice that is related to deep learning and that was not covered already in this project, class lectures, or the textbook. This deep learning related topic might be something that was described or mentioned briefly in the textbook or in class; comes from your own research; is related to your interests; is an idea from a research paper that you find intriguing; or any other deep learning related topic.
Remember that you need to investigate your advanced topic in depth, at a "graduate level".
Grading Rubric
1. Report:
  - Set of Experiments Performed: 20 pts
  - Model and Training Procedure Description: 10 pts
  - Plot: 10 pts
  - Model Performance and Confusion Matrix: 10 pts
  - Visualization: 10 pts
2. Code:
3. Model:
4. Advanced Topic:
Total Points: 120 pts