WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS4341 Introduction to Artificial Intelligence 
SOLUTIONS Project 4 - C 2002

By Wendy Kogel, Greg Milette, Yakov Kronrod, and Carolina Ruiz 
------------------------------------------

  1. Neural Networks
    Sketch of the solution:

    4)

    7)

    8)

    11)

    13)

    Report Questions)


  2. Neural Networks + Genetic Algorithms
    1. The neural network contains 13 connections/links:
      • 4*2 links connecting the 4 inputs to the 2 hidden nodes
      • plus 2*1 links connecting the 2 hidden nodes to the output node
      • plus 3 links, one for each of the hidden & output nodes, representing the node's activation threshold.

      Each weight can be any real number. To represent a real number using just bits, we'll use the IEEE floating-point standard 754 (see e.g. A.S. Tanenbaum, Structured Computer Organization, 4th Edition, Prentice Hall, 2000). This standard allows the representation of real numbers using three precisions: single precision, double precision, and extended precision. For simplicity, we employ here single precision, which uses 32 (1 for the sign, 8 for the exponent, and 23 for the fraction) bits to represent a real number:

      
      bits  1 2             9 10                                          32
            - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           | |   exponent    |   fraction                                  |
            - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
            ^
            | sign
      
      Each individual will be represented as an array of 13 positions, one for each connection in the net. Each position in the array will be a single precision float as described above. Hence, each individual can be seeing as a bit-string of lenght 13*32 = 416, where the first 32 bits represent the first weight, the next 32 bits represent the second weight, etc..

      Note: The important point here is that your representation of an individual should accommodate the 13 weights, and that your representation allows each weight to be any real number. Whether or not you use double precision floats is not important.

    2. We are given that the initial population consists of 100 individuals. The most important thing to consider is how the individuals will represent the possible population. Primarily, we want the individuals to be randomly chosen. One method to do this is to run through the 100 bit strings and assign each bit to a random boolean (randomly chosing 1 or 0). Also, something to consider is repeated individuals. We would like 100 distinct individuals to ensure that we have a variety of data and that we are not testing the neural net twice for an identical bit string. Another aspect to consider is the distribution of the individuals over the sample space of possible individuals. A somewhat uniform distribution would guarantee that at least one of the individuals is 'close' to the optimal one. However, this is not necessary since through crossovers and mutations a wide variety of individuals would eventually be considered anyway.

    3. The fitness of an individual must represent how close the output computed by the net is to the desired output over all the training examples. The higher the fitness of an individual, the better the corresponding neural net computes the target function. An initial thought here is to use the inverse of the neural net's error function to represent the fitness of an individual. If you remember, the error function represents how close a neural net is to computing the correct target answer. If we run every training instance for a given individual and add up the errors, then we have a total error. NOTE: High error is bad. With fitness, high fitness is good. So, we divide 1 by the error and higher errors give lower fitness, and similarly lower errors give higher fitness. There are some problems with this approach, however. If the error is derived to be 0 (a perfect individual), dividing 1 by this would lead to a DBZ error. The better approach is to use a function that looks as follows:
         fitness = N - total error,  
      
      where (total error) is computed as the sum of errors made by the neural net for each training instance, and N is the total number of training instances. More precisely, given an individual I, let NNI be the neural network obtained by assigning to each connection in the net the corresponding weight encoded in the individual I. Hence
           fitness of I = N - total error of NNI, where
      
           total error of NNI = sum over all instances of |desired output - net's output|
      
      (A slightly modified approach is to make total error the sum of the squares of each
      error., i.e.
      
           total error of NNI = 1/2 * sum over all instances of (desired output - net's output)^2
      )
      
      There is one underlying assumption (obvious in our case) which states that both the desired and the actual output are values between 0 and 1. This guarantees that the sum over N instances will not give a number larger than N. In this manner, the highest error is N, giving a fitness of 0 (the worst fitness). The lowest error is 0 (perfect), which yields a fitness of N (the best fitness).

  3. Mutation: We want to randomly flip one bit to mutate the individual. To improve efficiency, we could start by mutating the high order bits and then as the algorithm gets trained more and more start switching lower order bits. That way we start by changing the weights by a large number and then gradually zero in on the correct weights. Another important part of mutation is that we do not want to mutate the bit corresponding to the sign of the individual. A simple mutation should not cause such a drastic change in the individual. Also, we do not want to modify the higher order bits in the exponent of the individual. This would be even more extreme than changing the sign. For instance, changing a bit could move a value from
          3         131
     .423^  to .423^
    
    by changing the bit corresponding to '128, or 2^(7) from 0 to 1. This is probably not a good idea.

    Selection: The selection algorithm is quite simple. We do so by selecting 100 individuals from the current population to be included in the next generation. We first define a probability distribution over the current population (i.e. we assign a probability value to each individual in the current population in such a way that the sum of probabilities over all the individuals in the population is equal to 1). Then we do random selection with replacement using that probability distribution. We need to figure out a good way to assign probability distribution that corresponds to the chances of an individual to be selected. We choose here to use the standard selection method, but any other one described in Winston's book and covered in class (e.g. ranking method, diversity, etc.) would be fine. We already have the fitness functions that give us how well the individual performed on the training data. So, if we just divide this value by the sum of the fitness values of all the individuals in the current population, we have a probability between 0 and 1 for selection. This also directly guarantees that all the probabilities will add up to one. Then, we can simply use this probability distribution to make our selection choices.

    Cross-over operators - To determine how to combine two individuals to get two new individuals. We will split the individuals at a random point, i, using single-point crossover. All the array positions up to i will come from the first individual, the rest of the array positions will come from the second individual. A second individual will be generated by doing the opposite, taking all the array positions up to i from the second individual and then taking the rest from the first individual. For our case, we must make sure that the crossover points lie only of multiples of 13 bits from the beginning of the bit string. This would ensure that these points only occur between individual weight representations and not in the middle of one. So the result of any crossover operation would result in an individual with part of their weights encoded from one parent and the rest of the NN weights from the other parent.

  4. First, before starting the algorithm, you must decide on the termnination condition of the algorithm. There are several choices that you have here: a maximum number of iterations, a 'good-enough' individual, or something else like the first time you have k mutations and crossovers in a row that give no improvement to the performance of the NN. For the following pseudocode, let:
     a. Fitness_threshhold represent the fitness you want to achieve
     b. MAX represent the maximum number of iterations
    
        Generate the population of the initial 100 individuals P={I1, I2, ..., I100}
        Evaluate each I in the Population and compute Fitness(I)
        BestInd = The individual with highest fitness
        Iterations = 0;
        While ( Fitness(BestInd) < Fitness_threshold  &&  Iterations < MAX )
    	 1. Select 50 pairs of individuals for crossover
    	 2. For each pair produce two offspring and replace in P
    	 3. Mutate some percent of the Population
    	 5. Evaluate the fitness for each new member of P
           6. BestCurrent = the best fit individual from new P
           7. if( Fitness(BestCurrent) > Fitness(BestInd) )
              then BestInd = BestCurrent
           8. Iterations ++
    
    The termination condition is determined by your criteria, most likely when the Fitness of the best individual reaches the threshhold.

  5. The output of the algorithm will be the best individual produced, namely BestInd in the algorithm above.


  • Decision Trees
    This is the resulting tree:
    tear-prod-rate = reduced: none
    tear-prod-rate = normal
    |  astigmatism = no
    |  |  age = young: soft
    |  |  age = pre-presbyopic: soft
    |  |  age = presbyopic
    |  |  |  spectacle-prescrip = myope: none
    |  |  |  spectacle-prescrip = hypermetrope: soft
    |  astigmatism = yes
    |  |  spectacle-prescrip = myope: hard
    |  |  spectacle-prescrip = hypermetrope
    |  |  |  age = young: hard
    |  |  |  age = pre-presbyopic: none
    |  |  |  age = presbyopic: none
    
    
    Below is the procedure followed to construct the tree:
    
    

    Looking for the root node:

    Tuples: 24 Age: None soft hard Young 8/24 -4/8 log2(4/8) -2/8 log2(2/8) -2/8 log2(2/8) = .5 0.5 0.5 0.5 Pre-presbyopic 8/24 -5/8 log2(5/8) -2/8 log2(2/8) -1/8 log2(1/8) = .43 0.42 0.5 0.375 presbyopic 8/24 -6/8 log2(6/8) -1/8 log2(1/8) -1/8 log2(1/8) = .35 0.31 0.375 0.375 = 1.28 spectacle-prescripion none Soft Hard hypermetrope 12/24 8/12 log2(8/12) 3/12 log2(3/12) 1/12 log2(1/12) = .59 .39 .5 .3 myope 12/24 7/12 log2(7/12) 2/12 log2(2/12) 3/12 log2(3/12) = .69 .45 .43 .5 =1.28 astigmatism none Soft Hard no 12/24 7/12 log2(7/12) 5/12 log2(5/12) 0/12 log2(0/12) = .49 .45 .53 0 yes 12/24 8/12 log2(8/12) 0/12 log2(0/12) 4/12 log2(4/12) = .46 .39 0 .53 =.95 tear-prod-rate none Soft Hard normal 12/24 3/12 log2(3/12) 5/12 log2(5/12) 4/12 log2(4/12) = .78 .5 .53 .53 reduced 12/24 12/12 log2(12/12)0/12 log2(3/12) 0/12 log2(4/12) 0 = .78 Use tear-prod-rate.
    Tear-prod-rate = reduced, tuples: 12 all none Decide on next attribute for tear-prod-rate = normal tuples left: 12 Age: None soft hard Young 4/12 -0/4 log2(0/4) -2/4 log2(2/4) -2/4 log2(2/4) = .33 0 0.5 0.5 Pre-presbyopic 4/12 -1/4 log2(1/4) -2/4 log2(2/4) -1/4 log2(1/4) = .5 0.5 0.5 0.5 presbyopic 4/12 -2/4 log2(2/4) -1/4 log2(1/4) -1/4 log2(1/4) = .5 0.5 0.5 0.5 = 1.33 spectacle-prescripion none Soft Hard hypermetrope 6/12 2/6 log2(8/6) 3/6 log2(3/6) 1/6 log2(1/6) = .73 .53 .5 .43 myope 6/12 1/6 log2(1/6) 2/6 log2(2/6) 3/6 log2(3/6) = .73 .43 .52 .5 = 1.46 astigmatism none Soft Hard no 6/12 1/6 log2(1/6) 5/6 log2(5/6) 0/6 log2(0/6) = .49 .43 .22 0 yes 6/12 2/6 log2(2/6) 0/6 log2(0/6) 4/6 log2(4/6) = .46 .53 .39 .46 = .95 Use astigmatism
    Asigmatism = no, tuples: 6 Age: None soft hard Young 2/6 -0/2 log2(0/2) -2/2 log2(2/2) -0/2 log2(0/2) = 0 0 0 0 Pre-presbyopic 2/6 -0/2 log2(0/2) -2/2 log2(2/2) -0/2 log2(0/2) = 0 0 0 0 presbyopic 2/6 -1/2 log2(1/2) -1/2 log2(1/2) -0/2 log2(0/2) = .33 0.5 0.5 0 = .33 spectacle-prescripion none Soft Hard hypermetrope 3/6 0/3 log2(0/3) 3/3 log2(3/3) 0/3 log2(0/3) = 0 0 0 0 myope 3/6 1/3 log2(1/3) 2/3 log2(2/3) 0/3 log2(0/3) = .49 .53 .39 0 = .49 Use age.
    Age = young, tuples: 2 all soft Age = pre-presbyopic, tuples: 2, all soft Age = presbyopic, tuples 2: 1 none, 1 soft Note that spectacle-prescription is the only attribute left so it needs to be used to split this branch obtaining: spectacle-prescription = hypermetrope, tuples: 1, soft spectacle-prescription = myope, tuples: 1, none
    Back tracking up to astigmatism = yes Astigmatism = yes, tuples: 6 Age: None soft hard Young 2/6 -0/2 log2(0/2) -0/2 log2(0/2) -2/2 log2(2/2) = 0 0 0 0 Pre-presbyopic 2/6 -1/2 log2(1/2) -0/2 log2(0/2) -1/2 log2(1/2) = .33 0.5 0 0.5 presbyopic 2/6 -1/2 log2(1/2) -0/2 log2(0/2) -1/2 log2(1/2) = .33 0.5 0 0.5 = .66 spectacle-prescripion none Soft Hard hypermetrope 3/6 2/3 log2(2/3) 0/3 log2(0/3) 1/3 log2(1/3) = .46 .38 0 .53 myope 3/6 0/3 log2(0/3) 0/3 log2(0/3) 3/3 log2(3/3) = 0 0 0 0 = .46 Use spectacle-prescription
    spectacle-prescription = hypermetrope, tuples: 3 Note that age is the only attribute left so it needs to be used to split this branch obtaining: age = young, tuples: 1, hard age = pre=presbyopic, tuples: 1, none age = presbyopic, tuples: 1, none spectacle-prescription = myope, tuples 3, contactlense = hard