Genetic Algorithms
The goal of this project was to write a GA classifier for the census data.
Code Written
This project was developed in C. The following sections describe the major features of the program.
Gene Format
Genes in this system are variable length strings of rules. Each gene contains between one and ten rules. Each rule is an encoded representation of the allowed values for each of the following attributes:
Each of these attributes were chosen because they were shown to be significant in the Decision Tree experiment performed earlier this semester. The one exception to this is Hours Worked: this attribute was chosen to fill out the bit string to sixteen bits. If a field was optional (don't-care) the value was set to be all ones to allow instances to be compared using a logical AND operation. Discretization was modified somewhat from earlier experiments in order to fit into the allowed number of bits. Where attribute values needed to be combined, reasonable combinations were chosen by examining the Decision Tree results.
Gene Selection
The user is allowed to indicate the percentage of genes to be created by crossover during each generation. The survivors (non-crossed) are chosen using tournament selection where two genes are chosen at random and the fittest of the two survives. The remaining genes are formed by crossover from the original population (not the survivors). The crossover algorithm used for this project is the same as that used in GABIL and described in [Mitchell, 1997].
There is a user entered parameter for mutation but mutation is not used in this version of the project.
Fitness
The fitness of each gene is the number of training examples that can be matched by at least one rule.
Experiments Performed
Due to the amount of time it took to run each experiment, not as many could be run as desired. Experiments were run varying the number of genes in each population and varying the percentage of crossover. These experiments used a training set and test set of 5000 records each.
Results and Evaluation
Table 1 shows the results when the number of hypothesiss in each population are varied.
Table 1: Varying Population Size
|
Population Size |
Generations Needed |
Training Accuracy |
Test Accuracy |
|
5 |
100 |
42.5 |
41.2 |
|
10 |
100 |
82.4 |
82.1 |
|
20 |
54 |
97.1 |
97.3 |
|
30 |
100 |
57.7 |
56.4 |
|
40 |
36 |
96.3 |
96.1 |
|
50 |
68 |
97.4 |
97.5 |
These results are very good, and seem to increase with the population size allowed. The exception is the case where there are 30 genes. This experiment was repeated twice to verify the strange results. This may be a case where an optimal gene is not chosen because of the tournament selection and dies off.
Table 2 shows the results when the crossover percentage is varied. These experiments were all performed with a population size of 20.
Table 2: Varying Crossover Percentage
|
Crossover % |
Generations Needed |
Training Accuracy |
Test Accuracy |
|
40 |
54 |
97.1 |
97.3 |
|
50 |
26 |
97.4 |
97.1 |
|
60 |
64 |
95.5 |
95.0 |
|
70 |
35 |
95.4 |
95.6 |
The "winning" rules were printed out. They were not as expected. Figure 1 shows the winning hypothesis for the population size of 20 and crossover percentage of 40.
< Education = ??? M-STAT = Married-spouse-absent OCC-STAT = Protective-serv RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = > 50K >
< Education = ??? M-STAT = Separated OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = > 50K >
< Education = PhD M-STAT = Never-married OCC-STAT = Sales RELATIONSHIP = Not-in-family HOURS WORKED = >= 40 SALARY = > 50K >
< Education = PhD M-STAT = Married-civ-spouse OCC-STAT = Sales RELATIONSHIP = Not-in-family HOURS WORKED = >= 40 SALARY = > 50K >
< Education = PhD M-STAT = Never-married OCC-STAT = Sales RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = > 50K >
< Education = PhD M-STAT = Married-civ-spouse OCC-STAT = Sales RELATIONSHIP = Not-in-family HOURS WORKED = >= 40 SALARY = > 50K >
< Education = Assoc Admin. M-STAT = Never-married OCC-STAT = Priv-house-serv RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = <= 50K >
< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >
< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >
< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >
Figure 1: Winning Hypothesis
There are several things that look suspicious about these rules. For one thing, there are several duplicates of each other, it would be best if these were eliminated. Also, they don't seem to cover the various attributes values very well (especially the preference for Tech-support and Sales). One potential problem is that the fitness function counts the hypothesis as a match if any part of it classifies the instance correctly. This will not account for conflicting rules that would classify the same instance as positive or negative.
A second method of fitness calculation was tried. In this method, rules were voted on so that a hypothesis would only be rated as a match if more of its clauses classified the data positively than negatively.
Tables 3 and 4 show the tests re-run with the new voting scheme. Unfortunately, the change to remove duplicate rules did not work and needs to be implemented.
Table 3: Varying Population Size
|
Population Size |
Generations Needed |
Training Accuracy |
Test Accuracy |
|
5 |
100 |
20.6 |
19.8 |
|
10 |
100 |
29.2 |
29.8 |
|
20 |
100 |
43.1 |
43.2 |
|
30 |
100 |
65.8 |
66.4 |
|
40 |
100 |
75.0 |
75.4 |
|
50 |
100 |
62.5 |
62.6 |
Table 4: Varying Crossover Percentage
|
Crossover % |
Generations Needed |
Training Accuracy |
Test Accuracy |
|
40 |
100 |
43.1 |
43.2 |
|
50 |
100 |
66 |
66.5 |
Figure 2 shows a hypothesis from the best results returned (population of 40).
< Education = Assoc Admin. M-STAT = Never-married OCC-STAT = Priv-house-serv RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = <= 50K >
< Education = Assoc Voc. M-STAT = Never-married OCC-STAT = Farming-fishing RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = <= 50K >
< Education = ??? M-STAT = ??? OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >
< Education = Assoc Voc. M-STAT = Never-married OCC-STAT = Farming-fishing RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >
< Education = Assoc Voc. M-STAT = ??? OCC-STAT = Farming-fishing RELATIONSHIP = Other-relative HOURS WORKED = 0 SALARY = > 50K >
< Education = ??? M-STAT = Never-married OCC-STAT = Machine-op-inspct RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = <= 50K >
< Education = Assoc Voc. M-STAT = ??? OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >
< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >
< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >
< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >
Figure 2: Hypothesis with Voting and Duplicates
The last three rules were what comes up when the gene is equal to zero.
A third version of the system used the voting method but also eliminated any duplicate rules. Table 5 gives results for this system. For the generations needed column, the second number indicates the generation where the maximum fitness was reached.
Table 5: Varying Population Size, No Duplicates
|
Population Size |
Generations Needed |
Training Accuracy |
Test Accuracy |
|
5 |
100/51 |
43 |
43.4 |
|
10 |
100/41 |
62.2 |
62.2 |
|
20 |
100/43 |
75 |
75.4 |
|
30 |
100/81 |
75 |
75.4 |
|
40 |
100/58 |
75 |
75.4 |
This test ended up generating a hypothesis that classified all instances as < 50K. This corresponds to about 75% of the instances. Figure 3 shows the best hypothesis generated for the population of 20.
< Education = BS M-STAT = Divorced OCC-STAT = ??? RELATIONSHIP = Husband HOURS WORKED = ??? SALARY = <= 50K >
< Education = PhD M-STAT = Divorced OCC-STAT = Machine-op-inspct RELATIONSHIP = Husband HOURS WORKED = ??? SALARY = <= 50K >
< Education = PhD M-STAT = ??? OCC-STAT = Machine-op-inspct RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >
< Education = ??? M-STAT = ??? OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >
< Education = ??? M-STAT = Divorced OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >
< Education = No HS M-STAT = Married-AF-spouse OCC-STAT = Tech-support RELATIONSHIP = Husband HOURS WORKED = 0 SALARY = > 50K >
< Education = ??? M-STAT = Married-civ-spouse OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = 0 SALARY = <= 50K >
< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >
Figure 3: Hypothesis for Voting with No Duplicates
The difficulty with this method is that because there is a hypothesis to automatically classify each instance as <= 50K, in order for an instance to be classified as >50K there must be two rules that do so. In order to fix this problem, the software was revised to break ties by choosing the classification that had the most specific rule. Table 6 gives the results from this version.
Table 6: Varying Population Size, Tiebreaker
|
Population Size |
Generations Needed |
Training Accuracy |
Test Accuracy |
|
5 |
100/51 |
43 |
43.4 |
|
10 |
100/41 |
62.2 |
62.2 |
|
20 |
100/43 |
75 |
75.4 |
|
30 |
100/81 |
75 |
75.4 |
|
40 |
100/58 |
75 |
75.4 |
These results ended up being virtually identical (except for some small differences in the 5 and 10 rule cases) to the version that did not use tie breakers.
The final modification to the code was to implement a version that did not use tournament selection but instead always chose the fittest genes. Table 7 gives the results for this version.
Table 7: Varying Population Size, Non-Tournament
|
Population Size |
Generations Needed |
Training Accuracy |
Test Accuracy |
|
5 |
100/30 |
45.6 |
45.1 |
|
10 |
100/64 |
75 |
75.4 |
|
20 |
100/31 |
32.9 |
32.7 |
|
30 |
100/51 |
62.5 |
62.6 |
|
40 |
100/6 |
27.7 |
27.7 |
For smaller populations, this algorithm worked slightly better than the tournament selection but for larger populations it was worse, in some cases much worse.
The non-tournament version was tested using 100 training and test instances so that the results could be compared to other methods. Table 8 gives the results.
Table 8: Comparison with Other Learning Methods
|
Method |
Results |
|
Instance Based Learning |
88% (k=5) |
|
Pruned Decision Tree (w/?s) |
65.6% |
|
Naïve Bayes Classifier |
78% |
|
GA, 20 Rules/Hypothesis |
75% |
The 100 instance set ended up giving as good an accuracy as the larger data set.
Strengths and Weaknesses
The major weakness is that the random number generators sometimes produce infinite loops in the code. Counters are used to detect them and to vary the values but there are still cases where they occur. For crossover, the cutpoints must have matching rule boundaries for the two hypothesiss, this is where the code sometimes gets stuck.
Another weakness is that mutation was not used. Adding mutation may have helped to achieve better accuracy. A parameter is read in for the mutation rate but it was not used.
User's Manual
The genetic algorithm program can be built as follows:
cc genetic.c testprint.c test_line.c train_line.c decode_rule.c decode_line.c gen_level.c valid.c crossover.c cross.c -o genetic
The program can be run as follows:
./genetic <threshold> <population> <generations> <crossover %> <mutation>
where:
<threshold>
- a number between 0 and 100 indicating the accuracy requested<population>
- the number of genes in each population<generations>
- the maximum number of generations<crossover>
- a number between 0 and 100 indicating the percentage of each new generation to be created by crossover<mutation>
- integer number indicating mutation rate -- NOT USED
References
Mitchell, Machine Learning, McGraw Hill, 1997