Genetic Algorithms

 

The goal of this project was to write a GA classifier for the census data.

 

Code Written

 

This project was developed in C. The following sections describe the major features of the program.

 

Gene Format

 

Genes in this system are variable length strings of rules. Each gene contains between one and ten rules. Each rule is an encoded representation of the allowed values for each of the following attributes:

 

 

Each of these attributes were chosen because they were shown to be significant in the Decision Tree experiment performed earlier this semester. The one exception to this is Hours Worked: this attribute was chosen to fill out the bit string to sixteen bits. If a field was optional (don't-care) the value was set to be all ones to allow instances to be compared using a logical AND operation. Discretization was modified somewhat from earlier experiments in order to fit into the allowed number of bits. Where attribute values needed to be combined, reasonable combinations were chosen by examining the Decision Tree results.

 

Gene Selection

 

The user is allowed to indicate the percentage of genes to be created by crossover during each generation. The survivors (non-crossed) are chosen using tournament selection where two genes are chosen at random and the fittest of the two survives. The remaining genes are formed by crossover from the original population (not the survivors). The crossover algorithm used for this project is the same as that used in GABIL and described in [Mitchell, 1997].

 

There is a user entered parameter for mutation but mutation is not used in this version of the project.

 

 

Fitness

 

The fitness of each gene is the number of training examples that can be matched by at least one rule.

 

Experiments Performed

 

Due to the amount of time it took to run each experiment, not as many could be run as desired. Experiments were run varying the number of genes in each population and varying the percentage of crossover. These experiments used a training set and test set of 5000 records each.

 

 

Results and Evaluation

 

Table 1 shows the results when the number of hypothesiss in each population are varied.

 

Table 1: Varying Population Size

 

Population Size

Generations Needed

Training Accuracy

Test Accuracy

5

100

42.5

41.2

10

100

82.4

82.1

20

54

97.1

97.3

30

100

57.7

56.4

40

36

96.3

96.1

50

68

97.4

97.5

 

These results are very good, and seem to increase with the population size allowed. The exception is the case where there are 30 genes. This experiment was repeated twice to verify the strange results. This may be a case where an optimal gene is not chosen because of the tournament selection and dies off.

 

Table 2 shows the results when the crossover percentage is varied. These experiments were all performed with a population size of 20.

 

Table 2: Varying Crossover Percentage

 

Crossover %

Generations Needed

Training Accuracy

Test Accuracy

40

54

97.1

97.3

50

26

97.4

97.1

60

64

95.5

95.0

70

35

95.4

95.6

 

The "winning" rules were printed out. They were not as expected. Figure 1 shows the winning hypothesis for the population size of 20 and crossover percentage of 40.

 

< Education = ??? M-STAT = Married-spouse-absent OCC-STAT = Protective-serv RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = > 50K >

< Education = ??? M-STAT = Separated OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = > 50K >

< Education = PhD M-STAT = Never-married OCC-STAT = Sales RELATIONSHIP = Not-in-family HOURS WORKED = >= 40 SALARY = > 50K >

< Education = PhD M-STAT = Married-civ-spouse OCC-STAT = Sales RELATIONSHIP = Not-in-family HOURS WORKED = >= 40 SALARY = > 50K >

< Education = PhD M-STAT = Never-married OCC-STAT = Sales RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = > 50K >

< Education = PhD M-STAT = Married-civ-spouse OCC-STAT = Sales RELATIONSHIP = Not-in-family HOURS WORKED = >= 40 SALARY = > 50K >

< Education = Assoc Admin. M-STAT = Never-married OCC-STAT = Priv-house-serv RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = <= 50K >

< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >

< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >

< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >

 

Figure 1: Winning Hypothesis

 

There are several things that look suspicious about these rules. For one thing, there are several duplicates of each other, it would be best if these were eliminated. Also, they don't seem to cover the various attributes values very well (especially the preference for Tech-support and Sales). One potential problem is that the fitness function counts the hypothesis as a match if any part of it classifies the instance correctly. This will not account for conflicting rules that would classify the same instance as positive or negative.

 

A second method of fitness calculation was tried. In this method, rules were voted on so that a hypothesis would only be rated as a match if more of its clauses classified the data positively than negatively.

 

Tables 3 and 4 show the tests re-run with the new voting scheme. Unfortunately, the change to remove duplicate rules did not work and needs to be implemented.

 

Table 3: Varying Population Size

 

Population Size

Generations Needed

Training Accuracy

Test Accuracy

5

100

20.6

19.8

10

100

29.2

29.8

20

100

43.1

43.2

30

100

65.8

66.4

40

100

75.0

75.4

50

100

62.5

62.6

 

 

 

Table 4: Varying Crossover Percentage

 

Crossover %

Generations Needed

Training Accuracy

Test Accuracy

40

100

43.1

43.2

50

100

66

66.5

 

 

Figure 2 shows a hypothesis from the best results returned (population of 40).

 

 

< Education = Assoc Admin. M-STAT = Never-married OCC-STAT = Priv-house-serv RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = <= 50K >

< Education = Assoc Voc. M-STAT = Never-married OCC-STAT = Farming-fishing RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = <= 50K >

< Education = ??? M-STAT = ??? OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >

< Education = Assoc Voc. M-STAT = Never-married OCC-STAT = Farming-fishing RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >

< Education = Assoc Voc. M-STAT = ??? OCC-STAT = Farming-fishing RELATIONSHIP = Other-relative HOURS WORKED = 0 SALARY = > 50K >

< Education = ??? M-STAT = Never-married OCC-STAT = Machine-op-inspct RELATIONSHIP = Not-in-family HOURS WORKED = ??? SALARY = <= 50K >

< Education = Assoc Voc. M-STAT = ??? OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >

< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >

< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >

< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >

 

Figure 2: Hypothesis with Voting and Duplicates

 

The last three rules were what comes up when the gene is equal to zero.

 

A third version of the system used the voting method but also eliminated any duplicate rules. Table 5 gives results for this system. For the generations needed column, the second number indicates the generation where the maximum fitness was reached.

 

 

Table 5: Varying Population Size, No Duplicates

 

Population Size

Generations Needed

Training Accuracy

Test Accuracy

5

100/51

43

43.4

10

100/41

62.2

62.2

20

100/43

75

75.4

30

100/81

75

75.4

40

100/58

75

75.4

 

This test ended up generating a hypothesis that classified all instances as < 50K. This corresponds to about 75% of the instances. Figure 3 shows the best hypothesis generated for the population of 20.

 

< Education = BS M-STAT = Divorced OCC-STAT = ??? RELATIONSHIP = Husband HOURS WORKED = ??? SALARY = <= 50K >

< Education = PhD M-STAT = Divorced OCC-STAT = Machine-op-inspct RELATIONSHIP = Husband HOURS WORKED = ??? SALARY = <= 50K >

< Education = PhD M-STAT = ??? OCC-STAT = Machine-op-inspct RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >

< Education = ??? M-STAT = ??? OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >

< Education = ??? M-STAT = Divorced OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = ??? SALARY = <= 50K >

< Education = No HS M-STAT = Married-AF-spouse OCC-STAT = Tech-support RELATIONSHIP = Husband HOURS WORKED = 0 SALARY = > 50K >

< Education = ??? M-STAT = Married-civ-spouse OCC-STAT = ??? RELATIONSHIP = ??? HOURS WORKED = 0 SALARY = <= 50K >

< Education = No HS M-STAT = Married-civ-spouse OCC-STAT = Tech-support RELATIONSHIP = Wife HOURS WORKED = 0 SALARY = <= 50K >

 

Figure 3: Hypothesis for Voting with No Duplicates

 

The difficulty with this method is that because there is a hypothesis to automatically classify each instance as <= 50K, in order for an instance to be classified as >50K there must be two rules that do so. In order to fix this problem, the software was revised to break ties by choosing the classification that had the most specific rule. Table 6 gives the results from this version.

 

Table 6: Varying Population Size, Tiebreaker

 

Population Size

Generations Needed

Training Accuracy

Test Accuracy

5

100/51

43

43.4

10

100/41

62.2

62.2

20

100/43

75

75.4

30

100/81

75

75.4

40

100/58

75

75.4

 

These results ended up being virtually identical (except for some small differences in the 5 and 10 rule cases) to the version that did not use tie breakers.

 

The final modification to the code was to implement a version that did not use tournament selection but instead always chose the fittest genes. Table 7 gives the results for this version.

 

Table 7: Varying Population Size, Non-Tournament

 

Population Size

Generations Needed

Training Accuracy

Test Accuracy

5

100/30

45.6

45.1

10

100/64

75

75.4

20

100/31

32.9

32.7

30

100/51

62.5

62.6

40

100/6

27.7

27.7

 

For smaller populations, this algorithm worked slightly better than the tournament selection but for larger populations it was worse, in some cases much worse.

 

The non-tournament version was tested using 100 training and test instances so that the results could be compared to other methods. Table 8 gives the results.

 

Table 8: Comparison with Other Learning Methods

 

Method

Results

Instance Based Learning

88% (k=5)

Pruned Decision Tree (w/?s)

65.6%

Naïve Bayes Classifier

78%

GA, 20 Rules/Hypothesis

75%

 

The 100 instance set ended up giving as good an accuracy as the larger data set.

 

 

Strengths and Weaknesses

 

The major weakness is that the random number generators sometimes produce infinite loops in the code. Counters are used to detect them and to vary the values but there are still cases where they occur. For crossover, the cutpoints must have matching rule boundaries for the two hypothesiss, this is where the code sometimes gets stuck.

 

Another weakness is that mutation was not used. Adding mutation may have helped to achieve better accuracy. A parameter is read in for the mutation rate but it was not used.

 

User's Manual

 

The genetic algorithm program can be built as follows:

 

cc genetic.c testprint.c test_line.c train_line.c decode_rule.c decode_line.c gen_level.c valid.c crossover.c cross.c -o genetic

 

The program can be run as follows:

 

./genetic <threshold> <population> <generations> <crossover %> <mutation>

 

where:

<threshold> - a number between 0 and 100 indicating the accuracy requested

<population> - the number of genes in each population

<generations> - the maximum number of generations

<crossover> - a number between 0 and 100 indicating the percentage of each new generation to be created by crossover

<mutation> - integer number indicating mutation rate -- NOT USED

 

References

 

Mitchell, Machine Learning, McGraw Hill, 1997