This project consists of two parts:
The objective of this project is to construct a model with the highest
prediction accuracy possible for the Boolean challenge, and
a model with the highest
prediction accuracy possible for the multivalued challenge described below.
Please submit each of these two models by email to us as part of your project
submission.
The more accurate, creative, and well-designed your solution is, the better.
Remember to include as much domain knowledge as you can.
Since you are testing on a separate test set, you do not need to use 10-fold
cross-validation for this challenge.
apache2
httptunnel
mailbomb
mscan
processtable
saint
snmpgetattack
snmpguess
[There are other attack-types that appear in the test and not in the training
set, but they are disregarded here because they are very infrequent.]
Use clustering algorithms
(e.g., Simple K-means and/or Hierarchical Clustering
[make sure to experiment with different "linkType"s]
implemented in the Weka system)
to determine if any of the above attack-types are similar
to other attack-types that do appear both in the training and the test
datasets, which are listed below:
back
buffer_overflow
ftp_write
guess_passwd
imap
ipsweep
land
loadmodule
multihop
neptune
nmap
normal
perl
phf
pod
portsweep
rootkit
satan
smurf
spy
teardrop
warezclient
warezmaster
For this, you can use just the test set alone, or the test set and the
training set combined.
Explain your work in detail.