Prof. Ruiz

- Generate two sets X and Y of 100 random numbers, where each set following a normal distribution. In this example:
- the normal distribution for X has mean 90 and standard deviation 10.
- the normal distribution for Y has mean 60 and standard deviation 10.

D contains now 200 data instances, whose first column is a randomly generated number, and its second column tells if the number came from X or from Y. See D contents.X = random('Normal',90,10,1,100); Y = random('Normal',60,10,1,100); D(1:100,1) = X; D(1:100,2) = 1; D(101:200,1) = Y; D(101:200,2) = 2;

- Translate D to an arff file: em_dataset_example.arff.
- I include below the clustering results in Weka:
(note that the parameters I used for EM were:
weka.clusterers.EM -I 100 -N 2 -M 8.0 -S 100)
=== Run information === Scheme: weka.clusterers.EM -I 100 -N 2 -M 8.0 -S 100 Relation: em_example Instances: 200 Attributes: 2 A Ignored: class Test mode: Classes to clusters evaluation on training data === Model and evaluation on training set === EM == Number of clusters: 2 Cluster Attribute 0 1 (0.49) (0.51) ============================ A mean 89.7246 60.419 std. dev. 9.0504 8.8655 Clustered Instances 0 100 ( 50%) 1 100 ( 50%) Log likelihood: -4.1775 Class attribute: class Classes to Clusters: 0 1 <-- assigned to cluster 94 6 | 1 6 94 | 2 Cluster 0 <-- 1 Cluster 1 <-- 2 Incorrectly clustered instances : 12.0 6 %

- We can also use Matlab to cluster D.
see em_clustering_D_matlab.m.
Here is the summary result of the clustering reported by Matlab:
obj = Gaussian mixture distribution with 2 components in 1 dimensions Component 1: Mixing proportion: 0.482694 Mean: 91.0523 Component 2: Mixing proportion: 0.517306 Mean: 60.0800