EM Clustering Example.
Prof. Ruiz

  1. Generate two sets X and Y of 100 random numbers, where each set following a normal distribution. In this example: Below is the Matlab program two_random_normal_sets.m that I used to achieve this.
    X = random('Normal',90,10,1,100);
    Y = random('Normal',60,10,1,100);
    D(1:100,1) = X;
    D(1:100,2) = 1;
    D(101:200,1) = Y;
    D(101:200,2) = 2;
    
    D contains now 200 data instances, whose first column is a randomly generated number, and its second column tells if the number came from X or from Y. See D contents.

  2. Translate D to an arff file: em_dataset_example.arff.

  3. I include below the clustering results in Weka: (note that the parameters I used for EM were: weka.clusterers.EM -I 100 -N 2 -M 8.0 -S 100)
    === Run information ===
    
    Scheme:       weka.clusterers.EM -I 100 -N 2 -M 8.0 -S 100
    Relation:     em_example
    Instances:    200
    Attributes:   2
                  A
    Ignored:
                  class
    Test mode:    Classes to clusters evaluation on training data
    === Model and evaluation on training set ===
    
    
    EM
    ==
    
    Number of clusters: 2
    
    
                Cluster
    Attribute         0       1
                 (0.49)  (0.51)
    ============================
    A
      mean       89.7246  60.419
      std. dev.   9.0504  8.8655
    
    Clustered Instances
    
    0      100 ( 50%)
    1      100 ( 50%)
    
    
    Log likelihood: -4.1775
    
    
    Class attribute: class
    Classes to Clusters:
    
      0  1  <-- assigned to cluster
     94  6 | 1
      6 94 | 2
    
    Cluster 0 <-- 1
    Cluster 1 <-- 2
    
    Incorrectly clustered instances :	12.0	  6      %
    

  4. We can also use Matlab to cluster D. see em_clustering_D_matlab.m. Here is the summary result of the clustering reported by Matlab:
    obj = 
    
    Gaussian mixture distribution with 2 components in 1 dimensions
    Component 1:
    Mixing proportion: 0.482694
    Mean:    91.0523
    
    Component 2:
    Mixing proportion: 0.517306
    Mean:    60.0800