NAME
       stats - a summary statistics program

SYNOPSIS
       stats <flags>, where flags are:
       -h or -help     this message
       -c#     probability confidence 0.5 < c <= 1.0
       -f#     field (use twice for (x,y) pairs)
       -xy     read (x,y) from field 1 and field 2
       -gp     output lines for gnuplot
       -GP     output only lines for gnuplot
       -s#     number of future samples (0 for infinite)
       -v      print out extra information
       Defaults are: -c0.00 -f1
       version 2.80, by Mark Claypool
       send bugs, suggestions to claypool@cs.wpi.edu

DESCRIPTION

Stats is designed to be a quick, simple to use program to generate 
the following summary statistics:
     - confidence intervals
     - mean
     - variance
     - standard deviation
     - min, max
     - sum
     - linear regression fits (for two fielded inputs)

Input is from the standard input in the form of numbers in distinct
fields, one entry per line. Fields are groups of non-white space
separated by whitespace.  The first field is numbered 1, and is
indicated by the flag -f1 from the command line.  Input lines with a
"#" as the first character are treated as a comment and are ignored.

For a single stream of numbers, you may specify on what field you wish
to calculate statistics.  By default, stats reports mean, variance,
standard deviation, min, max and sum.  If confidence intervals are
requested (-c# flag), stats reports a confidence interval of # around
the mean.

In the case of two-fielded input (specified by -fnum1 -fnum2), results
for each field is reported separately, as above. In addition, stats
performs a simple least-squares fit of a straight line to the data.
Stats also reports the total sum of squares (SST) and the sum of
squares explained by regression (SSR). Stats gives the fraction of the
variation that is explained determines the goodness for the regression,
called the coefficient of determination. If confidence intervals are
requested (-c# flag), stats reports confidence intervals around the
slope and y intercept.

For those that use gnuplot, the -gp flag gives additional format that
is easily incorporated into gnuplot scripts for the "plot"
command. This includes a format for error bars, for the individual
fields, and confidence parabolas around the line fits. The size of the
interval depends upon the number of future samples. There are two
extreme cases, 1 and infinity.  You can specify the number of samples
with the -s# flag. A -s0 will indicate an infinite number of samples.

In order to speed up processing, all results are calculated in one
pass.  This involves keeping the sum of the numbers squared for
calculating, among other things, the variance.  This "on the fly"
technique has the potential to cause the sum of the squares to
overflow. To be flag if this does not happen, an overflow is checked
for and reported.  However, in pilot test with LOTS of numbers, this
never happened.

Note that confidence intervals use an approximation formula for the t
tables for over 30 values. For fewer than 30 values, only confidence
intervals of 95% will be accurate.

All formulas and calculations used in stats can be found in any decent
statistics book.  However, the author especially used "The Art of
Computer Systems Performance Analysis", by Raj Jain, copyright 1991,
published by John Wiley and Sons, Inc.

EXAMPLES

mark% cat example.data
5
10
15

mark% cat example.data | stats
          Field:  1
          lines:  3
           mean:  10.000000000000
       variance:  25.000000000000
        std dev:  5.000000000000
            sum:  30.000000000000
            min:  5.000000000000
            max:  15.000000000000

mark% cat example.data | stats -c.95
          Field:  1
          lines:  3
           mean:  10.000000000000
       variance:  25.000000000000
        std dev:  5.000000000000
            sum:  30.000000000000
            min:  5.000000000000
            max:  15.000000000000
     confidence:  95%
  left endpoint:  3.207474082984
 right endpoint:  16.792525917016

mark% cat example.2.data
value: 20   response: 10.2
value: 40   response: 19.3
value: 31   response: 15.4

mark% cat example.2.data | stats -f2 -f4
          Field:  2
          lines:  3
           mean:  30.333333333333
       variance:  100.333333333333
        std dev:  10.016652800878
            sum:  91.000000000000
            min:  20.000000000000
            max:  40.000000000000

          Field:  4
          lines:  3
           mean:  14.966666666667
       variance:  20.843333333333
        std dev:  4.565449959570
            sum:  44.900000000000
            min:  10.200000000000
            max:  19.300000000000

           line:  y = Ax + B
              A:  0.455647840532
              B:  1.145348837209
  error squared:  0.025265780731
            SSR:  41.661400885936
            SST:  41.686666666667
 coeff. of det.:  0.999393912185
    correlation:  0.999696910161

mark% cat example.2.data | stats -f2 -f4 -GP -c.95
   30.333333     16.725659     43.941008
   14.966667      8.764479     21.168854
(0.455647840532*x + 1.145348837209) + ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'max fit' with lines 2, \
0.455647840532*x + 1.145348837209 title 'best fit', \
(0.455647840532*x + 1.145348837209) - ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'min fit' with lines 2

The first two lines above (beginning with 30.3 and 14.9) are points
that you want to plot.  Say, the first one is data that came at some X
value (indeterminable by stats). You cut and paste the first one into
file1 such that it looks like:

1      30.333333     16.725659     43.941008

To gnuplot, the first number is the X coordinate, the second is the Y
coordinate, the third is the low confidence interval, the fourth is
the high confidence interval. For gnuplot, you then have the command:

plot \
"file1" title 'my data' with lines, \
"file1" title '95% confidence interval' with errorbars

You do something similar with the second line.

With the second bunch of lines (beginning with 0.4), are gnuplot plot
commands to generate a line fit and some parabolic confidence line
fits.  Cut and paste the lines into a gnuplot plot command like:

plot \
(0.455647840532*x + 1.145348837209) + ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'max fit' with lines 2, \
0.455647840532*x + 1.145348837209 title 'best fit', \
(0.455647840532*x + 1.145348837209) - ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'min fit' with lines 2

BUGS

For more than 30 samples, there is an approximation for the T table
values used in computing confidence intervals. For fewer than 30
samples, the numbers must be looked up in a table. Because the author
of this program is lazy, only table values for 95% have been
recorded. Fortunately, 95% confidence intervals are quite common.

This man page could do a lot towards describing the relevance and
meaning of the statistics reported by stats.  For example, what
significance does a correlation of 0.60 have? How are confidence
intervals to be interpreted? What information can be gathered from
just the standard deviation?

FUTURE WORK

To do: 
  - Make stats read in an environment variable for command line flags.
  - Add histogram capabilities.
  - Add more T-Table values for less than 30 samples and non-95% 
      confidence intervals. 90% and 99% are good candidates.
  - Add a test for normality.
  - Add a perl script package that helps with parsing input files.
  - Add hypothesis testing, including P-Values.
  - Might be nice if stats generated a gnuplot file, both for the data
      and for the control variables.  
  - Add summary statistics for paired data.
