NAME
stats - a summary statistics program
SYNOPSIS
stats , where flags are:
-h or -help this message
-c# probability confidence 0.5 < c <= 1.0
-f# field (use twice for (x,y) pairs)
-xy read (x,y) from field 1 and field 2
-gp output lines for gnuplot
-GP output only lines for gnuplot
-s# number of future samples (0 for infinite)
-v print out extra information
Defaults are: -c0.00 -f1
version 2.80, by Mark Claypool
send bugs, suggestions to claypool@cs.wpi.edu
DESCRIPTION
Stats is designed to be a quick, simple to use program to generate
the following summary statistics:
- confidence intervals
- mean
- variance
- standard deviation
- min, max
- sum
- linear regression fits (for two fielded inputs)
Input is from the standard input in the form of numbers in distinct
fields, one entry per line. Fields are groups of non-white space
separated by whitespace. The first field is numbered 1, and is
indicated by the flag -f1 from the command line. Input lines with a
"#" as the first character are treated as a comment and are ignored.
For a single stream of numbers, you may specify on what field you wish
to calculate statistics. By default, stats reports mean, variance,
standard deviation, min, max and sum. If confidence intervals are
requested (-c# flag), stats reports a confidence interval of # around
the mean.
In the case of two-fielded input (specified by -fnum1 -fnum2), results
for each field is reported separately, as above. In addition, stats
performs a simple least-squares fit of a straight line to the data.
Stats also reports the total sum of squares (SST) and the sum of
squares explained by regression (SSR). Stats gives the fraction of the
variation that is explained determines the goodness for the regression,
called the coefficient of determination. If confidence intervals are
requested (-c# flag), stats reports confidence intervals around the
slope and y intercept.
For those that use gnuplot, the -gp flag gives additional format that
is easily incorporated into gnuplot scripts for the "plot"
command. This includes a format for error bars, for the individual
fields, and confidence parabolas around the line fits. The size of the
interval depends upon the number of future samples. There are two
extreme cases, 1 and infinity. You can specify the number of samples
with the -s# flag. A -s0 will indicate an infinite number of samples.
In order to speed up processing, all results are calculated in one
pass. This involves keeping the sum of the numbers squared for
calculating, among other things, the variance. This "on the fly"
technique has the potential to cause the sum of the squares to
overflow. To be flag if this does not happen, an overflow is checked
for and reported. However, in pilot test with LOTS of numbers, this
never happened.
Note that confidence intervals use an approximation formula for the t
tables for over 30 values. For fewer than 30 values, only confidence
intervals of 95% will be accurate.
All formulas and calculations used in stats can be found in any decent
statistics book. However, the author especially used "The Art of
Computer Systems Performance Analysis", by Raj Jain, copyright 1991,
published by John Wiley and Sons, Inc.
EXAMPLES
mark% cat example.data
5
10
15
mark% cat example.data | stats
Field: 1
lines: 3
mean: 10.000000000000
variance: 25.000000000000
std dev: 5.000000000000
sum: 30.000000000000
min: 5.000000000000
max: 15.000000000000
mark% cat example.data | stats -c.95
Field: 1
lines: 3
mean: 10.000000000000
variance: 25.000000000000
std dev: 5.000000000000
sum: 30.000000000000
min: 5.000000000000
max: 15.000000000000
confidence: 95%
left endpoint: 3.207474082984
right endpoint: 16.792525917016
mark% cat example.2.data
value: 20 response: 10.2
value: 40 response: 19.3
value: 31 response: 15.4
mark% cat example.2.data | stats -f2 -f4
Field: 2
lines: 3
mean: 30.333333333333
variance: 100.333333333333
std dev: 10.016652800878
sum: 91.000000000000
min: 20.000000000000
max: 40.000000000000
Field: 4
lines: 3
mean: 14.966666666667
variance: 20.843333333333
std dev: 4.565449959570
sum: 44.900000000000
min: 10.200000000000
max: 19.300000000000
line: y = Ax + B
A: 0.455647840532
B: 1.145348837209
error squared: 0.025265780731
SSR: 41.661400885936
SST: 41.686666666667
coeff. of det.: 0.999393912185
correlation: 0.999696910161
mark% cat example.2.data | stats -f2 -f4 -GP -c.95
30.333333 16.725659 43.941008
14.966667 8.764479 21.168854
(0.455647840532*x + 1.145348837209) + ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'max fit' with lines 2, \
0.455647840532*x + 1.145348837209 title 'best fit', \
(0.455647840532*x + 1.145348837209) - ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'min fit' with lines 2
The first two lines above (beginning with 30.3 and 14.9) are points
that you want to plot. Say, the first one is data that came at some X
value (indeterminable by stats). You cut and paste the first one into
file1 such that it looks like:
1 30.333333 16.725659 43.941008
To gnuplot, the first number is the X coordinate, the second is the Y
coordinate, the third is the low confidence interval, the fourth is
the high confidence interval. For gnuplot, you then have the command:
plot \
"file1" title 'my data' with lines, \
"file1" title '95% confidence interval' with errorbars
You do something similar with the second line.
With the second bunch of lines (beginning with 0.4), are gnuplot plot
commands to generate a line fit and some parabolic confidence line
fits. Cut and paste the lines into a gnuplot plot command like:
plot \
(0.455647840532*x + 1.145348837209) + ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'max fit' with lines 2, \
0.455647840532*x + 1.145348837209 title 'best fit', \
(0.455647840532*x + 1.145348837209) - ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'min fit' with lines 2
BUGS
For more than 30 samples, there is an approximation for the T table
values used in computing confidence intervals. For fewer than 30
samples, the numbers must be looked up in a table. Because the author
of this program is lazy, only table values for 95% have been
recorded. Fortunately, 95% confidence intervals are quite common.
This man page could do a lot towards describing the relevance and
meaning of the statistics reported by stats. For example, what
significance does a correlation of 0.60 have? How are confidence
intervals to be interpreted? What information can be gathered from
just the standard deviation?
FUTURE WORK
To do:
- Make stats read in an environment variable for command line flags.
- Add histogram capabilities.
- Add more T-Table values for less than 30 samples and non-95%
confidence intervals. 90% and 99% are good candidates.
- Add a test for normality.
- Add a perl script package that helps with parsing input files.
- Add hypothesis testing, including P-Values.
- Might be nice if stats generated a gnuplot file, both for the data
and for the control variables.
- Add summary statistics for paired data.