Statiscal Notes
#4: A Matter Of Deviation
By: T. V. Nguyen
The author of Ecclesiastes remarked
that "No man can find out the work that God maketh from the beginning
to the end". Indeed, we can not measure cosmic rays everywhere all
the time. We can not try a new drug on everybody. No one can test every
shell or bomb that they manufacture. We have to content with SAMPLES. The
measurements involved in every scientific experiment constitute a sample
of that unlimited set of measurements which would result if one performed
the same experiment over and over indefinitely. This total set of potential
measurements is referred to as POPULATION. Once a sample of data is collected,
we are interested in four questions: (i) how can one describe the sample
usefully and clearly; (ii) from the evidence of this sample, how one does
one best infer conclusions concerning the total population; (iii) how reliable
are these conclusions and (iv) how should samples be taken in order that
they may be as illuminating and dependable as possible? In this note, I
will discuss answers to the first question.
In the previous article, I already
mentioned the average (or the mean) as a measure of a central position
of a data set. I also point out a few havocs associated with this statistic.
For example, the mean of income of a certain class of the Havard University
is not a very useful figure if the class happens to include one man who
has an income of half a million dollars. The average does not tell us the
whole story. Consider the following data on ages of three different children
in three groups:
(a) 6, 6, 6
(b) 6, 5, 7
(c) 2, 1, 15
The mean of all three groups is
6 years old. But, as you can see, this mean certainly does not adequately
reflect the true picture of each of the data set. In (a) three values are
the same; in (b) the mean seems to be a reasonable representation, while
in (c) the mean is a hopeless statistic. We want to know more from the
data. We want to know the extent to which the values differ from this mean.
The term DISPERSION is used to describe the degree to which a set of values
vary about their mean. Other terms that convey this same concept are VARIATION,
SCATTER and SPREAD. When a set of values are all close to the mean, they
exhibit less dispersion than when some of the values are much larger and/or
much smaller than the mean. For descriptive measures used to express the
amount of dispersion in a set of data are the range, average deviation,
the variance and the standard deviation.
The range is defined as the difference
between the largest and smallest value in a data set. For example, in the
above example, the range in (a) is 0, in (b) 2 and in (c) 14. The range,
although easy to compute, is usually an unsatisfactory measure of dispersion,
since only two values are used in the computation. In other words, the
range does not make use all the information available in the data it is
supposed to describe.
The AVERAGE DEVIATION expresses
the average amount by which a set of values differ from their mean. It
takes into account the deviation of each value from the mean, x(i) - mean.
However, the sum of these deviations, and hence their mean, is always equal
to zero. Therefore, some modification must be made in the procedure if
it is to lead to a valuable measure of dispersion. An appropriate modification
is to take the mean of the deviations while ignoring the signs. That is,
the absolute values of the deviation. The procedure is expressed in the
following formula:
Ave Dev = Sum of |X(i) - mean| /
N
In the above examples, the average
deviation is calculated as:
(a) Ave Dev = [ |6-6| + |6-6| +
|6 - 6|] / 3 = 0
(b) Ave Dev = [ |6-6| + |5-6| +
|7 - 6| ] / 3 = 0.67
(c) Ave Dev = [ |2-6| + |1-6| +
|15 - 6| ] / 3 = 6
Although it is an intuitive measure
of dispersion, its usefulness is limited because it does not lend itself
to further mathematical manipulation. Consequently, it is seldom used as
a measure of dispersion.
The VARIANCE, like the average deviation,
makes use of individual values, x(i), from their mean, that is, x(i) -
mean. In computing the variance, negative differences are avoided by squaring,
rather than taking absolute values. The variance of a sample of data, then,
may be computed from the formula:
Var = Sum of [x(i) - mean]**2 /
N
Thus, the variance is simply the
average of the squared deviation of the individual values from their mean.
The numerator is called the SUM OF SQUARES ABOUT THE MEAN. The symbol s**2
is used to designate the sample variance. The sample variance can be used
to estimate the (unknown) population variance. And when this use is made,
the denominator is (N-1) rather than N, i.e.
s**2 = Sum of [x(i) - mean]**2 /
(N-1)
In the above example, the sample
variance is:
(a) s^2 = [(6-6)^2 + (6-6)^2 + (6-6)^2]
/ 2 = 0
(b) Ave Dev = [(6-6)^2 + (5-6)^2
+ (7-6)^2] / 2 = 1
(c) Ave Dev = [(2-6)^2 + (1-6)^2
+ (15-6)^2] / 2 = 61
(please note ^2 and **2 both denote
square)
Note that the variance has a unit
of age^2 (age squared), which is sometimes impractical. It is desirable
to convert this back to the original unit (age). This can be done by taking
the positive square root of the variance, and the result is called the
STANDARD DEVIATION. This is one of the most widely used measure of dispersion
in statistics. The standard deviation is often denoted by s, i.e. in the
example:
(a) s = 0
(b) s = sqrt(1) = 1 years
(c) s = sqrt(61) = 7.8 years
Thus, although the three data sets
have an identical mean, the standard deviation tells us that the there
is no variation in (a), while the variability in (c) is almost 8 times
higher than that of in (b).
The standard deviation has an important
implication in the description of data. P. L. Chebyshev (1821-1894), an
eminent Russian mathematician, has shown that in any set of observations,
given a mean M and standard deviation S,
at least 75% of the observations
are expected to fall within M +/- 2S at least 89% of the observations are
expected to fall within M +/- 3S at least 96% of the observations are expected
to fall within M +/- 4S
Thus, given a mean, a standard deviation
and the number of observations, we can reasonably work out the distribution
of the data. If, in fact, the data are distributed symmetrically as in
Figure 1, then the Chebyshev's statement can be shown to be:
at least 68% of the observations
are expected to fall within M +/- S at least 95% of the observations are
expected to fall within M +/- 2S at least 99.7% of the observations are
expected to fall within M +/- 3S
How can we compare two samples of
data, which are measured in different units, say, in kg and in cm? The
mean and standard deviation can provide a way of getting around the problem
of different units of measurements. Consider the quantity, which statisticians
call the "z score",
z = [X(i) - M] / s
where X(i) is an ith value, M is
the mean and s is the standard deviation. As you can see, the z-score does
not have a unit. Thus, a transformation of data from original unit to the
z-score does allow comparisons to be done.
The following figure graphs the
number of women (y-axis) classified by their bone mass (x-axis). The mean
of bone mass is 80 mg and standard deviation is 5 mg. For a woman whose
bone mass is 70 mg, her z-score is (70 - 80)/5 = -2; for a woman whose
bone mass is 80 mg, her z-score would be (80- 80)/5=0; on the other hand,
a woman with bone mass of 95 mg is equivalent to a z-score of (95-80)/5
= 3. This transformation from mg to z-score is presented below the original
unit of the figure. The interesting feature of the z-score is that it always
has mean 0 and standard deviation of 1. Thus, by simply knowing whether
a z-score is positive or negative, we know whether the bone mass is above
the mean or below the mean. The larger the absolute value of the z-score,
the further that bone mass is from the mean. It should be noted that this
is also the method in which most educational authorities use to standardised
students' marks across schools.
Number of women
200 + ...................................................***
.......| .....................................................***
.......| .............................................***
*** ***
.......| .............................................***
*** ***
.......| .............................................***
*** ***
100 + ....................................***
*** *** *** ***
.......| ......................................***
*** *** *** ***
.......| ......................................***
*** *** *** ***
.......| ..............................***
*** *** *** *** *** ***
.......| .......................***
*** *** *** *** *** *** *** ***
. ....-------------------------------------------------------------------------------------------------
mg ...........................60
....65 ...70 ..75 ....80 ...85 ...90... .95 ..100
z-score................... -4.....
-3 ....-2 ...-1 .....0 ....+1 ...+2 ....+3 ...+4
FIGURE 1: Graph of a distribution
of bone mass with a mean of 80 mg and a standard deviation of 5 mg.
Now, we can simplify the Chebyshev's
statement even further by using the z-score:
at least 68% of the z-scores are
expected to fall within +/- 1S at least 95% of the z-scores are expected
to fall within +/- 2S at least 99.7% of the z-scores are expected to fall
within +/- 3S
Next time, I will discuss how to
use this z-score to work out whether Deng Xiao Ping had chosen Hu Yao Bang
(as his successor) by chance.
See you next time.
Tuan
Nguyen, Ph.D.
[email protected]
For discussion on
this column, join [email protected]
Copyright ©
1996 by VACETS and Tuan Nguyen
: