Chapter 5: Lies, Damned Lies and Statistics

[1:] "Some notable murder cases have been tied to the lunar phases. Of the eight murders committed by New York's infamous "Son of Sam", David Berkowitz, five were during a full moon." [BBC News report]

[2:] The "Son of Sam" case introduced the term "serial killer" to our language. His capture on August 8, 1977 led, among other things, to my move to New York two weeks later (well, sort of). The matter of interest here, however, is the fact that the news story quoted above provides powerful supporting evidence for a common piece of folk wisdom: the phase of the Moon affects people's behavior.

[3:] Recently, I was having dinner with a college friend in Sydney, Australia where she now lives. Her twenty-something daughter was along. At one point, the conversation turned to behavior and the full moon which "everyone knows," the daughter said, "makes people act crazy."

[4:] "No," I couldn't resist injecting, "there is no evidence for that whatsoever. There have been dozens of statistical studies, and none has shown any connection."

[5:] "Well, you can prove anything you want with statistics" was the dismissive reply.

[6:] I must confess that I find the latter statement more disturbing than the adherence to disproven folk wisdom. One cannot prove "anything you want" with statistics if they are applied to reliable data in a logical and mathematically correct fashion. Indeed, statistics is an essential tool for deciding whether or not a scientific result is significant and meaningful. In the hands of a scientifically minded person, they are, in fact, more often used to disprove what "you want" -- support for your favorite current theory.

[7:] A simple illustration can be derived from examination of the above news story, which was repeated thousands of times during the Son of Sam's reign and countless times since. The facts are as follows:

[8:] The dates and times of the murders (or shootings -- in fact, only six, not eight, of the victims died) were

[9:] July 26, 1976 at 1AM

October 23, 1976 at 3AM

November 26, 1976 at 11PM

January 30, 1977 at 00:10 AM

March 8, 1977 at 10PM

April 17, 1977 at 3AM

June 26, 1977 at 3AM

July 31, 1977 at 1AM.

[10:] Now, the orbit of the Moon around the Earth has been very precisely determined -- in fact, by bouncing laser light off the Moon and timing its return, we measure the distance to the Moon every day to a few centimeters. Thus, it is straightforward to calculate the exact position of the Moon, and therefore its phase, at any point in the past or the future for tens of thousands of years. A nice web-based calculator is available; you can find the following phases for the dates given above:

[11:] 28.46 days, 29.44 days, 5.83 days, 9.58 days, 18.26 days, 28.32 days, 9.06 days, and 15.41 days.

[12:] The accounting system labels the new moon -- when it is perfectly aligned between the Sun and the Earth and is thus invisible to us [HUH?]-- as 0.00 days. During the year in question, the Moon had an orbital period around the Earth (and thus an interval from one new moon to the next) of 29.52 days (this varies slightly over time as the Moon is tugged about by the Sun and other planets). The full moon occurs half way through the orbit at 14.76 days.

[13:] Using the data above, it is easy to calculate the number of days from the full moon of each attack by just taking the absolute value[HUH?] of the difference between the given phase and 14.76 days; i.e., the first murder occurred |28.46 - 14.76| = 13.7 days from the full moon, while the third occurs |5.83 - 14.76| = 8.93 days from full moon. Summing these numbers and dividing by eight (the number of attacks) gives the average time between the full moon and the attacks: 8.24 days. Now, if the attacks had all occurred near the full moon as claimed, this average would be near zero. However, if they were randomly distributed with respect to the full moon (i.e., there was no connection), we would expect the times to be evenly distributed between 0 and 14.76 days, and thus, on average, to be 14.76/2 = 7.38 days [HUH?]. It is obvious that the true value is much closer to the expected value than the value postulated by the "lunatic" theory (and the claims of countless media reports) of 0.0 days. Assessing quantitatively the difference between these two hypotheses is what you will learn to do in this chapter.

HOW NOT TO LIE WITH STATISTICS

[14:] Experiments have been a key component of science since the Renaissance when Simon Stevenius (not Galileo) dropped two weights from a tower in Delft (not Pisa). Experiments produce measurements which are combined and compared to the predictions of models. A good model makes specific, quantitative predictions; thus, for a useful comparison to be made, an experiment must yield either numerical results or the classification of outcomes into clearly defined categories. This section explores some of the rules scientists adopt when dealing with measurements and the numbers that describe them.

Accuracy and Precision

[15:] We begin by drawing a distinction between two words which are often used interchangeably in every day speech, but which in science mean two quite different things: "accuracy" and "precision".

[16:] On the archery range, you might say: "She shot with great accuracy", or "She shot with great precision" and mean more or less the same thing. However, as you probably learned as a child (if you grew up in this backward country that still resists the otherwise universal metric system), the average human body temperature is 98.6 degrees Fahrenheit. To a scientist, this number is quite precise: it is quoted to a precision of one part in 1000 (i.e., it is not 98.5 or 98.7, but 98.6). However, the number is not accurate. The average temperature of the normal person is actually about 0.5 degrees lower or 98.1 degrees Fahrenheit (and different individuals have different averages). Furthermore, body temperature varies by at least half a degree over the course of each day. A good thermometer can be used to determine the body's temperature to an accuracy of 0.1 degrees Fahrenheit or even better, but most of the time - and for most people - it will not be 98.6 F.

[17:] The precision of a number is represented by the number of significant figures it contains, where a significant figure is defined by these rules. A simple way of determining the number of significant figures is by writing the number in scientific notation -- 1000 = 1 x 103, while 1000.1 = 1.0001 x 103 -- and then by counting the number of digits in the prefix.

[18:] The rule of thumb in calculating a number derived from measurements is to report the number of significant digits of the least precise measurement, plus one. If you use fewer digits than this, you throw away part of the information you have received. If you use more digits, you are "inventing" information that you actually do not have.

[19:] For example, take the ratio 2/3. It is mathematically correct to approximate this fraction as 0.66666666667. But if the numbers 2 and 3 have physical significance, you must report their calculated ratio more conservatively. Say a scientist records the speed of a falling body as covering 2 meters in 3 seconds. The correct quotation for the velocity of the object is not 0.66666666667, but rather 0.67 meters per second. The expression "0.66666666667 m/s" has a precision of eleven decimal places and implies that the scientist knows the speed to one part in a trillion. That would be quite a high-tech experiment indeed! In fact, "2 meters" and "3 seconds" each has a precision of only one significant figure. Following convention, our answer must have 2 significant figures, where the appropriate rounding has occurred. We will attain the same result if the data were reported as 2 meters in 3.005 seconds. The expression "2 meters", which has one significant figure, denotes our least precise measurement (compared to "3.005 meters" which has four significant figures). Furthermore the least precise number determines the accuracy of the result and therefore the precision to which it should be recorded. (Here is a second example, in case you're confused.)

Error and Uncertainty

[20:] No measurement is perfect. All measurements have associated uncertainties (with the possible exception of counting discrete items). Adopting a somewhat sloppy use of English, scientists often refer to such uncertainties as "errors". They are not errors in the sense of "mistakes". Rather they represent the amount by which a measurement may differ from the true value of the measured quantity. In the quantum realm of individual atoms, there is an inherent uncertainty in measurement that cannot be overcome by clever instruments or more careful procedures. In the macroscopic world, however, measurement error inevitably results from using instruments of finite precision which must accomplish a measuring task in a finite amount of time.

[21:] If you measure the length of your desk with a foot-long ruler as accurately as you can, and then ask your next-door neighbor to bring in a ruler and do the same, it is quite unlikely that you will get exactly the same answer. Two types of errors will lead to the discrepant results: systematic errors and statistical (or random) errors.

[22:] Systematic errors arise from such problems as the fact that the two rulers will not be precisely the same length, or that the temperature of the room may change between the measurements, and the desk may have expanded or contracted a bit as a result. There may also be differences in the way that you and your neighbor approach the task. In some cases, sources of systematic error can be reduced or eliminated: e.g., you could make sure the room temperature remained constant, or you could ship the desk to Sevres outside of Paris where the world's standard meter stick resides and use that to measure the length. Often, however, we are unaware of systematic errors ("How did I know the cheap Bookstore rulers are really only 11 inches long?") or have no easy way to control them (just try keeping your room's temperature constant to within 20 degrees or so, using Residence Halls' radiators). Repeating a measurement using different instruments -- or even better, different techniques -- is the best way to discover and correct systematic errors.

[23:] Random errors may also be difficult to eliminate, but they are both quantifiable and reducible, and the subject of statistics holds the key. In the case of the desk-length measurement, random errors arise from such things as not perfectly aligning the ruler with the edge of the desk or not precisely moving the ruler exactly one ruler's length each time. Unlike systematic errors such as those produced by the short ruler, random errors - as their name suggests - have both plusses and minuses, producing answers too long and too short with roughly equal frequency. Their random nature makes these errors more tractable.

[24:] Good scientific measurements must be qualified by stating an uncertainty, or error. Scientists standardly notate this by quoting a result +/- an error. In some cases, random and systematic errors are reported consecutively; e.g., 5.02 ft +/- 0.17 ft +/- 0.3 ft which translates to a random error of 0.17 ft and a systematic error of 0.3 ft. What does the +/- sign mean? Focusing on the random error for the moment, it does not, as one might expect, imply that the true length of the desk is definitely between 4.85 and 5.19 feet. Rather, it represents a shorthand for describing the probability that the quoted measurement is within a certain range. Although the distribution of errors, and thus the range of probabilities, is not always easy to determine, in some cases of interest, it is safe to assume that the individual measurements are distributed in a "normal distribution", more commonly referred to as a "bell curve" distribution. This curve is also known as a Gaussian curve after the mathematician Karl Gauss. It is described by the equation:
P(x) = 1/[σ(√(2π))] x e(-(x-μ)2)/(2σ2) where σ = standard deviation, and μ = mean of x.

[25:] It looks like this: GAUSSIAN GRAPH

[26:] The curve's peak represents the average or mean value of all the measurements: if you had the time to make an infinite number of measurements so that all the random errors averaged out, this would represent the true value of the quantity of interest (e.g., the length of the desk). The value of sigma (σ) determines the width of the curve and is called the "standard deviation".

[27:] It is logical at this point to ask -- how do I know the value of sigma? In other words, how do I know the width of the distribution of errors in my measurement so that I can assign one standard deviation as its statistical uncertainty? The most straightfoward way to determine this quantity is actually to measure it. That is, if you perform the desk-length measuring exercise 100 times and plot the results in the form of a histogram (see Chapter 3) of the values you obtain, it will probably look very much like this HISTOGRAM

[28:] which, as you can see, is approximated quite well by a Gaussian distribution.

[29:] The standard deviation is a convenient way to characterize the spread of a set of measurements. But if you do go through the trouble of repeating a measurement many times, shouldn't you gain something beyond just knowing how big your random errors are? Shouldn't doing many measurements improve the accuracy of the answer? Indeed, it does. The quantity sigma will not change if you take 20 measurements or 200 -- it is just a description of the irreducible errors associated with your measuring technique. But, random errors by their very nature, tend to "average out" if many measurements are summed. The average or MEAN value of twenty-five estimates is a better approximation of the true value than an average of nine. The "error" quoted on such mean values, then, is not the error which characterizes each measurement, but the "error in the mean". It turns out this is simply given by σ/√(N), where N is the number of measurements you made. Thus, to be quantitative, 25 measurements are 5 (=√(25)) times better than one.

[30:] The usual practice in quoting the statistical error on a measurement is to quote +/- 1 sigma or one standard deviation. Integrating under the Gaussian curve reveals that 68% of its area lies between +/- 1 sigma of the mean. Thus, the literal meaning of the measurement of the desk's length reported as 5.02 ft +/- 0.17 ft is, "This measurement is consistent with the true length of the desk lying in the range 4.85 to 5.19 feet at the one-sigma level (meaning there is only a 32% chance that the true value lies outside this range)."

[31:] The gaussian distribution allows for errors of any size; i.e., it is a continuous distribution as is appropriate for, say measuring the length of a desk, which might be in error by 0.1237 inches or 0.1238 inches. In many cases, however, a finite number of measurements may yield only one of a finite number of outcomes. Let me explain this apparently obscure statement with an example.

[32:] Suppose you were to do the experiment of flipping a coin five times. The result of each experiment is unambiguous: it is either a head or a tail. There is no "error" in each outcome. However, you might be interested in how many times when you do this experiment you will get three heads. Harkening back to our definition of probability, all this requires is to enumerate all the possible outcomes of the experiment, and then divide that number into the number of desired outcomes, three heads. With only five flips, this is a practical, if tedious, approach; the possibilities are enumerated and the probability calculated here. But suppose you wanted to know how many times fifty flips would yield 31 heads. Or, equivalently, how many times you would catch 31 female frogs out of 50 if the gender ratio in a pond was exactly 50:50? It would clearly be useful to have a general formula to calculate the probabilities of these outcomes.

[33:] And there is one. It is called the binomial distribution. The derivation is not overly complicated, but I will spare you the details here. The probability of m successes in n trials when the probability of the outcome of interest is p is given by:
P(m successes in n trials) = n!/[m!(n-m)!] x pm x (1-p)(n-m)

[34:] where n! is just the factorial which equals 1 x 2 x 3 x 4... x (n-1) x n

[35:] OK, so it looks complicated, but it is not nearly as bad as writing down every possibility for catching 50 frogs and then counting up how many of them give you 31 females. For example, let's use this distribution to answer the original question: What is the chance of getting three heads in five flips.

[36:] n=5, m=3, and p=1/2 (the odds of getting heads is 50%), so
5!/[3!(2!)] x (1/2)3 x (1/2)2 = 10 x (1/8) x (1/4) = 0.31 or 31%, which equals 10/32, the answer we derived from the tedious enumeration.

[37:] For the frogs, the answer is 0.027 or only 2.7%. If you got this answer, you might begin to wonder if the gender ratio in your frog pond was really 50:50.

[38:] In some experiments, the goal is simply to count events that are occurring at random times and compare this to the known average rate at which they are expected to occur. One of the important examples of this is in using the radioactive decay of atomic nuclei as clocks to date ancient materials. We now know the age of the Earth to better than 1% through the application of such techniques, and have dated the Shroud of Turin (the putative burial cloth of Jesus) to the 1350's (when church documents show it was forged by an artist in the employ of a corrupt bishop). In these applications, we use the fact that each type of radioactive nucleus decays at a precisely determined average rate, although the individual decays happen at random moments. For example, the heavy isotope of Carbon, C-14, will lose half of its atoms in 5760 years. We can determine how much of the C-14 breathed in by a plant (or the linen made from it) is left by counting the number of decays (signaled by the emission of a high energy electron) and thus determine the age of the sample directly.

[39:] The uncertainty in this kind of counting experiment is given by a third kind of distribution called the Poisson distribution. For an observed number of counts N and an average number expected a, the probability of of getting N counts in a given trial is

[40:] Pa(N) = (aN e-a)/n!

[41:] For example, if an archeologist found a fragment of cloth in a Central American excavation and wanted to know its age she might turn to C-14 dating. The cloth is of a relatively tight weave. Did the first New World settlers 15,000 years ago know how to weave so skillfully (did aliens teach them how to do it?), or is the fragment from the time of the Conquistadors 500 years ago? Putting the cloth in a device to count radioactive decays, we record 1, then 2, then 0, then 0, then 2 counts in the first five one-minute counting intervals. Given the size of the sample and the decay rate of C-14, we would expect an average counting rate of 0.917 decays per minute if the cloth is 500 years old, and only 0.074 decays per minute if it is 15,000 years old. The Poisson distribution tells us the probability for getting 0, 1, and 2, counts in each one minute interval. In this case, we expect 0 counts 40% of the time, 1 count 37% of the time, and 2 counts 17% of the time if the cloth is 500 years old, and 0 counts 93% of the time, 1 count 6.8% of the time, and 2 counts 0.2% of the time if it is 15,000 years old. Clearly, our results favor the younger age -- there is only a 1 in 500 chance we would see 2 counts in a single minute, whereas this happened in the first five minutes. Continuing the experiment for another ten minutes or so would seal the case.

[42:] In fact, the Poisson distribution and the Gaussian distribution are both special cases of the binomial distribution. We use the former in situations where the expected events are relatively rare and their average occurrence rate is well-known. The Gaussian distribution is useful when our sample size is large. In all cases, though, these probability distributions allow us to dispassionately assess the outcome of an experiment or observation, and offer a guide for future behavior. The result of the Central American cloth dating suggests it would be a waste of time to pursue the alien weavers hypothesis, whereas the frog gender ratio might well lead the ecologist to question whether polluting chemicals might be altering the natural ratio of males to females and prompt him to propose further experiments.

[43:] And this is the real purpose of statistics: to provide the scientist with a quantitative assessment of the uncertainty in a measurement, and the likelihood that, given an assumed model, the measured value is consistent with the model's prediction. Note that I say "consistent with" -- not that a measurement proves that a model is correct. The world of science is not about proofs (that's the realm of mathematics and, perhaps, of philosophy). While often a precise discipline which strives for accuracy, science is ever-aware of the inherent and unavoidable uncertainty in its measurements and accounts for them explicitly when building its models of the natural world.