In defense of five standard deviations ~ TheReference

Originally posted on August 12th. The second part was added at the end. The third part. Last, fourth part.

Five standard deviations are cute.

However, Tommaso Dorigo wrote the first part of his two-part "tirade against the five sigma",

Demistifying The Five-Sigma Criterion

I mostly disagree with his views. The disagreement begins with the first word of the title ;-) that I would personally write as "demystifying" because what we're removing is mystery rather than mist (although the two are related words for "fog").

He "regrets" that the popular science writers tried to explain the five-sigma criterion to the public – I think they should be praised for this particular thing because the very idea that the experimental data are uncertain and scientists must work hard and quantitatively to find out when the certainty is really sufficient is one of the most universal insights that people should know about the real-world science.

When I was a high school kid, I mostly disliked all this science about error margins, uncertainties, standard deviations, noise. This sentiment of mine must have been a rather general symptom of a theorist. Error margins are messy. They're the cup of tea of the sloppy experimenters while the pure and saint theorist only works with perfect theories making perfect predictions about the perfectly behaving Universe.

Of course, sometimes early in the college, I was forced to get dirty a little bit, too. You can't really do any empirical research without some attention paid to error margins and probabilities that the disagreements are coincidental. As far as I know, the calculations of standard deviations was one of the things that I did not learn from any self-studies – these topics just didn't previously look beautiful and important to me – and the official institutionalized education system had to improve my views. The introduction to error margins and probabilistic distributions in physics was a theoretical introduction to our experimental lab courses. It was taught by experimenters and I suppose that it was no accident because they were more competent in this business than most of the typical theorists.

At any rate, I found out that the manipulations with the probability distributions were a nice and exact piece of maths by themselves – even though they were developed to describe other, real things, that were not certain or sharp – and I enjoyed finding my own derivations of the formulae (the standard deviations for the coefficients resulting from linear regression were the most complex outcomes of this fun research).

At any rate, hypotheses predict that a quantity \(X\) should be equal to \(\bar X\pm \Delta X\) if I use simplified semi-laymen's conventions. The error margin – well, the standard deviation – \(\Delta X\) is never zero because our knowledge of the theory, its implications, or the values of the parameters we have to insert to the theory are never perfect.

Similarly, the experimenters measure the value to be \(X_{\rm obs}\) where the subscript stands for "observed". The measurement also has its error margin. The error margin has two main components, the "statistical error" and the "systematic error". The "total error" for a single experiment may always be calculated (using the Pythagorean theorem) as the hypotenuse of the triangle whose legs are the statistical error and the systematic error, respectively.

The difference between the statistical error and the systematic error is that the statistical error contains all the contributions to the error that tend to "average out" when you're repeating the measurement many times. They're averaging out because they're not correlated with each other so about one-half of the situations are higher than the mean and one-half of them are lower than the mean etc. and most of the errors cancel. In particular, if you repeat the same dose of experiments \(N\) times, the statistical error decreases \(\sqrt{N}\) times. For example, the LHC has to collect many collisions because the certainty of its conclusions and discoveries is usually limited by the "statistics" – by their having an insufficient number of events that can only draw a noisy caricature of the exact graphs – so it has to keep on collecting new data. If you want the relative accuracy (or the number of sigmas) to be improved \(K\) times, you have to collect \(K^2\) times more collisions. It's that simple.

On the other hand, the systematic error is an error that always stays the same if you repeat the experiment. If the CERN folks had incorrectly measured the circumference of the LHC to be 27 kilometers rather than 24.5 kilometers, this will influence most of the calculations and the 10% error doesn't go away even after you perform quadrillions of collisions. All of them are affected in the same way. Averaging over many collisions doesn't help you. Even the opinions of two independent teams – ATLAS and CMS – are incapable of fixing the bug because the teams aren't really independent in this respect as both of them use the wrong circumference of the LHC. (This example is a joke, of course: the circumference of the LHC is known much much more accurately; but the universal message holds.)

When you're adding error margins from two "independent" experiments, like from the ATLAS collisions and the CMS collisions, you may add the statistical errors for "extensive" quantities (e.g. the total number of all collisions or collisions of some kind by both detectors) by the Pythagorean theorem. It means that the statistical errors in "intensive quantities" (like fractions of the events that have a property) decreases as \(1/\sqrt{N}\) where \(N\) is the number of "equal detectors". However, the systematic errors have to be added linearly, so the systematic errors of "intensive" quantities don't really drop and stay constant when you add more detectors. Only once you calculate the total systematic and statistical errors in this non-uniform way, you may add them (total statistical and total systematic) via the Pythagorean theorem (physicists say "add them in quadrature").

So far, all the mean values and standard deviations are given by universal formulae that don't depend at all on the character or shape of the probabilistic distribution. For a distribution \(\rho(X)\), the normalization condition, the mean value, and the standard deviation are given by\[

\eq{
1 & = \int dX\,\rho(X) \\
\bar X &= \int dX\,X\cdot \rho(X) \\
(\Delta X)^2 &= \int dX\,(X-\bar X)^2\cdot\rho (X)
}

\] Note that the integral \(\int dX\,\rho(X)\) with the extra insertion of any quadratic function of \(X\) is a combination of these three quantities. The Pythagorean rules for the standard deviations may be shown to hold independently of the shape of \(\rho(X)\) – it doesn't have to be Gaussian.

However, we often want to calculate the probability that the difference between the theory and the experiment was "this high" (whether the probability is high enough so that it could appear by chance) – this is the ultimate reason why we talk about the standard deviations at all. And to translate the "number of sigmas" to "probabilities" or vice versa is something that requires us to know the shape of \(\rho(X)\) – e.g. whether it is Gaussian.

There's 32% risk that the deviation from the central value exceeds 1 standard deviation (in either direction), 5% risk that it exceeds 2 standard deviations, 0.27% that it exceeds 3 standard deviations, 0.0063% that it exceeds 4 standard deviations, and 0.000057% which is about 1 part in 1.7 million that it exceeds five standard deviation.

So far, Dorigo wrote roughly four criticisms against the five-sigma criterion:

five sigma is a pure convention
the systematic errors may be underestimated which results in a dramatic exaggeration of our certainty (we shouldn't be this sure!)
the distributions are often non-Gaussian which also means that we should be less sure than we are
systematic errors don't drop when datasets are combined and some people think that they do

You see that this set of complaints is a mixed bag, indeed.

Concerning the first one, yes, five sigma is a pure convention but an important point is that it is damn sensible to have a fixed convention. Particle physics and a few other hardcore hard disciplines of (usually) physical sciences require 5 sigma, i.e. the risk 1 in 1.7 million that we have a false positive, and that's a reasonably small risk that allows us to build on previous experimental insights.

The key point is that it's healthy to have the same standards for discoveries of anything (e.g. 1 in 1.7 million) so that we don't lower the requirements in the case of potential discoveries we would be happy about; the certainty can't be too small because the science would be flooded with wrong results obtained from noise and subsequent scientific work building on such wrong results would be ever more rotten; and the certainty can't ever be "quite" 100% because that would require an infinite number of infinitely large and accurate experiments and that's impossible in the Universe, too.

We're gradually getting certain that a claim is right but this "getting certain" is a vague subjective process. Science in the sociological or institutionalized sense has formalized it so that particle physics allows you to claim a discovery once your certainty surpasses a particular thresholds. It's a sensible choice. If the convention were six sigma, many experiments would have to run for a time longer by 35% or so before they would reach the discovery levels but the qualitative character of the scientific research wouldn't be too different. However, if the standard in high-energy physics were 30 sigma, we would still be waiting for the Higgs discovery today (even though almost everyone would know that we're waiting for a silly formality). If the standard were 2 sigma, particle physics would start to resemble soft sciences such as medical research or climatology and particle physicists would melt into stinky decaying jellyfish, too. (This isn't meant to be an insulting comparison of climatology to other scientific disciplines because this comparison can't be made at all; a more relevant comparison is the comparison of AGW to other religions and psychiatric diseases.)

Concerning Tommaso's second objection, namely that some people underestimate systematic errors, well, he is right and this blunder may shoot their certainty about a proposition through the roof even though the proposition is wrong. But you can't really blame this bad outcome – whenever it occurs – on the statistical methods and conventions themselves because you need some statistical methods and conventions. You must only blame it on the incorrect quantification of the systematic error.

The related point is the third one, namely that the systematic errors don't have to be normally distributed (i.e. with a distribution looking like the Gaussian). When the distribution have thick tails and you have several ways to calculate the standard deviations, you should better choose the largest one.

However, I need to say that Tommaso heavily underestimates the Gaussian, normal distribution. While he says that it has "some merit", he thinks that it is "just a guess". Well, this sentence of his is inconsistent and I will explain a part of the merit below – the central limit theorem that says that pretty much any sufficiently complicated quantity influenced by many factors will be normally distributed.

Concerning Tommaso's last point, well, yes, some people don't understand that the systematic errors don't become any better when many more events or datasets are merged. However, the right solution is to make them learn how to deal with the systematic errors; the right solution is not to abandon the essential statistical methods just because someone didn't learn them properly. Empirical science can't really be done without them. Moreover, while one may err on the side of hype – one may underestimate the error margins and overestimate his certainty – he may err on the opposite, cautious side, too. He may overstate the error margins and \(p\)-values and deny the evidence that is actually already available. Both errors may turn one into a bad scientist.

Now, let me return to the Gaussian, normal distribution. What I want to tell you about – if you haven't heard of it – is the central limit theorem. It says that if a quantity \(X\) is a sum of many (\(M\to\infty\)) terms whose distribution is arbitrary (the distributions for individual terms may actually differ but I will only demonstrate a weaker theorem that assumes that the distributions coincide), then the distribution of \(X\) is Gaussian i.e. normal i.e. \[

\rho(X) = C\exp\zav{ - \frac{(X-\bar X)^2}{2(\Delta X)^2} }

\] i.e. the exponential of a quadratic function of \(X\). If you need to know, the normalization factor is \(C=1/(\Delta X)\sqrt{2\pi}\). Why is this central limit theorem true?

Recall that we are assuming\[

X = \sum_{i=1}^M S_i.

\] You may just add some bars (i.e. integrate both sides of the equation over \(X\) with the measure \(dX\,\rho(X)\): the integration is a linear operation) to see that \[

\bar X = \sum_{i=1}^M \bar S_i.

\] It's almost equally straightforward (trivial manipulations with integrals whose measure is still \(dX\,\rho(X)\) or similarly for \(S_i\) and that have some extra insertions that are quadratic in \(S_i\) or \(X\)) to prove that\[

(\Delta X)^2 = \sum_{i=1}^M (\Delta S_i)^2

\] assuming that \(S_i,S_j\) are independent of each other for \(i\neq j\) i.e. that the probability distribution for all \(S_i\) factorizes to the product of probability distributions for individual \(S_i\) terms. Here we're assuming that the error included in \(S_i\) is a "statistical error" in character.

So the mean value and the standard deviation of \(X\), the sum, are easily determined from the mean values and the standard deviations of the terms \(S_i\). These identities don't require any distribution to be Gaussian, I have to emphasize again.

Without a loss of generality, we may linearly redefine all variables \(S_i\) and \(X\) so that their mean values are zero and the standard deviations of each \(S_i\) are one. Recall that we are assuming that all \(S_i\) have the same distribution that doesn't have to be Gaussian. We want to know the shape of the distribution of \(X\).

An important fact to realize is that the probabilistic distribution for a sum is given by the convolution of the probability distributions of individual terms. Imagine that \(X=S_1+S_2\); the arguments below hold for many terms, too. Then the probability that \(X\) is between \(X_K\) and \(X_K+dX\) is given by the integral over \(S_1\) of the probability that \(S_1\) is in an infinitesimal interval and \(S_2\) is in some other corresponding interval for which \(S_1+S_2\) belongs to the desired interval for \(X\). The overall probability distribution is given by \[

\rho(X_K) = \int dS_1 \rho_S(S_1) \rho_S(X_K-S_1).

\] You should think why it's the case. At any rate, the integral on the right hand side is called the convolution. If you know some maths, you must have heard that there's a nice identity involving convolutions and the Fourier transform: the Fourier transform of a convolution is the product of the Fourier transforms!

So instead of \(\rho(X_K)\), we may calculate its Fourier transform and it will be given by a simple product (we return to the general case of \(M\) terms immediately)\[

\tilde \rho(P) = \prod_{i=1}^M \tilde\rho(T_i).

\] Here, \(P\) and \(T_i\) are the Fourier momentum-like dual variables to \(X\) and \(S_i\). However, now we're almost finished because the products of many (\(M\)) equal factors may be rewritten in terms of an exponential. If \(\rho(T_i)=\exp(W_i)\), then the product of \(M\) equal factors is just \(\exp(MW_i)\) and the funny thing is that this becomes totally negligible if \(MW_i\gg 1\). So we only need to know how the right hand side behaves in the vicinity of the maximum of \(T_i\) or \(W_i\). A generic function \(W_i\) may be approximated by a quadratic function over there which means that both sides of the equation above will be well approximated by \(C_1\exp(-MC_2 T_i^2)\) for \(M\to\infty\).

It's the Gaussian and if you make the full calculation, the Gaussian will inevitably come out as shifted, stretched or shrunk, and renormalized so that \(X\), the sum, has the previously determined mean value, the standard deviation, and the probability distribution for \(X\) is normalized. Just to be sure, the Fourier transform of a Gaussian is another Gaussian so the Gaussian shape rules regardless of the variables (or dual variables) we use.

So there's a very important – especially in the real world – class of situations in which the quantity \(X\) may be assumed to be normally distributed. The normal distribution isn't just a random distribution chosen by some people who liked its bell-like shape or wanted to praise Gauss. It's the result of normal operations we experience in the real world – that's why it's called normal. The more complicated factors influencing \(X\) you consider, and they may be theoretical or experimental factors of many kinds, the more likely it is that the Gaussian distribution becomes a rather accurate approximation for the distribution for \(X\).

Whenever \(X\) may be written as the sum of many terms with their error margin (even though the inner structure of these terms may have a different, nonlinear character etc.; and the sum itself may be replaced by a more general function because if it has many variables and the relevant vicinity of \(X\) is narrow, the linearization becomes OK and the function may be well approximated by a linear combination i.e. effectively a sum, anyway), the normal distribution is probably legitimate. Only if the last operation to get \(X\) is "nonlinear" – if \(X\) is a nonlinear function of a sum of many terms etc. or if you have another specific reason to think that \(X\) is not normally distributed, you should point this fact out and take it into account.

But Tommaso's fight against the normal distribution as the "default reaction" is completely misguided because pretty much no confidence levels in science could be calculated without the – mostly justifiable – assumption that the distribution is normal. Tommaso decided to throw the baby out with the bathwater. He doesn't want an essential technique to be used. He pretty much wants to discard some key methodology but as a typical whining leftist, he has nothing constructive or sensible to offer for the science he wants to ban.

Second part

Dorigo's second part of the article is insightful and less controversial.

He reviews a nice 1968 Arthur Rosenfeld paper showing that the number of fake discoveries pretty much agrees with the expectations – some false positives are bound to happen due to the number of histograms that people are looking at. Sometimes experimenters tend to improve their evidence by ad hoc cuts if they get excited by the idea that they have made a discovery. Too bad.

Dorigo argues that the five-sigma criterion should be replaced by a floating requirement. This has various arguments backing it. One of them is that people have differing prior subjective probabilities quantifying how much plausible or almost inevitable they consider a possible result. Of course that extraordinary claims require extraordinary evidence while almost robustly known and predicted ones require a weaker one. It's clear that people get convinced by some experimental claims at a lower number of sigmas than for other claims. But I wouldn't institutionalize this variability because due to the priors' intrinsically subjective character, it's extremely hard to agree on the "right priors".

He also mentions OPERA that made the ludicrous claim about the superluminal neutrinos that was called bogus on this blog from the very beginning. It was a six-sigma result (60 nanoseconds with a 10 nanoseconds error), we were told. Dorigo blames it on the five-sigma standards. But this is just silly. Whatever statistical criterion you will introduce for a "discovery", you will never fully protect physics against a silly human error that may introduce an arbitrary large discrepancy to the results – against stupid errors such as the loosened cable. I wouldn't even count it as a systematic error; it's just a stupid human error that can't really be quantified because nothing guarantees that it remains smaller than a bound. So I think it's irrational to mix the debate about the statistical standards with the debate about loosened cables and similar blunders that cripple the quality of an experimental work "qualitatively" – they have nothing to do with one another!

Third part

In the third part, Dorigo discusses three extra classes of effects and tricks that may lead to fake discoveries. I agree with everything he writes but none of these things implies that the 5-sigma standard is bad or that it could be replaced by something better.

The first effect is mismodeling (well, a systematic error on the theoretical side); the second effect is aposterioriness, the search for bumps in places where we originally didn't want to look (which is OK for discovering new physics but such unplanned situations may heavily and uncontrollably increase the number of discrepancies i.e. false positives we observe, and we shouldn't forget about it when we're getting excited about such a discrepancy); and dishonest manipulation of the data (there's no protection here, except to shoot the culprit; if someone wants to spit on the rules, he will be willing to spit on any rules).

Fourth, last part

In this fourth and final part, Dorigo continues in a discussion of a bump near \(150\GeV\). At the very end, he proposes ad hoc modifications of the five-sigma rule – from 3 sigmas for the B decay to two muons to 8 sigmas for gravitational waves. One could assign ad hoc requirements that differ for different phenomena but it's not clear how it would be determined for many other phenomena for which the holy oracle Dorigo hasn't specified his miraculous numbers. Moreover, the patterns in his numbers don't seem to make any sense. It is very bizarre why a certain exotic, not-guaranteed-to-exist decay of the B-mesons is OK with 3 sigmas while the gravitational waves that have to exist must pass 8 sigmas. Moreover, some other observed signatures, like "SUSY", aren't really signatures but whole frameworks that may manifest themselves in many experiments and each of them clearly requires a different assignment of the confidence levels if we decide that confidence levels should be variable. There would be some path from being sure that there's new physics to being reasonably sure that it's a SUSY effect – Dorigo seems to confuse these totally different levels of knowledge.

If this table with the variable confidence levels were the goal of Dorigo's series, then I must say that the key thesis of his soap opera is crap.

TheReference

Monday, August 19, 2013

In defense of five standard deviations

0 comments:

Post a Comment

Popular Posts

Categories

Blog Archive

About Me