Bench philosophy: Statistical Pitfalls

Stats for the Stupified
by Steven D. Buckingham, Labtimes 01/2011

Politicians exploit it. Psychologists love it. Journals reject papers over it. But for many of us statistics is a bit like sleep – something we have to do but no-one really quite knows why. Here is a set of heuristics – some simple rules of thumb to avoid frequent stats failures.

Some disciplines have stats at the very core: psychology and genomics, for instance. But for most of us, stats is just something we have to do and we rarely get beyond the good old Student’s t-test. There is a vast arsenal of statistical techniques but taking the most common tool – hypothesis testing – I hope to help you avoid some of the typical blunders and perhaps even think again about how you approach statistics.

For many of us, running a t-test is almost synonymous with “doing the stats”. And like many familiar friends, it is the most frequently ill-used. The most common t-test blunders come from not respecting its assumptions. Violate any test’s assumptions and you’ll get unreliable answers. In the case of the t-test, there are two fundamental assumptions that are often overlooked:

  • The populations you are testing must have a Normal distribution. You can check whether this holds for your data with tests such as the Kolmogorov-Smirnov test. Most respectable and easy-to-use stats packages will do this without being asked, but most unrespectable and easy-to-convince researchers may just ignore the package’s warnings.
  • If you want to be really picky, the two samples are supposed to have the same variance, although in reality the t-test is pretty robust against variance differences.

T for two

Now for a really big t-test blunder – one I have seen time and time again. Imagine you have tested for the effects of a set of two kinase inhibitors on cell growth. You have a set of controls and you want to test whether either, or both, of the kinase inhibitors have an effect. So you do a t-test of drug A against control and a t-test of drug B against control, right? Wrong! If you are using the two-sample t-test, the number of samples must be, obviously, two. You cannot use the t-test to compare three or more samples. If you have a sadistic streak, believe me there is nothing more satisfying at a talk or whilst refereeing a paper, when you notice that the t-test has been used to compare several samples one after another, to smugly point out that the test is completely wrong.

To explain why, we have to digress a little into how statistics works. (Bear with me, it’ll be worth it – just think of that smug remark you can bring at the next lab meeting). Statistics is all about probability and the sort of probability most of us were brought up on is called “frequentist”. In this way of looking at things, a probability is a statement of how many times we tend to get a certain outcome, given a number of trials. So, declaring “The probability of rolling a six is 1/6” is exactly the same as saying “If I roll a dice 6,000 times I get a six 1,000 times.” No more, no less.

So why can’t I use the t-test in my imaginary kinase experiment? Well, when we say the t-test is significant at 5%, we of course mean that the probability of getting this result entirely by chance is one in 20. In other words, a significance of 5% is exactly the same as saying “If I did this experiment 100 times, even if there was no real effect underlying the data, I would still get this particular result five times, just by chance.” So if you test the effects of, say, reciting 100 different nursery rhymes on cell growth, using the t-test you will find that about five of them have a significant effect on cell growth compared to control. (By the way, a 5% chance of something happening by chance does NOT mean you can be 95% sure that it is not due to chance).

Of course, the right way to go about this kind of problem is to use ANOVA (analysis of variance). The kinase experiment would use a one-way ANOVA because only one effect is being tested (drug vs. control). But there is also the option of looking at two effects in the same experiment (such as looking at kinase inhibitors in the presence or absence of growth factors) using a two-way ANOVA, which most basic stats packages offer by default. In principle, of course, you can design any number n of factors in an experiment and use an n-way ANOVA (decent stats packages such as the free R package offer this), increasing the usefulness of the data but at the expense of increasing the number of experimental trials and the complexity of the design.

But beware of monsters lurking even in the enlightened waters of an ANOVA. The most common one is “OK, I have been very good and done a proper ANOVA on my kinase data and to my relief it came out significant. That means, I can now go ahead with a clear conscience and do my t-tests again, right?” Wrong! ANOVA only tells you that the variance between your groups was different to the variance within your groups and that you would only have got this by chance X% of the time. It does not correct for the over-use of the t-test described above.

The way to find significance of the individual groups is to use a “post-hoc” test, one that is specifically designed to find out, which sets contribute to the significance of an ANOVA. These include the Newman-Keuls test, the Tukey test and Dunnett’s test. They approach the dangers of over-testing in various ways but remember that there is no rigorous way of doing this, so you may find yourself trying them all out until you get the result you want. Yes, it really is as dangerous as this.

So here is our heuristic so far:

If you are genuinely only comparing two or less groups, use t-test. If you are comparing more than two groups, use ANOVA and consider the possibility, then use post-hoc tests but be cynical.

Wrong conclusions

T-tests and ANOVA are just about the only tests many of us need in most hypothesis-testing situations. But we are not home and dry yet. The issue of statistical power is ignored by a frighteningly large proportion of researchers, resulting in them making wrong conclusions. Let’s go back to our kinase example above. Imagine the ANOVA gave a P-value of 10%. Clearly insignificant, so we conclude that neither of the drugs had any effect, right? Wrong! Ignoring the obvious logical blunder that “lack of evidence of X” is not “evidence for lack of X”, there is also a statistical problem. Was the variance in the data too large, concealing a small but real difference between groups? Perhaps we didn’t run enough repeats. This is what the question of statistical power is all about. The power of a test is a measure of how well it is protected against drawing such a false negative. An underpowered test tells you nothing because you simply didn’t do enough repeats, given the size of the effect or the variance in the data. Most versions of statistical power calculations are a function of the size of the samples (the n number), variance in the data and the size of the effect (for example, the differences in the mean growth rates of the cells in the kinase example). There is no single simple formula for working out the power but there are web resources available, such as the java applet at, which also explains how you should and should not use power analysis. Slightly simpler and rather more fun is – try out the amusing “results whacker”.

You should do a power test on preliminary data before doing your main experiments because you must decide on the number of repeats before doing the experiment. This is vital and a very common mistake – one, which we poor readers cannot discern in the final published paper. The mistake goes: “We ran the experiment six times and got a P value of 5.6%. Because it was almost significant, we decided to get a few more repeats (three more, in fact) and finally got significance down to 4%, and concluded there was a real effect.” This error is as insidious as it is innocent because, of course, all that appears in the paper is “the effect was significant (P=0.04, n=9)”.

Run a pilot study

Beware the reader. The reason it is wrong is because if you keep repeating an experiment, reapplying the stats test to the increasing n-numbers, the significance does not change smoothly but with lots of peaks and valleys, so you could easily just keep repeating the experiment until, by chance, one of those peaks crosses the significance line. For a graphical example of this, read

So we have another heuristic:

Before doing the experiments, run a pilot study and use the preliminary data to do a power analysis to decide the n-numbers.

I was once at a scientific meeting where a questioner commented to the presenter “I see your data, but I still just don’t believe it.” Was he being unscientific? No, and let me explain why.

Have you read papers claiming to prove extrasensory perception (ESP)? For instance, the evidence might rest on a t-test of the number of correctly “guessed” dice throws, coin flips or card draws. The experiment was correctly done under controlled conditions, the correct stats tests used and adequately powered, and a P value of 4.8%. Now be honest – what is your reaction? And more to the point, what would be your reaction, if exactly the same data had been presented for the effects of kinase inhibitors on cell growth? My guess is you would not have a problem with the latter, even though the data is the same.

Admittedly, this is an extreme illustration but it serves the purpose of pointing out a tendency lurking in the less controversial corners of biological science – that of putting statistics before science. We need reminding that statistical significance is not the same as scientific significance. A P value is no substitute for scientific judgement. Did you feel uncomfortable with your instinctive arguments against the ESP data? You needn’t – and neither should you let stats bully you into less controversial judgements. You are probably, perhaps without knowing it, doing a Bayesian analysis.

Bayes has the answer

Instead of thinking of probability in terms of how often something happens, the Bayesian approach looks at it in terms of “how confident I am that something is true”. Here I can only point the interested reader to other resources for a fuller introduction to this powerful approach (see box). Put very simply, it says that I start with my existing belief in a hypothesis (the prior), I multiply this by a factor (the bayes factor) that represents how surprising a piece of data is, to give a new updated belief (the posterior).

So going back to the ESP example, my prior that ESP exists is already very low (somewhere near zero, actually) and the Bayesian factor in the experiment is also very low (they got 53% of guesses right but this is not very surprising, despite what the t-test says). So my posterior belief (what I now believe having read the paper and even accepting the data) is still pretty close to zero. Interested? See the box to learn more.

Web resources for statistics
Tree/A useful decision tree to take you to the appropriate stats test
Collection of stats pages with interactive resources
A searchable guide from a commercial graphing/stats company
Some online stats calculators from a commercial graphing/stats company
Bayes for the rest of us: following this simple tutorial could change the way you think about data

Last Changed: 10.11.2012