Fretting over R2 and outliers

Two things I’ve been thinking about lately.

First, outliers. A classical example of these arises in what is known as Anscombe’s quartet, four datasets with almost nearly identical properties (mean, variance, x-y correlation) yet very different when plotted:

Variation in Anscombe’s quartet despite nearly identical summary statistics.

John Kruschke at “Doing Bayesian Data Analysis” has a terrific example of how to fit robust regressions through these lines using BUGS, along with links to additional code from Rasmus Bååth.  It is definitely worth taking a look at these.  One of the key points is that for small sample sizes, where the population SD is really unknown, it is worth modelling data from a Student’s t distribution as opposed to a normal distribution, as it allows for more flexibility in outliers (such as in y3-y4 above).  These are easy to implement in BUGS and Stan.


Second, how worried should we be about R2 values?  I’ve been thinking briefly about this over the past week as I’m not getting my usual impressive 0.9 values.  Does this matter though?  What does a R2 really tell us?  Well, consider the example below:

x1 <- rnorm(10000)
y1 = 5 + 1*x1 + 1*rnorm(10000)
y2 = 5 + 1*x1 + 3*rnorm(10000)

summary(lm(y1 ~ x1))
summary(lm(y2 ~ x1))

If you run the code in R you will see that both models recover the estimated effects of x1 on y1 but that y2 has 3X larger error estimates around this effect.  The resulting R2 is ca. 1/5 that of the y1 model.  Does this mean the model is a poor fit?  Not really…  It speaks much more to the fact that the predictive power of the y2 model is diminished and there are wider prediction intervals, rather than there being anything wrong with the mean estimate of x1 per se.  I’d be curious to hear what others make.


Finally, I thought we should drop a mention of the student blog PLANeT, run by our Part II undergraduates.  It’s worth checking them out! They regularly post well written and engaging articles on the relationship between plants and society.

Many measurements and few observations

Our University colleague Professor Sir David Spiegelhalter has written a brief opinion piece in the latest issue of Science on the future of probabilistic models, particularly for big datasets (think images or genomes).

Two points jumped out at me:

(1) Statistical problems have shifted from many observations (large n) and few parameters (small p) to small n and large p, creating pitfalls when testing large numbers of hypotheses.  This is because the standard “p-value”, which we’ve griped about in the past here, will declare 1 in 20 non-existent relationships “significant” simply by chance.  So procedures are needed to reduce false discoveries.  The bit that I didn’t really follow was why even bother minimizing false discoveries?  Wouldn’t an interpretation of effect sizes be more meaningful?

(2) Inferring causation from observational data will continue to be a challenge, especially when n gets cheap and p remains large.  Statistical theory to deal with causality will be needed more than ever, and thankfully, it is improving.  This is something we’re quite fond of having thought a fair bit about causality in the context of path analysis, structural equation modelling, and directed acyclic graphs (see our J Appl Ecol paper that just came out).  The problem, however, is that these approaches don’t come easy and I struggle to see how they can be used by non-statisticians (the models in our paper took years of faffing!).  Finding ways to make causal inference more accessible is going to be critical in the future.

Mixture models for bimodality?

Oikos kindly featured our latest paper.  See below!

Oikos Blog

Bimodality – the characteristic of a continuous variable having two distinct modes – is of widespread interest in data analysis. This is because, in some cases, we can use the presence or absence of bimodality to infer something about the underlying processes generating the distribution of a variable that we are interested in studying. In ecology, tests of bimodality have been used in many different contexts, such as to understand body size distributions, functional traits, and transitions among different ecosystem states. But a lack of evidence for bimodality has been reported in many studies. Our paper “Masting, mixtures and modes: are two models better than one?”, now shows that a widely-used statistical test of bimodality can fail to reject the null hypothesis that focal probability distributions are unimodal. We instead promote the use of mixture models as a theory oriented framework for testing hypotheses of bimodality.

Our interest in…

View original post 428 more words

Dealing with non-normal data: are you skewed?

I was recently trying to model some data from a normal distribution but the data were right-skewed.  No amount of transformation could eliminate this.  In the past, I’ve dealt with this by using the skew normal distribution.  But rather than match distributions to data, we should be asking whether skewness makes sense (in our case) biologically?  There has been good reason to expect this in some cases, such as where environmental filters might push a trait into a certain direction.  But what about where there isn’t a good rationale for skewness?  Why might it arise and what can we do about it?

This is where the old Student’s t-distribution comes in.  The t-distribution has a bell shaped curve just like the normal but it has heavy tails that become shorter as the ‘degrees of freedom’ parameter ν approaches infinity.


What this means is that if we have a small-sample size (n = 8 in the case of my data from earlier this week), drawing samples from a symmetric, long-tailed distribution can often create the impression of asymmetric skewness.  Gelman (via Rubin in fact) recommended that:

“if you want to model an asymmetric distribution with outliers, you can use a symmetric long-tailed model”

He’s even gone so far as to say that we should always be defaulting to t-distributions but that it hasn’t permeated practice because of computational issues.  He predicted these would be eliminated with our saviour Stan.  In fact, you can use t-distributions in JAGS.

The real problem that I’ve had with the t-distribution is that it requires a value for the degrees of freedom ν.  I wanted to estimate this because its ‘true’ value should really be unknown despite my 8 observations.  And this can be difficult in a Bayesian context because of the prior we place on ν.  After some reading and fiddling, I found that the recommendation of Gelman and Hill from pg 372 of ARM (and described here) worked well and converged onto a relatively tight posterior for ν despite the uninformative prior.  I think I’ll be using this approach a lot more when I have small sample sizes!

The value of p-value

In the latest issue of Nature there is a piece by Regina Nuzzo about how un-reliable p-values are in interpreting statistical tests. She describes how researchers use the p-value entirely differently than how Fisher intended it to be used when he first published it in the 1920s. His significance, censu Nuzzo, was a mere means to test whether a result is worth pursuing. It has become the gold standard in scientific work to test for the 0.05 significance level without providing information about the feasibility of the hypothesis being tested, or on the strength of the effect being reported. Without that information, the p-value is not very informative and in many cases might be misleading. This paper can be related to a long historic feud among statisticians regarding the pros and cons of frequentist statistics versus, for example, Bayesian inference techniques.

The take home message from Nuzzo’s and other papers on the subject, is that it’s not enough to use p-value, one should always look for additional ways to test the rigor of one’s results, and to describe them in a way that renders them highly reproducible.