Two things I’ve been thinking about lately.
First, outliers. A classical example of these arises in what is known as Anscombe’s quartet, four datasets with almost nearly identical properties (mean, variance, x-y correlation) yet very different when plotted:
John Kruschke at “Doing Bayesian Data Analysis” has a terrific example of how to fit robust regressions through these lines using BUGS, along with links to additional code from Rasmus Bååth. It is definitely worth taking a look at these. One of the key points is that for small sample sizes, where the population SD is really unknown, it is worth modelling data from a Student’s t distribution as opposed to a normal distribution, as it allows for more flexibility in outliers (such as in y3-y4 above). These are easy to implement in BUGS and Stan.
Second, how worried should we be about R2 values? I’ve been thinking briefly about this over the past week as I’m not getting my usual impressive 0.9 values. Does this matter though? What does a R2 really tell us? Well, consider the example below:
x1 <- rnorm(10000) y1 = 5 + 1*x1 + 1*rnorm(10000) y2 = 5 + 1*x1 + 3*rnorm(10000) summary(lm(y1 ~ x1)) summary(lm(y2 ~ x1))
If you run the code in R you will see that both models recover the estimated effects of x1 on y1 but that y2 has 3X larger error estimates around this effect. The resulting R2 is ca. 1/5 that of the y1 model. Does this mean the model is a poor fit? Not really… It speaks much more to the fact that the predictive power of the y2 model is diminished and there are wider prediction intervals, rather than there being anything wrong with the mean estimate of x1 per se. I’d be curious to hear what others make.
Finally, I thought we should drop a mention of the student blog PLANeT, run by our Part II undergraduates. It’s worth checking them out! They regularly post well written and engaging articles on the relationship between plants and society.