Correlation Is Evidence

Commenting on a post by Arnold Kling, Megan McArdle says that correlation is not evidence of causation:

Correlations are, at best, suggestive. They are not by themselves evidence--nay, not even if you cross your arms, scowl at your opponent, and say "Well, then give me another explanation for this astonishing correlation!" Until you've got something better than a simple correlation, the burden of proof remains upon you.

As I've said before, a strongly statistically significant correlation is, by definition, unlikely to be spurious, and therefore strongly suggestive of some kind of causal connection. It's not conclusive evidence--dredge through enough data and you're bound to turn up some spurious collations with very small p-values--but it's evidence nevertheless.

Of course, you don't get to pick the causal relationship you like best and claim that the correlation proves it--there are always alternative explanations. But neither can you say "correlation is not causation" and then sweep an inconvenient correlation under the rug. A highly significant correlation almost always means that there's something interesting going on, and a model that can't explain it is likely flawed, or at best incomplete.

The more interesting point raised in Dr. Kling's post is his observation that correlations may not always be as significant as they appear, because a steady trend is really only two data points.

Share this


Causation is just a theory that two successive phenomenons will be observed with perfect correlation.

The real thing people should beware of is : sample correlation is not correlation. How many statisticians take 1000 times series, stir them, inevitably end up with a good looking sample correlation and claim there is significant correlation in a paper....

I see his point but

The more interesting point raised in Dr. Kling's post is his observation that correlations may not always be as significant as they appear, because a steady trend is really only two data points.

I definitely see his point, and it goes against your point when the correlation is between two steady trends - i.e., a strong correlation between two steady trends means very little. For example, suppose that I am filling up my glass with beer and, over in another country, somebody else is filling up his glass with beer at just the same time. If you track the quantity of beer over time in our two glasses as we fill them up, then they will be strongly correlated. You don't want to conclude that our pouring is somehow specially linked.

In contrast, if you were to observe us, and if we were to pour, drink, pour, drink, and do so (a) irregularly (sometimes fast, sometimes slow, with no predictable pattern, and (b) exactly in sync with each other, then this is a very different situation from a steady pour, and you might reasonably suspect that there was a special link between us. The correlation may be no stronger, and yet it is much more reasonable to suspect a link. This shows that there is more to linking than correlation.

However, having said all that, having agreed with Kling's point, I stop short at the expression of this point as, "a steady trend is really only two data points". I am not saying that I have any particular reason to think this is false. I am saying that Kling hasn't given me any particular reason to think this is true. I can immediately see that as a rule of thumb this might be a good way to put into practice the point that Kling was making, because it prevents us from making too much of steady trends while allowing us to make much more of unpredictable and yet still correlated trends. But this only means that this is a useful rule of thumb given what we've realized about steady trends. I'm wondering whether Kling (or you for that matter) has in mind some sort of mathematical or statistical principle from which he can derive this conclusion as more than just a useful rule of thumb. Kling himself seems to express it in a way that suggests it's a rule of thumb:

With a strong trend, you probably should just think of yourself as having two data points--the beginning and the end point.

Two more quick points. If two things are correlated, three hypotheses are: A causes B, B causes A, and A and B are both caused by a common C.

I wonder if we can't, after all, say that the two beer pourings do have one common cause, namely, time itself. An event has multiple causes, which we might therefore call causal factors (or contributing causes). One causal factor in my pouring, and in someone else's pouring, is time. We might argue as follows:

Causal factor 1: I have been pouring beer at 1 ounce/second since 5:35:21.

Causal factor 2: It is now 5:35:28.

As an effect of these two causes, there are now 7 ounces of beer in the glass.

Time can be said to be a causal factor helping to determine the current amount of beer in the glass. Maybe you want to argue that time itself cannot genuinely be considered a cause. The concept of a "cause" in my mind, however, at the moment does not exclude considering time a causal factor.

So when we see two steady trends and notice that they are correlated, this may in fact be evidence that they have a common cause, namely, time. And so the treatment of correlation as evidence of causal connection is vindicated. It is, of course, the sort of common cause in which we probably have little interest. But the treatment of correlation as evidence of causal connection still - in an uninteresting way - vindicated.

A second quick point: I'm sorry if I missed either you or Kling making this point, but there is the additional caution that if you look hard enough at enough data, you can find spurious relationships. It doesn't take all that much data to find spurious relationships, as illustrated by the familiar birthday paradox.

[edit: I just looked and Arthur B. has just made this last point]