Of Drunkards and Lamp-posts
It has been said of statistics that they” are often used as a drunkard uses a lamp post, more for support than illumination.”
It frequently falls to us to make an argument, based on a survey. To do this, we must first live with the data for a bit, getting a feel for its quirks and shortcomings along with its strengths. Unfortunately for all applied statisticians, rarely do we get the pleasure of translating data collected from our own design, where we have attempted to control all the confounding factors.
There is a growing school, driven amongst computer scientists and engineers for non-parametric studies of datasets that are generated without control- often called “data dredging.” It is the sometime harrowing and risky process of looking for patterns post-hoc, and then asserting a p value to their strength.
There are lots of discussions floating about out there regarding the relevance of a p value on a dredged pattern, usually along the lines of “well, given ‘something’ has to happen…” And they are valid. But I came across another concern that has left me perplexed.
We all know about the difficulty in interpreting cross tables, particularly 2X2, the staple of demographics.
But, dredging might make this even harder. Given a survey design, the interpretation of a contingency, or cross table, is relatively easy. You know which factor was controlled, and how the subjects were chosen. With a dredged set, how was the table propagated?
Let us presume that we have a 3X2 table. If we selected the data using a query that fixed the row totals, we could analyse it has two multinomials. Easy. If we fixed the column totals, we have three binomials. A different test but easy enough. If we extracted a fixed number of records (the grand total) and propagated the table, then each response cell becomes a Poisson variable. That’s more complicated, but doable. What about if there is another factor we didn’t think of? Something that might inflate the variance…….What I now have is a list of at least three different ways to analyse the same table, and in the case of a dredged set, no meaningful way to choose between them.
If anyone has any ideas, I’d love to hear them. If anyone feels like writing some theory for us foot soldiers, that will guide us when there is no data model.
And bemoaning dredging as a practice isn’t helpful…it is here to stay. We just need a rudder to help steer it.

Wednesday, April 7, 2010 at 10:09AM
Reader Comments