A confession
I do confess that I read textbooks. Mostly they are obscure methods and applications, but occassionaly they are graduate or college level, and even introductory. I do this because I need to keep up with new tricks, but also because there is something special in trying to teach a newbie analytics.
I mean, it is all about creating the right mental image so they can see the data shape and flow, and so instinctively recognise how to work with it. And introductory text are full of simplififed, yet correct, views of the world of data.
So, what is my confession? I normally gloss over the sections about “descriptive statistics” (i.e using stats to describe the overall “feel” of the data - 1 in 5 kids skip breakfast, etc).
They never interested me. I am a more “inferential modelling” type of fella (i.e. using hypothesis testing and other hefty tools to infer insight and generate correlations and predictions about a data set).
If it isn’t predicting an outcome, or telling me how to do things better or faster, why would I care? I should acknowledge that this often gets me into trouble with the Communications section, who are charged with creating the visualisations and stories around the inferences I find.
But I had a moment the other day when I was challenged on why I had made a choice of design. My answer had to do with the underlying nature of the data set…yep. Descriptive Statistics held the key.
So now, as well as sending my models down to Communications to have them make the pretty visuals for the output, now I speak to them before I model too, to see if the descriptive stuff holds any secrets that make the modelling and data mining more efficient, or more appropriate.
This blog was written without duress. No inference should be made about my continuuing support for the Communications department and the wonderful work they do.
Mmmm humble pie……


Thursday, October 29, 2009 at 1:24PM
Reader Comments (1)
"Anscombe's Quartet", Anscombe (1976). Most convincing "look at the data first" argument that I've met.
Clustering to find which data is patterned and which are outliers, and then doing a residual analysis of the implied model, should work to correctly eliminate outliers and determine appropriate transforms analytically. Realistically, it is not advanced enough yet to keep up with our ability to simply look at something and pull out patterns and flag potential outliers. OK, so the human mind seems to have a cutoff around p=.6, but it is so fast it's a shame to waste the EDA capability.