Showing posts with label data quality. Show all posts
Showing posts with label data quality. Show all posts

Thursday, August 29, 2013

The Willie Sutton Data Principle (WSDP)

Willie Sutton was a bank robber. The story (possibly apocryphal) goes that when asked why he robbed banks, he replied, "Because that is where the money is". So continuing the theme, of doing things because that's where they are, let's turn attention to data capture.
Data (including metadata) captured during actual operation are often more interesting than attempts to capture intent after the fact. So let's make sure we do the capture that way. The more impressive web properties do exactly that.

  • You looked at these 12 options and didn't buy (from a search process)
  • You looked at these 4 things together and did a side by side comparison (when online computer shopping) - and you purchased.......
  • You rented this movie, here are some others you might like - more importantly this kind of site learns more about your habits as you visit more often
Compare and contrast this practice with the university that I mentioned in the previous post. There the data capture specifically goes against the WSDP. The school is not capturing the data in the course of normal activity. It is attempting to make the data sources (students) do something extra and inconvenient to provide data that they don't care about.

Data quality vs cost

How often do we see surveys and other after the fact data capture trying to capture people's impressions and desires? Too often I fear. Here's a story about something going very wrong.
An unnamed university used to have students fill out their instructor ratings on paper towards the end of the semester. The professor would hand out the forms, in one of the last lectures, leave the room for the 10 or 15 minutes while the students filled in the forms, put the completed forms into an envelope and place the envelopes in a basket. The reviews are anonymous, so the simple act of placing them in a sealed envelope gives the sorts of confidence needed.
Then the data for each class is collated, tabulated and made available to the instructor as averages, max, min values without any specific student identification. A relatively straightforward system. A burden on some poor administrator perhaps - the entering of the data could be a bit tedious - especially in some of the larger classes.
Bean counters get control. Aha, we could save money by giving the students an online survey form instead. They would fill that out. It would be anonymous still. The administrator would be freed up. What could go wrong? Answer - just about everything.
Students are strongly discouraged from taking computing equipment into a class room, so they now have to fill in the online survey at some other time; The URL is another painful link to have to remember when they go to the other location; Many of the students use mainly smart phones or tablets and the survey form is not conducive to that form factor; There is no value to the individual students to fill in the form anyway. It is for instructor appraisal and measurement not for the students' benefit.
The completion rate for forms went from about 70% for the paper forms to less than 20% for the online forms. The data quality suffered. Instructors are now deprived of valuable feedback. The school is desperate for the feedback. Rather than fixing the root cause of the problem, they are now suggesting that instructors give "extra credit" to students who do the survey. That is of course wrong at many levels. Here are students receiving grade for something not related to their academic progress (ethics anyone?). The instructor isn't supposed to know which students actually filled in the survey anyway.
The key message here is that if you want good data make the capture of the data as unintrusive to the data source as possible - even if you have to do work at the back end to make the data usable.
At some level that's one axis of what "big data" is about. Capturing the data at the point of use without requiring any extra steps. Analyze that data in the ways you want to.