Thursday, August 29, 2013

The Willie Sutton Data Principle (WSDP)

Willie Sutton was a bank robber. The story (possibly apocryphal) goes that when asked why he robbed banks, he replied, "Because that is where the money is". So continuing the theme, of doing things because that's where they are, let's turn attention to data capture.
Data (including metadata) captured during actual operation are often more interesting than attempts to capture intent after the fact. So let's make sure we do the capture that way. The more impressive web properties do exactly that.

  • You looked at these 12 options and didn't buy (from a search process)
  • You looked at these 4 things together and did a side by side comparison (when online computer shopping) - and you purchased.......
  • You rented this movie, here are some others you might like - more importantly this kind of site learns more about your habits as you visit more often
Compare and contrast this practice with the university that I mentioned in the previous post. There the data capture specifically goes against the WSDP. The school is not capturing the data in the course of normal activity. It is attempting to make the data sources (students) do something extra and inconvenient to provide data that they don't care about.

Data quality vs cost

How often do we see surveys and other after the fact data capture trying to capture people's impressions and desires? Too often I fear. Here's a story about something going very wrong.
An unnamed university used to have students fill out their instructor ratings on paper towards the end of the semester. The professor would hand out the forms, in one of the last lectures, leave the room for the 10 or 15 minutes while the students filled in the forms, put the completed forms into an envelope and place the envelopes in a basket. The reviews are anonymous, so the simple act of placing them in a sealed envelope gives the sorts of confidence needed.
Then the data for each class is collated, tabulated and made available to the instructor as averages, max, min values without any specific student identification. A relatively straightforward system. A burden on some poor administrator perhaps - the entering of the data could be a bit tedious - especially in some of the larger classes.
Bean counters get control. Aha, we could save money by giving the students an online survey form instead. They would fill that out. It would be anonymous still. The administrator would be freed up. What could go wrong? Answer - just about everything.
Students are strongly discouraged from taking computing equipment into a class room, so they now have to fill in the online survey at some other time; The URL is another painful link to have to remember when they go to the other location; Many of the students use mainly smart phones or tablets and the survey form is not conducive to that form factor; There is no value to the individual students to fill in the form anyway. It is for instructor appraisal and measurement not for the students' benefit.
The completion rate for forms went from about 70% for the paper forms to less than 20% for the online forms. The data quality suffered. Instructors are now deprived of valuable feedback. The school is desperate for the feedback. Rather than fixing the root cause of the problem, they are now suggesting that instructors give "extra credit" to students who do the survey. That is of course wrong at many levels. Here are students receiving grade for something not related to their academic progress (ethics anyone?). The instructor isn't supposed to know which students actually filled in the survey anyway.
The key message here is that if you want good data make the capture of the data as unintrusive to the data source as possible - even if you have to do work at the back end to make the data usable.
At some level that's one axis of what "big data" is about. Capturing the data at the point of use without requiring any extra steps. Analyze that data in the ways you want to.

Saturday, August 24, 2013

Historical Context

When we want to access transactional data long after the transaction has been posted to a system of record, we need to make sure we have enough context to reconstruct the history.
  • How do we go back and find who sold a particular item years after a sale was registered?
  • How do we compare performance over time when the rules have changed?
  • How do we reprice an item to produce a receipt (perhaps for provenance) when we don't know what the pricing rules were at the time of the relevant sale? 
There is a wealth of data surrounding the transactions - data that in the transactional systems would likely be properly normalized and well managed. When we want to lookup the cashier at the time a sale is executed, we can. After all we know which terminal they signed in at, what employee id they signed in with, etc. However when the employee id is no longer "valid" - i.e. the employee cannot perform the transactions any more, the fact that a particular employee performed a particular transaction is still a fact. Forensic applications comb through backups attempting to link all the data together - in an attempt to piece together the relevant facts. That is, however a slow and tedious process.
We almost have to go against the fundamental principles that we learned when designing databases - especially transactional databases. Normalize the data to reduce redundancy, remove the possibility of update/delete anomalies. But these historical (sometimes wharehousing oriented) systems don't work that way. We have to scoop up what else was relevant at the time of the transaction. So that we can do unlikely analytics.
Web properties have known this forever. When doing A/B testing the whole interaction context is harvested so that minute behavioral changes can be analyzed - long after the original master data have been purged.
There is a second dirty and ugly secret lurking here too. Not only do we have to capture more data than we ever thought, but actually even if we have the data we may nor be able to use it because the application required to rematerialize it no longer exists. We have upgraded our applications remembering to change the DB schema, but without a good way to get the right version of the application in place at the point in time for a particular piece of data that we wish to examine, we still not be able to make use of the data..
Continuous data arhival synchronized with the proper copies of the applications presents challenges at a scale that the big data world is just getting to grips with.