Chris Bird on Data: 2013

Tuesday, September 24, 2013

Be a WSDP app

In a previous post I referenced the "Willie Sutton Data Principle" (WSDP). Willie Sutton (reputedly) replied, "That's where the money is" when asked, "Why do you rob banks?"
The WSDP is about data capture - the idea being data capture is most useful if it happens in the natural course of doing something - not if you have to go an extra step. Of course there are exceptions, but there are so many sites/applications where to capture the data you simply have to work too hard to make it worth your while. For me a couple of examples include untappd a terrific idea for sharing beer drinking experiences. However the work required to enter the beers far outweighs any benefit that I derive. Now I drink a lot of beer, and I try many different brews, but somehow crowdsourcing beer drinking is too much work. Similarly, I would have to be a fanatic to enter the foods/drinks that I consume into my Jawbone Up. I love the product for showing exercise and sleep habits (that's passive collection), but as soon as I have to do work to collect the data, I am simply not motivated enough.
Contrast with Google Maps - looking at "real time" traffic. That's a great WSDP app. The data are captured passively as we move around, and then made usable to other drivers. I can contribute to the crowd source - but do it without having to work at it.
Moral of the story - make your apps worth entering data - or make the data entry invisible to the user. Be a WSDP app where possible.

TV advertising and idempotence

Reading this and reflecting on my own experience started to make me think about the nature of commercials on TV.
In our house we rarely watch live TV - mostly delayed or semi-live. Semi-live is the term used to describe starting after the show/event has started and being able to skip the commercials. Delayed I am taking to mean some time after the show/event is over.
There are several programs (e.g. repeats, highlights, old movies) that are in a sense timeless. I might go back days or weeks after the recording to watch. And therein lies the rub.
Idempotence is a mathematical term that roughly means "when you invoke a function you get the same answer every time you invoke it." If we think of language as like invoking functions, there are words that definitely cause us difficulties. Tomorrow for example gives a different answer every day. Next Wednesday gives a different answer every week....
So when we are looking at delayed TV, advertisements that crop up saying, "watch xxx tomorrow" become meaningless when I am watching a delayed program after the presentation of xxx.
To make advertising more meaningful, a couple of things could help - at least help we delayed TV crowd.

Insert time relevant advertising at the moment of watching the program
If it isn't possible to insert advertising (contracts, agreements, technology, cost or whatever), then make sure that the advertisements represent time in an idempotent way. Give Date/Time in absolute terms and not just in relative terms for example.

Right now, one of the main reasons to watch delayed is to avoid the advertisements. That does the job well for me. But f the advertisements were harder to avoid and more time of viewing relevant, then maybe they would reach me better.

Wednesday, September 11, 2013

My hate/hate relationship with Amazon

My first encounter with Amazon left a very poor taste. I had bought a friend a book for his birthday. I wanted to send it anonymously to give him a bit of an opportunity to do some detective work. So I went to the fledgling amazon.com and ordered the book. I checked the box that allowed me not to include any information about the sender, gave my credit card number and sat back awaiting a fun game as my friend tried to figure out where it came from. It didn't take long! Amazon indeed did not include any information about me inside the package. They did however include my name on the return address on the exterior shipping label. That's just plain wrong - especially when they say they will never share details, etc.

Fast forwarding to today. I had been researching a specific Thai cookery book (by Chef McDang with whom I was at school in England, but that is another story). I have a copy, but needed some further information about the book - again for a friend. I didn't log in to Amazon while doing this search. However within a few days of doing that search, I am now inundated by Amazon with suggestions for Thai cook books. Impressive for sure. But not exactly what I had in mind.

There are actually two things in there that I don't like. First they were reasoning from the particular to the general back to the particular. "If you like one Thai recipe book, you probably like Thai recipes/cooking and so you will like these." Second, even though I thought I had wiped caches, etc. Amazon was still able to determine who I was when making the original request and do something obnoxious with that data.

Creepy isn't it?

The Dow Jones average and history

This posting was prompted by seeing the Dow Jones Industrial Average change out 3 of its component stocks. Gone are Alcoa, Bank of America, and Hewlett Packard. In are Nike, Visa, and Goldman Sachs. This change was announced on September 10th, 2013 for effect on September 23, 2013. It isn't the first time, nor will it be the last, that the DOW will make adjustments. But that does leave me with a rather uncomfortable question. How valid is it to compare the DOW of September 23, 2013 with the Dow of September 22nd. 2013? Had Alcoa, BofA and H-P still been in the index on September 23, then the DOW would have one value, but since they aren't it would have another.

The way the "magic" occurs is that the DOW is a price weighted index. The idea is that a shift of $1 in the price of any stock will cause the index to shift a certain number of points. It doesn't matter which stock it is.To prevent any discontinuities, structural activities (stock splits, acquisitions, divestitures) and additions/removals to the index cause a weighting factor to change. From March 2013 to now the weighting factor has been that a $1 change in any stock price will move the index 7.7 points. From September 23, 2013 onwards a $1 change in a stock price will cause a movement of 6.5 points

So as we compare DOW index values over the years, we at least need to be aware that it isn't a single seamless whole with the same stocks at the same weighting all the time.

As we look at index data in our own businesses, we also need to realize that the way an index is calculated can affect the perception of the underlying reality that has been abstracted by the index.

Thursday, August 29, 2013

The Willie Sutton Data Principle (WSDP)

Willie Sutton was a bank robber. The story (possibly apocryphal) goes that when asked why he robbed banks, he replied, "Because that is where the money is". So continuing the theme, of doing things because that's where they are, let's turn attention to data capture.
Data (including metadata) captured during actual operation are often more interesting than attempts to capture intent after the fact. So let's make sure we do the capture that way. The more impressive web properties do exactly that.

You looked at these 12 options and didn't buy (from a search process)
You looked at these 4 things together and did a side by side comparison (when online computer shopping) - and you purchased.......
You rented this movie, here are some others you might like - more importantly this kind of site learns more about your habits as you visit more often

Compare and contrast this practice with the university that I mentioned in the previous post. There the data capture specifically goes against the WSDP. The school is not capturing the data in the course of normal activity. It is attempting to make the data sources (students) do something extra and inconvenient to provide data that they don't care about.

Data quality vs cost

How often do we see surveys and other after the fact data capture trying to capture people's impressions and desires? Too often I fear. Here's a story about something going very wrong.
An unnamed university used to have students fill out their instructor ratings on paper towards the end of the semester. The professor would hand out the forms, in one of the last lectures, leave the room for the 10 or 15 minutes while the students filled in the forms, put the completed forms into an envelope and place the envelopes in a basket. The reviews are anonymous, so the simple act of placing them in a sealed envelope gives the sorts of confidence needed.
Then the data for each class is collated, tabulated and made available to the instructor as averages, max, min values without any specific student identification. A relatively straightforward system. A burden on some poor administrator perhaps - the entering of the data could be a bit tedious - especially in some of the larger classes.
Bean counters get control. Aha, we could save money by giving the students an online survey form instead. They would fill that out. It would be anonymous still. The administrator would be freed up. What could go wrong? Answer - just about everything.
Students are strongly discouraged from taking computing equipment into a class room, so they now have to fill in the online survey at some other time; The URL is another painful link to have to remember when they go to the other location; Many of the students use mainly smart phones or tablets and the survey form is not conducive to that form factor; There is no value to the individual students to fill in the form anyway. It is for instructor appraisal and measurement not for the students' benefit.
The completion rate for forms went from about 70% for the paper forms to less than 20% for the online forms. The data quality suffered. Instructors are now deprived of valuable feedback. The school is desperate for the feedback. Rather than fixing the root cause of the problem, they are now suggesting that instructors give "extra credit" to students who do the survey. That is of course wrong at many levels. Here are students receiving grade for something not related to their academic progress (ethics anyone?). The instructor isn't supposed to know which students actually filled in the survey anyway.
The key message here is that if you want good data make the capture of the data as unintrusive to the data source as possible - even if you have to do work at the back end to make the data usable.
At some level that's one axis of what "big data" is about. Capturing the data at the point of use without requiring any extra steps. Analyze that data in the ways you want to.

Saturday, August 24, 2013

Historical Context

When we want to access transactional data long after the transaction has been posted to a system of record, we need to make sure we have enough context to reconstruct the history.

How do we go back and find who sold a particular item years after a sale was registered?
How do we compare performance over time when the rules have changed?
How do we reprice an item to produce a receipt (perhaps for provenance) when we don't know what the pricing rules were at the time of the relevant sale?

There is a wealth of data surrounding the transactions - data that in the transactional systems would likely be properly normalized and well managed. When we want to lookup the cashier at the time a sale is executed, we can. After all we know which terminal they signed in at, what employee id they signed in with, etc. However when the employee id is no longer "valid" - i.e. the employee cannot perform the transactions any more, the fact that a particular employee performed a particular transaction is still a fact. Forensic applications comb through backups attempting to link all the data together - in an attempt to piece together the relevant facts. That is, however a slow and tedious process.
We almost have to go against the fundamental principles that we learned when designing databases - especially transactional databases. Normalize the data to reduce redundancy, remove the possibility of update/delete anomalies. But these historical (sometimes wharehousing oriented) systems don't work that way. We have to scoop up what else was relevant at the time of the transaction. So that we can do unlikely analytics.
Web properties have known this forever. When doing A/B testing the whole interaction context is harvested so that minute behavioral changes can be analyzed - long after the original master data have been purged.
There is a second dirty and ugly secret lurking here too. Not only do we have to capture more data than we ever thought, but actually even if we have the data we may nor be able to use it because the application required to rematerialize it no longer exists. We have upgraded our applications remembering to change the DB schema, but without a good way to get the right version of the application in place at the point in time for a particular piece of data that we wish to examine, we still not be able to make use of the data..
Continuous data arhival synchronized with the proper copies of the applications presents challenges at a scale that the big data world is just getting to grips with.