Chris Bird on Data: history

Showing posts with label history. Show all posts

Wednesday, September 11, 2013

The Dow Jones average and history

This posting was prompted by seeing the Dow Jones Industrial Average change out 3 of its component stocks. Gone are Alcoa, Bank of America, and Hewlett Packard. In are Nike, Visa, and Goldman Sachs. This change was announced on September 10th, 2013 for effect on September 23, 2013. It isn't the first time, nor will it be the last, that the DOW will make adjustments. But that does leave me with a rather uncomfortable question. How valid is it to compare the DOW of September 23, 2013 with the Dow of September 22nd. 2013? Had Alcoa, BofA and H-P still been in the index on September 23, then the DOW would have one value, but since they aren't it would have another.

The way the "magic" occurs is that the DOW is a price weighted index. The idea is that a shift of $1 in the price of any stock will cause the index to shift a certain number of points. It doesn't matter which stock it is.To prevent any discontinuities, structural activities (stock splits, acquisitions, divestitures) and additions/removals to the index cause a weighting factor to change. From March 2013 to now the weighting factor has been that a $1 change in any stock price will move the index 7.7 points. From September 23, 2013 onwards a $1 change in a stock price will cause a movement of 6.5 points

So as we compare DOW index values over the years, we at least need to be aware that it isn't a single seamless whole with the same stocks at the same weighting all the time.

As we look at index data in our own businesses, we also need to realize that the way an index is calculated can affect the perception of the underlying reality that has been abstracted by the index.

Thursday, August 29, 2013

The Willie Sutton Data Principle (WSDP)

Willie Sutton was a bank robber. The story (possibly apocryphal) goes that when asked why he robbed banks, he replied, "Because that is where the money is". So continuing the theme, of doing things because that's where they are, let's turn attention to data capture.
Data (including metadata) captured during actual operation are often more interesting than attempts to capture intent after the fact. So let's make sure we do the capture that way. The more impressive web properties do exactly that.

You looked at these 12 options and didn't buy (from a search process)
You looked at these 4 things together and did a side by side comparison (when online computer shopping) - and you purchased.......
You rented this movie, here are some others you might like - more importantly this kind of site learns more about your habits as you visit more often

Compare and contrast this practice with the university that I mentioned in the previous post. There the data capture specifically goes against the WSDP. The school is not capturing the data in the course of normal activity. It is attempting to make the data sources (students) do something extra and inconvenient to provide data that they don't care about.

Saturday, August 24, 2013

Historical Context

When we want to access transactional data long after the transaction has been posted to a system of record, we need to make sure we have enough context to reconstruct the history.

How do we go back and find who sold a particular item years after a sale was registered?
How do we compare performance over time when the rules have changed?
How do we reprice an item to produce a receipt (perhaps for provenance) when we don't know what the pricing rules were at the time of the relevant sale?

There is a wealth of data surrounding the transactions - data that in the transactional systems would likely be properly normalized and well managed. When we want to lookup the cashier at the time a sale is executed, we can. After all we know which terminal they signed in at, what employee id they signed in with, etc. However when the employee id is no longer "valid" - i.e. the employee cannot perform the transactions any more, the fact that a particular employee performed a particular transaction is still a fact. Forensic applications comb through backups attempting to link all the data together - in an attempt to piece together the relevant facts. That is, however a slow and tedious process.
We almost have to go against the fundamental principles that we learned when designing databases - especially transactional databases. Normalize the data to reduce redundancy, remove the possibility of update/delete anomalies. But these historical (sometimes wharehousing oriented) systems don't work that way. We have to scoop up what else was relevant at the time of the transaction. So that we can do unlikely analytics.
Web properties have known this forever. When doing A/B testing the whole interaction context is harvested so that minute behavioral changes can be analyzed - long after the original master data have been purged.
There is a second dirty and ugly secret lurking here too. Not only do we have to capture more data than we ever thought, but actually even if we have the data we may nor be able to use it because the application required to rematerialize it no longer exists. We have upgraded our applications remembering to change the DB schema, but without a good way to get the right version of the application in place at the point in time for a particular piece of data that we wish to examine, we still not be able to make use of the data..
Continuous data arhival synchronized with the proper copies of the applications presents challenges at a scale that the big data world is just getting to grips with.