Chris Bird on Data: Historical Context

When we want to access transactional data long after the transaction has been posted to a system of record, we need to make sure we have enough context to reconstruct the history.

How do we go back and find who sold a particular item years after a sale was registered?
How do we compare performance over time when the rules have changed?
How do we reprice an item to produce a receipt (perhaps for provenance) when we don't know what the pricing rules were at the time of the relevant sale?

There is a wealth of data surrounding the transactions - data that in the transactional systems would likely be properly normalized and well managed. When we want to lookup the cashier at the time a sale is executed, we can. After all we know which terminal they signed in at, what employee id they signed in with, etc. However when the employee id is no longer "valid" - i.e. the employee cannot perform the transactions any more, the fact that a particular employee performed a particular transaction is still a fact. Forensic applications comb through backups attempting to link all the data together - in an attempt to piece together the relevant facts. That is, however a slow and tedious process.
We almost have to go against the fundamental principles that we learned when designing databases - especially transactional databases. Normalize the data to reduce redundancy, remove the possibility of update/delete anomalies. But these historical (sometimes wharehousing oriented) systems don't work that way. We have to scoop up what else was relevant at the time of the transaction. So that we can do unlikely analytics.
Web properties have known this forever. When doing A/B testing the whole interaction context is harvested so that minute behavioral changes can be analyzed - long after the original master data have been purged.
There is a second dirty and ugly secret lurking here too. Not only do we have to capture more data than we ever thought, but actually even if we have the data we may nor be able to use it because the application required to rematerialize it no longer exists. We have upgraded our applications remembering to change the DB schema, but without a good way to get the right version of the application in place at the point in time for a particular piece of data that we wish to examine, we still not be able to make use of the data..
Continuous data arhival synchronized with the proper copies of the applications presents challenges at a scale that the big data world is just getting to grips with.

Chris Bird on Data

Saturday, August 24, 2013

Historical Context

No comments:

Post a Comment