Chris Bird on Data

Friday, November 28, 2014

Big Data and Operational Rules

As usual, a story to start things off. Years ago I used to commute a couple of times/week between Dallas and Houston. SOP was to book home to Dallas on the last flight of the evening. That way I would be sure to make it home. However, on several occasions I would be done early and want to get home early. The less-experienced traveler would look to stand by on an earlier flight. Join the throng of people looking to move about on a crowded route.

The more experienced traveler would examine the refund policy. Realizing that the late ticket was refundable, the traveler would purchase a ticket for an earlier flight and do the refund later. Yes the purchased ticket might be a bit more expensive, but why not? That's what expense policy is for.

The very experienced traveler would look for the most overbooked flight earlier than the originally booked flight. Assuming overbooking, the traveler has the chance of offering to be "bought out" and get some form of compensation for being inconvenienced. It had the added bonus that one would be moved ahead of the stand by passengers for the next flight as well. So, the simple act of making tickets fully refundable can enable some quite naughty behaviors.

Now to the point:
Should the airline recognize this kind of behavior and do something to stop it? Sadly the answer is, as usual, it depends. What does it depend on? The balance between good will (especially in this time of over-sharing via social media) and revenue loss.

It's easy to envisage the good will part - handy too because nothing has to change, so no thinking required. However, if there is a real problem, then something more must be done. The environment must properly enable the activities to be performed.

One approach is illustrated in the diagram below, there is a transactional system whose behavior we wish to influence. This system (and several others) send their transactions to the Historical/Analytic Data Store (HADS) which does very little formatting. It is after all the full record of the transactional history of the enterprise (Now that is BIG DATA!). The transactions are also sent to the Operational Analysis System in near real time. The Operational Analysis System is backed by an Operational Data Store (ODS) which contains the most recent state of the operational transactions. The Operational Analysis System's main role is to perform analysis on the transactions in its domain and, if necessary, take action against its transactional system. It uses rules to drive the analysis.

Simplified view of Analytic and Operational Elements

So, back to our example: How do we know how widespread the practice is, and what it has cost the airline? Where do we go to dig this out? Use the Data Analytic System fronting the HADS. We should be looking in the HADS for the residue of inappropriate behavior over time. If we find some, our next step might be to recognize the elements that make up the behavior, deducing the necessary rules, and placing those rules into the Operational Analysis System. At the most extreme, we might see the Operational Analysis System issuing cancellation transactions against the Transactional System. Of course these cancellations will then flow just like any other transactions.

Bottom line. We need to have the ability to detect patterns over time in our transactions; determine the value/cost balance of these patterns and, if required, implement rules that can detect and act on these transactions on the fly. No need to "crack the case" on the Transactional System - that system might be a purchased/outsourced system or it might be a system of record. We need to make sure that it is capable of receiving and executing transactions generated by the Operational Analysis System, so it continues to be a proper system of record.

Thursday, August 28, 2014

CDM vs CDS

I hear a lot of talk about canonical data models and how they will help get a more unified view of data throughout the enterprise. And then I see that what really happens is that we build bloated representations of some idealized view of the data that doesn't serve anyone well. Perhaps it isn't always like that, but we architects seem to think that they are better ideas than the poor schmucks who have to use them.

Actually, a lot of the time, we aren't building a canonical data MODEL, but defining a canonical data STRUCTURE. The difference (and yes it really matters), is that with canonical data STRUCTURES we pass around the data structure in its complete form, and require that the consumers filter the meaning from it. I used to think that was a good idea.

But then, I fell back to my modeling roots, and realized that maybe we should use the MODEL as a lens through which to view data structures. By that I mean that a Canonical Model is intensional and descriptive. It doesn't have an extension or implementation expressed as a data structure. When a (services) consumer of the data wishes to consume some data, the consumer binds to the model (some time prior to execution), and instructs the model to manifest the data in the way the service consumer needs it.

Of course this requires that the model has the ability to choose among its data sources, perform the transforms on behalf of the consumers, recognize that the contracts might change and evolve the schemata appropriately. It isn't easy.

It also turns some of the ideas of services on their heads. Instead of a one size fits all service (retrieve all the data about a reservation), we have the opportunity to have bespoke, services that give the service consumer exactly what it needs. Hmm, service proliferation, that doesn't sound good - and it isn't if you don't have proper service management disciplines. But it is probably a whole lot better than passing bloated, unnecessary giant structures around clogging up the network and annoying the consumers.

Saturday, February 15, 2014

Graph Databases & Object people vs AI people

This post goes way back in time. I was working with a client in Seattle, and had just "got the OO bug". I was seeing everything in terms of objects, was reasoning from the intension and delighted that every software I created was a member of a "Class", and that all I needed to know was which class something belonged to and I knew what it could "do". It seemed so orderly, so logical and at some gut level, so wrong. It didn't fit with my mental model of the world, but I couldn't really get a handle on why. Until I met John H.
Now John came from the artificial intelligence world and he had a tendency to reason from the extension - ie the instances of things, rather than from what "Class" they belonged to. I railed against this, and all references to frames and other terms that were important to the AI practitioners. There I was trying to learn a different approach and now someone suggests that may not be right either. (Whatever right means). So to him, the idea of inheritance wasn't so much at the class level, but more what I would have called instantiation. In other words some notion of an instance being created from some other instance (perhaps) or just kind of arbitrarily. It all felt so foreign.
Also, at the time I taught a lot of data modeling - and was somewhat dissatisfied with the state of that art as well. (aside: I must have been no fun to hang out with - everything that I was learning was unsatisfactory) because I couldn't specify the constraints well in the data models. It became too easy to overly generalize, or overly specialize a model without really having a good idea what was going on - and then there was time. How do we do data modeling taking time into account? A really hard problem in some cases. Modeling Roles was tricky because we like inheritance and hierarchies as organizing principles. But with roles those principles become a whole lot harder.
Peter Chen's (Professor Chen was one of my sponsors for my "Expert in Field" visa application in 1984) E/R modeling notation started to help me because it properly sorted out implementation details(keys, foreign keys especially) from the necessary business concepts. Relationships became First Class (and could be reified) so that properties belonging to a relationship could be described. We could easily see where the value production was (matched to creation of associative entities) vs the cost propositions (managing the static, or what is now called "Master" data).
All of this mental discomfort and anguish finally came together this week when I started looking at Graph Databases (and specifically Neo4J).
Suddenly the ideas of E/R from Dr. Chen, the navigational simplicity of walking a graph, the "multiple typing" of data elements started to come together into a cohesive whole. Not so much for managing the data of record, but for providing a safe place where it could be analysed, traversed, understood.
So I am definitely feeling better about the data that I have to manage and worry about. Clojure + Neo4J, here I come.....
Oh and it is time to dust off this gem too..Logic Algebra and Databases by Peter Gray published in 1984. Maybe I will finally understand it!

Wednesday, February 5, 2014

Framing

While I have tried to resist over the years, I realize that I am really a data junky. Not a database junky, but a data junky. I want to know about meaning, about value, about inference. I am especially interested in the temporal aspects of data - recognizing that as circumstances change, so algorithms/rules under which the data were collected might be different. So applying yesterdays derivation/categorization rules to today's data will always be flawed. Future posts will look at the DOW Jones Industrial Average, the need for context in the data/metadata around transactions and other, unusual thoughts around data, capture and use

A surprising effect of biggish data

We had a fairly important go live in the last couple of weeks. Data entry into a purchased front-end system with near real time feeds to a back end data store. Pretty standard stuff in many ways. But one of the big ahas came when we realized that by having all fo the data from the external system (every state change of the important entities), it meant we could quickly research issues without having to go back to the system of record.
We could run balances, filter test transactions that might have been in the system from the final tracker tests, show key metrics, you name it.
The exciting moment for me was when a business user asked if we could identify transactions that may have been in error - the front end was saying something "funky". I heard the request and asked, "When would you like to know?". Business response (people not used to having truly up to date data) was, "Do you think you can get it to us tomorrow evening?" My answer, "Well I could delay it that long, how does a few minutes look to you?"
And sure enough we were able to deliver answers that quickly.
So, for me and my experience, the ability to perform real time analysis of the data, to discover glitches early and quickly, without the need to recreate made the solution worth its weight in gold. An unexpected bonus.
Of course that same data set is used for creating KPI and other operational reports too - the troubleshooting is an added bonus.

Tuesday, September 24, 2013

Be a WSDP app

In a previous post I referenced the "Willie Sutton Data Principle" (WSDP). Willie Sutton (reputedly) replied, "That's where the money is" when asked, "Why do you rob banks?"
The WSDP is about data capture - the idea being data capture is most useful if it happens in the natural course of doing something - not if you have to go an extra step. Of course there are exceptions, but there are so many sites/applications where to capture the data you simply have to work too hard to make it worth your while. For me a couple of examples include untappd a terrific idea for sharing beer drinking experiences. However the work required to enter the beers far outweighs any benefit that I derive. Now I drink a lot of beer, and I try many different brews, but somehow crowdsourcing beer drinking is too much work. Similarly, I would have to be a fanatic to enter the foods/drinks that I consume into my Jawbone Up. I love the product for showing exercise and sleep habits (that's passive collection), but as soon as I have to do work to collect the data, I am simply not motivated enough.
Contrast with Google Maps - looking at "real time" traffic. That's a great WSDP app. The data are captured passively as we move around, and then made usable to other drivers. I can contribute to the crowd source - but do it without having to work at it.
Moral of the story - make your apps worth entering data - or make the data entry invisible to the user. Be a WSDP app where possible.

TV advertising and idempotence

Reading this and reflecting on my own experience started to make me think about the nature of commercials on TV.
In our house we rarely watch live TV - mostly delayed or semi-live. Semi-live is the term used to describe starting after the show/event has started and being able to skip the commercials. Delayed I am taking to mean some time after the show/event is over.
There are several programs (e.g. repeats, highlights, old movies) that are in a sense timeless. I might go back days or weeks after the recording to watch. And therein lies the rub.
Idempotence is a mathematical term that roughly means "when you invoke a function you get the same answer every time you invoke it." If we think of language as like invoking functions, there are words that definitely cause us difficulties. Tomorrow for example gives a different answer every day. Next Wednesday gives a different answer every week....
So when we are looking at delayed TV, advertisements that crop up saying, "watch xxx tomorrow" become meaningless when I am watching a delayed program after the presentation of xxx.
To make advertising more meaningful, a couple of things could help - at least help we delayed TV crowd.

Insert time relevant advertising at the moment of watching the program
If it isn't possible to insert advertising (contracts, agreements, technology, cost or whatever), then make sure that the advertisements represent time in an idempotent way. Give Date/Time in absolute terms and not just in relative terms for example.

Right now, one of the main reasons to watch delayed is to avoid the advertisements. That does the job well for me. But f the advertisements were harder to avoid and more time of viewing relevant, then maybe they would reach me better.