Chris Bird on Data: February 2014

Saturday, February 15, 2014

Graph Databases & Object people vs AI people

This post goes way back in time. I was working with a client in Seattle, and had just "got the OO bug". I was seeing everything in terms of objects, was reasoning from the intension and delighted that every software I created was a member of a "Class", and that all I needed to know was which class something belonged to and I knew what it could "do". It seemed so orderly, so logical and at some gut level, so wrong. It didn't fit with my mental model of the world, but I couldn't really get a handle on why. Until I met John H.
Now John came from the artificial intelligence world and he had a tendency to reason from the extension - ie the instances of things, rather than from what "Class" they belonged to. I railed against this, and all references to frames and other terms that were important to the AI practitioners. There I was trying to learn a different approach and now someone suggests that may not be right either. (Whatever right means). So to him, the idea of inheritance wasn't so much at the class level, but more what I would have called instantiation. In other words some notion of an instance being created from some other instance (perhaps) or just kind of arbitrarily. It all felt so foreign.
Also, at the time I taught a lot of data modeling - and was somewhat dissatisfied with the state of that art as well. (aside: I must have been no fun to hang out with - everything that I was learning was unsatisfactory) because I couldn't specify the constraints well in the data models. It became too easy to overly generalize, or overly specialize a model without really having a good idea what was going on - and then there was time. How do we do data modeling taking time into account? A really hard problem in some cases. Modeling Roles was tricky because we like inheritance and hierarchies as organizing principles. But with roles those principles become a whole lot harder.
Peter Chen's (Professor Chen was one of my sponsors for my "Expert in Field" visa application in 1984) E/R modeling notation started to help me because it properly sorted out implementation details(keys, foreign keys especially) from the necessary business concepts. Relationships became First Class (and could be reified) so that properties belonging to a relationship could be described. We could easily see where the value production was (matched to creation of associative entities) vs the cost propositions (managing the static, or what is now called "Master" data).
All of this mental discomfort and anguish finally came together this week when I started looking at Graph Databases (and specifically Neo4J).
Suddenly the ideas of E/R from Dr. Chen, the navigational simplicity of walking a graph, the "multiple typing" of data elements started to come together into a cohesive whole. Not so much for managing the data of record, but for providing a safe place where it could be analysed, traversed, understood.
So I am definitely feeling better about the data that I have to manage and worry about. Clojure + Neo4J, here I come.....
Oh and it is time to dust off this gem too..Logic Algebra and Databases by Peter Gray published in 1984. Maybe I will finally understand it!

Wednesday, February 5, 2014

Framing

While I have tried to resist over the years, I realize that I am really a data junky. Not a database junky, but a data junky. I want to know about meaning, about value, about inference. I am especially interested in the temporal aspects of data - recognizing that as circumstances change, so algorithms/rules under which the data were collected might be different. So applying yesterdays derivation/categorization rules to today's data will always be flawed. Future posts will look at the DOW Jones Industrial Average, the need for context in the data/metadata around transactions and other, unusual thoughts around data, capture and use

A surprising effect of biggish data

We had a fairly important go live in the last couple of weeks. Data entry into a purchased front-end system with near real time feeds to a back end data store. Pretty standard stuff in many ways. But one of the big ahas came when we realized that by having all fo the data from the external system (every state change of the important entities), it meant we could quickly research issues without having to go back to the system of record.
We could run balances, filter test transactions that might have been in the system from the final tracker tests, show key metrics, you name it.
The exciting moment for me was when a business user asked if we could identify transactions that may have been in error - the front end was saying something "funky". I heard the request and asked, "When would you like to know?". Business response (people not used to having truly up to date data) was, "Do you think you can get it to us tomorrow evening?" My answer, "Well I could delay it that long, how does a few minutes look to you?"
And sure enough we were able to deliver answers that quickly.
So, for me and my experience, the ability to perform real time analysis of the data, to discover glitches early and quickly, without the need to recreate made the solution worth its weight in gold. An unexpected bonus.
Of course that same data set is used for creating KPI and other operational reports too - the troubleshooting is an added bonus.