“Agile data warehousing” is a contradiction in terms. Very few activities are less agile than populating a 10 terabyte data warehouse… However if you want to deliver information frequently and rapidly to the business, you have to speed up the ETL implementation.
Those of us with small children have all heard the Fat Controller moaning at Thomas and his friends for causing unnecessary inconvenience and delay. Don’t we all wish we can deliver BI to the business without unnecessary inconvenience and delay?
In my previous post I discussed how to model a data warehouse with conformed dimensions when using an agile approach. However if we want to increase user interaction and get useful information to the business much quicker, we have to speed up the loading process too. This can be done by using a well architected ETL framework and through faster, smaller real-time loads.
Before embarking on “agile ETL”, you must already have a robust well-tested ETL framework in place
that contains all the functional components. Before you even analyse a single source-to-target data mapping, you already know exactly what your ETL is going to do. You are going to log the time the process ran, with which parameters it was called, how much of which source data it accessed, etc. You also know you are going to populate dimensions, where you have to generate surrogate keys, maintain versions and denormalise hierarchies. You are going to populate facts with key lookups. You are also going to log the loading and execution statistics. So, there is the whole specification, well, sort of… Now go and code and test it, and once done, you’ll be ready to implement agile ETL.
There are two types of ETL processes you have to develop, namely the “now” real-time processes and the big bulk historical load processes. When you’re doing agile data warehousing, you focus on the former. In most big data warehousing programs, the bulk history loads, with their validations and sign-offs cause the most problems, over-runs and unhappy customers, as most of the undocumented and forgotten-about inconsistencies occur in the archaic data. On an agile schedule you don’t have time for that. You have to speedily create the real time loads, by filling in the mappings into the pre-coded framework.
So what about the history loads? You create a separate team, well experienced in hard core bulk data movement, and they code and implement one big data load after the other, basing the logic on that already implemented during the sprints, but also catering for all the older exceptions. One of the spin-offs from approaching it this way is that the business already knows that your loads work correctly, as they have already accepted and implemented the trickle feeds. Nine times out of ten, they will now be way more wiling to play a role in sorting out old data problems, and even accept blame for the state the data is in. A note of caution though – you have to deploy the bulk loads it in such a way that they don’t block the trickle feeds.
In the next post on agile BI, I will discuss approaches to interact with the business, even in non-agile orientated organisations.
2 pings