Elaborate Architectures


Elaborate data and BI architecturesIn the Big Data field, because of the potentially large volumes of data, very streamlined architectures are being used. Sure Big Data sometimes participates in complex data warehouse ecosystems, but the real “Big Data” part has to be kept incredibly simple and efficient to be able to cope with the load. However, by comparison even very large transaction volumes are relatively quite small. Why are some conventional data warehouses battling with much less data? What can conventional data warehouse architects learn from Big Data? 

I have seen some elaborate data warehouse architectures in my time. A diagram of Bill Inmon’s Corporate Information Factory (CIF) circa 1999 comes to mind, which showed a staging area, one or more Operational Data Stores (ODSs), an enterprise data warehouse (EDW), many downstream data marts and a variety of other related data stores such as Webhouses. Organisations implemented that. And some federated organisations without good information management disciplines implemented even more than that. Some of those implementations are still running today. Worse even, similar monstrosities are still being constructed and maintained today. And we wonder why we hear discussions in the boardrooms that BI costs too much? And why do we hear discussions in the corridors that “they” are going to “can the project”? (Not that it’s a project after so many years anymore, or ever was, for that matter, but that’s another discussion.)

So nowadays organisations are using Big Data technologies like Hadoop to speed up data loading, integration and re-extraction. And they get excited about blasting petabytes of highly complex data into Hadoop across 16 or more servers in parallel, and they boast that they can extract, transform and shrink that dataset to meaningful information across those same 16 or more servers in no time.

But in the conventional data warehouse world, organisations are battling to load and process very structured data sets smaller than a few terabytes into their data warehouses. Why? Is it the technology? No sir, you can scale a conventional data warehouse across a number of servers too. Oracle would love to sell you such an appliance. IBM too. But the problem is not with the technology. The problem lies in the architecture. The organisations that battle all have elaborate architectures with ODSs before or next to the data warehouse, multiple layers inside the data warehouse and many cubes and data marts after the data warehouse.

Why would you keep data in the staging area after you have processed it? Just archive it so that you can reprocess it if you need to. Why would you keep n layers of the same data in different formats in the data warehouse, when you really only need the dimensional layer to report and analyse it? I seriously question the need to keep any data in a relational form in the data warehouse, apart from in the source format in the staging area. And why, oh why, would you have a gazillion data marts, unless they are used to address specialised security requirements or to offload highly intensive advanced analytical processing to separate servers?

Every additional and unnecessary layer adds storage costs and processing time. Propagating a transaction through five layers is going to make real-time BI a lot harder to achieve. And in today’s world, it consumes an even more costly and scare resource – “developer time” – to analyse, design, code, implement and test the data path through all those layers. Any changes also have to be propagated throughout all those layers. Talk about expensive impact.

I have seen some very simple and elegant data warehouse solutions in my time too, that even on relatively simple hardware can handle terabytes of data easily and very efficiently, in almost real time too. The most successful BI implementations I have come across are extremely simple. The more simplified, the more cost-effective. A dynamic staging area, which gets archived as soon as the data has been processed. A single layer dimensional data warehouse. As few data marts as possible. And where do you do operational reporting? Instead of trying to keep an ODS consistent with the EDW, rather spend the effort and costs to populate the EDW in near-time or real-time, and report straight from the single source of the truth. No more costly boardroom debates about which version of which truth to use, not even going into the discussions required to manage BI’s reputation.

Conventional data warehouse architects can learn a lot from the Big Data movement. Pump the data into the staging area as quickly as possible, archive it as soon as you can, force structure on it as quickly as possible, and use one single integrated source of the truth, even for real-time operational reporting. This also makes agile data warehousing simpler, easier, faster. You will be surprised how many fewer problems you will have, how much quicker you can process terabytes of data – in almost real-time, and most importantly, by how much your costs would reduce. It works for complex Big Data. Trust me, it works even better for the much easier structured transactional data. Do it. Now.

1 ping

  1. Weekly bookmarks: december 14th | robertsahlin.com

    [...] Elaborate Architectures » Martin's Insights – Elaborate Architectures #bigdata http://t.co/8DqGmlVI [...]

Leave a Reply