Handling the Data Explosion


Handling the data explosionThere is an enormous amount of data around, and much of it in data warehouses. We question the realities and discuss the approaches to keep the data warehouse to a handleable size, without reducing the availability of any historic data and without suffering any loss of data and information.

I don’t need to repeat any statistics – we all know that we are being overwhelmed by a data explosion! With data shrapnel everywhere, our data warehouses are bursting at the seams. There are enough arguments out there why we should expand the storage space for the data warehouse… But alternatively, is there something we can do about it? Can we actually be more data conscious? Let’s take a brisk walk through the data warehouse architecture and see.

First we encounter the staging area, that vaguely described scratch workplace. Here data modelers and ETL analysts  don’t have to abide by any rules, just because the source systems are complex, or maybe, just because they never are afforded the time to model and document it properly. I endorse bringing all the source data into the staging area, because it avoids time-consuming revisits to the closely guarded transactional systems. It also creates a stable point to which you can recover and reload from, should things go awry. But how long do you need a copy of the data in there? Once it has been loaded in the data warehouse, it can be archived and removed. In fact, with the ETL or ELT tools available today, staging can be skipped for some processes. Keep it slick!

Next we get to the data warehouse. Why, oh why, would you ever need a 3NF layer in here as well? It serves no other  purpose but to duplicate the data storage and processing effort, never mind all the testing, debugging and validation. There should be one and only one layer, and call me a dimensional bigot, but that is what works well in here. This area  should be modeled and managed with the highest precision, and no duplications should be allowed. Keep it clean!

One type of duplication is encouraged though. Properly summarized aggregates are useful to optimize the performance of higher level roll-ups. They can actually be kept in the data warehouse for longer while the detailed data can be archived to some offline or slower storage medium. However, the usage of the aggregates must be monitored. Often when business priorities change, the aggregates aren’t used anymore, but they are still maintained in the data warehouse.

Oops, here we have to back-track a little bit. Slightly off to the side, slightly disconnected, is the Operational Data Store. It’s not officially part of the data warehouse, but fulfills a similar role. In fact, most of its contents are in the data warehouse too. When processes were slow and ETL was cumbersome, I could understand the justification for its existence. But today? With real-time ETL readily available? It’s just another duplication of data, often the domain of a renegade group. Speed up the data warehouse loads, get consistent reporting results, and save some space!

Next… Why did we even get here? A scattering of data marts. You would have to motivate very well why you need these. In my mind there can only be two reasons, firstly to off-load some heavy processing to a separate server (for example for heavy analytical processing or data mining), or secondly to offload some highly sensitive or private data to a more secure environment. But no, we see an urban sprawl of data marts, some even on the data warehouse server. Again, speed up the data warehouse, reduce the inconsistencies, and save some more space!

Yesteryear there may have been good reasons why some of these duplications were created, but with the technologies now at our disposal, there is definitely a business case to get rid of many duplications, streamline the processes and deliver better information to the business faster. Today there are some great ETL and ELT tools available, good data management tools and also good usage tracking facilities. Used wisely, together with a burst of common sense, we can greatly curb the unnecessary waste of space by the data warehouse, opening it up for more useful data and in the process delivering more consistent information to the business even faster.

Please consider the environment before duplicating the data (but feel welcome to share this post!)
* Image from www.ecolivingexperts.com

Leave a Reply