Revolt against the Data Lake as Staging Area


Criticisms against the data lake as staging areaIn previous posts I covered the data lake, approaches to “building” the lake and the data lake’s relationship with the traditional data warehouse. However as rosy as all that may seem, there are some people quite opposed to using the data lake as an all-encompassing free-for-all at-scale staging area. To an extent, I tend to stand with them.

OK, so this may not be an actual physical revolution, with guns and stand-offs and coup d’tats and all that exciting stuff… it’s more like passive and quite opinionated resistance, maybe a few placards and a few die-hard silent protestors that have chained themselves to their structured data staging areas… Anyway, I used some liberty to force this piece to tie in with my other “revolt” posts… (there it is, that violence thing again)

So let us explore some of the criticisms posed against implementing a BI ecosystem based around a data lake as the all-encompassing free-for-all data staging area.

Unknown territory

The big data vendors are very good at promoting their offerings, and powerpointing the advantages of an Hadoop data lake, and they tell you in no uncertain terms about the problems with the data warehouse and the staging area that you are already aware of. Now that gets you interested real quick, not so? But, they never tell you about the potential problems with a data lake.

There are many complications with going the data lake route. The data becomes complex and the processing is quite primitive, or you have to run three layers of tools on top of each other to get half a human-friendly interface.

In reality, very few organisations have actually successfully implemented a data lake, so there is not yet a large body of IP and best practice approaches from which to learn. So it’s more a case of revolutionaries taking up arms and forging ahead through whatever resistance is encountered, sometimes at the risk of losing time, data, sanity, or all of these.

Unknown costs

Product and services vendors will quickly powerpoint you the advantages and ROIs on a data lake implementation. If you don’t have Hadoop, or the add-on tools to make the white elephant useful, they’ll gladly sell and implement those technologies for you too. Oh, you don’t have the infrastructure? Oh, they’ll gladly lease that to you in the cloud, with no setup effort and no lock-in. Yeah right, no lock-in…but think about this, once your data is up there in the cloud, you are pretty much locked into that never-neverland. Because, where could you bring it back to?

So, be fearful of the costs and the hidden costs. It’s an expensive world out there.

Data governance

Usually we apply a lot of governance and rules around the contents of the staging area, and on who may access it and what they may do with it –because, frankly, the data in a staging area is not yet ready for consumption by the business. The data is typically still in a very physical format (like codes instead of business understandable descriptions) and in data structures as used by the source systems (like in some over-time deviated from normal form type of structure. Maybe I must register the term and baffle students with OTDNF…) Anyway, business users have great difficulty navigating through such data and making any sense of it. So in addition to the ETL coders, we typically only allow very skilled data analysts or data scientists into the staging area if they have to extract data or analyse it before it can be processed into the data warehouse.

However, with a data lake implementation, there is typically very little of those governances defined, never mind ever applied. As I implied above, the lake is viewed much more as a free-for-all scratch patch where anyone can go and fetch whatever they want, and do with it whatever they want. It quickly becomes a recipe for inconsistent results and new and not properly defined calculations of measures and KPIs that no-one has any real control over.

Likewise, without proper MDM processes in place, we get multiple definitions and instantiations of the same key business entities. MDM is complex enough in a structured world. With a data lake you can now get analysts and data scientists creating analyses and solutions that are only point-focussed, addressing a single problem at a time that only meets some immediate need. So you get business silos all over, again, but in a wilder and more difficult to control territory this time.

Data governance is hard work and it is difficult to get traction and achieve results in a conventional application, data warehouse and BI environment. How much more so in a unstructured free-for-all lake-centered BI ecosystem?

What metadata?

To manage a data lake well, you need to collect, update and manage quite a lot of metadata through metadata life cycle management processes. The metadata needs to include machine-generated data, definitions of data file contents, data lineage and so on. It’s all fine saying the data lake supports schema on read, but you need to be able to put that schema together when you need to access the data. Without proper metadata, it’s a bit like finding a needle in the proverbial haystack.

In conventional systems, many of the metadata management functions are already implemented, such as model to model migration, attribute-level mapping, impact analysis, automated source-to-metadata synchronisation, and much more. In a data lake, all these metadata management capabilities first have to be developed and tested, before it can be implemented.

Silos of Information

Because the data lake doesn’t support the concept of joins or any other form of data integration at the schema level, such as conformed dimensions, it can very quickly and very easily become a large collection of unrelated data silos, with very little to connect the silos up in the long run.

The connection between different but related data sets has to be done at read time, and the big challenge is that you have to know and apply the rules of integration at the time of accessing the data. This may be months after the data was dumped in the lake. Tying this back to the metadata point above, where are you going to get the metadata that describes those integration rules?

Encouraging chaos

The data lake was created to address the needs of data scientists who need immediate access to all data to quickly build data-driven solutions and create analyses without needing to wait for the formalised data warehousing and BI teams to go through the rigours imposed by corporate IT. These scientists are addressing real needs and the data lake offers a very convenient way to address them.

But don’t for one second believe that the data lake is going to give you enterprise-wide integrated views of the business, populated with clean, consistent, integrated data. Unless you have a super-integrated MDM environment and the most powerful data governors on the planet, the data lake isn’t going to unify the organisation’s data. Not by a long shot. It wasn’t designed for that purpose, and it doesn’t encourage that type of behaviour.

In fact, the data lake was created as a revolt against a repressive and bureaucratic IT culture that exists in many organisations. Maybe it’s good that the data architects want to (eventually) normalise, model, describe, catalogue and secure every piece of data that comes into the organisation, but more than often, this slow and cumbersome process gets in the way of insight-driven innovation. The whole agile movement, which is used for applications and BI, are testimony to this as well.

So realistically, the data lake invites chaos. And some chaos may be required to stimulate and even make innovation possible. However, the real problem with the data lake is that there are no warning signs to caution unsuspecting business people about the potential side-effects when that chaos gets too wild and strategic decisions are made based on unverified or unverifiable data.

Re-doing the staging area

Now I am very comfortable with the fact that the data lake may contain data that has no intention of ever reaching the data warehouse. I am also very comfortable with data analysts and other data jockeys extracting data from the data lake and integrating it with structured data in an analytical sandbox environment. Hey, I’m even comfortable with a huge amount of filtering and reducing and matching happening in the data lake, and those qualified and by then structured data getting transformed and integrated into the enterprise data warehouse. After all, in most organisations, the big value from big data is only realised when you can integrate the outcomes derived from the big data with the outcomes from traditional BI to arrive at new, but targeted and qualified insights. Call it analytics performed on integrated “all data”.

However, I have a serious question as to why you would take perfectly structured data, dump it in an unstructured data lake, use much more primitive tools than an established ETL framework to process it, in order to eventually get it back into the production enterprise data warehouse and business intelligence environment. Like, why would you off-road it via a longer route through the wilderness if there is a shorter uncongested perfectly tarred road straight to your destination? Maybe if your hiding something, sure, but very few organisations can justify or afford such clandestine operations.

While we’re talking about data integration, at this stage Hadoop is not yet a suitable solution for data integration. Many key data integration capabilities are still missing or poorly implemented. It is not yet near the maturity of a well-established and function rich ETL framework. Where established ETL frameworks offer functionality for data governance and in particular data quality management, the Hadoop stack is still sorely missing this. It still has to be coded in individual developers’ solutions. Hadoop workstreams are coded and implemented separately and independently – as such there is no way to define and implement inter-dependency between them.

Much of the ETL / data integration features and facilities you get standard “out of the box” with an ETL package has to be coded in MapReduce on a Hadoop implementation. Functions which are standard in an ETL tool, such as datatype conversions, key lookups, record version management, aggregations, and basic parsing and extraction all have to be coded, generally by Java developers and then be integrated and then maintained by them. (Of course, some of the established ETL tools are now starting to offer “big data” processing capabilities, so that they can be used across the data lake as well.)

Concluding Thoughts

Now don’t get me wrong, I have no qualms with data lakes. In fact as a big data collection, storage and ELT area I think they are quite useful. What this revolt is against, are the two terms “all-encompassing” and “free-for-all” in the definition given for data lakes.

For all-encompasing, I will question again, as I did above, why you would ever unstructure your data, pump it into Hadoop to ELT it there and then eventually bring it back to the structured data warehouse? So I would rather use the Data Lake for “big data” storage, ELT and the like, and bring the useful bits into the structured environment. Personally I think the analytics and BI toolsets that have been available for a very long time in the structured world are still far better than those currently being developed and rolled out in the data lake technology world.

With regards to “free-for-all”, I also have big concerns. What would  business user want to do there in the first place? Even a conventional report developer? I gladly grant all over access to a small handful of skilled and experienced data scientists and data analysts, but to take it from there to “free for all” just invites chaos because the data lake world is just too hard to govern properly, and besides, we may not want bureaucratic governance processes stifling innovation there anyway.

So as I concluded in my “Data Lake vs Data Warehouse” post as well, if the data lake is used correctly in the BI ecosystem, together with the data warehouse being used for what it, in turn, is good for, one can have a very synergistic extended BI ecosystem that can really provide good information and insights to the business as and when needed.

Leave a Reply