Finding value in the data lakehouse


As you may have gathered from my previous post, I have become very interested in cloud-based data lakehouses. It was therefore with a keen interest that I read the article ‘Five Effective Ways to Build a Robust Data Lake Architecture’ on Enterprise Talk.

While initially I was hoping that the author was going to review and compare five different variations of data lakehouse architecture, the piece focuses on five steps needed to set up a data lake architecture, and the insights shared are very useful. The five steps identified are:

  • Determine the business data goals
  • Select the right data repository to gather and store information
  • Develop a data governance strategy
  • Integrate AI and Automation
  • Integrate DataOps

Of course, these steps are all crucial to success. I cannot emphasise enough the importance of aligning the business strategy and goals in addition to a well implemented and enforced data governance strategy. However, I found that my need for more information saw me clicking through on another article on Enterprise Talk, titled ‘How Enterprises Can Leverage Data Lakehouse Architecture to Get the Most Value from Their Data’.

Defining a data lakehouse

Previously, I have been involved on the periphery of a data vault-based data lake. This works well for a smaller number of disparate data sources. But in my current engagement, we must integrate and standardise data across many different data sources, within minimal time frames using limited resources. In effect, we cannot cater for the extensive data modelling required for a data vault.

This would also explain my interest in data lakehouses. The article defines a data lakehouse as ‘a dual-layered architecture, with a warehouse layer placed over a data lake, enforcing schema, which ensures data integrity and control while also allowing for faster BI and reporting. Data lakehouse architecture also eliminates the need for multiple data copies and drastically decreases data drift issues.’

Speed above all

In my attempt to simplify things, I see the data lake part as a continuously updated, timestamped source-true staging area. This may be schema-defined for structured data and schema-less for unstructured data. What I like about such an approach is that very little processing is done to the data on the way in. It is quick and efficient, and source-true for auditability and traceability.

In fact, in some implementations the data is not even physically copied to the data lake. Instead, through virtualisation, it can be viewed as if it were in the data lake. It makes perfect sense for large datasets that already contain date and time indicating attributes, and which are not accessed and updated much on the source system. Examples can include system logs, activity logs, audit trails, mobile call records, point-of-sale transactions, and so on.

It’s not inside…

For me, the ideal solution would have two constructs sitting ‘on top’ of the data lake as opposed to only the data warehouse as defined above.

The first construct would be a minimalistic structured data warehouse. However, I would add the criteria that it only contains the measures and dimensions and timespan of data used for regular BI, dashboarding, and reporting. In other words, very highly curated, controlled, and trusted data is used to produce the information in the daily, weekly, and monthly running of the business. While I am at it, we might as well throw regulatory reporting in here as well.

The second construct would then be an analytical sandbox that is used for analytical models and ad hoc queries that need data integrated or refined from the data lake. Models and reports developed in the sandbox can always be productionised into the data warehouse component if they become a long-term fixture on the information/intelligence landscape.

As the author of the second article writes, ‘a data lakehouse initiative, when done correctly, can free up data and let an organisation use it the way it wants and at the speed it wants’. And that is ultimately what we are all looking for.

Leave a Reply

hope howell has twice the fun. Learn More Here anybunny videos