Finding value in the data lakehouse

Share

As you may have gathered from my previous post, I have become very interested in cloud-based data lakehouses. It was therefore with a keen interest that I read the article ‘Five Effective Ways to Build a Robust Data Lake Architecture’ on Enterprise Talk.

While initially I was hoping that the author was going to review and compare five different variations of data lakehouse architecture, the piece focuses on five steps needed to set up a data lake architecture, and the insights shared are very useful. The five steps identified are:

  • Determine the business data goals
  • Select the right data repository to gather and store information
  • Develop a data governance strategy
  • Integrate AI and Automation
  • Integrate DataOps

Of course, these steps are all crucial to success. I cannot emphasise enough the importance of aligning the business strategy and goals in addition to a well implemented and enforced data governance strategy. However, I found that my need for more information saw me clicking through on another article on Enterprise Talk, titled ‘How Enterprises Can Leverage Data Lakehouse Architecture to Get the Most Value from Their Data’.

Defining a data lakehouse

Previously, I have been involved on the periphery of a data vault-based data lake. This works well for a smaller number of disparate data sources. But in my current engagement, we must integrate and standardise data across many different data sources, within minimal time frames using limited resources. In effect, we cannot cater for the extensive data modelling required for a data vault.

This would also explain my interest in data lakehouses. The article defines a data lakehouse as ‘a dual-layered architecture, with a warehouse layer placed over a data lake, enforcing schema, which ensures data integrity and control while also allowing for faster BI and reporting. Data lakehouse architecture also eliminates the need for multiple data copies and drastically decreases data drift issues.’

Speed above all

In my attempt to simplify things, I see the data lake part as a continuously updated, timestamped source-true staging area. This may be schema-defined for structured data and schema-less for unstructured data. What I like about such an approach is that very little processing is done to the data on the way in. It is quick and efficient, and source-true for auditability and traceability.

In fact, in some implementations the data is not even physically copied to the data lake. Instead, through virtualisation, it can be viewed as if it were in the data lake. It makes perfect sense for large datasets that already contain date and time indicating attributes, and which are not accessed and updated much on the source system. Examples can include system logs, activity logs, audit trails, mobile call records, point-of-sale transactions, and so on.

It’s not inside…

For me, the ideal solution would have two constructs sitting ‘on top’ of the data lake as opposed to only the data warehouse as defined above.

The first construct would be a minimalistic structured data warehouse. However, I would add the criteria that it only contains the measures and dimensions and timespan of data used for regular BI, dashboarding, and reporting. In other words, very highly curated, controlled, and trusted data is used to produce the information in the daily, weekly, and monthly running of the business. While I am at it, we might as well throw regulatory reporting in here as well.

The second construct would then be an analytical sandbox that is used for analytical models and ad hoc queries that need data integrated or refined from the data lake. Models and reports developed in the sandbox can always be productionised into the data warehouse component if they become a long-term fixture on the information/intelligence landscape.

As the author of the second article writes, ‘a data lakehouse initiative, when done correctly, can free up data and let an organisation use it the way it wants and at the speed it wants’. And that is ultimately what we are all looking for.

Rethinking the opportunities bubbling below the surface of data lakes

Share

Long-time readers of my blog can likely recall my scepticism around data lakes when they first emerged. However, a lot of water has flowed into the space since then, motivating me to start investigating the area in-depth.

My early criticisms against Hadoop-based data lakes were that they were too batch-oriented to cater for business analytics and business intelligence requirements. The technology did not cater for cataloguing its contents. Instead, it required users to manage the catalogue or dictionary to the side if they were to keep tabs on what was being pumped into the data lake. This left a major burden of data management on data engineers and data custodians. They were tasked to manage the organisation’s data resource properly and to make efficient access and utilisation of the data possible.

Invariably, this required a lot of work and discipline to avoid the data lake becoming an unmanageable data swamp. Fortunately, much development has been happening to overcome this challenge as is evident in this article published in Virtualization and Cloud Review.

Moving beyond past obstacles

The convergence of data lakes and data warehouses, as deployed on cloud technologies, address my early concerns around still having to develop a full-scale data warehouse downstream from the data lake to make sense of the data dumped into the lake.

Modern cloud architectures are more tightly integrated and the architectural distinction between the data lake and the data warehouse is fading away. As a recent TDWI article puts it: “The new generation of DWs are, in fact, DLs that are designed, first and foremost, to govern the cleansed, consolidated, and sanctioned data used to build and train machine learning models.”

The data structures used in these technologies have also become more efficient and well-managed. In some cases, the data from the data lake does not even have to be physically moved to the data warehouse. Techniques like virtualisation and data mapping mean logical structures in the data warehouse can directly point to the appropriate physical data sets contained in the lake. This eliminates a lot of data movement and redundant data copying. In turn, this results in significant savings in storage space and processing while increasing access to data and readiness.

The move towards integration

These developments are resulting in data architects and solution designers embracing new, integrated approaches. As Tomas Hazel puts it in his article: “It is important to find a solution that allows you to turn up the heat in the data lake with a platform that is cost-effective, elastically scalable, fast, and easily accessible. A winning solution allows business analysts to query all the data in the data lake using the BI tools they know and love, without any data movement, transformation, or governance risk.”

Having started to study some of the technologies now available in this space, I am excited by the potential of these platforms that now allow and enable good data management practices, such as the data governance aspects Hazel pointed out. My initial investigations also revealed to me that with less physical data movement, and more configurable and less coding-based data mapping and data transformation, this environment will be much more scalable, cost-effective, and easily manageable.

Data optimisation

Krishna Subramanian on RTinsights explains approaches of how to get file-based data into a managed cloud-based data lake. While her article is more focused on the taming of the data in the data lake, she does touch on several approaches.

These are equally applicable to incorporating and managing structured source system data as well as unstructured data into the integrated cloud-based data lake or data warehouse environment. This includes optimising the data through proper metadata application and tagging and indexing to enable more efficient searching. With this comes appropriate relevance filtering on the flood of external data and being able to use appropriate taxonomies.

She does not just focus on the technology side of the data lake. Her considerations end by addressing the organisational culture. She refers to 2021 research by New Vantage Partners: “Leading IT organisations continue to identify culture – people, process, organization, change management – as the biggest impediment to becoming data-driven organisations.”

She continues that a data-driven culture needs to span not just the analysts and the lines of business, but IT infrastructure teams too. From my side, I would add that this data-driven culture must include representatives from each line of business – the data stewards and business subject matter experts. I will no doubt explore this in more depth in the future.

The advantages of data fabrics

Share

In my blog post last month, I started looking at the concept of data fabrics to get an understanding around what it is all about. This month, I continue with the discussion, focussing on the advantages of data fabrics. The points I have outlined below are based on a very good article written by Lori Witzel, Director of Research for Analytics and Data Management at TIBCO and published on ITProPortal. I have added my own views here and based on my experience working in the data space.

To recap, the concept of data fabric was created to address the need for more data-driven insights while coping with the reality of the distributed nature of modern data architectures. Complicating this is that most organisations are dealing with data sources located on-premises and across hybrid and multi-cloud environments. For example, a company might be running both a CRM application and a modern data warehouse platform across two different cloud providers. A correctly implemented data fabric framework enables us to work across all these data environments.

However, it is not only about the technical connectivity and the integration of data flows, data access, and data storage. There is also an element of augmented data management and data governance that is required in such a cross-platform orchestration.

The challenge is that in the modern disparate, and often siloed organisation, a lot of de-duplication, verification, integration, and other data resolutions are required to get a complete and single source of the truth. Add to that the vast amount of ‘old’ data. While this might not operationally be required anymore, the company may need this data archived for trend analysis, historic reporting, or even as mandated by legislation.

The referenced article lists the four key advantages of using data fabrics as insights, innovation, information governance, and insured trustworthiness.

Insights

According to the author, data fabrics allow an organisation to treat its data like any other business component, something advocates have been crying out for years. Data fabrics not only allows the organisation to take advantage of more advanced insights, such as those derived through analytics and machine learning, but it also enables the data custodians to automate and accelerate data management. I would love to explore how this will work in a future post, so keep an eye out for that as well.

Innovation

The second point almost happens naturally. An organisation that leverages a data fabric approach can put their entire ‘data estate’ (as the article refers to it) to work. With so much more actionable insight being generated as a result, it is easier to transform the organisation through data-driven insights and attempt and adopt new levels of innovation. Of course, the organisation’s culture and agility must be able to support such an approach.

Information governance

This advantage sounds like music to my ears – because as our data environments become more complex, so does the data governance aspect too; coupled with, as the article states, an increasingly complex array of regulatory and compliance needs, not to mention more stringent requirements for privacy and security. The unified view provided through data fabric frameworks can simplify and streamline this complexity. I am eager to explore this in more detail too in subsequent posts – how exactly does a data governance forum utilise the facilities and features of the data fabrics framework to improve data governance over a faster flowing and more tightly integrated architecture?

Insurance

The name of this advantage was slightly strange for me as this point is about the trust in data – which of course is critical. I guess the author wanted to stay with the four ‘I’s theme. So, the data fabric provides improved trustworthiness in the data, and what can then be done with it. This flows directly from a more unified approach to data management. It will be interesting to see how this is physically implemented. As we all know, the downstream data quality can only be built on the basis laid in the source systems. Looking ahead, I am interested in examining each of these four concepts and how they will be technically implemented within an organisation.

Unlock new business insights through data fabrics

Share

Seeing as I’m currently working at large for a federated organisation with significantly different and siloed business streams that are managed through a plethora of different systems – ranging from 30-year-old mainframes to modern in-cloud platforms – the topic of data fabrics is very interesting to me. Even more so given how I’m coming from a database, data governance, integration, business intelligence, and insights background.

Read the rest of this entry »

Telling powerful stories with data visualisation

Share

Long-time readers of this blog know I’m a big advocate for using data visualisations to better narrate the ‘stories’ that are hiding inside organisational data. My interest was therefore sufficiently piqued when I came across this article on the Bulletin Expert Contributor Network. And while it doesn’t make any wild discoveries, I enjoy the way it provides seven ways to improve data visualisations.

Read the rest of this entry »

Demonstrating value as a Chief Data Officer

Share

The year has started with the proverbial bang. But even though the pace has been furious, there are many constants in which we can find comfort. One of these is how critical data is to every organisation regardless of industry vertical.

Read the rest of this entry »

Unpacking Gartner’s latest tech trends

Share

It is that time of year again when I like to #trendspot for the forthcoming year. And I’m always interested to read what Gartner predicts, so it was with great interest that I read their article titled Gartner Identifies the Top Strategic Technology Trends for 2022. My initial thought was, “Wow, that is a lot of deep stuff for an organisation to think about going into 2022, especially with everything else going on!” But then my eye fell on the line in the conclusion which reads: “This year’s top strategic technology trends highlight those trends that will drive significant disruption and opportunity over the next five to 10 years.” That’s more like it! So, with that context in mind, below I’ve shared my views on the trends identified by Gartner.

Artificial intelligence

Gartner identified two artificial intelligence (AI)-related topics – generative AI and AI engineering. The former centres on machine learning methods that learn about content or objects from data, and use it to generate new, original, realistic artefacts. For its part, AI engineering is an integrated approach for operationalising AI models – effectively putting AI solutions into productions to realise the value that they have been developed for.

Even though generative AI is interesting, the challenge is that it can potential be misused for scams, fraud, political disinformation, and forged identities.

It is AI engineering that really excites me. It can be applied to wider range of solutions that include advanced analytics. Too often we see amazing analytical and AI solution developed and evaluated to potentially produce impressive results. However, businesses then fail to put the solutions into production practice by integrating them into their operational and business processes.

Data fabric

Out of all the trends, this is the one I am most enthusiastic about. According to Gartner, data fabric is about the flexible, resilient integration of data across platforms and business users. It has emerged to simplify an organisation’s data integration infrastructure and create a scalable architecture. This reduces the technical debt seen in most data and analytics teams due to rising integration challenges.

We all know how complex data management, data governance, and data integration can become over vastly differing technologies. This is even more so the case when these are managed by different vendors across siloed business lines. For me, data fabric must be on top of the priority list for any business.

Autonomic systems; Composable applications; Hyperautomation; and Total Experience

Several of the technologies listed by Gartner are focused on putting more adaptable solutions in place in a much shorter timeframe using a variety of automation techniques. This highlights how businesses cannot wait for solutions through year-long analysis, development, testing, and implementation cycles.

Of course, it might be relatively easy to reduce the technical time. The stakeholder time perhaps less so. Getting business users across a large, siloed organisation to agree on priorities, requirements, data standards, governance, security, and privacy can be a time-consuming task. Even just getting the right people around a virtual conference table is challenging. Now add Total Experience to the mix where we want to improve the experience across customers, employees, business managers, providers, and other stakeholders then it becomes clear that it is on the people-side of things where the most significant obstacle remains.

Decision intelligence

The way Gartner describes decision intelligence does not make it seem to be a new technology. Rather, it is the reword of approaches supported by technology. It makes me think back on scorecards, dashboards, alerts, early warning predictions, data visualisation and other approaches and technologies used in this field.

I feel that technologies can certainly help in the decision-making process, but it only addresses one side of the coin. The other aspect that needs attention is the psychology of decision-making. Different people have vastly different approaches to and styles of decision-making. Add to that the group dynamic often found in boardrooms, and you have a dream project for any business psychologist. It will still be some time before technology is intelligent enough to assist in that arena.

That’s a wrap

While I did not cover all the topics mentioned by Gartner, such as privacy and security technologies, my focus is more on the data side.

It will be interesting to watch the developments over the next few years to see how these technologies evolve and become adopted by organisations. Of course, as the famous saying goes, nothing is constant but change. We will no doubt see this list change and evolve over the years.

The 4 big steps to become data-driven

Share

Readers of this blog are no strangers to me discussing the importance of becoming a data-driven organisation or the steps required to become one. However, a Forbes article really brings this all home by discussing the big four aspects which this could entail.

Read the rest of this entry »

Making data FAIR

Share

Last month, I examined the importance of improving data literacy and briefly discussed five strategies organisations can employ to help achieve this. For my October piece, I want to build on these concepts and turn my attention to what makes data FAIR (findable, accessible, interoperable, reusable).

Read the rest of this entry »

Improving data literacy remains important

Share

We all know the importance of accessing ‘clean’ data and the hidden insights it contains. But despite this, in my experience, many companies are still not comfortable with the data literacy skills of their employees. Often, this comes down to not understanding where to begin when it comes to boosting data literacy across all the required levels of the organisation.

Read the rest of this entry »

Older posts «

hope howell has twice the fun. Learn More Here anybunny videos