Long-time readers of my blog can likely recall my scepticism around data lakes when they first emerged. However, a lot of water has flowed into the space since then, motivating me to start investigating the area in-depth.
My early criticisms against Hadoop-based data lakes were that they were too batch-oriented to cater for business analytics and business intelligence requirements. The technology did not cater for cataloguing its contents. Instead, it required users to manage the catalogue or dictionary to the side if they were to keep tabs on what was being pumped into the data lake. This left a major burden of data management on data engineers and data custodians. They were tasked to manage the organisation’s data resource properly and to make efficient access and utilisation of the data possible.
Invariably, this required a lot of work and discipline to avoid the data lake becoming an unmanageable data swamp. Fortunately, much development has been happening to overcome this challenge as is evident in this article published in Virtualization and Cloud Review.
Moving beyond past obstacles
The convergence of data lakes and data warehouses, as deployed on cloud technologies, address my early concerns around still having to develop a full-scale data warehouse downstream from the data lake to make sense of the data dumped into the lake.
Modern cloud architectures are more tightly integrated and the architectural distinction between the data lake and the data warehouse is fading away. As a recent TDWI article puts it: “The new generation of DWs are, in fact, DLs that are designed, first and foremost, to govern the cleansed, consolidated, and sanctioned data used to build and train machine learning models.”
The data structures used in these technologies have also become more efficient and well-managed. In some cases, the data from the data lake does not even have to be physically moved to the data warehouse. Techniques like virtualisation and data mapping mean logical structures in the data warehouse can directly point to the appropriate physical data sets contained in the lake. This eliminates a lot of data movement and redundant data copying. In turn, this results in significant savings in storage space and processing while increasing access to data and readiness.
The move towards integration
These developments are resulting in data architects and solution designers embracing new, integrated approaches. As Tomas Hazel puts it in his article: “It is important to find a solution that allows you to turn up the heat in the data lake with a platform that is cost-effective, elastically scalable, fast, and easily accessible. A winning solution allows business analysts to query all the data in the data lake using the BI tools they know and love, without any data movement, transformation, or governance risk.”
Having started to study some of the technologies now available in this space, I am excited by the potential of these platforms that now allow and enable good data management practices, such as the data governance aspects Hazel pointed out. My initial investigations also revealed to me that with less physical data movement, and more configurable and less coding-based data mapping and data transformation, this environment will be much more scalable, cost-effective, and easily manageable.
Data optimisation
Krishna Subramanian on RTinsights explains approaches of how to get file-based data into a managed cloud-based data lake. While her article is more focused on the taming of the data in the data lake, she does touch on several approaches.
These are equally applicable to incorporating and managing structured source system data as well as unstructured data into the integrated cloud-based data lake or data warehouse environment. This includes optimising the data through proper metadata application and tagging and indexing to enable more efficient searching. With this comes appropriate relevance filtering on the flood of external data and being able to use appropriate taxonomies.
She does not just focus on the technology side of the data lake. Her considerations end by addressing the organisational culture. She refers to 2021 research by New Vantage Partners: “Leading IT organisations continue to identify culture – people, process, organization, change management – as the biggest impediment to becoming data-driven organisations.”
She continues that a data-driven culture needs to span not just the analysts and the lines of business, but IT infrastructure teams too. From my side, I would add that this data-driven culture must include representatives from each line of business – the data stewards and business subject matter experts. I will no doubt explore this in more depth in the future.