The engineering analogy to building a data lake carries quite well. Putting a data lake in place is about as complex as building a man-made lake – there are so many aspects to consider. In this post I comment on some of the wisdom that have been published about building and populating data lakes. At this stage it is still a very new approach, and most implementations have, for the biggest part, been pretty experimental.
So in previous posts I provided a description of the data lake and elaborated on its relationship with the enterprise data warehouse. In this post I look at aspects related to the implementation, management and governance of the data lake.
Data movement
Data can be moved directly into the data lake from a large variety of data sources – as it becomes available, or as new or changed source systems become accessible – and that is what makes the data lake so compelling. Data scientists and data analysts can investigate that data quickly, and get useful business insights from it, without having to wait for the BI team to model and ETL all that data first. It does not require any up-front modelling or even definition effort. Data refinement, filtering, structuring, preparation and the application of analytics can be created and implemented as and when needed by the business. So the data architecture used in a data lake is termed “loosely coupled” – no structure is forced on the data at entry time. The structure and interpretation of the data is only done when it is used – this is called “schema on read”, as opposed to the “schema on write” approach that is used in traditional data warehousing. But of course, you still have to know what data is available, and you need to know how to access it, before you can apply any structure or interpretation “on read”,
The data movement into the data lake can be done through a number of tools, including, but not limited to, ETL tools, bulk file loading facilities, data integration tools and data movement tools that were specifically created for big data technologies. The result is that over time all the raw data – structured and unstructured – from all over the organisation can be made available in one place.
But of course, managing that whole process is often more complex than initially perceived. Error and failure handling, system availability, duplication control, handshaking protocols, status updates, etc., all have to be put in place. In terms of effort, it is not all that different to putting a proper ETL framework in place. The wider the variety of data sources, and the more diverse the data integration toolsets are, the more complex it becomes managing these processes.
Search and access
Once all the data is available in the data lake, users can search for the data they need, and then either extract and offload it to the conventional analytics platform for analyses using conventional analytical tools, or they can analyse it “on the spot” using big data analytics tools.
But first, you need to be able to determine whether the data you’re looking for is actually in the data lake, find out how to get to it and how to accesses it. This requires some descriptive metadata about content, its size, volume and maybe even the structures.
Search tools can only find data if its properly indexed. However, making such large volumes of disparate data searchable introduces its own challenges. You need metadata that describe what data is available, where it is, how to access it, and when you have larger volumes, indexing tools to speed up the search and access. The APIs and tools that provide this functionality are only now being developed, enhanced and released. So there is a lot of growth, development and change happening in this world.
Data governance
Over time, together with the metadata and indexing data described above, a data governance framework should be built over the data lake for enterprise-wide data management and data governance. But it is only as the organisation’s storage, understanding, utilisation, documentation and indexing of the “loosely coupled” data matures, that the organisation’s data management framework can be extended to cater for more advanced aspects of data management and data governance.
The data architects – yes, there should be such people responsible for the data lake (read my comments against the “free-for-all data lake in a subsequent post) – have to ensure that the data lake doesn’t become an uncategorised hoarding ground or dumping ground, or as some authors have called it, a data graveyard, where there are files and files of data, but we do not know what’s in them, much like piles of unopened unlabeled boxes in some people’s garages.
It is very easy to lose track of what’s in the lake, especially about datasets that aren’t used regularly. So the data needs to be cataloged and classified – some good old data and metadata management is required.
The challenge with data governance and data management in the data lake environment, is that you have to strike that narrow balance between keeping tabs on what there is but not restricting the use and creation of new data and insights. You need some data governance in order to make the data in the lake useful, and to some extent also control it, so that the quality and context is good enough to allow it to be used for decision-making. However, you need to do this without hampering the scientists and data analysts that need unrestricted access to all that is there. A further complication is that these scientists and analysts may even be creating further derived data insights, which also needs to be cataloged, indexed and managed.
Characteristics
Let us now turn our attention to the characteristics of a well-implemented data lake.
Properly implemented, the data lake can provide a significant improvement to the organisation’s data architecture – where it provides a large data storage, processing and analysis capability, which can be filled and accessed with agility. If properly run in conjunction with the data warehouse, it can free up expensive EDW resources, and enable more efficient processing of large volumes of data. This is useful for organisations with extreme large structured datasets too, such as call data records or point-of-sale transactions.
For properly skilled data scientists and data analysts, the data lake provides a self-sufficient self-service analytics environment where data exploration and other data-related and analytical tasks can be performed without having to wait for the EDW team to first model the data and then load it.
A thorough data lake implementation will more than likely make use of a number of different tools and technologies. At this relative early stage, it is still questionable whether a single tool can provide all the necessary data management and data movement capabilities that are required. Very few toolsets yet provide native support for the integration of the vast array of structured, semi-structured and unstructured data types that are typically managed in a data lake environment.
In particular, the data lake environment requires a number of configurable and process-efficient data loading facilities to get data loaded error-free, fast, securely and also repeatedly if multiple datasets have to be loaded from the same source over time.
The contents of the data lake, and the structures used to manage that content, will be very different from industry to industry. For example, the data required in a healthcare environment differs vastly from that used in the telecommunications industry, and likewise the required analytical outcomes also differ substantially. The data search, locating and access capability must be aware of the industry, with industry-specific codes, terminology, structures and applications used.
A successful data lake implementation would increase the independence from IT, and/or the data warehouse team. A properly created and documented data lake will make it possible for data scientists, data analysts and other skilled users to access the data, run discovery tools and do their own in-depth analyses, without having to rely on IT or the data warehousing team to first load the data and make it available.
An efficiently implemented data lake will make use of automated metadata capturing and metadata management facilities. For every data set stored and processed in the data lake, a significant amount of metadata needs to be captured and be maintained as that data gets accessed and used. Attributes like data lineage, data quality, and usage history have to be maintained to increase the usability of the data. Maintaining all this metadata requires a highly-automated metadata extraction, capture, and tracking facility. If you rely on manual metadata management processes, the metadata will quickly fall out of sync with the data lake’s content, and the lake will turn into a grave yard.
A well-implemented data lake will integrate closely with the enterprise data warehouse and its related toolsets, such as ETL / ELT, data cleansing and data profiling tools. If the data lake acts correctly as part of the data staging ecosystem, users will be able to run reports and perform analyses using the conventional tools as they always have.
Maturity
Similar to the maturity stages of a data warehouse and BI implementation, the implementation and enterprise wide adoption of a data lake can also be described in terms of 4 maturity stages (adapted from a Hortonworks white paper):
- Stage 1 – Getting to grips with handling data at scale. This involves getting the necessary technologies in place, understanding how they work and learning to acquire and process data at scale. At this stage, the organisation may not be getting much value yet, as the analytics you can apply at this early stage may be quite simple, but much you will learn much about using the technology and how to use it effectively.
- Stage 2: Building transformation and analytics capability. This involves putting the functionality, people and processes in place to transform and analyze data. This includes acquiring, installing and adapting the tools that are most appropriate to the organisation’s requirements. You need to consider the available skill sets, or training up the required skills, and putting the necessary processes in place. At this stage you start acquiring more data and building applications. Using a BI analogy, at this stage you have identified and selected a BI toolset and you have put a tested ELT framework in place. At this stage you already need to have figured out how the capabilities from the enterprise data warehouse and the data lake are going to be used together. This is typically figured out during a large PoC, or preferably during a first larger-scale project.
- Stage 3: Getting data and analytics into the hands of as many people as possible, for a broad operational impact. At this stage the data lake and the enterprise data warehouse start working in unison, each playing its specific role.
- Stage 4: Enterprise data management capabilities are added to the data lake. At this stage, as the use of big data grows, it requires better governance, compliance enforcement, security, auditing and other related data management functions. However, very few organisations are even anywhere near this stage at present.
Concluding remarks
As the amount of big and unstructured data grows, organisations that had invested years of cost and effort in creating enterprise data warehouses are beginning to create data lakes to complement their enterprise data warehouses. The data lake and the enterprise data warehouse must both do what they each do best and work together as components of a logical BI ecosystem.
Data lakes are more and more becoming the popular technology used to store, manage and exploit all these new forms of unstructured data, sometimes combined with structured data as well. In essence, the structures and content of the data lake is determined by what the organisation requires, but which they cannot manage with their current enterprise-wide data architecture and BI environment.
Together, the data lake and the enterprise data warehouse provide a combination of storage and analytical capabilities that can deliver accelerating returns very fast. This allows people to do more with more data, faster, and driving better business results. That’s the ultimate payback from investing in a data lake to complement the enterprise data warehouse.
In subsequent posts, I’ll discuss the criticisms against the data lake as all-encompassing, free-for-all staging area, and investigate the question whether it will ever completely replace the functionality of the enterprise data warehouse.