In a previous post I gave a high level description and overview of the data lake. Informally speaking, from a BI point of view, a data lake is a large scaled-out all-encompassing free-for-all unstructured data staging area. In this post I take the discussion further to investigate whether it replaces or interacts with the data warehouse.
The purpose of the data warehouse – defined as a subject-oriented, integrated, time-variant and non-volatile collection of data in support of the organisation’s decision making processes – is to integrate data from many transactional and operational systems, in order to present the resulting information as a “single integrated version of the truth” to decision makers on all levels of the organisation. The design of the data warehouse, if done properly, allows them to access, report and analyse that information from all the relevant and possible angles; in order to get consistent and accurate information as a result.
However, the critique against the data warehouse is that its careful design and subsequent implementation takes time and effort. Some opponents to the data warehouse way of doing BI argues that the dimensional structures used in a data warehouse are inflexible and limiting. However, I claim they are only inflexible and limiting if the data warehouse is inadequately designed or badly implemented, which unfortunately, many of today’s data warehouses are…
New ETL Environment
Lately, we have seen some very interesting developments in organisations with enterprise data warehouses that have started to incorporate unstructured “big data” into their BI ecosystem. Some of these organisations actively utilise the data lake in the data integration and data management space.
In the ETL environment, the ability of the data lake to store and process data at low cost, together with the different methods to process large volumes of data at scale, has raised the interest in the data lake as an effective location to do ETL processing quite efficiently. In terms of big data, the data lake is a natural place for the “scaled out” ELT processing of that data – note “ELT”, since with big data you need to take the ELT processing to the data, and not move the data through an ETL processing pipeline in a separate environment. So the ELT applied to the big data in the data lake filters and pre-prepares the correct subset of that data for loading into the data warehouse.
Even for large sets of structured transactional data – for example, think of mobile phones’ call data records, or retail point of sale records – moving that ETL processing to the data lake frees up a lot of the conventional “data warehousing” resources that can now be used for improving the preparation of other structured data and for the analytics performed on that data then. It works well where you have one or two super-large data sets that consume the majority of the resources. These datasets can be moved from the conventional structured ETL environment into the large-scale unstructured data lake environment to take advantage of the parallelism and scaled-up data storage and data processing facilities available there.
While there is no immediate benefit to move all the ETL processes to the data lake environment, especially if they are already stabilized in a productionised environment, it is likely that over time more large scale ETL processes will be moved to this environment, potentially as they are adapted, changed or developed from scratch.
Analytical Silos
However, the organisations that are using both an enterprise data warehouse and a data lake often face a different problem – because they are in fact creating a distributed form of analytics. Results from analysing video, audio, unstructured text, etc., are often found in the data lake, while the results from applying analytics to structured data are either found in the analytical modelling tool environment, or written back to the data warehouse for reporting and further action.
The issue with these two separate environments is that the most value is obtained when these separate sets of modelling results are integrated. In reality, in most cases, the models will produce better results when they are run over an integrated dataset. For example, you would want to analyse a customer’s sentiment scores (unstructured data) together with their buying patterns (structured data) when you predict their propensity to churn.
Incorporating both datasets in the same analytical model is a bit more complex. At this stage, best practice is still to reduce and filter the “big data” as much as possible in situ, and only move the relatively small significant bits of valuable data that contains the relevant variables into the data warehouse environment for conventional advanced analytical modelling. Some integrated ecosystems that support data federation may provide access to the data in the data lake through wrappers or pass-through filters, thereby reducing the need to move or copy the data between the two environments.
Hybrid reporting and discovery
If the data lake and the enterprise data warehouse can be accessed together as a hybrid, unified system, users can develop reports or discover and analyse more data with less effort. Many of the mainstream data warehousing and BI products provide transparent access to and integration with the data stored in Hadoop.
In the long run, the location of the data and to some extent the details of the analytical models used to answer the burning business questions must be hidden from end users, and especially the decision-makers. Their job is to ask business questions. The BI ecosystem, made up of an enterprise data warehouse, a data lake, and potentially a discovery platform across these to facilitate analytics across the architecture, will determine what data and what analytics are used, and where it is executed, in order to answer those questions.
Concluding thoughts
The data warehouse itself should stay a logical representation of clean, vetted data that users at various levels can use to make decisions. Without a data warehouse, decision-makers operate by the seat of their pants, making critical decisions based on inaccurate or no data at all.
But as the amount of structured and unstructured data increase, we may need to implement a data lake to complement the enterprise data warehouse. The contents and structure of the data lake will be determined by what data and analytics are required but which you cannot store or process in the conventional data warehouse architecture.
In this way, the organisation implements a BI ecosystem – a logical data warehouse in Gartner terminology – with a variety of technologies and tools, including a data lake. In this environment, the data lake and the enterprise data warehouse must both do what they do best and work together as components of the larger integrated logical data warehouse, which in turn, provides greater value by delivering insights previously not possible.
In a subsequent post I’ll elaborate on the criticisms against such an approach, in particular against the use of a Hadoop-based data lake as a free-for-all all-encompassing staging area. Interesting stuff, stay tuned.