«

»

Data Lake

Share

Data lakeMy mini-series on big data would be incomplete without a discussion on data lakes. The data lake is emerging as a popular concept to address big data challenges, especially in the context of big data analytics and the integration of big data into the BI ecosystem. Vendors tote the data lake as the new approach to speed up the delivery of information and insights to the business without the delays traditionally experienced with cumbersome data warehousing processes.

The traditional single data warehouse architecture has long since been stretched into a BI ecosystem in order to incorporate big data sources and to enable big data analytics. The data lake has become an important component of this “logical data warehouse”, as Gartner calls the BI ecosystem. The data lake is incorporated into the ecosystem, because with its more flexible and variable data structures, it can manage the larger and more unstructured data sets that this environment now also has to cater for.

Description

If we amalgamate the various loose definitions of a data lake, we get to this description: A massive, easily accessible data repository built on (relatively) inexpensive computer hardware for storing uncategorised pools of data “as is”.

The data in the data lake can then be accessed, categorised and analysed at a later date. The data lake is therefore designed to retain all the data available, including data immediately of interest, data potentially of interest, as well as the data that the organisation may not even yet know what the intended usage will be.

So the intention of the data lake is not only a large data store for variable structured or unstructured data that may potentially be of interest, but the objective is that it forms a key component of the extended analytical ecosystem, particularly to include that data, as well as potentially structured data too, into the extended business intelligence and analytical ecosystem.

In BI layman’s techie-speak, a data lake is therefore a large semi- or unstructured all-encompassing staging area, typically implemented on cost-effective big data storage technology.

Capabilities

The data lake provides the following capabilities:

  • To capture and store raw data at scale for a low cost.
  • To store many types of data with different data structures in the same repository.
  • To define the structure of the data at the time it is used, referred to as “schema on read”.
  • To perform new types of data processing, such as applying data transformations, integrations and other forms of analytical data preparation, typically on the data “at rest”, typically at scale.
  • To perform single subject analytics based on very specific use cases.

Implementation

Even though a data lake is a product-independent concept, most discussions of data lakes at some or other time lead to details of its implementation using Hadoop. However, it is not the fact that Hadoop is becoming more popular as the primary storage platform for large and unstructured data sets that is causing the increase in the data lake’s popularity, nor the increase in data volumes, especially of unstructured or semi-structured data. It is the desires of organisations to integrate existing and new unstructured data into their analytical ecosystems that is the key objective, even when these organisations do not even know what data or parts of the data they are going to require for their future analytical initiatives. The interest in the data lake is therefore driven by the requirements for additional analytical insights to be derived from the larger data pool.

So you may ask at this early stage, when is a large Hadoop distributed file system simply a large data store, and when can I rightfully call it a data lake? The difference comes in the intended usage. When the data store is merely used to accumulate and store large volumes of data, possibly for later use, we should only call it a big data store or storage system, which from a business intelligence and analytics perspective can be viewed as another (large) source system.

However, if the intended usage is for data integration into the analytics space, or for any type of analytics or integration with other business intelligence capabilities, we can rightfully call it a data lake. They key piece in the description lies in the analytical intention, and that it is a component utilised on the data integration or preparation path for analytical processing.

Contents

The initial data lake implementations were created to handle large-scale web data at organizations like Google and Yahoo. Other organisations then started using the concept for other types of big data such as click streams, server logs, social media streams, as well as machine and sensor data.

From there it has evolved into a facility used by more general organisations to simply dump all their data, both structured and unstructured, into the lake and then let whoever require it extract whatever they need using whatever technology is best suited to the task. In this process technologists and business users alike can create whatever views they require over all this data.

So, as I mentioned above, in data warehouse “architectural” terms, the data lake has evolved to become a very large, free-for-all, all-encompassing, data staging area.

Appeal

The data lake has become popular because it provides a cost-effective and technologically feasible way to meet big data challenges – in particular to incorporate unstructured data into the BI ecosystem. Organisations are utilising the data lake as a new way to capture, process and analyse more and complex data. It also provides an area where all data – structured and unstructured – can be dumped “on the fly”, without requiring any structuring, modelling or transformation. Again, in BI layman’s speak, the data lake forms a cost-efficient data staging area, where the data is merely dumped as it gets created or updated, without requiring any pre-processing of the data.

In particular, the following benefits have been identified:

  • First, the data lake gives business users immediate access to all data. They don’t have to wait for the data warehousing team to model the data, load it into a data warehouse and only then give them access. Rather, the business users (or rather their data analysts) now shape the data however they want to, to meet their particular requirements, when they need to access it. The data lake therefore speeds up delivery and offers unparalleled flexibility since nobody or nothing stands between the business users and the data (apart from the technological difficulties to identify, access and extract the right data).
  • Second, data in the lake is not limited to only structured relational or transactional data, as is traditionally handled by data warehouses. In addition to the conventional structured data, the data lake can contain any type of data: clickstream, machine-generated, social media, unstructured text, and any other internal or external data, even audio, video and text.
  • Third, with a data lake, you never need to move the data. That is very important in the current era of big data. The premise with big data (especially large volumes of complex data) is to never move it – you take the processing to the data, not the other way round. So, the data streams into the lake and stays there. You process it in place using whatever technology you want and serve it up however the users want to see it, through whatever technology may be appropriate. But the data never leaves the lake – only the derived insights do. It becomes one large pooled pool of data, for everyone and every tool in the organisation to access.

Concluding thoughts

So, the data lake empowers business users – it frees them from waiting for IT to get data massaged and moved into the data warehouse before it becomes accessible. That is, of course, if the business users can master the technologies required to access and analyse the data in the lake. Furthermore, it speeds up delivery and enables business units to perform a wider range of analyses much quicker. The data lake caters for new types of data and it is implemented using technology that lowers the costs of data storage and processing while improving performance.

In future posts I will discuss the relationship and interaction between the data lake and the data warehouse in more depth – where I consider the integration of big data into the BI ecosystem, or more correctly, the integration of BI data into the big data ecosystem. I will also cover approaches to set up, implement and use a data lake effectively.

1 comment

Leave a Reply