The term ‘data lake’ has started to take a strong place in the Information Management architectures and processes of many organisations. In fact, over the last two years, I have seen more and more data managers developing data lakes in an effort to assist their organisations who need to store information that the business may need down the line.
As part of this experience, I also noticed that many companies are under the impression that you can use a data lake to get rid of the data Information Management rigors that are usually associated with conventional data warehousing. Too often, organisations believe that they can merely dump all kinds of data into the data lake, and somehow, whoever needs it, will be able find it and retrieve it later.
But this is madness – if fact, this kind of approach can introduce more chaos into the organisation – where the data lake can potentially become an unmanageable dump of data that no-one knows exactly what is in there, how to manage or access it and with more data continually flooding in.
As with most data related strategies, there are key concepts that should always be considered. This is especially necessary when incorporating a data lake into your Information Strategy. I would recommend reading the 2016 Building a Data Lake with Legacy Data checklist report (by TDWI and sponsored by Syncsort), as it provides some great insight into the Information Management disciplines that can be applied for a data lake.
There are 3 highlights which stand out for me – so here follows a summary for your interest.
Use master data and metadata accurately
There are various sources of data that make up the lake, but it is crucial to consider – and manage – master data and metadata when developing, implementing and maintaining a data lake in a business. This list of master data items can become quite extensive – yes, we have the standard list of lookup values, but it also includes key entities that are involved in the transactional processes, such as products, customers, suppliers, each with their relevant attributes. With the vast array of data that floods into the lake, the metadata is essential to catalogue its contents and to provide guidance to a wide variety of users on how to access it and how to interpret its contents.
Data lakes need data governance
Given the impact of regulation and governance today, a data lake must form part of the data governance strategy of any business. Not only does this ensure compliance of the data, but allows the data to be more easily accessed and processed, amongst many other opportunities. Data governance cannot be an afterthought, especially for those with extensive legacy data stores, which are often maintained for regulatory and compliance reasons – as well as for long-term analytics use.
Data security becomes important to data lakes
As data is positioned in a data lake, aspects related to data security must be considered and investigated thoroughly by the business who owns the data lake. So, for example, when we look at semantic data, we understand that data is more than just individual fields. Rather it specifies two data basics – as well as the relationship between them – and can be added when you integrate data into a data lake. However, who may access what across such relationships need to be carefully assessed and controlled.
Concluding remarks
Of course, a data lake can reduce the time-consuming and labour-intensive processes associated with gathering and storing large pools of information. However, it does not reduce the need for Information Management disciplines in the broader sense. In fact, a business needs rather more, and stricter, Information Management disciplines and processes to reap the business benefits proposed by the implementation of a data lake.