Seeing as I’m currently working at large for a federated organisation with significantly different and siloed business streams that are managed through a plethora of different systems – ranging from 30-year-old mainframes to modern in-cloud platforms – the topic of data fabrics is very interesting to me. Even more so given how I’m coming from a database, data governance, integration, business intelligence, and insights background.
It should hardly come as a surprise that I got quite excited when I was pointed to this article on ITProPortal discussing data fabrics. The opening sentence was like music to my ears: ‘The central notion of data fabrics has arisen due to the distributed nature of modern data architectures and the increased pressure for data-driven insights.’
In a related article, data fabric is defined as ‘architecture that includes all forms of analytical data for any type of analysis that can be accessed and shared seamlessly across the entire enterprise.’
Its purpose, and again I quote, is to ‘provide a better way to handle enterprise data, giving controlled access to data and separating it from the applications that create it. This is designed to give data owners greater control and make it easier to share data with collaborators.’
According to Gartner, a data fabric architecture must support these four principles:
- The data fabric must collect and analyse all forms of metadata.
- It must convert passive metadata to active metadata.
- It must create and curate knowledge graphs.
- It must have a robust data integration backbone that supports all types of data users.
Discussions around active metadata-driven architecture make me especially excited. A colleague and I began our consulting careers by developing a metadata-driven database configuration and replication system and deploying it at two customers.
Understanding the data catalogue
Insights from the data fabric is centred on an active data catalogue. Users must quickly and easily find and access the data they need to obtain business insights. The data catalogue provides a repository for all technical metadata, a business glossary, data dictionary, and governance attributes.
However, the catalogue not only documents the data resource, but it provides users with an interface to view what data is available and what analytical assets exist. Once they have access to the data and insights, they can re-use these to make decisions. Alternatively, they can create their own analytical assets using the data or by adapting existing asset to their needs. In turn, this can be shared via the catalogue.
The catalogue is collaborative in the sense that if the data required is not catalogued, users can submit a request to get that data into the environment. This means that the technicians who build and maintain the data fabric must update the catalogue and notify the users of any additions, edits, or changes made to the data fabric, its data, or analytical assets. Data lineage and usage must be continuously monitored. This requires a level of modelling, cataloguing, and governance that must be continually applied.
Building blocks
There are several key components that contribute to the data fabric. These are the enterprise data warehouse (EDW), an investigative computing platform (ICP), and a real-time analysis engine. Data integration is key, including the extraction of data from sources and transforming it into a single version of the truth that is loaded into the EDW. The ETL, or more likely ELT processes, must create a trusted data source in the EDW. This is then used for producing reports and analytics. We’re obviously talking about a modern EDW here, catering for structured and unstructured data.
For the ICP (or data lake), raw data is extracted from sources and reformatted, integrated, and loaded into the repository for exploration or experimentation. This repository is used for data exploration, data mining, analytical modelling, and other ad hoc forms of investigating data.
In the past, the data warehouse and the investigative area were separated because they used incompatible technologies. But thanks to data storage being separated from computing, it is now possible for the data warehouse and ICP to be deployed on the same storage technology. This will see us ending up with a layered enterprise data hub.
Because it removes the requirement to physically move data from one data source to another, data virtualisation is often used in the data fabric to access data. This is often referred to as ‘data democratisation.’
Real-time analysis is a new area of analytics focused on analysing the data streaming into the company before it is stored. It is a significant addition to the range of analytical components in the data fabric. Data usage is often used to decide of where and in what format data is stored and analysed.
Of course, all the databases in the organisation form part of the data fabric environment. Look out for a subsequent post where I will discuss the advantages of this approach and technology.