Do you know where your food comes from? Did the farmer use pesticides? Did the transport company spray preservative chemicals over your food? Did they keep it appropriately refrigerated? Would you eat food from sources you don’t trust? The same applies to data. Do you know what the lifecycle of your data entails? Was it manually entered? What validations were applied? Through how many transactional systems did it go, and was it transformed along the way? Would you make decisions based on data you don’t trust? This is where data lineage comes in.
A good entry point to this concept can be found in this Geeky Gadgets article. In it, the author describes data lineage as ‘the process of tracing and documenting the life-cycle of data, from its origin, through its transformation and usage, to its eventual storage. It provides a historical record of data, outlining its relationships and dependencies, thereby ensuring transparency and trust in the data. The importance of data lineage lies in its ability to provide visibility into the analytics pipeline. This allows organisations to understand how data is utilised and transformed across various business processes, enhancing the understanding of data flow.’
Far too often, people have an oversimplified view that data enters the organisation through some well-controlled interface and gets stored in a properly designed system with good validations and integrity rules enforced. From there, it is used in various business processes and to guide decision-making. However, the reality is that some data life cycles are complex.
For example, in the training and education space, data is often entered manually by an applicant, without being properly validated. It then permeates through a plethora of systems, such as CRM, learning systems, training management systems, a research platform, and even into a financial system if payments or invoicing are involved. At each step, the data gets manipulated in some manner. Some of the systems try to apply validation rules, but by that stage, it is far too late. This results in the various organisational users seeing different versions of the truth, depending on where they get their information from. Add to this a reporting team, as well as a few rogue citizen reporters that apply their own transformations and extractions of the data. Pretty messy, right?
Getting it done right
Data lineage is therefore an integral component of data management and proper data governance. This helps ensure the accuracy, consistency, appropriateness, and other aspects of data quality. Think of data lineage as the discipline of analysing and documenting the data’s origins, movements, storage, use, and any changes made over time. Additionally, it also includes its characteristics, especially related to quality.
In the example shown above, it is already easy to grasp that all these components can quickly turn into a spider’s web of interdependencies! In such an environment, you must be disciplined and dedicated to analysing and documenting data lineage properly.
Data lineage is therefore crucial for data tracking and analysing data integrity. It allows data workers to trace the data back to its origin which enables them to gauge the accuracy of the data. By tracking and analysing data thoroughly, data lineage can point to errors in the data or areas where inconsistencies can be introduced.
Data governance matters
Although data lineage is often done as part of the broader data architecture function, it has a special relationship with data governance. This is because data lineage provides the source for data management and data ownership. In data governance, reporting, and analytics, we often speak about the source of truth, referring to some system of record that we trust. However, data lineage provides evidence of the real source of that data.
Visibility of the whole data pipeline is crucial for setting, improving, and maintaining quality and standards. Often, the ‘source of truth’ for reporting is far removed from the actual source of the data. Trying to impose data quality through a system further down the pipeline is equivalent to trying and filter river water before it runs into the sea, without considering all the farm and industrial runoff and other pollutants liberally dumped into the river upstream.
In effect, data lineage is an important pillar of effective data governance, as it provides a framework for understanding how and where the data comes from and flows, its dependencies, and which transformations are applied to it. It also supports various aspects of data management, including data quality management and data privacy management.
Critically, data lineage is another type of metadata that must be maintained in conjunction with the other data dictionaries and catalogues maintained in the organisation. Documenting data lineage separately from those metadata sources would be creating another silo and potential data inconsistency in the organisation.
Business impact
By providing a clear and accurate view of the organisation’s data, where it comes from, and where it is used, data lineage enables executives to make more informed decisions based on reliable data. It also assists in identifying trends, patterns, and insights, which can inform business growth. But one of the more important aspects is that it assists in risk identification and subsequent management. Data flowing through a well-managed data pipeline or inter-system integration is much more secure than an Excel file extracted from one system and manually uploaded into another. Properly documented data lineage will quickly point to those risky pathways.
As data continues to grow, the importance and impact of data lineage analysis will no doubt increase. As part of a solid and informed data strategy, organisations should be paying attention to and investing in data lineage tools and best practices.