Last month I addressed how data quality is perceived by different specialists inside the organisation. This month, I turn the spotlight onto what makes modern data quality management different from traditional approaches. Edwin Walker’s Data Science Central article ‘Difference between modern and traditional data quality’ provides an excellent starting point.
Walker describes this difference as follows: “Modern data quality practices make use of new technology, automation, and machine learning to handle a variety of data sources, ensure real-time processing, and stimulate stakeholder collaboration. Data governance, continuous monitoring, and proactive management are prioritised to ensure accurate, reliable, and fit-for-purpose data for informed decision-making and corporate success.”
Given how closely this aligned with the Data Management Body of Knowledge (DMBoK) of DAMA, I liked the tie-in with data governance. In my experience, data quality is hard to manage and improve without a proper data governance framework, management support, and appropriate processes.
The first four differences which Walker highlights are standard for today’s business environment. We are all aware of these influences:
- Data sources and types: We have moved on from only having to deal with structured data in neatly modelled databases, to unstructured data, external data, social media data, IoT data, and more. The variety of data and the sourcing of data from outside the organisation’s controls, has significantly contributed to the complexity of data quality management.
- Scale and volume: Big data and even terabytes of structured data have brought with it more challenges to process. Additionally, it is also difficult to quality assure data at such scale. Modern data processing, which includes data quality management, must leverage technologies like distributed processing and cloud computing to efficiently manage and improve the quality of such large datasets.
- Real-time and near-time processing: In older batch-oriented systems, it was easier to schedule and integrate periodic data quality monitoring and cleansing processes. However, with real-time or near-real-time processing, data quality issues must be detected and addressed as they occur. The reason being that the data is often used immediately or stored or integrated with other data.
- Data governance and data stewardship: We are all aware of how important data governance and data stewardship are to improve data quality management. Data governance frameworks include policies, procedures, and responsibilities for managing data quality throughout the organisation. Data stewards are assigned to ensure adherence to these policies and to drive data quality initiatives. The key difference is that in more modern organisations, data governance and data stewardship are being accepted as part of the company’s fabric and way of working.
Things start getting really interesting in Walker’s article when we move beyond the four introductory points, to discuss another three very relevant points. Here they are with my take:
- Automation and machine learning: The practice of using automation and machine learning techniques to enhance data quality processes is a relatively new aspect. Automation makes it possible to schedule the execution of repetitive tasks, for example data cleansing, validation, and standardisation. Machine learning algorithms can identify patterns and anomalies in data, enabling automated detection of data quality issues. Predictive modelling can be used to indicate where data quality standards are going to off track. This will enable companies to apply more proactive interventions.
- Collaboration and cross-functional involvement: In traditional organisations, data quality handling was primarily an IT function. At best, it often fell on the shoulders of the data warehouse and business intelligence teams especially when it came to ‘dealing with’ bad quality data. However, modern data quality practices involve collaboration among various stakeholders. This sees an enterprise-wide acceptance that this is ‘our’ problem. A DQ Labs blog post writes that ‘both producers and consumers of data are shifting away from the traditional ideologies around centralised data ownership towards new principles around decentralised data ownership.’ Therefore, it is vital that business users, data analysts, data scientists, and subject matter experts all get involved. Through this collaboration, the company can ensure that data quality requirements are aligned with business needs and that data quality efforts address the specific goals of the various business functions. I see more companies use the term ‘data owner’. This gives business users the ownership and mandate to influence data quality right from the moment that data enters the organisational systems.
- Data quality as a continuous process: Modern data quality practices rely on continuous and embedded data quality management. Data quality is no longer a one-time activity that has been moved to the side. The organisations that have the most success, continuously monitor, measure, and improve data quality through feedback loops back to the business as well as the system and data owners. This is where the data quality dashboard has become more widely accepted and actively used to ensure sustained data quality over time.
With the above in mind, Walker summarises the article aptly, which is what I will close off on for you to ponder over: ‘Overall, modern data quality practices adapt to the changing data landscape, incorporating new data types, handling larger volumes of data, and leveraging automation and advanced analytics. They prioritise real-time processing, collaboration, and continuous improvement to ensure high-quality data that supports informed decision-making and business success.’