Data wrangling is often used to describe what data engineers do. However, a universally accepted definition of the concept has proven difficult to find. It was therefore fortuitous that I came across this piece on Simplilearn titled “What Is Data Wrangling? Overview, Importance, Benefits, and Future.” It shares some interesting and insightful points and so naturally, I wanted to share my views.
The article describes data wrangling as ‘a crucial process in the data analytics workflow that involves cleansing, structuring, and enriching raw data to transform it into a more suitable format for analysis.’
With that in mind, there are four key aspects to consider when it comes to data wrangling. First, there is data cleansing. This entails removing or correcting inaccuracies, inconsistencies, and duplicates. Data structuring follows. This is where data is transformed into a tabular format for easier use in analytical applications. From there, enrichment is the next step. This involves adding new information to enhance the data’s value for analysis. And finally, the process includes validation. Here, validation is performed to confirm the accuracy and quality of the data.
Essentially, data wrangling makes raw data more accessible and meaningful. Through this, analysts and data scientists can derive valuable insights more efficiently and accurately.
The relevance of data wrangling
These aspects, and data wrangling itself, may seem like nothing new. After all, companies have been doing these things in ETL and other data preparation pipelines for years. However, data wrangling is gaining momentum thanks to several reasons.
There is the volume and variety of data to consider. We are currently processing way more data from the Internet, social media, IoT devices, and many other sources than ever. Furthermore, the data comes in different varieties and types of structures than what traditional ETL methods have been designed to deal with.
Additionally, advanced analytics and artificial intelligence (AI) have become more mainstream. Unlike in the past, where these were typically standalone projects, today they require even more real-time access to high-quality data. The ‘modern’ models are accessing more integrated data sources.
Data wrangling also facilitates faster decision-making. Organisations are reliant on the most updated data to determine how to adapt their strategies to stay more competitive and relevant. We can also not ignore the impact that compliance and data governance have on data. Organisations are faced with a myriad of regulatory concerns and privacy issues than what was in place even five years ago. Data wrangling can help to ensure compliance by cleansing and structuring data according to these regulations.
Of course, training analytics and AI models on bad-quality data will still lead to bad-quality results. Data wrangling can help improve the quality and accuracy of data. As such, it can improve the reliability of the insights gained from the data.
The data wrangling process
The article describes a seven-step process of data wrangling in great detail. While you can read about it in the link provided, below is a summarised version that builds on the points raised earlier in this article:
1. Collection: The first step is collecting raw data from various sources, such as databases, files, external APIs, social media and many other data streams. The data can be structured (e.g. tabular), semi-structured (e.g., JSON or XML files) or unstructured (e.g., text documents, images).
2. Cleansing: This is performed to remove or correct errors, irrelevant data and duplicates, and resolve inconsistencies that can affect analytical outcomes.
3. Structuring: Data needs to be restructured into a more analysis-friendly format. This often entails converting unstructured or semi-structured data into a structured form, like a table.
4. Enriching: We often need to add context or additional information, or merge with another dataset to make it more valuable for analysis.
5. Validation: Ensuring the data’s accuracy and quality after it has been cleaned, structured, and enriched.
6. Storing: The wrangled data is then stored in a database or a data warehouse to make it accessible for analysis and reporting. Proper storage not only secures the data but also organises it in a way that is efficient for querying and analysis.
7. Documentation: This is necessary throughout the data wrangling process, to record what transformations were done as well as the lineage of the data.
Some considerations
While I agree that data wrangling is a good data preparation process for unstructured and external data, I do question using this approach for structured data extracted from systems within the organisation. I also question its use for unstructured data from within the organisation.
Most of my criticism centres on the cleansing step. If we clean the data in the data wrangling process, for example between the data source and analytics and or AI, the data in the source system is always going to be incorrect or inconsistent.
There are two problems with this. Firstly, if the business compares the reporting and analytics results with the original data, they will become aware of the differences in the datasets. Secondly, if that same data is extracted and processed through any similar process, the same cleansing rules must be applied, otherwise, there will be inconsistencies between the various resulting datasets.
I believe that for all internal datasets, be it structured or unstructured, it would be better to rather invest in improved data governance and data quality assurance processes as opposed to data cleansing. This will ensure that the data gets fixed at the source. It will also enable the organisation to put capabilities in place, like built-in data validation rules, to ensure that those errors or inconsistencies do not get repeated or re-introduced for any new data entering the internal systems. However, for external or generated data, the data wrangling process makes particularly good sense.