«

»

Finding balance between data science and data engineering

Share

Previously, I wrote about the two-tiered data lakehouse with an analytical sandbox and a curated data warehouse in the second layer – the one for productised BI and the other for data science work. I was therefore intrigued when coming across a Forbes article titled ‘Three Keys to a Harmonious Relationship between Data Science and Data Engineering’ that shows exactly the kind of balance we are trying to get right in my current engagement.

The author of this article raises the importance of distinguishing between data ingestion and data curation. That is exactly where the two environments, with their different styles of data management, come into play. In the analytical sandbox, you merely put the data there in a useable form. Of course, you must pay attention to correctness, accuracy, and privacy. But it is not a process that has to be productionised. In fact, in most cases it hardly needs to be repeated. The data scientists just need the data there, preferably as soon as possible, in a useable form to get cracking with the analysis they need to do.

Data curation, which happens in the data warehouse part, is a totally different ballgame. That is where you need proper data profiling, data validation and process design, implementation, testing, and deployment. The processes must also be repeatable. In most cases, the data warehouse needs to be refreshed either periodically or continually. It therefore requires a lot more vigour than what would happen with data ingestion.

Not only will the outputs from these processes be delivered to the business, but it will usually happen on a repeatable schedule. Additionally, this is the environment where so-called citizen reporters will get their data from for self-service reporting. This area, as well as the processes that feed it, therefore requires much stricter governance and control. Furthermore, there must be proper documentation to enable people to use the data correctly and with confidence.

Wrangling data

This plays directly into the second point the article raises – namely data wrangling versus data engineering.

For the sandbox environment, we just need to get the data there. Often, this process is developed in an agile, even experimental, fashion. Data engineering, which we use in the productionised data warehouse environment, is a much more rigorous and managed approach to create repeatable and verifiable data pipelines. These can be migrated from development to testing to production environments.

In the first case, errors and outliers are often excluded on-the-fly. They can even be excluded or managed during the analysis. In the second case, the process needs to cater properly for errors and outliers. For example, by creating exception reports, exception ‘buckets’ for erroneous data, and repeatable runs to re-process those problematic datasets. As the author rightly mentions, the two processes have very distinct purposes even to the extent that it may affect what tools are used for each process.

Achieving value

The third point, ‘AI modelling versus production scoring’, addresses the process of migrating analytical modelling processes and outcomes from the analytical sandbox to the productionised data warehouse environment.

In short, when we find that an analytical model developed in the sandbox environment has ongoing value that can be run regularly or continuously through, for example, streaming. However, it must be re-implemented into the productionised data warehouse environment. In turn, this must go through a more rigorous process to be validated, tested, and deployed in a production environment. The model and its outcomes, developed in the sandbox, now serve as a template or a specification of what must be productionised.

Perfection

I really like how the distinction between data science and data engineering is catered for by the two environments in the data lakehouse.

Here is a good analogy. The sandbox environment and data wrangling by the data scientists is like a band playing live and improvising to respond to their audience. This is a once-off, sometimes rough and spontaneous, moment of brilliance.

The productionised environment and data curation by the data engineers is like taking that same song into the recording studio. The band then uses different instruments and recording mechanisms, possibly adding studio musicians, re-recording some pieces, and so on. The goal is a perfect song that will be put on a record, CD, or streaming media. It is therefore a managed process of precision and quality to repeatedly represent key aspects of the brilliance.

Organisations need to understand all these nuances to overcome potential data obstacles and deliver the value that they can unlock by effective analysis and leveraging the difference between the data science and engineering aspects.

Leave a Reply

hope howell has twice the fun. Learn More Here anybunny videos