Data wrangling is often used to describe what data engineers do. However, a universally accepted definition of the concept has proven difficult to find. It was therefore fortuitous that I came across this piece on Simplilearn titled “What Is Data Wrangling? Overview, Importance, Benefits, and Future.” It shares some interesting and insightful points and so naturally, I wanted to share my views.
CDAO Melbourne report back: Unlocking the value of data
My colleague and I were fortunate enough to be invited to attend the recent Corinium CDAO Melbourne 2024 conference held from 2-4 September 2024. Corinium has an interesting approach to this annual event by alternating between keynote and breakout sessions, guided panel discussions, and interactive fireside chats. Such a structure makes for an insightful and engaging event that I always find great value in.
While last year’s conference focused on GenAI, this edition had a good balance between data management, data governance, reporting, analytics, and AI. Having soaked up so much valuable information, I wanted to share a brief and high-level report back around some of the key aspects I took away from all the presentations and discussions.
The power of data
A common theme of many presentations and discussions was the power (or value) of data and insights. This positioning around the value was refreshing to me. As professionals, we often get asked to justify proposed projects or the size of the resource pool when it comes to the value that it provides to the business. Some of the examples covered were:
- The time saved by technical teams when the business can do self-service reporting.
- Determining the lifetime value of customers. This was particularly interesting as some organisations still have ‘lifetime’ customers. Other organisations must focus on the value they can squeeze out of a customer in a typical five-year cycle.
- Additional value is derived when insights can be refined using external data. I think we undervalue the impact that the weather and the seasons have on buying and other related business patterns. Unfortunately, very few businesses ingest weather data to combine with their data analysis.
- Cross-selling is another opportunity to get value out of data. There was a good presentation on how analytics is used for cross-selling in the agricultural space. While this might seem like an unlikely topic at a CDAO conference, the discussion highlighted the value the sector derives from using data for cross-selling.
A good takeaway was that when it comes to data value pitches, it helps to be as specific as possible. Vague statements of potential value gain hardly excite any decision-maker!
Data governance
Another theme that received a lot of attention was data governance. If companies are going to use GenAI to generate insights, they must make sure that the data on which those insights are based is accurate – and trustworthy.
Data classification, dictionaries, catalogues, and metadata all came up in the presentations. These topics have been neglected in recent years. One presentation took a look at a very interesting classification scheme that associates the value of the data with the classification of its attributes. Another presenter talked about the value of particular data domains and their alignment with strategy. While this is a similar classification scheme to the previous one, this happens more at an aggregate level.
Data quality was a hot topic, not just for GenAI input but also for the quality of insights provided through reporting. Other aspects of data governance that were mentioned were data ownership, data stewardship, and risk management.
Additionally, there were full papers presented on the aspects of managing unstructured data. These covered the likes of discovery, cataloguing unstructured data, classification, data line, preventing exposure, privacy, security, and making the data useful and reportable.
Another interesting take on data governance is that it has a branding problem. Data governance is often seen as a burden to the organisation that slows down production and the delivery of insights. One speaker took an interesting approach to how to turn data governance from a risk-controlling police officer to a value-generating business person. The most important point in this regard was that data governance should be aligned with the business strategy. It should assist reporting and BI teams to provide better value, instead of being a blocker.
The conference also had a deep dive session into data literacy. The focus was more on the people aspect than the technology used. Understanding of data, including contextualising it to the role where it is used and how to extract value from it, were all fascinating points. The viewpoint was held that aspects of data literacy should be built into every project and deliverable – and that data literacy is not a stand-alone exercise to force people to sit through boring presentations or online training courses.
Overall, this year’s Corinium CDAO Melbourne 2024 conference was a fascinating one that really highlighted just how far we have come around the importance of data, along with what key aspects are critical to focus on as more businesses and industries become more data driven.
Breaking down data silos should be a strategic mandate
Last month, I explored the challenges created by data silos and how pragmatic approaches like a data lakehouse architecture and data steering teams can assist in breaking down the silos. In this month’s post, my attention on the topic turns to more strategic drives that can help achieve better enterprise-wide data sharing and overcome the limits imposed by data compartmentalisation.
Data silos and how to break them down
Data silos are pervasive in organisations around the world. At many of the companies where I have worked, these data silos have made it difficult to ensure system integration, data governance, and effective reporting. Following an insightful industry piece I recently came across, for my blog article this month, I decided to focus on and discuss some of the approaches that can be used to break down these walls.
As mentioned, a great industry article to read about this topic is Bob Violino’s “Breaking down data silos for digital success” published on CIO. In it, he uses two key phrases – unifying data strategies and knocking down data (and political) walls.
This highlights that breaking down data silos at an organisation is a strategic imperative. Trying to do so by stealth or adopting a bottom-up approach will be impossible. Furthermore, data silos are often the result of the corporate’s political landscape and make-up of the business. While some of these are a result of the legacy of older organisational cultures, modern businesses looking for integrated and consolidated insights must move beyond those ‘limitations’.
In my experience, healthcare is one of the industries that struggles the most with siloed, point-focused, and unintegrated systems. In his article, Violino uses a children’s hospital in the US as an example of how to overcome these data siloes. The case study examines the hospital’s journey to consolidate 120 separate systems into a single, centralised data warehouse with one reporting tool.
The value in data lakehouse architecture
More organisations are embracing data lakehouse architectures, as opposed to conventional dimensional data warehouses. These are used to systematically collect all relevant structured, unstructured, and streaming data, to store, transform, aggregate, and label it as needed. And finally, to optimise the data for reporting.
At the organisation I am currently at, the data lakehouse architecture is enabling us to ingest data from a myriad of systems much faster and with more agility to adapt to changing requirements. So, instead of an instantiated dimensional data warehouse on top of the data lakehouse, we are using dynamic views and the reporting tool’s semantic modelling capabilities.
This results in putting in place more efficient and business-friendly reporting environments. Not only do these return results faster, but also use fewer resources than it would take to design and implement a dimensional data warehouse and the multi-layer ETL processes to populate it. Of course, in our reporting models, we are still using dimensional principles, but we are not physically instantiating them.
The data steering team
Organisations can make additional strides in breaking down silos by putting in place a dedicated data navigation or steering team. Such a team would help the organisation align data across business areas and establish a data governance function. This will empower decision-makers to ensure the trust, privacy, and security of data while also being able to identify the technology and human resources to use to help build an integrated data architecture.
Such an approach would be especially beneficial to a highly siloed organisation. Having key stakeholders from the various silos participate in a central forum, making key data- and priority decisions, will foster a culture of sharing across organisational boundaries. When key data- and governance-related decisions are shared, and when information about data quality and business gains achieved through centralised reporting is shared, it tends to organically break down the siloes. This approach favours treating data as a shared resource that must be managed accordingly.
Data silos can result in inconsistencies and operational inefficiencies for a business, and their dissolution can ensure the consistency and accessibility of reliable data across the organisation. A centralised data team structure can establish a unified data ecosystem. Breaking down these silos can foster a culture of innovation, facilitating coordination and collaboration between different business areas. This will result in better decision-making, efficiency in providing analytics, and faster service to stakeholders.
Next month, I will be digging deeper into the strategic aspects and other tips on how to break down data silos.
Self-service BI – Examining the right approach to take
Last month, I discussed the challenges and opportunities of self-service BI as an approach to enable non-tech-savvy business users to directly access data and explore it on their own. In this blog, the focus turns to some of the reasons why self-service has failed and understanding the approaches that have worked.
For reference, I use the terms ‘power users’ and ‘citizen developers’ interchangeably in this piece, to represent those business users outside the BI team. These are the people who will be enabled through self-service BI to address their own and others’ informational needs.
Self-service BI failures
There are several possible reasons why self-service BI fails. Some that have been identified are:
- Unrealistic expectations: Companies who let novice users loose on organisational data face the potential of bad quality reports and inconsistent reporting. This results in a huge distrust of data in general.
- Reporting chaos: With no governance structures in place, there will be redundant reports from different users. Because they work in silos and use different filters and terminology, they will deliver conflicting results despite using the same underlying data.
- Lack of adoption: BI tools, environments, and processes may be easy for specialists. However, we must keep in mind that casual users do not have the same background and skills. The complexity and ‘newness’ of these environments can be quite intimidating for novices.
- Lack of support: Citizen developers are not trained on BI processes and tools. Without proper support and handholding, any self-service initiative is bound to fail. Organisations must factor in the time and resources essential to deliver this support.
- Poor data quality: If the power and downstream users do not trust the data, they will stop using it. Even worse, if the business starts distrusting the data there is a high likelihood that siloed pockets of departmental BI initiatives will get started.
Making self-service BI work
Through a combination of my own experiences and several industry sources, below are a few ways a business can go about establishing self-service BI successfully:
- Identify the user population: It would be complete chaos if self-service BI was accessible to the entire company. In a data-mature organisation, only 25% of all users can be labelled as potential ‘power users’. In my experience, it is better to bring these ‘citizen developers’ on board in small groups, each with a specific focus and guidance. It is also useful to involve them in a community of interest.
- Set a self-service BI strategy: Self-service can mean a lot of different things. The business must therefore be clear about the scale of implementation, the types of users, their technical proficiency, the expectations of deliverables, and the approach to be used. It is also important to not try and boil the ocean! Starting small and building focused business areas one at a time works well.
- Keep stakeholders informed: The company must keep not only the power users but also their managers and the intended users of their work up to date. There must also be channels set up for feedback throughout the process.
- Set up comprehensive governance and quality assurance: In my mind, this is one of the most important aspects to get right. A company must put policies and processes in place to ensure what is delivered to the business is complete, accurate, timely, and relevant. A business cannot allow inaccurate or inconsistent information to be reported to the business. Once data is distrusted, it becomes an almost insurmountable obstacle to regain that trust. To this end, peer review processes are very useful.
- Use an appropriate tool: Although most of the reporting tools out there claim to support self-service BI, some are more suitable than others. I have found that tools that support curated semantic models are better for ensuring consistent and accurate reporting. Additionally, they reduce some of the technical complexities where users have to identify and implement various types of joins and unions across datasets.
- Establish a single source of the truth: Even a well-architected data warehouse and reporting environment may be too complex and detailed for citizen developers. A well-curated and quality-assured semantic data model, with pre-joined and de-normalised high-level entities presented in business-language terms works very well. This requires a lot of planning, designing and implementation before letting power users loose on the model.
- Establish a dictionary and metadata: Not only must the data be surfaced in business terms, but it should also be properly catalogued and documented. The documentation must be readily available and easily accessible by the business users. Likewise, there should also be a catalogue of existing reports and dashboards. These must also be easily accessible to enable users to search the catalogue for similar reports before embarking on a new development.
- Educate the power users: Users must be educated on the use of the tool as well as the data. Aspects of visualisation theory are also very important. It is also very useful to explain the data model to power users through workshops and hands-on implementation sessions.
- Refice and adapt: Like any good strategy implementation, regular monitoring, review, feedback, and adjustment are always useful.
While I did not discuss it in length, there must also be a close alignment between self-service BI and the broader data governance function. Self-service BI users often detect data quality and consistency issues. This means there should be good and open communication to bring these issues to the fore and make the organisation aware of how they are being handled.
Understanding the benefits and challenges of self-service BI
The concept of ‘self-service business intelligence (BI)’ started gaining momentum in the early 2000s. More than two decades later, a survey by Yellowfin has found that the majority of respondents (61%) say that less than 20% of their business users have access to self-service BI tools. Perhaps more concerning, 58% of those surveyed said less than 20% of people who do have access to self-service BI use the tool. In this blog, the first of a two-part series on the topic of self-service BI, I take a closer look into this interesting and challenging area in the wider data field and share my views.
Data and Analytics in Healthcare – conference review
I recently had the privilege to attend the Data and Analytics in Healthcare conference hosted by Corinium Intelligence in Australia. Following this, I wanted to use my blog piece this month to discuss some of the key lessons and insights shared during the event. I’ll jump straight in.
Data governance and ethics
Of course, it is easy to get carried away by the hype around generative AI, machine learning (ML), large language models (LLMS), and so on. Even so, it was sobering to see several presentations and panel discussions still focusing on data governance and ethics related to these topics. But given how we are talking about healthcare data this should not come as too much of a surprise.
A useful framework in this area is the Australian Digital Health Capability Framework and Quality in Data. This looks to align with existing industry-specific frameworks, ensuring that all health and care workers are empowered with digital capabilities. Concern was expressed that as much as 60% of AI and ML tools, especially cloud-based ones, share healthcare data with third parties without consent.
Additionally, delegates heard that data governance and adherence to ethics do slow the adoption of insights. There were examples where research results were not implemented for years due to the number of frameworks, data governance, privacy, and other controls that had to be followed. It was mentioned that legislation is often a step behind ethics, resulting in the need to reword policies down the line.
On the positive side, the sharing of ‘de-identified’ health data for research and outcome improvement was unanimously supported. One of the presenters compared this to sharing an organ for transplant. Why would anyone not want their de-identified data to be shared if it can improve the health outcomes of others facing similar circumstances?
Data management
Another area that was covered, which is close to my heart, was that of data sourcing and data management. It was stressed that to obtain advanced insights from data, it must be the right data, of high quality, and available in a processable format. The amount of unstructured data that is hard to mine and interpret in healthcare is staggering. In short, you need a solid data foundation if you want to use AI and ML effectively. This comes down to having data that is scalable, understandable, accessible, and fit-for-purpose.
One of the biggest challenges in healthcare data remains data linkage – joining the dots between related data in different datasets, originating from different systems often managed by different organisations. An interesting observation made was around the bias in healthcare – where we mostly collect data about sick or ill people. In fact, hardly any data is collected about healthy people within this context. This makes it difficult to identify what target populations for treatments, or comparative control groups’ variables, should look like. Making this more difficult is the fact that a lot of data is collected and stored but not used to its full potential.
Process and methodology
There were several good sessions centred on the processes and methodology to follow when adopting analytics, especially AI, ML, and LLMs in healthcare. While these were too detailed to cover here, the general sentiment was that one needs to be more careful and thorough about the design, evaluation, and interpretation of results. Especially in rare cases, do we have enough volumes of training data of sufficiently high enough quality for advanced models? Is the technology mature enough? And do we have proper processes for ongoing monitoring and improvement?
In other industries, people may get annoyed or even switch providers when an incorrect marketing campaign is fired off at them, or an inappropriate product is recommended. In healthcare, the implications of these ‘mistakes’ can have more serious implications, even life-threatening ones.
Building capacity and capability
Other sessions had interesting discussions on developments around capability and capacity building. My impression was that healthcare organisations in other countries are also scrambling for resources and funding. Key approaches to help overcome this include partnering, collaboration, and innovation across organisations and teams. The adage of start small and build on RIO shown came through as well.
Additionally, culture is key. It was mentioned that literacy and education take as much as 70% of the effort of adopting new technology and insights. The mind shift that must happen at the decision-making levels was also covered. Insights and data have to have a seat at the table.
An interesting study showed that both text analytics and AI only did an okay job of coding and classifying electronic medical records, with not a huge difference between the two. So, while it takes coding and classification experts hours to apply coding and classification to cases and diagnoses, there is a massive risk in replacing that expertise, insight, and interpretation with an automated process. You simply cannot automate the acquisition of health knowledge and interpretation.
The general message that came across was that AL and ML were efficient in reducing the administrative burden of clinicians and allied health staff. But despite some amazing (isolated) research outcomes, it was too risky and unethical to have technology make or influence diagnosis and treatment. However, healthcare is overloaded with administrative processes and many redundant data capture processes that can be automated to free up the clinicians and allied health staff to focus on what they are trained for and do best.
Conclusion
In closing, I didn’t review specific AI, ML or LLP case studies – they were very interesting and relevant and well presented, with lessons learnt, but it’s just too much detail to cover in this post.
It was another great and relevant event put on by the Corinium team. I walked away with many notes and some key aspects to incorporate in my strategic and operational plans going forward. I learnt about a few new concepts and made a few new connections too. I hope you find the above brief insights shared of value.
Of course, a nice venue and having proper barista-made coffee and wholesome food, together with networking drinks afterwards, rounded it off to make it an enjoyable experience. All in all it was a great and insightful day!
The importance of data lineage
Do you know where your food comes from? Did the farmer use pesticides? Did the transport company spray preservative chemicals over your food? Did they keep it appropriately refrigerated? Would you eat food from sources you don’t trust? The same applies to data. Do you know what the lifecycle of your data entails? Was it manually entered? What validations were applied? Through how many transactional systems did it go, and was it transformed along the way? Would you make decisions based on data you don’t trust? This is where data lineage comes in.
Data quality a priority for 2024
Despite the hype surrounding generative Artificial Intelligence (GenAI), I am finding in my industry reading that many industry analysts are predicting that data quality (one of my favourite topics) will remain a key priority for this year – especially when it comes to data management and governance.