An on-going challenge of data warehousing is the long turnaround time between when a business event occurs and when the fact representing it is ready for consumption in the data warehouse. Delayed delivery makes it difficult to make timely decisions. Long data latencies also impede the organisation’s ability to quickly assess the implementation of decisions. Consequently near-time data warehouse synchronization has become a critical component of strategic and operational decision-making.
One of the key requirements for highly synchronised information delivery is that it must provide consistent, deterministic results to similar requests. This means that my dashboard of sales for the past hour, must show exactly the same results if it is run three days later with that same hour as parameter. This demand for consistency makes ODSs and data marts so hard to keep synchronized with near-time data warehouses.
Fortunately there are technologies that can assist us in keeping the data warehouse true to reality in near real time.
Change data capture (CDC) is an event-driven mechanism for capturing changes to operational data and propagating these changes to the data warehouse. Triggering and propagating data modifications via CDC enables really close coherence between the operational systems and the data warehouse. This enables the discovery of actionable opportunities in real-time while maintaining consistency.
Aside: CDC is also widely used in data replication facilities, and for example with Sybase replication server the same facilities can be used to implement replication as well as near-time data warehouse synchronisation.
Data Federation enables transparent access to heterogeneous and generally physically distributed data sources, even in different formats, without requiring a staging area. It is an effective way to capture subsets of large distributed data sets, and is often used when data is offsite, in older formats, or infrequently used. Federation simplifies the consolidation process to join dissimilar data before it arrives at the data warehouse. It is very useful for “pull” oriented data warehouse synchronisation.
Stream processing feeds the loading facilities with data continuously pumped from the data sources in real time. This enables complex event processing and real-time data warehouse synchronisation, thereby reducing latency. This works well for “push” oriented synchronisation.
However, you cannot blindly take your old batch-oriented ETL processes, and implement them for near-time data synchronisation. Any latent performance impediments will become worse and can seriously hamper the implementation. They have to be radically streamlined. In particular, you have to replace sorting, merging, aggregating, joining, as well as any data copying, filtering and partitioning operations, as these can all take too long to execute in near-time. In a previous post on agile data warehousing, I discussed the importance of a well-designed ETL framework. That same concept is applicable here, but the focus now is on processing speed and throughput. The faster you can process and load the data, the less delays, contention and other related problems you will have.
Of course, if you are going to synchronize a large volume of updates in real time, you need an infrastructure that can support it. You need the power, capacity and throughput to extract data from the source system(s), transport it to the data warehouse and there load and process it in near real time.
Referring to a previous post on handling the data explosion, near-time data warehouse synchronisation, if implemented correctly and efficiently, gives us the ability to report operationally straight off the data warehouse. Then we can get rid of the ODS and other operational data marts, save some space, remove duplicate processes, and have consistent up-to-date information readily available.