Dimensional Models for Predictive Analytics


Dimensional models for advanced analyticsEven though it has been around for years, predictive analytics has seen a much stronger uptake over the past few years. For example, store owners are now trying to predict what products their customers will buy and which products should be taken off the shelf completely; marketers are interested in analysing customer sentiment; telcos need to know which customers are about to churn off the network; and so on. However an interesting area for discussion is the integration of advanced analytics with conventional data warehousing and BI.

In days gone by, stalwart knowledge workers, as they used to be called, extracted data from wherever they needed, transformed it however they needed and built analytical models in the backroom. They produced sets of indicators or probabilities that the “application guys” had to integrate with the data in existing applications in order to get business benefit from it.

Nowadays the discussion is more focussed around topics like, whether the data can be provided at the correct level of detail, in the right data format, to the data scientists, as they are now called, and whether the insights they derive can be integrated back into the BI layer for more targeted reporting. Organisations want those indicators and scores operationalised to “seamlessly” appear in applications, for example through SOA architectures, to make the indicators and scores visible to the users of the systems “on the floor” where they are working with the customers, providers or products in question. As a simple example, the probability that a customer is about to churn must be made available in the call centre application so that the call centre agents can incorporate consideration of that fact into the script when interacting with the customer.

So I’m all for a “scratch patch” or often called a “sandbox environment” – a stand-alone analytics server environment – where data is extracted to in a convenient format for advanced analytical modelling. This is often achieved in much shorter timeframes than sourcing that same data into the enterprise data warehouse environment, primarily because it doesn’t have to go through tightly tested and controlled productionisation processes. However, I agree with this approach only for the purpose of analysing the data, performing data discovery and developing “proof of concept” analytical models. I have a serious concern when the outputs from such “straw man” models are integrated back into operational applications, for production use in operational business processes, where those indicators and scores then affect decisions and outcomes as they are utilised. In my opinion, they then need to be processed and managed through more formal processes.

So can the BI environment be used for that?

So, while undertaking research into this topic I found a great article from the Kimball Group entitled Leverage Your Dimensional Model for Predictive Analytics, that gives strong insight into this point of discussion.

Ralph Kimball writes that even though some analysts out there believe that there are differences among predictive analytic techniques, all of the advanced analytical modelling tools can consume data as a sequence of observations, which are keyed by a unique key, similar to the surrogate key used in a data warehouse, for example, like CustomerKey used to identify the various versions of the customer dimension records held in the Customer dimension.

So, for example, when you want to predict the “expected” future behaviour for a number of customers by running a predictive analytic model on the customer’s history, the “required” data for the predictive modelling tool is a number of rows whose primary key is the unique customer key, with the other columns being an extensive number of expressive attributes, including the customary descriptive and demographic fields.

From my experience, when dealing with predictive analytics models, the input has to be at a more granular level than the data typically found in a conventional dimensional table, and as a result, it is more complex to get into that state.

In normal customer measurement, many of the interesting attributes are chosen at the time of analysis by searching certain tables describing a particular customer’s history and then transforming the information with a tag, score or amount. In many cases categorical variables have to be transformed to a set of flags that represent that data at the most elementary “yes / no” value. For example, a very simple gender lookup table with values “male”, “female” and “unknown” have to be transformed to three variables: GenderMaleFlag, GenderFemaleFlag and GenderUnknownFlag, each with the values 0 or 1 to represent “no“ or “yes” respectively.

These customer tags, scores and amounts are pretty constant and remain fairly static over the customer’s lifetime. In fact, by using a good semantic layer, they can be used for query filtering and grouping beyond predictive analysis, even in conventional reporting and dashboarding applications.

Putting together the data that’s fed into a predictive model from a dimensional model can be pretty simple; it requires a few joins across foreign and surrogate keys. However it gets a bit more complicated when we need to start transforming the data as in the example illustrated above. Some of the categorical and nominal variables are even more complex to transform. So how do you go about doing this, as there are potentially a number of systems in place for implementing this specific task?

From my experience, we can utilise the ETL processes, which have become quite powerful over the years – to perform this. For me, the main attractions of using ETL are:

  • The processing is done once at data extraction and loading time into the data warehouse. Apart from being resource efficient – in that only gets done once – it has the added advantage that it is done once, correctly, under reasonably controlled circumstances (obviously depending on the implementation and controls in place in the organisation).
  • The data is being processed, and some business rules are already being applied at ETL time – it makes sense to add the analytical transformation steps during the same transformation processes – that way the analytical transformations are also documented, managed and controlled I the same manner as the other ETL transformations are.
  • We don’t need to recruit or assign additional transformation and documentation responsibilities – the ETL team are already performing similar transformations.

Of course there are other tools or programs also available to assist us to build the observation dataset that can be used for predictive analytics. These range from Excel, to hard-coded programs, to master data management tools, and even some of the more mature advanced analytical modelling tools also provide ample data transformation capabilities. However, they do not always have the controls and rigours around them as a mature ETL process. The effect may then quite easily be that different transformations are used in different analytical models, which may result in a large amount of re-work, and potentially inconsistency of interpretation across different analytical modelling outcomes.

There is still much research to be undertaken for what data structures and data transformations are appropriate, required and ideal for predictive analytical modelling. However, it also excites me to know that an established technology such as an ETL tool can be used very effectively to manage the data preparation for advanced analytical modelling, especially under much more controlled circumstances for productionised applications.

The more we research on the synergies between predictive modelling, data warehousing and BI, the more we will understand how better to implement these processes cost-effectively, repeatably and in a much more controlled fashion.

Leave a Reply