A recent white paper from Embarcadero Technologies on the seven deadly sins of database design triggered me to write this post – as similar principles can be applied to dimensional data warehouse design. Some of the sins are vaguely similar to those encountered in relational data modelling, but the other sins are very different and are only applicable to the style of modelling used for dimensional data warehouses. Avoiding these sins will thwart a lot of the criticisms often raised against dimensional data warehouse models.
Just as a database forms the foundation of any application, the data warehouse forms the cornerstone of any proper BI implementation. Get the data warehouse design wrong, and you will either battle to populate it, as the source data will not have the correct places to slot in, or you will battle to deliver correct or efficient reports from it. Even worse, when you go one step further and let self-service super-users loose on an incorrect data warehouse design, and they report incorrect information back to the business – see what fun you have then!
Well, let’s look at the seven deadly sins for data warehouse design in some depth… Note this may turn into a lengthy sermon!
Sin no. 1: Not documenting the data warehouse design
An undocumented data warehouse design makes it difficult for business analysts, data analysts, ETL developers, report and dashboard developers, advanced analytical modellers and other data warehouse architects to work with the data warehouse. In fact, it makes it difficult for everyone involved.
The intended consumption of the data contained in a data warehouse is through query and reporting tools, as well as through the interactive analysis components of these tools, typically “driven” by developers or empowered business users through a self-service capability. It is very important that the table and column names clearly reflect the intention, business meaning and contents of the underlying data exactly, and in clear unambiguous business terms.
You may think a data warehouse design, especially one modelled in business terms, should be self-explanatory, but that is far from the case:
- Firstly, the grain of a fact table is very hard to deduce, if it is not explicitly stated and documented, because the primary keys of a fact table often define physical uniqueness, which does not necessarily correspond to the grain, which is defined in business terms.
- Secondly, a dimension may contain multiple hierarchies, some which are only known and used by certain parts of the business. Because of all the descriptive attributes associated with a hierarchy, even a known hierarchy, it doesn’t stand out immediately. Multiple overlapping hierarchies are even more complex to trace if they are not explicitly and thoroughly documented.
- Complex derived measures, aggregated snapshot fact tables, bridge tables, degenerate dimensions and so on are all complexities that are not immediately clear to understand – therefore they should all be properly documented.
Now you may think that if your BI toolset employs a semantic model, that all of this will be taken care of by the semantic model. If the semantic model presents all the data in clear unmistakable business terms, surely it doesn’t matter what the underlying table and column names are called (to over-simplify it a little bit.) However, that is only partially the case. The problem arises in that in most modern day organisations, there are often more than one BI toolset being used. So whereas the BI team and the self-service report users in the business may be using the tool that employs the semantic model, the advanced analytical modellers may be using a different toolset, and the data scientists may be using yet another data visualisation tool, neither of which run through the semantic model. Now imagine the confusion in the business when these three groups present information about the same concepts to the business but using different terms and definitions!
Sin no. 2: Not following proper dimensional modelling practices
The dimensional model, as thought out and further developed by Ralph Kimball, is a very logical and efficient way to structure data for reporting, dashboards, especially for interactive analysis, and even for advanced analytical modelling. The model seems so simple – place measures and KPIs in fact tables, put descriptive attributes in denormalised dimensions, and link them all up through surrogate primary and foreign keys. Easy, right? Well, no, there are many things that you can get wrong, or not do properly, that will make the model either perform badly, or even worse – return incorrect results.
Some of the complexities include proper versioning of dimensions, where you have to decide how changes to each attribute are going to be handled – read up on type I and II slow changing dimensions (SCDs) for more details. From a design point of view, every dimension needs to carry additional attributes to implement SCDs. Your design needs to cater for how late arriving records are going to be handled – where the measures for a fact “arrive” in the data warehouse before the corresponding dimensional records have been created.
Surrogate keys are crucial to implementing dimensional models correctly and efficiently. A surrogate key is a “meaningless” (to the business) unique system-generated identifier that is assigned to each record in dimension table. The surrogate key is not derived from any business value and has no reference to the “natural” primary key of the data as it is defined in application. The reason for having a surrogate key is that when you implement multiple versions of the dimensional record as part of the SCD approach, the natural primary key of the record as it applies in the application repeats for every version of the record in the dimension table.
Using surrogate keys have other side advantages too. It allows you to change the structures of dimension and fact tables, without ever having to re-link the records through their foreign-primary key relationships. It allows you to add or change columns, without affecting the primary and foreign key structures. You can even change the primary key of the table in the application, but the already-linked records in the fact and dimension tables will remain linked as they were at the time of the event they represent.
Did I mention this all has to be documented as well?
Sin no. 3: Not treating the data warehouse model like a living, breathing organism
OK, this one I basically took straight out of the Embarcadero Technologies article, as it is so perfectly relevant to data warehousing as well.
In the data warehousing world we get this same situation – where the data warehouse (database) implementation is changed in production to address data problems, implement late changing requirements, to solve performance issues or to fix some other urgent problems, without updating the underlying data model design. Over a short period of the time the data model then becomes outdated with the reality, and is soon rendered useless.
In the data warehousing / business intelligence world, this problem is compounded by the fact that a BI implementation never consists of a single clearly defined project, with specific start and end dates and a fixed scope. BI is an ongoing program, where the data warehouse evolves as the program progresses. So of course, the data warehouse design will keep changing and expanding, and the model needs to be constantly extended, changed and kept up to date.
Further to this point, a mature data warehouse typically contains the data from multiple source systems. As mentioned in the Embarcadero article, the models of these application systems are also living, breathing organisms that undergo changes. Any change to any of the source systems may affect changes to the data warehouse design and model.
Sin no. 4: Not implementing conformed dimensions
In the original article this point was related to reference data.
In a dimensional data warehouse model, certain key dimensions become the anchor points of the dimensional model. These are called conformed dimensions in the dimensional model and they represent key business entities and often those data sets are managed as master or reference data in the organization. There should only be a single implementation of each conformed dimension in the data warehouse, and these dimensions must then be shared between all the fact tables where these business entities (or conformed dimensions then) are referenced.
When you start creating copies or variations of these shared dimensions, that’s when you create subsets or segments (or silos) in the enterprise data warehouse model – the result of which is that you cannot cross correlate between different business processes or calculate measures across different business processes.
Sin no. 5: Not defining foreign keys
Note that I didn’t say “not implementing foreign keys”…
Foreign key referential constraint enforcement in a data warehouse environment is often the topic of an interesting debate between the data warehouse architect / designer, the ETL analyst/developers and the data warehouse DBAs. In essence, you want all the foreign keys to be defined, and you want to be sure that all the ETL processes do the correct foreign key – primary key lookups and implement the foreign keys correctly. However, you do not want the foreign key constraints to be activated on the production data warehouse database, as it consumes too many resources to check it during every insert operation – as it typically slows down the ETL processing too much. This is exacerbated in a near-time / real-time data warehouse environment when you need those updates to be applied in almost real time.
So what’s to do? The approach we typically follow is to define the foreign key constraints in the model, and implement and activate them in the development, testing and acceptance environments. However, once you are sure that the foreign key constraints are properly adhered to by the ETL processes; you do not implement and enforce them in the database in the production environment.
Sin no. 6: Not modelling business processes
This point is very specific to dimensional modelling. When creating a data warehouse design, you have to create the model with a fact table for each key business process. The grain of the fact table must accurately reflect the level of detail of the process as it happens in the business. The detailed (transactional) fact tables must be designed at this lowest level of detail.
Many modellers focus on the transactions recorded by the source systems, and model the data warehouse design straight off the transactional system by merely denormalising the descriptive entity tables as dimensional tables. In some cases there may be a one-to-one correspondence between the source system transaction tables and the business processes you need to represent, but you cannot take that for granted by merely modelling off the transactional system. You have to analyse the business processes as the business sees and applies them, and make sure they are correctly represented by the “transactional” fact tables.
Summary and aggregate fact tables can be created at a later stage to address performance or specific summarisation requirements, but if the transactional facts do not represent the business processes at the most detail grain, then the users will not be able to determine the information they require, or drill down to that level of detail when required.
Sin no. 7: Not designing for and “fetching” all the related data
I will illustrate this with the “supermarket analogy”. When you have to rush to the supermarket to get nappies, toilet paper, bread or whatever the urgent need was, you also get a lot of non-urgent items, like the beer, milk, butter, cheese, toothpaste and washing powder, to ensure you don’t need another urgent little mission sometime soon.
So you apply the same principle to data warehouse design (and implementation). As an example, if have to get the “claimed amounts” and “paid amounts” for each claim for the CEO’s urgent dashboard (aren’t they always urgent?), then you get all the data related to the claim and the claim payment processes – all the relevant dates (and the typical claims and claims payment processes have many related dates), by whom it was claimed, to whom it was paid, when (OK, we’ve covered that already), into which account, authorized by who, for what item or which treatment, related to which policy, by which provider, and so on. That way, the data for any related query, report or dashboard that you may need in the future is already readily available in the data warehouse. It also has the side benefit that you don’t have to change the ETL process any time soon to fetch more of the related data.
You may think you don’t have time to do it because of the urgent need. Believe me, it takes way more time to re-analyse, change and extend it the next time round for the urgent need that arises down the line. The psychological advantage you have is that if it took a short while longer the first time, you have set the bar for expectations already – so when you deliver faster the next time, because all the data is readily available, you perform and deliver above expectations.
Concluding comments
I often hear or read that dimensional data warehouses are limited – limited in what business requirements they can provide information for, and limited in flexibility. These claims are often toted by big data vendors, who want to muscle their technology into the staging area, or advocates of other modelling styles, like vault models. If you avoid the seven deadly sins described above, your models should be extensible, flexible, and very rich in the amount and depth of information they can provide.
Hopefully by avoiding these seven sinful practices, you will design and implement more efficient and better maintainable data warehouse databases – ones that are even more immune to the inevitable changes that will occur.