In-memory data warehouses


In-memory data warehousesIn a previous post on in-memory databases I gave a high level overview of SAP HANA, Oracle TimesTen and IBM’s solidDB. In this post I discuss temperature-based memory management for data warehousing, as implemented in the Teradata Intelligent Memory release. Databases with a large in-memory capacity have an appeal for on-line transactional applications, as well as for BI data warehouses, but in this post I only focus on memory management for data warehouses. 

If we look at the definition of a data warehouse as “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making processes”, no-one specifies on what type device that data must be stored, nor in which format. However, we all know that it must be relatively easy to get to the required data, and the data warehouse should have good query- and reporting performance to serve up the reports and dashboards and ad hoc analyses in timeframes that more and more impatient users are expecting.

Now there are two schools of thought about this. The one says memory is becoming so affordable and available that you can put the whole, or most of the data warehouse in memory. That is the approach advocated by SAP with the HANA in-memory database. Obviously the data warehouse in this instance needs to adhere to some limitations of reality – you won’t put 12 years’ of hardly-used data in memory. Neither memory nor HANA’s per-Terabyte licensing is cheap enough to do that. So that is where the second group comes in that says it is not necessary to store the portion of data in memory that is not used as often.

Teradata Intelligent Memory

Teradata’s new Intelligent Memory option claims to provide a more ‘intelligent’ approach, where the system monitors what data the users are accessing the most and then ensuring that that particular data is stored on the appropriate storage medium to allow the users the relative speed required to access to it.

This approach builds on the old paradigm of data temperature, which works on the premise that the data being accessed the most is the most valuable at the time (thereby classified as ‘hot’ data) and therefore it should reside on a similarly classified storage medium (such as in-memory storage for ‘hot’ data). When a data set is no more accessed frequently (i.e. it becomes classified as ‘cold’), it gets migrated to slower and lower cost storage technology, such as solid state disks or conventional magnetic disk storage. Therefore the cost of the storage media used is in line with the relative value of the data to the organization.

The convenient aspect of the Teradata Intelligent Memory implementation is that data architects and DBAs no longer have to define and implement the database access patterns according to some pre-defined access policy. Traditionally the best they could do was some approximation based on classes of data – where the classes were defined both in terms of business entities and the age of the data. With a dynamic system implementation, it is all done automatically, at a much lower level of detail, and it is dynamically adjusted according to changing data access patterns. So an intelligent learning system is not only more accurate, but it also reduces the amount of human management required to try and maintain it at an optimal level.

In order to address the issue of new data suffering through a period of “average” performance, Teradata claims that they will provide a facility for data to be loaded in the correct place in a hot status, from where it can then cool down according to the access patterns.

Of course, to make the best use of this technology, the data and system architects must provide a tiered storage infrastructure with an acceptable price-performance rating over the respective tiers. The interesting aspect here is that as storage is continuously improving over time, the configurations themselves can be modified as well.

Concluding remarks

Although I made a distinction between transactional databases and data warehouse databases in order to focus on the latter, the distinction isn’t always that clear. One of the premises of a totally in-memory database like SAP HANA is that you do not necessarily have to copy the transactional data to a separate data warehouse. You can simply create analytical or calculation views over the operational transactional tables, in order to provide a dimensional view through which the data can be reported and analysed. You may think that it is expensive to store the entire database in memory, but if you counter that argument by the fact that you do not have a second (data warehouse) database, and no ETL processing either, then the cost justification argument becomes very different.

Some analysts feel that implementations based on data “temperature” may be more appropriate for data warehousing, BI and analytical processing, as you don’t always need all of that data warehouse data to be held in memory. Some other analysts reckon that it is merely a transitional technology as the industry is moving more and more towards an all in-memory future.

While only time will tell which way that pendulum will swing, in the meantime my recommendation will be based on the required data validity period combined with the analytical requirements. If you are in a fast-changing business and you only analyse 6 months’ data to determine trends, then a totally in-memory on-line database and data warehouse combination is the way to go. On the other hand, if you are periodically ploughing through 6 years’ data for long term trends and historic analyses, then a data warehouse using intelligent memory management will be more appropriate.

Leave a Reply