«

»

The Elastic Data Warehouse

Share

Some days I get really flexed out of shape about the rate at which new terms become spontaneously combusted in the BI and analytics industry. The latest I came across is the term “elastic data warehouse.” So what are we doing here – bending the rules, stretching the imagination or flexing our muscles here? Now this term hasn’t even been entered on the free encyclopedia that anyone can edit yet, and it already appears on two vendors’ websites… So what exactly is an elastic data warehouse?

In a TDWI e-book titled “Demystifying Elastic Data Warehousing: Enabling High-Speed Analytics in the Cloud”, Jon Bock, vice president of product and marketing at Snowflake Computing doesn’t give us a succinct definition of the elastic data warehouse as such, but he states that the concept of “elastic data warehousing is defined by flexibility, scalability, and agility that separate it from previous approaches”. Now that just may stimulate some philosophical post-graduate debate about recursive definitions…

Interestingly, Microsoft has also been quick to jump on the bandwagon, calling its SQL data warehouse “an elastic data warehouse as a service” which is supposedly the “industry-first enterprise-class cloud data warehouse that can grow, shrink, and pause in seconds”. They clam that “with Azure SQL Data Warehouse, storage and compute scale independently.” I suppose they mean that it is “computing” that scales?

Characteristics

OK but let’s try and get a bit serious here. Snowflake Computing states that that the elastic data warehouse can:

  • Scale up and down at any time without significant effort or disruption, as opposed to a traditional data warehouse that requires significant planning and effort to scale up and is often difficult or impossible to scale down. An elastic data warehouse’s scalability supports rapid iteration and evolution in data usage, which is good for supporting exploratory, experimental, and ad hoc analytics
  • Respond to increasing demands without performance degradation. An elastic data warehouse can rapidly adapt to growth in data volume, data rate, concurrency, and query intensity without degrading performance because it can provide dedicated resources for different workloads and can acquire and use additional resources on the fly when workloads need them.
  • Adapt to evolution and changes in data without requiring significant redesign or re-architecting. The elastic data warehouse can quickly adapt to changes in data structure and content without requiring any schema redesign, additional data migration, or new data storage structures.

OK, so for me, there is nothing too new there – we see very similar characteristics required from data lakes and modern analytics platforms. But granted, it all aligns pretty well with what organisations require today.

Technical capabilities

Now, surely you need some nifty technology to enable these capabilities. The following are listed as key enablers of elastic data warehouses, with my own commentary added in italics:

  • Cloud is a key enabling technology for elastic data warehousing. Cloud infrastructure offers a near-infinite, on-demand pool of resources available programmatically to applications such as a data warehouse. OK, I get the cloud bit – it’s not a necessity, but it’s really much more convenient and cost effective if someone else with storage, computing and administration capacity to spare manages it all – because they can offer the benefits of scale.
  • Schema and data model flexibility, such as schema-on-query and dynamic schemas, are necessary for an elastic data warehouse to be able to adapt quickly and easily to changes in data structure. OK, here I would really like to know how this is accomplished, while still making the data easily available to non-skilled users. More on this follows below.
  • Supporting multiple forms of data models (for example, both star and snowflake data models) also aids this flexibility. That’s nothing new really; we have that capability in many BI and analytics ecosystems, in fact, it’s even offered in some really old traditional data warehouse database management systems…
  • Adaptive optimisation makes it possible for a data warehouse to take advantage of resources that might change at any time. It does this by dynamically determining optimal data distribution, parallelism, and query decomposition based on resources needed and available at the time of execution. Dynamic data allocation and dynamic distributed query optimisation, although I agree necessary in such a distributed implementation, has been around for quite a while…
  • Standard SQL processing is a key requirement because so many existing skills, tools, and processes in place today understand and rely on support for standard SQL. Of course.

So, if I look at the capabilities required to make elastic data warehousing work, for me, where the rubber hits the road is in the schema and data model flexibility. All the other capabilities are also required of other platforms used for more dynamic analytics, for example as found in modern analytical platforms. <<ref>> So let’s hone in on schema and data model flexibility.

Schema and data model flexibility

Snowflake Computing claims their key differentiator lies in “data elasticity”. Interestingly, this term “exists”, insofar as it was entered in Wikipedia in January 2016, with this explanation attributed to the Cassandra NoSQL database (dated 2011): “The elasticity of a data store relates to the flexibility of its data model and clustering capabilities. The greater the number of data model changes that can be tolerated, and the more easily the clustering can be managed, the more elastic the data store is considered to be.”

So, the elastic data warehouse model is stated to be “at least as tolerant of semi-structured and poly-structured data as its on-premises kith.” Don’t you love these “recursive comparative definitions”! Then it goes on: Within Snowflake, we have a new data type called ‘VARIANT’ which is effectively taking a JSON (JavaScript Object Notation) document and putting it into a column in a relational table. Now at a first read, it seems we’re back to Interbase with its BLOBs (binary large objects) in the 1980s. But you have to read further… In its dynamic optimisation engine, as they load the data and horizontally partition and spread it across several servers behind the scenes, they don’t dynamically build tables (i.e. fix the structure) but they record the definition of the contents at the metadata layer. So if there are 25 elements packed inside that JSON object, it will logically look like 25 “embedded” columns, but it won’t be physically stored as columns. If somebody issues a SQL query against that content, the dynamic optimiser will then pull the JSON object apart and join it with the other data referenced in the query as if it was joining two relational tables at a column level.

In this way it permits schema-on-read flexibility for semi-structured data, while it can enforce a predefined schema for traditional structured data types. Therefore, the system is “elastic” with respect to how it stores, structures, and manages data, to the extent that it cannot only natively ingest JSON documents, but also handle changes to that JSON document structure. In that way, as long as you incorporate all the “bits and pieces” requiring flexibility into the JSON document part of the structure, it eliminates the frustrations of structured data preparation and loading. New elements and changes to the elements and their structures will not affect or break the SQL queries; only the removal of referenced objects will break a query – but that is obviously to be expected.

Up to now, we’ve been focussing on Snowflake. So how does it work in Microsoft world? So here you would put your unstructured data in Azure blob storage, and when required you would create database objects to define the data structure, and then you can run a Transact-SQL query to load the data into the defined objects. Another approach, using Polybase, is to issue a Transact-SQL query to combine non-relational data from Azure blob storage with relational tables in the data warehouse. Through Polybase you can also use a Transact-SQL to seamlessly query across both relational data in a relational database and non-relational data in common Hadoop formats. So, in a way, it is similar to Snowflake’s approach, but there are a few more steps involved – so, it is not as seamless as the Snowflake implementation. For example, you may be required to define the schema of an object maybe three years after you stored it – I hope you can remember what the blob contained after all that time if it’s not a self-describing structure.

Advantages of elastic data warehousing

So, some of the advantages of elastic data warehouses are similar to what you find for on-cloud ecosystems, data lakes or modern analytical platforms – namely flexibility and scalability. This has become a necessity for dealing with the unexpected volumes and types of data we deal with.

However, the unique advantage of elastic data warehouses lie in the data elasticity, which translates into increased agility and faster time to value. Instantiating the elastic data warehouse doesn’t require months of business analysis and dimensional model design, and then even more months of ETL code development and implementation, never mind the time-consuming aspects of system acquisition, configuration and ongoing administration.

One of the great use cases for elastic data warehouses is in monitoring and event-driven analytic applications. These applications need to incorporate new data and then update reports and dashboards on a continuous basis. Much of this data is application-generated semi-structured data, which requires a data warehouse that can flexibly adapt to handle it. The need to continuously ingest and process data also requires an elastic data warehouse in order to keep up with variations and spikes in data flow.

Concluding remarks

So although I have indicated scalability and model flexibility as “knowns” above, of course they are necessary in a modern-day BI and analytical platform, or then in an “elastic data warehouse”.

But the key differentiator, which then warrants the name “elastic” data warehouse, lies in the data elasticity. Providing the schema-at-read facility makes the implementation so much more dynamic, provided the definition of the content is already documented and available – which is clearly the case in Snowflake’s implementation using JSON. The danger with a blob-based implementation is that if you haven’t catalogued the blob’s content when you created it, the definition of the content may be lost over time.

However, all this requires that the structured and semi-structured data be physically copied into the data warehouse. In some applications this may be useful, especially where access to source systems may be restricted for security or performance reasons. But in many other cases, the equivalent functionality can be achieved through a data federation framework, with requiring large and less-used data to be physically copied to the data warehouse. That’s the main difference between data frameworks and data virtualisation on the one hand, and elastic data warehouses and BI ecosystems utilising data lakes on the other.

Leave a Reply