«

»

Data Warehouse Appliance

Share

Data warehouse applianceA client recently asked me: “If I get an appliance, do I still need a data warehouse?” “No sir,” I replied, “kettles run on their own steam, unless you want to analyse the boil duration by the volume, the room pressure and the number of people waiting… then you need a data warehouse!” In this post I discuss data warehouse appliances, the advantages their bring to the table, and the burning points to be aware of. 

A data warehouse appliance is a combination of hardware and software that is designed and assembled specifically for data warehousing and analytical processing.

A data warehouse appliance comes with its own:

  • Hardware: Processors, memory, disk storage, network interfaces, etc.; and
  • Software: Operating system, database management system, ETL tool, reporting and dashboarding tool, and other applications for business analytics.

Most data warehouse appliances are highly parallelised and distributed data across integrated disk storage units, allowing independent processors to query data in parallel with little contention. Most data warehouse appliances are also pre-configured to optimally perform data warehousing and analytics processing. Some of the data warehouse appliance vendors are adding additional functionality, such as in-memory analytics, to compete with the mega-vendors.

The data warehouse appliances claim to be so readily configured to allow its purchaser to deploy a high-performance data warehouse right out of the box. I’ll get back to this point under caveats, below.

The first vendor to introduce the term “data warehouse appliance” was Netezza, back in 2003. Netezza employed sophisticated projection and restriction filters, and minimized data movement and I/O throughout the system, in order to speed up data warehouse processing. IBM acquired Netezza and relaunched the platform as the IBM PureData for Analytics, powered by Netezza, which is a simple data appliance for serious analytics. It simplifies and optimizes performance of data services for analytic applications, enabling very complex algorithms to run in minutes instead of in hours.

Other industry examples include Teradata and Oracle Exadata. SAP HANA is a combination of hardware, storage, operating system, management software and an in-memory database engine where the data is held in memory rather than being accessed on disk.

Advantages

The advantages of acquiring and deploying a data warehouse appliance are the following.

Faster time to value: A properly configured data warehouse appliance is delivered ready-to-go for immediate data loading and query execution. All the installation and integration of hardware, storage and software is done “in the factory”. It automatically delivers high performance with no tuning required, and in most cases with much shorter deployment cycles.

High availability: Most data warehouse appliances are architected for high availability. Most of the components are internally redundant for a robust, fail-safe, production-ready environment. Many offer warm-standby servers, dual networks, dual power-supplies and disk mirroring with failover and solutions for server failure.

Improved performance: In general, data warehouse appliances support mixed workloads, where a wide range of queries, reports and analytical processing jobs can run at the same time as the large ETL jobs that load the data warehouse. They use distributed query processing, parallelisation, and partitioning to improve performance and throughput.

Scalability: When a traditional data warehouse needs to be scaled out, the administrator needs to migrate all the data to a larger, more robust server. Data warehouse appliances scale well for both capacity and performance. When a data warehouse appliance needs to be scaled out, the appliance can simply be expanded by purchasing additional pug-and-play components.

The data warehouse appliance is built according to a modular design to which system administrators can add components incrementally. This eliminates both the typical up-front costs incurred when you need to over-provision for the future and the downtimes and costs associated with doing an unload-reload style upgrade as is typically required with large volumes of data.

Reduced administration: In a traditional data warehouse implementation, the database administrator needs to configure and tune the DBMS to perform well for large sets of users. With a data warehouse appliance, however, it is the vendor who is responsible for fine tuning the software to run optimally on the hardware.

A data warehouse appliance is a single vendor solution, and therefore the vendor takes ownership and responsibility for installing, configuring and tuning all the components of the software solution for optimal performance on the given hardware.

Reduced support: The software comes pre-installed and pre-configured as part of the appliance. This eliminates the customer’s time and costs for integration and regression testing of the storage solution, the operating system, the DBMS, the ETL and the reporting tools, and it removes some of the compatibility issues typically experienced in a multi-vendor solution.

A single vendor also results in a single support-point for problem-resolution and a simplified upgrade path for software and hardware.

Simpler architecture: With the advanced design of a data warehouse appliance, the available infrastructure is fast enough to load the EDW quick enough so that it can be used for operational reporting as well. This allows you to model your ODS as an active part of the EDW, and to run operational reporting straight off the data warehouse.

This configuration eliminates the need for a separate ODS – there is no more need for duplicate storage and duplicate processes; and no more inconsistencies between information reported from the ODS and information reported from the EDW. Such a configuration with the ODS contained within the EDW can save the organisation significant time, effort and money.

Smart analytics: Many of the data warehouse appliances provide built-in support for advanced analytical processes, where the algorithms required to run analytical models are provided as part of the database engine. This allows data scientists and other knowledge workers to run their analytical models directly on the data inside the appliance instead of having to offload it to a separate infrastructure and deal with the associated data preprocessing, transformation and movement.

Caveats

However, the pre-configured box is an empty shell. The BI team still has to analyse requirements, design and implement dimensional models, design and implement the data warehouse, design and implement reports and dashboards, and so forth.

One can attempt to fast track the data modeling part by acquiring industry models, but you have to be careful – these “off the shelf” data models don’t always fit the organisation’s business models so well; they often have to be adapted or customized. Besides, whether the models fit 100% or not, you still have to determine the business requirements and evaluate the extent to which the models support those requirements. Many of the store-bought models have too many layers too – some of them still have very complicated 3NF models in the middle that potentially wastes a lot of storage and processing.

Once the data warehouse repository has been designed and implemented, all the pipes must still have to be connected – the ETL processes to connect to the source transactional systems and transfer the data to the data warehouse have to be implemented. Data analysis of the source systems still needs to be done; and ETL processes must be designed, coded, implemented and tested. Likewise, reports and dashboards have to be designed, developed and implemented. Based on our experience on all our other projects, this is where the most effort lies.

The data warehouse appliance may come out pre-configured and pre-tuned on a systems level, but a DBA must still check the designed models, the coded report access paths, and set up the appropriate indexing and database content tuning. No vendor can anticipate that beforehand.

Bottom line

So, the kettle may come with a power cable, an element and a thermostat pre-installed and calibrated, but you still need to fill it with water, switch it on to boil, and then add the coffee to the plunger and pour the water on, not to mention the cups, cream, teaspoons…

A data warehouse appliance is the complete system of hardware servers, storage (e.g. SAN), operating system and database software that is housed in a single hardware structure, together with the necessary software pre-installed and optimized for data warehousing, BI and analytics.

By buying an appliance, you save the costs of procuring, installing and configuring the hardware, the data warehouse DBMS and database, the ETL tool, and the reporting and analytics components separately, and getting them to communicate efficiently with each other. All the configuration and tuning parameters are set as well. That all comes with the box.

But it is an empty shell on which the data warehouse has to be designed, implemented and populated. The ETL functionality must still be designed and implemented to connect the source systems to populate the data warehouse. Then, reports and dashboards must still be designed, constructed and deployed.

So a data warehouse appliance can contribute a significant up-front saving of time and money, but in the long run, the whole solution must still be specified, designed, implemented, tested and deployed, and that all must be managed to make optimal use of the appliance.

Leave a Reply