In order to cater for the demands of a modern data-centric and analytics-driven organisation, one needs a more extensive ecosystem for analytics than what a traditional simple “data source – ETL – data warehouse – dashboard” environment can cater for. This is especially relevant considering the more complex data types and structures and the more dynamic data flows that have to be supported. The trend now is to utilise a modern analytics platform.
So what constitutes a modern analytics platform? In a previous post I discussed how a modern analytics platform is utilised throughout the analytics lifecycle. However, in order to enable quite a wide set of users, with quite varying skill sets, to perform those steps, the modern analytics platform needs to contain a substantial set of interfacing, enabling and foundational capabilities. In this post I look under the hood at the technical components that make up a modern analytics platform. I conclude with an example of a modern analytics platform that is commercially available.
On a very high level, the modern analytics platform has a number of user interface capabilities, underlying enabling technologies, different types of data storage areas, and very importantly, an integrated metadata management capability. Let us look at these in more detail.
User Interfaces
The various user interface capabilities are:
- Report / dashboard tool – conventional business intelligence tool to develop, test, implement and deploy ad hoc and productionised reports and dashboards.
- Data visualisation tool – to perform exploratory data discovery, data analysis, information presentation, data-driven storytelling and other forms of data-related collaboration and investigation.
- Data search facility – in such an extensive and complex environment where a large variety of very different data sources are integrated and where a large number of information “widgets” and analytical insights are managed, it can become hard to locate data. The data search facility allows the users to construct complex natural language type queries (à la google) to locate data content across the entire analytics platform viewed as an integrated ecosystem.
- Data capture facility – to capture, catalogue and store new datasets such as ad hoc defined hierarchical structures, new lookup sets and other external reference data.
- Collaboration facility – to share and discuss information and insights through discussion forums, commentary, “likes” and other concepts adapted from commonly used social media.
- Advanced analytical modelling workbench – to develop, test and interactively run advanced analytical models.
All these interface capabilities access the data that is either stored as part of the modern analytics platform or which are accessed through the platform. These capabilities are all also tightly integrated with the metadata management tool, the data storage facility and the federated data access facility, all described below.
Note that I discussed “interface capabilities” in generic terms here. Practically, these capabilities can be provided by a single interface “portal” or by a number of closely integrated or independent tools. In fact, a good “open” modern analytics platform would, in addition to providing these capabilities, also provide access for independent “best of breed” third party tools to perform these functions.
Data Storage
An analytics platform typically has two types of storage areas:
- A structured data store, used as a relational database or a data warehouse database for those datasets that have been transformed and persisted in dimensional data warehouse structures. The dimensional part of the data warehouse will only have star schemas for well-understood and frequently queried and analysed business processes; i.e. the dimensional data warehouse will form a productionised part of the ecosystem. Structured transactional data, although not necessarily in dimensional form, may also be stored in this area.
- Data lake for both unstructured data, i.e. as a conventional data lake or as a large-scale data staging area, as well as for structured data (either as a “database” for captured, summarised or derived data or as a “staging area” for data en-route to the data warehouse).
Note that this is a conceptual description. Technologically there may be more than two physical storage technologies utilised, for example, an in-memory store, a columnar data warehouse store, a relational transactional database, as well as one on more unstructured data technologies such as Hadoop and/or a NoSQL database making up the data lake area.
Integrated Metadata Management
The integrated metadata management facility is the cornerstone component of the analytical platform, as it forms the glue that holds everything together, and it is the key component through which all the other components interact with each other. The metadata management tool interacts with all the components of the analytics platform. Conceptually, it consists of two levels of metadata (which are very tightly integrated):
- Physical data dictionary, catering for technical metadata (e.g. data sources, mappings, structures, data records and items, table layouts, indexing structures, data types, databases, connectivity details, access methods, sizing and volume details, process logs, audit trails, usage logs, etc.)
- Logical business glossary, catering for logical / business metadata (e.g. logical data inventory, user views, business rules, definitions, meanings, interpretations, hierarchies, glossaries, derivations, etc.)
Enabling Technologies
In the engine room, there are a number of enabling technologies that together make the analytics platform provide the required functionality:
- Federated data access tool – a facility which transparently accesses and integrates data from a variety of local and remote databases and other data sources to present this data as a logically integrated and coherent data set in business terms to the users. To this end, the federated data access tool decomposes the query into (physical) subqueries for submission to the relevant constituent databases and data systems, after which it integrates the result sets of the composite subqueries. Because various database management systems employ different query languages, federated database systems typically apply wrappers to the subqueries to translate them into the appropriate query languages. Through this data abstraction, the federated data access tool provides a uniform user interface to all the user interface tools, thereby enabling users to search, store, access and retrieve data from multiple and different databases through a single (logical) query, no matter how different the constituent databases are with respect to structure and interface language.
- ETL tool – to move data from the source systems into the data lake or data warehouse areas, as an initial load and/or incrementally on an on-going basis. It is also used to transform data from the data lake to the dimensional data warehouse, if required. This should also cover streaming data.
- Data discovery and mapping tool – a facility that interrogates remote database schemas and data structures, as well as investigate their content data at a detail level as well, and then catalogues the data in physical and logical terms in the integrated metadata management tool. The mapping component allow the data analyst to discover, explore and define mappings between the various data sources, using joins, fuzzy matches, business rules and other user-defined integration definitions.
- Data quality management tool – in such an environment where a large variety of very different data sources are integrated, data quality measurement, reporting and integrated data quality management is of crucial importance. This ensures quality information is provided to all stakeholders and where this is not possible, e.g. due to bad quality source data, that the level of data quality is known and made visible.
- Advanced analytics – a key requirement is the ability to do advanced data mining and to create machine learning and other statistical and advanced analytical models. The advanced analytics facility can access data across the entire ecosystem, and write new and derived insights, properly documented and catalogued, back into the storage area from where it can be reported or analysed in combination with the other data.
- Cognitive Analytics is an emerging area which brings human interaction, intelligence and learning characteristics into analytical systems. It includes capabilities such as Natural Language Processing (NLP), Machine Learning, Artificial Intelligence (AI), Semantic Search & Retrieval (across structured and un-structured data sources), large-scale parallel processing and reasoning, data-robots, self-regulating feedback and iteration, and human-style question-and-answer based interfaces.
As mentioned above, in a “good” modern analytics platform, these enabling technologies are all very tightly interfaced with the integrated metadata management tool.
Architectural Representation
Bringing that all together, conceptually, a modern analytical platform can be depicted as follows (adapted from original image drawn up by Nathan Cortese of KPMG Australia):
Commercial Example – fraXses
fraXses is a good example of a modern analytics platform that is now commercially available all over the world. In their literature, fraXses is branded as federated / virtualised data application framework driven by metadata for the next generation of business solutions and analytics. Although fraXses is advertised as a federated data framework, it has all the makings of a modern analytics platform, and it is often deployed as such.
A key component of fraXses is its strong metadata management component (called LegoZ) which is actively utilised to overcome the very complex issues of data and process management; ingestion of real-time and siloed data stores, through data discovery and automated schema building. The fraXses metadata management facility is used to not only describe the data, but also to manage the following definitions:
- business and process rules;
- data definitions;
- joining of data;
- extended data definitions;
- data mappings; and
- real time business rules and logic definitions and decisions.
fraXses provides standardised API-based access for third party query, reporting, dashboard, visualisation and analytics tool benches. It also provides its own user-friendly interfaces for analytics and data interaction.
Underneath the hood it has a number of very powerful enabling technologies, including but not limited to these:
- A federated database management platform (called fraPses), in which several databases appear to function as a single entity through a process known as virtualisation. These databases can be in different formats, locations and technologies. The technology behind fraXses allows the solution to view all this disparate data in a unified way – there is no need for unnecessary data duplication and no need for instantiated aggregation.
- An advanced analytics engine (including integration with R, Apache Spark, etc.)
- A data discovery facility (called Discovery).
- A rules and decision engine (called Fathom).
In terms of storage capabilities, fraXses allows data to be seamlessly on-boarded from any data source in order to create a “data lake” for storing structured or unstructured data. This data can be stored in a combination of in-memory, flash, SATA or simply in the Cloud. It ships with two types of relational databases, two types of columnar databases, some NoSQL data storage capabilities as well as the world’s fastest time series database.
The fraXses architecture is illustrated in the following diagram (source www.fraXses.com):
fraXses represents a tightly integrated framework of the components typically found in an analytics platform, with its “cleverest” capabilities being its rich metadata management and federated data access that can be effectively utilised to handle vast volumes of data to perform complex real-time analytics.
More detail on fraXses can be found at www.fraXses.com.
1 ping