The Big Data movement is transforming long-established data warehouse architectures into multifaceted analytical ecosystems, in which the time dimension has also seen been ratcheted up a few notches. In the good old days data was extracted from operational systems, batch-processed through a data warehouse, transformed to information and delivered to a relatively small group of business users. Nowadays much more data flies from origins way more and dynamic than before, to many more users in more varying roles, through a wider variety of fast-changing channels, each tailored to the type of data it transports and the type of user who needs to consume it. The monolithic enterprise data warehouse has been transformed to a vast and dynamic BI ecosystem. (The style of information consumption is also changing rapidly, but we need to cover that in another post.)
In fact, Gartner states that: “Big Data is moving from a focus on individual projects to an influence on enterprises’ strategic information architecture. Dealing with data volume, variety, velocity and complexity is forcing changes to many traditional approaches. This realization is leading organizations to abandon the concept of a single enterprise data warehouse containing all information needed for decisions. Instead they are moving towards multiple systems, including content management, data warehouses, data marts and specialized file systems tied together with data services and metadata, which will become the “logical” enterprise data warehouse.”
There are two ways to work with Big Data. The first, and what I would recommend as far as possible, is to force structure on the Big Data as early on in the process as possible, filter the relevant parts, and then process the resulting structured information using conventional tools and systems in the BI environment. This works well if your data is mostly complex, but you don’t need to store and process it using any additional technologies. The approach covered in the remainder of this post is required when you have very large volumes of Big Data or if you need to store and process it using some storage structure like a Hadoop cluster or a specialised NoSQL database. Then you end up with a more complex BI ecosystem, as illustrated in the following diagram.
Take the ETL processes to the data
The moment you introduce a Hadoop cluster or a specialised NoSQL database into your IT environment, chances are very likely you have to upgrade your previously hopefully simple BI environment to a more complex BI ecosystem. This is especially the case if you are dealing with large volumes of data. Forcing structure on complex semi-structured data can still be done in a relatively simple and streamlined environment, but as soon as you start dealing with Big Data volumes (that your conventional systems cannot cope with, i.e. as per some of the definitions out there) then you have to start using different approaches to deal with the volumes. The most efficient approach is to take the processing to the data, because moving the data to the processes is too cumbersome and time consuming, not even talking about the processing and storage costs to copy such volumes of data. When you take an ETL process (or any valid permutation of those three letters) to the data, it means that you make that particular process, as well as that particular data source, part of the BI ecosystem.
Dynamic data exploration
However, the Big Data movement is not only about preparing and making large volumes and new sources of data available. It is also about facilitating flexible, ad hoc data exploration and the rapid development of prototype style analytical applications. In this new world, users cannot anticipate the questions they will ask nor the data they will need to answer those questions.
Often, the data they require will not even be in the data warehouse. So if the data scientists (or similarly titled knowledge workers) want to explore and analyse the raw data, the raw data then becomes part of the BI ecosystem. The approach to cut a little data mart or to create a separate sandbox environment for them to work with simply is not feasible when dealing with extreme volumes. Similar to the approach with ETL processes described above, you have to take the application of exploratory analytics and prototype analytical applications to the data as well.
Data blending
The new expanded environment also needs to give developers the power to create and use dashboards built with in-memory visualisation tools that point to both the corporate data warehouse and other larger more dynamic data sources. The traditional ETL approach has been replaced by data blending, which minimises data movement and increases data availability between co-existing databases, with the data often times housed in different types of structures. Dynamic data blending is causing organisations to move away from dealing with data integration (ETL processing) as a separate discipline, to an approach where data integration, data quality, metadata management and data governance are managed together.
Semantic layer
In an environment where data blending is employed, metadata in the form of a semantic layer plays a very important role. With traditional data warehousing, every time something changed, the data warehouse and the ETL processes had to change. A semantic layer hides all the underlying structures, definitions and implementation details, which makes it much less change averse. You can add or change data sources “under the covers” and surface the changes by simply updating the semantic layer. The semantic layer also plays a very important role in hiding the discrepancies between what the business calls data entities and attributes and what they are physically implemented as. With the number of data entities and their attributes vastly growing, that takes care of a large number of translation problems and other frustrations between IT and the business.
Co-existing toolsets
A problem related to having a number of co-existing databases used for analytics, is that you may also end up with a larger number of toolsets that have to be supported in the BI ecosystem. A single corporate standard data integration tool doesn’t make the grade anymore. For example, in addition to a traditional ETL environment, you now also need a mishmash of Apache projects, such as Flume, Sqoop, Ooze, Pig, Hive and ZooKeeper to manage and get to the data in your Hadoop environment. These independent projects often contain competing or overlapping functionality, have separate release schedules and aren’t always tightly integrated. Each toolset is also evolving rapidly at its own pace. Infrastructure and technology management has suddenly become very in testing and challenging.
Concluding remarks
Solutions now exist to process massive amounts of data in real time, to search and analyse any type of unstructured or semi-structured data, and deliver this information to anyone, almost anywhere.
But Big Data is not only about the technologies required to store massive amounts of data. It is more about creating a flexible infrastructure that enables high-performance computing, high-performance analytics and governance, in a deployment model that makes sense for each particular organisation.
Embarking on extreme scale Big Data initiatives definitely means extending the BI ecosystem, and taking information management disciplines to a new level. The data- and BI strategy now needs to be extended to incorporate new requirements and articulate a longer term data and technology vision to cater for the much faster evolution of the business which is changing existing and developing new products, services, markets and channels based on the new levels of information it is getting to deal with.
2 pings
Weekly bookmarks: mars 16th | robertsahlin.com
March 17, 2013 at 02:14 (UTC 10) Link to this comment
[…] Big Data and the BI Ecosystem » Martin's Insights – Big Data and the BI Ecosystem http://t.co/xIIDNELnyF […]
My Homepage
September 24, 2013 at 14:22 (UTC 10) Link to this comment
… [Trackback]…
[…] Read More here: martinsights.com/?p=469 […]…