«

»

Big Data Categorised

Share

Big Data CategorisedIn my previous post I criticised the 3Vs as a definition of Big Data, offered our own working description and presented “6VS” as a more encompassing characterisation. In this post I categorise the different types of Big Data to get to an explanation why there cannot be a single definition. Categorising the various types of Big Data also points us to the respective storage and processing options.

The type of Big Data most prevalent in most organisations is unstructured text, which constitutes up to as much as 80% of their overall data content. This includes documents, presentations, web content, email and social media feeds. While only a small handful of organisations may have enough data volumes to justify a Hadoop implementation, the average organisation can get by with a well organised and indexed document management system, or a NoSQL document database like MarkLogic, Couchbase or MongoDB. Some textual data, like twitter feeds, elementary emails and standard documents are relatively straightforward to analyse, once you have managed to force a structure over the text, but more complex texts like complicated email threads with attachments or voluminous documents have more complex structures and are harder to analyse. Of those characteristics listed in my previous post, unstructured text usually only adhere to variability and sparseness.

Application log files are an exception to the above, as they represent unstructured text where velocity may be added to the characteristics, that is, if they need to be analysed in real time. While emails and tweets may “arrive” with a high velocity, they hardly have the requirement to be analysed in real time. Log files are also another exception in that hardly anyone would ever store them “inside” a database, not even in a specialised NoSQL database. Log files are typically stored directly on disk, and are most often analysed with a speciality tool such as Splunk.

An interesting class of Big Data is graph structures, where entities are related to each other in complex relationships like trees or graphs. Examples include the networks of people that know each other as you find on LinkedIn, or the groups of subscribers in caller groups that regularly call each other. This type of data is neither large, nor unstructured, but graph structures of undetermined depth are very complex to store and process using standard SQL. For this reason this type of data is classified as Big Data and it is usually stored in graph-oriented NoSQL databases such as Neo4J, InfoGrid, InfiniteGraph, uRiKa, OrientDB and FlockDB. Geospatial data itself is very structured, but when analysing and optimising routes, for example, the data is best stored and processed using a graph structure as well. Of all the characteristics, graph-oriented Geospatial data mostly only satisfies variability, unless routes are analysed in real-time, in which case velocity can also come into play.

Audio and video data sometimes consume very large volumes, but in reality, these need to be transcribed and filtered so that it can be processed in the same manner as unstructured text. This is therefore a class of Big Data where volume can be a consideration, and sparseness obviously too.

Scientific data and sensor data often rate high in volume and velocity. They are often best stored in key-value databases, such as Riak, DynamoDB, Redis and Voldemort. Scientific and sensor data are normally quite structured, but often have variable record structures with widely differing measurements, so variety and variability are also valid characteristics. Veracity sometimes comes into play when capturing external data where we have no influence as to what gets captured or what it may mean.

Of all these case studies, I would stick my head out and say that most of these are relatively low new value adds – most are value augmenters. The exceptions are scientific and sensor data and possibly graph optimisations, where significant additional or new business value can be derived.

So if we tabulate the types vs the characteristics – bearing in mind this is a gross generalisation – we see a very interesting phenomenon:

Volume Variety Velocity Veracity Variability Value Sparseness
Unstructured text

X

X

Email, tweets

X

X

X

Log files

X

X

X

Graph data

X

X

X

Audio and video

X

X

Sensor data

X

X

X

X

X

Scientific data

X

X

X

 

X

X

 

Firstly, with the exception of scientific data, no category of Big Data really ticks the “traditional” 3Vs properly, but then, how many “average” organisations really deal with large volumes of “proper” scientific data that has those characteristics?

Secondly, very few of the Big Data types have ticks in the Volume column. In fact, more have ticks in the Velocity column. That hardly spells Hadoop for me. In fact, if I look at the various types of Big Data, as categorised above, it gives me an indication that in most cases one of the NoSQL databases listed above would often be more suitable for storing and analysing that particular type of Big Data.

Lastly, with such widely differing characteristics and applications, no wonder we cannot get to a single definition. While we all know the over-hyped term “Big Data” is here to stay, just like many other misnomers in our industry, we would avoid a lot of confusion if we rather refer to the type or category of Big Data that we are dealing with in a particular situation. You don’t do sentiment analysis on all the types of Big Data, or do you now? Really, do you analyse graph structures and sensor data for that too? Likewise, you don’t do route optimisation on twitter feeds or audio files, or do you now? Let’s be more specific and call a spade a spade, at least when we are talking about spades then.

2 pings

  1. Big Data Categorised » Martin's Insights ...

    [...] Volume, Variety, Velocity, Veracity, Variability, Value, Sparseness. Unstructured text. X. X. Email, tweets. X. X. X. Log files. X. X. X. Graph data. X. X. X. Audio and video. X. X. Sensor data. X. X.  [...]

  2. Future Enterprise 2013 » Martin's Insights

    [...] a hold-all bucket for a too large variety of data types (as I have advocated before in this post on big data). Video, spatial data, sensor data and graph structured data, to name a few, should all be treated [...]

Leave a Reply