Big data in healthcare


Alternative big data technologies in healthcareSome healthcare practitioners smirk when you tell them that you used some alternative medication such as homeopathy or naturopathy to cure some illness. However, in the longer run it sometimes really is a much better solution, even if it takes longer, because it encourages and enables the body to fight the disease naturally, and in the process build up the necessary long term defence mechanisms. Likewise, some IT practitioners question it when you don’t use the “mainstream” technologies…  So, in this post, I cover the “alternative” big data technologies. I explore the different types of big data datatypes and the NoSQL databases that cater for them. I illustrate the types of applications and analyses that they are suitable for using healthcare examples.

Big data in healthcare

Healthcare organisations have become very interested in big data, no doubt fired on by the hype around Hadoop and the ongoing promises that big data really adds big value. However, big data really means different things to different people. For example, for a clinical researcher it is unstructured text on a prescription, for a radiologist it is the image of an x-ray, for an insurer it may be the network of geographical coordinates of the hospitals they have agreements with, and for a doctor it may refer to the fine print on the schedule of some newly released drug. For the CMO of a large hospital group, it may even constitute the commentary that patients are tweeting or posting on Facebook about their experiences in the group’s various hospitals. So, big data is a very generic term for a wide variety of data, including unstructured text, audio, images, geospatial data and other complex data formats, which previously were not analysed or even processed. There is no doubt about that big data can add value in the healthcare field. In fact, it can add a lot of value. Partially because of the different types of big data that is available in healthcare. However, for big data to contribute significant value, we need to be able to apply analytics to it in order to derive new and meaningful insights. And in order to apply those analytics, the big data must be in a processable and analysable format.


Enter yellow elephant, stage left. Hadoop, in particular, is touted as the ultimate big data storage platform, with very efficient parallelised processing through the MapReduce distributed “divide and conquer” programming model. However, in many cases, it is very cumbersome to try and store a particular healthcare dataset in Hadoop and try and get to analytical insights using MapReduce. So even though Hadoop is an efficient storage medium for very large data sets, it is not necessarily the most useful storage structure to use when applying complex analytical algorithms to healthcare data. Quick cameo appearance. Exit yellow elephant, stage right. There are other “alternative” storage technologies available for big data as well – namely the so-called NoSQL (not only SQL) databases. These specialised databases each support a specialised data structure, and are used to store and analyse data that fits that particular data structure. For specific applications, these data structures are therefore more appropriate to store, process and extract insights from data that suit that storage structure.

Unstructured text

A very large portion of big data is unstructured text, and this definitely applies to healthcare too. Even audio eventually becomes transformed to unstructured text. The NoSQL document databases are very good for storing, processing and analysing documents consisting of unstructured text of varying complexity, typically contained in XML, JSON or even Microsoft Word or Adobe format files. Examples of the document databases are Apache CouchDB and MongoDb. The document databases are good for storing and analysing prescriptions, drug schedules, patient records, and the contracts written up between healthcare insurers and providers. On textual data you perform lexical analytics such as word frequency distributions, co-occurrence (to find the number of occurrences of particular words in a sentence, paragraph or even a document), find sentences or paragraphs with particular words within a given distance apart, and other text analytics operations such as link and association analysis. The overarching goal is, essentially, to turn unstructured text into structured data, by applying natural language processing (NLP) and analytical methods. For example, if a co-occurrence analysis found that BRCA1 and breast cancer regularly occurred in the same sentence, it might assume a relationship between breast cancer and the BRCA1 gene. Nowadays co-occurrence in text is often used as a simple baseline when evaluating more sophisticated systems. Rule-based analyses make use of some a priori information, such as language structure, language rules, specific knowledge about how biologically relevant facts are stated in the biomedical literature, the kinds of relationships or variant forms that they can have with one another, or subsets or combinations of these. Of course the accuracy of a rule-based system depends on the quality of the rules that it operates on. Statistical or machine-learning–based systems operate by building classifications, from labelling part of speech to choosing syntactic parse trees to classifying full sentences or documents. These are very useful to turn unstructured text into an analysable dataset. However, these systems normally require a substantial amount of already labelled training data. This is often time-consuming to create or expensive to acquire. However, it’s important to keep in mind that much of the textual data requires disambiguation before you can process, make sense of, and apply analytics to it. The existence of ambiguity, such as multiple relationships between language and meanings or categories makes it very difficult to accurately interpret and analyse textual data. Acronym / slang / shorthand resolution, interpretation, standardisation, homographic resolution, taxonomy ontologies, textual proximity, cluster analysis and various other inferences and translations all form part of textual disambiguation. Establishing and capturing context is also crucial for unstructured text analytics – the same text can have radically different meanings and interpretations, depending on the context where it is used. As an example of the ambiguities found in healthcare, “fat” is the official symbol of Entrez Gene entry 2195 and an alternate symbol for Entrez Gene entry 948. The distinction is not trivial – the first is associated with tumour suppression and with bipolar disorder, while the second is associated with insulin resistance and quite a few other unrelated phenotypes. If you get the interpretation wrong, you can miss or erroneously extract the wrong information.

Graph structures

An interesting class of big data is graph structures, where entities are related to each other in complex relationships like trees, networks or graphs. This type of data is typically neither large, nor unstructured, but graph structures of undetermined depth are very complex to store in relational or key-value pair structures, and even more complex to process using standard SQL. For this reason this type of data can be stored in a graph-oriented NoSQL database such as Neo4J, InfoGrid, InfiniteGraph, uRiKa, OrientDB or FlockDB. Examples of graph structures include the networks of people that know each other, as you find on LinkedIn or Facebook. In healthcare a similar example is the network of providers linked to a group of practices or a hospital group. Referral patterns can be analysed to determine how specific doctors and hospitals team together to deliver improved healthcare outcomes. Graph-based analyses of referral patterns can also point out fraudulent behaviour, such as whether a particular doctor is a conservative or a liberal prescriber, and whether he refers patients to a hospital that charges more than double than the one just across the street. Another useful graph-based analysis is the spread of a highly contagious disease through groups of people who were in contact with each other. An infectious disease clinic, for instance, should strive to have higher infection caseloads across such a network, but with lower actual infection rates. A more deep-dive application of graph-based analytics is to study network models of genetic inheritance.

Geospatial data

Like other graph-structured data, geospatial data itself is pretty structured – coordinates can simply be represented as pairs of coordinates. However, when analysing and optimising ambulance routes of different lengths, for example, the data is best stored and processed using a graph structures. Geospatial analyses are also useful for hospital and practice location planning. For example, Epworth HealthCare group teamed up with geospatial group MapData Services to conduct an extensive analysis of demographic and medical services across Victoria. The analysis involved sourcing a range of data including Australian Bureau of Statistics figures around population growth and demographics, details of currently available health services, and the geographical distribution of particular types of conditions. The outcome was that the ideal location and services mix for a new $447m private teaching hospital should be in the much smaller city of Geelong, instead of in the much larger but services-rich city of Melbourne.

Sensor data

Sensor data often are also normally quite structured, with an aspect being measured, a measurement value and a unit of measure. The complexity comes in that for each patient or each blood sample test you often have a variable record structure with widely different aspects being measured and recorded. Some sources of sensor data also produce large volumes of data at high rates. Sensor data are often best stored in key-value databases, such as Riak, DynamoDB, Redis Voldemort, and sure, Hadoop. Biosensors are now used to enable better and more efficient patient care across a wide range of healthcare operations, including telemedicine, telehealth, and mobile health. Typical analyses compare related sets of measurements for cause and effect, reaction predictions, antagonistic interactions, dependencies and correlations. For example, biometric data, which includes data such as diet, sleep, weight, exercise, and blood sugar levels, can be collected from mobile apps and sensors. Outcome-oriented analytics applied to this biometric data, when combined with other healthcare data, can help patients with controllable conditions improve their health by providing them with insights on their behaviours that can lead to increases or decreases in the occurrences of diseases. Data-wise healthcare organisations can similarly use analytics to understand and measure wellness, apply patient and disease segmentation, and track health setbacks and improvements. Predictive analytics can be used to inform and drive multichannel patient interaction that can help shape lifestyle choices, and so avoid poor health and costly medical care.

Concluding remarks

Although there are merits in storing and processing complex big data, we need to ensure that the type of analytical processing possible on the big data sets lead to valuable enough new insights. The way in which the big data is structured often has an implication on the type of analytics that can be applied to it. Often, too, if the analytics are not properly applied to big data integrated with existing structured data, the results are not as meaningful and valuable as expected. We need to be cognisant of the fact that there are many storage and analytics technologies available. We need to apply the correct storage structure that matches the data structure and thereby ensure that the correct analytics can be efficiently and correctly applied, which in turn will deliver new and valuable insights.

Leave a Reply