It’s big, BIG I tell you. Really Big. Wherever we turn, there it is, looming like some great billboard telling us we need to pay attention. It’s the new—well, maybe the newest—800-pound gorilla in the corner of the room. A search for it on Google turns up 2,050,000,000 hits. That’s two BILLION hits. To give some perspective, by comparison, Miley Cyrus only gets a mere 919,000,000 hits (fewer than half). And yet, nobody really has a good definition for it.
It reminds me of the time I was at a meeting of pathologists studying the then brand-new technology: the electron microscopy. The instructor started off by asking: “What happens when you have 12 pathologists looking at a slide?” His answer: “You get 12 different opinions.”
“Big data” is the latest shiny object, the next great big thing, the sine non qua for all future data analytics everywhere. And, in healthcare, it’s begun to resemble the stature of the Holy Grail. I don’t want to seem too snarky about this, but we’re counting on “something” for which we do not have solid definition. It’s right back to the dilemma of the 12 pathologists. Or maybe, more optimistically, it’s more akin to Supreme Court Justice Stuart Potter’s famous quip (from the 1964 case, Jacobellis v. Ohio 378 U.S. 184) , if a concurring opinion can be called such:
“I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description [“hard-core pornography”]; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it, and the motion picture involved in this case is not that.” [Emphasis added.]
Jonathan Stuart Ward and Adam Barker, in their excellent “Undefined By Data: A Survey of Big Data Definitions,” attempt to address this ambiguity, first by demonstrating how serious the problem is. Surveying the definitions by the major IT wonks (Gartner), and companies (Microsoft, Intel, Oracle) that manage “big data,” Ward and Barker discovered wildly different definitions. To their credit, they extracted the following common elements, although their introduction to their synopsis is telling:
“Despite the range and differences existing within each of the aforementioned definitions there are some points of similarity. Notably all definitions make at least one of the following assertions:
- Size: the volume of the datasets is a critical factor.
- Complexity: the structure, behaviour and permutations of the datasets is a critical factor.
- Technologies: the tools and techniques that are used to process a sizable or complex dataset is a critical factor.”
Their definition: “Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.”
This is likely to be cited frequently as an emerging consensus definition, and I think they’re onto something. Although it’s a good start, we need to be careful not to focus on the low hanging fruit of “big data,” structured data. I think the definition needs to include more detail to reflect a key reality, which is that most of the really interesting information, the stuff that tells us so much about humans, is stored in that messiest of formats: narrative. To be fair, I imagine that unstructured data could be included under the rubric “complex dataset,” but I think that narrative is so important, and so different from the normal run of the mill “complex data,” that it must be explicitly stated as a part of the definition.
Estimates in healthcare put up to 70 percent of essential health information in narrative format. And, although we attempt to capture the “essence” of the narrative by abstracting it to increasingly complex code sets including SNOMED and ICD/10 (it is coming, isn’t it?) and ICD/11 (“By the pricking of my thumbs, something wicked this way comes?”) we must remember the words of Alfred Korzybski: “The map is not the territory.”
So, let me take a shot at a revised definition (changes in bold):
“Big data is a term describing the acquisition, storage and analysis of sometimes large, usually complex data sets, that frequently include both structured and unstructured data, including free text narrative, using a series of techniques including, but not limited to: traditional statistical techniques, NoSQL (“Not Only SQL”), MapReduce , machine learning, and combinations of these methods .”
Carl Husa is a Lead Institutional Analyst and Healthcare Architect at Bryan University, where he shares his incredible wealth of knowledge–as well as his spicy opinions and expanding collection of hot sauces–with his colleagues.
Bryan University offers exclusive degrees targeting high-growth professions in healthcare and legal services, particularly as they intersect with Big Data. More information is available at www.BryanUniversity.edu.