‘Big Data’ is one of those buzz phrases doing the rounds in the industry at the moment. It’s an adjacent topic to cloud but is being thrown around in much the same way, often prefixed by the question: “What are you doing about…?” Well, with the costs of storage plummeting, it’s becoming clear that the answer to that question is you should be storing everything.
At the Telco Cloud World Forum in London recently, Jeurgen Urbanski, chief technologist for big data and cloud at Deutsche Telekom, made an interesting point. He said many of the companies he’d been helping with their big data problem – that is to say, very large amounts of information to store and then process – were struggling because they either ran out of capacity or money for data storage.
The era of Big Data is so closely aligned with cloud because it benefits from the same adoption of commodity hardware and open source software that suddenly makes the storage (and processing) of very large amounts of data (think petabytes and petabytes ) possible.
In recent years, the cost of storage media has sunk so low that for first time in history it’s possible to store anything you want for as long as possible, economically speaking. But this development shifts the problem to being able to perform actions on those data in a technically plausible and cost effective manner.
It’s probably true that when any nascent market hits the mainstream, there is a considerable amount of FUD (fear, uncertainty and doubt) injected by legacy vendors that have a lot to lose. But that’s where another staple of the cloud movement – open source software – is coming into play.
Doug Cutting, chief architect at cloud specialists Cloudera and chairman of the Apache Software Foundation, helped create Apache’s open source software framework Hadoop out of necessity as data from the web exploded, and grew far beyond the ability of traditional systems to handle it.
Consider those shows that seem to be on TV all the time about hoarders who are terrified of throwing anything away because there might be something, somewhere, under a stack of cycling magzines from the 1960s that is of value. Yet these people never get around to actually having their tat valued, or indeed doing anything with it. This is exactly the problem such a framework sets out to solve.
According to Cutting, Hadoop was initially inspired by papers from Google outlining its approach to handling an avalanche of data, and has since become the de facto approach for storing, processing and analyzing hundreds of terabytes, and even petabytes of data. It’s certainly causing a stir in the industry, by making compute and analytics processes on very large amounts of data not just possible but even relatively simple.
Deutsche Telekom, a tier one operator, calls Hadoop the “single most disruptive technology in the data complex”. By Apache’s declaration, Hadoop is 100 per cent open source, and instead of relying on expensive, proprietary hardware and different systems to store and process data, the framework enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits.
Urbanski is right to be excited by the possibilities here. “Big Data is an entirely new way to take advantage of the incredible volume of data and the increasing diversity and variety of data as well as the incredible velocity of data,” Urbanski said.
In today’s hyper-connected world more and more data is being created every day, but with tools like Hadoop no data is too big. With parallel processing of data sets, you can have hundreds of servers searching for just one answer. It’s pretty much the industry’s realisation of Douglas Adams’ city-sized supercomputer Deep Thought.
Even when different types of data have been stored in unrelated systems, Apache claims you can dump it all into your Hadoop cluster with no prior need for a schema. In other words, you don’t need to know how you intend to query your data before you store it. So the big breakthrough here is that businesses and organisations can now find value in data that was recently considered useless.
“Essentially it means you can store first, ask questions later,” says Urbanski.
This latter point is important. So much of what we talk about in this industry now focuses on being able to do things dynamically or on the fly, but it seems that there are plenty of lessons we could learn from the past, from these large scale archives of data if only we had the tools and, more importantly, the right questions to ask.
Deep Thought, of course, knew the answer but not the question. Indeed, the need to make sense of that data opens the door to another, more philosophical specialist player, because surely the big question now is: do you even know what to ask?