My initial reaction upon first hearing the term “Big Data” was one of excitement. The wheels of progress certainly turn in the DBA world but, compared to the world of the application developer, they turn mighty slowly.
As I read through various articles and blogs on the topic, I found my enthusiasm waning. It appears that I am not alone in feeling a twinge of disillusion, as there have been a number of recent posts on technology sites proclaiming the death of ‘Big Data’.
However, I came to realise that Big Data has several positive ideas. My epiphany came when I got the chance to attend a few conferences and seminars and made use of the time in the intervals to speak to the conference presenters and the other attendees. The same themes and topics of conversation came up time and time again.
- “Big Data” is neither new, nor is it merely a marketing term.
- It is not about the technology so much as the interaction with people.
- The challenges that bedevil data warehouses also plague Big Data. This is hardly a revelation as “Big Data” can be seen as largely an extension of data warehousing.
So what is “Big Data”?
Why is there an antipathy towards Big Data? One reason is that the definition of Big Data, and much of the debate, has been on technical grounds. For me the quote from Buck Woody gives the most concise and revealing summary of the technical definition of Big Data.
“Big Data is data that you aren’t able to process and use quickly enough with the technology you have now.”
Big Data was founded on the concept of the three Vs:
- Volume – The sheer amount of data to be processed.
- Velocity – The speed at which data must be turned into a useful asset.
- Variety/variability – The number of data sources, destinations and formats that have to be dealt with.
A combination of two or more of the above represent a “Big Data” challenge but none of those challenges are new. Indeed the march of technology may render today’s big data problem a non-issue tomorrow.
- Solid state drives taking over from traditional hard disk drives.
- Compression technologies including column store indexes.
- Data location awareness technologies such as Rainstor on Hadoop.
Although the 3 Vs and Buck’s quote are good technical definitions of “Big Data”, I think they are missing the point and therein lies the flaw in the “Big Data is Dead” argument.
Why is Big Data alive and kicking?
As I said earlier, my epiphany came from talking to people who believed in ‘Big Data’ and watching them present their ideas. Although they had an interest in technology, they were predominantly business people presenting a business use for “Big Data”. The technology was simply a means to an end and not (for them) the interesting bit.
Business people were finding data interesting!
You should not underestimate the importance of this. In my experience business managers understand that their reports and information depend on data but they don’t think of data as an asset. More likely they think of data as a burden, a liability and a cost centre.
With that in mind when business people are extolling the virtues and importance of data we in the technical community should sit up and take note.
Once I had grasped the importance of that point I began to feel that the three V’s of Big Data were grasping the wrong end of the stick. Why? Because they focus on the technical characteristics of “Big Data” and not on the characteristics that matter most to the business.
What are my V’s of Big Data?
I have two closely related contenders for my number one “V” of Big Data. They are closely related and intertwined and yet justify being presented as two key attributes.
I choose “Vision” as my first V because of the effort involved in establishing, communicating and championing it. The vision seeks to answer the question ‘What problem will a “Big Data” solution address?’ or ‘What opportunities will it present to the business?’
To describe the vision as being a mere sales pitch would be to underplay its role. It is a sales pitch but, instead of being targeted at a small group of decision makers, its scope is to sell to everyone who will be involved in the Big Data initiative. In fact I would go as far as to say that the one thing that successful IT projects share is a widely understood, clearly articulated, vision. Beyond its role as a sales pitch, the “Vision” is also a direction of travel indicator visible to all. A pole star if you like.
It is not a simple matter to put together a vision that will achieve all these things. When I worked in advertising, we were taught to consider the personality types of people within our audience; as we were in advertising the four quadrants were named after parts of a print advert.
|Headline||Brusque to the point of being rude.
Focussed on the issue to hand and dislikes distractions from the main topic.
Likes a terse bullet point summary.
Punctual, expects everything to run to a schedule.
|Illustration||Flamboyant and sociable.
Not interested in the detail just wants the “big picture”.
Turns up to meetings late and expects to be at the centre of attention.
|Logo||Not a fan of change.
Likes to know how they and the people around them will be affected.
|Body Copy||Likes to get immersed in the details.
Suspicious of “big picture” sales pitches and want to validate claims against facts.
Different industries take the same approach but with quadrants named appropriately for their type of business.
When presenting an advertising pitch we would use the techniques described by Phil Factor in his article “The art of the one-pager” for the Powerpoint part of the presentation. This would address the needs and concerns of the “Headline” and “Illustration” personality type.
This was firmly backed by a more detailed report used appease the “Logo” and “Body Copy” personalities.
My second “V” plays a crucial role in the “Vision” and that is “Value”. Perhaps the best way to illustrate both the similarity and separateness of these two V’s is by using an analogy. Consider two motoring magazines, “Top Gear” and “Fleet Manager”.
Top Gear will contain photographs of cars artistically shot in some of the world’s most beautiful locations. It will describe the way a car grips the road, looks stylish from certain angles and rewards a certain style of driving. It will make a cursory mention of costs. This is the VISION.
Fleet Manager will contain photographs where the car itself is the dominant feature. It will catalogue depreciation, 3 year running costs, insurance groups and CO2 emission-based taxes. It will make a cursory mention of style and give a workmanlike view of what the car is like to drive. This is the VALUE.
Money is the universal measure or lingua franca of business so it is surprising that it is mentioned so little in articles about Big Data. Although value is related to money, it is more than simply asking:
- How much is this going to cost?
- How much revenue will it bring in?
It is also going to be asking the following:
- How will this give greater understanding of my customers such that I can use that understanding to increase profitability?
- How will it help me find friction points in what we do today and therefore by reducing them save costs to increase profitability?
In short, value can be expressed directly as in cost/revenue or indirectly as in nurturing a facet of business life that leads to cost reduction/revenue increase such as brand awareness, commercial reputation, supplier relationships. This brings us to my third “V” of Big Data. In order for “Value” to be realised people must trust what the data is telling them. It must have “Veracity”.
Veracity is the truthfulness of the data or to put it another way, a measure of its quality. Data occasionally presents us with unpalatable facts and when it does so the data has to be demonstrably unimpeachable.
It matters not one jot if you can process a billion different data sources and several yottabyes of data with single-digit millisecond response times. If the quality of the data is in question and the topic is emotive, then no-one will act appropriately upon the conclusions drawn from that data. This is true of all data and data quality is an on-going challenge in more traditional data processing but I feel it is important enough to emphasize it in the context of big data.
The size of the challenge depends on what the source of “Big Data” actually is. If the source is mechanical sensors then the quality tends to be high. After all, if a sensor is reporting unreliable data then there are bigger problems to deal with than downstream data concerns. However, sensor data that depends on GPS co-ordinates poses its own data quality challenge.
Take as an example telematics data from vehicle black boxes. The map below shows a route the black box thinks I took. My destination was at the lowest part of the route, south of Tarporley. From the most eastern point until just north of the road marked A54 the journey is accurate. From that point on the indicated route is entirely fictitious as the straightness of the journey lines will attest.
What has actually happened is that around the Tarporley area the black box loses contact with some of the GPS satellites used to calculate my position so the route finder makes its best guess as to where it thinks I am.
This is a Big Data quality problem in the form of inaccuracy and incompleteness of data. For social media data sources then there are huge challenges to be met simply in extracting relevant data, much less in interpreting it correctly and in an automated way.
This leads me to my final “V” of Big Data.
Validity – Signal to Noise
Youth sees issues in black and white terms. Age sees shades of grey. Age sees shades of grey because it has accumulated the data of life (experience and knowledge). The accumulation hasn’t led to clarity and certainty; it has led to uncertainty and doubt.
With huge volumes of data it becomes ever harder to work out which data is useful and which is not. What data requires focus and what is simply a distraction. Returning to my telematics black box example, these devices capture data on a number of different measures 4x/sec. My daily round-trip commute is 2 hours long meaning that 92% of all data captured by my black box has no meaning other than to say the vehicle is parked at a location. Of the data that remains not all events across all the measures are necessary. They are simply distraction and noise so the challenge is to identify and eliminate them.
Fortunately in telematics this task is relatively straight forward but for other Big Data sources, such as web logs, one man’s signal is another man’s noise. The challenge is far more acute.
I began this article suggesting that we should focus on the business attributes of Big Data rather than the technical attributes. However if I have one qualm over Big Data it is with the business value of volume. The ultimate goal of a business is to be able to target an individual with an offer tailor made for them and thus have a higher probability that the individual will purchase.
With an increased volume does come the opportunity move towards this goal by considering a more finely grained set of dimensions for data. That is instead of considering 18-25 year olds we can now have increasing confidence in the accuracy of results of analysis should we consider 18,19…24, 25 as individual cells.
Whether businesses are set up to capitalize on such fine grained data segments is another matter entirely. For most businesses a random sample will produce an accurate enough model of the population as a whole. Statisticians have various mathematical tricks to allow for inaccuracies inherent in drawing conclusions based on small samples. Beyond a certain point increasing the volume of data points does not offer significant improvements in the accuracy of the model.
It is the nature of IT to focus on the technology, the “how” of Big Data almost to the point of tunnel vision. This is of course remains our fundamental responsibility but we must broaden our perspective and consider the business goals, the “what” of Big Data. Only by doing this can we make the value of Big Data a reality and reduce misconceptions for the concept.