It seems you can’t swing a corrupt politician these days without hitting an article, advertisement, session or discussion of “Big Data”. I’ve even done a few of those myself. In fact, at the university where I teach, I’m on a board that selects the course content, teachers and materials for their offering for this topic.
But it’s just a fad. Like so many we’ve seen before.
In the days when the Personal Computer was first introduced into the mass market, companies were jockeying for market position, and of course the way you do that is to append some moniker the general public can latch onto. In those days it was digital-this or that. That transitioned to e- everything, and long before the phone of the fruit variety, it morphed to i- something or other.
And the process hasn’t changed much. Who can forget such classics as N-tier programming, Service Oriented Architectures, Business Intelligence and of course my current favorite, The Cloud? True, some of these are actually in use; others have long faded into the hype-cycle.
But it’s just a fad.
Like so many
Companies that sell things are faced with a choice – either jump on the bandwagon and use the buzzword in their products or be considered last-year’s paradigm. So of course most of them succumb to buzzword mania to stay relevant. When it works out to be a real offering, it’s a hit and the company is rewarded, and when it’s just a marketing word that doesn’t pan out into a real offering they are derided as selling hype.
Enter “Big Data”. Like Cloud, this term can mean almost anything, so anyone can define it the way they wish. And of course they have – adding to the confusion and the marketing-flavor of the term.
One definition of Big Data involves four V’s:
- Volume – The amount of data held
- Velocity – The speed at which the data should be processed
- Variety – The variable sources, processing mechanisms and destinations required
- Value – The amount of data that is viewed as not redundant, unique, and actionable
Other definitions have been proposed. I have one I like to use:
- Big Data is data that you aren’t able to process and use quickly enough with the technology you have now.
I like this definition because it’s easy to understand, and encompasses other definitions. (Also, it has the words “Don’t Panic” on the cover and is slightly cheaper than other definitions.)
Big Data is data
that you aren’t
able to process
and use quickly
enough with the
I stay away from a particular size of data as Big Data because that changes by the month. What was a large amount of data last year can fit on a USB drive I use on my laptop. It really isn’t about size, even though the term seems to indicate that it is.
Wait – I’ve been fretting about the definition of something I said earlier was just a fad. A buzzword. A marketing phrase. Why even discuss it if it will be gone soon? Because of the way it will disappear. Remember those buzzwords of days gone by? We no longer refer to a PC as a digital PC, or a website as an i-article anymore. But those things are still here. You’re probably reading this on a PC, or a tablet, on the Internet.
We use buzzwords and undefined terms to help solidify what we don’t yet have fully defined in our culture. Some of them die naturally because they are truly vapor; others bake in so completely to the environment they no longer need a qualifying term. And I believe that is what will happen to Big Data. The term is the fad – not the concepts it deals with. The term will fade – I give it another twelve months or so – but not what it refers to.
The idea of so much complex data that needs complicated processing is with us – and perhaps has been with us since the birth of the Internet – will continue and become even more prevalent as we instrument the world. The savvy technical professional sees past the hype and learns the concepts they need to move their organization forward.
It’s actually not about “Big” data – it’s about data in general. The reason the term will fade is because, as we begin the transition to a data-centric architectures, ones where the data is classified, secured and its path determined at the outset of programming, the “big” part will just go away, and fade into “data”, as it should. It’s not about the amount or the complexity; it’s more about the way we source, store and process, analyze and present the data.
In fact, I’ll go so far as to say that all computing is simply the re-arranging of data. An e-mail isn’t about the client program; it’s about the e-mail payload. The client program simply allows me to manipulate the data. The data is the key, and it’s important to broaden your concepts to think about data first, regardless of its size.
Sources of Data
We need to think about data comprehensively – all types of data. We often think about data we keep in structured storage – databases and so on – but, in fact, the data that businesses rely on is not only scattered around the organization but is even held off-site.
But how do you get your mind around all of the data you store at a company? Do you have to think about every single Word document or Excel Spreadsheet? Not necessarily…
I’ve found that one of the best ways to think about data in its entirity is to consider the Business Continuity Plan.
Note: If you do not have a Business Continuity Plan, you should probably get that done before you do anything else – it’s a critical part of an organization, and something the business might not think about. They look to you, as the technical professional, to plan for this..
A Business Continuity Plan describes what the organization needs to keep operating on a day-to-day basis. These days that includes the computing systems they use to do everything from producing what they sell to tracking those sales and paying bills and the people that work on getting the product out the door. That means those hidden Excel spreadsheets and much-reviled Access databases in various departments to the Software-as-a-Service (SaaS) products used for payroll and accounting and finance.
At the very least, these data sources should be identified and classified as mission-critical, and at best they should be documented, backed up, and the systems to run them should be duplicated somewhere offsite. It takes only one disaster to show the value of this effort.
From the “Big Data” perspective, each of these structured and unstructured systems represents a possible source of data to mine for information or for a specific answer. This actually represents an interesting choice right at the start. It’s not essential that you take the data and move it into another location to analyze it. Using systems like StreamInsight, in some cases you can simply examine the data as it passes from creation to original storage to take actions based on content. For instance, you might wire up a connector to listen to the accounting and finance system to trigger a report when a certain threshold value is met for a condition – all without storing the data again somewhere else for analysis.
There are also methods of examining data-at-rest in its source system for mining or pattern matching, such as using open source products like Lucene or even using the built-in functionality within SharePoint to find things in documents, spreadsheets and more. All of these items represent data, even though you don’t store them in a database management system, relational or otherwise.
The point is to consider the source of the data using the Business Continuity Plan process, or simply leveraging the one you currently have. Think of the sources of your data as being those which the business uses every day.
Storing and Processing Data
If you do need to move the data, you need a place to keep it and a way to process it for analysis. In some cases the data is structured or semi-structured, and needs no further modification – another process can simply take the data as an input and process it.
In other cases, the data is not only stored in a certain way but needs a pre-processing mechanism. Using the HDFS system within Hadoop is one such example. The data is stored on various nodes, which allows for scale.
This brings up and interesting concept now being explored using newer programming languages like Bloom. A data-centric view of data is the key to distributed architectures due to state. Data can grow so large that it isn’t possible for a single node to process in a reasonable time – which fits my earlier definition of Big Data. These languages move even the program itself onto multiple nodes, right next to the data which in this case represent the state of the system.
It’s important to consider that there isn’t one single way to store and process data, and that even Excel and Access (among others) have a place. In fact, looking at your organization as a whole, you’ll find that these products represent a logical distributed system – just not very well connected. That’s where you come in, and where Big Data becomes simply data.
After you’ve selected the method you want to use for storing and processing a datum, you need to figure out the best way to analyze it. This may or may not refer to reports or Business Intelligence – or that may be pushed off to the next phase, presentation.
Creating an Excel formula is one way to analyze data, as is creating an “R” function to walk across a huge text file; so is a SQL Server stored procedure, a Map/Reduce function in Hadoop or a query over data returned from an object called in a class. The particular software used isn’t important as much as who is doing the analysis, the reason they are doing it, and who needs to see the results.
In some cases, those folks that should be analyzing data aren’t doing so. I’ve recommended to organizations that they pay for Excel training for all of their workers rather than starting a Business Intelligence project. This would, of course, depend on the kind of workers that the company has and who can make decisions based on the data.
That’s the most important point in analysis: what are you going to do with the result of the analysis? If it is simply a TPS report that no one will ever see, there’s no need to worry about it. If, however, those decisions are actionable and you can get them to the right people, that’s what you should focus on.
It’s always fascinating to me that people tend to focus first on pretty graphs, charts or reports before they consider what they’ll do with data. It’s understandable, since we’re visual creatures, that we would gravitate this way, but from a business standpoint it should be the very last thing you should care about.
Interestingly, in many Big Data definitions, a report or visualization isn’t the result. In fact, it’s far more common in things like Machine Learning that you’re looking for a single answer – it’s the “42” you’re looking for, not a pie-chart. In fact, charts, graphs and reports are merely instruments to get you to an answer, not the answer itself.
That isn’t to say that these things don’t have their place. A graphic can make sense out of data that is incomprehensible in raw numeric or grouping form. Learning the proper way to represent a particular data set in a graphical format is a great tool for the data professional. Keep in mind, however, that a graph can be as prone to misinterpretation as a statistic.
Consider for a moment a group of school scores. They range from an average of 94 in our district to an average of 98 in others. If I wish to show that we’re doing great in comparison to other districts, I simply use a huge range like 0-98 with graduations of 5 on a bar chart and we look great. If, however, I want more money for my school, I make sure to use a range of 93-98 and use .5 gradients. The visual there belies the real delta.
Or does it?
It’s important to understand these visualizations and what they represent. Charts and graphs for data analysis are the watering can of the bonsai tree. Used properly they represent success, improperly they represent failure.
So Big Data is a fad. It will fade, over time, into the pantheon of other tech buzzwords. But the data it represents won’t – it exists now, and continues to grow. So it’s OK to allow the term for now, learn the concepts it presents, and bake it into what you do today. Big Data will only get bigger. And that’s not hype.