13 April 2012

Big Data is Just a Fad

The Term 'Big Data' is nothing more than a fad, and we'll soon be cringing with embarrassment at the thought that we ever used it. However, the data, and the challenges to processing it that it presents, will stay with us. If jargon like 'Big Data' helps us focus on the problems, then let's use it: temporarily, perhaps.

It seems you can’t swing a corrupt politician these days without hitting an article, advertisement, session or discussion of “Big Data”. I’ve even done a few of those myself. In fact, at the university where I teach, I’m on a board  that selects the course content, teachers and materials for their offering for this topic.

But it’s just a fad. Like so many we’ve seen before.

In the days when the Personal Computer was first introduced into the mass market, companies were jockeying for market position, and of course the way you do that is to append some moniker the general public can latch onto. In those days it was digital-this or that. That transitioned to e- everything, and long before the phone of the fruit variety, it morphed to i- something or other.

And the process hasn’t changed much. Who can forget such classics as N-tier programming, Service Oriented Architectures, Business Intelligence and of course my current favorite, The Cloud? True, some of these are actually in use; others have long faded into the hype-cycle.


But it’s just a fad.
Like so many
we’ve seen
before.
          “

Companies that sell things are faced with a choice – either jump on the bandwagon and use the buzzword in their products or be considered last-year’s paradigm. So of course most of them succumb to buzzword mania to stay relevant. When it works out to be a real offering, it’s a hit and the company is rewarded, and when it’s just a marketing word that doesn’t pan out into a real offering they are derided as selling hype.

Enter “Big Data”. Like Cloud, this term can mean almost anything, so anyone can define it the way they wish. And of course they have – adding to the confusion and the marketing-flavor of the term.

One definition of Big Data involves four V’s:

  • Volume – The amount of data held
  • Velocity – The speed at which the data should be processed
  • Variety – The variable sources, processing mechanisms and destinations required
  • Value – The amount of data that is viewed as not redundant, unique, and actionable

Other definitions have been proposed. I have one I like to use:

  • Big Data is data that you aren’t able to process and use quickly enough with the technology you have now.

I like this definition because it’s easy to understand, and encompasses other definitions. (Also, it has the words “Don’t Panic” on the cover and is slightly cheaper than other definitions.)


Big Data is data
that you aren’t
 able to process
 and use quickly
 enough with the
 technology you
 have now.
                  

I stay away from a particular size of data as Big Data because that changes by the month. What was a large amount of data last year can fit on a USB drive I use on my laptop. It really isn’t about size, even though the term seems to indicate that it is.

Wait – I’ve been fretting about the definition of something I said earlier was just a fad. A buzzword. A marketing phrase. Why even discuss it if it will be gone soon? Because of the way it will disappear. Remember those buzzwords of days gone by? We no longer refer to a PC as a digital PC, or a website as an i-article anymore. But those things are still here. You’re probably reading this on a PC, or a tablet, on the Internet.

We use buzzwords and undefined terms to help solidify what we don’t yet have fully defined in our culture. Some of them die naturally because they are truly vapor; others bake in so completely to the environment they no longer need a qualifying term. And I believe that is what will happen to Big Data. The term is the fad – not the concepts it deals with. The term will fade – I give it another twelve months or so – but not what it refers to.

The idea of so much complex data that needs complicated processing is with us – and perhaps has been with us since the birth of the Internet – will continue and become even more prevalent as we instrument the world. The savvy technical professional sees past the hype and learns the concepts they need to move their organization forward.

It’s actually not about “Big” data – it’s about data in general. The reason the term will fade is because, as we begin the transition to a data-centric architectures, ones where the data is classified, secured and its path determined at the outset of programming, the “big” part will just go away, and fade into “data”, as it should. It’s not about the amount or the complexity; it’s more about the way we source, store and process, analyze and present the data.

In fact, I’ll go so far as to say that all computing is simply the re-arranging of data. An e-mail isn’t about the client program; it’s about the e-mail payload. The client program simply allows me to manipulate the data. The data is the key, and it’s important to broaden your concepts to think about data first, regardless of its size.

Sources of Data

We need to think about data comprehensively – all types of data. We often think about data we keep in structured storage – databases and so on – but, in fact, the data that businesses rely on is not only scattered around the organization but is even held off-site.

But how do you get your mind around all of the data you store at a company? Do you have to think about every single Word document or Excel Spreadsheet? Not necessarily…

I’ve found that one of the best ways to think about data in its entirity is to consider the Business Continuity Plan.

Note: If you do not have a Business Continuity Plan, you should probably get that done before you do anything else – it’s a critical part of an organization, and something the business might not think about. They look to you, as the technical professional, to plan for this..

A Business Continuity Plan describes what the organization needs to keep operating on a day-to-day basis. These days that includes the computing systems they use to do everything from producing what they sell to tracking those sales and paying bills and the people that work on getting the product out the door. That means those hidden Excel spreadsheets and much-reviled Access databases in various departments to the Software-as-a-Service (SaaS) products used for payroll and accounting and finance.

At the very least, these data sources should be identified and classified as mission-critical, and at best they should be documented, backed up, and the systems to run them should be duplicated somewhere offsite. It takes only one disaster to show the value of this effort.

From the “Big Data” perspective, each of these structured and unstructured systems represents a possible source of data to mine for information or for a specific answer. This actually represents an interesting choice right at the start. It’s not essential that you take the data and move it into another location to analyze it. Using systems like StreamInsight, in some cases you can simply examine the data as it passes from creation to original storage to take actions based on content. For instance, you might wire up a connector to listen to the accounting and finance system to trigger a report when a certain threshold value is met for a condition – all without storing the data again somewhere else for analysis.

There are also methods of examining data-at-rest in its source system for mining or pattern matching, such as using open source products like Lucene or even using the built-in functionality within SharePoint to find things in documents, spreadsheets and more. All of these items represent data, even though you don’t store them in a database management system, relational or otherwise.

The point is to consider the source of the data using the Business Continuity Plan process, or simply leveraging the one you currently have. Think of the sources of your data as being those which the business uses every day.

Storing and Processing Data

If you do need to move the data, you need a place to keep it and a way to process it for analysis. In some cases the data is structured or semi-structured, and needs no further modification – another process can simply take the data as an input and process it.

In other cases, the data is not only stored in a certain way but needs a pre-processing mechanism. Using the HDFS system within Hadoop is one such example. The data is stored on various nodes, which allows for scale.

This brings up and interesting concept now being explored using newer programming languages like Bloom. A data-centric view of data is the key to distributed architectures due to state. Data can grow so large that it isn’t possible for a single node to process in a reasonable time – which fits my earlier definition of Big Data. These languages move even the program itself onto multiple nodes, right next to the data which in this case represent the state of the system.

It’s important to consider that there isn’t one single way to store and process data, and that even Excel and Access (among others) have a place. In fact, looking at your organization as a whole, you’ll find that these products represent a logical distributed system – just not very well connected. That’s where you come in, and where Big Data becomes simply data.

Analyzing Data

After you’ve selected the method you want to use for storing and processing a datum, you need to figure out the best way to analyze it. This may or may not refer to reports or Business Intelligence – or that may be pushed off to the next phase, presentation.

Creating an Excel formula is one way to analyze data, as is creating an “R” function to walk across a huge text file; so is a SQL Server stored procedure, a Map/Reduce function in Hadoop or a query over data returned from an object called in a class. The particular software used isn’t important as much as who is doing the analysis, the reason they are doing it, and who needs to see the results.

In some cases, those folks that should be analyzing data aren’t doing so. I’ve recommended to organizations that they pay for Excel training for all of their workers rather than starting a Business Intelligence project. This would, of course, depend on the kind of workers that the company has and who can make decisions based on the data.

That’s the most important point in analysis: what are you going to do with the result of the analysis? If it is simply a TPS report that no one will ever see, there’s no need to worry about it. If, however, those decisions are actionable and you can get them to the right people, that’s what you should focus on.

Presenting Data

It’s always fascinating to me that people tend to focus first on pretty graphs, charts or reports before they consider what they’ll do with data. It’s understandable, since we’re visual creatures, that we would gravitate this way, but from a business standpoint it should be the very last thing you should care about.

Interestingly, in many Big Data definitions, a report or visualization isn’t the result. In fact, it’s far more common in things like Machine Learning that you’re looking for a single answer – it’s the “42” you’re looking for, not a pie-chart. In fact, charts, graphs and reports are merely instruments to get you to an answer, not the answer itself.

That isn’t to say that these things don’t have their place. A graphic can make sense out of data that is incomprehensible in raw numeric or grouping form. Learning the proper way to represent a particular data set in a graphical format is a great tool for the data professional. Keep in mind, however, that a graph can be as prone to misinterpretation as a statistic.

Consider for a moment a group of school scores. They range from an average of 94 in our district to an average of 98 in others. If I wish to show that we’re doing great in comparison to other districts, I simply use a huge range like 0-98 with graduations of 5 on a bar chart and we look great. If, however, I want more money for my school, I make sure to use a range of 93-98 and use .5 gradients. The visual there belies the real delta.

Or does it?

It’s important to understand these visualizations and what they represent. Charts and graphs for data analysis are the watering can of the bonsai tree. Used properly they represent success, improperly they represent failure.

So Big Data is a fad. It will fade, over time, into the pantheon of other tech buzzwords. But the data it represents won’t – it exists now, and continues to grow. So it’s OK to allow the term for now, learn the concepts it presents, and bake it into what you do today. Big Data will only get bigger. And that’s not hype.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 20953 times – thanks for reading.

  • Rate
    [Total: 51    Average: 4.1/5]
  • Share

Buck Woody

Follow Buck on

View all articles by Buck Woody

Related articles

Also in BI

Relational Algebra and its implications for NoSQL databases

With the rise of NoSQL databases that are exploiting aspects of SQL for querying, and are embracing full transactionality, is there a danger of the data-document model's hierarchical nature causing a fundamental conflict with relational theory? We asked our relational expert, Hugh Bin-Haad to expound a difficult area for database theorists.… Read more

Also in Big Data

Staying Ahead of the Game

Matt Hilbert has noticed a term that keeps popping up: The Next Generation DBA. He believes it's been coined because change is afoot, and a lot of change that will transform the way DBAs work. But what exactly is going on - and what can you do to actually become a Next Gen DBA?… Read more

Also in Cloud

In Search of the Cortana Analytics Suite

Cortana Analytics Suite is important and significant, but it is difficult to work out why or how from the existing 'information'. After more setbacks than Dr Livingstone, Bob Sheldon emerged from the jungle of marketing hyperbole triumphantly with a small diagram which explained it. Here he reveals the individual components, and finds them, in combination, to be a curiously interesting attempt to bring Big Data under contro… Read more

Also in Database Administration

The SQL Server 2016 Query Store: Forcing Execution Plans using the Query Store

The SQL Server 2016 Query Store can give you valuable performance insights by providing several new ways of troubleshooting queries, studying their plans, exploring their context settings, and checking their performance metrics. However, it can also directly affect the performance of queries by forcing Execution Plans for specific queries.… Read more
  • Robert young

    Swan Dive
    From an analysis point of view, the question to be addressed is: what does one get by looking at population data, rather than (good or bad) samples? And the answer is, just a few Black Swans (aka, extreme outliers).

    For this exercise to be useful, those Black Swans to be worth finding, they’re going to have to be, well, unusually valuable. A few years back the meme, was “long tail”. Hasn’t turned out that way; the money is made by shifting millions of one widget (even Amazon does that). One exception being, in the US, pharma which has de jour monopoly power over small indications.

    It’s worth noting that the modus operandi of corporations is to buy out competitors (would Adam Smith approve?), cull product lists, and thus reduce choice. Why do they do this? Because it just makes more money.

    I commend Janert’s “Data Analysis With Open Source Tools” (sorry MS) for discussion. For OR and math stats types, long tail was always a shibboleth.

  • timothyawiseman@gmail.com

    Scale Matters
    You make a number of excellent points and I appreciate you providing a different perspective.

    But I think scale matters. I also suspect the term “Big Data” will fade, but I think it will be replaced by more precise terms rather than turning into just “Data”. There is a difference between the scale of data that is comfortably handled by excel, handled by SQL on a decent desktop, and handled by SQL on a massive server. That difference in scale is apparant both in the type of tool you should use to process the data as well as in what granularity of conclusion you can draw with validity from the data.

    “Big Data” is a crude, marketing-speak way of getting that difference of scale across, and right now it serves that purpose reasonably well in a marketing sense and even reasonably well, in a rough way, im some technical material.

    While how we define that scale will likely change, the fact there will be significant gradations in number of datapoints available and difference in how those are handled at scale will likely become more important in the near future rather than less.

  • timothyawiseman@gmail.com

    Related
    I came across this today and it reminded me of your article: http://www.tomsitpro.com/articles/big_data_analysis-data_mining-mapreduce-sql_skills,1-207.html

  • pstamant

    A fad that is going to be around for a while
    I agree that “big” is a relative term, but I think that this “big” is something that isn’t going to be conquered soon. When your everyday business analyst knows how to build an OLAP cube and business intelligence is implied in the title, then big data will become obsolete, but as long as Excel spreadsheets are the tool of choice for analyst, which are inadequate when trying to model “big data,” collecting data will remain far ahead of analyzing data for most of us.

Join Simple Talk

Join over 200,000 Microsoft professionals, and get full, free access to technical articles, our twice-monthly Simple Talk newsletter, and free SQL tools.

Sign up