Click here to monitor SSC
  • Av rating:
  • Total votes: 51
  • Total comments: 4
Buck Woody

Big Data is Just a Fad

13 April 2012

The Term 'Big Data' is nothing more than a fad, and we'll soon be cringing with embarrassment at the thought that we ever used it. However, the data, and the challenges to processing it that it presents, will stay with us. If jargon like 'Big Data' helps us focus on the problems, then let’s use it: temporarily, perhaps.

It seems you can’t swing a corrupt politician these days without hitting an article, advertisement, session or discussion of “Big Data”. I’ve even done a few of those myself. In fact, at the university where I teach, I’m on a board  that selects the course content, teachers and materials for their offering for this topic.

But it’s just a fad. Like so many we’ve seen before.

In the days when the Personal Computer was first introduced into the mass market, companies were jockeying for market position, and of course the way you do that is to append some moniker the general public can latch onto. In those days it was digital-this or that. That transitioned to e- everything, and long before the phone of the fruit variety, it morphed to i- something or other.

And the process hasn’t changed much. Who can forget such classics as N-tier programming, Service Oriented Architectures, Business Intelligence and of course my current favorite, The Cloud? True, some of these are actually in use; others have long faded into the hype-cycle.


But it’s just a fad.
Like so many
we’ve seen
before.
          ”

Companies that sell things are faced with a choice – either jump on the bandwagon and use the buzzword in their products or be considered last-year’s paradigm. So of course most of them succumb to buzzword mania to stay relevant. When it works out to be a real offering, it’s a hit and the company is rewarded, and when it’s just a marketing word that doesn’t pan out into a real offering they are derided as selling hype.

Enter “Big Data”. Like Cloud, this term can mean almost anything, so anyone can define it the way they wish. And of course they have - adding to the confusion and the marketing-flavor of the term.

One definition of Big Data involves four V’s:

  • Volume – The amount of data held
  • Velocity – The speed at which the data should be processed
  • Variety – The variable sources, processing mechanisms and destinations required
  • Value – The amount of data that is viewed as not redundant, unique, and actionable

Other definitions have been proposed. I have one I like to use:

  • Big Data is data that you aren’t able to process and use quickly enough with the technology you have now.

I like this definition because it’s easy to understand, and encompasses other definitions. (Also, it has the words “Don’t Panic” on the cover and is slightly cheaper than other definitions.)


Big Data is data
that you aren’t
 able to process
 and use quickly
 enough with the
 technology you
 have now.
                  

I stay away from a particular size of data as Big Data because that changes by the month. What was a large amount of data last year can fit on a USB drive I use on my laptop. It really isn’t about size, even though the term seems to indicate that it is.

Wait – I’ve been fretting about the definition of something I said earlier was just a fad. A buzzword. A marketing phrase. Why even discuss it if it will be gone soon? Because of the way it will disappear. Remember those buzzwords of days gone by? We no longer refer to a PC as a digital PC, or a website as an i-article anymore. But those things are still here. You’re probably reading this on a PC, or a tablet, on the Internet.

We use buzzwords and undefined terms to help solidify what we don’t yet have fully defined in our culture. Some of them die naturally because they are truly vapor; others bake in so completely to the environment they no longer need a qualifying term. And I believe that is what will happen to Big Data. The term is the fad – not the concepts it deals with. The term will fade – I give it another twelve months or so – but not what it refers to.

The idea of so much complex data that needs complicated processing is with us – and perhaps has been with us since the birth of the Internet – will continue and become even more prevalent as we instrument the world. The savvy technical professional sees past the hype and learns the concepts they need to move their organization forward.

It’s actually not about “Big” data – it’s about data in general. The reason the term will fade is because, as we begin the transition to a data-centric architectures, ones where the data is classified, secured and its path determined at the outset of programming, the “big” part will just go away, and fade into “data”, as it should. It’s not about the amount or the complexity; it’s more about the way we source, store and process, analyze and present the data.

In fact, I’ll go so far as to say that all computing is simply the re-arranging of data. An e-mail isn’t about the client program; it’s about the e-mail payload. The client program simply allows me to manipulate the data. The data is the key, and it’s important to broaden your concepts to think about data first, regardless of its size.

Sources of Data

We need to think about data comprehensively – all types of data. We often think about data we keep in structured storage – databases and so on – but, in fact, the data that businesses rely on is not only scattered around the organization but is even held off-site.

But how do you get your mind around all of the data you store at a company? Do you have to think about every single Word document or Excel Spreadsheet? Not necessarily…

I’ve found that one of the best ways to think about data in its entirity is to consider the Business Continuity Plan.

Note: If you do not have a Business Continuity Plan, you should probably get that done before you do anything else – it’s a critical part of an organization, and something the business might not think about. They look to you, as the technical professional, to plan for this..

A Business Continuity Plan describes what the organization needs to keep operating on a day-to-day basis. These days that includes the computing systems they use to do everything from producing what they sell to tracking those sales and paying bills and the people that work on getting the product out the door. That means those hidden Excel spreadsheets and much-reviled Access databases in various departments to the Software-as-a-Service (SaaS) products used for payroll and accounting and finance.

At the very least, these data sources should be identified and classified as mission-critical, and at best they should be documented, backed up, and the systems to run them should be duplicated somewhere offsite. It takes only one disaster to show the value of this effort.

From the “Big Data” perspective, each of these structured and unstructured systems represents a possible source of data to mine for information or for a specific answer. This actually represents an interesting choice right at the start. It’s not essential that you take the data and move it into another location to analyze it. Using systems like StreamInsight, in some cases you can simply examine the data as it passes from creation to original storage to take actions based on content. For instance, you might wire up a connector to listen to the accounting and finance system to trigger a report when a certain threshold value is met for a condition – all without storing the data again somewhere else for analysis.

There are also methods of examining data-at-rest in its source system for mining or pattern matching, such as using open source products like Lucene or even using the built-in functionality within SharePoint to find things in documents, spreadsheets and more. All of these items represent data, even though you don’t store them in a database management system, relational or otherwise.

The point is to consider the source of the data using the Business Continuity Plan process, or simply leveraging the one you currently have. Think of the sources of your data as being those which the business uses every day.

Storing and Processing Data

If you do need to move the data, you need a place to keep it and a way to process it for analysis. In some cases the data is structured or semi-structured, and needs no further modification – another process can simply take the data as an input and process it.

In other cases, the data is not only stored in a certain way but needs a pre-processing mechanism. Using the HDFS system within Hadoop is one such example. The data is stored on various nodes, which allows for scale.

This brings up and interesting concept now being explored using newer programming languages like Bloom. A data-centric view of data is the key to distributed architectures due to state. Data can grow so large that it isn’t possible for a single node to process in a reasonable time – which fits my earlier definition of Big Data. These languages move even the program itself onto multiple nodes, right next to the data which in this case represent the state of the system.

It’s important to consider that there isn’t one single way to store and process data, and that even Excel and Access (among others) have a place. In fact, looking at your organization as a whole, you’ll find that these products represent a logical distributed system – just not very well connected. That’s where you come in, and where Big Data becomes simply data.

Analyzing Data

After you’ve selected the method you want to use for storing and processing a datum, you need to figure out the best way to analyze it. This may or may not refer to reports or Business Intelligence – or that may be pushed off to the next phase, presentation.

Creating an Excel formula is one way to analyze data, as is creating an “R” function to walk across a huge text file; so is a SQL Server stored procedure, a Map/Reduce function in Hadoop or a query over data returned from an object called in a class. The particular software used isn’t important as much as who is doing the analysis, the reason they are doing it, and who needs to see the results.

In some cases, those folks that should be analyzing data aren’t doing so. I’ve recommended to organizations that they pay for Excel training for all of their workers rather than starting a Business Intelligence project. This would, of course, depend on the kind of workers that the company has and who can make decisions based on the data.

That’s the most important point in analysis: what are you going to do with the result of the analysis? If it is simply a TPS report that no one will ever see, there’s no need to worry about it. If, however, those decisions are actionable and you can get them to the right people, that’s what you should focus on.

Presenting Data

It’s always fascinating to me that people tend to focus first on pretty graphs, charts or reports before they consider what they’ll do with data. It’s understandable, since we’re visual creatures, that we would gravitate this way, but from a business standpoint it should be the very last thing you should care about.

Interestingly, in many Big Data definitions, a report or visualization isn’t the result. In fact, it’s far more common in things like Machine Learning that you’re looking for a single answer – it’s the “42” you’re looking for, not a pie-chart. In fact, charts, graphs and reports are merely instruments to get you to an answer, not the answer itself.

That isn’t to say that these things don’t have their place. A graphic can make sense out of data that is incomprehensible in raw numeric or grouping form. Learning the proper way to represent a particular data set in a graphical format is a great tool for the data professional. Keep in mind, however, that a graph can be as prone to misinterpretation as a statistic.

Consider for a moment a group of school scores. They range from an average of 94 in our district to an average of 98 in others. If I wish to show that we’re doing great in comparison to other districts, I simply use a huge range like 0-98 with graduations of 5 on a bar chart and we look great. If, however, I want more money for my school, I make sure to use a range of 93-98 and use .5 gradients. The visual there belies the real delta.

Or does it?

It’s important to understand these visualizations and what they represent. Charts and graphs for data analysis are the watering can of the bonsai tree. Used properly they represent success, improperly they represent failure.

So Big Data is a fad. It will fade, over time, into the pantheon of other tech buzzwords. But the data it represents won’t – it exists now, and continues to grow. So it’s OK to allow the term for now, learn the concepts it presents, and bake it into what you do today. Big Data will only get bigger. And that’s not hype.

Buck Woody

Author profile:

Buck Woody has been working with Information Technology since 1981. He has worked for the U.S. Air Force, at an IBM reseller as technical support, and for NASA as well as U.S. Space Command as an IT contractor. He has worked in most all IT positions from computer repair technician to system and database administrator, and from network technician to IT Manager and with multiple platforms as a Data Professional. He has been a DBA and Database Developer on Oracle systems running on a VAX to SQL Server and DB2 installations. He has been a Simple-Talk DBA of the Day

Search for other articles by Buck Woody

Rate this article:   Avg rating: from a total of 51 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Swan Dive
Posted by: Robert young (view profile)
Posted on: Saturday, April 21, 2012 at 2:33 PM
Message: From an analysis point of view, the question to be addressed is: what does one get by looking at population data, rather than (good or bad) samples? And the answer is, just a few Black Swans (aka, extreme outliers).

For this exercise to be useful, those Black Swans to be worth finding, they're going to have to be, well, unusually valuable. A few years back the meme, was "long tail". Hasn't turned out that way; the money is made by shifting millions of one widget (even Amazon does that). One exception being, in the US, pharma which has de jour monopoly power over small indications.

It's worth noting that the modus operandi of corporations is to buy out competitors (would Adam Smith approve?), cull product lists, and thus reduce choice. Why do they do this? Because it just makes more money.

I commend Janert's "Data Analysis With Open Source Tools" (sorry MS) for discussion. For OR and math stats types, long tail was always a shibboleth.

Subject: Scale Matters
Posted by: timothyawiseman@gmail.com (view profile)
Posted on: Thursday, April 26, 2012 at 4:58 PM
Message: You make a number of excellent points and I appreciate you providing a different perspective.

But I think scale matters. I also suspect the term "Big Data" will fade, but I think it will be replaced by more precise terms rather than turning into just "Data". There is a difference between the scale of data that is comfortably handled by excel, handled by SQL on a decent desktop, and handled by SQL on a massive server. That difference in scale is apparant both in the type of tool you should use to process the data as well as in what granularity of conclusion you can draw with validity from the data.

"Big Data" is a crude, marketing-speak way of getting that difference of scale across, and right now it serves that purpose reasonably well in a marketing sense and even reasonably well, in a rough way, im some technical material.

While how we define that scale will likely change, the fact there will be significant gradations in number of datapoints available and difference in how those are handled at scale will likely become more important in the near future rather than less.

Subject: Related
Posted by: timothyawiseman@gmail.com (view profile)
Posted on: Wednesday, May 02, 2012 at 11:51 AM
Message: I came across this today and it reminded me of your article: http://www.tomsitpro.com/articles/big_data_analysis-data_mining-mapreduce-sql_skills,1-207.html

Subject: A fad that is going to be around for a while
Posted by: pstamant (view profile)
Posted on: Monday, May 07, 2012 at 4:55 AM
Message: I agree that "big" is a relative term, but I think that this "big" is something that isn't going to be conquered soon. When your everyday business analyst knows how to build an OLAP cube and business intelligence is implied in the title, then big data will become obsolete, but as long as Excel spreadsheets are the tool of choice for analyst, which are inadequate when trying to model "big data," collecting data will remain far ahead of analyzing data for most of us.

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.