Click here to monitor SSC

Big Data: Size isn’t everything

Published 10 May 2013 10:58 am

Big Data has a big problem; it’s the word “Big”. These days, a quick Google search will uncover terabytes of negative opinion about the futility of relying on huge volumes of data to produce magical, meaningful insight. There are also many clichéd but correct assertions about the difficulties of correlation versus causation, in massive data sets. In reading some of these pieces, I begin to understand how climatologists must feel when people complain ironically about “global warming” during snowfall.

Big Data has a name problem. There is a lot more to it than size. Shape, Speed, and…err…Veracity are also key elements (now I understand why Gartner and the gang went with V’s instead of S’s).

The need to handle data of different shapes (Variety) is not new. Data developers have always had to mold strange-shaped data into our reporting systems, integrating with semi-structured sources, and even straying into full-text searching. However, what we lacked was an easy way to add semi-structured and unstructured data to our arsenal. New “Big Data” tools such as MongoDB, and other NoSQL (Not Only SQL) databases, or a graph database like Neo4J, fill this gap. Still, to many, they simply introduce noise to the clean signal that is their sensibly normalized data structures.

What about speed (Velocity)? It’s not just high frequency trading that generates data faster than a single system can handle. Many other applications need to make trade-offs that traditional databases won’t, in order to cope with high data insert speeds, or to extract quickly the required information from data streams.

Unfortunately, many people equate Big Data with the Hadoop platform, whose batch driven queries and job processing queues have little to do with “velocity”. StreamInsight, Esper and Tibco BusinessEvents are examples of Big Data tools designed to handle high-velocity data streams. Again, the name doesn’t do the discipline of Big Data any favors. Ultimately, though, does analyzing fast moving data produce insights as useful as the ones we get through a more considered approach, enabled by traditional BI?

Finally, we have Veracity and Value. In many ways, these additions to the classic Volume, Velocity and Variety trio acknowledge the criticism that without high-quality data and genuinely valuable outputs then data, big or otherwise, is worthless. As a discipline, Big Data has recognized this, and data quality and cleaning tools are starting to appear to support it. Rather than simply decrying the irrelevance of Volume, we need as a profession to focus how to improve Veracity and Value. Perhaps we should just declare the ‘Big’ silent, embrace these new data tools and help develop better practices for their use, just as we did the good old RDBMS?

What does Big Data mean to you? Which V gives your business the most pain, or the most value? Do you see these new tools as a useful addition to the BI toolbox, or are they just enabling a dangerous trend to find ghosts in the noise?

6 Responses to “Big Data: Size isn’t everything”

  1. Phil Factor says:

    Surely it boils down to two choices when handling data. the first choice is to adopt all the disciplines and constraints of any modern scientific discipline, in which you take the stance of drawing conclusions only where you can demonstrate that the null hypothesis, ( that the exciting things youve found are due to chance, data error, artefact etc) is highly improbable in which case you take the ‘data scientist’ approach. The other choice is that you choose not to, in which case you join the ‘big data’ camp .

  2. AndyDent says:

    I think Veracity is most important but it is not binary.

    Taking Phil’s scientific theme further, I refer back to my work for CSIRO (Australian government science) with various ‘ists working with geological data.

    In some domains at least, people still want dodgy data. As a geologist said to me at a seminar on interchanging assay data “if all you have is shit, give it to me and tell me it’s shit”.

    I think this kind of large-scale data with high error rates also occurs in high data-volume areas such as oil & gas exploration drilling, remote sensing and water quality and flow sensing.

    This raises a querying problem that may be even harder to deal with than nulls – adding some kind of error estimator into the queries.

    So, adding a statistician to the team may be as important as a DBA.

  3. Robert Young says:

    The problem with big data follows from Andy’s observation: Big Data codifies the effort to find needles in haystacks. The only reason to pass population data (which is 99.44% of cases), rather than samples, is to find outliers. And only those so extreme that they won’t be found large samples, which are a small fraction of the Big Data.

    So, then the question becomes, “what’s it worth you ya Bunky?” And, the only rational answer has to be “lots”. The Big Data folks don’t make a good case for this being true.

    The other vector of Big Data, is to pass, for example, every customer’s purchase or search, in search of “hidden” correlations. Think about that for just a second.

  4. Keith Rowley says:

    I know there are special problems when dealing with large volumes of data that mean there is “some” value to the term “big data” but for us data is data in a lot of ways.

    What we are really looking for of course is information. Data processed to give meaningful insights. That is much harder to do right no matter how “big” or small your dataset happens to be.

    One last comment in re correlation versus causation. If you can use correlation to predict something correctly a significant percentage of the time it does not really matter in a lot of ways if you have found causation in the business world. Business does not have the same need to understand root causes that science does, as long as there is a way to reasonably monetize the correlation business can use that just as well as causation.

  5. paschott says:

    I think I’d agree with the keep the “big” out of it – it’s data and should be treated accordingly. We use MongoDB here for things like Session because some of our legacy code lets that get out of hand a bit, the shape is variable, and storing it as a BLOB that’s constantly updated/read in SQL Server doesn’t make sense.

    For larger data stores, we look at each problem individually. SQL Server helps us with the “Veracity” issue better than Mongo. It also helps us access the data more easily. SSAS helps us get to the details and patterns within the data.

    I think the tools available now will help enable better analysis in the long term, because you can’t analyze what you didn’t store. However, the flip side is that you need to know for what purpose you’re analyzing the data. It’s all too easy to find what seems to be a trend, but mistakenly attribute its cause to the wrong thing. I think SQL Server Central had an editorial or pointer to an article recently on this when a money laundering scam could be mistaken for legitimate business if analyzed incorrectly.

    At this time, the tools provide Velocity where SQL Server does not – irregularly shaped large data. We don’t need long term storage of it because it’s fleeting and we can re-create it without too much trouble. It’s just faster to read/write the data outside of SQL Server.

    Long term, we’re probably looking at Veracity and Value as the main drivers. it has to be accurate and it has to be worth analyzing. Variety may come into it as well, but we can work around shaping the data towards something useful if needed.

  6. Natarshia says:

    Whenever I first heard the term Big Data and even now after I’ve learned more about it, the word BIG always make me think of Tom Hanks playing that piano in the movie BIG :) As a career developer/dba I think it’s ok for some technology to have these obsure names, because it made me ask what is big data , you mean grown up data like Tom Hanks in the movie or what. I spent about 4 months doing self learning on Big Data. I started with Hadoop, not to learn Hadoop but to learn and understand what Hadoop was or was not doing, and then that let me to learn about No-Sql, and that lead me to learn about other technologies, and so on and so forth. We have to work in the present and the future. And with so many new technologies, and multiple options, we need to know before management does what is best for the company. With that said, it goes unspoken but most would say veracity and value should be a given and we are handling on the surface the other 3 V’s because they are most visual. But after my 4 months of research I learned that if we spent more time understanding our business and what we wanted to do with our business, we could way further than trying to learn every technology out there.

Leave a Reply