Click here to monitor SSC

Know your Data Lineage

Published 9 May 2014 1:25 pm

An academic paper without the footnotes isn’t an academic paper. Journalists wouldn’t base a news article on facts that they can’t verify. So why would anyone publish reports without being able to say where the data has come from and be confident of its quality, in other words, without knowing its lineage. (sometimes referred to as ‘provenance’ or ‘pedigree’)

The number and variety of data sources, both traditional and new, increases inexorably. Data comes clean or dirty, processed or raw, unimpeachable or entirely fabricated. On its journey to our report, from its source, the data can travel through a network of interconnected pipes, passing through numerous distinct systems, each managed by different people. At each point along the pipeline, it can be changed, filtered, aggregated and combined.

When the data finally emerges, how can we be sure that it is right? How can we be certain that no part of the data collection was based on incorrect assumptions, that key data points haven’t been left out, or that the sources are good? Even when we’re using data science to give us an approximate or probable answer, we cannot have any confidence in the results without confidence in the data from which it came.

You need to know what has been done to your data, where it came from, and who is responsible for each stage of the analysis. This information represents your data lineage; it is your stack-trace. If you’re an analyst, suspicious of a number, it tells you why the number is there and how it got there. If you’re a developer, working on a pipeline, it provides the context you need to track down the bug. If you’re a manager, or an auditor, it lets you know the right things are being done. Lineage tracking is part of good data governance.

Most audit and lineage systems require you to buy into their whole structure. If you are using Hadoop for your data storage and processing, then tools like Falcon allow you to track lineage, as long as you are using Falcon to write and run the pipeline. It can mean learning a new way of running your jobs (or using some sort of proxy), and even a distinct way of writing your queries. Other Hadoop tools provide a lot of operational and audit information, spread throughout the many logs produced by Hive, Sqoop, MapReduce and all the various moving parts that make up the eco-system. To get a full picture of what’s going on in your Hadoop system you need to capture both Falcon lineage and the data-exhaust of other tools that Falcon can’t orchestrate.

However, the problem is bigger even that that. Often, Hadoop is just one piece in a larger processing workflow. The next step of the challenge is how you bind together the lineage metadata describing what happened before and after Hadoop, where ‘after’ could be  a data analysis environment like R, an application, or even directly into an end-user tool such as Tableau or Excel. One possibility is to push as much as you can of your key analytics into Hadoop, but would you give up the power, and familiarity of your existing tools in return for a reliable way of tracking lineage?

Lineage and auditing should work consistently, automatically and quietly, allowing users to access their data with any tool they require to use. The real solution, therefore, is to create a consistent method by which to bring lineage data from these data various disparate sources into the data analysis platform that you use, rather than being forced to use the tool that manages the pipeline for the lineage and a different tool for the data analysis.

The key is to keep your logs, keep your audit data, from every source, bring them together and use the data analysis tools to trace the paths from raw data to the answer that data analysis provides.

6 Responses to “Know your Data Lineage”

  1. Keith Rowley says:

    Unfortunately in this world “should” and “do” are too often miles apart. Yes this “should” be a simple mostly invisible back end process, but as your article pointed out it is not, by a long shot.

  2. willliebago says:

    A lot of times lineage just becomes part of the technical debt :(

    • Simon Elliston Ball says:

      How do you mean Willie? Just in terms of the overhead of tracking lineage, piecing it together after the fact, or in terms of the upfront cost of auditing?

  3. paschott says:

    Wish we had some of those problems. The largest issue we have w/ data lineage is the transforms done as it switches from the OLTP system to a report. It doesn’t rest in that state – just transformed through the report code to show a certain way. That means debugging has to go through the report (or maybe a function/stored proc) to figure out what was done along the way. It would be nice to have a great set of best practice type tools to store the data lineage as data is moved from OLTP to a data mart, but I haven’t see a standard emerge for those yet.

    Referring to the above comment about technical debt, for us it would be because we’d likely have to put it together after the fact. The drive to show something to our customers would likely preclude putting that lineage in up front. Once we had a handle on it, we’d probably get a better idea of the quality/detail level for that lineage and how to manifest it. Until then – comments in our code and maybe some extended properties to get us by. :(

    I will give a shout out to Dave Stein (@made2mentor) who has done some pretty cool stuff w/ extended properties and BIML to help generate his SSIS packages and know that certain transforms are expected along the way.

    • Simon Elliston Ball says:

      Absolutely agree! I don’t think there’s anything wrong with ‘doing lineage’ after the fact. In fact my view is that collection of the data to enable lineage queries should be entirely transparent. Being able to track processes like this should be about supporting and improving agility to customer requirements.

      Code comments are great and extended properties are a brilliant way to add meta-data to SQL server objects, but how do you explain processes to non-technical users and end-consumers of reports? Are they even interested in how the numbers got there?

      • paschott says:

        Knowing the source of the data does depend quite a bit on the users. Usually, that’s only of info to the data people. Everyone else’s eyes tend to glaze over. :) However, I’ve been with a couple of end-users who have been very interested in the source, transforms, and calculations along the way to understand what they’re actually seeing. That can be exhilarating and frustrating at the same time. At times, they’ve completely concentrated on the wrong data for a drill-down. Other times they really walk through the logic and desired output, leading to better data for decisions.

        I guess it depends on the level of detail. Most end-users don’t necessarily care about the technical details of what moved the data, but they tend to be interested in the source and calculations if it directly affects their jobs.

Leave a Reply

Blog archive