Click here to monitor SSC
  • Av rating:
  • Total votes: 7
  • Total comments: 2
Buck Woody

Data Science Laboratory System - Interactive Data Tools

18 March 2013

Data tools interact directly with data and are great for automating data  data-aquisition,  but they aren't always the best way to prototype or pilot a process. Interactive data tools also allow you to test and refine the process, until it is ripe for automation. 

This is the third in a series on setting up a Data Science Laboratory server – the first is located here.

My plan is to set up a system that allows me to install and test various methods to store, process and deliver data. These systems range from simple text manipulation to Relational Databases and distributed file and compute environments. Where possible, I plan to install and configure the platforms and code locally. The outline of the series looks like this:

I’ll repeat a disclaimer I’ve made in the previous articles - I do this in each one because it informs how you should read the series:

This information is not an endorsement or recommendation to use any particular vendor, software or platform; it is an explanation of the factors that influenced my choices. You can choose any other platform, cloud provider or other software that you like - the only requirement is that it fits your needs. As I always say – use what works for you. You can examine the choices I’ve made here, change the decisions to fit your needs and come up with your own system. The choices here are illustrative only, and not meant to sell you on a software package or vendor.

The first article in this series explains my choices for the platform and the operating system, and details a few tools for text-based data interaction. The second article deals with a few examples for those text tools. Even with a few examples however, there is no substitute for studying and reading up on the tools - this series isn’t meant to be an exhaustive examination of each one. Hopefully the examples give you an idea of what the tool can do, and then you can decide how much further you want to investigate it.

The third article in the series covers Instrumentation – the ability to time processes and evaluate resource allocation on the laboratory system. The laboratory system’s first function is to serve as a place to install, configure, learn and test software that deals with data; and measuring a process is paramount to the system.

In this article, I’ll explain about tools that interact directly with data

Interactive Data Tools

Most computing systems are multi-user, storing and processing data for programs that interact with that data. The system I’m building is a place to experiment with the tools for those types of systems, but it has a secondary purpose as well.

I’ll use the data laboratory system not only to experiment with new data engines, but also to examine and manipulate data and write proof-of-concept processes and code. The process I use to do that is:

  1. Define the requirements (what must occur)
  2. Define the constraints (what cannot occur)
  3. Create an outline from the finish of the process backwards to the start
  4. Test and refine the process
  5. Automate the process
  6. Monitor the process

Step four is where the interactive data tools come into play. Before I automate the system, I need to test various ways to ingress, manipulate, and egress the data.

There are a few tools that fall into this category. When I say “interactive”, I mean that the tools are meant to be used by a single person, in real-time, in front of a console. That isn’t to say they can’t be automated - almost anything can. But in general these tools aren’t meant to be automated into the final process, with one set of exceptions.

In the last few articles I’ve referenced various text-based tools, such as sed, awk and grep. These utilities are sometimes part of the process I develop, and in the examples I showed in those articles, can be automated into an enterprise-level production environment as well. So in addition to the text-based tools previously mentioned, I’ll install (or access) a few more tools that work with data interactively.

Microsoft Excel

Microsoft Excel ranks as one of the primary tools for data analysis in the vast majority of businesses and organizations I consult with. It has, without much exaggeration, been termed “the world’s most popular database”. Of course, Excel isn’t a multi-user engine for data processing. At its most basic, it’s a spreadsheet product, normally intended for one person (or at most a small group of people) to edit various forms of data.

The reason Excel is so popular is that it does so much more. Because it became a market standard, lots of add-ins are available, and a large body of knowledge has sprung up around it. Also because of its success in the market, Microsoft has invested a great deal of time and effort in the product, making it even more robust.

Although the product has thousands of uses, for this system I’ll focus in on only a few:

  • Small-scale data conversion (read from one format, send to another)
  • Data manipulation
  • Data calculation (involving maths, statistics, and date and time)
  • Data visualization

On my laboratory system I’m installing Microsoft Excel version 2013, 64-bit edition. There are so many new features and functions that it provides that having the latest version is important to the tasks I want to accomplish - I won’t list those here, and of course you can use an earlier version and different edition on your system.

It’s important to note here that I won’t install this on my Microsoft Windows Azure Virtual Machine. Licensing restrictions prohibit me from using an MSDN edition of Microsoft software on the Windows Azure Platform, which is also true for many other cloud platforms. Check with your legal department if you’re using a different cloud provider before you use an MSDN edition.

Rather than explain the processes for working with data in a general way for Excel, I’ll have you check the reference section for the books I use for learning and working with Excel. Let’s focus instead on some of the other use-cases for this tool for the data professional.

An interesting example of working with Excel is the ability to read data not only from local files, but being able to connect to multiple data sources, from Relational Databases to Internet data sets. As an example, here is the process to read in stock values from a Web Provider (You can read more about this process and all of the parameters here: Get external data from a Web page

I’ll start by opening a blank worksheet, and then navigating to the “Data” object on the Ribbon interface.

Next, I click the “From Web” button, and enter the following URL when the web interface comes up:

http://moneycentral.msn.com/investor/external/excel/quotes.asp?quote=MSFT

After selecting the yellow arrow representing the data set I want, I then take the default location in the spreadsheet on where to place it:

As you can see, there’s quite a bit of cleanup here, but this is only three clicks - there is a lot more you can do to bring a segment of data in to Excel. In some cases you can point to the data, in others you can copy the data, and in still others you can query the data on demand or refresh.

You can connect to or import from any number of Relational Database Systems or just about anything that you can connect to using ODBC connections.

PowerPivot

Excel provides an amazing array of charts, graphics, Key Performance Indicator (KPI) dials and graphs and more for visualizing data. Those you’ll also read about in the references below, and starting in Excel version 2010, Microsoft created a new extension to serve as a Business Intelligence tool called PowerPivot. This extension allows you to bring in massive sets of data (millions of rows) and combine them into Pivot Charts (more on Pivot Charts in the References section).

In Excel version 2010, you’ll need to add in PowerPivot with an installation, but in version 2013 it’s built in and all you have to do is enable it. I’ll give you a reference to both below.

There’s a of data evaluation you can do with PowerPivot, but the two main values are working with huge amounts of data (for a spreadsheet, anyway) and tying multiple datasets together. For an end-user, this is akin to learning to do SQL Joins. For the data professional, you can easily query data for your experiments and tests.

The real power of PowerPivot isn’t best shown in a static walkthrough; you really have to see the movement of the data and the ease of creating visualization to understand it well. After you install it, you can check out these short videos: Business Intelligence - PowerPivot. That site also has lots of other training material on it.

References:

  • Just getting started with Excel? Use the help within the product to begin, and then go to Excel 2007 training courses to find videos, articles, books and more on the product. This is for 2007, but the basics here hold for all versions. In the newer versions you’ll find the help system links to similar content in a different format.
  • After you’re past the basics, one of the best references I’ve found on Excel is “Excel Data Analysis and Business Modeling” by Wayne Winston. Microsoft® Excel® 2010: Data Analysis and Business Modeling This is the 2010 edition, and I’ll buy the 2013 edition when he updates it. I recommend it highly.
  • Install PowerPivot for Excel 2010 (you can also find the instructions for enabling it in Office 2013) here: Download PowerPivot
  • If you’re new to Pivot Charts, check out this article: PivotTable reports 101

Hive for Excel Add-In

Later in this series I’ll cover a distributed data computing system called Hadoop. I’ll quickly mention an add-in here for Excel that allows a query process called “Hive” to use Excel against even a huge data store like Hadoop. I mention it here only for completeness, you’ll have to wait until you have access to Hadoop on Windows Azure (which is by request).

Since I haven’t covered Hadoop yet, I’ll mention this add-in again when I come to that section. You need to have access to the HadoopOnWindowsAzure site to install the driver, but you can see the entire process here in case you don’t want to wait for my article on that: http://social.technet.microsoft.com/wiki/contents/articles/6226.how-to-connect-excel-to-hadoop-on-azure-via-hiveodbc.aspx

Wolfram Alpha

Wolfram Alpha (WA) is a web site (http://www.wolframalpha.com/) that deals with all things math, and also has interesting data applications. There is a premium and free access area, and in this case I’m using the free edition. I also have a paid subscription to Wolfram Alpha to do my day job. Even if you use the free version, I recommend you join the site, since you can set up individual preferences.

It’s difficult to adequately describe what WA does. The simplest thing to do is to visit the examples page (http://www.wolframalpha.com/examples) and explore the site further. You can bring in datasets or use datasets they provide, and much more. Whenever I hit an area in statistics I haven’t used in a while, I use Wolfram Alpha. After you explore it you may find yourself using it more than you might expect.

The R Project

The R Project for Statistical Computing (usually just called “R”) is an open source project that is an implementation of the “S” statistical language. R can also use “packages”, which are extensions to the functions included in R that can be downloaded and referenced in a sequence of steps.

As you progress in your career as a data professional, you will find that you need to learn, or re-learn, statistics. It’s an interesting fact that the data professional and “researcher” roles of yester-year are now converging, into what some have termed the “data scientist”. Debate continues on whether this is a factual title, but the point is that statistics plays a huge part of the data landscape in modern computing.

I originally debated whether to include R in this article or in the one that deals with programming and scripting. In R you can write commands similar to scripting, and you can also save those scripts and run them programmatically. R also has the ability to work in an automated fashion and even federated to work as an “engine” of sorts.

Even so, I find myself mostly using R in batch-mode, sitting at the terminal. I have yet to install it on a server to process data automatically.

To download R, visit the site at: http://www.r-project.org/ and then select “Download R” from the first page, and select a server that is closest to you. You’ll be directed to a page to select Mac, Linux or Windows as the operating system where you want to run it. With that done, you’re directed to a page to download “base”, “contrib”, or “RTools”. Unless you’re developing for the R project, the “base” package is the binary to download.

From there simply run the setup program you download, and take all the defaults.

When you’re done, you’ll have a command-line type window where you can type expressions that R will evaluate and return to you. If you’d like a formal course in the R environment, check out this reference: http://www.r-project.org/ and select the “Documentation” page. There’s a full set of manuals, introductions and so on there. If you’d like a shorter version, check out the references below.

When you start R, you’re dropped in the main command line environment. For help, you can simply type help(). You can also type demo() for a few demonstrations in the main package, but if you’re not working with statistics a great deal, many of the topics may not be helpful.

The key to working with R is to realize that everything is a “vector” (a kind of linear array), and that the primary method of working with symbols, constants and variables is through assignment. This is similar to Lisp or other recursive, function-heavy languages.

But this is a bit esoteric. Let’s take a concrete example, from the “sample session” in the R documentation.

At the R console, type the following commands:

x <- 1:10

w <- 1 + sqrt(x)/2

w

This reads as follows: “Assign the values 1 through 10 to the 1-dimensional array “x”.”

The variable “x” (a vector, technically) now holds (1,2,3,4,5,6,7,8,9,10).

The second line reads “Assign 1 plus the square root of “x” divided by 2 to the array “w”.”

The vector w now holds the approximate values of (1.5, 1.7, 1.8, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.58). This is called the weighted standard deviation for each value, which is a value that helps you understand how well you can trust the number as an average - more at http://www.gla.ac.uk/sums/users/kev/Height_and_Weight/stdev2.html.

While this is all well and good, how can R directly assist you in your day to day job? Let’s return to the pubs sample database tables we exported to tab-separated files a few articles back. Assuming that you have that done, change the path below (but not the forward-slashes) and run this line:

x <- read.table("C:/cygwin/home/bwoody/datasources/pubs/sales.tsv", header = FALSE, sep = "\t")

This line reads:

“Assign vector “x” the contents of the file sales.tsv. Ignore any headers, and it’s tab-separated”.

So far, so good. But what can we do with it?

For one thing, we could see the statistical spreads and data within the set. But we can also simply create a quick chart using the 5th position of the data - when indicates the invoice was due:

plot(x[5])

And we get something that looks like this:

And of course R does much more. You can read data from the web, such as the stocks for Microsoft as in this example:

stocks <- read.csv("http://ichart.yahoo.com/table.csv?s=MSFT&a=0&b=1&c=2006%20&d=0&e=31&f=2012&g=w&ignore=.csv")

And then plot the dates and closes for the stock:

plot(stocks$Date, stocks.Close)

The “stocks” vector has headers that can be referenced, such as the .Date and .Close in this example.

R Studio IDE

I don’t always use an Integrated Development Environment for console-based applications, but I really like the R Studio package. You can find it here: http://www.rstudio.com/ The installation is very straightforward, and will locate your R environment automatically.

I find it especially helpful for reading data from various locations (as opposed to typing them out) and locating and installing new packages to work with.

References:

In the next installment, I’ll cover programmatic methods and tools that I’ll work with on the laboratory.

Buck Woody

Author profile:

Buck Woody has been working with Information Technology since 1981. He has worked for the U.S. Air Force, at an IBM reseller as technical support, and for NASA as well as U.S. Space Command as an IT contractor. He has worked in most all IT positions from computer repair technician to system and database administrator, and from network technician to IT Manager and with multiple platforms as a Data Professional. He has been a DBA and Database Developer on Oracle systems running on a VAX to SQL Server and DB2 installations. He has been a Simple-Talk DBA of the Day

Search for other articles by Buck Woody

Rate this article:   Avg rating: from a total of 7 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: About R
Posted by: Robert young (view profile)
Posted on: Tuesday, April 16, 2013 at 11:01 AM
Message: - R, by intent and design, is interactive, not batch. it can be made to do the latter, but it's not SAS/SPSS, with their mainframe heritages and RBAR view of data. there are (commercial) variants which provide disk based row execution, Revolution Analytics being the most known.

- while R is internally vector based (column major, just like Fortran; imagine that), most of the newer packages, and some of the base, view the world as a dataframe, which is semantically just an un-normalized RDBMS table. R even attempts to do SQL-like joins with a 'merge' verb. don't use it. keep your data in SQL Server (or whatever), do the data manipulation there, then the stats in R. unlike SAS, R hasn't (and most likely never will) been built to be a data management engine, although there is a well known text which makes the case for such: http://www.amazon.com/Data-Manipulation-R-Use/dp/0387747303/ref=sr_1_8?s=books&ie=UTF8&qid=1366131489&sr=1-8&keywords=data+management+r

- just go for RStudio. the standard GUI for windows is OK, but no where else. if you get into R, you'll soon migrate to *nix, since the stat world has mostly made the migration. and RStudio has a built-in macro/hot-key for the assignment operator, <-, with [alt,-]. makes life much easier, especially if you're an APL refugee.

- the language is either schizoid or integrated, depending on one's point of view. in the end, the developers decided that the syntax would serve two masters: command execution and program coding. I'm still of the view that this was a mistake; SAS and SPSS, etc. make more sense by having two "languages" for the two functions (whether either did this in a successful way, is a different question). the R cadre still feels otherwise, although the move to use C++/C as user-level coding language (the R engine, and much of the standard bits and pieces, is C/C++) through packages like Rcpp is gaining momentum. then, there's Julia, which is being promoted as a full replacement for R, as coding language.

- the R community is, I'd say, mostly academic, and not so much commercial. thus, one can find a package in R to execute bleeding/cutting edge stats long before SAS/SPSS/Stata/BMDP/etc. there are now more than 4.000 on CRAN and nearly 1,000 on BioConductor. that's as chaotic as it sounds, since there's no BDFL.

- check out R-bloggers ( http://www.r-bloggers.com/ ), and search for 'excel london whale' for a (somewhat) biased comparison between Excel and R for banksters.

I'll go back under the setee now.

Subject: http://tester51.com/
Posted by: click here (not signed in)
Posted on: Wednesday, April 24, 2013 at 4:23 PM
Message: Thanks for composing this. I truly feel as though I understand a lot more about this than I did previously. Your blog actually brought some things to light that I never ever would have thought of in the past reading it. You need to continue this, Im sure many people would concur youve got a present.

 

Top Rated

Data Science Laboratory System – Object-Oriented Databases
 Object-Oriented Databases (OOD) avoid the object-relational impedence mismatch altogether by tightly... Read more...

Tales from a Cloud Software Firm
 Following on from a discussion about how people are using the cloud, the Simple-Talk Editorial Team sat... Read more...

Data Science Laboratory System – Document Store Databases
 A Document Store Database (DSD) is similar to a Relational Database Management system with the... Read more...

Data Science Laboratory System - Instrumentation
 It is sensible to check the performance of different solutions to data analysis in 'lab' conditions.... Read more...

Testing the StreamInsight Service for Windows Azure
 Getting 'up to speed' with StreamInsight is easier if you take the time to run it and test it out.... Read more...

Most Viewed

Windows Azure Virtual Machine: A look at Windows Azure IaaS Offerings (Part 2)
 We continue our introduction of the Azure IaaS by discussing how images and disks are used in the Azure... Read more...

PHPFog and Pagoda Box: A Look at PHP Platforms
 Cloud platforms such as Heroku, AppEngine, PHPFog and Pagoda Box are ideal for companies who just want... Read more...

An Introduction to Windows Azure BLOB Storage
 Azure BLOB storage is persistent Cloud data storage that serves a variety of purposes. Mike Wood shows... Read more...

Managing session state in Windows Azure: What are the options?
 Because you can't maintain session state for ASP.NET applications in Azure using the default in-process... Read more...

Creating a custom Login page for federated authentication with Windows Azure ACS
 Windows Azure Acess Control Service (ACS) provides a way of authenticating users who need to access web... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.