Click here to monitor SSC
  • Av rating:
  • Total votes: 6
  • Total comments: 0
Sergei Dumnov

Adding Public Domain Data to your Software Demos with R

27 November 2013

With all the extensions that are available, it is astonishing how little R code is required to achieve impressive BI graphics. You can even use R to take the latest data from a table in a website and apply it to a map in just a few lines of code: and here they are, with Sergei's explanation..

Sometimes I want to add public domain data, e.g. from the Bureau of Statistics, to my business intelligence (BI) software demonstrations to support some ideas, and I always try to side-step this boring, mundane task. It was a way of life for quite a while until I found how R can make this task real fun.

Let’s assume that our BI application contains a Year-to-Date Sales by Australian (AU) State Report. The application gives us an option to visualise this information directly on the AU map, shows it as a bar graph, or as a table.

It is very likely that this information strongly correlates with the population by state, but we do not have the population by state data in our database. The answer is to create a live link to a public domain site that contains the required information and that allows this type of access, bring it directly into the application (this operation is often called web scraping), transform it, and use it for the sales analysis and potentially in other parts of the application. You have to take some precautions if you plan to use it in production versions of your applications as the data source page might change in the future.

Web Scraping a Data Source

The web site I am going to fetch the data from is a Wikipedia page Ranked list of states and territories of Australia. The first HTML table on this page is shown below.

Figure 1. The web scraping data source.

Our next tasks are to

  •  Create a live link to this table from our application
  •  Extract 2nd and 3rd columns of the table
  •  Take only rows that correspond to the main Australian states (rows 1 to 8)
  •  Change names of the extracted columns to State and Population (for brevity) respectively
  •  Display state population on the map, bar graph, or as a table, e.g. as it is shown below

Figure 2. Distribution of the Australian population by state (tooltip shows the NT population).

Here is a just 14 line software fairy that does the magic in a fraction of a second (on my PC over a good Internet connection) …

Web Scraping R Code Snippet

 if (!require(XML)) install.packages('XML')

if (!require(googleVis)) install.packages('googleVis')

library(XML)

library(googleVis)

url <- 'http://en.wikipedia.org/wiki/Ranked_list_of_states_and_territories_of_Australia'

doc <- htmlParse(url)

ndx <- getNodeSet(doc, "//table")

tbl <- readHTMLTable(ndx[[1]])

tbl <- tbl[,2:3]

colnames(tbl) <- c("State","Population")

tbl <- transform(tbl, Population = as.integer(gsub(",|-", "", Population)))

tbl <- subset(tbl,tbl$Population > 5000 & tbl$Population < 20000000  | tbl$Population != NA)

AUP <- gvisGeoMap(tbl, locationvar='State', numvar='Population',hovervar = 'State',options=list(height=350, region='AU',dataMode='regions'))

plot(AUP) 

Listing 1. Web Scraping and plotting R snippet.

How does it work? The lines 1 to 4 install, if it is required, and load two libraries, i.e. XML and googleVis. The first one liberates us from that boring, mundane task whilst the second library elegantly plots the extracted data.

The line 6 extracts the entire web page that contains the required data and saves it in the variable doc. The next line searches within the saved page, selects only the HTML tables and saves this collection in the ndx variable. It is assumed that the HTML tables are well-formed.

After that the line 8 gets the first table out of the ndx collection, the one we need, and saves it in the tbl data frame variable. The following two lines do the above mentioned transformation of removing redundant columns and changing names of the remaining ones.

Although values in the second column look like numbers, they are string.  The line 11 replaces commas and dashes with an empty string and converts strings to integer numbers. Finally, the line 12 filters out unneeded rows. Here is the tbl data frame ready to be visualised.

                         State Population

1              New South Wales    7348900

2                     Victoria    5679600

3                   Queensland    4610900

4            Western Australia    2472700

5              South Australia    1662200

6                     Tasmania     512400

7 Australian Capital Territory     379600

8           Northern Territory     236900

The visualisation is done by the last two lines in the Listing 1. The first statement ...

AUP <- gvisGeoMap(tbl, locationvar='State', numvar='Population',hovervar = 'State',options=list(height=350, region='AU',dataMode='regions'))

... creates an HTML fragment that contains the tbl data frame converted  -into a JSON object (then into a Google Fusion table) and googleVis Flash component that is  rendered by the last line, plot(AUP).

There is no need to provide the state longitudes and latitudes. The googleVis works well with any address type strings such as 10 George Street, Sydney, NSW, Australia or shorter versions, e.g. Tasmania (addresses / geo names should comply with the ISO standards; see googleVis documentation for more details).

As I have mentioned, the tbl data frame can be also presented as a bar graph and/or a table. Here is an updated version of the code in Listing 1.

if (!require(XML)) install.packages('XML')

if (!require(googleVis)) install.packages('googleVis')

if (!require(ggplot2)) install.packages('ggplot2')

library(XML)

library(ggplot2)

suppressPackageStartupMessages(library(googleVis))

url <- 'http://en.wikipedia.org/wiki/Ranked_list_of_states_and_territories_of_Australia'

doc <- htmlParse(url)

ndx <- getNodeSet(doc, "//table")

tbl <- readHTMLTable(ndx[[1]])

tbl <- tbl[,2:3]

colnames(tbl) <- c("State","Population")

tbl <- transform(tbl, Population = as.integer(gsub(",|-", "", Population)))

tbl <- subset(tbl,tbl$Population > 5000 & tbl$Population < 20000000  | tbl$Population != NA)

wd <- getwd()

htmldir <- "C:\\DoBIwithR\\Code\\RWebScrap\\RWebScrap\\Content\\"

setwd(htmldir)

AUP <- gvisGeoMap(tbl, locationvar='State', numvar='Population',hovervar = 'State',options=list(width=427, height=328, region='AU',dataMode='regions'))

filename <- 'googleVis.html'

if(file.exists(filename)) file.remove(filename)

print(AUP,"chart",file="googleVis.html")

#================================================

tbl$State <- c('NSW','VIC','QLD','WA','SA','TAS','ACT','NT')

cbPalette <- c("#a4fd7f", "#267114", "#e0ffd4", "#5db40f", "#bcf590", "#95e856", "#439905", "#a8ef69")

q <- ggplot(tbl,aes(x=State,y=Population,fill=State)) + geom_bar(stat="identity")

q <- q + scale_fill_manual(values=cbPalette) + theme(legend.position = "none",

                                                     plot.background = element_rect(fill='#eaf7fe'))

filename <- 'pop_by_state.png'

if(file.exists(filename)) file.remove(filename)

ggsave(q,file=filename,width=5.65,height=3.53)

setwd(wd)

Listing 2: updated R code - the software fairy.

It generates both the above map (html fragment) and a corresponding bar graph (an image) that were integrated into an ASP.NET MVC 5 application. A screenshot of its home page is displayed below. Technique of integrating code fragments and images produced by R procedures has been explained in great details in my recent articles that can be found here.

You can download the VS 2013 ASP.NET MVC project from here.

Sergei Dumnov

Author profile:

Sergei is a Senior BI Solution Architect at Information Builders (Sydney, Australia). Before immigrating to New Zealand he worked for Russian Academy of Sciences (Akademgorodok, Novosibisk) where he obtained a PhD in Applied Mathematics.

Search for other articles by Sergei Dumnov

Rate this article:   Avg rating: from a total of 6 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.
 

Top Rated

Acceptance Testing with FitNesse: Multiplicities and Comparisons
 FitNesse is one of the most popular tools for unit testing since it is designed with a Wiki-style... Read more...

Acceptance Testing with FitNesse: Symbols, Variables and Code-behind Styles
 Although FitNesse can be used as a generic automated testing tool for both applications and databases,... Read more...

Acceptance Testing with FitNesse: Documentation and Infrastructure
 FitNesse is a popular general-purpose wiki-based framework for writing acceptance tests for software... Read more...

TortoiseSVN and Subversion Cookbook Part 11: Subversion and Oracle
 It is only recently that the tools have existed to make source-control easy for database developers.... Read more...

TortoiseSVN and Subversion Cookbook Part 10: Extending the reach of Subversion
 Subversion provides a good way of source-controlling a database, but many operations are best done from... Read more...

Most Viewed

A Complete URL Rewriting Solution for ASP.NET 2.0
 Ever wondered whether it's possible to create neater URLS, free of bulky Query String parameters?... Read more...

Visual Studio Setup - projects and custom actions
 This article describes the kinds of custom actions that can be used in your Visual Studio setup project. Read more...

.NET Application Architecture: the Data Access Layer
 Find out how to design a robust data access layer for your .NET applications. Read more...

Calling Cross Domain Web Services in AJAX
 The latest craze for mashups involves making cross-domain calls to Web Services from APIs made publicly... Read more...

Web Parts in ASP.NET 2.0
 Most Web Parts implementations allow users to create a single portal page where they can personalize... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.