Excel to R: Visualising Data – Part 2
Last time we took a quick pass through the basics of good data visualisation and then finished with a simple demonstration that showed a bit of what R can do with charts. This post will show how I created the chart in R. By the time we’re done you should have an overview of how graphics work in R and how to turn a basic chart into a presentation quality graphic.
Let’s begin with the data. Like Excel, R builds a chart from a data table. In Excel that means arranging data in a block of cells that has a header row to indicate variable names and a row for each data item. The R equivalent is a data frame, which defines a data set in much the same way, a header that gives each column a name, and rows for data items.
For the Major League Baseball players’ salaries, the data behind the chart was a simple, two-column table, one with salary ranges and the other with counts. In Excel I used the Frequency function to reduce the 800 individual entries to a frequency table. In R, I got there by creating a new data frame, called salary.freq, with two variables for the ranges (bins) and counts, and then populating it with data from the original data frame.
Before we can start creating a chart in R we need to select the packages that do the work. For this kind of job I have a standard set that keeps the functions I need close at hand. The most important one for this example is ggplot2, since it provides the graphics engine we will be using. The package UsingR is especially handy because it loads several other packages that provide a range of statistical and data manipulation functions:
require(ggplot2)
require(scales)
require(UsingR) #sample data from UsingR
require(gcookbook) #sample data from R Graphics cookbook
The next step is to create a data frame by reading in the players’ salary data from a CSV file. For information purposes I follow with a “describe” command that shows me some summary statistics, including the count and range, which will be used to calculate the bin ranges for the chart:
salaries = read.csv(“MLB-Salaries-2015.csv”)
describe(salaries$Salary) #Display the count and range for the data
To hold the summary data for the chart we need to create a data frame called salary.freq, which aggregates the raw salaries into bins that are $3 million wide. We also need to do a bit of work on number formatting. Although Excel will, by default, display the same number formatting on a chart as in the source data table, R just stores values and displays them in scientific notation. Since we want to display the bin ranges in millions we add two extra columns called “label” and “highlight”, which scale the bin ranges to display in millions and enable colour highlights on individual columns.
bins = seq(0,33000000,by=3000000) #Populate the bin ranges in $3 million increments.
#Then assign each data row to a bin based on bin limits
salary.bin = cut(salaries$Salary,bins,right = TRUE)
salary.freq = data.frame(table(salary.bin)) #Get the count for each bin
#Right = TRUE will behave the same as the Excel FREQUENCY function,
# meaning the count includes all values less than or equal to bin upper limit.
#right = FALSE includes only values less than the bin limit.
#Add a column that will provide numeric bin labels expressed in millions
salary.freq$label = round(bins[bins>0] / 1e6,1)
#Add a variable that allows selective highlighting
salary.freq$highlight = c(“blue”,”blue”,”blue”,”blue”,”blue”,”gray”,”gray”,”gray”,”gray”,”gray”,”red”)
With the data set properly structured, we’re ready to use the ggplot2 package to build the chart. In this case we’re building a vertical bar chart, with some default colouring applied and the legend suppressed. This is not the only way to create graphics in R, but ggplot2 has become my “go to” package simply because it is versatile and the logic is similar enough to Excel’s Insert Chart sequence that I find it intuitive to use. The sequence below starts by telling the package to use the salary.freq data frame, assigns the variables “label” and “freq” to the x and y axes and assigns the “highlight” column as the source for fill colours. With the data elements assigned to columns the next line tells R to create a bar chart using the count data to determine the bar height.
g = ggplot(data = salary.freq,aes(x=label, y=Freq, fill=highlight)) #create chart object
g = g + geom_bar(stat = “identity”, colour=”gray85″) #Define it as a bar chart with colours and individual values displayed
# colour = sets the outline of the bars, fill = sets the bar fill (#2F5597 = dark blue)
g = g + scale_fill_manual(values = c(“#2F5597″,”gray85″,”red”)) #Apply highlight colours
g = g + theme(legend.position = “none”) #Suppress the legend
That produces a chart with a readable x-axis and some colour on the bars, but I still need some titles and labels to finish the job. The next series of steps add them, adjusts fonts, puts value labels in individual bars and then removes visual elements like grid-lines and y-axis value labels that don’t really add anything to the story the chart is telling.
#Refine:
g = g + ggtitle(“MLB Players Compensation – 2015”) #Give it a title
g = g + xlab(“Salary ($ M)”) #Add x-axis title
g = g + ylab(“No of Players”) #Add y-axis title
g = g + geom_text(aes(label=Freq),vjust=1.5,colour=”white”) #Add labels to the bars
# Format the y axis text
# Get colour palette here: http://sape.inf.usi.ch/quick-reference/ggplot2/colour
# Get font faces here: http:// http://www.cookbook-r.com/Graphs/Fonts/
# More about colours: http://www.sthda.com/english/wiki/ggplot2-themes-and-background-colors-the-3-elements
g = g + theme(plot.title = element_text(face=”bold”,colour=”gray40″,size=16))
g = g + theme(axis.title.x=element_text(face=”bold.italic”,colour=”gray40″,size=12))
g = g + theme(axis.title.y=element_text(face=”bold.italic”,colour=”gray40″,size=12))
#Remove the grid lines
g = g + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank())
#Change panel background colour to white
g = g + theme(panel.background = element_rect(fill=”white”,color=”white”))
#Remove the y-axis labels and tick marks
g = g + theme(axis.text.y=element_blank(),axis.ticks=element_blank())
# Add a distinct label to the highest category
g = g + annotate(“text”,x=33,y=25,label=”>30 M = 1″,size=3,color=”gray40″)
g #dsiplay the chart
As you can see there’s quite a bit of detail to specify but the logic of manipulating individual chart features is reasonably straightforward, not very different than creating a chart in a VBA macro. And just like working with Excel, there’s quite a bit of trial-and-error involved in adjusting the chart to get the right look. The good news is that once the code has been worked out for a particular kind of chart it is very reusable as a template for other data sets.