Excel to R: Visualising Data
Analysis and persuasion. A large part of my day-to-day work involves visualising data for either or both of these purposes. I consolidate data into charts and tables, sometimes to help me see patterns and sometimes to help other see what the data has to say. Excel has been an important tool along the way, but tools on their own aren’t enough. Without due care, good tools can pave a path to a bad chart like this one, which fails to inform.
Making charts that deliver a clear message about the story in the data, and that do so in the 8-seconds available to get a reader’s attention is not something most of use are naturally good at. And it’s not just about being proficient with any particular tool. As important as that is, it’s just table stakes. Following the lead of people like Edward Tufte and Cole Knaflic, making charts and tables that can truly carry a story forward requires careful application of 4 core principles throughout the process of data presentation:
- Understand the context
- Choose a format that fits the situation
- Eliminate visual clutter
- Use visual elements that help focus attention
Context:
A few questions to ask:
- Who is the audience and why should they care?
- How do they like to receive information?
- How much understanding of the situation and its background do you already have in common with them?
- How much time are they willing to spend trying to understand the story you are telling them?
When I’m trying to share insight that is the result of hours spent analysing, modelling, testing and charting a data set it’s easy to forget that I am far more familiar with the data than my audience is. In fact, they may not even care enough about subtleties to be willing to spend much time or effort taking in what the data has to say. If they happen to be senior managers, an important early lesson that I learned is that they are busy people They need to get quickly to answers: “What do you want me to know?” “Can I trust what you are telling me?” “Do you want me to do something?”
A well-crafted chart tells an honest story about something that matters to the reader, it’s easy to understand, and it is backed by all the work it took to get to this point (not that you necessarily get to tell that part of the tale). It is far too easy to tell a story about what we find interesting in the data and lose the audience in the process – more challenging and much more rewarding to engage the reader in a story that resonates with things they care about.
Format
What is the key message and what kind of visualization will help to carry it? Different styles of charts have different uses and a good visualization helps to carry the main point of the story forward. Is it about changes over time, comparisons between groups of elements, a whole-part relationship among sub-categories in a group, or basic information about centre, shape and spread in the data at hand? Does the chart need to show comparison to a specification or a complex relationship like a product quality profile that spans multiple variables?
Bar charts are good for comparisons or showing the shape of a data set, but less so for conveying whole-part relationships. Pie charts can show whole-part relationships but they do a poor job of showing quantitative comparisons. They also become almost unreadable when they have more than about 4 elements in them. Scatter plots can show relationships and time series charts can show trends over time. And then there are the specialty views like bubble charts and bullet charts, formats that can be useful for some specific situations but may be unfamiliar to some audiences. The style of chart should help carry the story, not get the reader bogged down in trying to understand what it shows.
Keep it Supremely Simple
When charts are used to convey a story every pixel matters: the goal should be to eliminate anything that could distract and then use white space, text, colour and font size to gently guide the reader to the aspects of the data that caught your attention in the first place.
The two charts below are an example. Both show the same data, a summary of Major League Baseball players’ compensation in 2015. The story in the data is about why the average player compensation ($4.3 million) was more than double the median ($1.9 million). A bar chart of salary distribution shows that players’ salaries are heavily skewed by a handful of very highly compensated individuals.
Using default settings, Excel produces a chart that shows the shape of the data set but it also adds visual clutter, things that take up cognitive energy without giving much back.
A more refined version simply removes unnecessary visual elements, and then make selective use of colour and labels to draw attention to key features. Same data, same chart; but some simple modifications made it easier to take in the information.
Simple Process for Making Presentable Charts
With the principles as anchor points, producing a presentation quality chart is a 4-step process that begins with a raw data set and ends with the finished product.
Select
Perhaps a statement of the obvious, but the first step is to select data that is trustworthy and complete. If there are missing values or outliers, be ready to explain how they were handled. Similarly, if you are charting a subset of the original, be ready to explain how it was derived. In this chart, selection was straightforward because MLB players salaries are readily available from multiple sources. I chose Stat Crunch because it is a free source of real-world data drawn from original sources. I don’t have to worry about its authenticity, and for illustration purposes, it’s small enough (817 rows) that I can use the entire player roster to make a summary chart.
Stage
A vertical bar chart is a good way to present this kind of data because it’s easy to visualize shape. Some transformation was needed to turn the raw data set into something that would display properly. In this case it was a simple matter of creating a frequency table that divided 817 salary values ranging from $507,000 to $32.6 million into a series of ranges. I often find 10-12 ranges to be a useful number of bars to present in a chart like this one. Too few and the chart starts losing detail that might be important; too many just adds clutter without any informative value. Excel’s Frequency function distributed the data to 11 bins that covered the full range of values.
Build and Refine
Once Excel built the chart I had to look more closely and decide how to make it easier to read. This part tends to go through several iterations and it takes time. In this case, eight changes to the original chart are worth noting:
- Re-formatted the category axis to display just enough values to define the range of the data set, make room for a slightly larger font and leave some white space between the labels.
- Re-formatted the salary numbers to make numbers in millions more readable.
- Removed the horizontal grid lines. These are supposed to make it easier to trace the column height back to the value axis but actual values are often easier to read, so…
- Added value labels to the four ranges that account for 85 percent of the salaries on the roster and adjusted the column width to make the labels fit within the bars.
- Removed the value axis since it duplicates information provided by the data labels.
- Applied grey colouring to the bars between the first 4 and the last category to emphasize the distinction between the lower salary ranges and the middle ones that don’t add a lot of value to the story.
- Since the last category captures just 1 salary greater than $30M, applied red colouring and a distinctive data label to keep it from disappearing altogether.
- Added a title to the chart and changed the title and axis label text to grey – still perfectly readable but with reduced less visual competition between the chart and the title.
Quite a few steps, and the better part of an hour’s work. With R, the path isn’t any shorter, but the steps are similar, at least in concept. Below is a bar chart that R produced from the same data, using default settings along with the code that produced it:
The Select and Stage steps involved creating a frequency table of salary counts in a new data frame using the same bin ranges used in Excel. The Build step created the chart using the ggplot package. The initial product is a bit rough: scales, colour, titles, labels, in fact all the elements that required adjustment in the Excel chart need work. Below is a refined version.
It’s a big improvement – a little different than the Excel chart but very presentable. And the effort to do it the first time was similar to Excel: a dozen or so iterations, adjusting each style element until the end result was acceptable. Next time we’ll take a deeper dive into bar charts in R and look at how I got to this version from the original.
One thought on “Excel to R: Visualising Data”
Comments are closed.