Data Discovery with Excel: The Good, the Bad and the Ugly
If data is a window into a problem then one of the main goals of data science is getting people to look through the window: Do you see patterns? Are the apparent patterns real? Do they help you understand the situation? Can you use what you see to predict beyond the horizon?
So, how well do everyday desktop tools like Excel fare in helping to provide answers?
Teachers of introductory statistics courses (like me) are fond of starting off the discovery process in the form of a game. The class is presented with the data set shown here, giving the basic facts of an unspecified risk event involving a large loss of life. The group is invited to ask questions about patterns they see in the data In order to discover what the risk event was. Robert Dawson at St. Mary’s Unversity gives this version of the dialog that might follow:
- What do you notice about those numbers?
- The death rate for females was much less than for males.
- Yes, but much more so among the rich.
- It was like that with the children, too — none of the kids in groups I or II died, but lots of the ones in group III did.
- Was it something that could be cured but the cure cost a lot?
- But then why didn’t the wealthy men get cured?
- Is there any disease that women and children can be cured from and men can’t?
- Maybe it was something else — a massacre or something?
- But who’d massacre poor women and children and all men but not rich women?
- Hey, look at the numbers exposed. Look how many more men were exposed, and how few children.
- Maybe they just looked at more men?
- No, it says `Population Exposed’.
- Those numbers are pretty exact. They must have known exactly who was exposed and who wasn’t. How could they tell that so accurately with an epidemic?
- Yes, and who were those class 4 people? Notice they were all adults, and almost all guys?
- They had the same death rate as the poor people. Were they poor?
- Could they have been doctors or rescue workers?”
Usually it takes about 5 minutes before someone catches on and identifies the event as the sinking of the Titanic.
Even now, more than a century after its fatal maiden voyage from Southampton to New York, Titanic stands out one of the most catastrophic disasters in civilian marine history. The mortality data tell a stark story of lives lost; they also provide a glimpse into factors that influenced the behaviour of a large group of people suddenly faced with a life-threatening situation. The circumstances were simple: a sinking ship, not enough lifeboats for everyone, no time for debate so it was women and children first into the lifeboats. The data show that was what happened, at least in the first and second class decks where lifeboats were accessible. But a quick glance beyond that group suggests that other factors at work. Does the data show us anything about what else in the situation influenced who got off safely and who didn’t?
The Titanic data have often been used as a case study for exploring tools and techniques for data analysis. It tells a story about a real situation, it’s compact enough to analyse with ordinary desktop software, it has defects and missing values to contend with, and it’s rich enough to encourage some exploration. There are at least two versions available; the first part of this analysis was done using an anonymous, coded set distributed by Robert Dawson. It’s a simple presentation, only four variables, and it includes data on both passengers and crew (Class IV in the table).
I produced the summary in 4 main steps using standard Excel functions:
- Convert the coded data to text. A series of VLOOKUP functions substitute plain text for the numeric codes in the raw data, making the data table easier to read, both in its raw form and in condensed tabulations.
- Summarize the data in a Pivot table: This produced a tabulation of survival by class and gender and by class and age.
- Get total people on board for each of Classes I – IV: This used a COUNTIF function to enumerate occurrences of each of the class values in the raw data.
- Group and enumerate people by class/gender and by class/ age: This uses the DCOUNTA function, which counts rows in the raw data table based on the values in the Class and Gender columns. (Each group enumerated with this function needs its own set of lookup criteria so I keep them together, out-of-sight in a separate worksheet layer.)
Another version of the data is available through Kaggle (among other sources). Based on the passenger manifest, it includes more descriptive features, providing opportunities to explore data manipulation techniques and to ask more probing questions.
For starters, did the size of family group in which a person travelled influence survival? The passenger manifest provides enough detail about the relationships among the people on board to identify family units travelling together but it requires a little bit of feature engineering. For analysis purposes a “family” is two or more people with the same surname, travelling on the same ticket with a count greater than 1 in any of columns headed Mother, Sibling-Spouse or Parent-Child. Since the Passenger name follows a standard pattern of “Last name, first name” it’s easy to insert a column in the data table and use some Excel string manipulation functions to extract the surname. A second feature called family size sums the three fields headed Mother, Sibling-Spouse and Parent Child, assigning a value of 1 to passengers with no sibling-spouse or parent-child values. Finally, for filtering purposes, A third feature “CL_Age” categorizes the age data into “Adult” versus “Child” using 18 years as the separation point.
The original data table has grown by three columns but the new features enables creation of a pivot table, which groups everybody into family units. A high-level visualization can be created by using Family Size as the Pivot Table heading and plotting the column totals to show the number of survivors for each grouping.
The chart shows that overall, single passengers fared the worst with a survival rate slightly lower than the overall shipboard population. Families of 2-4 individuals had a higher-than-average survival rate but it trails off quickly for groups larger than 4. The inference seems to be that there was safety in numbers but only up to a point.
If the guiding principle was “women and children first” then analysis of the passenger manifest by gender and by age should show differences in survival rate based on both of these dimensions. The only problem is that many entries in the passenger manifest have no recorded age (shown here as “unknown” age). Using the values presented in other features we can infer reasonable approximations for some of the missing values. For example, there are 3 rows (passengers 141, 167, 534) for female passengers of unknown age who are mothers. Since the median age recorded for mothers in the rest of the shipboard population is 35 we can reasonably infer an age of 35 for these individuals. Using similar logic we can assign a median age of 5 years to the 20 passengers who have a blank age but whose records indicate siblings and/or a parent on board. This correctly classifies them as children but it probably introduces some distortion into the overall estimate of age distribution.
That still leaves about 150 individuals travelling alone who were of unknown age. Since it would be very unusual for a child in 1912 to take a transatlantic journey alone we can assume they were all adults. If all we need to know is the adult/child division that’s good enough. But can we assign them ages without distorting the rest of the data? The answer is yes, as long as we are willing to live with two assumptions: that the unknown ages follow a distribution pattern similar to the rest of the adult population on board and that the ages for which we do have records are normally distributed.
The normal probability chart shows that the distribution of ages for the female passengers does not follow a strictly normal distribution, but it is still a reasonable approximation.
In order to assign ages to passenger ids with unknown ages the firt step is to use the Excel DAVERAGE and DSTDEV functions to get the average and standard deviation of adults with recorded ages. These values can be used along with the RAND() function to seed NORMINV(), which generates random ages for each of the passenger IDs of adults without a recorded age in the original manifest.
With the blank ages eliminated we get this visualization for female passengers, retaining the same overall shape as the original data but eliminating “Unknown” ages.
The data suggests that the proportion of adult women who survived was similar to that for girls over 10 but girls 10 and under had a proportionately lower chance of survival. Inspection of the data indicates the by far the majority of children aged 10 or under were in 3rd class where access to lifeboats was limited.
The analysis could go further but this post is really about how Excel shapes up as an analysis tool.
The Good:
- Rich capabilities for summarizing and visualizing data, and for doing basic statistical analysis. Excel goes a long way toward handling some reasonably complex analysis jobs.
- Easy to add feature when needed for analysis purposes.
- Facilitates an exploratory approach to data analysis.
- Analysis is easy to learn – with minimal instruction, people with moderate Excel skills catch on quickly and use it effectively for data analysis.
- Analysis workbooks are easy to exchange with others thanks to the widespread use of Microsoft Office as a desktop suite.
The Bad and the Ugly:
- Adding data features can cause workbooks to become very large, very quickly, with formulas in each cell of each new feature. In this example, adding 8 features to a raw data file with 900 rows resulted in file size growth from 75KB to 750KB.
- Excel encourages data exploration but without careful documentation the steps required to complete the analysis can get lost. Steps to complete an analysis can be difficult to reproduce.
- Some of the functions required to add features and summarize data can lead to tedious manual work along the way. The analysis for this example took most of a day to complete.
- Default formatting of key outputs, most notably Pivot Tables and Charts, may not be suitable for analysis and presentation without manual manipulation.
- Limited size of dataset. In my experience, Excel performs well with data sets containing up to about 50,000 rows, even with as many as 10 calculated features in the table. Beyond that, calculation time can start getting sluggish enough to require turning off auto-recalculation. With really large datasets (i.e., >1, 000,000 rows) recalculation can cause Excel to run out of resources.
- Limited community supported development. Google help is readily available. If you want more than that it’s out there, but it’s rarely free.