Excel to R: The Analysis ToolPak

A useful, but often overlooked Excel feature is the Analysis ToolPak. It’s useful because it packages about 20 commonly used statistical functions in a format that is very easy to use. It gets overlooked because it has to be activated through the Excel Options dialog before it even becomes visible.

If the Analysis ToolPak is new to you you will find it in the Data segment of the Excel Tool Ribbon. If you already use it and you are In the process of extending your analysis repertoire to include R, you may just be ToolPak Ribbonlooking for a guide that shows how to do ToolPak tasks in R. Either way, this post fills a gap with a quick mapping from Excel to R of the five most commonly used functions in the Analysis ToolPak.

Descriptive Statistics

Centre, spread and shape are probably the three things people most often want to know about a dataset. Both Excel and R offer quick tabulations of statistics that supply answers. In Excel, the Descriptive Statistics dialog sets options for selecting the data, output location and which statistics to present, including a confidence interval for the mean. In this case the dataset lists tips received in a restaurant over a period of several weeks; the setup dialog and the result are shown below.

The closest R equivalent is probably the describe function but the collection of statistics it delivers can vary depending on the package that has been activated for basic statistics. The best match to the Excel output probably comes from the psych package as shown here:
R Descriptive Statistics

95 pct Confidence Interval

The describe function output didn’t include the confidence interval for the mean, so I had to run it separately using the t-test function. That gave me a full description of the statistical test – maybe more information than I want but it got the job done.

Histograms

A Histogram is a quick way to visualize the information that the descriptive statistics provide. Continuing with the tips dataset, the Excel dialog produces both a frequency table and a chart:

Excel Histogram

To get this I had to decide up-front how many bins I wanted and calculate the boundary values for each. Excel did the rest, giving me both the frequency table and the chart output.

The R equivalents produces this chart using options set within the ggplot function:

R-Plot Tips Histogram

The chart specification includes the bin width, which simplifies things a little bit, but R doesn’t give up the frequency table quite so easily. To get that, I had to create a variable containing all of the cut points that marked the bin boundaries and another one that assigned each value in the original data to a bin. Then I had to populate a third variable with the count frequencies within each cut range and finally bring the bins and counts together in a single table. It sounds like a lot but it’s really only a 4-line script:

Tips Frequency Table

T-Test: Are Averages from 2 groups the same?

It’s a common enough question and the t-test is one way to answer it. In this case we want to know if male servers and female servers receive equal tips. In Excel I have to start by separating the dataset into two gender groups and then specify the details in the T-test dialog for testing means between two samples. I also have to decide whether to use a version of the test that assumes equal or unequal variances between the two samples. I’ll skip over proving that they are equal (they are), and just show the version of the test that assumes equal variances. As with other ToolPak functions, the dialog lets me specify input and output characteristics as well as the level of significance applied to the test. The output is a simple tabulation of the test result:

This is a 2-tailed test and the P-value greater than 0.05, so I accept the null hypothesis and conclude that there is no difference between the tips received by male and female servers.

The R equivalent is a single command that splits the dataset into two groups according to the variable “sex”. The output includes a reminder about the statistical hypothesis (just in case I have forgotten) and estimates the confidence interval for the difference in the means between the two samples.

Correlation

More often than not a dataset has more than one variable, and in the early stages of exploratory analysis it’s common to ask whether or not there is a relationship between any pairs of variables in the set. Correlation is one way to answer that question, and “economics” is a sample datasets packaged with the R’s ggplot2 package that demonstrates this nicely.

The dataset includes economic indicators collected over a nearly 50-year period. In this case the analysis question is whether or not there is a relationship between personal consumption expenditures (represented as pce) and the personal savings rate (psavrt). Since money is something a person can either save or spend the intuitive answer would be a strong, inverse relationship between spending and Correlation Functionssaving. Both R and Excel offer simple functions that answer the question by calculating the correlation coefficient for this variable pair as -0.84, confirming what we expected.

But are there any other relationships between pairs of variables? R provides a quick answer in the form of a correlation table using the same function used for the single variable pair. The only change is specifying all of the variables at once, using position within the table to reference them.

Economics Correlation

The procedure in Excel is similar but there’s an extra step. Where R let me specify the columns to include in the analysis by position, Excel required them to be adjacent to each other. Not a big deal with a small dataset like this one (574 rows and 6 columns) but potentially a problem with a large dataset. Once the columns were in the right order, I just specified the input and output ranges and Excel did the rest.

Economics Correlation Table Excel

Linear Regression

Correlation tells whether or not there is some kind of relationship between two variables but in itself it has no predictive value. If I want to do predictive modeling then for some input variable, x, I’m looking for an expected response, y. The simplest version is a straight-line relationship that is the best-fit path through a dataset that has two variables.

The ggplot2 package in R contains a sample data set called father.son, which illustrates how to do simple linear regression. It contains over 1,000 sets of paired observations of the heights of fathers and sons. If I’m asking whether or not the father’s height can be used to predict the son’s it’s helpful to start by visualizing the data. The ggplot function gives me a scatter plot of the data, the best fit line through the data and an uncertainty band around it:

Scatter Plot

 

To get the equation for the best-fit line I needed the lm function, which I used to populate a variable called heightsLM with the results of the analysis:

This expresses a linear relationship:

Son Height = 33.89 inches + 0.51 Father Height

There’s plenty of scatter in the data so if I’m a prudent modeller I also want to know more about how much confidence to put in the accuracy of the prediction. As it turns out the lm function also calculates a series of statistics that I can display using the summary function:

Heights ANOVA Table RThe result table can be a little intimidating at first glance but it is giving me an idea of how far to trust the quality of the prediction that comes from the equation. The low p-value associated with the F statistic tells me that there is a significant linear relationship between the father’s height and the son’s height, but the value of R-squared tells me that on average, variation in the heights of fathers explains only a quarter of the variation in the heights of their sons. As a predictive model it has some limitations, but R gave me both both the model and an evaluation of its quality with very little effort.

In Excel, the process is equally straightforward. Having exported the dataset from R in the form of a CSV file I could open it in Excel and then use the Regression option from the Analysis ToolPak to get both the scatter plot and the regression statistics in a single step:

Excel Regressio

In addition to the usual specification of input and output ranges, the dialogue offers options about chart outputs and other things that may be useful in evaluating the predictive value of the regression model. The output consists of an ANOVA table that gives me the same statistical summary about the quality of fit that R provided. It also goes a little further, calculating the confidence limits for all of the coefficients calculated for the regression equation. In the interest of full disclosure, this is prettier than the raw Excel output. To make it more readable I re-scaled the chart to just show the range of the data values (something R did automatically); added the regression equation using the right-click local menu and applied some formatting to the ANOVA table.

I could go on but by now you should have the idea. Both R and Excel have very accessible sets of tools for basic statistics, even for a beginner. If you are an experienced Excel user and you’re new to R, an introductory handbook like R for Everyone by Jared Lander is a good way to jump-start the process. Another great resource is R for Excel Users by John Taveras. If your needs extend to more advanced statistical work, R has a rich tool kit but it takes work, especially if you want presentation quality data visualization. But that’s a topic for another time.

Excel to R: Do you Like Green Eggs and Ham?

Excel is such a handy tool for data discovery and analysis that it’s fair to ask, “Why bother with anything else, especially an arcane scripting environment like R?” The truth is that the transition from Excel to R is very much a green eggs and ham experience: decidedly unappealing at the outset, even for adventurous folk, but rewarding in the long run.  This post takes another look at the Titanic data set, this time using R to do the same analysis done last time in Excel.

As before, the starting point is making readable text from the raw, coded data shown above on the left.  R provides a simple substitution function that simply specifies the filter conditions for the variable values to be changed and does a simple replacement. For example, converting the Class codes to the form used in the Class substitutionssummary table takes four statements shown here.

Following a similar pattern Age and Gender codes become readable text as well. So far, so good. Easy.

The first real hurdle is getting R to present something that looks like a Pivot table, summarizing both the On Board and Survivor data by Class, Gender and Age. Of the several ways that R provides for tabulating data the one that makes the most intuitive sense to Excel users is probably the ‘cast’ function. This is similar to a Pivot table in that it aggregates the values of a selected variable and arranges them in a table based on the values of two descriptive variables.

Running the tabulation twice (once with the full on-board manifest and once on a subset containing just the survivors) yields the two summaries shown here. Applying a similar treatment using Class and Gender as the descriptive variables and adding a little bit of formatting gives this summary:

It isn’t quite as pretty as the tabulation in Excel but the effort to crunch the numbers in R is  about the same. The bonus with R was that it kept track of everything I did in a history file. By the time the job was done I had both the required output and the raw material to start building a repeatable script.

Turning to the passenger manifest downloaded from Kaggle  things get a little more interesting. As before, the first objective is to identify family groupings, looking for evidence that the size of group travelling together influenced survival. Since some R  functions are fussy about missing values (and the passenger data contains plenty of them) we begin by reading in the two raw CSV files, called “train” and “test”, combining them to a single data set and then standardizing empty values with “NA” as a placeholder:

Family groupings are defined as a unique combination of surname and ticket number. Since names are recorded as “Last Name, Title. First Name”, the following command adds a new feature to the data set called “Surname”. Family size is then calculated as the sum of the two columns for Parent-Child and Sibling-Spouse, assigning 1 when these values add up to zero:

For filtering purposes another new feature categorizes the age data into “Adult” versus “Child” using 18 years as the separation point. The original data table has grown by three variables, enabling  creation of a unique family identifier that combines surname, ticket and family size:

From here invoking the “ggplot2” package plots the entire data set in the form of a histogram that shows survival by family size.

Titanic Survival by Family Size
The R command to generate the chart looks like this.

An Excel user learning R doesn’t take very long to discover built-in capabilities to do things that are notoriously difficult in Excel. A case in point would be a box-plot: anyone who has tried it in Excel knows that it can be done but only with considerable time and effort. The ggplot2 package, used above, provides a single-command presentation of a set of box-plots showing fares by class and port of embarkation:

And so the pattern goes: working with R means breaking the analysis into a series of tasks, invoking the appropriate package and constructing a short sequence of commands that complete the task.

One of the tasks in this analysis is assigning ages to passengers with no recorded age. In Excel, that required assumptions about the demographic make-up of the passenger population and a fair bit of data manipulation. R provides a more elegant treatment using the MICE package, which imputes random values for missing ages based on patterns observed in other variables in the data set. Once MICE has been loaded as a library, the following script assigns the missing ages and then produces side-by-side histograms showing that the treatment didn’t change the shape of the age distribution.

Histogram Ages

With ages assigned a single step renders side-by-side histograms showing survival by Age and Gender:

So, why should a die-hard Excel user learn to do analysis with R? After all, R scripting is a fair distance removed from the familiar point and click experience of Excel, and developing proficiency with R takes time, practice and patience. As I see it, there are four good reasons to make the effort:

  • R offers richer capabilities than Excel for summarizing and visualizing data, doing statistical analysis, and developing predictive models that are reusable and deployable to others.
  • Learning R is no more difficult than learning to write a VBA macro in Excel.
  • R is free, open source software written by and for people who use data to understand real-world problems. When faced with a new problem, chances are there is already a package out there that cover the critical steps.
  • The R community is global, and it includes people who are constantly improving the tool and adding new packages to the repository. It also includes ordinary users like Megan Risdal at Kaggle who are happy to share what they know.

None of which is a case for abandoning Excel, which remains just as handy as it ever was. But R is a very nice addition to the analysis toolkit, well worth the effort.

Data Discovery with Excel: The Good, the Bad and the Ugly

If data is a window into a problem then one of the main goals of data science is getting people to look through the window: Do you see patterns? Are the apparent patterns real? Do they help you understand the situation? Can you use what you see to predict beyond the horizon?

So, how well do everyday desktop tools like Excel fare in helping to provide answers?

Teachers of introductory statistics courses (like me) are fond of starting off the discovery process in the form of a game. The class is presented with the data set shown here, giving the basic facts of an unspecified risk event involving a large loss of life. The group is invited to ask questions about patterns they see in the data In order to discover what the risk event was.  Robert Dawson at St. Mary’s Unversity gives this version of the dialog that might follow:

  • What do you notice about those numbers?
  • The death rate for females was much less than for males.
  • Yes, but much more so among the rich.
  • It was like that with the children, too — none of the kids in groups I or II died, but lots of the ones in group III did.
  • Was it something that could be cured but the cure cost a lot?
  • But then why didn’t the wealthy men get cured?
  • Is there any disease that women and children can be cured from and men can’t?
  • Maybe it was something else — a massacre or something?
  • But who’d massacre poor women and children and all men but not rich women?
  • Hey, look at the numbers exposed. Look how many more men were exposed, and how few children.
  • Maybe they just looked at more men?
  • No, it says `Population Exposed’.
  • Those numbers are pretty exact. They must have known exactly who was exposed and who wasn’t. How could they tell that so accurately with an epidemic?
  • Yes, and who were those class 4 people? Notice they were all adults, and almost all guys?
  • They had the same death rate as the poor people. Were they poor?
  • Could they have been doctors or rescue workers?”

Usually it takes about 5 minutes before someone catches on and identifies the event as the sinking of the Titanic.

Even now, more than a century after its fatal maiden voyage from Southampton to New York, Titanic stands out one of the most catastrophic disasters in civilian marine history. The mortality data tell a stark story of lives lost; they also provide a glimpse into factors that influenced the behaviour of a large group of people suddenly faced with a life-threatening situation. The circumstances were simple: a sinking ship, not enough lifeboats for everyone, no time for debate so it was women and children first into the lifeboats. The data show that was what happened, at least in the first and second class decks where lifeboats were accessible. But a quick glance beyond that group suggests that other factors at work. Does the data show us anything about what else in the situation influenced who got off safely and who didn’t?

The Titanic data have often been used as a case study for exploring  tools and techniques for data analysis. It tells a story about a real situation, it’s compact enough to analyze with ordinary desktop software, it has defects and missing values to contend with, and it’s rich enough to encourage Some exploeation. There are at least two versions available; the first part of this analysis was done using an anonymous, coded set distributed by Robert Dawson. It’s a simple presentation, only four variables, and it includes data on both passengers and crew (Class IV in the table).

I produced the summary in 4 main steps using standard Excel functions:

  1. Convert the coded data to text. A series of VLOOKUP functions substitute plain text for the numeric codes in the raw data, making the data table easier to read, both in its raw form and in condensed tabulations.
  2. Summarize the data in a Pivot table: This produced a tabulation of survival by class and gender and by class and age.
  3. Get total people on board for each of Classes I – IV: This used a COUNTIF function to enumerate occurrences of each of the class values in the raw data.
  4. Group and enumerate people by class/gender and by class/ age: This uses the DCOUNTA function, which counts rows in the raw data table based on the values in the Class and Gender columns. (Each group enumerated with this function needs its own set of lookup criteria so I keep them together, out-of-sight in a separate worksheet layer.)

 

Another version of the data is available through Kaggle (among other sources). Based on the passenger manifest, it includes more descriptive features, providing opportunities to explore data manipulation techniques and to ask more probing questions.

For starters, did the size of family group in which a person travelled influence survival? The passenger manifest provides enough detail about the relationships among the people on board to identify family units travelling together but it requires a little bit of feature engineering. For analysis purposes a “family” is two or more people with the same surname, travelling on the same ticket with a count greater than 1 in any of columns headed Mother, Sibling-Spouse or Parent-Child. Since the Passenger name follows a standard pattern of “Last name, first name” it’s easy to insert a column in the data table and use some Excel string manipulation functions to extract the surname. A second feature called family size sums the three fields headed Mother, Sibling-Spouse and Parent Child, assigning a value of 1 to passengers with no sibling-spouse or parent-child values. Finally, for filtering purposes, A third feature “CL_Age” categorizes the age data into “Adult” versus “Child” using 18 years as the separation point.

The original data table has grown by three columns but the new features enables creation of a pivot table, which groups everybody into family units. A high-level visualization can be created by using Family Size as the Pivot Table heading and plotting  the column totals to show the number of survivors for each grouping.

The chart shows that overall, single passengers fared the worst with a survival rate slightly lower than the overall shipboard population. Families of 2-4 individuals had a higher-than-average survival rate but it trails off quickly for groups larger than 4. The inference seems to be that there was safety in numbers but only up to a point.

If the guiding principle was “women and children first” then analysis of the passenger manifest by gender and by age should show differences in survival rate based on both of these dimensions. The only problem is that many entries in the passenger manifest have no recorded age (shown here as “unknown” age). Using the values presented in other features we can infer reasonable approximations for some of the missing values. For example, there are 3 rows (passengers 141, 167, 534) for female passengers of unknown age who are mothers. Since the median age recorded for mothers in the rest of the shipboard population is 35 we can reasonably infer an age of 35 for these individuals. Using similar logic we can assign a median age of 5 years to the 20 passengers who have a blank age but whose records indicate siblings and/or a parent on board. This correctly classifies them as children but it probably introduces some distortion into the overall estimate of age distribution.

That still leaves about 150 individuals travelling alone who were of unknown age. Since it would be very unusual for a child in 1912 to take a transatlantic journey alone we can assume they were all adults. If all we need to know is the adult/child division that’s good enough. But can we assign them ages without distorting the rest of the data? The answer is yes, as long as we are willing to live with two assumptions: that the unknown ages follow a distribution pattern similar to the rest of the adult population on board and that the ages for which we do have records are normally distributed.

The normal probability chart shows that the distribution of ages for the female passengers does not follow a strictly normal distribution, but it is still a reasonable approximation.

In order to assign ages to passenger ids with unknown ages the firt step is to use the Excel DAVERAGE and DSTDEV functions to get the average and standard deviation of adults with recorded ages. These values can be used along with the RAND() function to seed NORMINV(), which generates random ages for each of the passenger IDs of adults without a recorded age in the original manifest.

With the blank ages eliminated we get this visualization for female passengers, retaining the same overall shape as the original data but eliminating “Unknown” ages.

The data suggests that the proportion of adult women who survived was similar to that for girls over 10 but girls 10 and under had a proportionately lower chance of survival. Inspection of the data indicates the by far the majority of children aged 10 or under were in 3rd class where access to lifeboats was limited.

The analysis could go further but this post is really about how Excel shapes up as an analysis tool.

The Good:

  • Rich capabilities for summarizing and visualizing data, and for doing basic statistical analysis. Excel goes a long way toward handling some reasonably complex analysis jobs.
  • Easy to add feature when needed for analysis purposes.
  • Facilitates an exploratory approach to data analysis.
  • Analysis is easy to learn – with minimal instruction people with moderate Excel skills catch on quickly and use it effectively for data analysis.
  • Analysis workbooks are easy to exchange with others thanks to the widespread use of Microsoft Office as a desktop suite.

The Bad and the Ugly:

  • Adding data features can cause workbooks to become very large, very quickly, with formulas in each cell of each new feature. In this example, adding 8 features to a raw data file with 900 rows resulted in file size growth from 75KB to 750KB.
  • Excel encourages data exploration but without careful documentation the steps required to complete the analysis can get lost. Steps to complete an analysis can be difficult to reproduce.
  • Some of the functions required to add features and summarize data can lead to tedious manual work along the way. The analysis for this example took most of a day to complete.
  • Default formatting of key outputs, most notably Pivot Tables and Charts, may not be suitable for analysis and presentation without manual manipulation.
  • Limited size of dataset. In my experience, Excel performs well with data sets containing up to about 50,000 rows, even with as many as 10 calculated features in the table. Beyond that, calculation time can start getting sluggish enough to require turning off auto-recalculation. With really large datasets (i.e., >1, 000,000 rows) recalculation can cause Excel to run out of resources.
  • Limited community supported development. Google help is readily available. If you want more than that it’s out there, but it’s rarely free.

Data Science Isn’t Just Big Data

The client wanted 30 minutes. More precisely, they needed to eliminate a 30-minute delay between completing a finished product test in a manufacturing quality control lab and communicating the result back to the operators on the production line. By making data more available to inform decisions about process equipment settings, simply automating that reporting step led to lower scrap rates and higher throughput – a win for everyone.

Another client, also a manufacturer, needed to optimize finished goods inventory and production scheduling for a line of consumer electrical products that had 20 or so discrete models. A statistical model used a rolling 3 years of order data in order to predict optimum monthly production volumes for each product. Reviewing the model output became the agenda for a monthly meeting of key representatives from Sales, Production and Inventory Control in order to decide what to make in the month ahead. The result was a reduction in finished goods inventory, better on-time delivery and better management of raw material inventories.

That was just a few years ago, before “big data” had become part of the everyday business lexicon, a time when Data Driven Decision-making (DDD) was commonplace in manufacturing operations but far less so in other business areas. Today, we tend to associate “big data” with the full practice of DDD but it’s worth making a distinction between the technology for working with very large datasets and what we do with the data once we have it. The discipline of DDD isn’t about size: it’s about the ability to detect and validate patterns in business data, and to use predictive models to inform day-to-day decisions. Sometimes, it’s attractive to use very large data sets to do the analysis, especially in addressing problems in digital marketing, customer retention, fraud detection and information security, areas which once depended only on intuition and experience.

CRISP-DM CycleOne way of looking at data analysis capability is to develop it as a managed business asset. That starts with basic questions of how data can be used to improve performance and how to make it available for analysis. Technologies like HADOOP can help when the relevant data is a terabyte scale dataset with a few million records, but just as often the relevant data can be found in more compact sources. Once data is available, the rest of the DDD cycle is about what to do with it. Managing data assets also includes adding new skills, things like using statistical methods of analysis, how to present it in ways that decision makers can take in and act on, and learning to work with an iterative Build-Measure-Learn approach to problem-solving.

Another shift in thinking is that DDD isn’t just just for big organizations with big money to spend. Although working with very large data sets can bring some very real costs, the underlying process of data analysis and presentation is scale independent. A lot of useful work can and does get done without a big technology investment. Delivering a deployable solution for my manufacturing clients used data they were already generating; we just had to capture it and present it in a way that made sense to the people who needed to use it. We used simple, inexpensive tools to do it (mostly Excel with a little bit of custom code), and it only took a few weeks to deliver a working solution. More than a decade later, it is interesting to note that Excel and R are still the most popular tools for every day data mining work.

Next: Exploring Data with Excel

The Fork and Cork Grill

Live theatre, live music, dining out.

Besides being three of my favourite past-times, what do they have in common?

Smile, there's wine!

Each is a performance in which skilled and talented people collaborate, bringing the best they have in order to create memorable experiences for the patrons they serve. The people who do it have to be on top of their game every day, individually and in the way they work together. If you happen to know anyone whose work is performance you will also understand that it’s hard work, and much of it goes unseen.

All of which brings me to the Fork and Cork Grill, a new restaurant that has just opened in Kitchener. The name says what this collaboration is about: food and wine. It’s a casual place where friends and family can gather at table to share good food and drink as a catalyst for conversation, laughter and simply enjoying one another’s company.

So what does this collaboration look like?

THE CAST

Owners: The Fork and Cork story began when Robert Zablocki and his wife, Dorothy, arrived in Kitchener from their native Poland in 1989. In the years that followed a house building business began to focus on flooring, then restaurant flooring, and finally a restaurant with purchase of an Angel’s diner franchise 10 years ago. They did well, and after a long and successful run with the diner they were ready to move on to a new adventure. Their dream was to bring something new to the K-W area, a casual dining experience that featured really good food and great hospitality.

Executive Chef: Eric Neaves and his wife, Lexzi, (who also works in the Fork and Cork kitchen) add another chapter. Both grew up in families where good food was important. Meal times were family bonding times that also served as a focal point for hospitality. No stranger to performance, Eric trained as an opera singer before his passion for cooking led him to Stratford Chefs School where he and Lexzi met, and from there to a number of fine restaurants.

Eric’s cooking has focused on restaurants devoted to the “Farm to Table” and “Nose to Tail” concepts, including time at London’s acclaimed “The Only on King”, Toronto’s “Marben” and Rob Gentile’s award winning “BUCA”. His training in classic French and Italian techniques has given him great respect for old world traditions, seen especially in his pastas, cured meats, cheeses, breads and pastries. He brings passion, imagination and skill for using top quality ingredients and treating them with care in order to produce memorable meals.

Some local brewsSommelier and House Manager: Good service doesn’t just happen. Peter Lavoie leads the front-of-house team, helping to set the tone for hospitality at the Fork and Cork. A former instructor at the Stratford Chefs School, he also worked as sommelier and house manager at the Prune Restaurant in Stratford. The wine list he has put together features some exceptional offerings from the Niagara region. There is also a great selection of craft beers from across Southwestern Ontario.

Farmers and Producers: The Fork and Cork is also a showcase for the best of what Southwestern Ontario farmers and craft food producers can do. They are a vital and unseen part of the whole dining experience, and the menu acknowledges their contribution with a word of thanks and a list of who they are. The web site provides links for anyone who wants to know more about them.

THE MENU
The MenuThe menu brings together an eclectic mix of old-world and contemporary ideas. Starters range from the very simple (warm marinated olives) to some truly unique offerings like poutine made with duck jus and topped with pulled duck. Salads and cold plates include a summer vegetable salad (local vegetables, olive oil, lemon and fine herbs) that changes from week to week depending on what is available. Savouries include cured trout, a charcuterie board and a selection of local cheeses.

Pizzas are done Romagna style – ultra-thin crust, interesting toppings and flash-baked. Pastas are all made fresh and presented in flavourful combinations. A personal favourite of mine is the double-stuffed ravioli, with green pea and goat’s cheese fillings served with pea tendrils, aged balsamic and parmigiano. The list of entrée is not long but each dish offers something unique.

OPENING NIGHT

Main Dining RoomMy wife and I attended the second of two “Friends and Family” nights that opened the Fork and Cork. It was a lot like being at the opening of a new theatre production: a lot of preparation in the weeks leading up to a moment of truth when it all comes together for a live audience. First impressions of the the space itself were of elegant simplicity – inviting and comfortable. It’s a good-sized space with seating for 120 people in the main dining room and another 40 in the bar area. Music was upbeat and interesting but not intrusive. All of the front of the house staff were friendly, relaxed and knowledgeable, welcoming us warmly and conveying a genuine sense of excitement about being able to share in the start of something new. While we waited at the bar for the rest of our group to arrive we had a nice chat with the bartender. She was busy but she also took the time to talk while she prepared our drinks.

On to dinner…

Summer Greens with BarattaMy wife started with the gazpacho. Garnished with a deep-fried zucchini blossom, she described it as the best she had ever tasted – thick and flavourful with a perfect balance of tart, sweet and spice that went nicely with a Chardonnay from Ravine Vineyards. I had the Summer Greens, which included a white bean puree, burrata cheese, tomato confit and basil oil. Paired with a Pinot Grigio from Peninsula Ridge there was a delightful contrast in flavour and texture, bringing together the greens, creamy cheese, tomatoes and wine.

For an entrée, my wife ordered the braised duck leg spaghettoni with oyster mushrooms, basil and marscapone cheese, which she enjoyed thoroughly. I opted for the Falafel Burger, a vegan-friendly offering that anyone would enjoy. Topped with traditional falafel toppings (hummus, tomato, red cabbage, greens, house pickles) and fries on the side with garlic-tahini, it was very tasty. It went well with a big, fruity Gamay Noir from 13th Street winery.

We sat at a table of 8 that collectively sampled a large portion of the menu (steak, burger, pasta, pizza and grilled tuna salad) with enthusiastic approval all around. Our server was attentive and ready with answers to questions about the menu and also about the wine and beer list. Despite the busy night (roughly 100 diners who all arrived within a short enough time to stress the most seasoned crew) service flowed smoothly. Dorothy, Robert and Peter were very visible, sharing a few words of greeting with each table and helping out where needed. Eric even managed to get out of the kitchen briefly to visit with a number of the folks who were there. There was a genuine sense that everyone was having fun.

Kudos to Robert and Dorothy for the vision and drive to create a new venture, to Eric and the kitchen team for their excellent efforts and to Peter and the rest of the nearly 50 people who make up the creative and delivery sides of this collaboration. Finally, thank-you to the 20+ farmers and producers who work behind the scene to make it all happen. Bravo! We’ll be back.

Fork and Cork Grill 519-208-0606

What About Requirements Documents?

My last post was about why Use Cases should still be considered alive and well as an analysis technique in Agile projects. That begs the question: if people can dismiss Use Cases as old fashioned and a waste of time, is there any place left for traditional Requirements documents?

I am not a fan of Big Requirements Up-Front. Fat requirements documents based on the premise that if you don’t ask, you don’t get, tend to accumulate very long feature lists. Crafted early in a project, they risk becoming disconnected from what the front line users will actually use. We try to mitigate this risk by prioritizing the features.  But anyone who has been through the exercise of separating a few hundred features into categories by priority can quickly agree on three things: it’s necessary, it’s difficult and it requires more ruthlessness than most of us can muster.

The MoSCoW technique is a popular way to try and simplify the process by distributing the list into four categories, something like this:

  • Must Have (some team members really want it)
  • Should Have (someone else really wants it)
  • Could Have (really useful but nobody believes they will actually get it)
  • Won’t Have (nobody cares)

The process depends on having a reasonably complete feature set up-front.  The problem is that taking a group through this kind of exercise  with a big feature list is a long and tiring exercise for everyone involved. If that isn’t enough, it often doesn’t work very well. Without good ground rules about what the categories mean (hopefully more objective than the ones listed here) there is a very real chance of getting it wrong, leading to much time and effort wasted, building things people never use and leaving out things they actually need.

The situation doesn’t change much when the aim is purchase of a commercial off-the-shelf (COTS) package. There is a strong temptation to write extensive feature lists in order to set up an “objective” comparison among vendor offerings. Purchasing departments like this approach because the evaluation looks like such a systematic way to get best value in a big ticket purchase. In fact, vendors who compete in the same market space often have very similar feature offerings, something that has been demonstrated to me almost every time a set of RFP responses came back. It’s easy for the comparison to end in a tie when everyone offers the same features. Final product selection is more likely to be decided on differences in execution of a handful of use cases that a few key stakeholders really care about.

High LEvel Project Life Cycle

So, what’s the alternative? If, like me, you tend to favour incremental delivery there’s a lingering problem about putting boundaries around what to deliver. There is, after all, more to delivery than construction, and especially in large projects, there’s a bigger story to be told than can be represented on a set of index cards. Part of that story is a nod to key stakeholders that the things that matter most to them are being addressed.

Here is where we encounter our old friend, the Requirements Document. Dressed a little differently perhaps, trimmed down and definitely playing a new role we have a “System Vision” document. Produced at the end of a period of exploration it captures the what and the why of a project at a high level. The System Vision articulates outcomes that matter to key stakeholders: what the solution does for them but not necessarily how it does it. It goes into just enough detail to help people see a picture of what they are about to build, and it might touch on some non-functional aspects that constrain what ultimately what will be viewed as a “good” result.

Disciplined Agile DeliveryIn Disciplined Agile Delivery, Scott Ambler describes this as a key gate point. At the end of the Inception stage the project is poised to move into construction, and its culture is about to shift to a preoccupation with deadlines, budgets and issue management. Requirements discovery will continue, new features will be added, others will drop off and priorities will shift. The Vision document can aid the ongoing process of managing change by providing a snapshot what people understood before the construction noise began. Coupled with a good requirements management process it provides a reference point for evaluating the changes that inevitably will come.

Are Use Cases Becoming Old-Fashioned?

A few weeks ago a discussion popped up on LinkedIn titled “Use Cases are old-fashioned and a waste of time and money“. Provocative, to be sure, the debate that followed (at last count 88 comments and building) was an entertaining romp through the personal biases of a number of experienced practitioners. There were two distinct camps. Arguing against use cases was a group that liked pictures better than words for displaying the sequence of events to fulfill a user goal. For them, taking the time to write use cases is an unnecessary formalism that is too slow to keep pace with the needs of an Agile team. On the other side was more reflective group who viewed use cases as just one more tool in the box for managing requirements, regardless of the the development approach at hand. As a way to get a broader perspective, I thought it might be interesting to see what the Agile Extension to the Business Analysis Body of Knowledge (BABOK-AE) had to say.

BABOK Agile Extension

For those who haven’t encountered it, the BABOK-AE was developed in collaborative between the IIBA and the Agile Alliance in order to tackle the problem of managing requirements (aka features) in Agile development projects. It was an interesting read, not so much as a way to settle this debate but more as a reminder that good requirements management is just as important in Agile projects as in traditional, plan-driven approaches. What changes with Agile projects is the timing of detailed requirement definition, and an emphasis on face-to-face communication. The real casualty is the comprehensive Requirements Document; use cases are cited along side of other models as simply another lightweight representations of a set of requirements.

So why the debate? Part of it seems to be a concern (obsession?) among Agile practitioners with keeping the design artifacts light. The focus on building functionality that can be delivered within a fixed time cycle demands representations that are just barely good enough to use. It’s easy to see white boards and index cards as are good enough for the moment, and they usually become primary media for recording and communicating requirements as user stories, swim lanes and sequence diagrams. As useful as they are, these representations tend to have a short shelf-life but some Agile practitioners are still suspicious of documents, even thin ones like use cases, as things that transgress a basic tenet of Agile delivery (“working software over comprehensive documentation”).

The word “comprehensive” might be the hang-up. As with any modelling technique it’s easy to get carried away with use case analysis and try to make too many documents with too much information packed into them. But use cases are not supposed to be weighty documents. Authors like Mary and Tom Poppendieck and Alistair Cockburn, one of the original signatories of the Agile Manifesto,  did a pretty good job of showing us how to keep use cases brief. We are not necessarily as good at listening to sound advice that has been around for a while.

Then there’s the the gap between the idealized Agile delivery environment and the reality of many organizations where actual projects take place. When delivery team members are dispersed geographically and dealing with competing priorities, face-to-face communication becomes a luxury and written communication takes on a larger role. Use cases can be useful for grouping user stories together in a way that helps the delivery team see a richer picture that includes triggers, preconditions and success criteria, all of which are part of delivering a customer goal. As well, defining a use case narrative as a main flow with some extensions, alternates, recovery paths and failure modes can help teams break down an epic story that may need several iterations to build completely.

Finally, there just might be some plain shortsightedness on the part of those of us who live in development projects. We forget too easily that developed software ultimately goes into production and will probably be supported by somebody else. The required knowledge transfer is far more than just handing over a stack of index cards at the end of user acceptance testing. Use cases can be a useful starting point for organizing the knowledge base that support teams depend on in order to troubleshoot problems in production when things go wrong.

My own aim is to document just enough information to suit the situation. Use cases are certainly part of my repertoire  but I’m more likely to start with a few user stories and some swim-lane diagrams that set out a high level sequence of events. Sometimes that’s all that is needed; other times, especially when I’m working with off-site developers, documentation will evolve into some “fully dressed” use cases that may include some appendices that serve as a place to gather specification details. Picking the right tool for the job is more a matter of understanding what aids communication in situation at hand than following a set of rules.

Soft Systems Methodology

Life is messy. Those of us who work as business analysts see it in the myriad of ways in which people create, store and retrieve information in order to do their work. Called upon to help improve a situation, we often reach for help in the form of a methodology. The trouble with most system design methodologies is that they are made for giving structure to the process of building a solution, not for helping people living in a messy situation decide what would actually make things better. It is all too easy, particularly with larger projects, to get to design without fully understanding the context in which people will use the solution. What’s missing is an explicit way of capturing the culture and politics of the situation in order to bring key stakeholders to a common vision of what to change.

Soft Systems Methodology (SSM) is a lesser known approach to business analysis that enriches the tool set for exploring the cultural and political context of improvement projets. Based in system theory, SSM is an inquiry process that takes the term “system” beyond technology, taking into account the people involved and what aspects of the situation matter to them. It isn’t new; since Peter Checkland published his original work more than 30 years ago SSM has drawn a following among practitioners and researchers (mostly in the UK) who were looking for more effective ways to tackle complex problems.

According to SSM, understanding how to improve a situation requires both a logical and a social view. The former takes us down the familiar path of examining activities and information flows, perhaps building a Context Diagram and identifying gaps between actual practice and a potential future state. The social inquiry addresses how people working in the situation perceive it and how their respective world-views influence what changes would be acceptable. This leads to a series of possible future state models, each of which could deliver some improvement, combined with insight into the chances of success, given differences about what matters to different stakeholders.

SSM Process

 

Rich Pictures
Models are a core components of SSM, serving both as a medium to structure the inquiry – giving a first-level representation of the situation – and as a source of questions to ask of different stakeholders. Conventional modelling techniques (e.g., Context Diagram, UML activity diagrams, BPMN models, Value Stream Maps) can help, but the emphasis that SSM puts on people and perception invites other approaches.

Rich Pictures provide a way to visualize a complex situation in terms of the main business processes, their data requirements, relationships among the main actors and their perception of the issues and opportunities for improvement. Unlike formal modelling languages, Rich Pictures do not have rules. Well, maybe one: tell a story. Use images, pictures, keywords and descriptive labels to help the reader take in who is processing what data for what purpose, what data is coming in, what information is going out, and what matters to key people in the situation. Frequently hand-drawn, even cartoon-like, the aim is to convey a complex story more effectively in picture than in than prose.

Root Definitions: CATWOE
The informality of a Rich Picture does not mean that it is any less rigorous than other models; rather, the picture is the result of a structured analysis that seeks to derive clarity from complexity. The analysis starts by identifying one or more candidate system relevant to the situation, expressed in the form of a one-sentence root definition:

A system to do X by Y, in order to Z.

Taking a simple example, suppose the system to be improved was a document management system in an organization that does proprietary Research and Development work. The existing practice is to use file shares. A possible root definition could be:

A system for storing and retrieving documents in order to enable knowledge sharing among researchers.

The next step is to view the system from the perspective of a key stakeholder (or group) using the mnemonic CATWOE in order to identify key elements, starting with what input the system converts to output (the Transformation, T) from that perspective. The rest of the analysis builds out detail, identifying:

Custmers:

  • Who (or what) benefits from this transformation?
  • What problems are they experiencing?
  • How would they react to one change versus another?

Actors:

  • Who does the work in order to produce output?
  • What happens to them if some aspect of the system changes?
  • How would they react to one change versus another?

Transformation:

  • From “start” to “finish”, what are the steps, the inputs and outputs?

Weltanschauung (world-view)

  • What do key stakeholders care about the most?
  • What are the points of agreement (and conflict) between stakeholders about what is important?

Owner:

  • To whom is the “system” answerable (who has the authority to take action to change it)?
  • How much help do you need from them in order to make a change?
  • What would cause them to help / hinder a change?

Environment:

  • What external factors interact with the system in a way that constrains what is possible?
  • Are there technical, regulatory or other factors that will govern the approach to change?

In this example, CATWOE taken from the perspective of the VP, Research and Development, might look like this:

Customers: Researchers, Project Teams
Actors: Researchers, Project Teams (Don’t see value in learning a new technology.)
Transformation: Organize, store and retrieve project documents, research reports and reference material.
Weltanschauung (world-view): Access to information aids team productivity.
Owner: VP, Research and Development

Environment:

  • Must use existing technology infrastructure
  • Corporate investment in SharePoint
  • Non-disclosure agreements, patent applications

Rich Picture of an Intranet

The analysis gives us a snapshot of the system from a single perspective, one that puts a high value on improved information access, while still acknowledging other stakeholders. Senior management concerns about protection of intellectual property and the IT department’s view of ownership become environmental constraints that limit how the system can operate. Shifting the perspective to one of those stakeholders might emphasize different aspects as their interests take centre stage. Neither perspective is wholly right or wrong; pursuing the cycles of discussion that come out of the analysis leads to discovery about which changes are desirable and feasible.

Why SSM Matters
SSM provides an inquiry process that helps everyone involved gain understanding of each other’s concerns and the relative merits of different approaches to change. On some levels that’s nothing new; any experienced business analyst probably has some structured stakeholder analysis in their repertoire already. The extra pieces that SSM brings to the table are the deliberate effort to view the situation from different perspectives and the early timing in the project life cycle. The cyclic process of discussion and learning helps build mutual understanding and trust among stakeholders, strengthening the collective will to embrace the changes that lie ahead.

Based on my own experience of using (at least parts of) SSM for more than a decade, here are some things to like about it:

  1. CATWOE is easy to remember as an outline of what to ask, especially in the early stages of getting to know stakeholders.
  2. By encapsulating a rich snapshot of what’s going on in a situation and how it is perceived from a variety of perspectives, CATWOE can lead quickly to a way of structuring a discussion about which changes could be effective.
  3. Just as User Stories provide a kind of shorthand for Use Cases, Root Definitions can encapsulate the big picture of an entire system in a single sentence.
  4. Rich Pictures help depict a problematic situation in a way that can draw attention to what needs to change, where there are conflicts and whose support may be needed in order to succeed.
  5. The emphasis on understanding the system from different perspectives helps to build attitudes of trust and collaboration between stakeholders and those charged with making improvements.
  6. SSM is scalable. It can be equally effective for drawing insights about a single system within a single organization, for developing system strategy across a whole organization or even for tackling complex situations (e.g., regional health care delivery) that involve multiple systems, organizations and stakeholder groups.

 

Notes on Notes

As much as technology permeates our daily lives it has been slow to enter the simple act of note-taking. Look around the next time you’re in a meeting: low-tech rules. If there are a dozen people in the room, chances are that no more than one or two will be using anything more sophisticated than pen and paper to take notes. And that’s not a bad thing; the technology we use to record notes shouldn’t take attention away from the rest of what’s going on in the meeting. So I find myself resisting this particular shift in technology just as much as anyone else. At least that was the story until about six months ago.

Writing with MyScript

Writing with MyScript

This post started out handwritten on an iPad using MyScript Memo, one of several note taking apps that I have been trying out. Having acquired an iPad my goal was to find practical uses for it. For example, could  I take it into a meeting room and use it with the same ease as pen and paper? If I was going to, I would want a note-taking app that would let me:

  • Write notes with a stylus.
  • Shift between handwriting and typing.
  • Add an images to a note.
  • Save a note as an image.
  • Convert handwriting to text
  • Transfer notes to other apps and devices.
  • Organize notes for later reference.

I found two good choices, Notability from Ginger Labs and MyScript Smart Note from Vision Objects. Neither does quite everything but both go a long way toward doing what I need.

Notability had become my everyday note-taking tool. Although I find myself using a stylus most of the time, I like that I can also add a keyboard to my iPad and type when I need it. The text input feature comes in handy when someone emails me a meeting agenda and I want to use it as an outline for my notes. The ability to mix text, written notes and images means that that I can capture a whiteboard session as an image and then make notes on it. At the end of a session I have a quick, accurate record of what happened together with my own notes for later use. Best of all, Notability lets me organize notes by topic and date so that I can find them again.

The one thing that Notability does not do is handwriting recognition. Most of the time that isn’t a concern because I don’t often transcribe notes verbatim to another document. When I do need that capability I turn to MyScript, which goes a long way in enriching a hand-written tablet experience. The big feature is conversion to text, which I used to convert the first draft of this post from my iPad to my laptop. It isn’t perfect, but most of the time accuracy is well above 90%, and it’s easy to copy and paste the result into some other app.

I started with with MyScript Memo, which is free, and recently upgraded to Smart Note, which adds some extra features including a set of editing gestures like a strike-through for deletion for making quick edits on the fly. With either, the writing experience is more like notebook-and-pen than any other app I tried.

Two big things to get used to with any handwriting app are the feel of writing on a smooth, hard surface and finding a natural wrist position. Both Notability and MyScript have wrist guard overlays that cover the bottom portion of the screen in order to help maintain a more comfortable wrist position. That helps prevent stray marks or other unintended effects from multiple points of contact, but I noticed that the screen would sometimes get jittery when there was too much wrist pressure on the tablet surface.

Choice of stylus matters. The Wacom Bamboo stylus is inexpensive and rugged. It’s rubber tip requires a firm hand, so the writing experience is a bit like writing with a crayon. The Adonit Jot-Pro claims fine control, and that’s true, up to a point. The writing tip is a small plastic disk, which does allows easier motion, at least for cursive writing. Printing is a little trickier – it’s easy for the disk to touch the surface between motions and leave stray marks. With either stylus, slowing down and forming letters more carefully produces a better handwriting recognition result.

After six months what can I say? My iPAD is with me most of the time when I’m in a meeting. It’s a lot less intrusive than a laptop and it provides a richer tool set for recording notes. Pen and paper won’t go away completely but paper notebooks are lasting a lot longer.

 

Open Source Content Management: The DNN Platform

DNN Logo

How do you use Microsoft and Open Source in the same sentence without creating an oxymoron? Answer: talk about the application, not the technology stack that runs it. The application is DotNetNuke, also known as the DNN Platform and Evoq. Originally bundled with the first release of Visual Studio .NET as a sample project, the “IBuySpy” portal has taken on a life of its own as DotNetNuke. Ten years, 7 million downloads and one million web sites later we have a thriving open source community.

DNN is a portal for capturing and presenting web content. It provides a standardized framework for managing security, user access and page layout, and a set of modules that deliver discrete functions within each page. Skins can be applied to the portal or to individual pages in order to customize look and feel. The portal is distributed free of charge in a Community edition that comes with a basic set of modules and skins. An extended set of modules and a package of training and support services are provided for a fee in the Professional ($3,500 US) and Enterprise ($8,500 US) editions.

Installation

Having worked with DNN in the past for a couple of intranet sites, I was curious to see what it looks like today so I downloaded and installed the community edition. The first thing I noticed is that since the last time I installed a DNN site the installation package had morphed into two different packages. The standard installer is intended for a production site where both IIS and SQL Server are already installed. The all-in-one installer (which I chose) uses Microsoft WebMatrix to download and install everything needed for an evaluation site. The process was well-documented and (mostly) uneventful; I had my evaluation site up and running in less than an hour.

Site Design and Management

Modules are a powerful feature of the DNN plarform  for packaging and presenting content. Each module delivers a specific kind of content (e.g., announcements, images, events, documents) to a single page or across all pages in the site. The community edition comes packaged with an introductory set of modules for managing text, images and lists. If the feature you want isn’t there, chances are that you can find it in the DNN store, from an independent 3rd-party vendor or on CodePlex. Many modules are free, while others are priced anywhere from $50 – $2,000. To build my evaluation site I downloaded several free modules from CodePlex and found the installation process to be simple and straightforward.

One of the things that I have always liked about DNN is the way that design and administration functions are integrated into the same interface as the web site itself. By signing in as an administrator I have access to an office-like ribbon that lets me manipulate all aspects of the site layout and content. Adding a page to a site is a simple, one-click operation that opens a dialogue where I can specify high level information about the page and where it fits in the menu hierarchy. Once the page has been created I can add functionality to display different kinds of content by simply dragging a module from the ribbon to a zone on the page. Once on the page the module has a local menu for adjusting settings and an edit function for adding content.

Evaluation

Against my selection criteria, I found that the Announcements, Form and List and Events modules provided strong capabilities for managing custom lists and other content types that support team collaboration and basic project management. Document management with the free documents module was disappointing. Although it was easy enough to upload documents to the site, the only interfaces for retrieving them was a simple list. Adding folders to the list improved the ability to navigate a medium sized document collection, and the tags feature (combined with a managed list) did enable me to search for documents using a standardized vocabulary of search terms. Still, the concept of a managed document repository with flexible search and retrieval capabilities just isn’t there: no version control, no check-in /check-out and no approval mechanism. This is a limitation that others have recognized, and modules like Document Exchange have been created (~$500) to enable document management capabilities that approach what a SharePoint user would take for granted.

There is plenty to like about DNN. IT support teams tend to like it because it uses familiar server technologies. Any organization that has already adopted Microsoft SQL server and IIS will find DNN to be easy to install and support. Business users like DNN because they can take full ownership of capture and presentation of the information that matters to them. Both small and large organizations like DNN because it provides an inexpensive platform for building web sites that are rich in content, easy to navigate and scalable to a wide range of needs. If those seem like a good fit with your criteria then DNN is worth a close look. On the other hand, if you’re looking for extensive document management or work flow functions you may decide to keep looking.