Data Science Isn’t Just Big Data

Posted onMarch 5, 2016 by Bill Neaves

The client wanted 30 minutes. More precisely, they needed to eliminate a 30-minute delay between completing a finished product test in a manufacturing quality control lab and communicating the result back to the operators on the production line. By making data more available to inform decisions about process equipment settings, simply automating that reporting step led to lower scrap rates and higher throughput – a win for everyone.

Another client, also a manufacturer, needed to optimize finished goods inventory and production scheduling for a line of consumer electrical products that had 20 or so discrete models. A statistical model used a rolling 3 years of order data in order to predict optimum monthly production volumes for each product. Reviewing the model output became the agenda for a monthly meeting of key representatives from Sales, Production and Inventory Control in order to decide what to make in the month ahead. The result was a reduction in finished goods inventory, better on-time delivery and better management of raw material inventories.

That was just a few years ago, before “big data” had become part of the everyday business lexicon, a time when Data Driven Decision-making (DDD) was commonplace in manufacturing operations but far less so in other business areas. Today, we tend to associate “big data” with the full practice of DDD but it’s worth making a distinction between the technology for working with very large datasets and what we do with the data once we have it. The discipline of DDD isn’t about size: it’s about the ability to detect and validate patterns in business data, and to use predictive models to inform day-to-day decisions. Sometimes, it’s attractive to use very large data sets to do the analysis, especially in addressing problems in digital marketing, customer retention, fraud detection and information security, areas which once depended only on intuition and experience.

One way of looking at data analysis capability is to develop it as a managed business asset. That starts with basic questions of how data can be used to improve performance and how to make it available for analysis. Technologies like HADOOP can help when the relevant data is a terabyte scale dataset with a few million records, but just as often the relevant data can be found in more compact sources. Once data is available, the rest of the DDD cycle is about what to do with it. Managing data assets also includes adding new skills, things like using statistical methods of analysis, how to present it in ways that decision makers can take in and act on, and learning to work with an iterative Build-Measure-Learn approach to problem-solving.

Another shift in thinking is that DDD isn’t just just for big organizations with big money to spend. Although working with very large data sets can bring some very real costs, the underlying process of data analysis and presentation is scale independent. A lot of useful work can and does get done without a big technology investment. Delivering a deployable solution for my manufacturing clients used data they were already generating; we just had to capture it and present it in a way that made sense to the people who needed to use it. We used simple, inexpensive tools to do it (mostly Excel with a little bit of custom code), and it only took a few weeks to deliver a working solution. More than a decade later, it is interesting to note that Excel and R are still the most popular tools for every day data mining work.

Next: Exploring Data with Excel