Mon, 28 Oct 2013 11:26:34 GMT
The Data Scientist’s Four-Step Discovery Process
The discovery process used by data scientists commonly consists of four steps (see also Figure 1):
- Data acquisition: In this first step, data is collected from various data sources. Data scientists select the data sources that may be useful and relevant for their study.
- Data preparation: In this step, data is transformed, aggregated, integrated and cleansed until it has the form that data scientists need for their study. For example, for many data mining algorithms, it can be useful to transform real-life values to binary values.
- Data analysis: In this step, data is analyzed using various types of techniques, including simple reporting techniques; classic statistical techniques, such as forecasting,predictive modeling and clustering; advanced data mining techniques; data visualizationtechniques such as affinity visualization, path visualization, scatter clouds, geo-visualization techniques; and time-series analysis.
- Data interpretation: When the techniques and tools present results and insights, it’s still the responsibility of the data scientist to determine whether the results make sense. This requires in-depth knowledge of the business and the data, and it demands common sense.
Characteristics of the Data Scientist’s Discovery Process
The discovery process deployed by data scientists has the following characteristics:
- The discovery result consists of rules. The result of a discovery process is in most situations insights, and these insights are formulated as a set of rules. These rules can be simple if-then rules. For example, if two payments are done with the same credit card within 10 seconds, they are probably fraudulent. Rules can also be advanced statistical formulas indicating the relationship between specific variables. For example, a 10 degree rise in temperature increases sales of barbecue meat by 300%. Sometimes rules are sophisticated, self-learning data mining models that can predict customer behavior by combining historical and new incoming data.
- The discovery process is an iterative process. Figure 1 suggests that the discovery process is a serial process: when one step is finished, the next one starts, and we never return to a previous step. However, less would be closer to the truth. The discovery process is very iterative. For example, when a data analysis step has been finished, the conclusion may be to collect more data and start all over again. Even a data preparation step may lead to a return to the data acquisition step. In fact, this entire four-step process may have to be repeated several times before the right insights rise to the surface.
- *Discovery results should be actionable. *When a discovery process is finished, the organization has experienced no advantages yet – no money has been made, no ROI. The discovery process has to be followed up by a step called Act. In this step, the gained insights have to be used or implemented. Examples of implementing insights are: organization policies are changed, decision rules are embedded in operational applications, business processes are optimized, customers are offered special discounts and so on. Without the Act step, the entire discovery exercise has been for nothing. In other words, it’s important that discovery results are actionable. Note that the data scientist is not always involved in the Act step.
- No clear goal. Another characteristic that shows that data scientists are different from most other BI users is that their analysis work doesn’t always have a clear goal. The work they do is much more free format, much more research-like. Because the goal is not always that clear, classifying this process as “finding a needle in a haystack,” doesn’t always make sense. If you’re looking for a needle in a haystack, the goal is very clear, and with a powerful magnet it’s not even that difficult. Discovery is much more a stepwise refinement process. With each step, the data scientist may get closer to useful insights.
- Discovery may return spinoff results. It’s not uncommon that during the discovery process unexpected insights and rules are found. These spinoffs can be as useful as the rules intended to be found. Remember Alexander Fleming who discovered penicillin by accident. There are more well-known examples like this. For example, chemist William Perkin wanted to invent a cure for malaria. His experiments led accidently to the first-ever synthetic dye. And don’t forget George Crum who discovered Coke by accident when searching for a cure for headaches.
- Deployment of a wide range of analysis techniques. As indicated, data scientists use a wide range of analysis techniques to discover new insights. Many well-known statistical techniques can be used to find rules. A data scientist should have access to all the tools and techniques he needs. He should also be able to mix and match them. For example, he may want to apply a time-series analysis first, followed by a geo-visualization of the result. Data scientists should not be restricted in discovering valuable insights due to the lack of tools and techniques.
- *Data overload doesn’t exist. *The more data a data scientist has access to, the more discovery options he has. In this context, more means three things. First, it means more detailed data – no aggregate data. Aggregation of data can hide potential insights. Dealing with detailed data is a typical aspect of the big data trend. Nowadays, the technology exists to process massive amounts of data fast. Second, more means more data sources. Having access to a data warehouse is probably not enough for data scientists. They may also need access to large files with sensor data, spreadsheet data, external data sources and so on. It wouldn’t be the first time that rules are discovered by enriching internal business data with external data. Third, more means more types of data. Giving data scientists access to structured data is very useful, but not all the data has a very rigid structure. Data scientists may also require access to what some call unstructured, multi-structured, semi-structured or poly-structured data.
- Data scientists create new data. Usually, users of reporting tools don’t create their owndata. They access data stored in a data warehouse or data mart. In some situations, it could be that the data the data scientists need doesn’t even exist yet. The consequence can be that dedicated projects must be initiated to create and collect the required data. An interesting example of such a project is the Amsterdam Born Children and their Development (ABCD) project. This project started in 2001 and still continues. The project tracks the health of 8,000 children. Every so many years, these children have a checkup. The goal of this long-lasting study is to discover what the relationship is between early growth and development on the overall health later on in life. This study is a good example of where the right data has to be created first.
- Discovery projects may be long lasting. Some discovery processes are completed in one day, but they can also last for weeks, months and even years. For example, in April 2013, researchers working at the academic hospital in the city of Utrecht in The Netherlands discovered a formula that predicts the risk of new health problems ten years later for patients who have had a heart attack or stroke. The formula looks at fourteen variables, including age, gender, smoking habits and blood pressure. This study started in January 1996 and ended in 2013. This is a good example of a long-lasting discovery project.
With the introduction of highly scalable, low-cost data storage technology and fast in-memory analytical processing capabilities, the toolset of the data scientists has been enriched dramatically. Huge amounts of data can be stored and analyzed that were unthinkable a few years ago, and the techniques to analyze all that data have evolved enormously. If they weren’t already, data scientists may well become the key persons for organizations to survive in this increasingly competitive world.