Key to success in an A.I. or IoT or data strategy is data scientist productivity. Data scientist productivity is defined as the volume of business-critical results driven through data science. The difference between successful data-driven companies and not so successful ones is the productivity and throughput of the data science team. Increasing data scientist productivity and throughput leads to positive side effects, including standardization of data science processes, tooling and data science methodology, as well as an increase in the availability of case studies and foundational data science that can trigger and speed up other data science efforts.
When surveyed, data scientists typically cite two key roadblocks to success, productivity and throughput. First, it’s very hard to get access to any data, let alone clean, curated, enriched and high-quality data. Second, even when they have data, they often get blocked due to inability to understand the data and the context of its generation, transfer and lineage. Together, and even by themselves, these two issues can be debilitatingly toxic and consume vast amounts of energy with no results.
Access to high-quality data
The first issue occurs when organizations lack a comprehensive description and unified infrastructure to organize and describe data, or when organizations lack the leadership to force down silos and fiefdoms that cause employees to hoard and protect data. When presented a business problem, data scientists will often ask for all underlying, relevant data sets that represent the business, customer, process or operation. Ideally, enabled with a self-service data discovery mechanism, data scientists should be able to find all relevant data sets and systematically determine the subset that is relevant to the problem at hand. However, without such a capability, data scientists often have to depend on cooperation from the business and developer teams that are closest to the data set, and this process of knowledge transfer can take days, weeks and even months.
Once the ideal data sets have been identified, the laborious process of access provisioning, entitlements and transfer of data to the data scientist’s processing environment can begin. This process again is fraught with issues due to procedures and limitations for data access and transfer. Once the data has been delivered to the data science processing environment, the data scientists can begin inspecting and validating the data for completeness and relevance to the problem at hand. If there have been any issues in the data export, transfer or storage that cause the data to be incomplete, corrupted or unsuitable for the problem at hand, the data scientist will eventually discover the problem and will need to re-initiate the process of exporting, transferring and storing the data set.
Understanding the context, assumptions and biases in the data
Once data scientists have access to the data and have been able to inspect the data, they can begin the process of understanding the context of the data — i.e. where it was generated, how it was generated, what logic was used to generate it, what the data contains, what kind of attributes it has, what kind of values the attributes take, and what percentage of the entire data set the data at hand covers.
Understanding the context, assumptions and generation biases of the data often requires the data scientist to have direct discussions with the business teams and development teams involved with the generation of the data. Over time, these interviews should be used to populate a semantic layer that describes, annotates and defines the data. This ensures that for future initiatives, the semantic layer is available and in place to reduce the effort required to understand the data.
Measuring the distance between data science and the business
The severity of the above problems can be measured in terms of the distance (the number of hops through intermediary employees) between a business user and the data scientists. The distance is directly proportional to the time it takes to transfer context, the effort required to transfer the context and the lack of quality of the information transferred from the business to the data scientist.
The distance impacts not only the transfer of context but also the initial search and discovery of data sets and the transfer of the data itself. In addition, as data scientists iterate over the data set through experimentation and hypothesis testing, they hit these delays every time they have a question or require guidance and clarification.
At the same time, the separation between data science and business requires each of the disciplines be subject to their suborganization’s policies and priorities. This means that if a business team determines another initiative to be of higher priority, it can stall, shut down or essentially block the data science effort inadvertently.
Similarly, if the business team has a shift in priorities, it can stall the deployment or productization of the results of the data science team. This delay can often render the results of the data science efforts stale and unusable right away if the prioritization changes in the future.
Another problem with the distance occurs when questions and answers are interpreted and passed along layers of employees that act as intermediaries. When this happens, critical details such as information about the problem, the data, the users, previous attempts to solve or address the problem, and the operating environment can be missed and overlooked, leading expensive corrections that might have to occur down the line.
The huge impact the issues described above have on the productivity and throughput of data scientists poses an interesting but critical organizational strategy question for executives: Where do data scientists belong within the organization? Should they be closer to the business, so they can better understand the business context of the problem and the data required to solve the problem? Or should they be closer to the developer and technology team, so they’re able to deploy, productize and monitor the output of their data science activities? Both options have pros and cons, and organizations typically experiment with both approaches.
Data scientists require heavy interaction with different groups of employees at various points in the data science life cycle. During the initial phases, they need constant contact and interactions with the teams that understand the problem and the data (and the history of the problem). In successive phases, toward the end of the life cycle, data scientists need to close interaction with the developers who will deploy and instrument their data-science-driven products and services in production and enable the data scientists to monitor, improve and iterate on their results to drive higher quality.
Data science is a contact sport and requires a new kind of organizational strategy where data science teams are enabled to be mobile and interact heavily with various groups of employees at different points in the data science life cycle. Without these interactions, data scientists are bound to miss context, assumptions and biases in the data, or they will lack a complete understanding of the business problem at hand or be unable to monitor and ensure accurate deployment and operation of their data science output in the real world. Lack of any of these efficiencies causes lower-quality data science that not only makes data scientists have lower productivity but also reduces their efficiency and throughput of high-quality results.
Data scientists need to rotate through these various teams, and they need to be enabled by the organization to feel, act and be treated as dedicated members of the teams with whom they need to interact.