How Data Scientists Can Balance Practicality and Rigor

A hybrid approach of product-oriented pragmatism and scientific rigor can help data science teams stay focused and impactful

When building quantitative systems that drive commercial value, pragmatism and innovation are not in conflict with one another. For growing and lean start-ups with challenging research problems and data-focused customers, data science research must yield clear business wins quickly and iteratively.

An effective approach to scaling technology in these environments must embody a mix of methods that are rigorous, interpretable, defensible and business-aligned. We’ve learned this from our own teams’ growing pains from becoming a data-driven enterprise company, which many readers can undoubtedly relate to.

One solution: a hybrid approach of product-oriented pragmatism and scientific rigor. Here are two examples of how to balance practicality and scientific fundamentals to keep your data science team focused and impactful.

Practical modeling methods to avoid bias and increase business relevance (without being too fancy)

Being pragmatic as a data scientist is a way to be connected to and impactful to the business. One way to do this is to measure what matters to the business and train models using the most relevant available data. You need to be able to draw a coherent, straight line from business case to data set to model to business outcome.

On the flip side, applying unnecessary constraints — an easy mistake to make when designing new models — is a way to break that straight line and fail to generate impact. In academic settings, we often apply constraints for the sake of perceived elegance of methodology. It is important to not apply such constraints too liberally in the private sector.

These unnecessary constraints we impose on ourselves can surface all sorts of biases, some as obvious as the ones highlighted in this public case of MIT’s facial detection analysis from a year ago. In these highlighted cases, facial recognition models trained on presumably-caucasian example sets failed to recognize famous faces like Serena Williams, Michelle Obama and Oprah Winfrey.

On top of the obvious social bias problem, there’s also an avoidable and common issue of training models in the absence of business context. From a machine learning standpoint, it’s unrealistic and unnecessary to expect a generated model to intuit concepts not in the data, e.g., that skin tone is a variable in human faces.

A simple solution in this face recognition case is to identify a diverse set of faces that can train the model to recognize faces across a range of known facial feature variables. The designer should highlight key variables: skin tone, hair style, eyes, nose, ears, glasses and so forth. Is it “cheating” or “inelegant” to put Michelle Obama directly in the training set? Not at all. Put Jennifer Lopez in your test set too. You want train and test sets to be independent and balanced, but each should include key product-relevant examples. You want your data to generate the story of why your model is applicable and interpretable. Using data sets that lack curation is only generating unwanted bias in your models.

Takeaway: A machine-learned model is like a baby, assuming that whatever it sees in the room is the whole world. We have peripheral knowledge we can leverage when teaching these models. We create features intending to highlight the important variables in our awareness. We tell our models the things we already know so that they can learn the nuances we don’t yet know.

Executing agile data science with practical rigor

Another common issue we’ve seen is found in data science applications that don’t have long-established best practices or strong academic attention. There is a risk in these areas of long research cycles that don’t follow a straight path. We recommend mitigating these risks through borrowed concepts from iterative software development philosophy and “Agile” processes. A typical data science project can follow these steps:

  1. Define a success metric that is directly related to a business objective.
  2. Define a simple model that can be scored against the success metric.
  3. Iterate over a set of alternate approaches (only incrementally more complex) to improve the success metric.

We aim for the ability to test (or “round trip”) one or more approaches per week. This allows us to see within a few weeks whether we can achieve the business objective and whether we are hitting a point of diminishing returns. Over time, we have seen concrete benefits when applying agile-style data science, including:

  1. Resulting models are sufficiently complex to meet the business objectives but not more complex than is required.
  2. The cost to implement and maintain a necessary model is kept manageable, including for new employees.

This iterative process has tangential benefits as well. Seeing which types of increases in model complexity generate corresponding increases in performance allows data scientists to understand the nature of the underlying data in their sector. Insights are accumulated about which model classes are better suited to particular problems. This can result in a faster search for the most appropriate algorithm. This is a greedy approach at heart. However, in our experience, misses from greed are the exception rather than the common case when working with the noisy data sets encountered in everyday practice.

Takeaway: The same benefits of process-driven engineering have analogs in rapid-cycle innovation and modeling.

For any data-driven organization, it’s critical to innovate on a variety of quantitative problems, ranging across topics that can include user behavior, trust modeling and other new areas. A pragmatic and creative approach to your initiatives will help you to generate value iteratively and at a rapid rate.

Denne artikel er oprindeligt bragt på Medium.