My End-to-End Process for Data Science Projects

Brødtekst

Why do I write this story? Simply because not every data science project is successful. For sure, you can Google this topic and you will find thousands of reasons. For me, projects that didn’t succeed have been the biggest opportunities to improve my process and my skills. In this story, I will briefly summarize my learnings by providing my current (and still improving) process in a data science project. I do not doubt that this process does not apply to everyone simply because there is not a single definition of a “data science project”. My aim here is to share my experience by summarizing the different steps of a project. It does not guarantee success, but, at least, this process will try to address all pitfalls that you, as a Data Scientist, can prevent. My role is to create pipelines that consume data and provide insight(s) (see my definition of pipeline in my previous story). This is the process I am using to do so.

Scope the solution

The first step is to scope the solution. Sometimes it is very hard to find the balance between expectations and pricing. My recommendation is to drive any decisions by looking at the real need. This would allow putting aside “nice to have” s and prioritize “mandatory features”.

Start with “why?” (ref: Simon Sinek)

To draft the scope of the project, the first step is to work with the client and to understand the pain point and his motivations. In this story, I will use the term client for the person paying for the project. In reality, the client might be a director/VP/CTO of your company. The client is the one that decided to invest in this project. It is mandatory to understand why this person is interested in this project. Maximizing production? Scaling of a process? Avoid failures in his factory? Anything… What is the pain point? This is very important for the next step. A good trick is to ask how the client is currently treating his pain point. If there is already a metric capturing the efficiency of the current process.

Who will use it?

Many times I had the opportunity to draft a scope without even knowing who will use the solution. This is wrong! A project is successful if the solution is used. I have been in projects where the user is not even clearly identified. This is the best way to create something that will never be used. A CEO of a company says: “AI is trendy and cool, we should do something with it”. RUN!

Let’s take an example. A director/VP wants to use AI to optimize stock management in his factory. Great! Then I started talking about the solution but the future user is not even in the room. We start working on the project and we decided that the algorithm will provide an email every morning with a forecast of demand to help the stock manager to make the best decision. I work hard and I succeed in this task. We even develop the infrastructure to support the algorithm. Everything is done and the CEO is happy! Well, guess what? This forecast is not even the manager’s biggest concern. He has a hard time figuring out in real-time where and what items are available in the factory. Too bad, a simple interactive dashboard would have done the job. The forecast is useless without this information. The project is a failure.

In another situation, the manager needs a forecast every week and not every day, or the forecast needs to be available in realtime not per day… All those situations happen. My best recommendation is to bring the future user as early as you can in the conversation because he is the best person to define the needs.

What?

At this point, what I want is a clear success criterion.

Client: “I need the algorithm to predict when my assembly line will fail”.

Me: “Ok”, me looking at the operator, “How much in advance do you need this prediction?” (Here, I am trying to define the granularity of the problem.)

Operator: “I need the prediction for 2 days in advance, and I need this information in the morning before 7 am”.

Me: “Great, but you have to understand that chances are the algorithm will produce false alarms and also miss some failures. Are you comfortable with this? What is the current situation? How much cost failure and how much cost asking the maintenance team to work on a machine if it is not needed? How many machines you can check within a day?”

I need all those answers to create a success criterion.

Me: “How do you solve this problem right now?”.

Operator: “We have 1000 machines and we have at least 1 of them would fail every day. We cannot check them all, so, we prioritize our investigation by checking the one that has not been checked for a long time”.

The last information is very important because it tells me where is the bar. If I simulate the current strategy, I can evaluate how much better my solution is.

All this information helps me to define a metric with a success criterion. In this type of situation, I rather use a metric like a Hit Rate. The number of true positives within the top N most probable predicted failures. N is the number of interventions the maintenance team can take care of within a day. I rarely communicate metrics like F-Measure or AUC to the client because he cannot use this measurement to decide if the solution is a success or not. At the end of the conversation, we found out that a Hit Rate of 2 would be valuable enough for the solution to be used. Also, thanks to the information I collected, I can price how much money the client is currently wasting and how much money the client can save with a successful solution. Then, I can convert my metric in dollars when I present progress to the client.

An important consideration is that my solution will be integrated into an existing system. This system exists for a reason and people are using it. The AI solution will alter this system. This disruption has to be anticipated otherwise the final phase of the project (when people use the solution) may cause unexpected additional costs. This disruption might be so important that the client might finally change his mind and refuse to take the risk of releasing the solution (it happened to me). My recommendation is to introduce this potential disruption in the conversation during the initial scoping of the solution.

At the end of the discussion, I ask for a snippet of data, some information about the way we can collect the data, etc… I do data analysis (1–2 days) to check if there are no big red flags. Data engineers would also take part in the conversation to draft an architecture.

Now, I have the scope, I can move to the next step which is working on the solution.

The Proof Of Concept (POC) phase

The objective of this phase is to demonstrate that the project is viable and can bring value. To optimize the process in this phase, you have to consider what needs to be constraints.

The constraints triangle
Illustration: The constraints triangle

Basically in a POC phase, I cannot constraint the time, the resources, and the accuracy at the same time. Time means the amount of time I can spend on this project. Resources are the number of persons allocated to this project. Accuracy means the accuracy of the model. Theoretically, you can only constraints two of them. In reality, resources are more or less always constrained. Maybe some companies can allocate an “infinite” number of resources on a project. I have never been in such a situation. If you relax accuracy, it corresponds to a “hackathon mode” or “spike”. It is useful when you want to investigate an idea, most probably, for a short time. The most realistic choice is to relax time. Some clients have a hard time working in this mode because they have deadlines to respect and they might have the feeling that the POC will never end. But, the trick is to communicate progress to the client in an efficient way. For that reason, I use the Agile methodology. This paradigm has been invented for software development but I adapt it in the context of a data science project. If you are not familiar with the wording, I invite you to read some articles about Agile.

Sprint 1:

I work generally in a sprint of 2 or 3 weeks depending on the client. The objective of the first sprint is to establish a baseline. A baseline is declared when the first end-to-end pipeline (from data to metric) delivers a number. In this initial sprint, I choose the simplest path. This is the only thing that matters. Get a number as fast as possible. For instance, if some variables have missing values, the simplest is to remove those variables from the dataset. Or if I have to choose a model, the simplest is a linear regression or a logistic regression because there is no parametrization. No creasy preprocessing, only simple steps. Avoid any type of step that would imply fine-tuning or calibration. No PCA, no standardization, nothing. Basically, in this sprint, I set up the CI/CD, the git repository, I write a parser, and I create the simplest version of the pipeline to end up with a number for the metric. The only thing that I need to pay attention to is the evaluation strategy. It is hard to compare performances between 2 pipelines if you switch between 66% train 33% test procedure to a cross-validation procedure. The evaluation strategy needs to be as close as possible from the way the pipeline will be used. In the above example, I would use a growing window for the training set and a sliding window for the test set.

Sprint 2:

The second sprint is to productize the baseline. With the help of data engineers and DevOps, we set up the architecture to support the pipeline. Some people prefer to wait until the pipeline reaches the level of accuracy as it is defined in the scope. For me, I try to unblock the other teams as early as I can to produce an MVP as soon as possible.
For my sprints, I use standard cycles. A stand up meeting every day to check if there are no blockers. We do planning at the beginning of the sprint and demo at the end. We also do retrospective from time to time. At the end of the sprint, we check if the new version of the pipeline provides better results compared to the baseline. If yes, it corresponds to the new baseline for the next sprint. I track this information over sprints and I include it in the demo.

 

Tracking of the metric over sprints
Illustration: Tracking of the metric over sprints

I show the plot above to the client during the demo. He can decide 1) to continue for another sprint, 2) to declare that we reach the objectives or 3) to drop the project because the current solution is too far from the expectations and it would take a prohibitive amount of time to finish. This early stoping is beneficial for everyone. For the client, he is not wasting money on a project that might be too expensive and for the team to be reallocated to another project with a higher probability of success. The demo is also the opportunity to validate new assumptions.

During each iteration, I increase the complexity of the pipeline. For instance, I gather external data to augment the dataset, I improve the preprocessing, add more features, use a different algorithm, etc… The danger here is to over-engineer the pipeline. For instance, I don’t try to draft a new deep learning architecture at the second iteration or I don’t add crazy complex features to nail one or two outliers that would not affect the metric drastically. To identify what can be the most impactful alteration of the pipeline, I often look at instances that have the largest error to try to find an explanation and derive a new cleaning strategy and/or new features. I order those alterations by the amount of work and start with the simplest.

A note about CI/CD and notebook:

A measurement of the metric has to be reproducible. So I disregard any data point that has not been produced by the CI/CD pipeline. If the result has been obtained in a Jupyter notebook, I have to convince myself that it can be reproduced on another machine. Notebooks are cool for data analysis but they require strong discipline to make sure the results are reproducible. I know I am not capable of such a thing. For that reason, I only trust numbers from the CI/CD. I have been in a situation where I claimed to the client that we meet the expectation and finally, I was not able to reproduce the results. It is very frustrating and costly for everyone… I wish you never have to live in this situation.

Productization

I strongly recommend you have a look at the article “Hidden Technical Debt in Machine Learning Systems”. The machine learning code is a small piece of the entire solution. It is important to have a clear picture of the additional steps needed for productization. productization requires the support of data engineers and DevOps.

Hopefully, the POC is done and we met the success criterion. Good!! Now we need to productize the pipeline. There are two main ways to proceed:

  1. Hands-off. The results of the POC are translated into specifications. A team of software developers take over, rewrite the entire code, and productize the pipeline.
  2. Push to prod. My code is reused “as is” and with the help of data engineers and DevOps, we build an infrastructure around it.

Both approaches have pros and cons. The hands-off approach will make sure that the solution is more stable. Because a data scientist is not always an experimented developer (no offense here), productization of the code is often in the hand of other people. It is often the case in a large company. A detail, in this situation, the ownership of the code does not belong to the data scientist anymore. So fixing bugs is done by the software development team.

The second approach is to ask the data scientist to wrap his code into a library and it will be used in another system. For instance, in a lambda function in the cloud. This approach is faster since we avoid rewriting code. The code is “cleaned” and the debts are paid. If you follow my recommendations in my previous story, this step can be done efficiently. Finally, the owner of the code remains the data scientist and I am in charge of fixing bugs. This way is often chosen in small companies and startups where a limited number of people can work on each project.

I prefer to use the second option. I will explain my reason with more detail in the next section. Briefly, when using the first approach, if I want to release improvements of the solution or to add some features, I would have to work on my code base, rewrite specifications, wait for the developer team to update the solution, etc… This can take months. A fast cycle would imply to minimize hands-off.

Maintaining and improving the pipeline

Ok, now my pipeline is in production. However, it is important to create a process to capture the feedback of the operator to continuously improve the solution. For instance, The operator is using the product daily but he proposed some improvements for the algorithm. For instance, instead of predicting if a failure would occur, he prefers to have a score or a probability of failure. To derive a score from the prediction, it would require a minor change in the pipeline. Just returning the probability instead of the predicted class. What I try to do is to set up a feedback process as soon as the first version of the pipeline is accessible (sprint 2–3). This feedback loop would also help me to adjust over the sprints but when the product is released, this continuous improvement remains. To have a good interaction with the user, I want improvement to be released as fast as possible. For that reason, I avoid the hands-off strategy to do the improvement myself and get feedback faster. The feedback loop is also important to capture the decisions made by the operator. When decisions don’t match the prediction of the pipelines, it means we can improve the pipeline by incorporating the “reasons” the operator took his decision. Over time, the role of the operator will change and get simpler. One consequence is the ability to scale his capabilities. Ultimately, a predictive pipeline can become a prescriptive pipeline. Then, the prediction is not interpreted by the operator anymore and directly connected to take action. This introduces my last point:

Company reorganization

It is important to understand that introducing an AI solution in a company, will disrupt the usual way the employee would work. Sometimes, the employees’ role changes and additional skills might be needed. For example, if you introduce a chatbot to help the customer support team, this team might be reduced or reallocated to other tasks. The solution might also cause some business reorientation in the client company. I have seen some companies changing their strategy from selling products to selling services because of the new solution. This is often not anticipated by the client. Sometimes, it can cause the project to stop for a while. It happened to me that the company would have been so much affected by the algorithm that the reorganization and the integration of the solution in their system would require much more investment. They finally decided not to use it. I try to introduce this subject very early in the process to make sure the client would anticipate the situation.

Please feel free to connect. My LinkedIn can be found here.

Denne blog er oprindeligt bragt på Medium.