Putting a high-quality Machine Learning (ML) model into production can take weeks, months, or even quarters. Over the past year and a half, Galileo has spoken to hundreds of ML teams across many verticals, and unsurprisingly, this is the status quo across all these organizations.
In this article, I'll share more about our learnings and how ML teams are now working to solve these bottlenecks.
You may already know that the main steps of the standard ML lifecycle are labeling the data, training the model, deploying it, and monitoring it in production. And, of course, mature solutions are available to support the infrastructure needed for all these steps.
We can also see that model architectures for most mainstream ML tasks have been commoditized with the advent of solutions such as Hugging Face, JAX, etc. Rarely are data scientists in ML teams across organizations reinventing the wheel. So why exactly do ML teams find it hard to scale their impact?
When we asked data scientists across so many of these companies, we found that 80% of the data scientists' time goes into fixing and improving data sets to get better model performance out of the box. This has been the most significant impediment to ML adoption across the enterprise.
Teams perform ad-hoc experimentation that’s mostly messy, manual, and, a lot of the time, ineffective. They’d write custom scripts to get metrics out of datasets and even look at data dumps in raw formats, such as CSVs, Excel sheets, and so on.
This doesn't mean you shouldn't run experiments, but most of the time, running experiments that take a lot of time can lead to inefficient workflows and poor performance further down the line. Some of the problems these teams face further down their workflow include harmful model predictions in production (mispredictions), bias and non-compliance issues going unnoticed, very slow model deployment cycles, and a higher data acquisition cost.
So what do we do to fix this? One approach would be encouraging data scientists to run these analyses throughout the model's lifecycle. What are some of the insights that data scientists should be looking into? Let’s take a deeper look.
When you are trying to understand and curate your dataset, you want to make sure it is representative of the problem you are trying to solve, right? When trying to attain a representative dataset, labeling mistakes are one of the worst things that can happen.
Basically, your dataset is full of samples that have been wrongly labeled. This hurts the quality of your data and could hurt the performance of the learned model in the future. Perhaps your team did annotations manually, which introduced errors at this stage. Other times, your active learning strategies for labeling may learn the wrong labels and contribute to low data quality and all kinds of data imbalances.
Fixing the errors at this step is critical to avoiding harmful mispredictions and biases in production.
During model training, you need to understand the data samples that are hard for the model to train on and which attributes in the data are leading to confusion in your model predictions. These two things help you improve model accuracy on the weak sections of your data and help reform the prior step of data labeling to avoid confusion in the labels. You can tell the people who are labeling to pay attention to the important parts of the data and label correctly from the start.
The downside of not doing these experiments is that you leave model accuracy on the table. It leaves you with a blind spot on the scope of improvement for the model, and you also do not develop the correct intuition for your data unless these analyses are done. It is essential to understand the representational power of the model on your data set; without doing this analysis, you're going to treat the model as a black box.
Once you have the model in production, you should answer two questions:
Both of these questions are addressed by data scientists on an ongoing basis who conduct experiments such as computing drift or looking at samples that have weaker predictions or fall on the prediction boundary if applicable to your model.
By looking at these, you get an estimate of your model's performance in production. More importantly, it also informs your decisions about the additional data you need to train your model to keep it fresh and continuously valuable in production.
For any ML team to scale, automation is vital. You need to incorporate all the signals and insights you get from the state of your data into an efficient automation workflow. This is like an active learning pipeline, where data scientists don't have to worry about maintaining models that have already been put into production.
About 20 years ago, programmers took months to ship their code because there was a considerable lack of tooling. The entire software development lifecycle was built around a development team, a testing team, and an external user testing or integration workflow. Building and deploying software to production took months. Nowadays, developers have IDEs, auto-linters, and continuous testing environments. With that, software teams can ship code within a day.
There's a similar lack of tooling for data and ML teams today to help them ship their models faster and get high-quality models faster. The term "ML data intelligence" is based on the lessons we've learned from making ML teams more efficient so they can make high-quality models quickly.
ML data Intelligence should be baked into the model lifecycles for three reasons:
At Galileo, we have been applying these principles through our platform for all our customers for the last year. We have customers ranging from small ML teams of one or two people who have just started training a model with some data to large Fortune 500 companies with multiple ML teams and mature models in production. The impact metrics we are seeing are very encouraging.
Some of the impact metrics we are seeing so far have been:
Think of Galileo as the data scientist's assistant, where the data scientist gets critical insights on how to iterate faster and improve the model's accuracy through all their experiments. Galileo works by hooking into your model training framework, whether it's TensorFlow, PyTorch, or any AutoML framework. You add one line of logging code, which orchestrates how the model is trained over time.
At the end of the training run, Galileo ranks your dataset and shows which samples are the hardest for your model to train on. It can show the samples that are most likely mislabeled and have annotation mistakes. You may also have garbage samples that are unrepresentative of what your model is trying to learn; Galileo can help you find those too.
Once your model is in production, Galileo can show you things like drift information on unstructured data. The problem we are attempting to solve with drift in unstructured data is the semantic meaning of the data relevant to your model and how that changes in production over training data. You can just look at distribution skews if you don't have fixed features.
If you want to try out Galileo, sign up for free here.