As a former product leader at Google AI, my team and I were responsible for building models that would ‘just work’. They needed to ‘just work’ because we were selling to some highly regulated industries like financial services and healthcare, where the price to pay for poor or biased predictions is very steep.
Over and over again, we would think our model ‘worked’ due to high values on vanity metrics such as F1 or confidence scores, but within days we would realize issues with our data – it didn’t matter what other shiny tools we used for training, deploying or monitoring models – if the data was erroneous, the model would suffer, and the data can be ‘erroneous’ in dozens of ways, which made this a hard problem.
Turned out that this problem was not unique to Google – over the past year, we realized after speaking with 100s of ML leaders, that analyzing and fixing the data across the ML workflow, or continuous ML Data Intelligence is their top problem.
What tools did we at Google, and these 100s of ML teams use for ML data intelligence?
Sheets and scripts are still state of the art! This has many problems.
ML data intelligence is a team’s ability to holistically understand and improve the health of the data powering ML across the organization. This removes data biases and production mishaps proactively thereby resulting in 100s of hours saved for data scientists, lowering costs dramatically and improved model predictions quickly, sometimes in the order of 10-12% or more.
ML data intelligence tools are embedded in the model training and production environments to quickly identify data errors leveraging data-centric AI techniques baked in, and systematically enable data fixing with actionability and collaboration as key cornerstones.
ML data intelligence is one of the first tools that companies need when embarking on the ML journey, even before labeling or figuring out which model to use – getting an understanding of the data health first and fixing/improving it sets a good foundation for smarter data sampling for annotation (thereby saving on labeling costs).
The five pillars of ML data intelligence are:
The quality of the data relies on being able to identify noise/errors fast – this could be within the data dump you get from a customer, or from the data the model is getting hit with in production. ‘Data quality’ is abstract but critical. It needs constant supervision, analysis and adaptation of the data to ensure it is up and to the right.
Data quality is a byproduct of ML data intelligence, which provides a framework to inspect, analyze and fix the data to ensure high data quality across the ML workflow.
When we think of ‘ML monitoring’ there is a bias that conjures tools such as Datadog where incredible dashboards are constantly monitoring and alerting ML teams of model downtimes in production. This has 2 problems:
ML data intelligence tools should provide data health monitoring across the ML workflow – this does not have to be real-time, but does need to enable data science practitioners to create automated data health tests and tweak them over time.
Moreover, while ML monitoring tools focus on the ML Engineer/Program Manager, ML data intelligence tools focus squarely on the data scientist as an assistant for continuous data analysis and fixing.
ML data intelligence is a rapidly maturing but still evolving space.
Most job functions over time, as they grow in prominence within an organization, become more data-driven in their decision making. This has always required a new set of tools to step up and enable the shift.
Similarly, ML teams have become a mainstay for organizations, and now deserve the tools to quickly inspect, fix and track the data they are working with.
This ‘data stack’ in the ML developers toolkit will be powered by innovations in data-centric AI research (academia has a growing focus here), as well as a growing understanding that fixing the data can lead to huge gains in model performance – but to ‘fix’, you need to first ‘understand’ – ML data intelligence will enable both for the data scientist, ushering in the data-driven ML mindset.