Importance of data audits in building AI

We’re excited to bring Transform 2022 back in person on July 19 and virtually July 20 – August 3. Join AI and data leaders for insightful conversations and exciting networking opportunities. Learn more about Transform 2022

Artificial intelligence can do a lot to improve business practices, but AI algorithms can also introduce new risk opportunities. Consider, for example, Zillow’s recent shutdown of Offers, the refurbishment-buying business of the firm, after his forecasting models significantly outperformed home values. When house price data changed unpredictably, the group’s machine learning models didn’t adapt fast enough to account for volatility, resulting in significant losses. This type of data mismatch or “concept drift” occurs if you don’t give proper care and respect to data audits.

Zillow’s failure to properly audit his records didn’t just hurt the company; it could have done greater damage by driving other companies away from AI. Negative perceptions of a technology can halt progress in the commercial world, especially for a category like AI that has already lived through several winters. Machine learning pioneers such as Andrew Ng have recognized what is at stake and have launched campaigns to highlight the importance of data audits by, for example, holding an annual competition for the best data quality assurance practices (rather than picking winners). based on the model, as it is). done traditionally).

In addition to my own work building AI, as host of The Robot Brains podcast, I’ve also interviewed dozens of AI practitioners and researchers about their approach to monitoring and maintaining high-quality data. Here are some of the best practices I’ve compiled from that work:

Beware of outsourcing your data management and labeling. Data maintenance is not the sexiest task and it is time consuming. When time is short, as for most entrepreneurs, it is tempting to outsource the responsibility. But beware of the risks involved. A third-party vendor won’t be very familiar with your product vision, know contextual nuances, or have the personal incentives to keep the reins needed close together. Andrej Karpathy, chief of AI for Tesla, says he spends 50% of his own time maintaining the vehicles’ data playbooks because it’s so important. If your details are incomplete, please fill in the gaps. Not everything is lost if your data sources reveal gaps or potential areas for erroneous predictions. One resource that is often problematic is demographics. As we know, historical demographic data sources tend to lean towards white males, and that can skew your whole model. Princeton professor and co-founder of AI4All, Olga Russakovsky, created the REVISE model, which reveals patterns of correlations (possibly spurious) in visual data. You can use the model to request insensitivity to these patterns or decide to collect more data that the patterns don’t have. (Here’s the code to run the model if you want to use it.) Demographics are most often mentioned in these types of situations (i.e., medical history records traditionally contain a higher percentage of information about white males), but it can be applied in any scenario .
Understand the implications of sacrificing intelligence for speed. Your data audit can motivate you to plug in larger data sets with more complete coverage. In theory, that may seem like a great strategy, but it could actually be a mismatch for the intended business purpose. The larger the data set, the slower the analysis. Is that extra time justified by the value of the increased insight?

Financial services firms have often had to ask themselves this question, given the huge amounts of money at stake and the industry’s technology getting faster and faster (think nanoseconds). Mike Schuster, chief of AI at financial services firm Two Sigma, shared that it’s important to keep in mind that a more accurate model, powered by more data, can often result in longer inference times during implementation, potentially not meeting your need for speed. Conversely, if you’re making decisions over a longer horizon, you’ll have to compete with others in the market processing much larger amounts of data, so you’ll need to do the same to be competitive.

Applying AI models to solve business problems is becoming more common as the open source community makes them freely available to everyone. The downside is that as AI-generated insights and predictions become the status quo, the less flashy work of data maintenance can be overlooked. It’s like building a house on sand. It may look good at first, but as time goes by, the structure will collapse.

Professor Pieter Abbeel is director of the Berkeley Robot Learning Lab and co-director of the Berkeley Artificial Intelligence (BAIR) Lab. He has founded three companies: Covariant (AI for intelligent automation of warehouses and factories), Gradescope (AI to help teachers mark homework and exams), and Berkeley Open Arms (low-cost 7-dof robotic arms). He also hosts the podcast The Robot Brains.

DataDecision makers

Welcome to the VentureBeat Community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

If you want to read about cutting edge ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.

You might even consider contributing an article yourself!

Read more from DataDecisionMakers

This post Importance of data audits in building AI

was original published at “”