Bad Data: A $3T a Year Problem with a Solution

Mobility and modern telecommunication concept: macro view of tablet computer and touchscreen smartphones with colorful interfaces on laptop notebook PC

To further strengthen our commitment to providing industry-leading data technology coverage, VentureBeat is pleased to welcome Andrew Brust and Tony Baer as regular contributors. Check out their articles in the Data Pipeline.

A few years ago, IBM reported that companies were losing $3 trillion dollars a year due to bad data. Today, Gartner estimates $12.9 million as the annual cost of poor quality data. Money is wasted on digitizing resources, as well as organizing and searching for information – a problem that has certainly increased as the world has shifted to more digitized and remote environments.

Aside from the impact on revenue, bad data (or lack thereof) leads to poor decision-making and business assessments in the long run. The truth is that data isn’t data until it’s usable, and to get there, it needs to be accessible. In this piece, we discuss how deep learning can make data more structured, accessible and accurate, avoiding huge losses on revenue and productivity in the process.

Facing productivity hurdles: manual data entry?

Every day, companies work with data that is usually stored as scanned documents, PDFs or even images. It is estimated that there are 2.5 trillion PDF documents in the world, but organizations continue to struggle to automate the extraction of correct and relevant quality data from paper and digital documentation – usually resulting in data unavailable or in productivity problems given that slow extraction processes are no match for our current digitally driven world.

While some may think that manual data entry is a good method of turning sensitive documents into actionable data, it is not without its flaws, as they expose themselves to an increased potential for human error and the resulting cost of a time-consuming task that (and should) be automated. So the question remains: how can we make data accessible and accurate? And further, how can we easily capture the right data while reducing manual intensive work?

The power of machine learning

Machine learning has revolutionized everything we do over the past few decades. The goal from the start was to use data and algorithms to imitate the way we humans learn – and gradually learn our tasks from there to improve their accuracy. It’s no surprise that advanced technologies have been greatly adopted during the digital revolution. In fact, we have reached the point of no return, as by 2025, the amount of data generated each day is projected to reach 463 exabytes worldwide. This is simply a reflection of the urgency to create processes that can withstand the future.

Technology nowadays plays an integral role in the maintenance and quality of data. For example, data extraction APIs have the ability to make data more structured, accessible and accurate, increasing digital competitiveness. An important step in making data accessible is enabling data portability, a concept that protects users from locking their data in “silos” or “walled gardens” that may be incompatible with each other, exposing them to complications when making data backups.

Fortunately, there are steps to consider in leveraging the power of machine learning for organizational-level data portability and availability.

Defining and using the right algorithms — Based on the research and needs of data scientists, data must be managed through specific technical standards – meaning the transfer and/or export of data must be done in a way that allows organizations to enables to meet user data regulation requirements and provides insight for the business. Take document processing for example – extracting PII from a PDF that is needed for HR purposes must be stored in a different database than data extracted from a receipt, in terms of dates or amounts paid. With the right algorithm, these various functions can be automated. Create an application that can use these algorithms — Using different file types or data types, organizations can train their algorithm to deliver more accurate results over time. In addition, the number of file/data types should increase to further extend the use case. It is possible to duplicate this process, for example document processing, they will either train a new model for a different type of document, or in some more complex cases – such as invoices – train the same models with a closed file template. Thinking about security at all levels — It’s also important to remember that the data used for decision-making processes is essential and private to the business. Security remains important at every step of using machine learning to collect important data. the same kind of format in which the information is processed. In fact, the implications of the insights gathered and delivered to stakeholders depend on it. In addition, the quality of the data will also determine how accurately the algorithm identifies and provides the specific insights needed by the business.

The truth is, data can’t help you if it’s inaccessible: you can’t automate processes if data isn’t recognizable and usable by a machine. It’s a complex process that, when done right, brings many benefits, including accelerating the collection of insights for faster decision-making, providing greater productivity through faster data retrieval, improving accuracy through AI/ML and the end-user experience and reducing the overall cost of manual data extraction.

Making technology work for you: a high-quality, data-rich future

Organizations may be rich with data, the reality is that data is useless if users cannot interact with it at the right time. As we all know, most job specific processes start with a document. However, the way we interact with these documents has changed, taking the human focus away from entering data and shifting it to managing data to ensure processes run smoothly.

Real decision-making power lies in being able to quickly retrieve business information and data, while having confidence that the data will be accurate. That is why controlling data has enormous value. It ensures the quality of information used to build your business, make decisions and acquire customers.

Technology has given us the ability to let automation do the more mundane, yet important administrative tasks, so we can focus on creating real value – let’s embrace it. After all, data must be usable. As you continue your digital transformation journey, remember that the more (accurate) data you send a machine learning model, the better the results will be.

Jonathan Grandperrin is the co-founder and CEO of Mindee.

DataDecision makers

Welcome to the VentureBeat Community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

If you want to read about the very latest ideas and up-to-date information, best practices and the future of data and data technology, join us at DataDecisionMakers.

You might even consider contributing an article yourself!

Read more from DataDecisionMakers

This post Bad Data: A $3T a Year Problem with a Solution

was original published at “”