The very real risks of excluding the non-data subjects from your dataset

We’re excited to bring Transform 2022 back in person on July 19 and virtually July 20 – August 3. Join AI and data leaders for insightful conversations and exciting networking opportunities. Learn more about Transform 2022

Analysts use machine learning tools to analyze the vast amounts of data now available. Big data is attractive, but even the best tools and techniques will give you noise if the underlying data is not robust and inclusive. To reliably predict elections, correctly anticipate consumer demand, or accurately predict the course of a pandemic, leaders need to consider who is (and isn’t) included in their dataset.

For example, we are approaching the four-year anniversary of Britain’s shocking vote to leave the European Union. People around the world were stunned by this decision, especially as polls told us there would be no Brexit. Most young Britons were in favor of staying in the EU, but many of them did not vote that day. The behavior of these calm, disinterested young people made all the crucial difference to the outcome. How come conventional studies have overlooked this? What did pollsters ask these young people who did not intend to vote? Not very much, because the reality is that these pollsters didn’t come into contact with many of them. As a result, the Brexit polls overrepresented voters’ views and underrepresented the non-participants.

This is a remarkable example of the real risks of relying on data from the usual more engaged votes rather than the broadest, most diverse set of votes, including those who typically don’t fill out polls or are not active on social media. The same has happened in numerous referendums or elections since then, including the surprise election of Donald Trump as US president in 2016.

The risks of excluding those not involved go well beyond predicting referendums or elections, as they also affect a wide range of critical business, economic and public policy issues. For example, during the COVID-19 pandemic, we learned that if we don’t hear from all sources, hidden data could unexpectedly lead to new outbreaks.

Social media analysis is attractive because it yields a large, continuous dataset. But big data can increase the risk of drawing the wrong conclusion when applied to a small group of votes. In fact, people are not active on social media. Traditional business intelligence tools, such as focus groups and panel surveys, can also reinforce our biases when they exclude the diverse voices we need for reliable business intelligence.

To reliably predict consumer demand, accurately predict the course of a pandemic, or avoid overreacting to a corporate crisis, leaders need to ask themselves who they’re not hearing from and find solutions to incorporate them into their data. The same principle applies to understanding economic trends. Youth and new immigrant groups generally do not participate in the surveys underlying employment data. But we need more comprehensive data that captures these groups or leaders who may fail to deliver the right dose of economic aid at the right time to unincorporated populations – such as youth, for example, who typically do not respond to traditional surveys.

To address this problem, we use a technology that randomly appeals to anyone who uses the Internet. The method is analogous to randomly choosing the internet-using population, with the aim of reaching a much broader group of the population in any country of the world that goes far beyond the typical respondents of a survey.

What have we learned from these more distant voices? That Donald Trump would be elected in 2016 and that there was a real risk of Brexit ahead of the Brexit “surprise”. We’ve heard of Americans who typically don’t respond to government surveys about their concerns about vaccination, and learned what might convince them to get vaccinated. We’re also using this approach to get more reliable, independent sentiment data in Russia amid the current crisis in Ukraine — and we’ve learned that fewer Russians support Putin’s approach to invasion than is usually reported in typical opinion polls, as well as that fewer people are against him than might be suggested by anti-war protests.

Since we all evaluate data for decision-making, we can’t get caught up in applying the latest machine learning tools to big data before we ask: who is left out of our data set? What can we learn from those people? If we don’t fix that problem, we run the risk of our predictions and decisions being wrong.

Greg Wong is CEO and Danielle Goldfarb is head of global research at RIWI.

DataDecision makers

Welcome to the VentureBeat Community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

If you want to read about the latest ideas and up-to-date information, best practices and the future of data and data technology, join us at DataDecisionMakers.

You might even consider contributing an article yourself!

Read more from DataDecisionMakers

This post The very real risks of excluding the non-data subjects from your dataset

was original published at “”