Meta AI’s open source system attempts to rectify gender bias in Wikipedia bios

We’re excited to bring Transform 2022 back in person on July 19 and virtually July 20 – August 3. Join AI and data leaders for insightful conversations and exciting networking opportunities. Learn more about Transform 2022

At this point it has become reflexive: When searching for something on Google, Wikipedia is the actual first page. The website is consistently in the top 10 most visited websites in the world.

Yet not all changemakers and historical figures are equally represented in the dominant web encyclopedia: only 20% of Wikipedia’s biographies are about women. That percentage drops even more when it comes to women from intersectional groups – those in science, for example, or from underrepresented areas like Africa or Asia.

This is an indication of the fact that “there are a lot of societal biases on the Internet in general,” said Meta AI researcher Angela Fan, who wanted to investigate this imbalance for her PhD project as a computer science student at the Université de Lorraine, CNRS, in France. “AI models don’t cover everyone in the world equally.”

To address this, Fan collaborated with her PhD advisor, author and computer science researcher Claire Gardent, to build an open-source AI system that collects and writes the first drafts of Wikipedia-style biographies. Today, they published their findings and methodologies in the paper, “Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies.”

Meta AI has also made the model and associated dataset open source. These relate not only directly to women, but also to women in science and those in Asia and Africa. The hope, Fan said, is that open, reproducible science can complement existing efforts and provide a starting point for researchers to bring more representation to the Web.

NLP fights gender bias

As Fan pointed out, the natural language processing community (NLP) has focused on combating gender bias in its dialogue on co-reference resolution, detecting foul language, machine translation, and embedding words. These studies have presented a variety of strategies, including data augmentation, additional data collection efforts, modified generation, and fair evaluation.

In the case of Wikipedia, while efforts by groups like the Wikimedia Foundation, WikiProject Women and Women in Red — a Wikipedia editor community — have focused on de-biasing existing content, they have failed to address the systemic challenges surrounding the first collection of content and the factors that introduce bias in the first place, Fan said.

Meanwhile, factuality is one of the biggest problems in text generation and NLP. The process poses three major challenges, Fan said: how to gather relevant evidence, how to structure that information into well-formed text, and how to ensure that the text generated is factually correct.

The study’s model and dataset use AI to generate full biographies rather than focusing on fixing or adding bits and pieces of content to existing profiles. The model writes a full biography by first predicting the text around an intro paragraph, then the subject’s early life, and then their career. Each section follows three steps: a fetch module that selects relevant information from the web to write each section; a generation module to write the text of the next section and predict which section is next; and a citation module that displays relative citations.

Fan and Gardent’s question consisted of three parts: The name of the person for whom the biography was generated; their profession(s), and a section heading. They compiled a dataset of 1,500 biographies on women and then analyzed that generated text to understand how differences in data available on the web affect generation. They evaluated the factuality, fluency and quality of generated texts using both automatic statistics and human evaluation, looking at content and factuality.

The limitations of AI

As Fan explained, existing AI can write individual sentences quite well, but producing fully grammatical sentences can be difficult, and producing an entire long-form document or article can be even more difficult.

“The main challenge is generating long text,” says Gardent, author of the book “Deep Learning Approaches to Text Production,” and is affiliated with the Lorraine Research Laboratory in Computer Science, France’s national center for scientific research and university. . from Lorraine. “That sounds very natural. But if you look at it in detail, it is full of contradictions and superfluousness, and in fact it could be very wrong.”

This is because there are often not enough secondary sources to verify facts. At the same time, there are challenges with multilingual NLP. Wikipedia supports 309 languages, but English is dominant, followed by French and German. From there it drops significantly because many languages ​​- such as those spoken in Africa – have little resource. “It’s important to measure not just the representation of one group, but how it interacts with other groups,” Fan said.

The goal is to have “language agnostic representation,” Gardent agreed. If multiple languages ​​can be processed, they can be used to derive maximum information.

To address factuality, the study also used what’s known as Natural Language Entailment, a high-level quantification proxy. If two sentences contain each other in both directions, then they are semantically equivalent, Fan explained.

In the end, she stressed that the model and dataset are only one small step in rectifying a long-standing, inherent bias.
“Our model only addresses one part of a multifaceted problem,” Fan said, “so there are still more areas where new techniques need to be explored.”

VentureBeat’s mission is to be a digital city square for tech decision makers to learn about transformative business technology and transactions. Learn more about membership.

This post Meta AI’s open source system attempts to rectify gender bias in Wikipedia bios

was original published at “”