Document Type

Article

Publication Date

9-29-2017

Abstract

In recent years there have been many studies investigating gender biases in the content and editorial process ofWikipedia. In addition to creating a distorted account of knowledge, biases in Wikipedia and similar corpora have especially harmful downstream eects as they are often used in Artificial Intelligence and Machine Learning applications. As a result, many of the algorithms that are deployed in production “learn" the same biases inherent in the data that they churned. It is the therefore increasingly important to develop quantitative metrics to measure bias. In this study we propose a simple metric, the Gendered Pronoun Gap, that measures the ratio of the occurrences of the pronoun “he" versus the pronoun “she." We use this metric to investigate the distribution of the Gendered Pronoun Gap in two Wikipedia corpora prepared by Machine Learning companies for developing and benchmarking algorithms. Our results suggests that the way these datasets have been produced introduce dierent types of gender biases that can potentially distort the learning process for Machine Learning algorithms. We stress that while a single metric is not sucient to completely capture the rich nuances of bias, we suggest that the Gendered Pronoun Gap can be used as one of many metrics.

Relational Format

journal article

Creative Commons License

Creative Commons Attribution-No Derivative Works 4.0 International License
This work is licensed under a Creative Commons Attribution-No Derivative Works 4.0 International License.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.