Home » Columns, Featured, Stats4Good

Measuring and Reducing Bias in Machine Learning, AI

1 February 2023 2,772 views No Comment

David CorlissWith a PhD in statistical astrophysics, David Corliss is lead, Industrial Business Analytics, and manager, Data Science Center of Excellence, Stellantis. He serves on the steering committee for the Conference on Statistical Practice and is the founder of Peace-Work, a volunteer cooperative of statisticians and data scientists providing analytic support for charitable groups and applying statistical methods in issue-driven advocacy.

In recent years, bias in machine learning and AI has become recognized as one of the most important challenges in statistics and data science and one of the most important subjects for the Data for Good community to address. Last March, the Data Foundation hosted a public forum on accelerating AI in the public sector. As an attendee and speaker, I came to realize technology isn’t the greatest barrier to making an impact with data and statistics. Gaining—perhaps I should say regaining—the trust of the general public is.

Getting Involved
In opportunities this month, the New England Statistical Society is accepting applications for their NextGen scholarships for underrepresented minorities. These scholarships support graduating high-school seniors and first/second-year undergraduate students interested in a career in statistics and/or data science.

Also, now is the time to plan for the 2023 Symposium on Data Science and Statistics, which will be in St. Louis, Missouri, May 23–26. This is a great opportunity to learn new techniques, meet with other data scientists and statisticians, and get ideas for your next Data for Good project.

Machine learning and AI have been held up as ways to eliminate unfair bias in human processes, from hiring decisions to the judicial system. Highly publicized failures of data science to deliver fair and equitable results have damaged our ability to use science to drive change for the greater good. So, what went wrong?

Bias can come from many sources. Selection bias occurs when a training sample is not representative of the general population to which an algorithm is applied, or when a small group is put at a disadvantage in the application of the results. Failures have included voice recognition programs that disproportionately fail to understand female voices to over-policing Black men by capturing more data in majority-Black areas.

Prejudice bias, also known as the “history problem,” results when training data is labeled using previously biased decisions. In this way, the algorithm is taught to recreate the very bias it is intended to eliminate: bias in = bias out.

Another important source of bias occurs when individual model features are not screened for biased results. This is especially a concern in cases in which there is a huge number of candidate predictors, résumé assessment using natural language processing, for example.

The amount of bias in a model or algorithm can be measured by looking at the disparate impact of model outcomes across different population subsets. The same process for measuring disparate impact is often used in public health studies. Log odds or odds ratios can be used to measure the amount of bias by measuring the difference in model outcomes between different groups.

Recent technological developments have improved and simplified the measurement of bias and comparison of different models. One particularly useful tool is Fairlearn, an open source Python toolkit developed and maintained by the Fairlearn Project. Fairlearn focuses on differences in selection rates for different population subgroups, such as male versus female, to quantify bias. This supports the implementation of constraints in the model to equalize the odds across population subgroups or to produce demographic parity so differences in the selection rate for different groups is minimized.

There are several good articles and blog posts about Fairlearn. I prefer “A Primer on Machine Learning Fairness Using Fairlearn” by Armand Sauzay, which includes a link to a Kaggle project and the source code needed to produce the results described in the article.

Fairlean has its limitations, however. It is written in Python, which is open source, and most folks are able to use it. While there isn’t an identical package or procedure in other languages, the functionality can be recreated. The basic metrics such as precision and accuracy are well documented in most analytic languages.

Fairlearn also doesn’t automatically calculate odds ratios, but they aren’t difficult to code. In my own experience, the metrics used in Fairlearn, including selection rates and confusion matrices, are great for demonstrating bias and mitigation to other statisticians. However, I find odds ratios the most persuasive metric to use with people from backgrounds such as the social sciences, business, and law.

The Metrics package in R supports bias metrics. David Dalpiaz at the University of Illinois published a good discussion about bias and the tradeoff between bias and variance, with code examples in R .

SAS has one for its Viya product written in Python, but I haven’t seen a paper on bias mitigation with code in base SAS; I should write one.

As has been said of money in the past, machine learning and AI are wonderful tools but a terrible master. Learning to avoid sources of bias and quantifying and minimizing its impact allow the realization of the promise of these tools to benefit all.

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

Comments are closed.