One of the most common misperceptions in the world of Machine Learning about bias is “If I don’t use age, gender or race, or similar factors in my model, it’s not biased.” Well, that’s not true.
Even though the same people holding this opinion know that Artificial Intelligence can ‘learn’ and compute relationships between data, they don’t understand that there are proxies to biased data types in other features that are captured. These proxies are called confounding variables and, as the term indicates, unintended variables can confuse the model into producing biased results.
For example, if a model includes the brand and version of an individual’s mobile phones, that data can be related to the ability to afford an expensive cell phone — a characteristic that can imply a certain level of income.
If income is not a factor desired to use directly in the computing decision, imputing that information from data, such as the type of phone or the size of the purchases that the individual makes, introduces bias into the model. A high rand amount on purchases can indicate that an individual is more apt to potentially make these types of transactions over time, again imputing income bias.
Research into the effects of smoking provides another example of confounding variables. In decades past, research was produced that essentially made the correlation that if you smoke, your probability of dying in the next four years is fairly low; that must mean smoking is OK.
The confounding variable in this assumption was the distribution of smokers. In the past, the smoking population contained many younger smokers whose cancer would develop later in life.
The older smokers were already deceased thus did not makeup part of the data sample. Thus, the analytic model contained overwhelming bias and created a biased perception of the safety of smoking.
In the 21st century, similar bias could be produced by a model concluding that, since far fewer young people smoke cigarettes than 50 years ago, nicotine addiction levels are down, too.
However, youth use of e-cigarettes jumped 78% between 2017 and 2018, to one out of every five high-school students. E-cigarettes are potent nicotine delivery devices, fostering rapid nicotine addiction.
The challenge of delivering truly ethical AI requires closely examining each data class separately.
As data scientists, we must demonstrate to ourselves, and the world, that AI and machine learning technologies are not subjecting specific populations to bias and search for confounding variables. To reach that goal, the relationships learned by machine learning and AI need to be exposed. This is part of the trend toward Responsible AI – you can read more about it in my blog post on AI Predictions for 2020.
Explainability is paramount to the responsible use of AI and machine learning, and fortunately, algorithms for explaining machine learning go back more than 30 years. Now is the time to implement broadly before we see the spread of unregulated algorithms.