Why We Should Crowdsource AI Ethics (and How to Do So Responsibly)

It’s impossible to deny: artificial intelligence is becoming more and more powerful, while requiring less and less human supervision. Powerful, independent AIs not only perform routine tasks independently but now make difficult ethical decisions all on their own.

For instance, automated vehicles need to balance the distribution of unavoidable risk across everyone on the road—drivers, passengers, pedestrians, cyclists. Health care algorithms must decide which patients receive scarce medical resources, such as extra attention from staff, specialized equipment like ventilators, or donated organs. In the criminal justice system, algorithms have to determine who gets released on bail. So, what constitutes a “fair” decision?

At the moment, there seems to be no clear or widely agreed upon answer. Consider a health care algorithm that determines who receives a scarce piece of equipment. Some people believe that algorithms should treat all people equally and that the allocation should be made on a first come, first served basis. Others believe that the allocation should be based on years of life expectancy or chance of recovery. Yet others argue that patients who will play an important role in society (e.g., a doctor who performs organ transplants) should be prioritized. Moral disagreement like this can exist even within fairly homogenous communities, never mind a society as diverse as the United States.

What moral principles should govern decisions like these? And who gets to decide?

What moral principles should govern decisions like these? And who gets to decide? One line of thinking holds that we should leave these things up to experts, such as policymakers, legal scholars, and applied ethicists. But an important question remains: Should the interests and preferences of the public matter in some (if not all) cases? If so, how should they be factored in? And how can we find out what the preferences of the public are in the first place?  

We believe that large-scale online studies that crowdsource public opinion on ethical dilemmas, can add to the available repertoire of resources used in the political process (like public opinion polling and public comment periods) to gain insights about the preferences of the public.

At the outset, this may sound controversial. The term “crowdsourcing ethics” may sound like advocating for “what is popular is always right.” To the contrary, our goal in this article is to suggest that crowdsourcing is a new and important way to augment the information policymakers have about the views of their constituents. The role that we think crowdsourced ethics should play is modest yet valuable and, like all good policy-making tools, needs to be incorporated with responsible oversight.

Crowdsourcing ethical preferences

In 2016, our team created the Moral Machine, an online platform for gathering judgments about one of the most challenging conundrums of this technological moment: the moral decisions made by autonomous vehicles (AVs), such as how AVs should distribute unavoidable risk between road users. Imagine, for instance, that the brakes on an AV fail and it is careening out of control. How should the AV be programmed to respond? Should the AV crash into a barrier, inevitably harming the occupant? What if the alternative is to plow through a cross walk, harming multiple pedestrians? The website gathers judgments in cases like these. It received widespread media coverage and went viral beyond all expectations—capturing forty million decisions by millions of visitors from 233 countries and territories to date and resulting in the largest data set on AI ethics ever collected.

The Moral Machine was designed to understand the psychological factors underlying people’s ethical decisions related to AVs. The dilemmas used in the Moral Machine are inspired by a famous philosophical conundrum, known as the trolley problem, first proposed by the philosopher Phillipa Foot in 1967. One version of the trolley problem goes like this: An out-of-control trolley is about to run over five people who are stranded on the trolley tracks. It is possible to divert the trolley onto a sidetrack where only one person is standing, thereby saving the five and killing the one who was previously unthreatened. Should the trolley be diverted into the sidetrack? Or should it be left to kill the five people?

This tool has been used extensively by experimental philosophers, psychologists, and neuroscientists to understand how moral judgment works. By subtly varying different elements of the trolley problem—like the number of people on the tracks, characteristics of the victims, the intention of the bystander, and the causal sequence of events—and asking subjects to make judgments about them, investigators can gain insight into the factors impacting subjects’ moral decision-making.

We extended this logic to the Moral Machine, rewriting the problem in the context of AVs and framing the vehicle as the decision maker. We developed moral dilemmas by varying nine different attributes: intervention (stay vs. swerve), relationship to AV (pedestrians vs. passengers), legality (lawful vs. unlawful), gender (male vs. female), age (younger vs. older), social status (higher vs. lower), fitness (fit vs. large), number of characters (more vs. fewer), and species (humans vs. pets). As part of these manipulations, 20 different characters appeared in the scenes, ranging from doctors to babies in strollers, from criminals to pets. The manipulation of these factors together resulted in over 26 million possible dilemmas.

In countries where the rule of law is strong (and highly respected), people are more willing to protect rule-followers at the cost of rule violators. In other countries this factor makes little difference.

When we analyzed the Moral Machine data, we discovered that (unsurprisingly) participants mostly choose to spare people over animals and (more surprisingly, perhaps) choose to spare the young over the old. Interesting cross-cultural differences emerged as well. We found that based on their answers on the Moral Machine, countries congregate into three main clusters. We call these clusters Western, Eastern, and Southern based on their geographic positions. However, the differences between clusters are not solely explained by geographical locations. Some of these differences may be explained by modern institutions, while others are correlated with deep cultural traits. For example, in countries where the rule of law is strong (and highly respected), people are more willing to protect rule-followers at the cost of rule violators. Specifically, those crossing when the traffic light permits them are spared over jaywalkers. In other countries this factor makes little difference.

Now that we have that information, what should we do with it?

How do we gauge public opinion?

Some might see the Moral Machine data as unnecessary or even dangerous. After all, many people have implicit (and explicit) biases against people who are different from them. It would obviously be problematic for racist, sexist, and homophobic biases and other intolerant attitudes to be the foundation upon which we based the moral decision-making of AI systems.

Further, many ethically charged decisions can only be made after spending the time to understand complex sets of facts. How will a specific health care policy impact the economy? How will traffic patterns change by building a new road? In a complex, democratic society, experts such as regulators and bureaucrats, appointed by elected representatives, are arguably in a better position to make decisions about policy than their constituents are. They are tasked with researching the facts that bear on policy decisions and considering the implications of the available options. Their decisions are the ones that (in an ideal world) would be most just. Why, then, should they seek the advice of people who haven’t spent the necessary time required to understand the issue? Isn’t it best if we leave such decisions to experts alone, to figure out on their own?

Should we seek the advice of people who haven’t spent the necessary time required to understand the issue? Isn’t it best if we leave such decisions to experts alone, to figure out on their own?

Unfortunately, this solution is not completely satisfactory. Even when we trust that experts have sufficient training and wisdom, and have the interest of the public at heart, the question remains: How do they know what the interest of the public actually is? Even if we accept that the public’s opinion could be tinged with bias, impeded by less information, and should not be the final word, does that mean experts can simply dismiss the public’s view? If not, what is the best way for experts to learn about the interest of the people?

Policymakers currently use public-opinion polling and public comments to help them gauge public opinion. While helpful, these methods are limited by access, cost, and methodology.

Public-opinion polling is rooted in the idea that by surveying a representative sample of the public, one can arrive at a reasonably accurate estimate of the broad public’s views. Many polls use telephone calls and face-to-face interviews, which suffer from significant bias (e.g., they are limited to those who are willing to answer the call and participate), and thus result in unrepresentative samples. Another problem with these methods is that they are costly in terms of time and effort, and thus are hard to scale. Some public-opinion polls have recently started to harness the power of online panels and claim to guarantee representative samples, which improves substantially on previous methods (though the costliness of scaling this procedure remains a barrier). Moreover, public-opinion polling often obscures the reasons behind respondents’ answers. The main goal of public opinion polling is to estimate the attitude of the public, without uncovering why they feel that way.

The process for public comments is quite different from that of opinion polling. When a federal law is created (i.e., passes through both houses of Congress and is approved by the president), regulations are made to implement the law. The regulation falls to federal agencies, such as the Environmental Protection Agency, the National Highway Traffic Safety Administration, Federal Railroad Administration, and other agencies. When a new rule or regulation is proposed by an agency, the rule becomes open for public comment. Agencies may solicit comments from industry stakeholders, local governments, and members of the public who have an interest in the regulation. Comments can range from hundreds of pages of extensive analysis of the implications of a policy for an industry, compiled by legal experts, to a few sentences submitted by an individual citizen who has some interest in the outcome of the regulation. But this system has many drawbacks. For instance, only highly engaged parties are likely to be aware that a comment period is open, even when a regulation may have important impacts on their lives. Moreover, the regulations can be written in highly technical language, making it difficult for citizens to extract the information relevant to them.

These tools leave gaps in policymakers’ understanding of public preferences. How can we make the process of gathering preferences more accessible to the public? How can we incentivize the public to participate? And how can we investigate with scientific rigor the factors contributing to these attitudes?

How can we make the process of gathering preferences more accessible to the public? How can we incentivize the public to participate? And how can we investigate with scientific rigor the factors contributing to these attitudes?

We believe that online crowdsourcing platforms like Moral Machine provide one solution to help fill these gaps. Such platforms are often highly accessible and enjoyable to use, therefore reaching large numbers of people. They can also be designed as a randomized controlled trials, allowing more rigorous conclusions to be drawn from the data.

A new and needed perspective

How can the answers to hypothetical scenarios by millions of (biased, uninformed) users on the internet be useful for policymaking around the ethics of AI?

The data we collected via the Moral Machine provided us with interesting insights into the ways in which the public’s opinion is aligned (and misaligned) with that of the experts. In 2016, a German commission comprising experts in ethics, technology, and law was formed to create the first ethical guidelines for autonomous vehicles. By comparing the data collected via Moral Machine with the commission report, we noticed that while there is some overlap between the opinions of the public and the experts, there are also key points of disagreement. Both agree on sacrificing animals in order to spare human life, for instance. But the public largely approved of sparing children at the cost of the elderly, whereas the experts rejected any form of discrimination based on age.

While the experts are not required to cater to the public’s preferences when making ethical decisions, they may be interested in knowing the views of the public, especially in cases where the right decision is difficult to discern and where it may be important to gauge and anticipate public reaction to important policy decisions.

Even a group of experts may sometimes fail to reach an agreement when faced with particularly difficult moral questions. For example, the German Commission recommended both “General programming to reduce the number of personal injuries” and “It is also prohibited to offset victims against one another.” This seems like an attempt to combine two contradictory opinions (both of which are endorsed by experts) into a single recommendation. However, while either recommendation on its own may be actionable, it is hard to know how to respond to a document recommending both. In cases like this, policy experts may instead choose to use citizens’ preferences as a tiebreaker, or for the sake of determining which policies are most likely to be broadly acceptable, or else to at least open discussion for further public deliberation.

Other times, when policy experts find citizens’ preferences problematic, and decide to dismiss them, they still must be prepared for the reaction their policies will create and think carefully about how they will explain their choices to the public.

Even more ambitiously, cross-cultural data of this sort might help us determine the plausibility of achieving a global consensus on principles for a universal machine ethics.

Moreover, policymakers in an Eastern country may benefit from understanding how the preferences and the expectations of citizens of their country differ from those in Western countries (and vice versa). As a result, they may realize that some of the policy guidelines drafted by European commissions should be revisited before they adopt the guidelines for their own jurisdictions. To understand the public reaction in these countries, local government would need to design and run their own preference elicitation tools that target representative samples of their constituents, potentially using our framework and findings as a starting point. As increasingly nuanced data about local populations is gathered, we may begin to be able to answer the question of whether it would be sensible to transfer ethical guidelines about AVs from one country to another. Even more ambitiously, cross-cultural data of this sort might help us determine the plausibility of achieving a global consensus on principles for a universal machine ethics.

Being clear about limitations

It is important to be clear about the limitations of platforms like the Moral Machine.

First, as we have hopefully already made clear, we do not believe that the preferences collected from users of Moral Machine should be directly translated into policy or handed off to engineers to be coded into AVs (to the extent that would even be possible). Instead, we feel that the judgments collected on this platform or others may qualitatively inform a policymaker or a group of experts making policy recommendations.

Second, these platforms do not guarantee representative samples of the general population. The participants in this exercise are those who chose to visit our website (that is, internet-connected, tech-savvy folks curious about driverless car technology). Given this, policymakers should not embrace this data as the final word on public preference.

Finally, real world scenarios are multidimensional and complex. Our data is derived from simplified scenarios which capture a few interesting or important aspects of these cases. When designing the Moral Machine, we did not introduce every factor relevant to the moral decision-making of AVs. Nor did we introduce uncertainty about the outcomes of the actions. Despite these limitations, the Moral Machine begins to uncover important psychological phenomena at the heart of AV ethics. (See here, here, and here for arguments for using the simplified trolley-problem-like scenarios for investigating AV ethics.) However, we would also like to highlight the fact that simplistic hypothetical trolley scenarios should not distract from other highly critical ethical questions surrounding AV policy, including how risk is being distributed on the road by AVs.

AI is starting to make important ethical decisions. The development and regulation of these AI technologies should involve a dialogue between policymakers and the public. To that end, new techniques and tools should be developed to learn about the public’s views. In this article, we summarized our experience developing one such technique. We hope that our efforts will inspire further creative thinking about how to democratize the process of AI regulation.