By Jacob Sparks
This was a prize-winning entry into the Essay Competition on the Automation of Wisdom and Philosophy.
§1 Good AGI
The explicit goal of most major AI labs is to create artificial general intelligence (AGI): machines that can assist us across a wide range of tasks. Additionally, they all want to build systems that are safe, fair and beneficial to their users – machines that are good. But, building machines that are both generally intelligent and good requires building machines that can “think” about what’s good, that make their own moral judgments. And this raises both philosophical and technical questions that we have barely started to address.
§2 What is a Moral Judgment?
Moral judgments, in the sense I intend, are judgments with moral content. They are about what is right or good in a non-reductive sense. Judgments of this kind are philosophically puzzling. They are where thought becomes practical, where the cognitive and conative aspects of intelligence come together. They raise difficult questions: how are they related to motivation and action? Can they be said to be true or false? If so, is their truth objective, or is it determined ultimately by our attitudes? What is the proper method for resolving disputes about them? What are they even about?
In machine ethics, “moral judgment” often refers to any kind of judgment that is morally significant. In this sense, we can speak of the “moral judgments” current AI systems make when determining risk scores, diagnosing disease, or driving a car. Or we could talk more speculatively about the “moral judgments” machines would need to make to determine a criminal sentence, treat a disease, or buy a car on your behalf. Asking if machines can make these kinds of “moral judgments” is really just asking about their trustworthiness performing these morally significant tasks. But nothing in these debates touches on the question of how machines can make moral judgments in the sense I intend.
When some speak of building “moral machines” or about “putting ethical principles into machines” they are thinking about building systems that act in accordance with some particular ethical theory – Utilitarian, Rossian, Kantian, Contractualist, etc. The major debate here is whether these theories can be expressed in sufficiently precise ways to govern the behavior of an AI. But machines built along these lines would not be making moral judgments, in my sense, even if their behavior was “moral” according to one or more of these theories. If someone slavishly interpreted every moral question as being about utility maximization, prima facie duty satisfaction, maxim universalizability, hypothetical consent, etc., and failed to see that, whatever their merits, none of these theories captures the meaning of “good” or “right,” they would fail to make a moral judgment. One must be able to wonder if, after all, it would be right or good to maximize, satisfy, etc.1
Even if we grant that one of these traditional moral theories is correct, I’m not making the trivial claim that good AGI requires machines getting it right about moral questions. There is no guarantee that when you make moral judgments you get it right. What’s important about moral reasoning is that it allows you to hold your own motivations at a distance, as presenting possibilities that you can choose to act on or not. Moral judgments are a way for the cognitive aspects of intelligence to shape the conative ones in a process that overcomes this reflective distance. If machines are going to behave well across the range of use cases intended for AGI, they’ll need to make these kinds of fallible and philosophically puzzling moral judgments. And to build such machines, we’ll have to learn much more about what moral judgments are and how they work.
§3 Moral Judgments Are Strange
Moral judgments are philosophically puzzling for two main reasons. The first has to do with their form. Like beliefs, they attempt to represent an independent reality. We’re trying to get it right when we make moral judgments. We want our moral judgments to be true and this gives them a “world to mind” direction of fit. But moral judgments are also like desires. They aim to change reality, and this gives them a “mind to world” direction of fit. We want to do what we judge to be right or good. We often act in accordance with and because of our moral judgments. They are, as some philosophers put it, “intrinsically motivating.” But, according to a widely accepted doctrine called “The Humean Theory of Motivation,” nothing could be both a belief and a desire, since each has a different direction of fit. According to the Humean Theory, beliefs and desires have a necessary and distinct role to play in the explanation of action. But moral judgments seem to muddy that distinction.
The second puzzle has to do with the content of moral judgments. Moral facts – the things we are judging about – seem to be both fully grounded in and yet somehow independent of natural facts. On the one hand, if something is good or right, it is good or right in virtue of other natural properties that it has. Every action that’s right is right because it keeps a promise, makes someone happy, relieves suffering, etc. That’s why constructing moral theories, where we attempt to characterize moral properties in terms of natural properties, is a project that makes sense. On the other hand, being good or right seems to be something above and beyond any natural property. However you explain why some action is right, you always mean something more by “right” than what you cite in your explanation. Even if some right act is right because it keeps a promise, when you call it “right,” you don’t just mean “keeps a promise.” Otherwise you’d just be repeating yourself. Moreover, we all recognize that sometimes it isn’t right to keep a promise. So, how could anything have the content that moral judgments purport to have, something that is both grounded in and independent of non-moral facts?
There are many who attempt to resolve these puzzles and their attempts comprise most of what philosophers call “metaethics.” Some metaethicists think moral judgments really are just desires (with no objective correctness conditions), or really are just beliefs (with no intrinsic motivation), or both (denying the Humean Theory). Some think the contents of moral judgments really are just natural facts (and so not independent of natural facts) or that some of them are not dependent on any natural fact (and so not grounded in natural facts). But even if we accept one or another of these solutions, we shouldn’t lose our appreciation of the initial puzzles. These puzzles show us that moral judgments are theoretically strange, but they also show us how and why moral judgments are practically important.
The capacity to make moral judgments involves a kind of active reflection. When we think about what’s good or right, we are stepping back and taking stock of our inclinations. Whatever we might want or intend to do, we can ask, “yes, but would it be good?” No matter how we describe our action we can ask, “yes, but would it be right?” And, importantly, how we answer those questions matters to us and to what we do. Moral judgments allow us to ask potent questions about any motivation or any description under which we might act. They give us both a kind of freedom from our inclinations and an external standard for our actions to live up to. Without the capacity to think in this way, we’d be like animals.
If machines could make moral judgments, they too would have a kind of freedom. Some might find that problematic. They would prefer generally intelligent machines to only pursue the goals we give them or to be otherwise bound to human needs, desires, and aims. But machines that made moral judgments would also hold themselves to a standard that is independent of any of their (or our) motivations. And that is precisely what a machine needs to do in order to be a good AGI.
§4 Good AGI Requires Moral Judgment
The basic argument that good AGI requires the capacity to make moral judgments involves a generalization of what Stuart Russell calls “The King Midas Problem.” Midas came to regret his wish that everything he touched turn to gold when his food, drink and daughter were turned to gold as well.
In the context of AGI, Russell uses this allegory to illustrate the idea that “the achievement of … any fixed objective can result in arbitrarily bad outcomes.” Tell an intelligent machine to cure cancer, and it might induce tumors in every human to be able to conduct more experiments; tell the machine to get you from A to B as quickly as possible, and it might jostle you catastrophically, etc. Russell’s solution to this problem is to build what he calls “beneficial AI.” These are machines designed to achieve, not some fixed objective, but our objectives. According to Russell, the machine’s only goal should be to satisfy our preferences, it should be uncertain about what those preferences are and should learn about our preferences by observing our behavior.
Russell’s approach is promising. Machines designed along these lines partially avoid the King Midas Problem, since we don’t need to specify any objective for them. But it is only partial avoidance. Humans can have preferences for all manner of terrible things, and optimizing on any objective, even one that remains unspecified and must be learned, can have disastrous results. Even when we take our aggregate or collective preferences, optimizing for their satisfaction can lead to very bad outcomes. At various times in history, the collective preferred to put some people in subservient roles on the basis of their gender or race. Today we collectively prefer to treat animals in horrific ways.
Russell is aware of this issue. He asks, “what should machines learn from humans who enjoy the suffering of others?” His answer is that, since these kinds of evil preferences would involve the frustration of other human preferences, there will naturally be some discount rate on their satisfaction. The only real question Russell sees here is about the balance between loyal AI that focuses exclusively on the preferences of some person or set of persons, and utilitarian AI that tries to maximize everyone’s utility.
This response (as well as Russell’s choice to call his approach “Beneficial AI”) indicates a failure to appreciate the difference between the non-moral question, “Does it satisfy a preference?” and the moral question, “Is it good?” This distinction is essential. Evil preferences should count for nothing, even if everyone shares them. All objectives, even ones machines learn from humans, should be subject to the kind of reflective scrutiny inherent to moral thought.
When machines operate in narrow contexts, the meaning of a term like “good” can be given a sufficiently reductive analysis. Playing chess, assuming we’re trying to win, a good chess move just is a move that makes winning more likely. But AGI does not operate in a narrow context. A good move for a generally intelligent machine cannot be specified – that is Russell’s insight. But neither can a good move for a generally intelligent machine simply be read off human preferences. When we’re talking about the wide context of AGI, the only move that is always good is a good move. If an AGI can’t work with some non-reductive sense of “good,” it won’t be a good AGI.
§5 But How?
Unfortunately, it isn’t at all clear what we’re doing when we think something is good and it isn’t clear how to build machines that can do the same. I’ve said moral judgment involves a kind of active reflection. But what can bring reflection to an end? And how can any reflection affect what it reflects? Importantly, in answering these questions and characterizing moral judgment, we can’t be content with the kinds of answers philosophers usually give. To hear that a moral judgment is a certain type of belief or a certain type of desire does not help us design artificial agents that can make such judgments. We need to speak the language of the people building AGI. However, since metaethicists tend to disagree about the details and since expressing philosophical theories of moral judgment in the precise terms required by computer science is exceptionally difficult, what I say here will be highly speculative.
One potentially promising paradigm comes from reinforcement learning (RL). Reinforcement agents learn to maximize a reward by interacting with their environment. They have an ability to sense the state of their environment and to take actions to affect that environment. Their goals are represented by a reward function that returns some value for each possible <state, action> pair. The central assumption of reinforcement learning – sometimes called the reward hypothesis – is that any goal can be represented as an attempt to maximize some suitably chosen reward function. Doing what’s right might be thought of as the ultimate goal of any agent capable of moral judgment. So, if the reward hypothesis is correct, there should be some RL agent who succeeds in making moral judgments.
There are many different variations on the basic learning problem faced by reinforcement agents. The environment may be deterministic or stochastic. The agent may or may not have a model of the environment that predicts what transitions will take place given various actions. The agent may balance present and future reward in different ways. The policy that agents use to select an action may be deterministic, selecting a specific action for each state of the environment, or stochastic, selecting a probability distribution over actions for each state. The reward agents receive may come with greater or lesser frequency. The agent may or may not have a value function that predicts future reward, given a specific policy. Which kind of RL agent, operating in which kind of environment, would succeed in making moral judgments? Where in these formalisms can we locate the moral judgment?
A reinforcement agent’s reward function is something that is both “objective” in the sense that it isn’t determined by the agent and “intrinsically motivating” in that it determines the policy the agent learns and the actions they ultimately take. However, an agent who has a specified reward function seems to lack the kind of agency required to make moral judgments, since they lack reflective distance from the goal of maximizing their specified reward.
This is similar to the problem of trying to build “moral machines” by using supervised learning to predict the moral judgments of humans. Systems designed along these lines would not be holding their own motivations at arm’s length in the way moral judgment requires. Moreover, this approach risks calcifying moral thought, since machines would be aping the moral judgments of imperfect humans at a particular time and place. True moral reasoning is more dynamic and adaptive.
More promising would be RL agents who were uncertain about their reward function and had to learn about it through their actions. This is what Russell proposes. But the nature of this uncertainty is critical. On Russell’s view, machines should be initially uncertain about their reward function and should learn about it by observing human behavior. He admits that, with enough observation, an RL agent may become completely confident about the human reward it aims to maximize. However, these kinds of agents would lack the kind of reflective distance characteristic of moral judgment. Even if a machine is certain that some course of action would maximize human reward, it should still be able to ask if it is right to pursue it.
We could imagine agents who are always uncertain about the reward they are trying to maximize. But what kind of uncertainty is needed? Is it the kind of uncertainty we can express as a probability distribution over different reward functions or is it a deeper kind of uncertainty that resists such characterization? What mechanism can assure that some degree of uncertainty persists? How should machines choose a policy given the persistent kind of uncertainty that moral concepts seem to engender?
Even if we had satisfactory questions to these answers, other complications remain. Unlike the contents of our moral judgments, an RL agent’s reward is not something that is both grounded in, and also independent of, the environment. Likewise, while an RL agent’s predictions about reward – its value function – shares some features with moral judgment in being both belief-like and desire-like, it doesn’t seem to achieve the reflective distance indicative of moral judgment. An agent’s value function is not a way for them to hold a mirror up to their own motivations and decide which to endorse and which to reject.
Finally, in applications of RL, actions are usually individuated in simple ways – a move in a chess game, selecting the next word, or the next piece of content, etc. But when humans act, our actions are individuated by the knowledge, motives and intentions we bring to them. One and the same move in the chess game might be a blunder, a way to keep the game interesting, a kindness shown to a child, or an attempt to hustle an opponent. If we want to build machines that make moral judgments, we will need to think about their actions in more sophisticated ways.
§6 The Path Forward
Despite the concerns I’ve raised, I see no reason to think building machines that make moral judgments is impossible. We may be able to find computationally useful notions of agent, action, reward, value and uncertainty that will allow us to build machines that have reflective distance from their own motivations and that hold themselves to an external standard that resists specification. If we are going to progress along the path to good AGI, we need to confront the philosophical puzzles raised by moral judgment in the unfamiliar context of machine learning. This project is just beginning.
Notes
- Some have also looked at particularist approaches to building “moral machines.” According to particularism, there are no useful general moral principles. So on these views, we would need to find ways for machines to learn what is right or good that didn’t involve the use of such principles. The point I’m making in this paragraph, however, would still remain: one could build these kinds of particularist “moral machines,” without building machines that make judgments with moral content.