What do coherence arguments imply about the behavior of advanced AI?

Published 8 April 2021

This is an initial page, in the process of review, which may not be comprehensive or represent the best available understanding.

Coherence arguments say that if an entity’s preferences do not adhere to the axioms of expected utility theory, then that entity is susceptible to losing things that it values.

This does not imply that advanced AI systems must adhere to these axioms (‘be coherent’), or that they must be goal-directed.

Such arguments do appear to suggest that there will be non-zero pressure for advanced AI to become more coherent, and arguably also more ‘goal-directed’, given some minimal initial level of goal-directedness.

Contents

Details

Motion toward coherence

Expected utility maximization

‘Maximizing expected utility’ is a decision-making strategy, in which you assign a value to each possible ‘outcome’, and assign a probability to each outcome conditional on each of your available actions, then always choose the action whose resulting outcomes have the highest ‘expected value‘ (average value of outcomes, weighted by the probability of those outcomes).

Coherence arguments

‘Coherence arguments’¹ demonstrate that if one’s preferences cannot be understood as ‘maximizing expected utility’, then one can be manipulated into giving up things that one values for no gain.

For instance, one coherence argument notes that if you have ‘circular preferences’ then you will consent to series of decisions that will leave you worse off, given your preferences:

Suppose you prefer:

apple over pear
pear over quince
quince over apple
any fruit over nothing

Then there is some tiny amount of money you would pay to go from apple to quince, and quince to pear, and pear to apple. At which point, you have spent money and are back where you started. If it is also possible to buy all of these fruit for money, then losing money means you lost some of whatever fruit for nothing, and you do want all of the fruit, by assumption.

If you avoid all such predictable losses, then according to the coherence arguments, you must be maximizing expected utility (as described above).

Coherence forces

That a certain characteristic of an entity’s ‘preferences’ makes it vulnerable to manipulation does not mean that it will not have that characteristic. In order for such considerations to change the nature of an entity, ignoring outside intervention, something like the following conditions need to hold:

The entity can detect the characteristic (which could be difficult, if it is a logical relationship between all of its ‘preferences’ which are perhaps not straightforwardly accessible or well-defined)
The realistic chance for loss is large enough to cause the entity to prioritize the problem
The entity is motivated to become coherent by the possibility of loss (versus for instance inferring that losing money is good, since it is equivalent to a set of exchanges that are each good)
The entity is in a position to alter its own preferences

Similar might apply if versions of the above hold for an outside entity with power over the agent, e.g. its creators, though in that case it is less clear that ‘coherence’ is a further motivator beyond that for having the agent’s preferences align with those of the outside agent (which would presumably coincide with coherence, to the extant that the outside agent had more coherent preferences).

Thus we say there is generally an incentive for coherence, but it may or may not actually cause a an entity to change in the direction of coherence at a particular time. We can also describe this as a ‘coherence force’ or ‘coherence pressure’, pushing minds toward coherence, all things equal, but for all we know, so weakly as to be often irrelevant.

Coherence forces apply to entities with ‘preferences’

The coherence arguments only apply to creatures with ‘preferences’ that might be thwarted by their choices, so there are presumably possible entities that are not subject to any coherence forces, due to not having preferences of the relevant type.

Behavior of coherent creatures

Supposing entities are likely to become more coherent, all things equal, a natural question is how coherent entities differ from incoherent entities.

Coherence is consistent with any behavior

If we observe an agent exhibiting any history of behavior, that is consistent with the agent’s being coherent because the agent could have a utility function that rates that history higher than any other history. Rohin Shah discusses this.

Coherence and goal-directedness

Coherence doesn’t logically require goal-directedness

As Rohin Shah discusses, the above means that coherence does not imply ‘goal-directed’ behavior (however you choose to define that, if it doesn’t include all behavior):

Coherence arguments do not exclude any behavior
Non-goal-directed behavior is consistent with coherence arguments
Thus coherence arguments do not imply goal directed behavior

Increasing coherence seems likely to be associated with increased intuitive ‘goal-directedness’

The following hypotheses (quoted from this blog post) seem plausible (where goal-directedness_Rohin means something like ‘that which looks intuitively goal-directed)²:

1. Coherence-reformed entities will tend to end up looking similar to their starting point but less conflicted
For instance, if a creature starts out being indifferent to buying red balls when they cost between ten and fifteen blue balls, it is more likely to end up treating red balls as exactly 12x the value of blue balls than it is to end up very much wanting the sequence where it takes the blue ball option, then the red ball option, then blue, red, red, blue, red. Or wanting red squares. Or wanting to ride a dolphin.

[…]
2. More coherent strategies are systematically less wasteful, and waste inhibits goal-direction_Rohin, which means more coherent strategies are more forcefully goal-directed_Rohin on average
In general, if you are sometimes a force for A and sometimes a force against A, then you are not moving the world with respect to A as forcefully as you would be if you picked one or the other. Two people intermittently changing who is in the driving seat, who want to go to different places, will not cover distance in any direction as effectively as either one of them. A company that cycles through three CEOs with different evaluations of everything will—even if they don’t actively scheme to thwart one another—tend to waste a lot of effort bringing in and out different policies and efforts (e.g. one week trying to expand into textiles, the next week trying to cut everything not involved in the central business).
3. Combining points 1 and 2 above, as entities become more coherent, they generally become more goal-directed_Rohin. As opposed to, for instance, becoming more goal-directed_Rohin on average, but individual agents being about as likely to become worse as better as they are reformed. Consider: a creature that values red balls at 12x blue balls is very similar to one that values them inconsistently, except a little less wasteful. So it is probably similar but more goal-directed_Rohin. Whereas it’s fairly unclear how goal-directed_Rohina creature that wants to ride a dolphin is compared to one that wanted red balls inconsistently much. In a world with lots of balls and no possible access to dolphins, it might be much less goal-directed_Rohin, in spite of its greater coherence.
4. Coherence-increasing processes rarely lead to non-goal-directed_Rohin agents—like the one that twitches on the ground In the abstract, few starting points and coherence-motivated reform processes will lead to an agent with the goal of carrying out a specific convoluted moment-indexed policy without regard for consequence, like Rohin’s twitching agent, or to valuing the sequence of history-action pairs that will happen anyway, or to being indifferent to everything. And these outcomes will be even less likely in practice, where AI systems with anything like preferences probably start out caring about much more normal things, such as money and points and clicks, so will probably land at a more consistent and shrewd version of that, if 1 is true. (Which is not to say that you couldn’t intentionally create such a creature.)

Thus it presently seems likely that coherence arguments correspond to a force for for entities with something like ‘preferences’ to grow increasingly coherent, and generally increasingly goal-directed (intuitively defined).

Thus, to the extent that future advanced AI has preferences of the relevant kind, there appears to be a pressure for it to become more goal-directed. However it is unclear what can be said generally about the strength of this force.

Primary author: Katja Grace

Notes

Or ‘coherence theorems’. Discussed further here.
Or precisely, from that post: ‘something like Rohin’s preferred usage: roughly, that which seems intuitively goal directed to us, e.g. behaving similarly across situations, and accruing resources, and not flopping around in possible pursuit of some exact history of personal floppage, or peaceably preferring to always take the option labeled ‘A’.’