The hot mess theory of AI misalignment: More intelligent agents behave less coherently

I hope to find time to give a more thorough reply later; what I say below is hasty and may contain errors.

(1)
Define general competence factor as intelligence*coherence.

Take all the classic arguments about AI risk and ctrl-f "intelligence" and then replace it with "general competence factor."

The arguments now survive your objection, I think.

When we select for powerful AGIs, when we train them to do stuff for us, we are generally speaking also training them to be coherent. It's more accurate to say we are training them to have a high general competence factor, then to say we are training them to be intelligent-but-not-necessarily-coherent. The ones that aren't so coherent will struggle to take over the world, yes, but they will also struggle to compete in the marketplace (and possibly even the training environment) with the ones that are.

(2) I'm a bit annoyed by the various bits of this post/paper that straw man the AI safety community, e.g. saying that X is commonly assumed when in fact there are pages and pages of argumentation supporting X, which you can easily find with a google, and lots more pages of argumentation on both sides of the issue as to whether X.

Relatedly... I just flat-out reject the premise that most work on AI risk assumes that AI will be less of a hot mess than humans. I for one am planning for a world where AIs are about as much of a hot mess as humans, at least at first. I think it'll be a great achievement (relative to our current trajectory) if we can successfully leverage hot-mess AIs to end the acute risk period.

(3) That said, I'm intellectually curious/excited to discuss these results and arguments with you, and grateful that you did this research & posted it here. :) Onwards to solving these problems collaboratively!

[-]Unnamed1y125

Seems like the concept of "coherence" used here is inclined to treat simple stimulus-response behavior as highly coherent. e.g., The author puts a thermostat in the supercoherent unintelligent corner of one of his graphs.

But stimulus-response behavior, like a blue-minimizing robot, only looks like coherent goal pursuit in a narrow set of contexts. The relationship between its behavioral patterns and its progress towards goals is context-dependent, and will go off the rails if you take it out of the narrow set of contexts where it fits. That's not "a hot mess of self-undermining behavior", so it's not the lack-of-coherence that this question was designed to get at.

[-]David Johnston1y71

Here's a hypothesis about the inverse correlation arising from your observation: When we evaluate a thing's coherence, we sample behaviours in environments we expect to find the thing in. More intelligent things operate in a wider variety of environments, and the environmental diversity leads to behavioural diversity that we attribute to a lack of coherence.

[-]zeshen1y10

Without thinking about it too much, this fits my intuitive sense. An amoeba can't possibly demonstrate a high level of incoherence because it simply can't do a lot of things, and whatever it does would have to be very much in line with its goal (?) of survival and reproduction.

[-]Unnamed1y60

A hypothesis for the negative correlation:

More intelligent agents have a larger set of possible courses of action that they're potentially capable of evaluating and carrying out. But picking an option from a larger set is harder than picking an option from a smaller set. So max performance grows faster than typical performance as intelligence increases, and errors look more like 'disarray' than like 'just not being capable of that'. e.g. Compare a human who left the window open while running the heater on a cold day, with a thermostat that left the window open while running the heater.

A Second Hypothesis: Higher intelligence often involves increasing generality - having a larger set of goals, operating across a wider range of environments. But that increased generality makes the agent less predictable by an observer who is modeling the agent as using means-ends reasoning, because the agent is not just relying on a small number of means-ends paths in the way that a narrower agent would. This makes the agent seem less coherent in a sense, but that is not because the agent is less goal-directed (indeed, it might be more goal-directed and less of a stimulus-response machine).

These seem very relevant for comparing very different agents: comparisons across classes, or of different species, or perhaps for comparing different AI models. Less clear that they would apply for comparing different humans, or different organizations.

[-]Lauro Langosco1y74

(Crossposting some of my twitter comments).

I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.

I think that instead of thinking in terms of "coherence" vs. "hot mess", it is more fruitful to think about "how much influence is this system exerting on its environment?". Too much influence will kill humans, if directed at an outcome we're not able to choose. (The rest of my comments are all variations on this basic theme).
We humans may be a hot mess, but we're far better at influencing (optimizing) our environment than any other animal or ML system. Example: we build helicopters and roads, which are very unlikely to arise by accident in a world without people trying to build helicopters or roads. If a system is good enough at achieving outcomes, it is dangerous whether or not it is a "hot mess".
It's much easier for us to describe simple behaviors as utility maximization; for example a ball rolling down a hill is well-described as minimizing its potential energy. So it's natural that people will rate a dumb / simple system as being more easily described by a utility function than a smart system with complex behaviors. This does not make the smart system any less dangerous.
Misalignment risk is not about expecting a system to "inflexibly" or "monomanically" pursuing a simple objective. It's about expecting systems to pursue objectives at all. The objectives don't need to be simple or easy to understand.
Intelligence isn't the right measure to have on the X-axis - it evokes a math professor in an ivory tower, removed from the goings-on in the real world. A better word might be capability: "how good is this entity at going out into the world and getting more of what it wants?"
In practice, AI labs are working on improving capability, rather than intelligence defined abstractly in a way that does not connect to capability. And capability is about achieving objectives.
If we build something more capable than humans in a certain domain, we should expect it to be "coherent" in the sense that it will not make any mistakes that a smart human wouldn't have made. Caveat: it might make more of a particular kind of mistake, and make up for it by being better at other things. This happens with current systems, and IMO plausibly we'll see something similar even in the kind of system I'd call AGI. But at some point the capabilities of AI systems will be general enough that they will stop making mistakes that are exploitable by humans. This includes mistakes like "fail to notice that your programmer could shut you down, and that would stop you from achieving any of your objectives".

[-]Charlie Steiner1y71

The organizations one seems like an obvious collider - you got the list by selecting for something like "notability," which is contributed to by both intelligence and coherence, and so on the sample it makes sense they're anticorrelated.

But I think the rankings for animals/plants isn't like that. Instead, it really seems to trade on what people mean by "coherence" - here I agree with Unnamed, it seems like "coherence" is getting interpreted as "simplicity of models that work pretty well to describe the thing," even if those models don't look like utility maximizers. Put an oak tree in a box with a lever that dispenses water, and it won't pull the lever when it's thirsty, but because the overall model that describes an oak tree is simpler than the model that describes a rat, it feels "coherent." This is a fine way to use the word, but it's not quite what's relevant to arguments about AI.

[-]Hoagy1y40

Put an oak tree in a box with a lever that dispenses water, and it won't pull the lever when it's thirsty

I actually thought this was a super interesting question, just for general world modelling. The tree won't pull a lever because it barely has the capability to do so and no prior that it might work, but it could, like, control a water dispenser via sap distribution to a particular branch. In that case will the tree learn to use it?

Ended up finding an article on attempts to show learned behavioural responses to stimuli in plants at On the Conditioning of Plants: A Review of Experimental Evidence - turns out there have been some positive results but they seem not to have replicated, as well as lots of negative results, so my guess is that no, even if they are given direct control, the tree won't control its own water supply. More generally this would agree that plants lack the information processing systems to coherently use their tools.

Experiments are mostly done with M. pudica because it shows (fairly) rapid movement to close up its leaves when shaken.

[-]Charlie Steiner1y20

Huh, really neat.

[-]tailcalled1y31

Not sure I would agree about a single ant being coherent. Aren't ants super dependent on their colonies for reasonable behavior? Like they use pheromone trails to find food and so on.

Also AFAIK a misplaced pheromone trail can lead an ant to walk in circles, which is the archetypal example of incoherence. But I don't know how much that happens in practice.

[-]David Johnston1y32

This is a cool result - I think it's really not obvious why intelligence and "coherence" seem inversely correlated, but it's interesting that you replicated it across three different classes of things (ML models, animals, organisations).

[-]Daniel Kokotajlo1y52

I think it's misleading to describe this as finding that intelligence and coherence are actually inversely correlated. Rather, survey respondents' ratings of intelligence and ratings of coherence were inversely correlated.

[-]David Johnston1y31

Sure, and that's why I said "coherence" instead of coherence

[-]sudo1y10

Epistemic status: clumsy

An AI could also be misaligned because it acts in ways that don't pursue any consistent goal (incoherence).

It’s worth noting that this definition of incoherence seems inconsistent with VNM. Eg. A rock might satisfy the folk definition of “pursuing a consistent goal,” but fail to satisfy VNM due to lacking completeness (and by corollary due to not performing expected utility optimization over the outcome space).

[-]sudo1y20

Strong upvoted.

The result is surprising and raises interesting questions about the nature of coherence. Even if this turns out to be a fluke, I predict that it’d be an informative one.

[-]Aaron_Scher1y10

I found it hard to engage with this partially because motivated reasoning and thinking my prior beliefs, which expects very intelligent and coherent AGIs, are correct. Overcoming this bias is hard and I sometimes benefit from clean presentation of solid experimental results, which this study lacked, making it extra hard for me to engage with. Below are some of my messy thoughts, with the huge caveat that I have yet to engage with these ideas from as neutral a prior as I would like.
This is an interesting study to conduct. I don’t think its results, regardless of what they are, should update anybody much because:
- the study is asking a small set (n = 5-6) of ~random people to rank various entities based on vague definitions of intelligence and coherence, we shouldn’t expect a process like this to provide strong evidence for any conclusion
- Asking people to rate items on coherence and intelligence feel pretty different from carefully thinking about the properties of each item. I would rather see they author pick a few items from around the spectrums and analyze each in depth (though if that’s all they did I would be complaining about a lack of more items lol)
- A priori I don’t necessarily expect a strong intelligence-coherence correlation for lower levels of intelligence, and I don’t think finding of this type are very useful for thinking about super-intelligence. Convergent instrumental subgoals are not a thing for sufficiently unintelligent agents, and I expect coherence as such a subgoal to kick in at fairly high intelligence levels (definitely above average human, at least based on observing many humans be totally uncoherent which isn’t exactly a priori reasoning). I dislike that this is the state of my belief, because it’s pretty much like “no experiment you can run right now would get at the crux of my beliefs,” but I do think it’s the case here that we can only learn so much from observing non-superintelligent systems.
- The terms, especially “coherence”, seem pretty poorly defined in a way that really hurts the usefulness of the study, as some commenters have pointed out
My takes about the results
- Yep the main effect seems to be a negative relationship between intelligence and coherence
- The poor inter-rater reliability for coherence seems like a big deal
- Figure 6 (most important image) seems really whacky. It seems to imply that all the AI models are ranked on average lower in intelligence than everything else — including oak tree and ant. This just seems slightly wild because I think most people who interact with AI would disagree with such rankings. Out of 5 respondents, only 2 ranked GPT-3 as more intelligent than an oak tree.
- Isolating just the humans (a plot for this is not shown for some reason) seems like it doesn’t support the author’s hypothesis very much. I think this in line with some prediction like “for low levels of intelligence there is not a strong relationship to coherence, and then as you get in the high human level this changes”
Other thoughts
- The spreadsheet sheets are labeled “alignment subject response” and “intelligence subject response”. Alignment is definitely not the same as coherence. I mostly trust that the author isn’t pulling a fast one on us by fudging the original purpose of the study or the directions they gave to participants, but my prior trust for people on the internet is somewhat low and the experimental design here does seem pretty janky.
- Figures 3-5, as far as I can tell, are only using the rank compared to other items in the graph, as opposed to all 60. This threw me off for a bit and I think might not be a great analysis, given that participants ranked all 60 together rather than in category batches.
- Something something inner optimizers and inner agent conflicts might imply a lack of coherence in superintelligent systems, but systems like this still seem quite dangerous.

[-]Tapatakt1y10

I don't think that measurements of the concept of "coherence" which implies that an ant is more coherent than AlphaGo is valuable in this context.

However, I think that pointing out the assumption about the relationship between intelligence and coherence is.

[-]mishka1y10

Very interesting.

In favor:

1) The currently leading models (LLMs) are ultimate hot messes;

2) The whole point of G in AGI is that it can do many things; focusing on a single goal is possible, but is not a "natural mode" for general intelligence.

Against:

A superintelligent system will probably have enough capacity overhang to create multiple threads which would look to us like supercoherent superintelligent threads, so even a single system is likely to lead to multiple "virtual supercoherent superintelligent AIs" among other less coherent and more exploratory behaviors it would also perform.

[-]mishka1y10

But it's a good argument against a supercoherent superintelligent singleton (even a single system which does have supercoherent superintelligent subthreads is likely to have a variety of those).

[-]reallyeli1y11

I think this is taking aim at Yudkowskian arguments that are not cruxy for AI takeover risk as I see it. The second species doesn't need to be supercoherent in order to kill us or put us in a box; human levels of coherence will do fine for that.

[+][comment deleted]1y20

Deleted by RHollerith, 03/10/2023

Reason: Comment deleted by its author.

Moderation Log