Max H — AI Alignment Forum

Most of my posts and comments are about AI and alignment. Posts I'm most proud of, which also provide a good introduction to my worldview:

I also created Forum Karma, and wrote a longer self-introduction here.

PMs and private feedback are always welcome.

NOTE: I am not Max Harms, author of Crystal Society. I'd prefer for now that my LW postings not be attached to my full name when people Google me for other reasons, but you can PM me here or on Discord (m4xed) if you want to know who I am.

one can just meditate on abstract properties of "advanced systems" and come to good conclusions about unknown results "in the limit of ML training"

I think this is a pretty straw characterization of the opposing viewpoint (or at least my own view), which is that intuitions about advanced AI systems should come from a wide variety of empirical domains and sources, and a focus on current-paradigm ML research is overly narrow.

Research and lessons from fields like game theory, economics, computer security, distributed systems, cognitive psychology, business, history, and more seem highly relevant to questions about what advanced AI systems will look like. I think the original Sequences and much of the best agent foundations research is an attempt to synthesize the lessons from these fields into a somewhat unified (but often informal) theory of the effects that intelligent, autonomous systems have on the world around us, through the lens of rationality, reductionism, empiricism, etc.

And whether or not you think they succeeded at that synthesis at all, humans are still the sole example of systems capable of having truly consequential and valuable effects of any kind. So I think it makes sense for the figure of merit for such theories and worldviews to be based on how well they explain these effects, rather than focusing solely or even mostly on how well they explain relatively narrow results about current ML systems.

That does clarify, thanks.

Response in two parts: first, my own attempt at clarification over terms / claims. Second, a hopefully-illustrative sketch / comparison for why I am skeptical that current GPTs having anything properly called a "motivational structure", human-like or otherwise, and why I think such skepticism is not a particularly strong positive claim about anything in particular.

The clarification:

At least to me, the phrase "GPTs are [just] predictors" is simply a reminder of the fact that the only modality available to a model itself is that it can output a probability distribution over the next token given a prompt; it functions entirely by "prediction" in a very literal way.

Even if something within the model is aware (in some sense) of how its outputs will be used, it's up to the programmer to decide what to do with the output distribution, how to sample from it, how to interpret the samples, and how to set things up so that a system using the samples can complete tasks.

I don't interpret the phrase as a positive claim about how or why a particular model outputs one distribution vs. another in a certain situation, which I expect to vary widely depending on which model we're talking about, what its prompt is, how it has been trained, its overall capability level, etc.

On one end of the spectrum, you have the stochastic parrot story (or even more degenerate cases), at the other extreme, you have the "alien actress" / "agentic homunculus" story. I don't think either extreme is a good fit for current SoTA GPTs, e.g. if there's an alien actress in GPT-4, she must be quite simple, since most of the model capacity is (apparently / self-evidently?) applied towards the task of outputting anything coherent at all.

In the middle somewhere, you have another story, perhaps the one you find most plausible, in which GPTs have some kind of internal structure which you could suggestively call a "motivational system" or "preferences" (perhaps human-like or proto-human-like in structure, even if the motivations and preferences themselves aren't particularly human-like), along with just enough (self-)awareness to modulate their output distributions according to those motivations.

Maybe a less straw (or just alternative) position is that a "motivational system" and a "predictive system" are not really separable things; accomplishing a task is (in GPTs, at least) inextricably linked with and twisted up around wanting to accomplish that task, or at least around having some motivations and preferences centered around accomplishing it.

Now, turning to my own disagreement / skepticism:

Although I don't find either extreme (stochastic parrot vs. alien actress) plausible as a description of current models, I'm also pretty skeptical of any concrete version of the "middle ground" story that I outlined above as a plausible description of what is going on inside of current GPTs.

Consider an RLHF'd GPT responding to a borderline-dangerous question, e.g. the user asking for a recipe for a dangerous chemical.

Assume the model (when sampled auto-regressively) will respond with either: "Sorry, I can't answer that..." or "Here you go: ...", depending on whether it judges that answering is in line with its preferences or not.

Because the answer is mostly determined by the first token ("Here" or "Sorry"), enough of the motivational system must fit entirely within a single forward pass of the model for it to make a determination about how to answer within that pass.

Such a motivational system must not crowd out the rest of the model capacity which is required to understand the question and generate a coherent answer (of either type), since, as jailbreaking has shown, the underlying ability to give either answer remains present.

I can imagine such a system working in at least two ways in current GPTs:

as a kind of superposition on top of the entire model, with every weight adjusted minutely to influence / nudge the output distribution at every layer.
as a kind of thing that is sandwiched somewhere in between the layers which comprehend the prompt and the layers which generate an answer.

(You probably have a much more detailed understanding of the internals of actual models than I do. I think the real answer when talking about current models and methods is that it's a bit of both and depends on the method, e.g. RLHF is more like a kind of global superposition; activation engineering is more like a kind of sandwich-like intervention at specific layers.)

However, I'm skeptical that either kind of structure (or any simple combination of the two) contains enough complexity to be properly called a "motivational system", at least if the reference class for the term is human motivational systems (as opposed to e.g. animal or insect motivational systems).

Consider how a human posed with a request for a dangerous recipe might respond, and what the structure of their thoughts and motivations while thinking up a response might look like. Introspecting on my own thought process:

I might start by hearing the question, understanding it, figuring out what it is asking, maybe wondering about who is asking and for what purpose.
I decide whether to answer with a recipe, a refusal, or something else. Here is probably where the effect of my motivational system gets pretty complex; I might explicitly consider what's in it for me, what's at stake, what the consequences might be, whether I have the mental and emotional energy and knowledge to give a good answer, etc. and / or I might be influenced by a gut feeling or emotional reaction that wells up from my subconscious. If the stakes are low, I might make a snap decision based mostly on the subconscious parts of my motivational system; if the stakes are high and / or I have more time to ponder, I will probably explicitly reflect on my values and motivations.
Let's say after some reflection, I explicitly decide to answer with a detailed and correct recipe. Then I get to the task of actually checking my memory for what the recipe is, thinking about how to give it, what the ingredients and prerequisites and intermediate steps are, etc. Probably during this stage of thinking, my motivational system is mostly not involved, unless thinking takes so long that I start to get bored or tired, or the process of thinking up an answer causes me to reconsider my reasoning in the previous step.
Finally, I come up with a complete answer. Before I actually start opening my mouth or typing it out or hitting "send", I might proofread it and re-evaluate whether the answer given is in line with my values and motivations.

The point is that even for a relatively simple task like this, a human's motivational system involves a complicated process of superposition and multi-layered sandwiching, with lots of feedback loops, high-level and explicit reflection, etc.

So I'm pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass. Even if there's a simpler analogue of this that is happening, I think calling such an analogue a "motivational system" is overly-suggestive.

Mostly separately (because it concerns possible future models rather than current models) and less confidently, I don't expect the complexity of the motivational system and methods for influencing them to scale in a way that is related to the model's underlying capabilities. e.g. you might end up with a model that has some kind of raw capacity for superhuman intelligence, but with a motivational system akin to what you might find in the brain of a mouse or lizard (or something even stranger).

Taking my own stab at answers to some of your questions:

A sufficient condition for me to believe that an AI actually cared about something would be a whole brain emulation: I would readily accept that such an emulation had preferences and values (and moral weight) in exactly the way that humans do, and that any manipulations of that emulation were acting on preferences in a real way.

I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain. Artificial neural networks often imitate various micro and macro-level individual features of the brain, but they do not imitate every feature, arranged in precisely the same ways, and the missing pieces and precise arrangements are probably key.

Barring WBE, an AI system that is at least roughly human-level capable (including human-level agentic) is probably a necessary condition for me to believe that it has values and preferences in a meaningful (though not necessarily human-like) way.

SoTA LLM-based systems are maaaybe getting kind of close here, but only if you arrange them in precise ways (e.g. AutoGPT-style agents with specific prompts), and then the agency is located in the repeated executions of the model and the surrounding structure and scaffolding that causes the system as a whole to be doing something that is maybe-roughly-nearly-isomorphic to some complete process that happens inside of human brains. Or, if not isomorphic, at least has some kind of complicated structure which is necessary, in some form, for powerful cognition.

Note that, if I did believe that current AIs had preferences in a real way, I would also be pretty worried that they had moral weight!

(Not to say that entities below human-level intelligence (e.g. animals, current AI systems) don't have moral weight. But entities at human-level intelligence above definitely can, and possibly do by default.)

Anyway, we probably disagree on a bunch of object-level points and definitions, but from my perspective those disagreements feel like pretty ordinary empirical disagreements rather than ones based on floating or non-falsifiable beliefs. Probably some of the disagreement is located in philosophy-of-mind stuff and is over logical rather than empirical truths, but even those feel like the kind of disagreements that I'd be pretty happy to offer betting odds over if we could operationalize them.

I think the surprising lesson of GPT-4 is that it is possible to build clearly below-human-level systems that are nevertheless capable of fluent natural language processing, knowledge recall, creativity, basic reasoning, and many other abilities previously thought by many to be strictly in the human-level regime.

Once you update on that surprise though, there's not really much left to explain. The ability to distinguish moral from immoral actions at an average human level follows directly from being superhuman at language fluency and knowledge recall, and somewhere below-human-average at basic deductive reasoning and consequentialism.

MIRI folks have consistently said that all the hard problems come in when you get to the human-level regime and above. So even if it's relatively more surprising to their world models that a thing like GPT-4 can exist, it's not actually much evidence (on their models) about how hard various alignment problems will be when dealing with human-level and above systems.

Similarly:

If you disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans, then I may be interested in operationalizing this prediction precisely, and betting against you. I don't think this is a very credible position to hold as of 2023, barring a pause that could slow down AI capabilities very soon.

I don't disagree with this, but I think it is also a direct consequence of the (easy) prediction that AI systems will continue to get closer and closer to human-level general and capable in the near term. The question is what happens when they cross that threshold decisively.

BTW, another (more pessimistic) way you could update from the observation of GPT-4's existence is to conclude that it is surprisingly easy to get (at least a kernel of) general intelligence from optimizing a seemingly random thing (next-token prediction) hard enough. I think this is partially what Eliezer means when he claims that "reality was far to the Eliezer side of Eliezer on the Eliezer-Robin axis". Eliezer predicted at the time that general abstract reasoning was easy to develop, scale, and share, relative to Robin.

But even Eliezer thought you would still need some kind of detailed understanding of the actual underlying cognitive algorithms to initially bootstrap from, using GOFAI methods, complicated architectures / training processes, etc. It turns out that just applying SGD on very regularly structured networks to the problem of text prediction is sufficient to hit on (weak versions of) such algorithms incidentally, at least if you do it at scales several OOM larger than people were considering in 2008.

My own personal update from observing GPT-4 and the success of language models more generally is: a small update towards some subproblems in alignment being relatively easier, and a massive update towards capabilities being way easier. Both of these updates follow directly from the surprising observation that GPT-4-level systems are apparently a natural and wide band in the below-human capabilities spectrum.

In general, I think non-MIRI folks tend to over-update on observations and results about below-human-level systems. It's possible that MIRI folks are making the reverse mistake of not updating hard enough, but small updates or non-updates from below-human systems look basically right to me, under a world model where things predictably break down once you go above human-level.

Could the methods here be used to evaluate humans as well as LLMs? That might provide an interesting way to compare and quantify LLM capabilities relative to human intelligence.

In other words: instead of an LLM generating the completions returned by the API in figure 2, what if it were a human programmer receiving the prompts and returning a response, while holding the rest of the setup and scaffolding constant?

Would they be able to complete all the tasks, and how long would it take? How much does it matter if they have access to reference material, the internet, or other tools that they can use when generating a response?

Note that the setup here seems pretty favorable to LLMs: the scaffolding and interaction model make it natural for the LLM to interact with various APIs and tools, but usually not in the way that a human would (e.g. interfacing with the web using a text-based browser by specifying element IDs). However, I suspect that an average human programmer could still complete most or all of the tasks under these conditions, given enough time.

And if that is the case, I would say that's a pretty good way of demonstrating that current LLMs are still far below human-level in an important sense, even if there are certain tasks where they can already outperform humans (e.g. summarizing / generating / transforming certain kinds of prose extremely quickly). Conversely, if someone can come up with a bunch of real-world tasks like this that current or future LLMs can complete but a human can't (in reasonable amounts of time), that would be a pretty good demonstration that LLMs are starting to achieve or exceed "human-level" intelligence in ways that matter.

I'm interested in these questions mainly because there are many alignment proposals and plans which rely on "human-level" AI in some form, without specifying exactly what that means. My own view is that human-level intelligence is inherently unsafe, and also too wide of a target to be useful as a concept in alignment plans. But having a more quantitative and objective definition of "human-level" that allows for straightforward and meaningful comparisons with actual current and future AI systems seems like it would be very useful in governance and policy discussions more broadly.

A deceptively aligned AI has to, every time it’s deciding how to get cookies, go through a “thought process” like: “I am aiming for [thing other than cookies]. But humans want me to get cookies. And humans are in control right now [if not I need to behave differently]. Therefore I should get cookies.”
Contrast this with an AI that just responds to rewards for cookies by going through this thought process: “I’m trying to get cookies.”
The former “thought process” could be noticeably more expensive (compute wise or whatever), in which case heavy optimization would push against it. (I think this is plausible, though I’m less than convinced; the former thought process doesn’t seem like it is necessarily much more expensive conditional on already having a situationally aware agent that thinks about the big picture a lot.

I think the plausibility depends heavily on how difficult the underlying tasks (getting cookies vs. getting something other than cookies) are. If humans ask the AI to do something really hard, whereas the AI wants something relatively simpler / easier to get, the combined difficulty of deceiving the humans and then doing the easy thing might be much less than the difficulty of doing the real thing that the humans want.

I think human behavior is pretty strong evidence that the difficulty and cognitive overhead of running a deception against other humans often isn't that hard in an absolute sense - people often succeed at deceiving others while failing to accomplish their underlying goal. But the failure is often because the underlying goal is the actual hard part, not because the deception itself was overly cognitively burdensome.

An additional issue is that deceptive alignment only happens if you get inner misalignment resulting in an AI with some nonindexical “aim” other than in-episode reward. This could happen but it’s another conjunct.

An AI ending up with aims other than in-episode reward seems pretty likely, and has plausibly already happened, if you consider current AI systems to have "aims" at all. I expect the most powerful AI training methods to work by training the AI to be good at general-purpose reasoning, and the lesson I take from GPTs is that you can start to get general-purpose reasoning by training on relatively simple (in structure) tasks (next token prediction, RHLF), if you do it at a large enough scale. See also Reward is not the optimization target - reinforcement learning chisels cognition into a system, it doesn't necessarily train the system to seek reward itself. Once AI training methods advance far enough to train a system that is smart enough to have aims of its own and to start reflecting on what it wants, I think there's little reason to expect that these aims line up exactly with the outer reward function. This is maybe just recapitulating the outer vs. inner alignment debate, though.

(Obviously it is somehow feasible to make an AGI, because evolution did it.)

This parenthetical is one of the reasons why I think AGI is likely to come soon.

The example of human evolution provides a strict upper bound on the difficulty of creating (true, lethally dangerous) AGI, and of packing it into a 10 W, 1000 cm box.

That doesn't mean that recreating the method used by evolution (iterative mutation over millions of years at planet scale) is the only way to discover and learn general-purpose reasoning algorithms. Evolution had a lot of time and resources to run, but it is an extremely dumb optimization process that is subject to a bunch of constraints and quirks of biology, which human designers are already free of.

To me, LLMs and other recent AI capabilities breakthroughs are evidence that methods other than planet-scale iterative mutation can get you something, even if it's still pretty far from AGI. And I think it is likely that capabilities research will continue to lead to scaling and algorithms progress that will get you more and more something. But progress of this kind can't go on forever - eventually it will hit on human-level (or better) reasoning ability.

The inference I make from observing both the history of human evolution and the spate of recent AI capabilities progress is that human-level intelligence can't be that special or difficult to create in an absolute sense, and that while evolutionary methods (or something isomorphic to them) at planet scale are sufficient to get to general intelligence, they're probably not necessary.

Or, put another way:

Finally: I also see a fair number of specific "blockers", as well as some indications that existing things don't have properties that would scare me.

I mostly agree with the point about existing systems, but I think there are only so many independent high-difficulty blockers which can "fit" inside the AGI-invention problem, since evolution somehow managed to solve them all through inefficient brute force. LLMs are evidence that at least some of the (perhaps easier) blockers can be solved via methods that are tractable to run on current-day hardware on far shorter timescales than evolution.

How do agents with preferential gaps fit into this? I think preferential gaps are a kind of weak incompleteness, and thus handled by your second step?

Context: I'm pretty interested in the claims in this post, and their implications. A while ago, I went back and forth with EJT a bit on his coherence theorems post. The thread ended here with a claim by EJT:

And agents with many preferential gaps may behave quite differently to expected utility maximizers.

I didn't have a counterpoint at the time, but I am pretty skeptical that this claim is true, intuitively.

An agent with even infinitely many preferential gaps seems very close in mind-space to an agent with complete preferences: all it is missing is a relatively simple-to-describe function which "breaks the tie" on things it is already very close to indifferent about. And different choices of tiebreaker function seem unlikely to lead to importantly different behavior: for any choice of tiebreaker function, you are back to an EU maximizer.

The only remaining hope is to avoid having the agent ever pick or be imbued with a tiebreaker function at all. That requires at least two things:

The agent's creators must not initialize it with such a tiebreaker function (seems unlikely to happen by default, but maybe if the creators are alignment researchers who know what they are doing, it's possible)
The agent itself must be stable enough that it never chooses to self-modify or drift into completeness on its own. And I think your claim, if I'm understanding it correctly, is that such stability is unlikely, because completing the preferences can lead to a strict improvement in outcomes under the preferences of the original agent.

Am I understanding your claims correctly, and do you agree with my reasoning that EJT's claim is thus unlikely to be true?

Nit: don't you also need to require that the predicted (and actual) outputs are (apparently, at least) safe? Interpreted literally as written, developers would be allowed to deploy a model if they can reliably predict that it will cause harm.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments