Max H

NOTE: I am not Max Harms, author of Crystal Society. I'd prefer for now that my LW postings not be attached to my full name when people Google me for other reasons, but you can PM me here or on Discord (m4xed) if you want to know who I am.

I've been active in the meatspace rationality community for years, and have recently started posting regularly on LW. Most of my posts and comments are about AI and alignment.

Posts I'm most proud of, and / or which provide a good introduction to my worldview:

I also wrote a longer self-introduction here.

PMs and private feedback are always welcome.

Wiki Contributions

Comments

Could the methods here be used to evaluate humans as well as LLMs? That might provide an interesting way to compare and quantify LLM capabilities relative to human intelligence.

In other words: instead of an LLM generating the completions returned by the API in figure 2, what if it were a human programmer receiving the prompts and returning a response, while holding the rest of the setup and scaffolding constant?

Would they be able to complete all the tasks, and how long would it take? How much does it matter if they have access to reference material, the internet, or other tools that they can use when generating a response?

Note that the setup here seems pretty favorable to LLMs: the scaffolding and interaction model make it natural for the LLM to interact with various APIs and tools, but usually not in the way that a human would (e.g. interfacing with the web using a text-based browser by specifying element IDs). However, I suspect that an average human programmer could still complete most or all of the tasks under these conditions, given enough time.

And if that is the case, I would say that's a pretty good way of demonstrating that current LLMs are still far below human-level in an important sense, even if there are certain tasks where they can already outperform humans (e.g. summarizing / generating / transforming certain kinds of prose extremely quickly). Conversely, if someone can come up with a bunch of real-world tasks like this that current or future LLMs can complete but a human can't (in reasonable amounts of time), that would be a pretty good demonstration that LLMs are starting to achieve or exceed "human-level" intelligence in ways that matter.

I'm interested in these questions mainly because there are many alignment proposals and plans which rely on "human-level" AI in some form, without specifying exactly what that means. My own view is that human-level intelligence is inherently unsafe, and also too wide of a target to be useful as a concept in alignment plans. But having a more quantitative and objective definition of "human-level" that allows for straightforward and meaningful comparisons with actual current and future AI systems seems like it would be very useful in governance and policy discussions more broadly.

  • A deceptively aligned AI has to, every time it’s deciding how to get cookies, go through a “thought process” like: “I am aiming for [thing other than cookies]. But humans want me to get cookies. And humans are in control right now [if not I need to behave differently]. Therefore I should get cookies.”
  • Contrast this with an AI that just responds to rewards for cookies by going through this thought process: “I’m trying to get cookies.”
  • The former “thought process” could be noticeably more expensive (compute wise or whatever), in which case heavy optimization would push against it. (I think this is plausible, though I’m less than convinced; the former thought process doesn’t seem like it is necessarily much more expensive conditional on already having a situationally aware agent that thinks about the big picture a lot.

 

I think the plausibility depends heavily on how difficult the underlying tasks (getting cookies vs. getting something other than cookies) are. If humans ask the AI to do something really hard, whereas the AI wants something relatively simpler / easier to get, the combined difficulty of deceiving the humans and then doing the easy thing might be much less than the difficulty of doing the real thing that the humans want.

I think human behavior is pretty strong evidence that the difficulty and cognitive overhead of running a deception against other humans often isn't that hard in an absolute sense - people often succeed at deceiving others while failing to accomplish their underlying goal. But the failure is often because the underlying goal is the actual hard part, not because the deception itself was overly cognitively burdensome.

 

  • An additional issue is that deceptive alignment only happens if you get inner misalignment resulting in an AI with some nonindexical “aim” other than in-episode reward. This could happen but it’s another conjunct.

An AI ending up with aims other than in-episode reward seems pretty likely, and has plausibly already happened, if you consider current AI systems to have "aims" at all. I expect the most powerful AI training methods to work by training the AI to be good at general-purpose reasoning, and the lesson I take from GPTs is that you can start to get general-purpose reasoning by training on relatively simple (in structure) tasks (next token prediction, RHLF), if you do it at a large enough scale. See also Reward is not the optimization target - reinforcement learning chisels cognition into a system, it doesn't necessarily train the system to seek reward itself. Once AI training methods advance far enough to train a system that is smart enough to have aims of its own and to start reflecting on what it wants, I think there's little reason to expect that these aims line up exactly with the outer reward function. This is maybe just recapitulating the outer vs. inner alignment debate, though.

(Obviously it is somehow feasible to make an AGI, because evolution did it.)

This parenthetical is one of the reasons why I think AGI is likely to come soon.

The example of human evolution provides a strict upper bound on the difficulty of creating (true, lethally dangerous) AGI, and of packing it into a 10 W, 1000 cm box.

That doesn't mean that recreating the method used by evolution (iterative mutation over millions of years at planet scale) is the only way to discover and learn general-purpose reasoning algorithms. Evolution had a lot of time and resources to run, but it is an extremely dumb optimization process that is subject to a bunch of constraints and quirks of biology, which human designers are already free of.

To me, LLMs and other recent AI capabilities breakthroughs are evidence that methods other than planet-scale iterative mutation can get you something, even if it's still pretty far from AGI. And I think it is likely that capabilities research will continue to lead to scaling and algorithms progress that will get you more and more something. But progress of this kind can't go on forever - eventually it will hit on human-level (or better) reasoning ability. 

The inference I make from observing both the history of human evolution and the spate of recent AI capabilities progress is that human-level intelligence can't be that special or difficult to create in an absolute sense, and that while evolutionary methods (or something isomorphic to them) at planet scale are sufficient to get to general intelligence, they're probably not necessary.

Or, put another way:
 

Finally: I also see a fair number of specific "blockers", as well as some indications that existing things don't have properties that would scare me.


I mostly agree with the point about existing systems, but I think there are only so many independent high-difficulty blockers which can "fit" inside the AGI-invention problem, since evolution somehow managed to solve them all through inefficient brute force. LLMs are evidence that at least some of the (perhaps easier) blockers can be solved via methods that are tractable to run on current-day hardware on far shorter timescales than evolution.

 

How do agents with preferential gaps fit into this? I think preferential gaps are a kind of weak incompleteness, and thus handled by your second step?

Context: I'm pretty interested in the claims in this post, and their implications. A while ago, I went back and forth with EJT a bit on his coherence theorems post. The thread ended here with a claim by EJT:

And agents with many preferential gaps may behave quite differently to expected utility maximizers.

I didn't have a counterpoint at the time, but I am pretty skeptical that this claim is true, intuitively.

An agent with even infinitely many preferential gaps seems very close in mind-space to an agent with complete preferences: all it is missing is a relatively simple-to-describe function which "breaks the tie" on things it is already very close to indifferent about. And different choices of tiebreaker function seem unlikely to lead to importantly different behavior: for any choice of tiebreaker function, you are back to an EU maximizer.

The only remaining hope is to avoid having the agent ever pick or be imbued with a tiebreaker function at all. That requires at least two things:

  1. The agent's creators must not initialize it with such a tiebreaker function (seems unlikely to happen by default, but maybe if the creators are alignment researchers who know what they are doing, it's possible)
  2. The agent itself must be stable enough that it never chooses to self-modify or drift into completeness on its own. And I think your claim, if I'm understanding it correctly, is that such stability is unlikely, because completing the preferences can lead to a strict improvement in outcomes under the preferences of the original agent.

Am I understanding your claims correctly, and do you agree with my reasoning that EJT's claim is thus unlikely to be true?

Nit: don't you also need to require that the predicted (and actual) outputs are (apparently, at least) safe? Interpreted literally as written, developers would be allowed to deploy a model if they can reliably predict that it will cause harm.

we don't make an AI which tries to not be harmful with regards to its side-channels, such as hardware attacks — except for its output, it needs to be strongly boxed, such that it can't destroy our world by manipulating software or hardware vulnerabilities. similarly, we don't make an AI which tries to output a solution we like, it tries to output a solution which the math would score high. narrowing what we want the AI to do greatly helps us build the right thing, but it does add constraints to our work.

 

Another potential downside of this approach: it places a lot of constraints on the AI itself, which means it probably has to be strongly superintelligent to start working at all.

I think an important desiderata of any alignment plan is that your AI system starts working gradually, with a "capabilities dial" that you (and the aligned system itself) turn up just enough to save the world, and not more.

Intuitively, I feel like an aligned AGI should look kind of like a friendly superhero, whose superpower is weak superintelligence, superhuman ethics, and a morality which is as close as possible to the coherence-weighted + extrapolated average morality of all currently existing humans (probably not literally; I'm just trying to gesture at a general thing of averaging over collective extrapolated volition / morality / etc.).

Brought into existence, that superhero would then consider two broad classes of strategies:

  1.  Solve a bunch of hard alignment problems: embedded agency, stable self-improvement, etc. and then, having solved those, build a successor system to do the actual work.
  2. Directly do some things with biotech / nanotech / computer security / etc. at its current intelligence level to end the acute risk period. Solve remaining problems at its leisure, or just leave them to the humans.

From my own not-even-weakly superhuman vantage point, (2) seems like a much easier and less fraught strategy than (1). If I were a bit smarter, I'd try saving the world without AI or enhancing myself any further than I absolutely needed to.

Faced with the problem that the boxed AI in the QACI scheme is facing... :shrug:. I guess I'd try some self-enhancement followed by solving problems in (1), and then try writing code for a system that does (2) reliably. But it feels like I'd need to be a LOT smarter to even begin making progress.

Provably safely building the first "friendly superhero" might require solving some hard math and philosophical problems, for which QACI might be relevant or at least in the right general neighborhood. But that doesn't mean that the resulting system itself should be doing hard math or exotic philosophy. Here, I think the intuition of more optimistic AI researchers is actually right: an aligned human-ish level AI looks closer to something that is just really friendly and nice and helpful, and also super-smart.

(I haven't seen any plans for building such a system that don't seem totally doomed, but the goal itself still seems much less fraught than targeting strong superintelligence on the first try.)

Meta: I really like ideas and concrete steps for how to practice the skill of thinking about something. I think there are at least three methods for learning how to think productively about a particular problem:

  • Reading material written by others (not necessarily passively; this might include checking  your understanding of the material you read)
  • Doing exercises, practice problems, toy projects, etc. that are focused directly on the object-level problem.
  • Doing exercises to practice the "cognitive motions" needed to think productively about a problem. I think the exercise(s) described in this post are a good example of this.

And I think the last option is often neglected (in all fields, not just AGI safety) because there's not a lot of written material on how to actually do it. Note that it is a different thing than the more general skill of learning to learn and meta-cognition - different kinds of technical problems can require learning different, domain-specific kinds of cognitive motions.

If you've absorbed enough of the Sequences and other rationality material through osmosis, you might be able to figure out the kind of cognitive motions you need, and how to practice and develop them on your own (or maybe you've done some of the exercises in the CFAR handbook and learned to generalize the lessons they try to teach).

But having someone more experienced write down the kind of cognitive motions you need, along with exercises for how to learn and practice them, can probably get more people up to speed much more quickly. I think posts like this are a great step in that direction.

Object-level tip for the breaker phase: thinking about how a literal human might break your alignment proposal can be a useful way for building intuitions and security mindset. A lot of real alignment schemes involve doing something with human-ish level intelligence, and thinking about how an actual human would break or escape from something is often more natural and less prone to veering into vague or magical thinking than positing capabilities that a hypothetical super-intelligent AI system might have.

If you can't figure out how an actual human can break things, you can relax the constraint a bit by thinking about what a human with the ability to make 10 copies of themselves, think 10x as fast, write code with superhuman accuracy and speed, etc. could do instead.

Threat modelling is the term for this kind of thinking in the field of computer security.

I like this post as a vivid depiction of the possible convergence of strategicness. For literal wildfires, it doesn't really matter where or how the fire starts - left to burn, the end result is that the whole forest burns down. Once the fire is put out, firefighters might be able to determine whether the fire started from a lightning strike in the east, or a matchstick in the west. But the differences in the end result are probably unnoticeable to casual observers, and unimportant to anyone that used to live in the forest.


I think, pretty often, people accept the basic premise that many kinds of capabilities (e.g. strategicness) are instrumentally convergent, without thinking about what the process of convergence actually looks like in graphic detail. Metaphors may or may not be convincing and correct as arguments, but they certainly help to make a point vivid and concrete.

Once you have a policy network and a sampling procedure, you can embody it in a system which samples the network repeatedly, and hooks up the I/O to the proper environment and actuators. Usually this involves hooking the policy into a simulation of a game environment (e.g. in a Gym), but sometimes the embodiment is an actual robot in the real world.

I think using the term "agent" for the policy itself is actually a type error, and not just misleading. I think using the term to refer to the embodied system has the correct type signature, but I agree it can be misleading, for the reasons you describe.

OTOH, I do think modelling the outward behavior of such systems by regarding them as agents with black-box internals is often useful as a predictor, and I would guess that this modelling is the origin of the use of the term in RL.

But modelling outward behavior is very different from attributing that behavior to agentic cognition within the policy itself. I think it is unlikely that any current policy networks are doing (much) agentic cognition at runtime, but I wouldn't necessarily count on that trend continuing. So moving away from the term "agent" proactively seems like a good idea.

Anyway, I appreciate posts like this which clarify / improve standard terminology. Curious if you agree with my distinction about embodiment, and if so, if you have any better suggested term for the embodied system than "agent" or "embodiment".

I largely agree with the general point that I think this post is making, which I would summarize in my own words as: the importance of iteration-and-feedback cycles, experimentation, experience, trial-and-error, etc. (LPE, in your terms) is sometimes overrated in importance and necessity. This over-emphasis is particularly common among those who have an optimistic view on solving the alignment problem through iterative experimentation.

I think degree to which LPE is actually necessary for solving problems in any given domain, as well as the minimum amount of time, resources, and general tractability of obtaining such LPE, is an empirical question which people frequently investigate for particular important domains.

Differing intuitions about how important LPE is in general, and how tractable it is to obtain, seems like an important place for identifying cruxes in world views. I wrote a bit more about this in a recent post, and commented on one of the empirical investigations to which my post is partially a response to. As I said in the comment, I find such investigations interesting and valuable as a matter of furthering scientific understanding about the limits of the possible, but pretty futile as attempts to bound the capabilities of a superintelligence. I think your post is a good articulation of one reason why I find these arguments so uncompelling.

Load More