In his recent post arguing against AI Control research, John Wentworth argues that the median doom path goes through AI slop, rather than scheming. I find this to be plausible. I believe this suggests a convergence of interests between AI capabilities research and AI alignment research.

Historically, there has been a lot of concern about differential progress amongst AI safety researchers (perhaps especially those I tend to talk to). Some research is labeled as "capabilities" while other is labeled as "safety" (or, more often, "alignment"[1]). Most research is dual-use in practice (IE, has both capability and safety implications) and therefore should be kept secret or disclosed carefully.

Recently, a colleague expressed concern that future AIs will read anything AI safety researchers publish now. Since the alignment of future AIs seems uncertain and even implausible, almost any information published now could be net harmful for the future.

I argued the contrary case, as follows: a weak form of recursive self-improvement has already started (in the sense that modern LLMs can usefully accelerate the next generation of LLMs in a variety of ways[2]). I assume that this trend will intensify as AI continues to get more useful. Humans will continue to keep themselves at least somewhat in the loop, but at some point, mistakes may be made (by either the AI or the humans) which push things drastically off-course. We want to avoid mistakes like that.

John spells it out more decisively:

The problem is that we mostly don’t die of catastrophic risk from early transformative AI at all. We die of catastrophic risk from stronger AI, e.g. superintelligence (in the oversimplified model). The main problem which needs to be solved for early transformative AI is to use it to actually solve the hard alignment problems of superintelligence

The key question (on my model) is: does publishing a given piece of information reduce or increase the probability of things going off-course?

Think of it like this. We're currently navigating foreign terrain with a large group of people. We don't have the choice of splitting off from the group; we expect to more-or-less share the same fate, whatever happens. We might not agree with the decision-making process of the group. We might not think we're currently on-course for a good destination. Sharing some sorts of information with the group can result in doom.[3] However, there will be many types of information which will be good to share.

AI Slop

AI slop is a generic derogative term for AI-generated content, due to it being easy to mass-produce low-quality content full of hallucinations[4], extra fingers, and other hallmarks of AI-generated content.

As the AI hype continues to increase, I've continued to attempt to use AI to accelerate my research. While it is obviously getting better, my experience is that it continues to be useful only as a sounding board. I find myself often falling into the habit of not even reading the AI outputs, because they have proven worse than useless: when I describe my technical problem and ask for a solution, I get something that looks plausible at first glance, but on close analysis, assumes what is to be proven in one of the proof steps. I'm not exactly sure why this is the case. Generating a correct novel proof should be hard, sure; but checking proofs is easier than generating them. Generating only valid proof steps should be relatively easy.[5] 

These AIs seem strikingly good at conversing about sufficiently well-established mathematics, but the moment I ask for something a little bit creative, the fluent competence falls apart.

Claude 3.5 was the first model whose proofs were good enough to fool me for a little while, rather than being obvious slop. The o1 model seems better, in the sense that its proofs look more convincing and it takes me longer to find the holes in the proofs. I haven't tried o3 yet, but early reports are that it hallucinates a lot, so I mostly expect it to continue the trend of being worse-than-useless in this way.[6]

I'm not denying that these models really are getting better in a broad sense. There's a general pattern that LLMs are much more useful for people who have a lower level of expertise in a field. That waterline continues to increase.[7]

However, as these models continue to get better, they seemingly continue to display a very high preference for convincingness over correctness when the two come into conflict. If this continues it is, plausibly, a big problem for the future.

Coherence & Recursive Self-Improvement

Recursive self-improvement (RSI) is a tricky business. One wrong move can send you teetering into madness. It is, in a broad sense, the business which leading AI labs are already engaged in.

Again quoting John:

First, some lab builds early transformatively-useful AI. They notice that it can do things like dramatically accelerate AI capabilities or alignment R&D. Their alignment team gets busy using the early transformative AI to solve the alignment problems of superintelligence. The early transformative AI spits out some slop, as AI does. Alas, one of the core challenges of slop is that it looks fine at first glance, and one of the core problems of aligning superintelligence is that it’s hard to verify; we can’t build a superintelligence to test it out on without risking death. Put those two together, add a dash of the lab alignment team having no idea what they’re doing because all of their work to date has focused on aligning near-term AI rather than superintelligence, and we have a perfect storm for the lab thinking they’ve solved the superintelligence alignment problem when they have not in fact solved the problem.

So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.

To avoid this sort of outcome, it seems like we need to figure out how to make models "coherent" in a fairly broad sense (related to some formal notions of coherence, eg probabilistic coherence, and also informal notions of coherence). Here are some important-seeming properties to illustrate what I mean:

  1. Robustness of value-alignment: Modern LLMs can display a relatively high degree of competence when explicitly reasoning about human morality. In order for it to matter for RSI, however, those concepts need to also appropriately come into play when reasoning about seemingly unrelated things, such as programming. The continued ease of jailbreaking AIs serves to illustrate this property failing (although solving jailbreaking would not necessarily get at the whole property I am pointing at).
  2. Propagation of beliefs: When the AI knows something, it should know it in a way which integrates well with everything else it knows, rather than easily displaying the knowledge in one context while seeming to forget it in another.
  3. Preference for reasons over rationalizations: An AI should be ready and eager to correct its mistakes, rather than rationalizing its wrong answers. It should be truth-seeking, following thoughts where they lead instead of planning ahead to justify specific answers. It should prefer to valid proof steps over arriving at an answer when the two conflict.
  4. Knowing the limits of its knowledge: Metacognitive awareness of what it knows and what it doesn't know, appropriately brought to bear in specific situations. The current AI paradigm just has one big text-completion probability distribution, so there's not a natural way for it to distinguish between uncertainty about the underlying facts and uncertainty about what to say next -- hence we get hallucinations.

All of this is more-or-less a version of the metaphilosophy research agenda, framed in terms of current events in AI. We don't just need to orient AI towards are values; we need to orient AI towards (the best of) the whole human truth-seeking process, including (the best of) moral philosophy, philosophy of science, etc. 

What's to be done?

To my knowledge, we still lack a good formal model clarifying what it would even mean to solve the hardest parts of the AI safety problem (eg, the pointers problem). However, we do have a plausible formal sketch of metaphilosophy: Logical Induction![8]

Logical Induction comes with a number of formal guarantees about its reasoning process. This is something that cannot be said about modern "reasoning models" (which I think are a move in the wrong direction).

Can we apply ideas from logical induction to improve the reasoning of modern AI? I think it is plausible. Should we? I think it is plausible.[9]

More generally, this post can be viewed as a continuation of the ideas I expressed in LLMs for Alignment Research: A Safety Priority? and AI Craftsmanship. I am suggesting that it might be time for safety-interested people to work on specific capabilities-like things, with an eye particularly towards capabilities which can accelerate AI safety research, and more generally, an eye towards reducing AI slop. 

I believe that scaling up current approaches is not sufficient; it seems important to me to instead understand the underlying causes of the failure modes we are seeing, and design approaches which avoid those failure modes. If we can provide a more-coherent alternative to the current paradigm of "reasoning models" (and get such an alternative to be widely adopted), well, I think that would be good.

Trying to prevent jailbreaking, avoid hallucinations,[4] get models to reason well, etc are not new ideas. What I see as new here is my argument that the interests of safety researchers and capabilities researchers are aligned on these topics. This argument might move some people to work on "capabilities" or to publish such work when they might not otherwise do so.

Above all, I'm interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.

  1. ^

    I have come to prefer "AI safety" as the broader and more descriptive term for research intended to help reduce AI risk. The term "alignment" still has meaning to me, as a synonym for value-loading research, which aims to build agentic AI whose goal-directedness is aimed at human values. However, I think it is quite important to keep one's thoughts as close as possible to the main aim of one's research. To me, it seems that safety is the better aim than alignment. Alignment is one way to achieve safety, but may not be the best way.

  2. ^

    Approaches such as constitutional AI, RLHF, and deliberative safety use AI directly to help train AI. LLMs are also useful for programmers, so I imagine that they see some use for writing code at the AI labs themselves. More generally, researchers might have conversations with LLMs about their research.

  3. ^

    EG, maybe the majority of the group thinks that jumping off of cliffs is a good idea, so we don't want to tell the group the direction to the nearest cliff.

  4. ^

    One colleague of mine uses the term "confabulation" rather than the more common "hallucination" -- I think it is a much more fitting term. The so-called hallucinations are (1) in the output of the system rather than the input; confabulations are a behavior, whereas hallucinations are a sensory phenomenon; (2) verbal rather than visual; while hallucinations can be auditory or impact other senses, the central thing people think of is visual hallucinations. "Confabulation" calls to mind a verbal behavior, which fits the phenomenon being described very well.

    "Confabulation" also seems to describe some details of the phenomenon well; in particular, AI confabulation and human confabulation share patterns of motivated cognition: both will typically try to defend their confabulated stories, rather than conceding in the face of evidence.

    I recognize, unfortunately, that use of the term "hallucination" to describe LLM confabulation has become extremely well-established. However, thinking clearly about these things seems important, and using clear words to describe them aids such clarity.

    Ooh, I found someone else making the same point.

  5. ^

    I'm not saying "logic is simple, therefore generating only valid proof-steps should be simple" -- I understand that mathematicians skip a large number of "obvious" steps when they write up proofs for publication, such that fully formalizing proofs found in a randomly chosen math paper can be quite nontrivial. So, yes, "writing only valid proof steps" is much more complex than simply keeping to the rules of logic.

    Still, most proofs in the training data will be written for a relatively broad audience, so (my argument goes) fluency in discussing the well-established math in a given area should be about the level of skill needed for generating only valid proof steps. This is a strong pattern, useful for predicting the data. From this, I would naively predict that LLMs trying to write proofs would write a bunch of valid steps (perhaps including a few accidental mistakes, rather than strategic mistakes) and fail to reach the desired conclusion, rather than generating clever arguments.

    To me, the failure of this prediction requires some explanation. I can think of several possible explanations, but I am not sure which is correct.

  6. ^

    A colleague predicted that o3-pro will still generate subtly flawed proofs, but at that point I'll lose the ability to tell without a team of mathematicians. I disagree: a good proof is a verifiable proof. I can at least fall back on asking o3-pro to generate a machine-checkable version of the proof, and count it as a failure if it is unable to do so.

  7. ^

    Although: it is easy for people to overestimate how quickly that waterline is increasing. AI will naturally be optimized to pass the shallower tests of competence, and people will naturally be biased to make generalized predictions about its competence based on shallower tests. Furthermore, since most people aren't experts in most fields, Gell-Mann Amnesia leads to overestimation of AI.

  8. ^

    Readers might not be prepared to think about Logical Induction as a solution to metaphilosophy. I don't have the bandwidth to defend this idea in the current essay, but I hope to defend it at some point.

  9. ^

    The idea of mainstream AI taking inspiration from Logical Induction to generate capability insights is something that a number of people I know have considered to be a risk for some time; the argument being that it would be net-negative due to accelerating capabilities.

New Comment
27 comments, sorted by Click to highlight new comments since:

I don’t think your model hangs together, basically because I think “AI that produces slop” is almost synonymous with “AI that doesn’t work very well”, whereas you’re kinda treating AI power and slop as orthogonal axes.

For example, from comments:

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.

Some relatively short time later, there are no humans.

I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?

(Or maybe you’re treating it as a “capabilities elicitation” issue? Like, the AI knows all sorts of things, but when we ask, we get sycophantic slop answers? But then we should just say that the AI is mediocre in effect. Even if there’s secretly a super-powerful AI hidden inside, who cares? Unless the AI starts scheming, but I thought AI scheming was out-of-scope for this post.)

Anti-slop AI helps everybody make less mistakes. Sloppy AI convinces lots of people to make more mistakes.

I would have said “More powerful AI (if aligned) helps everybody make less mistakes. Less powerful AI convinces lots of people to make more mistakes.” Right?

And here’s a John Wentworth excerpt:

So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.

If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?

 

Really, I think John Wentworth’s post that you’re citing has a bad framing. It says: the concern is that early transformative AIs produce slop.

Here’s what I would say instead:

Figuring out how to build aligned ASI is a harder technical problem than just building any old ASI, for lots of reasons, e.g. the latter allows trial-and-error. So we will become capable of building ASI sooner than we’ll have a plan to build aligned ASI.

Whether the “we” in that sentence is just humans, versus humans with the help of early transformative AI assistance, hardly matters.

But if we do have early transformative AI assistants, then the default expectation is that they will fail to solve the ASI alignment problem until it’s too late. Maybe those AIs will fail to solve the problem by outputting convincing-but-wrong slop, or maybe they’ll fail to solve it by outputting “I don’t know”, or maybe they’ll fail to solve it by being misaligned, a.k.a. a failure of “capabilities elicitation”. Who cares? What matters is that they fail to solve it. Because people (and/or the early transformative AI assistants) will build ASI anyway.

For example, Yann LeCun doesn’t need superhumanly-convincing AI-produced slop, in order to mistakenly believe that he has solved the alignment problem. He already mistakenly believes that he has solved the alignment problem! Human-level slop was enough. :)

In other words, suppose we’re in a scenario with “early transformative AIs” that are up to the task of producing more powerful AIs, but not up to the task of solving ASI alignment. You would say to yourself: “if only they produced less slop”. But to my ears, that’s basically the same as saying “we should creep down the RSI curve, while hoping that the ability to solve ASI alignment emerges earlier than the breakdown of our control and alignment measures and/or ability to take over”.

 

…Having said all that, I’m certainly in favor of thinking about how to get epistemological help from weak AIs that doesn’t give a trivial affordance for turning the weak AIs into very dangerous AIs. For for that matter, I’m in favor of thinking about how to get epistemological help from any method, whether AI or not.  :)

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.

Some relatively short time later, there are no humans.

I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?

Maybe "some relatively short time later" was confusing. I mean long enough for the development cycle to churn a couple more times.

IE, GPT7 convinces people of sloppy safety measures XYZ, people implement XYZ and continue scaling up AGI, the scaled-up superintelligence is a schemer.

(Or maybe you’re treating it as a “capabilities elicitation” issue? Like, the AI knows all sorts of things, but when we ask, we get sycophantic slop answers? But then we should just say that the AI is mediocre in effect. Even if there’s secretly a super-powerful AI hidden inside, who cares? Unless the AI starts scheming, but I thought AI scheming was out-of-scope for this post.)

I do somewhat think of this as a capabilities elicitation issue. I think current training methods are eliciting convincingness, sycophantism, and motivated cognition (for some unknown combination of the obvious reasons and not-so-obvious reasons).

But, as clarified above, the idea isn't that sloppy AI is hiding a super-powerful AI inside. It's more about convincingness outpacing truthfulness. I think that is a well-established trend. I think many people expect "reasoning models" to reverse that trend. My experience so far suggests otherwise.

I would have said “More powerful AI (if aligned) helps everybody make less mistakes. Less powerful AI convinces lots of people to make more mistakes.” Right?

What I'm saying is that "aligned" isn't the most precise concept to apply here. If scheming is the dominant concern, yes. If not, then the precisely correct concept seems closer to the "coherence" idea I'm trying to gesture at.

I've watched (over Discord) a developer get excited about a supposed full-stack AI development tool which develops a whole application for you based on a prompt, try a few simple examples and exclaim that it is like magic, then over the course of a few more hours issue progressive updates of "I'm a little less excited now" until they've updated to a very low level of excitement and have decided that it seems like magic mainly because it has been optimized to work well for the sorts of simple examples developers might try first when putting it through its paces.

I'm basically extrapolating that sort of thing forward, to cases where you only realize something was bad after months or years instead of hours. As development of these sorts of tools continues to move forward, they'll start to succeed in impressing on the days & weeks timespan. A big assumption of my model is that to do that, they don't need to fundamentally solve the bad-at-extrapolation problem (hallucinations, etc); they can instead do it in a way that goodharts on the sorts of feedback they're getting.

Alignment is broad enough that I can understand classifying this sort of failure as "alignment failure" but I don't think it is the most precise description.

If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?

This does seem possible, but I don't find it probable. Self-improvement ideas can be rapidly tested for their immediate impacts, but checking their long-term impacts is harder. Therefore, AI slop can generate many non-working self-improvements that just get discarded and that's fine; it's the apparently-working self-improvement ideas that cause problems down the line. Similarly, the AI itself can more easily train on short-term impacts of proposed improvements; so the AI might have a lot less slop when reasoning about these short-term impacts, due to getting that feedback.

(Notice how I am avoiding phrasing it like "the sloppy AI can be good at capabilities but bad at alignment because capabilities are easier to train on than alignment, due to better feedback". Instead, focusing on short-term impacts vs long-term impacts seems to carve closer to the joints of reality.)

Sloppy AIs are nonetheless fluent with respect to existing knowledge or things that we can get good-quality feedback for, but have trouble extrapolating correctly. Your scenario, where the sloppy AI can't help with self-improvement of any kind, suggests a world where there is no low-hanging fruit via applying existing ideas to improve the AI, or applying the kinds of skills which can be developed with good feedback. This seems possible but not especially plausible.

But if we do have early transformative AI assistants, then the default expectation is that they will fail to solve the ASI alignment problem until it’s too late. Maybe those AIs will fail to solve the problem by outputting convincing-but-wrong slop, or maybe they’ll fail to solve it by outputting “I don’t know”, or maybe they’ll fail to solve it by being misaligned, a.k.a. a failure of “capabilities elicitation”. Who cares? What matters is that they fail to solve it. Because people (and/or the early transformative AI assistants) will build ASI anyway.

I think this is a significant point wrt my position. I think my position depends to some extent on the claim that it is much better for early TAI to say "I don't know" as opposed to outputting convincing slop. If leading AI labs are so bullish that they don't care whether their own AI thinks it is safe to proceed, then I agree that sharing almost any capability-relevant insights with these labs is a bad idea.

I think Abram is saying the following:

  • Currently, AIs are lacking capabilities that would meaningfully speed up AI Safety research.
  • At some point, they are gonna get those capabilities.
  • However, by default, they are gonna get those AI Safety-helpful capabilities roughly at the same time as other, dangerous capabilities (or at least, not meaningfully earlier).
    • In which case, we're not going to have much time to use the AI Safety-helpful capabilities to speed up AI Safety research sufficiently for us to be ready for those dangerous capabilities.
  • Therefore, it makes sense to speed up the development of AIS-helpful capabilities now. Even if it means that the AIs will acquire dangerous capabilities sooner, it gives us more time to use AI Safety-helpful capabilities to prepare for dangerous capabilities.

Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.

If that’s what you’re doing, then that’s bad. You shouldn’t do it. Like, if AI alignment researchers want AI that produces less slop and is more helpful for AIS, we could all just hibernate for six months and then get back to work. But obviously, that won’t help the situation.

And a second possibility is, there are ways to make AI more helpful for AI safety that are not simultaneously directly addressing the primary bottlenecks to AI danger. And we should do those things.

The second possibility is surely true to some extent—for example, the LessWrong JargonBot is marginally helpful for speeding up AI safety but infinitesimally likely to speed up AI danger.

I think this OP is kinda assuming that “anti-slop” is the second possibility and not the first possibility, without justification. Whereas I would guess the opposite.

Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.

TBC, I was thinking about something like: "speed up the development of AIS-helpful capabilities by 3 days, at the cost of speeding up the development of dangerous capabilities by 1 day".

I think it’s 1:1, because I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems (further details), which basically amounts to anti-slop.

If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is? (or it’s fine if you don’t want to state it publicly)

So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward.

Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast "shadows" outwards.

My claim is that cliffs are dangerous for a different reason, namely that people often won't realize when they're falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation.

This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people's rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.

I’m still curious about how you’d answer my question above. Right now we don’t know how to build ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”.

If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to ASI. So we might as well sit tight and find other useful things to do, until such time as the AI capabilities researchers figure it out.

…Furthermore, I don’t think the number of months or years between “AIs that are ‘better at extrapolating’” and ASI is appreciably larger if the “AIs that are ‘better at extrapolating’” arrive tomorrow, versus if they arrive in 20 years. In order to believe that, I think you would need to expect some second bottleneck standing between “AIs that are ‘better at extrapolating’”, and ASI, such that that second bottleneck is present today, but will not be present (as much) in 20 years, and such that the second bottleneck is not related to “extrapolation”.

I suppose that one could argue that availability of compute will be that second bottleneck. But I happen to disagree. IMO we already have an absurdly large amount of compute overhang with respect to ASI, and adding even more compute overhang in the coming decades won’t much change the overall picture. Certainly plenty of people would disagree with me here. …Although those same people would probably say that “just add more compute” is actually the only way to make AIs that are “better at extrapolation”, in which case my point would still stand.

I don’t see any other plausible candidates for the second bottleneck. Do you? Or do you disagree with some other part of that? Like, do you think it’s possible to get all the way to ASI without ever making AIs “better at extrapolating”? IMO it would hardly be worthy of the name “ASI” if it were “bad at extrapolating”  :)

If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is?

So far in this thread I was mostly talking from the perspective of my model(/steelman?) of Abram's argument.

I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems

I mostly agree with this.

Still, this doesn't[1] rule out the possibility of getting an AI that understands (is superintelligent in?) one complex domain (specifically here, whatever is necessary to meaningfully speed up AIS research) (and maybe a few more, as I don't expect the space of possible domains to be that compartmentalizable), but is not superintelligent across all complex domains that would make it dangerous.

It doesn't even have to be a superintelligent reasoner about minds. Babbling up clever and novel mathematical concepts for a human researcher to prune could be sufficient to meaningfully boost AI safety (I don't think we're primarily bottlenecked on mathy stuff but it might help some people and I think that's one thing that Abram would like to see).

  1. ^

    Doesn't rule out in itself but perhaps you have some other assumptions that imply it's 1:1, as you say.

So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.

If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?

Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.

If you really think you need to be similarly unsloppy to build ASI than to align ASI, I'd be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start).

(Tbc, I directionally agree with you that anti-slop is very useful AI capabilities and that I wouldn't publish stuff like Abram's "belief propagation" example.)

Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.

Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”

…And then if you follow through the “logic” of this OP, then the argument becomes: “AI alignment is a hard problem, so let’s just make extraordinarily powerful / smart AIs right now, so that they can solve the alignment problem”.

See the error?

If you really think you need to be similarly unsloppy to build ASI than to align ASI, I'd be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start).

I don’t think that. See the bottom part of the comment you’re replying to. (The part after “Here’s what I would say instead:”)

Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”

I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.

If you de-slopify the models, how do you avoid people then using them to accelerate capabilities research just as much as safety research? Why wouldn't that leave us with the same gap in progress between the two we have right now, or even a worse gap? Except that everything would be moving to the finish line even faster, so Earth would have even less time to react.

Is the idea that it wouldn't help safety go differentially faster at all, but rather just that it may preempt people latching on to false slop-solutions for alignment as an additional source of confidence that racing ahead is fine? If that is the main payoff you envision, I don't think it'd be worth the downside of everything happening even faster. I think time is very precious, and sources of confidence already abound for those who go looking for them.

Hmmm. I'm not exactly sure what the disconnect is, but I don't think you're quite understanding my model.

I think anti-slop research is very probably dual-use. I expect it to accelerate capabilities. However, I think attempting to put "capabilities" and "safety" on the same scale and maximize differential progress of safety over capabilities is an oversimplistic model which doesn't capture some important dynamics.

There is not really a precise "finish line". Rather, we can point to various important events. The extinction of all humans lies down a path where many mistakes (of varying sorts and magnitudes) were made earlier.

Anti-slop AI helps everybody make less mistakes. Sloppy AI convinces lots of people to make more mistakes.

My assumption is that frontier labs are racing ahead anyway. The idea is that we'd rather they race ahead with a less-sloppy approach. 

Imagine an incautious teenager who is running around all the time and liable to run off a cliff. You expect that if they run off a cliff, they die -- at this rate you expect such a thing to happen sooner or later. You can give them magic sneakers that allow them to run faster, but also improves their reaction time, their perception of obstacles, and even their wisdom. Do you give the kid the shoes?

It's a tough call. Giving the kid the shoes might make them run off a cliff even faster than they otherwise would. It could also allow them to stop just short of the cliff when they otherwise wouldn't.

I think if you value increased P(they survive to adulthood) over increased E(time they spend as a teenager), you give them the shoes. IE, withholding the shoes values short-term over long-term. If you think there's no chance of survival to adulthood either way, you don't hand over the shoes.

A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I'd encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!

I’m working on this. I’m unsure if I should be sharing what I’m exactly working on with a frontier AGI lab though. How can we be sure this just leads to differentially accelerating alignment?

Edit: my main consideration is when I should start mentioning details. As in, should I wait until I’ve made progress on alignment internally before sharing with an AGI lab. Not sure what people are disagreeing with since I didn't make a statement.

I'm not sure I can talk about this effectively in the differential progress framework. My argument is that if we expect to die to slop, we should push against slop. In particular, if we expect to die to slop-at-big-labs, we should push against slop-at-big-labs. This seems to suggest a high degree of information-sharing about anti-slop tech.

Anti-slop tech is almost surely also going to push capabilities in general. If we currently think slop is a big source of risk, it seems worth it.

Put more simply: if someone is already building superintelligence & definitely going to beat you & your allies to it, then (under some semi-plausible additional assumptions) you want to share whatever safety tech you have with them, disregarding differential-progress heuristics.

Again, I'm not certain of this model. It is a costly move in the sense of having a negative impact on some possible worlds where death by slop isn't what actually happens.

What kind of alignment research do you hope to speed up anyway?

For advanced philosophy like stuff (e.g. finding good formal representations for world models, or inventing logical induction) they don't seem anywhere remotely close to being useful.

My guess would be for tiling agents theory neither but I haven't worked on it, so very curious on your take here. (IIUC, to some extent the goal of tiling-agents-theory-like work there was to have an AI solve it's own alignment problem. Not sure how far the theory side got there and whether it could be combined with LLMs.)

Or what is your alignment hope in more concrete detail?

Yeah, my sense is that modern AI could be useful to tiling agent stuff if it were less liable to confabulate fake proofs. This generalizes to any technical branch of AI safety where AI could help come up with formalizations of ideas, proofs of conjectures, etc. My thinking suggests there is something of an "overhang" here at present, in the sense that modern AI models are worse-than-useless due to the way that they try to create good-looking answers at the expense of correctness.

I disagree with the statement "to some extent the goal of tiling-agents-like work was to have an AI solve its own alignment problem" -- the central thing is to understand conditions under which one agent can justifiably trust another (with "trust" operationalized as whether one agent wants to modify the decision procedure of the other). If AI can't justifiably trust itself, then it has a potential motive to modify itself in ways that remove safety guarantees (so in this sense, tiling is a precondition for lots of safety arguments). Perhaps more importantly, if we can understand conditions under which humans can justifiably trust AI, then we have a formal target for alignment.

Thanks.

True, I think your characterization of tiling agents is better. But my impression was sorta that this self-trust is an important precursor for the dynamic self-modification case where alignment properties need to be preserved through the self-modification. Yeah I guess calling this AI solving alignment is sorta confused, though maybe there's sth into this direction because the AI still does the search to try to preserve the alignment properties?

Hm I mean yeah if the current bottleneck is math instead of conceptualizing what math has to be done then it's a bit more plausible. Like I think it ought to be feasible to get AIs that are extremely good at proving theorems and maybe also formalizing conjectures. Though I'd be a lot more pessimistic about finding good formal representations for describing/modelling ideas.

Do you think we are basically only bottlenecked on math so sufficient math skill could carry us to aligned AI, or only have some alignment philosophy overhang you want to formalize but then more philosophy will be needed?

I think there is both important math work and important conceptual work. Proving new theorems involves coming up with new concepts, but also, formalizing the concepts and finding the right proofs. The analogy to robots handling the literal heavy lifting part of a job seems apt.

This argument might move some people to work on "capabilities" or to publish such work when they might not otherwise do so.

Above all, I'm interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.

My current guess:

I wouldn't expect much useful research to come from having published ideas. It's mostly just going to be used in capabilities and it seems like a bad idea to publish stuff.

Sure you can work on it and be infosec cautious and keep it secret. Maybe share it with a few very trusted people who might actually have some good ideas. And depending on how things play out if in a couple years there's some actual effort from the joined collection of the leading labs to align AI and they only have like 2-8 months left before competition will hit the AI improving AI dynamic quite hard, then you might go to the labs and share your ideas with them (while still trying to keep it closed within those labs - which will probably only work for a few months or a year or so until there's leakage).

Do you not at all buy John's model, where there are important properties we'd like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?

Can you link me to what you mean by John's model more precisely?

If you mean John's slop-instead-scheming post, I agree with that with the "slop slightly more likely than scheming" part. I might need to reread John's post to see what the concrete suggestions for what to work on might be. Will do so tomorrow.

I'm just pessimistic that we can get any nontrivially useful alignment work out of AIs until a few months before the singularity, at least besides some math. Or like at least for the parts of the problem we are bottlenecked on.

So like I think it's valuable to have AIs that are near the singularity be more rational. But I don't really buy the differentially improving alignment thing. Or like could you make a somewhat concrete example of what you think might be good to publish?

Like, all capabilities will help somewhat with the AI being less likely to make errors that screw its alignment. Which ones do you think are more important than others? There would have to be a significant difference in usefulness pf some capabilities, because else you could just do the same alignment work later and still have similarly much time to superintelligence (and could get more non-timeline-upspeeding work done).

Concrete (if extreme) story:

World A:

Invent a version of "belief propagation" which works well for LLMs. This offers a practical way to ensure that if an LLM seems to know something in one context, it can & will fluently invoke the same knowledge in almost all appropriate contexts.

Keep the information secret in order to avoid pushing capabilities forward.

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.

Some relatively short time later, there are no humans.

World B:

Invent LLM "belief propagation" and publish it. It is good enough (by assumption) to be the new paradigm for reasoning models, supplanting current reinforcement-centric approaches.

Two years later, GPT7 is assessing its safety proposals realistically instead of convincingly arguing for them. Belief propagation allows AI to facilitate a highly functional "marketplace of ideas" where the actually-good arguments tend to win out far more often than the bad arguments. AI progress is overall faster, but significantly safer.

(This story of course assumes that "belief propagation" is an unrealistically amazing insight; still, this points in the direction I'm getting at)

Thanks for providing a concrete example!

Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.

I also think the "drowned out in the noise" isn't that realistic. You ought to be able to show some quite impressive results relative to computing power used. Though when you maybe should try to convince the AI labs of your better paradigm is going to be difficult to call. It's plausible to me we won't see signs that make us sufficiently confident that we only have a short time left, and it's plausible we do.

In any case before you publish something you can share it with trustworthy people and then we can discuss that concrete case in detail.

Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.

Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.

Yeah, basically everything I'm saying is an extension of this (but obviously, I'm extending it much further than you are). We don't exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. (That is, so long as we're assuming scheming is not the failure mode to worry about in the shorter-term.) So, improved rationality for AIs seems similarly good. The claim I'm considering is that even improving rationality of AIs by a lot could be good, if we could do it.

An obvious caveat here is that the intervention should not dramatically increase the probability of AI scheming!

Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.

This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn't it better to start early? (If you have anything significant to say, of course.)