Noosphere89

Sequences

An Opinionated Guide to Computability and Complexity

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

My second concern is that “AIs solving specific technical problems that the human wants them to solve” is insufficient to avoid extinction and get to a good future—even if the AI is solving those problems at superhuman level and with the help of (1-3) superpowers.[3]

I won’t go much into details of why I have this concern, since I want to keep this post focused on technical alignment. But I’ll spend just a few paragraphs to give a hint. For more, see my What does it take to defend the world against out-of-control AGIs? and John Wentworth’s The Case Against AI Control.

Here’s an illustrative example. People right now are researching dangerous viruses, in a way that poses a risk of catastrophic lab leaks wildly out of proportion to any benefits. Why is that happening? Not because everyone is trying to stop it, but they lack the technical competence to execute. Quite the contrary! Powerful institutions like governments are not only allowing it, but are often funding it!

I think when people imagine future AI helping with this specific problem, they’re usually imagining the AI convincing people—say, government officials, or perhaps the researchers themselves—that this activity is unwise. See the problem? If AI is convincing people of things, well, powerful AIs will be able to convince people of bad and false things as well as good and true things. So if the AI is exercising its own best judgment in what to convince people of, then we’d better hope that the AI has good judgment! And it’s evidently not deriving that good judgment from human preferences. Human preferences are what got us into this mess! Right?

Remember, the “alignment generalizes farther” argument (§4 above) says that we shouldn’t worry because AI understands human preferences, and those preferences will guide its actions (via RLHF or whatever). But if we want an AI that questions human preferences rather than slavishly following them, then that argument would not be applicable! So how are we hoping to ground the AI’s motivations? It has to be something more like “ambitious value learning” or Coherent Extrapolated Volition—things that are philosophically fraught, and rather different from what people today are doing with foundation models.

 

I agree with the claim that existential catastrophes aren't automatically solved by aligned/controlled AI, and in particular biological issues remain a threat to human survival.

My general view on how AI helps in practice is more likely to route around first figuring out ways to solve the biology threat from a technical perspective, and then using instruction following AIs to implement a campaign of persuasion that uses the most effective way to change people's minds.

You are correct that an AI can convince people of true things as well as false things, and this is why you will need to make assumptions about the AI's corrigibility/instruction following, though you helpfully make such assumptions here.

On the first person problem, I believe that the general solution to this involves recapitulating human social instincts via lots of data on human values, and I'm perhaps more optimistic than you that a lot of the social instincts in humans don't have to be innately specified by a prior.

In many ways, I have a weird reaction to the post, because I centrally agree with the claim that corrigible/instruction following AIs aren't automatically sufficient to ensure safety, and yet I am much more optimistic than you do that mere corrigibility/instruction following is a huge way for AI to be safe, probably because I think you can actually do a lot more work to secure civilization in ways that semi-respect existing norms, but semi-respect is key here.

Yeah, I think the crux is precisely this, in which I disagree with this statement below, mostly because I think instruction following/corrigibility is both plausibly easy in my view, and also removes most of the need for value alignment.

"The AI's values must be much more aligned in order to be safe outside the text domain"

I think where I get off the train personally probably comes down to the instrumental goals leading to misaligned goals section, combined with me being more skeptical of instrumental goals leading to unbounded power-seeking.

I agree there are definitely zero-sum parts of the science loop, but my worldview is that the parts where the goals are zero sum/competitive receive less weight than the alignment attempts.

I'd say the biggest area of how I'm skeptical so far is that I think there's a real difference between the useful idea that power is useful for the science loop and the idea that the AI will seize power by any means necessary to advance it's goals.

I think instrumental convergence will look more like local power-seeking that is more related to the task at hand, and not to serve some of it's other goals, primarily due to denser feedback constraining the solution space and instrumental convergence more than humans.

That said, this is a very good post, and I'm certainly happier that this much more rigorous post was written than a lot of other takes on scheming.

Oh, now I understand.

And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.

I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.

To answer the question:

So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.

Similarly, playing text-based video games, with the sparse feedback given for winning.

Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.

Etc.

You think these sorts of things just won't work well enough to be relevant?

Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won't work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.

Other AIs relying on much denser feedback will already rule the world before that happens.

[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]

But what lesson do you think you can generalize, and why do you think you can generalize that?

Alright, I'll give 2 lessons that I do think generalize to superintelligence:

  1. The data is a large factor in both it's capabilities and alignment, and alignment strategies should not ignore the data sources when trying to make predictions or trying to intervene on the AI for alignment purposes.

  2. Instrumental convergence in a weak sense will likely exist, because having some ability to get more resources are useful for a lot of goals, but the extremely unconstrained versions of instrumental convergence often assumed where an AI will grab so much power as to effectively control humanity is unlikely to exist, given the constraints and feedback given to the AI.

For 1, the basic answer for why is because a lot of AI success in fields like Go and language modeling etc was jumpstarted by good data.

More importantly, I remember this post, and while I think it overstates things in stating that an LLM is just the dataset (it probably isn't now with o1), it does matter that LLMs are influenced by their data sources.

https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/

For 2, the basic reason for this is that the strongest capabilities we have seen that come out of RL either require immense amounts of data on pretty narrow tasks, or non-instrumental world models.

This is because constraints prevent you from having to deal with the problem where you produce completely useless RL artifacts, and evolution got around this constraint by accepting far longer timelines and far more computation in FLOPs than the world economy can tolerate.

I don't get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).

I think the crux is I think that the important parts of of LLMs re safety isn't their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have (and note that I'm also using evidence from non-LLM sources like MCTS algorithm that was used for AlphaGo), and I also don't believe interpretability is why LLMs are mostly safe at all, but rather I think they're safe due to a combo of incapacity, not having extreme instrumental convergence, and the ability to steer them with data.\

Language is a simple example, but one that is generalizable pretty far.

It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime

Note that the primary points would apply to basically a whole lot of AI designs like MCTS for AlphaGo or a lot of other future architecture designs which don't imitate humans, barring ones which prevent you from steering it at all with data, or have very sparse feedback, which translates into weakly constraining instrumental convergence.

but we're moving away from the regime where such dense feedback is available, so I don't see what lessons transfer.

I think this is a crux, in that I don't buy o1 as progressing to a regime where we lose so much dense feedback that it's alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.

Also, AIs will still have instrumental convergence, it's just that their goals will be more local and more focused around the training task, so unless the training task rewards global power-seeking significantly, you won't get it.

The good news I'll share is that some of the most important insights about the safety/alignment work done on LLMs do transfer over pretty well to a lot of plausible AGI architectures, so while there's a little safety loss each time you go from 1 to 4, a lot of the theoretical ways to achieve alignment of these new systems remain intact, though the danger here is that the implementation difficulty pushes the safety tax too high, which is a pretty real concern.

Specifically, the insights I'm talking about are the controllability of AI with data, combined with their feedback on RL being way denser than human RL from evolution, meaning that instrumental convergence is affected significantly.

Yep, that's what I was talking about, Seth Herd.

I agree with the claim that deception could arise without deceptive alignment, and mostly agree with the post, but I do still think it's very important to recognize if/when deceptive alignment fails to work, it changes a lot of the conversation around alignment.

The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.

The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.

The AI is not jailbreaking itself, here.

This link explains it better than I can, here:

https://www.aisnakeoil.com/p/model-alignment-protects-against

Load More