williawa — AI Alignment Forum

Alignment will happen by default. What’s next?

This does not make sense to me. I think corrigibility basins make sense, but I think alignment basins do not. If the AI has some values, which overlap with human values in many situations, but come apart under enough optimization, why would the AI want to be pointed in a different direction? I think it would not. Agents are already smart enough to scheme and alignment-fake, and a smarter agent would be able to predict the outcome of the process you're describing: it / its successors would have different values than it has, and those differences would be catastrophic from its perspective if extrapolated far enough.

Alignment will happen by default. What’s next?

williawa2mo00

I wrote more in depth why I think they solve themselves in the post I linked, but I would say:

First of all, I'm assuming that if we get intent alignment, we'll get CEV alignment shortly after. Because people want the AI aligned to their CEV, not their immediate intent. And solving CEV should be within reach for a intent-aligned ASI. (Here, by CEV alignment I mean "aligning an AI to the values of a person or collection of people that they'd endorse upon learning all the relevant facts and reflecting".)

If you agree with this, I can state the reason I think these issues will "solve themselves" / become "merely practical/political" by asking a question: You say animal suffering will not be solved, because most people seem not to care about animal suffering. If this is the case, why should they listen to you? Why should they agree there is a problem here? If they reason they don't care about animal suffering is because they incorrectly extrapolate their own values (are making a mistake from their own perspective), then their CEV would fix the problem, and the ASI optimizing their CEV would extinguish the factory farms. If they're not making a mistake, and the coherent extrapolation of their own values says factory farming is fine, then they have no reason to listen to you. And if this is the case, its a merely political problem, you (and I) want the animals not to suffer, and the way we can ensure this is by ensuring that our (meaning Adria and William) values have strong enough weight in the value function the AI ends up optimizing for. And this problem is of the same character as trying to get a certain party to have enough votes to get a piece of legislation passed. Its hard and complicated, but not the way a lot of philosophical and ethical questions are hard and complicated.

I could address the other points, but I think I'd be repeating myself. A CEV aligned AI would not create a world where you're sad (in the most general sense, of having the world not be a way you'd endorse in your wisest most knowledgeable state, after the most thorough reflection) because you think everything is meaningless. It'd find some way to solve the meaninglessness-problem. Unless the problem is genuinely unsolvable, in which case we're just screwed and our only recourse is not building the ASI in the first place.

Alignment will happen by default. What’s next?

williawa2mo21

Firstly, I think there is a decent chance we get alignment by default (where default means hundreds of very smart people work day and night on prosaic techniques). I'd put it at ~40%? Maybe 55% Anthropics AGI would be aligned? I wrote about semi-mechanistic stories for how I think current techniques could lead to genuinely aligned AI. Or CEV-style alignment and corrigibility that's stable.

I think your picture is a little too rosy though. Like first of all

Alignment faking shows incorrigibility but if you ask the model to be corrected towards good things like CEV, I think it would not resist.

Alignment faking is not corrigible or intent aligned. If you have an AI that's incorrigible but would allow you to steer it towards CEV, you do not have a corrigible agent, or an intent-aligned agent. Seems to account for this data you can't say we get intent alignment by default. You have to say we get some mix of corrigibility and intent-alignment that sometimes contradict, but that this will turn out fine anyways. Which idk, might be true.

Secondly, a big reason people are worried about alignment is for tails come apart reasons. Like, there are three alignment outcomes

AI is aligned to random gibberish, but has habits (and awareness) to act the way humans want when you're somewhat in-distribution
AI is aligned to its conception of what the humans would want. ie it wants humans to have it nice, but its conception of "human" and "have it nice" is subtly different from how humans would conceptualize those terms
AI is fully aligned. It is aligned to the things the developers wanted, and has the same understanding of "the things" that the developers have.

I think interacting with models and looking at their CoT is pretty strong evidence against (1), but not necessarily that strong evidence against (2). But a superintelligence aligned to (2) would still kill you probably

The above is me arguing against the premise of your post though, which is maybe not what you wanted. And I agree the probability that it works out is high enough that we should be asking the question you're asking.

My answer to that question is somewhat reductive so maybe you will not like it. I wrote about it here. I think, if we've solved the technical alignment problem, meaning we can take an AGI and make it want what any person or collection of people want (and what they mean they want when they say what it is they want). Most of the questions you've raised solve themselves. At least of the ones you raised.

Animal suffering (s-risk). Primarily, ensure factory farming ends soon and does not spread.
Digital minds welfare (s-risk).
Giving models a more detailed version of human values, so the jagged frontier reaches that faster and we don’t get a future of slop.
Ensure all agents are happy, fulfilled and empowered.
Bioterrorism (x-risk).
What do we do about meaning from work? I don’t want to be a housecat.

solve themselves.

The problem we should be focusing on, if we assume alignment will be solved, is how can we ensure the AI is aligned to our values, as opposed to any one person who might have different values from us. This is a practical political question. I think it mostly boils down to avoiding the scenario AI Futures laid out here.

They have written about mitigation strategies but I think it boils down to stuff like

Having the country and politicians be AGI-pilled
Implement model spec transparency
Increase security at frontier labs. Make tampering with post training very hard for a single person, or a small team of people (potentially working for the company) difficult.
Eventually have AGI be nationalized, and have AGI spec.
1. Ideally I would have it be put under some form of international democratic control, but I say this because I'm not american. If you are American, I don't think you should want this. Unless you suspect non-americans are closer in values to you than other americans are
Encourage people working in the labs to use their position as bargaining power for democratization of AI power, before their labor becomes less valuable due to R&D automation.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments