Sorted by New

Wiki Contributions


My short answer: Violations of the IID assumption is the likeliest problem in trying to generalize your values, and I see this as the key flaw underlying the post.

You can make the "some subnetwork just models its training process and cares about getting low loss, and then gets promoted" argument against literally any loss function, even some hypothetical "perfect" one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don't perceive you to believe this implication.

This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don't work well.

EtA: I am still more concerned about "not enough samples to learn human preferences" than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven't scrutinized it too much (but would be interested to discuss it cooperatively).

This is a crux for me, as it is why I don't think slow takeoff is good by default. I think deceptive alignment is the default state barring interpretability efforts that are strong enough to actually detect mesa-optimizers or myopia. Yes, Foom is probably not going to happen, but in my view that doesn't change much regarding risk in total.

We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.

I think this is probably true in the long term (the classical-quantum/reversible computer transition is very large, and humans can't easily modify brains, unlike a virtual human.) But this may not be true in the short-term.

Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)

Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.

First up, I've strongly upvoted it as an example of advancing the alignment frontier, and I think this is plausibly the easiest solution provided it can actually be put into code.

But unfortunately there's a huge wrecking ball into it, and that's deceptive alignment. As we try to solve increasingly complex problems, deceptive alignment becomes the default, and this solution doesn't work. Basically in Evhub's words, mere compliance is the default, and since a treacherous turn when the AI becomes powerful is possible, that this solution alone can't do very well.

Here's a link:

The real question for Habryka is why does he think that it's bad for WebGPT to be built in order to get truthful AI? Like, isn't solving that problem quite a significant thing already for alignment?

So you can have non-binding recommendations and input, but no actual binding power over the capabilities researchers, right?

My viewpoint is that the most dangerous risks rely on inner alignment issues, and that is basically because of very bad transparency tools, instrumental convergence issues toward power and deception, and mesa-optimizers essentially ruining what outer alignment you have. If you could figure out a reliable way to detect or make sure that deceptive models could never be reached in your training process, that would relieve a lot of my fears of X-risk from AI.

I actually think Eliezer is underrating civilizational competence once AGI is released via the MNM effect, as it happened for Covid, unfortunately this only buys time before the end. A superhuman intelligence that is deceiving human civilization based on instrumental convergence essentially will win barring pivotal acts, as Eliezer says. The goal of AI safety is to make alignment not dependent on heroic, pivotal actions.

So Andrew Critich's hopes of not needing pivotal acts only works if significant portions of the alignment problem are solved or at least ameliorated, which we are not super close to doing, So whether alignment will require pivotal acts is directly dependent on solving the alignment problem more generally.

Pivotal acts are a worse solution to alignment and shouldn't be thought of as the default solution, but it is a back-pocket solution we shouldn't forget about.

If I had to rate a crux between Eliezer Yudkowsky's/Rob Bensinger's/Nate Soares's/MIRI's views and Deepmind's Safety Team views or Andrew Critich's view, it's whether the Alignment problem is foresight-loaded (And thus civilization will be incompetent as well as safety requiring more pivotal acts) or empirically-loaded, where we don't need to see the bullets in advance (And thus civilization could be more competent and pivotal acts matter less). It's an interesting crux to be sure.

PS: Does Deepmind's Safety Team have real power to disapprove AI projects? Or are they like the Google Ethics team, where they had no power to disapprove AI projects without being fired.

Load More