Posts

Sorted by New

0peterbarnett's Shortform

62Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

3mo

6Doing oversight from the very start of training seems hard

17Framings of Deceptive Alignment

26Understanding Gradient Hacking

Wiki Contributions

Comments

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Peter Barnett3mo66

I'm confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it "aligned", and certainly the alignment is not stable (because it almost never takes "good" actions). Although this thing is also not robustly "misaligned" either.

TurnTrout's shortform feed

Peter Barnett4mo30

(I don't mean to dogpile)
I think that selection is the correct word, and that it doesn't really seem to be smuggling in incorrect connections to evolution.

We could imagine finding a NN that does well according to a loss function by simply randomly initializing many many NNs, and then keeping the one that does best according to the loss function. I think this process would accurately be described as selection; we are literally selecting the model which does best.

I'm not claiming that SGD does this^[1], just giving an example of a method to find a low-loss parameter configuration which isn't related to evolution, and is (in my opinion) best described as "selection".

^{^}
Although "Is SGD a Bayesian sampler? Well, almost" does make a related claim.

Alexander and Yudkowsky on AGI goals

Peter Barnett1y68

So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result ("don't steal" rather than "don't get caught")?

There is a disconnect with this question.

I think Scott is asking “Supposing an AI engineer could create something that was effectively a copy of a human brain and the same training data, then could this thing learn the “don’t steal” instinct over the “don’t get caught” instinct?”
Eliezer is answering “Is an AI engineer able to create a copy of the human brain, provide it with the same training data a human got, and get the “don’t steal” instinct?”