All of Jozdien's Comments + Replies

Conditioning Generative Models for Alignment

Thanks for the feedback!

I agree that there's lots of room for more detail - originally I'd planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I'm definitely not ruling out the possibility that I just made mistakes at certain points.

  • I think when I talk about conditioning in post I'm referring to prompting, unless I'm misunderstanding what you mean by conditioning on latent states for language models (which is entirely possible).
  • That's a ver
... (read more)
0Charlie Steiner24d
Re: prompting: So when you talk about "simulating a world," or "describing some property of a world," I interpreted that as conditionalizing on a feature of the AI's latent model of the world, rather than just giving it a prompt like "You are a very smart and human-aligned researcher." This latter deviates from the former in some pretty important ways, which should probably be considered when evaluating the safety of outputs from generative models. Re: prophecies: I mean that your training procedure doesn't give an AI an incentive to make self-fulfilling prophecies. I think you have a picture where an AI with inner alignment failure might choose outputs that are optimal according to the loss function but lead to bad real-world consequences, and that these outputs would look like self-fulfilling prophecies because that's a way to be accurate while still having degrees of freedom about how to affect the world. I'm saying that the training loss just cares about next-word accuracy, not long term accuracy according to the latent model of the world, and so AI with inner alignment failure might choose outputs that are highly probable according to next word accuracy but lead to bad real-world consequences, and that these outputs would not look like self-fulfilling prophecies.
Prizes for ELK proposals

While reading through the report I made a lot of notes about stuff that wasn't clear to me, so I'm copying here the ones that weren't resolved after finishing it.  Since they were written while reading, a lot of these may be either obvious or nitpick-y.

Footnote 14, page 15:

Though we do believe that messiness may quantitatively change when problems occur. As a caricature, if we had a method that worked as long as the predictor's Bayes net had fewer than 109 parameters, it might end up working for a realistic messy AI until it had 1012 parameters, since

... (read more)
Decision Theory

I think I'm missing something with the Löb's theorem example.

If  can be proved under the theorem, then can't  also be proved?  What's the cause of the asymmetry that privileges taking $5 in all scenarios where you're allowed to search for proofs for a long time?

2Abram Demski1y
Agreed. The asymmetry needs to come from the source code for the agent. In the simple version I gave, the asymmetry comes from the fact that the agent checks for a proof that x>y before checking for a proof that y>x. If this was reversed, then as you said, the Lobian reasoning would make the agent take the 10, instead of the 5. In a less simple version, this could be implicit in the proof search procedure. For example, the agent could wait for any proof of the conclusion x>y or y>x, and make a decision based on whichever happened first. Then there would not be an obvious asymmetry. Yet, the proof search has to go in some order. So the agent design will introduce an asymmetry in one direction or the other. And when building theorem provers, you're not usually thinking about what influence the proof order might have on which theorems are actually true; you usually think of the proofs as this static thing which you're searching through. So it would be easy to mistakenly use a theorem prover which just so happens to favor 5 over 10 in the proof search.