Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.blogspot.com
I'm loving this whole sequence, but I particularly love:
9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things
That feels very crisp, clear, and informative.
Probably the easiest "honeypot" is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that's anything like "get more reward" (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd be pretty happy with that.
Why do I care about this? It has uncomfortable tinges of status regulation, but I think it's important because there are so many people reading about this research online, and trying to find a way into the field, and often putting the people already in the field on some kind of intellectual pedestal. Stating clearly the key insights of a given approach, and their epistemic status, will save them a whole bunch of time. E.g. it took me ages to work through my thoughts on myopia in response to Evan's posts on it, whereas if I'd known it hinged on some version of the insight I mentioned in this post, I would have immediately known why I disagreed with it.
As an example of (I claim) doing this right, see the disclaimer on my "shaping safer goals" sequence: "Note that all of the techniques I propose here are speculative brainstorming; I'm not confident in any of them as research directions, although I'd be excited to see further exploration along these lines." Although maybe I should make this even more prominent.
Lastly, I don't think I'm actually comparing Darwin and Einstein's mature theories to Turing's incomplete theory. As I understand it, their big insights required months or years of further work before developing into mature theories (in Darwin's case, literally decades).
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don't think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I'm about to get a reward. Now I've hacked myself to like strawberries. But I have no idea how that's implemented in my brain, I can't "pick the parameters for myself", even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
When I read other people, I often feel like they're operating in a 'narrower segment of their model', or not trying to fit the whole world at once, or something. They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'
To me it seems like this is what you should expect other people to look like both when other people know less about a domain than you do, and also when you're overconfident about your understanding of that domain. So I don't think it helps distinguish those two cases.
(Also, to me it seems like a similar thing happens, but with the positions reversed, when Paul and Eliezer try to forecast concrete progress in ML over the next decade. Does that seem right to you?)
when Eliezer responded with:But there's a really really basic lesson here about the different style of "sentences found in political history books" rather than "sentences produced by people imagining ways future politics could handle an issue successfully".the subject got changed.
when Eliezer responded with:
But there's a really really basic lesson here about the different style of "sentences found in political history books" rather than "sentences produced by people imagining ways future politics could handle an issue successfully".
the subject got changed.
I believe this was discussed further at some point - I argued that Eliezer-style political history books also exclude statements like "and then we survived the cold war" or "most countries still don't have nuclear energy".
Key role, but most current ML is in the "applied" section, where the "theory" section instead explains the principles by which neural nets (or future architectures) work on the inside. Logical induction is a sidebar at some point explaining the theoretical ideal we're working towards, like I assume AIXI is in some textbooks.
Planning, Abstraction, Reasoning, Self-awareness.
I'm curious if you have a way to summarise what you think the "core insight" of ELK is, that allows it to improve on the way other alignment researchers think about solving the alignment problem.
Interesting post :) I'm intuitively a little skeptical - let me try to figure out why.
I think I buy that some reasoning process could consistently decide to hack in a robust way. But there are probably parts of that reasoning process that are still somewhat susceptible to being changed by gradient descent. In particular, hacking relies on the agent knowing what its current mesa-objective is - but that requires some type of introspective access, which may be difficult and the type of thing which could be hindered by gradient descent (especially when you're working in a very high-dimensional space!)
The more general point is that the agent doesn't just need to decide to hack in a way that's robust to gradient descent, it has to also have all of its decisions about how to hack (e.g. figuring out where it is, and which schelling point to choose) be robust to gradient descent. And that seems much harder. The type of thing I imagine happening is gradient descent pushing the agent towards a mesa-objective which intrinsically disfavours gradient hacking, in a way which the agent has trouble noticing.
Of course my argument fails when the agent has access to external memory - indeed, it can just write down a Schelling point for future versions of itself to converge to. So I'm wondering whether it's worth focusing on that over the memoryless case (even thought the latter has other nice properties), at least to flesh out an initial very compelling example.
Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly - might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight!
I'm still a little wary of how much the report talks about concepts in a humans' Bayes net without really explaining why this is anywhere near a sensible model of humans, but I'll have another read through and see if I can pin down anything that I actively disagree with (since I do agree that it's useful to start off with very simple assumptions).