I'll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)
Namely that I don't think we can talk sensibly about an AI having "beneficial goal-directedness" without situational awareness. For instance, it's of little use to have an AI with the goal of "ensuring human flourishing" if it doesn't understand the meaning of flourishing or human. And, without situational awareness, it can't understand either; at best we could have some proxy or pointer towards these key concepts.
The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"
Here's the review, though it's not very detailed (the post explains why):
A good review of work done, which shows that the writer is following their research plan and following up their pledge to keep the community informed.
The contents, however, are less relevant, and I expect that they will change as the project goes on. I.e. I think it is a great positive that this post exists, but it may not be worth reading for most people, unless they are specifically interested in research in this area. They should wait for the final report, be it positive or negative.
I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).
A decent introduction to the natural abstraction hypothesis, and how testing it might be attempted. A very worthy project, but it isn't that easy to follow for beginners, nor does it provide a good understanding of how the testing might work in detail. What might consist a success, what might consist a failure of this testing? A decent introduction, but only an introduction, and it should have been part of a sequence or a longer post.
Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)?
For myself, I was thinking of using CHATGPT-style approaches with multiple queries - what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...
Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button. [...] Therefore, it decides to not hit the reward button.
I think that subsection has the crucial insights from your post. Basically you're saying that, if we train an agent via RL in a limited environment where the reward correlates with another goal (eg "pick up the trash"), there are multiple policies the agent could have, multiple meta-policies it could have, multiple ways it could modify or freeze its own cognition, etc... Whatever mental state it ultimately ends up with, the only constraint is that this state must be compatible with the reward signal in that limited environment.
Thus "always pick up trash" is one possible outcome; "wirehead the reward signal" is another. There are many other possibilities, with different generalisations of the initial reward-signal-in-limited-environment data.
I'd first note that a lot of effort in RL is put specifically into generalising the agent's behaviour. The more effective this becomes, the closer the agent will be to the "wirehead the reward signal" side of things.
Even without this, this does not seem to point towards ways of making AGI safe, for two main reasons:
The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).
It's an implementation of the concept extrapolation methods we talked about here: https://www.lesswrong.com/s/u9uawicHx7Ng7vwxA
The specific details will be in a forthcoming paper.
Also, you'll be able to try it out yourself soon; signup for alpha testers at the bottom of the page here: https://www.aligned-ai.com/post/concept-extrapolation-for-hypothesis-generation
As we discussed, I feel that the tokens were added for some reason but then not trained on; hence why they are close to the origin, and why the algorithm goes wrong on them, because it just isn't trained on them at all.
Good work on this post.