I think about AI alignment. Send help.
I say things on twitter, other links at payor.io
Awesome, thanks for writing this up!
I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".
(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)
(Edit: others have made this point already, but anyhow)
My main objection to this angle: self-improvements do not necessarily look like "design a successor AI to be in charge". They can look more like "acquire better world models", "spin up more copies", "build better processors", "train lots of narrow AI to act as fingers", etc.
I don't expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.
Is the following an accurate summary?
The agent is built to have a "utility function" input that the humans can change over time, and a probability distribution over what the humans will ask for at different time steps, and maximizes according a combination of the utility functions it anticipates across time steps?
If that's correct, here are some places this conflicts with my intuition about how things should be done:
I feel awkward about the randomness is being treated essential. I'd rather be able to do something other than randomness in order to get my mild optimization, and something feels unstable/non-compositional about needing randomness in place for your evaluations... (Not that I have an alternative that springs to mind!)
I also feel like "worst case" is perhaps problematic, since it's bringing maximization in, and you're then needing to rely on your convex set being some kind of smooth in order to get good outcomes. If I have a distribution over potential utility functions, and quantilize for the worst 10% of possibilities, does that do the same sort of work that "worst case" is doing for mild optimization?
Can I check that I follow how you recover quantilization?
Are you evaluating distributions over actions, and caring about the worst-case expectation of that distribution?
If so, proposing a particular action is evaluated badly? (Since there's a utility function in your set that spikes downward at that action.)
But proposing a range of actions to randomize amongst can be assessed to have decent worst-case expected utility, since particular downward spikes get smoothed over, and you can rely on your knowledge of "in-distribution" behaviour?
Edited to add: fwiw it seems awesome to see quantilization formalized as popping out of an adversarial robustness setup! I haven't seen something like this before, and didn't notice if the infrabayes tools were building to these kinds of results. I'm very much wanting to understand why this works in my own native-ontology-pieces.
I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.
Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.
I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and there's a lot more attention/people/incentives for capabilities.
I think there are more targeted things that would be better for getting more good work to happen. Like research workshops or unconferences, where you choose who to invite, or building community with more aligned folk who are looking for interesting and alignment-relevant research directions. This would come with way less potential harm imo as a recruitment strategy.
I'm perhaps misusing "publish" here, to refer to "putting stuff on the internet" and "raising awareness of the work through company Twitter" and etc.
I mostly meant to say that, as I see it, too many things that shouldn't be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).
The transformer circuits work strikes me this way, so does a bunch of others.
Also, I'm grateful to know your read! I'm broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.
I mostly do just mean "keeping it within a single research group" in the absence of better ideas. And I don't have a better answer, especially not for independent folk or small orgs.
I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some "I won't use this for capabilities work without the permission of the authors" legal docs as well.
This isn't something I can visualize working, but maybe it has components of an answer.
To throw in my two cents, I think it's clear that whole classes of "mechansitic interpretability" work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.
And I think this points strongly against publishing this stuff, especially if the goal is to "make this whole field more prestigious real quick". Insofar as the prestige is coming from folks who work on AI capabilities, that's drinking from a poisoned well (since they'll grant the most prestige to the work that helps them accelerate).
One relevant point I don't see discussed is that interpretability research is trying to buy us "slack", but capabilities research consumes available "slack" as fuel until none is left.
What do I mean by this? Sometimes we do some work and are left with more understanding and grounding about what our neural nets are doing. The repeated pattern then seems to be that this helps someone design a better architecture or scale things up, until we're left with a new more complicated network. Maybe because you helped them figure out a key detail about gradient flow in a deep network, or let them quantize the network better so they can run things faster, or whatnot.
Idk how to point at this thing properly, my examples aren't great. I think I did a better job talking about this over here on twitter recently, if anyone is interested.
But anyhow I support folks doing their research without broadcasting their ideas to people who are trying to do capabilities work. It seems nice to me if there was mostly research closure. And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
Seems right, oops! A5 is here saying that if any part of my u is flat it had better stay flat!
I think I can repair my counterexample but looks like you've already found your own.