AI-assisted list of ten concrete alignment things to do right now

lukehmiles

Background

So I'm thinking that AI-assisted summarization, math, bug-finding in code, and logical-error finding in writing is at a point where it is quite useful, if we can improve the tooling/integration a little bit.

In code I've found it helpful to comment out some lines and write // WRONG: above them and // FIXED VERSION: below them then let copilot try a few things.

For writing you could take a paragraph excerpt and write a critique post: "John Smith wrote '...' This immediately strikes me as absurd because"

Imagine you were doing chemistry research in uh 1650 and had direct immediate written feedback from uh Robert Boyle on directions to pursue, dead ends, errors, etc except that 75% of the time he says something backwards or is just pattern matching. I think you might still do much better work than you would've without noisy-Boyle.

I'm not aware of anyone trying to actually use LLMs for meaningful writing/thinking assistance so I decided to try. I wrote the below text in about an hour. Consider this merely a demonstration that you can get a decent amount of semi-meaningful content in the right direction quite quickly. It's rare you can expect that much from someone.

List

Rob Bensinger gave this suggestion in a comment:

I think that this is a really good exercise that more people should try: Imagine that you’re running a project yourself that’s developing AGI first, in real life. Imagine that you are personally responsible for figuring out how to make the thing go well. Yes, maybe you’re not the perfect person for the job; that’s a sunk cost. Just think about what specific things you would actually do to make things go well, what things you’d want to do to prepare 2 years or 6 years in advance, etc.
Try to think your way into near-mode with regard to AGI development, without thereby assuming (without justification) that it must all be very normal just because it’s near. Be able to visualize it near-mode and weird/novel. If it helps, start by trying to adopt a near-mode, pragmatic, gearsy mindset toward the weirdest realistic/plausible hypothesis first, then progress to the less-weird possibilities.

I took the above prompt together with below underlined text and used BLOOM to generate ten concrete things to do right now. For each item, I generated three completions and picked my favorite one then did bit of editing. I had trouble editing the math into something reasonable so consider it creative inspiration.

(Repeat, my writing / prompting is underlined and AI completion is not:)

Here, I will try it right now. I'm in charge of OpenMind and it is clear from our rate of progress that we will have AGI in less than 24 months. It's been decided that we will build it and deploy it but I have some influence on additional efforts we can take to reduce the risk. Here's ten things I would try:

For starters, I would try to test each new model in an isolated (ie not internet connected) environment with constrained IO channels for a month before deploying it.
I would also use any and all interpretability methods available to inspect the model and try to predict off-distribution behavior. If I had the sway, I might run a red-team blue-team contest to see how good we are at predicting off-distribution.
There are some old posts about safety implications of on-policy vs off-policy RL that I read but did not fully digest. I would go back and absorb those more fully and see if the distinction can be adapted to our use case. (Off-policy was much better in some ways if I recall correctly)
I'm pretty strongly in favor of the near miss principle: if an adversary or an incompetent user almost succeeds in exploiting the system, then we should log that and analyze it and publish our analysis. Things that Actually Happened are more convincing, and often lead to more insights, than toy thought experiments. Ideally, another firm would notice the root cause of a particular pattern and we could nip some catastrophes in the bud.
Run a variety of automated experiments to challenge all aspects of the system: try to find data capture flaws, network server vulnerabilities, robo-advice, language translation errors, model optimizer biases, etc. Could summarize these flaws and debrief the company once per quarter.
Black-belt methods to deal with physical world risks would be great!
(These next few are a bit more mathy, apologies if you're missing the relevant background)
We typically model "an agent acting in an environment" with an MDP. But you can also consider it as a category: The "objects" are states, actions, rewards etc, and the "arrows" are induced by the agent policy: $\pi:S \rightarrow A$. So we can think of this category as "all agent-environment pairs that can be generated by a given agent policy". It's a monoidal symmetric multicategory: pairs of such things can be combined to get a new such thing. This generalizes to "multi-agent".
A lot of RL work involves algorithms that swap between optimizing reward and updating belief about the environment. This assumption of Markovian reward may not be met. There are newer approaches to update belief directly, without assuming Markovianity (or clean transitions as in POMDPs). It would be interesting to see whether they might be better suited to this problem.
I am seriously interested in zero-sum games because I am assuming that when people work on AGI they aren't just working on the interests of humans but on a competition between people and the new superhuman system. More concretely (since I need to get something working pronto for the imminent AGI), it would be interesting to try zero-sum RL as an intermediate step. (Action A gains value when policy of opponent is worse under action A, etc.)
I think that AGI is the frontier between formal reasoning and plausibility. In particular, neural nets (and reinforcement learning methods) are based on bounded rationality and my instincts are that we might do better with more explicit reasoning. If our brain were more explicitly updatable, then the least useful parts of the brain could be cut off or disabled. Simplicity sometimes increases reliability and interpretability and sometimes improves generalization in a "good" way (waves hands).

I might try this again tomorrow because there's lots of obviously-good ideas I didn't mention (e.g. many of these suggestions arose in an informal workshop we did a few weeks ago). There might be open problems in integrating these ideas, but I think we can make progress, even in the next few weeks.