Doing alignment research with Vivek Hebbar's team at MIRI as well as independent projects.
Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:
https://manifold.markets/ThomasKwa/will-someone-strengthen-our-goodhar?r=VGhvbWFzS3dh
Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either
I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.
edit: The proof is easy. Let , be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for replaced by . Then . But this means and so You don't lose any channel capacity switching to
I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far OOD. No other work I've seen quite matches the promise this post has in finding ways to exert fine-grained control over a system's internals; we now have a wide variety of concrete questions like
Comparing this to other work, my sense is that
It's a judgement call whether this makes it the most impressive achievement, but I think this post is pretty clearly Pareto-optimal in a very promising direction. That said, I have a couple of reservations:
This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.
Edit: I explain this view in a reply.
Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions.
SGD has inductive biases, but we'd have to actually engineer them to get high rather than high when only trained on . In the Gao et al paper, optimization and overoptimization happened at the same relative rate in RL as in conditioning, so I think the null hypothesis is that training does about as well as conditioning. I'm pretty excited about work that improves on that paper to get higher gold reward while only having access to the proxy reward model.
I think the point still holds in mainline shard theory world, which in my understanding is using reward shaping + interp to get an agent composed of shards that value proxies that more often correlate with high rather than higher , where we are selecting on something other than . When the AI ultimately outputs a plan for alignment, why would it inherently value having the accurate plan, rather than inherently value misleading humans? I think we agree that it's because SGD has inductive biases and we understand them well enough to do directionally better than conditioning at constructing an AI that does what we want.
That section is even more outdated now. There's nothing on interpretability, Paul's work now extends far beyond IDA, etc. In my opinion it should link to some other guide.
This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work.
Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.
How do you think "agent" should be defined?