Sorted by New

Wiki Contributions


we currently don't have a formal specification of optimization

This seems to me a singificant bottleneck for progress. No formal specification of what optimisation is has been tried before? What has been achieved? Is anyone working on this?

you can anneal whatever combination of the different losses you are using to eventually become exclusively imitative amplification, exclusively debate, or anything else in between

How necessary is annealing for this? Could you choose other optimisation procedures? Or do you refer to annealing in a more general sense?

I will keep track of all questions during our discussion and if there is anything that make sense to send over to you, I will or invite the attendees to do so.

I feel like we as a community still haven't really explored the full space of possible prosaic AI alignment approaches

I agree and I have mixed feelings about the current trend of converging towards somehow equivalent approaches all containing a flavour of recursive supervision (at least 8 of your 11). On one hand, the fact that many attempts point to a similar direction is a good indication of the potential of such a direction. On the other hand, its likelihood of succeeding may be lower than a portoflio approach, which seemed like what the community was originally aiming for. However, I (and I supect most of junior researchers too) don't have a strong intuition on what very different directions migth be promising. Perhaps one possibility would be to not completely abandon modelling humans. While it is undoubtly hard, it may be worth exploring this possiblity from a ML perspective as well, since others are still working on it from a theoretical perspective. It may be that granted some breaktroughts in Neuroscience, it could be less hard that what we anticipate.

Another open problem is improving our understanding of transparency and interpretability

Also agree. I find it a bit vague, in fact, whenever you refer to "transparency tools" in the post. However, if we aim for some kind of guarantees, this problem may either involve modelling humans or loop back to the main alignment problem. In the sense that specifying the success of a transparency tool, is itself prone to specification error and outer/inner alignment problems. Not sure my point here is clear, but is something I am interested on pondering aboud.

Thanks for all the post pointers. I will have an in-depth read.

Thanks for the great post. It really provides an awesome overview of the current progress. I will surely come back to this post often and follow pointers as I think about and research things.

Just before I came across this, I was thinking of hosting a discussion about "Current Progres in ML-based AI Safety Proposals" at the next AI Safety Discussion Day (Sunday June 7th).

Having read this, I think that the best thing to do is to host an open-ended discussion about this post. It would be awesome if you can and want to join. More details can be found here

One additional thing that I was thinking of discussing (and that could be a minor way to go beyond this post) are the various open problems across the different solutions and what might be more impactful to contribute/focus on. Do you have any thoughts on this?

Many thanks again for the precious resource you put together.