Leon Lang

I'm a PhD student at the University of Amsterdam. I have research experience in multivariate information theory and equivariant deep learning and recently got very interested into AI alignment. https://langleon.github.io/

Wiki Contributions

Comments

Sorted by
Leon Lang915

I think the main way that proof assistant research feeds into capabilies research is not through the assistants themselves, but by the transfer of the proof assistant research to creating foundation models with better reasoning capabilities. I think researching better proof assistants can shorten timelines.

  • See also Demis' Hassabis recent tweet. Admittedly, it's unclear whether he refers to AlphaProof itself being accessible from Gemini, or the research into AlphaProof feeding into improvements of Gemini.
  • See also an important paragraph in the blogpost for AlphaProof: "As part of our IMO work, we also experimented with a natural language reasoning system, built upon Gemini and our latest research to enable advanced problem-solving skills. This system doesn’t require the problems to be translated into a formal language and could be combined with other AI systems. We also tested this approach on this year’s IMO problems and the results showed great promise."

Thanks a lot for these pointers!

Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.

What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI. 

However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere. 

To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.

 

I assume you would agree with the following rephrasing of your last sentence:

The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI. 

If so, I would appreciate it if you could clarify why achieving good training data feedback is even possible: the system that gives feedback necessarily looks at the world through observations that conceal large parts of the state of the universe. For every observation that is consistent with the actions of a competent, well-motivated AI, the underlying state of the world might actually be catastrophic from the point of view of our "intentions". E.g., observations can be faked, or the universe can be arbitrarily altered outside of the range of view of the feedback system.

If you agree with this, then you probably assume that there are some limits to the physical capabilities of the AI, such that it is possible to have a feedback mechanism that cannot be effectively gamed. Possibly when the AI becomes more powerful, the feedback mechanism would in turn need to become more powerful to ensure that its observations "track reality" in the relevant way. 

Does there exist a write-up of the meaning of specification gaming and/or outer alignment that takes into account that this notion is always "relative" to the AI's physical capabilities?

Yes, after reflection I think this is correct. I think I had in mind a situation where with deployment, the training of the AI system simply stops, but of course, this need not be the case. So if training continues, then one either needs to argue stronger reasons why the distribution shift leads to a catastrophe (e.g., along the lines you argue) or make the case that the training signal couldn't keep up with the fast pace of the development. The latter would be an outer alignment failure, which I tried to avoid talking about in the text.