Kay Kozaronek — AI Alignment Forum

Thus, if given the right incentives, it should be "easy" for our AI systems to avoid those kinds of catastrophes: they just need to not do it. To us, this is one of the core reasons for optimism about alignment.

I'm not sure I understand this correctly. Are you saying that one of the main reasons for optimism is that more competent models will be easier to align because we just need to give them "the right incentives"?

What exactly do you mean by "the right incentives"?

Can you illustrate this by means of an example?

Takeaways from our robust injury classifier project [Redwood Research]

Kay Kozaronek3y00

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Thank you for your efforts in organizing and outlining the learning steps, Neel. I found the inclusion of concrete success criteria to be very helpful. I was wondering if you might be able to provide an estimated time indication for each step as well. I believe this would be useful not only to myself but to others as well. In particular, could you provide rough time estimates for the four steps in the "Getting the Fundamentals" part of the curriculum?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments