What we need to find, for a given agent to be constrained by being a 'utility maximiser' is to consider it as having a member of a class of utility functions where the actions that are available to it systematically alter the expected utility available to it - for all utility functions within this class.
This sentence is extremely difficult for me to parse. Any chance you could clarify it?
In most situations, were these preferences over my store of dollars for example, this would seem to be outside the class of utility functions that would meaningfully constrain my action, since this function is not at all smooth over the resource in question.
Could you explain smoothness is typically required for meaningly constraining our actions?
Thanks so much for not only writing a report, but taking the time to summarise for our easy consumption!
Note that
davinci-002
andbabbage-002
are the new base models released a few days ago.
You mean davinci-003?
- Learn more about the possible failures, to understand how likely they are, what causes them to arise, and what techniques may mitigate the failures (discussed here).
- Inform the current conversation about AI risk by providing the best evidence of misalignment risks, if any. We hope this will be helpful for labs, academia, civil society, and policymakers to make better decisions (discussed here). If misalignment issues end up being serious, then it will be critical to form a strong scientific consensus that these issues are real, for which examples of alignment failures are crucial.
You've linked to a non-public Google doc.
It’s not clear to me that the space of things you can verify is in fact larger than the space of things you can do because an AI might be able to create a fake solution that feels more real than the actual solution. At a sufficiently high intelligence level of the AI, being able to avoid this tricks is likely harder than just doing the task if you hadn’t been subject to malign influence.
I’m confused about the back door attack detection task even after reading it a few times:
The article says: “The key difference in the attack detection task is that you are given the backdoor input along with the backdoored model, and merely need to recognize the nput as an attack”.
When I read that, I find myself wondering why that isn’t trivial solved by a model that memorises which input(s) are known to be an attack.
My best interpretation is that there are a bunch of possible inputs that cause an attack and you are given one of them and just have to recognise that one plus the others you don’t see. Is this interpretation correct?
An alternative framing that might be useful: What do you see as the main bottleneck for people having better predictions of timelines (as you see it)?
Do you in fact think that having such a list is it?