Main takeaways from a recent AI safety conference:
An incomplete list of possibly useful AI safety research:
A systematic way for classifying AI safety work could use a matrix, where one dimension is the system level:
Another dimension is the "time" of consideration:
There would be 5*4 = 20 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is "too early" to consider.
There is still some AI safety work (research) that doesn't fit this matrix, e.g., org design, infosec, alignment, etc. of AI labs (= the system that designs, manufactures, operates, and evolves monolithic AI systems and systems of AIs).
AI alignment threat models that are somewhat MECE (but not quite):
In particular, the last threat model feels like it is trying to cut across aspects of the first two threat models, violating MECE.
Great overview! I find this helpful.
Next to intrinsic optimisation daemons that arise through training internal to hardware, suggest adding extrinsic optimising "divergent ecosystems" that arise through deployment and gradual co-option of (phenotypic) functionality within the larger outside world.
AI Safety so far research has focussed more on internal code (particularly CS/ML researchers) computed deterministically (within known statespaces, as mathematicians like to represent). That is, rather than complex external feedback loops that are uncomputable – given Good Regulator Theorem limits and the inherent noise interference on signals propagating through the environment (as would be intuitive for some biologists and non-linear dynamics theorists).
So extrinsic optimisation is easier for researchers in our community to overlook. See this related paper by a physicist studying origins of life.
Cheers, Remmelt! I'm glad it was useful.
I think the extrinsic optimization you describe is what I'm pointing toward with the label "coordination failures," which might properly be labeled "alignment failures arising uniquely through the interactions of multiple actors who, if deployed alone, would be considered aligned."
Reasons that scaling labs might be motivated to sign onto AI safety standards:
However, AI companies that don’t believe in AGI x-risk might tolerate higher x-risk than ideal safety standards by the lights of this community. Also, I think insurance contracts are unlikely to appropriately account for x-risk, if the market is anything to go by.
Types of organizations that conduct alignment research, differentiated by funding model and associated market forces:
MATS' goals:
"Why suicide doesn't seem reflectively rational, assuming my preferences are somewhat unknown to me," OR "Why me-CEV is probably not going to end itself":
Note: I'm fine; this is purely intellectual.
Can the strategy of "using surrogate goals to deflect threats" be countered by an enemy agent that learns your true goals and credibly precommits to always defecting (i.e., Prisoner's Dilemma style) if you deploy an agent against it with goals that produce sufficiently different cooperative bargaining equilibria than your true goals would?
This is a risk worth considering, yes. It’s possible in principle to avoid this problem by “committing” (to the extent that humans can do this) to both (1) train the agent to make the desired tradeoffs between the surrogate goal and original goal, and (2) not train the agent to use a more hawkish bargaining policy than it would’ve had without surrogate goal training. (And to the extent that humans can’t make this commitment, i.e., we make honest mistakes in (2), the other agent doesn’t have an incentive to punish those mistakes.)
If the developers do both these things credibly—and it's an open research question how feasible this is—surrogate goals should provide a Pareto improvement for the two agents (not a rigorous claim). Safe Pareto improvements are a generalization of this idea.
Are these framings of gradient hacking, which I previously articulated here, a useful categorization?
- Masking: Introducing a countervailing, “artificial” performance penalty that “masks” the performance benefits of ML modifications that do well on the SGD objective, but not on the mesa-objective;
- Spoofing: Withholding performance gains until the implementation of certain ML modifications that are desirable to the mesa-objective; and
- Steering: In a reinforcement learning context, selectively sampling environmental states that will either leave the mesa-objective unchanged or "steer" the ML model in a way that favors the mesa-objective.
How does the failure rate of a hierarchy of auditors scale with the hierarchy depth, if the auditors can inspect all auditors below their level?