There has been a couple of recent posts suggesting that Eliezer Yudkowsky's Sufficiently optimized agents appear coherent thesis does not seem useful because it's vacuously true: one obvious way to formalize "coherent" implies that all agents can be considered coherent. In a previous comment, I suggested that we can formalize "coherent" in a different way to dodge this criticism. I believe there's reason to think that Eliezer never intended "Sufficiently optimized agents appear coherent" to have an airtight argument and be universally true. (The Arbital post contains a number of caveats, including "If there is a particular kind of optimization pressure that seems sufficient to produce a cognitively highly advanced agent, but which also seems sure to overlook some particular form of incoherence, then this would present a loophole in the overall argument and yield a route by which an advanced agent with that particular incoherence might be produced".) In this post, I suggest that considering the ways in which it could be false can be a useful way to frame some recent ideas in AI safety. (Note that this isn't intended to be an exhaustive list.)
Even a very powerful optimization process cannot train or test an agent in every possible environment and for every possible scenario (by this I mean some sequence of inputs) that it might face, and some optimization processes may not care about many possible environments/scenarios. Given this, we can expect that if an agent faces a new environment/scenario that's very different from what is was optimized for, it may fail to behave coherently.
(Jessica Taylor made a related point in Modeling the capabilities of advanced AI systems as episodic reinforcement learning: "When the test episode is similar to training episodes (e.g. in an online learning context), we should expect trained policies to act like a rational agent maximizing its expected score in this test episode; otherwise, the policy that acts as a rational agent would get a higher expected test score than this one, and would therefore receive the highest training score.")
A caveat to this caveat is that if an agent is optimized for a broad enough range of environments/scenarios, it could become an explicit EU maximizer, and keep doing EU maximization even after facing a distributional shift. (In this case it may be highly unpredictable what the agent's utility function looks like outside the range that it was optimized for. Humans can be considered a good example of this.)
Eric Drexler suggested that one way to keep AIs safe is to optimize them to use few computing resources. If computing resources are expensive, it will often be less costly to accept incoherent behavior than to expend computing resources to reduce such incoherence. (Eliezer noted that such incoherence would only be removed "given the option of eliminating it at a reasonable computational cost".)
A caveat to this is that the true economic costs for compute will continue to fall, eventually to very low levels, so this depends on people assigning artificially high costs to computing resources (which Eric suggests that they do). However assigning an optimization cost for compute that is equal to its economic cost would often produce a more competitive AI, and safety concerns may not be sufficient incentive for an AI designer (if they are mostly selfish) to choose otherwise (because the benefits of producing a more competitive AI are more easily internalized than the costs/risks). One can imagine that in a world where computing costs are very low in an economic sense, but everyone is treating compute as having high cost for the sake of safety, the first person to not do this would gain a huge competitive advantage.
The optimizing process may itself be incoherent and not know how to become coherent or produce an agent that is coherent in an acceptable or safe way. A number of ideas fall into this category, including Peter Eckersley's recent Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function), which suggests that we should create AIs that handle moral uncertainty by randomly assigning a subagent (representing some moral theory) to each decision, with the argument that this is similar to how humans handle moral uncertainty. This can clearly be seen as an instance where the optimizing process (i.e., AI programmers) opts for the agent to remain incoherent because it does not know an acceptable/safe way to remove the incoherence.
A caveat here is that the agent may itself decide to become coherent anyway, and not necessarily in a way that the original optimizing process would endorse. For example, under Peter's proposal, one subagent may take an opportunity to modify the overall AI to become coherent in a way that it prefers, or multiple subagents may decide to cooperate and merge together into a more coherent agent. Another caveat is that incoherence is economically costly especially in a competitive multi-polar scenario, and if such costs are high enough the optimizing process may be forced to create a coherent agent even if it would prefer not to (in the absence of such costs).