Theron Pummer has written about this precise thing in his paper on Spectrum Arguments, where he touches on this argument for "transitivity=>comparability" (here notably used as an argument against transitivity rather than an argument for comparability) and its relation to 'Sorites arguments' such as the one about sand heaps.
Personally I think the spectrum arguments are fairly convincing for making me believe in comparability, but I think there's a wide range of possible positions here and it's not entirely obvious which are actually inconsistent. Pummer
... (read more)Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.
I've historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out 'for free'. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly i
... (read more)I really like this model of computation and how naturally it deals with counterfactuals, surprised it isn't talked about more often.
This raises the issue of abstraction - the core problem of embedded agency.
I'd like to understand this claim better - are you saying that the core problem of embedded agency is relating high-level agent models (represented as causal diagrams) to low-level physics models (also represented as causal diagrams)?
I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those).
I'm quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding 'agentic' models from this framework is:
These sorts of problems are what caused me to want a presentation which didn't assume well-defined agents and boundaries in the ontology, but I'm not sure how it applies to the above - I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I'm likely missing your point.