All of Hjalmar_Wijk's Comments + Replies

ARC evals has only existed since last fall, so for obvious reasons we have not evaluated very early versions. Going forward I think it would be valuable and important to evaluate models during training or to scale up models in incremental steps.

As someone who has been feeling increasingly skeptical of working in academia I really appreciate this post and discussion on it for challenging some of my thinking here. 

I do want to respond especially to this part though, which seems cruxy to me:

Furthermore, it is a mistake to simply focus on efforts on whatever timelines seem most likely; one should also consider tractability and neglectedness of strategies that target different timelines. It seems plausible that we are just screwed on short timelines, and somewhat longer timelines are more tractab

... (read more)
I had independently thought that this is one of the main parts where I disagree with the post, and wanted to write up a very similar comment to yours. Highly relevant link: My best guess would have been maybe 3-5x per decade, but 10x doesn't seem crazy.

These sorts of problems are what caused me to want a presentation which didn't assume well-defined agents and boundaries in the ontology, but I'm not sure how it applies to the above - I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I'm likely missing your point.

Motivating example: consider a primitive bacterium with a thermostat, a light sensor, and a salinity detector, each of which has functional mappings to some movement pattern. You could say this system has a 3 dimensional map and appears to search over it.

Theron Pummer has written about this precise thing in his paper on Spectrum Arguments, where he touches on this argument for "transitivity=>comparability" (here notably used as an argument against transitivity rather than an argument for comparability) and its relation to 'Sorites arguments' such as the one about sand heaps.

Personally I think the spectrum arguments are fairly convincing for making me believe in comparability, but I think there's a wide range of possible positions here and it's not entirely obvious which are actually inconsistent. Pummer

... (read more)

Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.

I've historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out 'for free'. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly i

... (read more)
4Abram Demski4y
The 'type of corrigibly' you are referring to there is corrigibly at all; rather, it's alignment. Indeed, the term corrigibly was coined to contrast to this, motivated by the fragility of this to getting the printer right. I tend to agree. I'm hoping that thinking about myopia and related issues could help me understand more natural notions of corrigibility.

I really like this model of computation and how naturally it deals with counterfactuals, surprised it isn't talked about more often.

This raises the issue of abstraction - the core problem of embedded agency.

I'd like to understand this claim better - are you saying that the core problem of embedded agency is relating high-level agent models (represented as causal diagrams) to low-level physics models (also represented as causal diagrams)?

I'm saying that the core problem of embedded agency is relating a high-level, abstract map to the low-level territory it represents. How can we characterize the map-territory relationship based on our knowledge of the map-generating process, and what properties does the map-generating process need to have in order to produce "accurate" & useful maps? How do queries on the map correspond to queries on the territory? Exactly what information is kept and what information is thrown out when the map is smaller than the territory? Good answers to these question would likely solve the usual reflection/diagonalization problems, and also explain when and why world-models are needed for effective goal-seeking behavior. When I think about how to formalize these sort of questions in a way useful for embedded agency, the minimum requirements are something like: * need to represent physics of the underlying world * need to represent the cause-and-effect process which generates a map from a territory * need to run counterfactual queries on the map (e.g. in order to do planning) * need to represent sufficiently general computations to make agenty things possible ... and causal diagrams with symmetry seem like the natural class to capture all that.

I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those).

I'm quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding 'agentic' models from this framework is:

  1. Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them.
  2. If we believe then things which generalize very competently are l
... (read more)
(black) hat tip to johnswentworth for the notion that the choice of boundary for the agent is arbitrary in the sense that you can think of a thermostat optimizing the environment or think of the environment as optimizing the thermostat. Collapsing sensor/control duality for at least some types of feedback circuits.