I do alignment research, mostly stuff that is vaguely agent foundations. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 are not representative of my current views about alignment difficulty.
I read about half of this post when it came out. I didn't want to comment without reading the whole thing, and reading the whole thing didn't seem worth it at the time. I've come back and read it because Dan seemed to reference it in a presentation the other day.
The core interesting claim is this:
My conclusion will be that most of the items on Bostrom's laundry list are not 'convergent' instrumental means, even in this weak sense. If Sia's desires are randomly selected, we should not give better than even odds to her making choices which promote her own survival, her own cognitive enhancement, technological innovation, or resource acquisition.
This conclusion doesn't follow from your arguments. None of your models even include actions that are analogous to the convergent actions on that list.
The non-sequential theoretical model is irrelevant to instrumental convergence, because instrumental convergence is about putting yourself in a better position to pursue your goals later on. The main conclusion seems to come from proposition 3, but the model there is so simple it doesn’t include any possibility of Sia putting itself in a better position for later.
Section 4 deals with sequential decisions, but for some reason mainly gets distracted by a Newcomb-like problem, which seems irrelevant to instrumental convergence. I don't see why you didn't just remove Newcomb-like situations from the model? Instrumental convergence will show up regardless of the exact decision theory used by the agent.
Here's my suggestion for a more realistic model that would exhibit instrumental convergence, while still being fairly simple and having "random" goals across trajectories. Make an environment with 1,000,000 timesteps. Have the world state described by a vector of 1000 real numbers. Have a utility function that is randomly sampled from some Gaussian process (or any other high entropy distribution over functions) on R1,000,000×1000→R. Assume there exist standard actions which directly make small edits to the world-state vector. Assume that there exist actions analogous to cognitive enhancement, making technology and gaining resources. Intelligence can be used in the future to more precisely predict the consequences of actions on the future world state (you’d need to model a bounded agent for this). Technology can be used to increase the amount or change the type of effect your actions have on the world state. Resources can be spent in the future for more control over the world state. It seems clear to me that for the vast majority of the random utility functions, it's very valuable to have more control over the future world state. So most sampled agents will take the instrumentally convergent actions early in the game and use the additional power later on.
The assumptions I made about the environment are inspired by the real world environment, and the assumptions I've made about the desires are similar to yours, maximally uninformative over trajectories.
Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I'll remove that paragraph. Thanks for the links, I hadn't read those, and I appreciate the pseudocode.
I think most likely I still don't understand what you mean by grader-optimizer, but it's probably better to discuss on your post after I've spent more time going over your posts and comments.
My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)?And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?