Is the following a typo?"So, the ( works"
first sentence of "CoCo Equilbiria".
Maybe you have made a gestalt-switch I haven't made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
Is there a difference between saying:
It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn't actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an "objective".
However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.
Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.
(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).
It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).
Here is my partial honest reaction, just two points I'm somewhat dissatisfied with (not meant to be exhaustive):2. "A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure." I would like there to be an argument for this claim that doesn't rely on nanotech, and solidly relies on actually existing amounts of compute. E.g. if the argument relies on running intractable detailed simulations of proteins, then it doesn't count. (I'm not disagreeing with the nanotech example by the way, or saying that it relies on unrealistic amounts of compute, I'd just like to have an argument for this that is very solid and minimally reliant on speculative technology, and actually shows that it is).6. "We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.". You name "burn all GPU's" as an "overestimate for the rough power level of what you'd have to do", but it seems to me that it would be too weak of a pivotal act? Assuming there isn't some extreme change in generally held views, people would consider this an extreme act of terrorism, and shut you down, put you in jail, and then rebuild the GPU's and go on with what they were planning to do. Moreover, now there is probably an extreme taboo on anything AI safety related. (I'm assuming here that law enforcement finds out that you were the one who did this). Maybe the idea is to burn all GPU's indefinitely and forever (i.e. leave nanobots that continually check for GPU's and burn them when they are created), but even this seems either insufficient or undesirable long term depending on what is counted as a GPU. Possibly I'm not getting what you mean, but it just seems completely too weak as an act.
Responding to this very late, but: If I recall correctly, Eric has told me in personal conversation that CAIS is a form of AGI, just not agent-like AGI. I suspect Eric would agree broadly with Richard's definition.
"I talk about consequentialists, but not rational consequentialists", ok this was not the impression I was getting.
Reading this post a while after it was written: I'm not going to respond to the main claim (which seems quite likely) but just to the specific arguments, which seems suspicious to me. Here are some points:
It seems to me that either we think there is no problem with selecting QIA's as answers, or we think that human judges will be irrational and manipulated, but I don't see the justification in this post for saying "rational consequentialist judges will select QIA's AND this is probably bad".
I think a subpartition of S can be thought of as a partial function on S, or equivalently, a variable on S that has the possible value "Null"/"undefined".
Can't you define C⊢SX for any set C of partitions of X, rather than C⊢FX w.r.t. a specific factorization F, simply as C⊢SX iff ⋁S(C)≥SX? If so, it would seem to me to be clearer to define ⊢ that way (i.e. make 7 rather than 2 from proposition 10 the definition), and then basically proposition 10 says "if C is a subset of factors of a partition then here are a set of equivalent definitions in terms of chimera". Also I would guess that proposition 11 is still true for ⊢S rather than just for ⊢F, though I haven't checked that 11.6 would still work, but it seems like it should.