Independent alignment researcher
I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:
Despite the economists, the economy doesn't try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they're able to make a profit, and dis-rewards firms which aren't able to make a profit. Firms which are technically profitable, but have no local profit incentive gradient pointing towards them (factoring in the existence of rich people and lenders, neither of which are perfect expected profit maximizers) generally will not happen.
Individual firms also don't (only) try to maximize profit. Some parts of them may maximize profit, but most are just structures of people built from local social capital and economic capital incentive gradients.
Politicians don't try to (only) maximize win-probability.
Democracies don't try to (only) maximize voter approval.
Evolution doesn't try to maximize inclusive genetic fitness.
Memes don't try to maximize inclusive memetic fitness.
Academics don't try to (only) maximize status.
China doesn't maximize allegiance to the CCP.
I think there's a general tendency for people to look at local updates in a system (when the system has humans as decision nodes, the local updates are called incentive gradients), somehow perform some integration-analogue for a function which would produce those local updates, then find a local minimum of that "integrated" function and claim the system is at that minimum or can be approximated well by the system at that minimum. Generally, this seems constrained in empirical systems by common sense learned by experience with the system, but in less and less empirical systems (like the economy or SGD), people get more and more crazy because they have less learned common sense to guide them when making the analysis.
This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).
John usually does not make his plans with an eye toward making things easier. His plan previously involved values because he thought they were strictly harder than corrigibility. If you solve values, you solve corrigibility. Similarly, if you solve abstraction, you solve interpretability, shard theory, value alignment, corrigibility, etc.
I don’t know all the details of John’s model here, but it may go something like this: If you solve corrigibility, and then find out corrigibility isn’t sufficient for alignment, you may expect your corrigible agent to help you build your value aligned agent.
I think the pointer “the thing I would do if I wanted to make a second AI that would be the best one I could make at my given intelligence” is what is being updated in favor of, since this does feel like a natural abstraction, given how many agents would think this (also seems very similar to the golden rule. “I will do what I would want a successor AI to do if the successor AI was actually the human’s successor AI”. or “treat others (the human) how I’d like to be treated (by a successor AI), (and abstracting one meta-level upwards)”). Whether this turns out to be value learning or something else 🤷. This seems a different question from whether or not it is indeed a natural abstraction.
Seems possibly relevant & optimistic when seeing deception as a value. It has the form ‘if about to tell human statement with properties x, y, z, don’t’ too.
Re: agents terminalizing instrumental values.
I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized.
This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them, because future states are basically guaranteed to require them.
An example of this for humans may be the act of balancing while standing up. If someone offered to export this kind of cognition to a machine which did it just as good as I, I wouldn't particularly mind. If someone also wanted to change physics in such a way that the only effect is that magic invisible fairies made sure everyone stayed balancing while trying to stand up, I don't think I'd mind that either.
I'm assuming this is frequency of the goal assuming the agent isn't optimizing to get into a state that requires that goal.
This argument also assumes the overseer isn't otherwise selecting for self-preserving cognition, or that self-preserving cognition is the best way of achieving the relevant goal.
Except for the part where there's magic invisible fairies in the world now. That would be cool!
What do you mean by “surface area”?
The main big one was that when I was making experiments, I did not have in mind a particular theory about how the network was doing a particular capability. I just messed around with matrices, and graphed a bunch of stuff, and multiplied a bunch of weights by a bunch of other weights. Occasionally, I'd get interesting looking pictures, but I had no clue what to do with those pictures, or followup questions I could ask, and I think it's because I didn't have an explicit model of what I think it should be doing, and so couldn't update my picture of the mechanisms the network was using off the data I gathered about the network's internals.
This was really really helpful! I learned a lot about how to think through experiment design, watching you do it, and I found some possible-mistakes I've been making while designing my own experiments!
My only criticism: When copilot auto-fills in details, it would be helpful if you'd explain what it did and why its what you wanted it to do, like how you do with your own code.
This seems like an underestimate because you don’t consider whether the first “AGI” will indeed make it so we only get one chance. If it can only self improve by more gradient steps, then humanity has a greater chance than if it self improves by prompt engineering or direct modification of its weights or latent states. Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.