I addressed this distinction previously, in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it's external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.
The important disanalogy
I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we h... (read more)
I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
- Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a p... (read more)
OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn't found any qualitatively new obstacles which might present deep challenges to my new view on alignment.
Here's one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever ... (read more)
Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.
But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).
The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:
here's no special reason why that one pathway would show up in a large majority of the [satisficer]'s candidate plans.
There is a special reason, and it's called "instrumental convergence." Satisficers tend to seek power.
there's an astounding amount of regularity in what we end up caring about.
Yes, this is my claim. Not that eg >95% of people form values which we would want to form within an AGI.
Is the distinction between 2 and 3 that "dog" is an imprecise concept, while "diamond" is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is 'maximize the number of dogs' and 3 is 'maximize the number of diamonds'.
Feat #2 is: Design a mind which cares about anything at all in reality which isn't a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don't know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year ... (read more)
Hm, I'll give this another stab. I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary?
I don't see Eliezer claiming 'there's no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head'.
Let me distinguish three alignment feats:
I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary?
Yes!
I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
True! Though everyone already agreed (e.g., EY asserted this in the OP) that it's possible in principle. The updatey thing would be if the case of the human genome / brain development sugg... (read more)
One of the problems with English is that it doesn't natively support orders of magnitude for "unreliable." Do you mean "unreliable" as in "between 1% and 50% of people end up with part of their values not related to objects-in-reality", or as in "there is no a priori reason why anyone would ever care about anything not directly sensorially observable, except as a fluke of their training process"? Because the latter is what current alignment paradigms mispredict, and the former might be a reasonable claim about what really happens for human beings.
EDIT: &nb... (read more)
"learning from evolution" even more complicated (evolution -> protein -> something -> brain vs. evolution -> protein -> brain)
ah, no, this isn't what I'm saying. Hm. Let me try again.
The following is not a handwavy analogy, it is something which actually happened:
Therefore, somewhere in the "gen... (read more)
we're gonna be designing the thing analogous to evolution, and not the brain. We don't pick the actual weights in these transformers, we just design the architecture and then run stochastic gradient descent or some other meta-learning algorithm.
But, ah, the genome also doesn't "pick the actual weights" for the human brain which it later grows. So whatever the brain does to align people to care about latent real-world objects, I strongly believe that that process must be compatible with blank-slate initialization and then learning.
... (read more)That meta-learning algorit
I'm not talking about running evolution again, that is not what I meant by "the process by which humans come to reliably care about the real world." The human genome must specify machinery which reliably grows a mind which cares about reality. I'm asking why we can't use the alignment paradigm leveraged by that machinery, which is empirically successful at pointing people's values to certain kinds of real-world objects.
Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?
Likewise, when you wrote,
This isn't to say that nothing in the system’s goal (whatever goal accidentally ends up being inner-optimized over) could ever point to anything in the environment by accident.
Where is the accident? Did evolution accidentally find a way to reliably orient terminal human values towards the real world? Do people each, individually, acci... (read more)
Big agreement & signal boost & push for funding on The “Reverse-engineer human social instincts” research program: Yes, please, please figure out how human social instincts are generated! I think this is incredibly important, for reasons which will become obvious due to several posts I'll probably put out this summer.
Hm. I've often imagined a "keep the diamond safe" planner just choosing a plan which a narrow-ELK-solving reporter says is OK.
How do you imagine the reporter being used?
Later in the post, I proposed a similar modification:
I think we should modify the simplified hand-off procedure I described above so that, during training:
- A range of handoff thresholds and proportions are drawn—in particular, there should be a reasonable probability of drawing values close to 0, close to 1, and also 0 and 1 exactly.
- The human net runs for steps before calling the reporter.
I do think this is what happens given the current architecture. I argued that the desired outcome solves narrow ELK as a sanity check, but I'm not claiming that the desired setup is uniquely loss-minimizing.
Part of my original intuition was "the human net is set up to be useful at predicting given honest information about the situation", and "pressure for simpler reporters will force some kind of honesty, but I don't know how far that goes." As time passed I became more and more aware of how this wasn't the best/only way for the human net to help with prediction, and turned more towards a search for a crisp counterexample.
Thanks for your reply!
What we're actually doing is here is defining "automated ontology identification"
(Flagging that I didn't understand this part of the reply, but don't have time to reload context and clarify my confusion right now)
If you deny the existence of a true decision boundary then you're saying that there is just no fact of the matter about the questions that we're asking to automated ontology identification. How then would we get any kind of safety guarantee (conservativeness or anything else)?
When you assume a true decision boundary, you're a... (read more)
Why would the planner have pressure to choose something which looks good to the predictor, but is secretly bad, given that it selects a plan based on what the reporter says? Is this a Goodhart's curse issue, where the curse afflicts not the reporter (which is assumed conservative, if it's the direct translator), but the predictor's own understanding of the situation?
I don't understand why a strong simplicity guarantee places most of the difficulty on the learning problem. In the diamond situation, a strong simplicity requirement on the reporter can mean that the direct translator gets ruled out, since it may have to translate from a very large and sophisticated AI predictor?
... (read more)if automated ontology identification does turn out to be possible from a finite narrow dataset, and if automated ontology identification requires an understanding of our values, then where did the information about our values come from? It did not
Why is this what you often imagine? I thought that in classic ELK architectural setup, the planner uses the outputs of both a predictor and reporter in order to make its plans, eg using the reporter to grade plans and finding the plan which most definitely contains a diamond (according to the reporter). And the simplest choice would be brute-force search over action sequences.
After all, here's the architecture:
But in your setup, the planner would be another head branching from "figure out what's going on", which means that it's receiving the results of a c... (read more)
How do we know that the "prediction extractor" component doesn't do additional serious computation, so that it knows something important that the "figure out what's going on" module doesn't know? If that were true, the AI as a whole could know the diamond was stolen, without the "figure out what's going on" module knowing, which means even the direct translator wouldn't know, either. Are we just not giving the extractor that many parameters?
How the power-seeking theorems relate to the selection theorem agenda.
I recommend reading the quoted post for clarification.
I agree that in certain conceivable games which are not baseline SC2, there will be different power-seeking incentives for negative alpha weights. My commentary wasn't intended as a generic takeaway about negative feature weights in particular.
But in the game which actually is SC2, where you don't start with a huge number of resources, negative alpha weights don't incentivize power-seeking. You do need to think about the actual game being considered, before you can conclude that negative alpha weighs imply such-and-such a behavior.
... (read more)But the numbe
I'm implicitly assuming a fixed opponent policy, yes.
Without being overly familiar with SC2—you don't have to kill your opponent to get to 0 resources, do you? From my experience with other RTS games, I imagine you can just quickly build units and deplete your resources, and then your opponent can't make you accrue more resources. Is that wrong?
Argument sketch for why boxing is doomed if the agent is perfectly misaligned:
Consider a perfectly misaligned agent which has -1 times your utility function—it's zero-sum. Then suppose you got useful output of the agent. This means you're able to increase your EU. This means the AI decreased its EU by saying anything. Therefore, it should have shut up instead. But since we assume it's smarter than you, it realized this possibility, and so the fact that it's saying something means that it expects to gain by hurting your interests via its output. Therefore, the output can't be useful.
Fixed, thanks!
This post's main contribution is the formalization of game-theoretic defection as gaining personal utility at the expense of coalitional utility.
Rereading, the post feels charmingly straightforward and self-contained. The formalization feels obvious in hindsight, but I remember being quite confused about the precise difference between power-seeking and defection—perhaps because popular examples of taking over the world are also defections against the human/AI coalition. I now feel cleanly deconfused about this distinction. And if I was confused about... (read more)
The Hippocratic principle seems similar to my concept of non-obstruction (https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility), but subjective from the human's beliefs instead of the AI's.
I find it concerning that you felt the need to write "This is not at all a criticism of the way this post was written. I am simply curious about my own reaction to it" (and still got downvoted?).
For my part, I both believe that this post contains valuable content and good arguments, and that it was annoying / rude / bothersome in certain sections.
The biggest disconnect is that this post is not a proposal for how to solve corrigibility. I'm just thinking about what corrigibility is/should be, and this seems like a shard of it—but only a shard. I'll edit the post to better communicate that.
So, your points are good, but they run skew to what I was thinking about while writing the post.
I skip over those pragmatic problems because this post is not proposing a solution, but rather a measurement I find interesting.
OK, I'll bite on EY's exercise for the reader, on refuting this "what-if":
... (read more)Humbali: Then here's one way that the minimum computational requirements for general intelligence could be higher than Moravec's argument for the human brain. Since, after, all, we only have one existence proof that general intelligence is possible at all, namely the human brain. Perhaps there's no way to get general intelligence in a computer except by simulating the brain neurotransmitter-by-neurotransmitter. In that case you'd need a lot more computing oper
no-one has the social courage to tackle the problems that are actually important
I would be very surprised if this were true. I personally don't feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.
I would guess that if people aren't tackling Hard Problems enough, it's not because they lack social courage, but because 1) they aren't running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they're wrong about what problems are Hard Problems. My money's mostly on (1), with a bit of (2).
But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.
Noting that I strongly disagree but don't have time to type out arguments right now, sorry. May or may not type out later.
Addendum: One lesson to take away is that quantilization doesn't just depend on the base distribution being safe to sample from unconditionally. As the theorems hint, quantilization's viability depends on base(plan | plan doing anything interesting) also being safe with high probability, because we could (and would) probably resample the agent until we get something interesting. In this post's terminology, A := {safe interesting things}, B := {power-seeking interesting things}, C:= A and B and {uninteresting things}.
I've started commenting on this discussion on a Google Doc. Here are some excerpts:
During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up.
Contains implicit assumptions about takeoff that I don't currently buy:
Although I didn't make this explicit, one problem is that manipulation is still weakly optimal—as you say. That wouldn't fit the spirit of strict corrigibility, as defined in the post.
Note that AUP doesn't have this problem.
though since can be embedded into [Vect], it surely can't hurt too much
As an aside, can you link to/say more about this? Do you mean that there exists a faithful functor from Set to Vect (the category of vector spaces)? If you mean that, then every concrete category can be embedded into Vect, no? And if that's what you're saying, maybe the functor Set -> Vect is something like the "Group to its group algebra over field " functor.
I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce "smart" or "powerful" behavior from simple rules.
I share an intuition in this area, but "powerful" behavior tendencies seems nearly equivalent to instrumental convergence to me. It feels logically downstream of instrumental convergence.
from simple rules
I already have a (somewhat weak) result on power-seeking wrt the simplicity prior over state-based reward functions. This isn't about utility functions over policies, though... (read more)
... (read more)So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like "humans have control over the AI" (as this is a causal statement and thus depends on
change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.
I'm not parsing this. You change the utility function, but it ends up in the same place with the same utility function? Did we change it or not? (I think simply rewording it will communicate your point to me)
Edited to add:
... (read more)If you can correct the agent to go where you want, it already wanted to go where you want. If the agent is strictly corrigible to terminal state , then was already optimal for it.
If the reward function has a single optimal terminal state, there isn't any new information being added by . But we want corrigibility to let us reflect more on our values over time and what we want the AI to do!
If the reward function has multiple optimal terminal states, then corrigibility again becomes meaningful. Bu
I worded the title conservatively because I only showed that corrigibility is never nontrivially VNM-coherent in this particular MDP. Maybe there's a more general case to be proven for all MDPs, and using more realistic (non-single-timestep) reward aggregation schemes.
Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence.
I'm interested in hearing about how your approach handles this environment, because I think I'm getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.
Read your post, here are my initial impressions on how it relates to the discussion here.
In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.
However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the te... (read more)
Could you give a toy example of this being insufficient (I'm assuming the "set copy relation" is the "B contains n of A" requiring)?
A:={(1 0 0)} B:={(0 .3 .7), (0 .7 .3)}
Less opaquely, see the technical explanation for this counterexample, where the right action leads to two trajectories, and up leads to a single one.
How does the "B contains n of A" requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).
For thi... (read more)
For the "Theorem: Training retargetability criterion", where f(A, u) >= its involution, what would be the case where it's not greater/equal to it's involution? Is this when the options in B are originally more optimal?
I don't think I understand the question. Can you rephrase?
Also, that theorem requires each involution to be greater/equal than the original. Is this just to get a lower bound on the n-multiple or do less-than involutions not add anything?
Less-than involutions aren't guaranteed to add anything. For example, if iff a goes le... (read more)
Right, I was intending "3. [these results] don't account for the ways in which we might practically express reward functions" to capture that limitation.
I'm currently pessimistic about the prospect. But it seems worth thinking about, because wouldn't it be such an amazing work-around?
My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its parameter space being 3-colored as follows:
parameter vector+training process+other initial conditions
leads to a nothingburger (a non-functional model)
Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.
Just because we write down English describing what we want the AI to do ("be helpful"), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn't mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It's easier to say "this logic enables nonmonotonic r... (read more)