Wiki Contributions


[Intro to brain-like-AGI safety] 14. Controlled AGI

Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).

🤔 I wonder if I should talk with Tan Zhi-Xuan.

I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.

I got the phrase "ontological lock" from adamShimi's post here, but it only comes up very briefly, so it is not helpful for understanding what I mean and is sort of also me assuming that adamShimi meant the same as I did. 😅 I'm not sure if it's a term used elsewhere.

What I mean is forcing the AI to have a specific ontology, such as things embedded in 3D space, so you can directly programmatically interface with the AI's ontology, rather than having to statistically train an interface (which would lead to problems with distribution shift and such).

[Intro to brain-like-AGI safety] 14. Controlled AGI

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.

I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”. 

This approach is probably particularly characteristic of my approach. I've perhaps overstated the similarity of my approach to John Wentworth's 😅 - I think that much of his research is useful to my approach, but there's also lots of positions of disagreement. But I suppose everyone finds his research ultra-promising.

A couple of notes:

I think even if my approach doesn't work out as the sole solution, it seems plausibly complementary to other approaches, including yours. For instance, if you don't do the sort of ontological lock that I'm advocating, then you tend to end up struggling with some basic symbol-reality distinction, e.g. you're likely to associate pictures of happy people with the concept of "happiness", so a happiness maximizer might end up tiling the world with pictures of happy people. My approach can avoid that for free (though the flipside is that it would likely not consider e.g. ems to be people unless explicitly programmed so. but that could probably be achieved.).

I think concepts like "solar cell efficiency" might be very achievable to define by my approach. If you have a clean 3D ontology, you can isolate an object like a solar panel in that ontology, and then counterfactually ask how it would perform under various conditions. So you could say "well how would this object perform if standard sunlight hit it under standard atmospheric conditions? how much power would it produce? would it produce any problematic pollution? etc.". You could be very precise about this.

... which is of course a curse as much as it is a blessing, e.g. you might not want a precise definition of "daytime", and it might not be possible for people to write down a precise definition of "honesty".

[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation

Avoiding a weak wireheading drive seems quite tricky. Maybe we could minimize it using timing and priors (Section 9.3.3 above), but avoiding it altogether would, I presume, require special techniques—I vaguely imagine using some kind of interpretability technique to find the RPE / feeling good concept in the world-model, and manually disconnecting it from any Thought Assessors, or something like that.

Here's a hacky patch that doesn't entirely solve it, but might help:

Presumably for humans, the RPE/reward is somehow wired into the world-model, since we have a clear awareness of it. But you could just not give it as an input to the AI's world model to begin with.

As long as it doesn't start hacking into its own runtime and peeking at the variables, this can mean that it doesn't have a variable corresponding to its reward in it's world-model, which would prevent it from wanting to use it for wireheading.

Of course this is unstable, so we probably wouldn't want to rely on that. The stable approach would be what we discussed in the other thread, of manually coding the value function. This would protect against wireheading in fundamentally the same way, though, by eliminating the need for a separate "reward" variable in the world-model.

Prizes for ELK proposals

If I understand the problem statement correctly, I think I could take a stab at easier versions of the problem, but that the current formulation is too much to swallow in one bite. In particular I am concerned about the following parts:


We start with an unaligned benchmark:

* An architecture Mθ



To solve ELK in this case we must:

* Supply a modified architecture Mθ+ which has the same inputs and outputs as Mθ <snip>

Does this mean that the method needs to work for ~arbitrary architectures, and that the solution must use substantially the same architecture as the original?

except that after producing all other outputs it can answer a question Q in natural language

Does this mean that it must be able to deal with a broad variety of questions, so that we cannot simply sit down and think about how to optimize the model for getting a single question (e.g. "Where is the diamond?") right?

According to my current model of how these sorts of things work, such constraints makes the problem fundamentally unsolvable, so I am not even going to attempt it, while loosening the constraints may make it solvable, and so I might attempt it.

Formalizing Policy-Modification Corrigibility

The idea is that manipulation "overrides'' the human policy regardless of whether that's good for the goal the human is pursuing (where the human goal presumably affects what  is selected). While here the override is baked into the dynamics, in realistic settings it occurs because the AI exploits the human decision-making process: by feeding them biased information, through emotional manipulation, etc. 

I think this skips over the problems with baking it into the dynamics. Baking manipulation into the dynamics requires us to define manipulation; easy for toy examples, but in real-world applications it runs head-first into nearest unblocked strategy concerns; anything that you forget to define as manipulation is fully up for grabs.

This is why I prefer directly applying a counterfactual to the human policy in my proposal, to entirely override the possibility of manipulation. But that introduces its own difficulties, and is not easy to scale up beyond the stop button. I've had a post in the works for a while about the main difficulty I see with my approach here.

Formalizing Policy-Modification Corrigibility

I like this post, it seems to be the same sort of approach that I suggested here. However, your proposal seems to have a number of issues; some of which you've already discussed, some of which are handled in my proposal, and some of which I think are still open questions. Presumably a lot of it is just because it's still a toy model, but I wanted to point out some things.

Starting here:

Definition: Corrigibility, formal.

Let  be a time step which is greater than . The policy-modification corrigibility of  from starting state  by time  is the maximum possible mutual information between the human policy and the AI's policy at time :

(As I understand, the maximum ranges over all possible distributions of human policies? Otherwise I'm not sure how to parse it, and aspects of my comment might be confused/wrong.)

Usually one would come up with these sorts of definitions in order to select on them. That is, one would incorporate corrigibility in a utility function in order to select a desired AI.

(Though on reflection, maybe that is not your plan, since e.g. your symmetry-based proofs can work for describing side-effects? Like the proofs that most goals favored power-seeking policies did not actually involve optimizing power-seekingness.)

However, this definition of corrigibility cannot immediately be incorporated into a utility function, as it depends on the time step n.

There are several possible ways to turn this into a utility function, with (I think?) two major axes of variation:

  1. Should we pick some specific constant n, or sum over all n?
  2. Humans policies most likely are not accurately modelled using  due to factors like memory. To the AI, this can look like the human policy changing over time or similar. So that raises the question of whether it is only the starting policy that it should be corrigible to, or if corrigibility should e.g. be expressed as a sum over time or something. (Neither is great. Though obviously this is a toy example, so that may be expected.)
  • If the environment doesn't allow the human to reach or modify the AI, the AI is incorrigible. Conversely, in some environments there does not exist an incorrigible AI policy for reasonable .

I think "reasonable " is really hard to talk about. Consider locking the human in a box with a password-locked computer, where the computer contains full options for controlling the AI policy. This only requires the human to enter the password, and then they will have an enormous influence over the AI. So this is highly corrigible, in a way. This is probably what we want to exclude from , but it seems difficult.

Furthermore, this definition doesn't necessarily capture other kinds of corrigibility, such as "the AI will do what the human asks.'' Maximizing mutual information only means that the human has many cognitively accessible ways to modify the agent. This doesn't mean the AI does what the human asks. One way this could happen is if the AI implements the opposite of whatever the human specifies (e.g. the human-communicated policy goes left, the new AI policy goes right). Whether this is feasible depends on the bridging law , which is not controlled by either player.

I think this is a bigger problem with the proposal than it might look like?

Suppose the AI is trying to be corrigible in the way described in the post. This makes it incentivized to find ways to let the human alter its policy. But if it allows too impactful changes, then that would prevent it from further finding ways to let the human alter its policy. So it is incentivized to first allow changes to irrelevant cases, such as the AI's reaction to states that will never happen. Further, it doesn't have to be responsive to a policy that the human would actually be likely to take, since you take the maximum over  in defining corrigibility. Rather, it could pick  to be a distribution of policies that humans would never engage in, such as policies that approximately (but far from totally) minimize human welfare. "I will do what you ask, as long as you enter my eternal torture chamber" would be highly corrigible by this definition. This sort of thing seems likely incentivized by this approach, because it reduces the likelihood that the corrigibility will become an obstacle to its future actions.

Also, it is not very viable to actually control the AI with corrigibility that depends on the mutual information with the AI's policy, because the policy is very far removed from the effects of the policy.

Corrigibility Can Be VNM-Incoherent

Actually upon thinking further I don't think this argument works, at least not as it is written right now.

Corrigibility Can Be VNM-Incoherent

Imagine that policies decompose into two components, . For instance, they may be different sets of parameters in a neural network. We can then talk about the effect of one of the components by considering how it influences the power/injectivity of the features with respect to the other component.

Suppose, for instance, that  is such that the policy just ends up acting in a completely random-twitching way. Technically  has a lot of effect too, in that it chaotically controls the pattern of the twitching, but in terms of the features  is basically constant. This is a low power situation, and if one actually specified what  would be, then a TurnTrout-style argument could probably prove that such values of  would be avoided for power-seeking reasons. On the other hand, if  made the policy act like an optimizer which optimizes a utility function over the features of  with the utility function being specified by , then that would lead to a lot more power/injectivity.

On the other hand, I wonder if there's a limit to this style of argument. Too much noninjectivity would require crazy interaction effects to fill out the space in a Hilbert-curve-style way, which would be hard to optimize?

Corrigibility Can Be VNM-Incoherent

Since you can convert a utility function over states or observation-histories into a utility function over policies (well, as long as you have a model for measuring the utility of a policy), and since utility functions over states/observation-histories do satisfy instrumental convergence, yes you are correct.

I feel like in a way, one could see the restriction to defining it in terms of e.g. states as a definition of "smart" behavior; if you define a reward in terms of states, then the policy must "smartly" generate those states, rather than just yield some sort of arbitrary behavior.

🤔 I wonder if this approach could generalize TurnTrout's approach. I'm not entirely sure how, but we might imagine that a structured utility function  over policies could be decomposed into , where  is the features that the utility function pays attention to, and  is the utility function expressed in terms of those features. E.g. for state-based rewards, one might take  to be a model that yields the distribution of states visited by the policy, and  to be the reward function for the individual states (some sort of modification would have to be made to address the fact that f outputs a distribution but r takes in a single state... I guess this could be handled by working in the category of vector spaces and linear transformations but I'm not sure if that's the best approach in general - though since  can be embedded into this category, it surely can't hurt too much).

Then the power-seeking situation boils down to that the vast majority of policies  lead to essentially the same features , but that there is a small set of power-seeking policies that lead to a vastly greater range of different features? And so for most , a  that optimizes/satisfices/etc.  will come from this small set of power-seeking policies.

I'm not sure how to formalize this. I think it won't hold for generic vector spaces, since almost all linear transformations are invertible? But it seems to me that in reality, there's a great degree of non-injectivity. The idea of "chaos inducing abstractions" seems relevant, in the sense that parameter changes in  will mostly tend to lead to completely unpredictable/unsystematic/dissipated effects, and partly tend to lead to predictable and systematic effects. If most of the effects are unpredictable/unsystematic, then  must be extremely non-injective, and this non-injectivity then generates power-seeking.

(Or does it? I guess you'd have to have some sort of interaction effect, where some parameters control the degree to which the function is injective with regards to other parameters. But that seems to holds in practice.)

I'm not sure whether I've said anything new or useful.

Corrigibility Can Be VNM-Incoherent

🤔 I was about to say that I felt like my approach could still be done in terms of state rewards, and that it's just that my approach violates some of the technical assumptions in the OP. After all, you could just reward for being in a state such that the various counterfactuals apply when rolling out from this state; this would assign higher utility to the blue states than the red states, encouraging corrigibility, and contradicting TurnTrout's assumption that utility would be assigned solely based on the letter.

But then I realized that this introduces a policy dependence to the reward function; the way you roll out from a state depends on which policy you have. (Well, in principle; in practice some MDPs may not have much dependence on it.) The special thing about state-based rewards is that you can assign utilities to trajectories without considering the policy that generates the trajectory at all. (Which to me seems bad for corrigibility, since corrigibility depends on the reasons for the trajectories, and not just the trajectories themselves.)

But now consider the following: If you have the policy, you can figure out which actions were taken, just by applying the policy to the state/history. And instrumental convergence does not apply to utility functions over action-observation histories. So therefore it doesn't apply to utility functions over (policies, observation histories). (I think?? At least if the set of policies is closed under replacing an action under a specified condition, and there's no Newcombian issues that creates non-causal dependencies between policies and observation histories?).

So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like "humans have control over the AI" (as this is a causal statement and thus depends on the AI).

We could consider u-P, utility functions over policies. This is the most general sort of utility function (I think??), and as such it is also way way too general, just like u-AOH is. I think maybe what I should try to do is define some causal/counterfactual generalizations of u-AOH, u-OH, and u-S, which allow better behaved utility functions.

Load More