Recent Discussion

Agenty things have the type signature (A -> B) -> A. In English: agenty things have some model (A -> B) which predicts the results (B) of their own actions (A). They use that model to decide what actions to perform: (A -> B) -> A.

In the context of causal DAGs, the model (A -> B) would itself be a causal DAG model - i.e. some Python code defining the DAG. Logically, we can represent it as:

… for some given distribution functions and .

From an outside view, the model (A -> B) causes the choice of action A. ... (Read more)

2Steve Byrnes5hWhy aren't you notationally distinguishing between "actual model" versus "what the agent believes the model to be"? Or are you and I missed it?

Good question. We could easily draw a diagram in which the two are separate - we'd have the "agent" node reading from one cloud and then influencing things outside of that cloud. But that case isn't very interesting - most of what we call "agenty" behavior, and especially the diagonalization issues, are about the case where the actual model and the agent's beliefs coincide. In particular, if we're talking about ideal game-theoretic agents, we usually assume that both the rules of the game and each agent's strate... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post

tl;dr: there is no natural category called "wireheading", only wireheading relative to some desired ideal goal.

Suppose that we have a built an AI, and have invited a human H to help test it. The human H is supposed to press a button B if the AI seems to be behaving well. The AI's reward is entirely determined by whether H presses B or not.

So the AI manipulates or tricks H into pressing B. A clear case of the AI wireheading itself.

Or is it? Suppose H was a meddlesome government inspector that we wanted to keep away from our research. Then we want H to press B, so we can get them our of our hair

... (Read more)
1Tom Everitt1dIs this analogous to the stance-dependency of agents [] and intelligence []?

It is analogous, to some extent; I do look into some aspect of Daniel Dennett's classification here:

I also had a more focused attempt at defining AI wireheading here:

I think you've already seen that?

Note: after putting this online, I noticed several problems with my original framing of the arguments. While I don't think they invalidated the overall conclusion, they did (ironically enough) make the post much less coherent. The version below has been significantly edited in an attempt to alleviate these issues.

Rohin Shah has recently criticised Eliezer’s argument that “sufficiently optimised agents appear coherent”, on the grounds that any behaviour can be rationalised as maximisation of the expectation of some utility function. In this post I dig deeper into this disagreement, conclud... (Read more)

Just a note that in the link that Wei Dai provides for "Relevant powerful agents will be highly optimized", Eliezer explicitly assigns '75%' to 'The probability that an agent that is cognitively powerful enough to be relevant to existential outcomes, will have been subject to strong, general optimization pressures.'

even if he doesn't it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.


Clarifying "AI Alignment"
191y3 min readShow Highlight

When I say an AI A is aligned with an operator H, I mean:

A is trying to do what H wants it to do.

The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.

This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.

In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.


Consider a human... (Read more)

3Rohin Shah4dAgreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly. If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned. That, or it could depend on the agent's counterfactual behavior in other situations. I agree it can't be just the action chosen in the particular state. I guess you wouldn't count universality []. Overall I agree. I'm relatively pessimistic about mathematical formalization. (Probably not worth debating this point; feels like people have talked about it at length in Realism about rationality [] without making much progress.) I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid no-free-lunch theorems.
1Vanessa Kosoy4dThe acausal attack is an example of how it can happen for systematic reasons. As for the other part, that seems like conceding that intent-alignment is insufficient and you need "corrigibility" as another condition (also it is not so clear to me what this condition means). It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis. Now, I do believe that if you set up the prior correctly then it won't happen, thanks to a mechanism like: Alpha knows that in case of dangerous uncertainty it is safe to fall back on some "neutral" course of action plus query the user (in specific, safe, ways). But this exactly shows that intent-alignment is not enough and you need further assumptions. Besides the fact ascription universality is not formalized, why is it equivalent to intent-alignment? Maybe I'm missing something. I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion. No, I make no such assumption. A bound on subjective regret ensures that running the AI is a nearly-optimal strategy from the user's subjective perspective. It is neither needed nor possible to prove that the AI can never enter a trap. For example, the AI is immune to acausal attacks to the extent that the user beliefs that the AI is not inside Beta's simulation. On the other hand, if the user beliefs that the simulation hypothesis needs to be taken into account, then the scenario amounts to legitimate acausal bargaining (which has its own complications to do with decision/game theory, but that's mostly a separate concern).
2Rohin Shah3dSorry, that's right. Fwiw, I do think subjective regret bounds are significantly better than the thing I meant by definition-optimization. Why doesn't this also apply to subjective regret bounds? My guess at your answer is that Alpha wouldn't take the irreversible action as long as the user believes that Alpha is not in Beta-simulation-world. I would amend that to say that Alpha has to know that [the user doesn't believe that Alpha is in Beta-simulation-world]. But if Alpha knows that, then surely Alpha can predict that the user would not confirm the irreversible action? It seems like for subjective regret bounds, avoiding this scenario depends on your prior already "knowing" that the user thinks that Alpha is not in Beta-simulation-world (perhaps by excluding Beta-simulations). If that's true, you could do the same thing with intent alignment / corrigibility. It isn't equivalent to intent alignment; but it is meant to be used as part of an argument for safety, though I guess it could be used in definition-optimization too, so never mind. That is hard to say. I would want to have the reaction "oh, if I built that system, I expect it to be safe and competitive". Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space. I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured "safety" without "passing the buck". I think subjective regret bounds and CIRL both make some progress on this, but somewhat "pass the buck" by requiring a well-specified hypothesis space for rewards / beliefs / observation models. Tbc, I also don't think intent alignment will lead to a mathematical formalization I'm happy with -- it "passes the buck" to the problem of defining what "trying" is, or what "corrigibility" is.
1Vanessa Kosoy1dIn order to get a subjective regret bound you need to consider an appropriate prior. The way I expect it to work is, the prior guarantees that some actions are safe in the short-term: for example, doing nothing to the environment and asking only sufficiently quantilized queries from the user (see this [] for one toy model of how "safe in the short-term" can be formalized). Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior. Now, you can say "with the right prior intent-alignment also works". To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work. Indeed, we can imagine that the ontology on which the prior is defined includes a "true reward" symbol s.t., by definition, the semantics is whatever the user truly wants. An agent that maximizes expected true reward then can be said to be intent-aligned. If it's doing something bad from the user's perspective, then it is just an "innocent" mistake. But, unless we bake some specific assumptions about the true reward into the prior, such an agent can be anything at all. This is related to what I call the distinction between "weak" and "strong feasibility". Weak feasibility means algorithms that are polynomial time in the number of states and actions, or the number of hypotheses. Strong feasibility is supposed to be something like, polynomial time in the description length of the hypothesis. It is true that currently we only have strong feasibility results for relatively simple hypothesis spaces (such as, support vector machines). But, this seems to me just a symptom of advances in heuristics outpacing the theory. I don't see any reason of principle that significantly limits the strong feasibility results we can expect. Ind
To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work.

I completely agree with this, but isn't this also true of subjective regret bounds / definition-optimization? Like, when you write (emphasis mine)

Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.

Isn't the assumption about the prior "doing all the work"?

Maybe your point is th... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post

Many approaches to AI alignment require making assumptions about what humans want. On a first pass, it might appear that inner alignment is a sub-component of AI alignment that doesn't require making these assumptions. This is because if we define the problem of inner alignment to be the problem of how to train an AI to be aligned with arbitrary reward functions, then a solution would presumably have no dependence on any particular reward function. We could imagine an alien civilization solving the same problem, despite using very different reward functions to train their AIs.

Unfortunatel... (Read more)

Suppose that in building the AI, we make an explicitly computable hardcoded value function. For instance, if you want the agent to land between the flags, you might write an explicit, hardcoded function that returns 1 if between a pair of yellow triangles, else 0.

In the process of standard machine learning as normal, information is lost because you have a full value function but only train the network using the evaluation of the function at a finite number of points.

Suppose I don't want the lander to land on the astronaut, who is wearing a blue spac... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post

4Alex Turner2dI (low-confidence) think that there might be a "choose two" wrt impact measures: large effect, no ontology, no/very limited value assumptions. I see how we might get small good effects without needing a nice pre-specified ontology or info. about human values (AUP; to be discussed in upcoming Reframing Impact [] posts). I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences. I know this isn't saying why I think this yet, but I'd just like to register this now for later discussion.
3Matthew Barnett2dI find this interesting but I'd be surprised if it were true :). I look forward to seeing it in the upcoming posts. That said, I want to draw your attention to my definition of catastrophe, which I think is different than the way most people use the term. I think most broadly, you might think of a catastrophe as something that we would never want to happen even once. But for inner alignment, this isn't always helpful, since sometimes we want our systems to crash into the ground rather than intelligently optimizing against us, even if we never want them to crash into the ground even once. And as a starting point, we should try to mitigate these malicious failures much more than the benign ones, even if a benign failure would have a large value-neutral impact. A closely related notion to my definition is the term "unacceptable behavior" as Paul Christiano has used it. This is the way he has defined it [], It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can't see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.
5Alex Turner2dI think people should generally be a little more careful about saying "this requires value-laden information". First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say "don't do all these bad things we don't like"! Second, it's always good to check "would this style of reasoning lead me to conclude solving the easy problem of wireheading [] is value-laden?": This isn't an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.

Epistemic status: I expect some people to say "this is obvious and trivial", and others to say "this makes no sense at all".

One fundamental difference between E.T. Jaynes’ probability textbook and most others is the emphasis on including our model in the prior information. When we write an expression like , that’s really shorthand for , where is some probabilistic model in which and appear. In practice, I’ve found this useful for two main use-cases: model comparison, and interventions/counterfactuals in causal models. This po... (Read more)

Load More