Agenty things have the type signature (A -> B) -> A. In English: agenty things have some model (A -> B) which predicts the results (B) of their own actions (A). They use that model to decide what actions to perform: (A -> B) -> A.
In the context of causal DAGs, the model (A -> B) would itself be a causal DAG model - i.e. some Python code defining the DAG. Logically, we can represent it as:
… for some given distribution functions and .
From an outside view, the model (A -> B) causes the choice of action A. ... (Read more)
tl;dr: there is no natural category called "wireheading", only wireheading relative to some desired ideal goal.
Suppose that we have a built an AI, and have invited a human H to help test it. The human H is supposed to press a button B if the AI seems to be behaving well. The AI's reward is entirely determined by whether H presses B or not.
So the AI manipulates or tricks H into pressing B. A clear case of the AI wireheading itself.
Or is it? Suppose H was a meddlesome government inspector that we wanted to keep away from our research. Then we want H to press B, so we can get them our of our hair... (Read more)
Note: after putting this online, I noticed several problems with my original framing of the arguments. While I don't think they invalidated the overall conclusion, they did (ironically enough) make the post much less coherent. The version below has been significantly edited in an attempt to alleviate these issues.
Rohin Shah has recently criticised Eliezer’s argument that “sufficiently optimised agents appear coherent”, on the grounds that any behaviour can be rationalised as maximisation of the expectation of some utility function. In this post I dig deeper into this disagreement, conclud... (Read more)
When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.
This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.
In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
Consider a human... (Read more)
Many approaches to AI alignment require making assumptions about what humans want. On a first pass, it might appear that inner alignment is a sub-component of AI alignment that doesn't require making these assumptions. This is because if we define the problem of inner alignment to be the problem of how to train an AI to be aligned with arbitrary reward functions, then a solution would presumably have no dependence on any particular reward function. We could imagine an alien civilization solving the same problem, despite using very different reward functions to train their AIs.
Unfortunatel... (Read more)
Epistemic status: I expect some people to say "this is obvious and trivial", and others to say "this makes no sense at all".
One fundamental difference between E.T. Jaynes’ probability textbook and most others is the emphasis on including our model in the prior information. When we write an expression like , that’s really shorthand for , where is some probabilistic model in which and appear. In practice, I’ve found this useful for two main use-cases: model comparison, and interventions/counterfactuals in causal models. This po... (Read more)