I'd like to get different answers in those two worlds. That definitely requires having some term in the loss that is different in W1 and W2. There are three ways the kinds of proposals in the doc can handle this:
"The goal is" -- is this describing Redwood's research or your research or a goal you have more broadly?
My general goal, Redwood's current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally.
I'm curious how this is connected to "doesn't write fiction where a human is harmed".
"Don't produce outputs where someone is injured" is just an arbitrary thing not to do. It's chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it's plausible that a predicate with neutral valence would have been better to avoid confusion).
The goal is not to remove concepts or change what the model is capable of thinking about, it's to make a model that never tries to deliberately kill everyone. There's no doubt that it could deliberately kill everyone if it wanted to.
I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.
It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.
I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?
Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.
It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.
I'm not sure I understand what you mean by "decision-theoretic approach"
I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences (and if you try to define utility in terms of solomonoff induction applied to your experiences, e.g. by learning a human, then it seems again vulnerable to attack bridging hypotheses or no).
This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.
I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).
Infra-Bayesian physicalism does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.
I agree that removing bridge hypotheses removes one of the advantages for malign hypotheses. I didn't mention this because it doesn't seem like the way in which john is using "embededness;" for example, it seems orthogonal to the way in which the situation violates the conditions for solomonoff induction to be eventually correct. I'd stand by saying that it doesn't appear to make the problem go away.
That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses (since otherwise they also get big benefits from the influence update). And then once you've done that in a sensible way it seems like it also addresses any issues with embededness (though maybe we just want to say that those are being solved inside the decision theory). If you want to recover the expected behavior of induction as a component of intelligent reasoning (rather than a component of the utility function + an instrumental step in intelligent reasoning) then it seems like you need a more different tack.
Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the "messed up situation"?
If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources. If you do the same induction but just remove the malign hypotheses, then it seems like you are even dumber and the problem is even worse viewed from the competitiveness perspective.
My guess is that "help humans improve their understanding" doesn't work anyway, at least not without a lot of work, but it's less obvious and the counterexamples get weirder.
It's less clear whether ELK is a less natural subproblem for the unlimited version of the problem. That is, if you try to rely on something like "human deliberation scaled up" to solve ELK, you probably just have to solve the whole (unlimited) problem along the way.
It seems to me like the core troubles with this point are:
I'm generally interested in crisper counterexamples since those are a bit of a mess.
In some sense, ELK as a problem only even starts "applying" to pretty smart models (ones who can talk including about counterfactuals / hypotheticals, as discussed in this appendix.) This is closely related to how alignment as a problem only really starts applying to models smart enough to be thinking about how to pursue a goal.
I think that it's more complicated to talk about what models "really know" as they get dumber, so we want to use very smart models to construct unambiguous counterexamples. I do think that the spirit of the problem applies even to very tiny models, and those are likely interesting.
(More precisely: it's always extremely subtle to talk about what models "know," but as models get smarter there are many more things that they definitely know so it's easier to notice if you are definitely failing. And the ELK problem statement in this doc is really focused on this kind of unambiguous failure, mostly as a methodological point but also partly because the cases where AI murders you also seems to involve "definitely knowing" in the same sense.)
I think my take is that for linear/logistic regression there is no latent knowledge, but even for a fully linear 3 layer neural network, or a 2 layer network solving many related problems, there is latent knowledge and an important conceptual question about what it means to "know what they know."
The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: "The AI model gets a higher score to the degree that people doing 'digital neuroscience' would have an easier time, and find more interesting things, probing its 'digital brain.'" So an AI with neurons corresponding to diamonds, robbers, sensors, etc. would outscore an AI whose neurons can't easily be seen to correspond to any human-familiar concepts.
I think that a lot depends on what kind of term you include.
If you just say "find more interesting things" then the model will just have a bunch of neurons designed to look interesting. Presumably you want them to be connected in some way to the computation, but we don't really have any candidates for defining that in a way that does what you want.
In some sense I think if the digital neuroscientists are good enough at their job / have a good enough set of definitions, then this proposal might work. But I think that the magic is mostly being done in the step where we make a lot of interpretability progress, and so if we define a concrete version of interpretability right now it will be easy to construct counterexamples (even if we define it in terms of human judgments). If we are just relying on the digital neuroscientists to think of something clever, the counterexample will involve something like "they don't think of anything clever." In general I'd be happy to talk about concrete proposals along these lines.
(I agree with Ajeya and Mark that the hard case for this kind of method is when the most efficient way of thinking is totally alien to the human. I think that can happen, and in that case in order to be competitive you basically just need to learn an "interpreted" version of the alien model. That is, you need to basically show that if there exists an alien model with performance X, there is a human-comprehensible model with performance X, and the only way you'll be able to argue that for any model we can define a human-comprehensible model with similar complexity and the same behavior.)
Generally we are asking for an AI that doesn't give an unambiguously bad answer, and if there's any way of revealing the facts where we think a human would (defensibly) agree with the AI, then probably the answer isn't unambiguously bad and we're fine if the AI gives it.
There are lots of possible concerns with that perspective; probably the easiest way to engage with them is to consider some concrete case in which a human might make different judgments, but where it's catastrophic for our AI not to make the "correct" judgment. I'm not sure what kind of example you have in mind and I have somewhat different responses to different kinds of examples.
For example, note that ELK is never trying to answer any questions of the form "how good is this outcome?"; I certainly agree that there can also be ambiguity about questions like "did the diamond stay in the room?" but it's a fairly different situation. The most relevant sections are narrow elicitation and why it might be sufficient which gives a lot of examples of where we think we can/can't tolerate ambiguity, and to a lesser extent avoiding subtle manipulation which explains how you might get a good outcome despite tolerating such ambiguity. That said, there are still lots of reasonable objections to both of those.