All of Vanessa Kosoy's Comments + Replies

Testing The Natural Abstraction Hypothesis: Project Update

Two questions:

  • What exactly is the #P-complete problem you ran into?
  • What is the precise mathematical statement of the "Telephone Theorem"? I couldn't find it in the linked post.
2johnswentworth1dThe #P-complete problem is to calculate the distribution of some variables in a Bayes net given some other variables in the Bayes net, without any particular restrictions on the net or on the variables chosen. Formal statement of the Telephone Theorem: We have a sequence of Markov blankets forming a Markov chainM1→M2→.... Then in the limitn→∞,fn(Mn)mediates the interaction betweenM1andMn(i.e. the distribution factors according toM1→fn(Mn)→M n), for somefnsatisfying fn(Mn)=fn+1(Mn+1) with probability 1 in the limit.
Vanessa Kosoy's Shortform

I think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn't deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user's evaluation in the end of this short-term interval.

1Charlie Steiner1dAh. I indeed misunderstood, thanks :) I'd read "short-term quantilization" as quantilizing over short-term policies evaluated according to their expected utility. My story doesn't make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).
Vanessa Kosoy's Shortform

When I'm deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself?

One potentially useful model is: I'm good at evaluating and bad at searching (after all, ). I can therefore delegate searching to another agent. But, as you point out, this doesn't account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stanc... (read more)

1Charlie Steiner1dAgree with the first section, though I would like to register my sentiment that although "good at selecting but missing logical facts" is a better model, it's still not one I'd want an AI to use when inferring my values. I think my point is if "turn off the stars" is not a primitive action, but is a set of states of the world that the AI would overwhelming like to go to, then the actual primitive actions will get evaluated based on how well they end up going to that goal state. And since the AI is better at evaluating than us, we're probably going there. Another way of looking at this claim is that I'm telling a story about why the safety bound on quantilizers gets worse when quantilization is iterated. Iterated quantilization has much worse bounds than quantilizing over the iterated game, which makes sense if we think of games where the AI evaluates many actions better than the human.
The theory-practice gap

There is a case that aligned AI doesn't have to be competitive with unaligned AI, it just has to be much better than humans at alignment research. Because, if this holds, then we can delegate the rest of the problem to the AI.

Where it might fail is: it takes so much work to solve the alignment problem that even that superhuman aligned AI will not do it in time to build the "next stage" aligned AI (i.e. before the even-more-superhuman unaligned AI is deployed). In this case, it might be advantageous to have mere humans doing extra progress in alignment betw... (read more)

4Buck Shlegeris4dYeah, I talk about this in the first bullet point here [https://www.alignmentforum.org/posts/tmWMuY5HCSNXXZ9oq/buck-s-shortform?commentId=BznGTJ3rGHLMcdbEB] (which I linked from the "How useful is it..." section).
Progress on Causal Influence Diagrams

IIUC, in a multi-agent influence model, every subgame perfect equilibrium is also a subgame perfect equilibrium in the corresponding extensive form game, but the converse is false in general. Do you know whether at least one subgame perfect equilibrium exists for any MAIM? I couldn't find it in the paper.

Information At A Distance Is Mediated By Deterministic Constraints

So, your thesis is, only exponential models give rise to nice abstractions? And, since it's important to have abstractions, we might just as well have our agents reason exclusively in terms of exponential models?

3johnswentworth10dMore like: exponential family distributions are a universal property of information-at-a-distance in large complex systems. So, we can use exponential models without any loss of generality when working with information-at-a-distance in large complex systems. That's what I hope to show, anyway.
Information At A Distance Is Mediated By Deterministic Constraints

I'm still confused. What direction of GKPD do you want to use? It sounds like you want to use the low-dimensional statistic => exponential family direction. Why? What is good about some family being exponential?

2johnswentworth10dYup, that's the direction I want. If the distributions are exponential family, then that dramatically narrows down the space of distributions which need to be represented in order to represent abstractions in general. That means much simpler data structures - e.g. feature functions and Lagrange multipliers, rather than whole distributions.
Information At A Distance Is Mediated By Deterministic Constraints

Can you explain how the generalized KPD fits into all of this? KPD is about estimating the parameters of a model from samples via a low dimensional statistic, whereas you are talking about estimating one part of a sample from another (distant) part of the sample via a low dimensional statistic. Are you using KPD to rule out "high-dimensional" correlations going through the parameters of the model?

2johnswentworth11dRoughly speaking, the generalized KPD says that if the long-range correlations are low dimensional, then the whole distribution is exponential family (modulo a few "exceptional" variables). The theorem doesn't rule out the possibility of high-dimensional correlations, but it narrows down the possible forms a lot if we can rule out high-dimensional correlations some other way. That's what I'm hoping for: some simple/common conditions which limit the dimension of the long-range correlations, so that gKPD can apply. This post says that those long range correlations have to be mediated by deterministic constraints, so if the dimension of the deterministic constraints is low, then that's one potential route. Another potential route is some kind of information network flow approach - i.e. if lots of information is conserved along one "direction", then that should limit information flow along "orthogonal directions", which would mean that long-range correlations are limited between "most" local chunks of the graph.
Research agenda update

The way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the "inner MDP" of a particular "metastate" effectively function as actions w.r.t. to the higher levels. Under some a... (read more)

Agency in Conway’s Game of Life

I think the GoL is not the best example for this sort of questions. See this post by Scott Aaronson discussing the notion of "physical universality" which seems relevant here.

Also, like other commenters pointed out, I don't think the object you get here is necessarily AI. That's because the "laws of physics" and the distribution of initial conditions are assumed to be simple and known. An AI would be something that can accomplish an objective of this sort while also having to learn the rules of the automaton or detect patterns in the initial conditions. Fo... (read more)

Research agenda update

I don't understand what Lemma 1 is if it's not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.

1Steve Byrnes12dGood question! Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as "goals", and then makes plans to advance those "goals". (An example of such an algorithm is (part of) the human brain, more-or-less, according to me [https://www.lesswrong.com/posts/iMM6dvHzco6jBMFMX/value-loading-in-the-human-brain-a-worked-example] .) We can say the algorithm is "aligned" if the things flagged as "goals" do in fact corresponding to maximizing the objective function (e.g. "predict the human's outputs"), or at least it's as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time. Making that definition better and rigorous would be tricky because it's hard to talk rigorously about symbol-grounding, but maybe it's not impossible. And if so, I would say that this is a definition of "aligned" which looks nothing like a performance guarantee. OK, hmmm, after some thought, I guess it's possible that this definition of "aligned" would be equivalent to a performance-centric claim along the lines of "asymptotically, performance goes up not down". But I'm not sure that it's exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities: * We prove that the algorithm is aligned (in the above sense) via "direct reasoning about alignment" (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee. * We prove that the algorithm satisfies the asymptotic performance guarantee via "direct reasoning about performance", and then a corollary of that proof would be that the algorithm is aligned (in the above sense). I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it's solvable at all, it would only be solvable
Research agenda update

It's only a problem if we also claim that the "find a learning algorithm that satisfies the desiderata" part is not an AGI safety problem.

I never said it's not a safety problem. I only said that a lot progress on this can come from research that is not very "safety specific". I would certainly work on it if "precisely defining safe" was already solved.

That's also where I was coming from when I expressed skepticism about "strong formal guarantees". We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to

... (read more)
1Steve Byrnes12dCool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled "Lemma 1: This learning algorithm will be aligned". The reason I think that is that (as above) I expect the learning algorithms in question to be kinda "agential", and if an "agential" algorithm is not "trying" to perform well on the objective, then it probably won't perform well on the objective! :-) If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we've already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
Research agenda update

Obviously the problem of "make an agential "prior-building AI" that doesn't try to seize control of its off-switch" is being worked on almost exclusively by x-risk people.

Umm, obviously I did not claim it isn't. I just decomposed the original problem in a different way that didn't single out this part.

...if we can make a safe agential "prior-building AI" that gets to human-level predictive ability and beyond, then we've solved almost the whole TAI safety problem, because we could then run the prior-building AI, then turn it off and use microscope AI t

... (read more)
1Steve Byrnes12dHmmm, OK, let me try again. You wrote earlier: "the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally". My claim is that good-enough algorithms for "adding more and more detail incrementally" will also incidentally (by default) be algorithms that seize control of their off-switches. And the reason I put a lot of weight on this claim is that I think the best algorithms for "adding more and more detail incrementally" may be algorithms that are (loosely speaking) "trying" to understand and/or predict things, including via metacognition and instrumental reasoning. OK, then the way I'm currently imagining you responding to that would be: (If that's not the kind of argument you have in mind, oops sorry!) Otherwise: I feel like that's akin to putting "the AGI will be safe" as a desideratum, which pushes "solve AGI safety" onto the opposite side of the divide between desiderata vs. learning-algorithm-that-satisfies-the-desiderata. That's perfectly fine, and indeed precisely defining "safe" is very useful. It's only a problem if we also claim that the "find a learning algorithm that satisfies the desiderata" part is not an AGI safety problem. (Also, if we divide the problem this way, then "we can't find a provably-safe AGI design" would be re-cast as "no human-level learning algorithms satisfy the desiderata".) That's also where I was coming from when I expressed skepticism about "strong formal guarantees". We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge. Again, as above, I was imagining an argument that turns a performance guarantee into a safety guarantee, like "I can prove that AlphaGo plays go at such-and-such Elo level, and therefore it must not be wireheading, because wireheaders aren't very good at playing Go." If you weren't thinking of performance guarantees, what "formal guarantees" are you thinking of? (For what
Research agenda update

I think the confusion here comes from mixing algorithms with desiderata. HDTL is not an algorithm, it is a type of desideratum than an algorithm can satisfy. "the AI's prior has a combinatorial explosion" is true but "dumb process of elimination" is false. A powerful AI has to be have a very rich space of hypotheses it can learn. But this doesn't mean this space of hypotheses is explicitly stored in its memory or anything of the sort (which would be infeasible). It only means that the algorithm somehow manages to learn those hypotheses, for example by some... (read more)

1Steve Byrnes12dThanks!! Here's where I'm at right now. In the grandparent comment [https://www.lesswrong.com/posts/DkfGaZTgwsE7XZq9k/research-agenda-update?commentId=zuRqXa3477egJfyAk] I suggested that if we want to make an AI that can learn sufficiently good hypotheses to do human-level things, perhaps the only way to do that is to make a "prior-building AI" with "agency" that is "trying" to build out its world-model / toolkit-of-concepts-and-ideas in fruitful directions. And I said that we have to solve the problem of how to build that kind of agential "prior-building AI" that doesn't also incidentally "try" to seize control of its off-switch. Then in the parent comment [https://www.lesswrong.com/posts/DkfGaZTgwsE7XZq9k/research-agenda-update?commentId=6D445mqEf9poEBqbK] you replied (IIUC) that if this is a problem at all, it's not the problem you're trying to solve (i.e. "finding good formal desiderata for safe TAI"), but a different problem (i.e. "developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms"), and my problem is "to a first approximation orthogonal" to your problem, and my problem "receives plenty of attention from outside the existential safety community". If so, my responses would be: * Obviously the problem of "make an agential "prior-building AI" that doesn't try to seize control of its off-switch" is being worked on almost exclusively by x-risk people. :-P * I suspect that the problem doesn't decompose the way you imply; instead I think that if we develop techniques for building a safe agential "prior-building AI", we would find that similar techniques enable us to build a safe non-manipulative-question-answering AI / oracle AI / helper AI / whatever. * Even if that's not true, I would still say that if we can make a safe agential "prior-building AI" that gets to human-level predictive ability and beyond, then we've solved almost the whole TAI safety
Research agenda update

I gave a formal mathematical definition of (idealized) HDTL, so the answer to your question should probably be contained there. But I'm not entirely sure what it is since I don't entirely understand the question.

The AI has a "superior epistemic vantage point" in the sense that, the prior is richer than the prior that humans have. But, why do we "still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it"? The objective is fully specified.

A possible interpretation o... (read more)

1Steve Byrnes18dThanks, that was a helpful comment. I think we're making progress, or at least I'm learning a lot here. :) I think your perspective is: we start with a prior—i.e. the prior is an ingredient going into the algorithm. Whereas my perspective is: to get to AGI, we need an agent to build the prior, so to speak. And this agent can be dangerous. So for example, let's talk about some useful non-obvious concept, like "informational entropy". And let's suppose that our AI cannot learn the concept of "informational entropy" from humans, because we're in an alternate universe where humans haven't yet invented the concept of informational entropy. (Or replace "informational entropy" with "some important not-yet-discovered concept in AI alignment.) In that case, I see three possibilities. * First, the AI never winds up "knowing about" informational entropy or anything equivalent to it, and consequently makes worse predictions about various domains (human scientific and technological progress, the performance of certain algorithms and communications protocols, etc.) * Second (I think this is your model?): the AI's prior has a combinatorial explosion with every possible way of conceptualizing the world, of which an astronomically small proportion are actually correct and useful. With enough data, the AI settles into a useful conceptualization of the world, including some sub-network in its latent space that's equivalent to informational entropy. In other words: it "discovers" informational entropy by dumb process of elimination. * Third (this is my model): we get a prior by running a "prior-building AI". This prior-building AI has "agency"; it "actively" learns how the world works, by directing its attention etc. It has curiosity and instrumental reasoning and planning and so on, and it gradually learns instrumentally-useful metacognitive strategies, like a habit of noticing and attending to important and unexplained and suggestive
Research agenda update

Algorithmic Information Theory

Research agenda update

I'm still a bit hazy on what happens next in the plan—i.e., getting from that probabilistic model to the more abstract "what the human wants".

Well, one thing you could try is using the AIT definition of goal-directedness to go from the policy to the utility function. However, in general it might require knowledge of the human's counterfactual behavior which the AI doesn't have. Maybe there are some natural assumption under which it is possible, but it's not clear.

It's still worth noting that I, Steve, personally can be standing in a room with another

... (read more)
1Steve Byrnes19dThanks again for your very helpful response! I thought about the quantilization thing more, let me try again. As background, to a first approximation, let’s say 5 times per second I (a human) “think a thought”. That involves a pair of two things: * (Possibly) update my world-model * (Possibly) take an action—in this case, type a key at the keyboard Of these two things, the first one is especially important, because that’s where things get "figured out". (Imagine staring into space while thinking about something.) OK, now back to the AI. I can broadly imagine two strategies for a quantilization approach: 1. Build a model of the human policy from a superior epistemic vantage point: So here we give the AI its own world-model that needn’t have anything to do with the human’s, and likewise allow the AI to update its world-model in a way that needn’t have anything to do with how the human updates their world model. Then the AI leverages its superior world-model in the course of learning and quantilizing the human policy (maybe just the action part of the policy, or maybe both the actions and the world-model-updates, it doesn't matter for the moment). 2. Straightforward human imitation: Here, we try to get to a place where the AI is learning about the world and figuring things out in a (quantilized) human-like way. So we want the AI to sample from the human policy for "taking an action", and we want the AI to sample from the human policy for "updating the world-model". And the AI doesn't know anything about the world beyond what it learns through those quantilized-human-like world-model updates. Start with the first one. If the AI is going to get to a superior epistemic vantage point, then it needs to “figure things out” about the world and concepts and so on, and as I said before, I think “figuring things out” requires goal-seeking-RL-type exploration [https://www.lesswrong.com/posts/Gfw7JMdKirxeSPiAk/solving-
1Steve Byrnes20dThanks! I'm still thinking about this, but quick question: when you say "AIT definition of goal-directedness [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=ovBmi2QFikE6CRWtj] ", what does "AIT" mean?
Vanessa Kosoy's Shortform

This is about right.

Notice that typically we use the AI for tasks which are hard for H. This means that without the AI's help, H's probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important.

On a typical task, H expects to fail eventually but they don't expect to fail soon. Therefore, the A... (read more)

Research agenda update

For example, lots of discussion of IRL and value learning seem to presuppose that we’re writing code that tells the AGI specifically how to model a human. To pick a random example, in Vanessa Kosoy's 2018 research agenda, the "demonstration" and "learning by teaching" ideas seem to rely on being able to do that—I don't see how we could possibly do those things if the whole world-model is a bunch of unlabeled patterns in patterns in patterns in sensory input etc.

We can at least try doing those things by just having specific channels through which human a... (read more)

1Steve Byrnes22dThanks!!! After reading your comment and thinking about it more, here's where I'm at: Your "demonstration" thing [https://www.lesswrong.com/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda#Value_learning_protocols] was described as "The [AI] observes a human pursuing eir values and deduces the values from the behavior." When I read that, I was visualizing a robot and a human standing in a room, and the human is cooking, and the robot is watching the human and figuring out what the human is trying to do. And I was thinking that there needs to be some extra story for how that works, assuming that the robot has come to understand the world by building a giant unlabeled Bayes net world-model, and that it processes new visual inputs by slotting them into that model. (And that's my normal assumption, since that's how I think the neocortex works, and therefore that's a plausible way that people might build AGI, and it's the one I'm mainly focused on.) So as the robot is watching the human soak lentils, the thing going on in its head is: "Pattern 957823, and Pattern 5672928, and Pattern 657192, and…". In order to have the robot assign a special status to the human's deliberate actions, we would need to find "the human's deliberate actions" somewhere in the unlabeled world-model, i.e. solve a symbol-grounding problem, and doing so reliably is not straightforward. However, maybe I was visualizing the wrong thing, with the robot and human in the room. Maybe I should have insteadbeen visualizing a human using a computer via its keyboard. Then the AI can have a special input channel for the keystrokes that the human types. And every single one of those keystrokes is automatically treated as "the human's deliberate action". This seems to avoid the symbol-grounding problem I mentioned above. And if there's a special input channel, we can use supervised learning to build a probabilistic model of that input channel. (I definitely think this ste
Did they or didn't they learn tool use?

On page 28 they say:

Whilst some tasks do show successful ramp building (Figure 21), some hand-authored tasks require multiple ramps to be built to navigate up multiple floors which are inaccessible. In these tasks the agent fails.

From this, I'm guessing that it sometimes succeeds to build one ramp, but fails when the task requires building multiple ramps.

2Daniel Kokotajlo2moNice, I missed that! Thanks!
DeepMind: Generally capable agents emerge from open-ended play

I don't see what the big deal is about laws of physics. Humans and all their ancestors evolved in a world with the same laws of physics; we didn't have to generalize to different worlds with different laws. Also, I don't think "be superhuman at figuring out the true laws of physics" is on the shortest path to AIs being dangerous. Also, I don't think AIs need to control robots or whatnot in the real world to be dangerous, so they don't even need to be able to understand the true laws of physics, even on a basic level.

The entire novelty of this work revol... (read more)

2Quintin Pope2moWhat really impressed me were the generalized strategies the agent applied to multiple situations/goals. E.g., "randomly move things around until something works" sounds simple, but learning to contextually apply that strategy 1. to the appropriate objects, 2. in scenarios where you don't have a better idea of what to do, and 3. immediately stopping when you find something that works is fairly difficult for deep agents to learn. I think of this work as giving the RL agents a toolbox of strategies that can be flexibly applied to different scenarios. I suspect that finetuning agents trained in XLand in other physical environments will give good results because the XLand agents already know how to use relatively advanced strategies. Learning to apply the XLand strategies to the new physical environments will probably be easier than starting from scratch in the new environment.
DeepMind: Generally capable agents emerge from open-ended play

This is certainly interesting! To put things in proportion though, here are some limitations that I see, after skimming the paper and watching the video:

  • The virtual laws of physics are always the same. So, the sense in which this agent is "generally capable" is only via the geometry and the formal specification of the goal. Which is still interesting to be sure! But not as a big deal as it would be if it did zero-shot learning of physics (which would be an enormous deal IMO).
  • The formal specification is limited to propositional calculus. This allows for
... (read more)

Thanks! This is exactly the sort of thoughtful commentary I was hoping to get when I made this linkpost.

--I don't see what the big deal is about laws of physics. Humans and all their ancestors evolved in a world with the same laws of physics; we didn't have to generalize to different worlds with different laws. Also, I don't think "be superhuman at figuring out the true laws of physics" is on the shortest path to AIs being dangerous. Also, I don't think AIs need to control robots or whatnot in the real world to be dangerous, so they don't even need to be ... (read more)

BASALT: A Benchmark for Learning from Human Feedback

It's not "from zero" though, I think that we already have ML techniques that should be applicable here.

BASALT: A Benchmark for Learning from Human Feedback

if I thought we could build task-specific AI systems for arbitrary tasks, and only super general AI systems were dangerous, I'd be advocating really hard for sticking with task-specific AI systems and never building super general AI systems

The problem with this is that you need an AI whose task is "protect humanity from unaligned AIs", which is already very "general" in a way (i.e. requires operating on large scales of space, time and strategy). Unless you can effectively reduce this to many "narrow" tasks which is probably not impossible but also not easy.

BASALT: A Benchmark for Learning from Human Feedback

The AI safety community claims it is hard to specify reward functions... But for real-world deployment of AI systems, designers do know the task in advance!

Right, but you're also going for tasks that are relatively simple and easy. In the sense that, "MakeWaterfall" is something that I can, based on my own experience, imagine solving without any ML at all (but ofc going to that extreme would require massive work). It might be that for such tasks solutions using handcrafted rewards/heuristics would be viable, but wouldn't scale to more complex tasks. If ... (read more)

3Rohin Shah2moI agree that's possible. Tbc, we did spend some time thinking about how we might use handcrafted rewards / heuristics to solve the tasks, and eliminated a couple based on this, so I think it probably won't be true here. No. For the competition, there's a ban on pretrained models that weren't publicly available prior to competition start. We look at participants' training code to ensure compliance. It is still possible to violate this rule in a way that we may not catch (e.g. maybe you use internal simulator details to do hyperparameter tuning, and then hardcode the hyperparameters in your training code), but it seems quite challenging and not worth the effort even if you are willing to cheat. For the benchmark (which is what I'm more excited about in the longer run), we're relying on researchers to follow the rules. Science already relies on researchers honestly reporting their results -- it's pretty hard to catch cases where you just make up numbers for your experimental results. (Also in the benchmark version, people are unlikely to write a paper about how they solved the task using special-case heuristics; that would be an embarrassing paper.)
BASALT: A Benchmark for Learning from Human Feedback

It's not quite as interesting as I initially thought, since they allow handcrafted reward functions and heuristics. It would be more interesting if the designers did not know the particular task in advance, and the AI would be forced to learn the task entirely from demonstrations and/or natural language description.

4Rohin Shah2moWe allow it, but we don't think it will lead to good performance (unless you throw a very large amount of time at it). The AI safety community claims it is hard to specify reward functions. If we actually believe this claim, we should be able to create tasks where even if we allow people to specify reward functions, they won't be able to do so. That's what we've tried to do here. Note we do ban extraction of information from the Minecraft simulator -- you have to work with pixels, so if you want to make handcrafted reward functions, you have to compute rewards from pixels somehow. (Technically you also have inventory information but that's not that useful.) We have this rule because in a real-world deployment you wouldn't be able to simply extract the "state" of physical reality. I am a bit more worried about allowing heuristics -- it's plausible to me that our chosen tasks are simple enough that heuristics could solve them, even though real world tasks are too complex for similar heuristics to work -- but this is basically a place where we're sticking our necks out and saying "nope, heuristics won't suffice either" (again, unless you put a lot of effort into designing the heuristics, where it would have been faster to just build the system that, say, learns from demonstrations). But for real-world deployment of AI systems, designers do know the task in advance! We don't want to ban strategies that designers could use in a realistic setting.
6Daniel Kokotajlo2moGoing from zero to "produce an AI that learns the task entirely from demonstrations and/or natural language description" is really hard for the modern AI research hive mind. You have to instead give it a shaped reward, breadcrumbs along the way that are easier, (such as allowing handcrafted heuristics and such, and allowing knowledge of a particular target task) to get the hive mind started making progress.
Open problem: how can we quantify player alignment in 2x2 normal-form games?

I don't think in this case should be defined to be 1. It seems perfectly justified to leave it undefined, since in such a game can be equally well conceptualized as maximally aligned or as maximally anti-aligned. It is true that if, out of some set of objects you consider the subset of those that have , then it's natural to include the undefined cases too. But, if out of some set of objects you consider the subset of those that have , then it's also natural to include the undefined cases. This is similar to how is simultaneously... (read more)

Open problem: how can we quantify player alignment in 2x2 normal-form games?

In common-payoff games the denominator is not zero, in general. For example, suppose that , , , , . Then , as expected: current payoff is , if played it would be .

2Alex Turner3moYou're right. Per Jonah Moss's comment [https://www.lesswrong.com/posts/ghyw76DfRyiiMxo3t/open-problem-how-can-we-quantify-player-alignment-in-2x2?commentId=G8A3KtPw8vHwFnDj9] , I happened to be thinking of games where playoff is constant across players and outcomes, which is a very narrow kind of common-payoff (and constant-sum) game.
Open problem: how can we quantify player alignment in 2x2 normal-form games?

Consider any finite two-player game in normal form (each player can have any finite number of strategies, we can also easily generalize to certain classes of infinite games). Let be the set of pure strategies of player and the set of pure strategies of player . Let be the utility function of player . Let be a particular (mixed) outcome. Then the alignment of player with player in this outcome is defined to be:

Ofc so far it doesn't depend on ... (read more)

2Alex Turner3mo✅ Pending unforeseen complications, I consider this answer to solve the open problem. It essentially formalizes B's impact alignment [https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility#Formalizing_impact_alignment_in_extensive_form_games] with A, relative to the counterfactuals where B did the best or worst job possible. There might still be other interesting notions of alignment, but I think this is at least an important notion in the normal-form setting (and perhaps beyond).
2Alex Turner3moThis also suggests that "selfless" perfect B/A alignment is possible in zero-sum games, with the "maximal misalignment" only occuring if we assume B plays a best response. I think this is conceptually correct, and not something I had realized pre-theoretically.
2Alex Turner3moIn a sense, your proposal quantifies the extent to which B selects a best response on behalf of A, given some mixed outcome. I like this. I also think that "it doesn't necessarily depend onuB" is a feature, not a bug. EDIT: To handle common- constant-payoff games, we might want to define the alignment to equal 1 if the denominator is 0. In that case, the response of B can't affect A's expected utility, and so it's not possible for B to act against A's interests. So we might as well say that B is (trivially) aligned, given such a mixed outcome?
My Current Take on Counterfactuals

I would be convinced if you had a theory of rationality that is a Pareto improvement on IB (i.e. has all the good properties of IB + a more general class of utility functions). However, LI doesn't provide this AFAICT. That said, I would be interested to see some rigorous theorem about LIDT solving procrastination-like problems.

As to philosophical deliberation, I feel some appeal in this point of view, but I can also easily entertain a different point of view: namely, that human values are more or less fixed and well-defined whereas philosophical deliberati... (read more)

2Abram Demski3moI don't believe that LI provides such a Pareto improvement, but I suspect that there's a broader theory which contains the two. Ah. I was going for the human-values argument because I thought you might not appreciate the rational-agent argument. After all, who cares what general rational agents can value, if human values happen to be well-represented by infrabayes? But for general rational agents, rather than make the abstract deliberation argument, I would again mention the case of LIDT in the procrastination paradox, which we've already discussed. Or, I would make the radical probabilist [https://www.lesswrong.com/posts/xJyY5QkQvNJpZLJRo/radical-probabilism-1] argument against rigid updating, and the 'orthodox' argument [https://www.lesswrong.com/posts/A8iGaZ3uHNNGgJeaD/an-orthodox-case-against-utility-functions] against fixed utility functions. Combined, we get a picture of "values" which is basically a market for expected values, where prices can change over time (in a "radical" way that doesn't necessarily spring from an update on a proposition), but which follow some coherence rules like an expectation of an expectation equals an expectation. One formalization of this is Skyrms [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.194.482&rep=rep1&type=pdf] '. Another is your generalization of LI (iirc). So to sum it up, my argument for general rational agents is: * In general, we need not update in a rigid way; we can develop a meaningful theory of 'fluid' updates, so long as we respect some coherence constraints. In light of this generalization, restriction to 'rigid' updates seems somewhat arbitrary (ie there does not seem to be a strong motivation to make the restriction from rationality alone). * Separately, there is no need to actually have a utility function if we have a coherent expectation. * Putting the two together, we can study coherent expectations where the notion of 'coherence' doesn't assume rigid updates. Howeve
An Intuitive Guide to Garrabrant Induction

First, "no complexity bounds on the trader" doesn't mean we allow uncomputable traders, we just don't limit their time or other resources (exactly like in Solomonoff induction). Second, even having a trader that knows everything doesn't mean all the prices collapse in a single step. It does mean that the prices will converge to knowing everything with time. GI guarantees no budget-limited trader will make an infinite profit, it doesn't guarantee no trader will make a profit at all (indeed guaranteeing the later is impossible).

An Intuitive Guide to Garrabrant Induction

A brief note on naming: Solomonoff exhibited an uncomputable algorithm that does idealized induction, which we call Solomonoff induction. Garrabrant exhibited a computable algorithm that does logical induction, which we have named Garrabrant induction.

This seems misleading. Solomonoff induction has computable versions obtained by imposing a complexity bound on the programs. Garrabrant induction has uncomputable versions obtained by removing the complexity bound from the traders. The important difference between Solomonoff and Garrabrant is not computabl... (read more)

1Steve Byrnes4moSorry if this is a stupid question but wouldn't "LI with no complexity bound on the traders" be trivial? Like, there's a noncomputable trader (brute force proof search + halting oracle) that can just look at any statement and immediately declare whether it's provably false, provably true, or neither. So wouldn't the prices collapse to their asymptotic value after a single step and then nothing else ever happens?
My Current Take on Counterfactuals

My hope is that we will eventually have computationally feasible algorithms that satisfy provable (or at least conjectured) infra-Bayesian regret bounds for some sufficiently rich hypothesis space. Currently, even in the Bayesian case, we only have such algorithms for poor hypothesis spaces, such as MDPs with a small number of states. We can also rule out such algorithms for some large hypothesis spaces, such as short programs with a fixed polynomial-time bound. In between, there should be some hypothesis space which is small enough to be feasible and rich... (read more)

My Current Take on Counterfactuals

However, I also think LIDT solves the problem in practical terms:

What is LIDT exactly? I can try to guess but I rather make sure we're both talking about the same thing.

My basic argument is we can model this sort of preference, so why rule it out as a possible human preference? You may be philosophically confident in finitist/constructivist values, but are you so confident that you'd want to lock unbounded quantifiers out of the space of possible values for value learning?

I agree inasmuch as we actually can model this sort of preferences, for a suff... (read more)

2Abram Demski3moRight, I agree with this. The situation as I see it is that there's a concrete theory of rationality (logical induction) which I'm using in this way, and it is suggesting to me that your theory (InfraBayes) can still be extended somewhat. My argument that we want this particular extension is basically as follows: human values can be thought of as the endpoint of human philosophical deliberation about values. (I am thinking of logical induction as a formalization of philosophical deliberation over time.) This endpoint seems limit-computable, but not necessarily computable. Now, it's also possible that at this endpoint, humans would have a more compact (ie, computable) representation of values. However, why assume this? (My hope is that by appealing to deliberation like this, my argument has more force than if I was only relying on the strength of logical induction as a theory of rationality. The idea of deliberation gives us a general reason to expect that limit-computable is the right place to look.) I'm not sure details matter very much here, but I'm provisionally happy to spell out LIDT as: 1. Specify some (bounded-value) LUV to use as "utility" 2. Make decisions by looking at conditional expectations of that LUV given actions. Concrete enough?
Introduction To The Infra-Bayesianism Sequence

Boundedly rational agents definitely can have dynamic consistency, I guess it depends on just how bounded you want them to be. IIUC what you're looking for is a model that can formalize "approximately rational but doesn't necessary satisfy any crisp desideratum". In this case, I would use something like my quantitative AIT definition of intelligence.

Formal Inner Alignment, Prospectus

Since you're trying to compile a comprehensive overview of directions of research, I will try to summarize my own approach to this problem:

  • I want to have algorithms that admit thorough theoretical analysis. There's already plenty of bottom-up work on this (proving initially weak but increasingly stronger theoretical guarantees for deep learning). I want to complement it by top-down work (proving strong theoretical guarantees for algorithms that are initially infeasible but increasingly made more feasible). Hopefully eventually the two will meet in the mi
... (read more)
Introduction To The Infra-Bayesianism Sequence

I'm not sure why would we need a weaker requirement if the formalism already satisfies a stronger requirement? Certainly when designing concrete learning algorithms we might want to use some kind of simplified update rule, but I expect that to be contingent on the type of algorithm and design constraints. We do have some speculations in that vein, for example I suspect that, for communicating infra-MDPs, an update rule that forgets everything except the current state would only lose something like expected utility.

2Stuart Armstrong4moI want a formalism capable of modelling and imitating how humans handle these situations, and we don't usually have dynamic consistency (nor do boundedly rational agents). Now, I don't want to weaken requirements "just because", but it may be that dynamic consistency is too strong a requirement to properly model what's going on. It's also useful to have AIs model human changes of morality, to figure out what humans count as values, so getting closer to human reasoning would be necessary.
My Current Take on Counterfactuals

In particular, it's easy to believe that some computation knows more than you.

Yes, I think TRL captures this notion. You have some Knightian uncertainty about the world, and some Knightian uncertainty about the result of a computation, and the two are entangled.

My Current Take on Counterfactuals

I lean towards some kind of finitism or constructivism, and am skeptical of utility functions which involve unbounded quantifiers. But also, how does LI help with the procrastination paradox? I don't think I've seen this result.

3Abram Demski4moWhat I'm referring to is that LI given a notion of rational uncertain expectation for the procrastination paradox -- so, less a positive result, more a framework for thinking about what behavior is reasonable. However, I also think LIDT solves the problem in practical terms: * In the pure procrastination-paradox problem, LIDT will eventually push the button if its logic is sound. If it did not, it would mean the conditional probability of ever pressing the button given not pressing it today remains forever higher than the conditional probability of ever pressing it today. However, the expectation can be split into the probability it gets pushed today, and the probability that it gets pushed on any day later than today. The LI should eventually know that the conditional probability of ever pressing the button given pressing it today is arbitrarily close to 1. So in order to never press the button, the conditional probability of ever pressing it in the future (given not pressing today) would have to go to 1 (faster than the probability of it ever being pressed given pressing it today). I don't think this can happen, since there will be some nonzero limit probability that the button will never be pressed (that is, there will be supposing the button is in fact never pressed). * In a situation where there is some actual reason to procrastinate (there are other sources of utility), but we place very high value on eventually pressing the button, it may be that the button will never be pressed? However, this will only happen if we're subjectively confident that it will eventually be pressed, and always have something better to do in the mean time. The second part seems pretty difficult. So maybe we can also prove that we eventually press the button in this case, as well. My basic argument is we can model this sort of preference, so why rule it out as a possible human preference? You may be philosophically confide
My Current Take on Counterfactuals

Yes, I'm pretty sure we have that kind of completeness. Obviously representing all hypotheses in this opaque form would give you poor sample and computational complexity, but you can do something midway: use black-box programs as components in your hypothesis but also have some explicit/transparent structure.

2Abram Demski4moOK, so, here is a question. The abstract theory of InfraBayes (like the abstract theory of Bayes) elides computational concerns. In reality, all of ML can more or less be thought of as using a big search for good models, where "good" means something approximately like MAP, although we can also consider more sophisticated variational targets. This introduces two different types of approximation: 1. The optimization target is approximate. 2. The optimization itself gives only approximate maxima. What we want out of InfraBayes is a bounded regret guarantee (in settings where we previously didn't know how to get one). What we have is a picture of how to get that if we can actually do the generalized Bayesian update. What we might want is a picture of how to do that more generally, when we can't actually compute the full update. Can we get such a thing with InfraBayes? In other words, search is a very basic type of logical uncertainty. Currently, we don't have much of a model of that, except "Bayesian Search" (which does not provide any nice regret bounds that I know of, although I may be ignorant). We might need such a thing in order to get nice guarantees for systems which employ search internally. Can we get it? Obviously, we can do the bayesian-search thing with InfraBayes substituted in, which already probably provides some kind of guarantee which couldn't be gotten otherwise. However, the challenge is to get the guarantee to carry all the way through to the end result.
Updating the Lottery Ticket Hypothesis

IIUC, here's a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point in the tangent space. Since the tangent space is linear, this is easy to do (i.e. doesn't require heuristic gradient descent): for square loss it's just solving a large linear system once, for many other losses it should amount to convex optimization for which we have provable efficient algorithms. And, I guess it's underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I'm guessing some of the linked papers might have done something like this?

3johnswentworth5moThis basically matches my current understanding. (Though I'm not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven't read up on the topic enough to be sure.
My Current Take on Counterfactuals

So we have this nice picture, where rationality is characterized by non-exploitability wrt a specific class of potential exploiters.

I'm not convinced this is the right desideratum for that purpose. Why should we care about exploitability by traders if making such trades is not actually possible given the environment and the utility function? IMO epistemic rationality is subservient to instrumental rationality, so our desiderata should be derived from the later.

Human value-uncertainty is not particularly well-captured by Bayesian uncertainty, as I imag

... (read more)
2Abram Demski5moIt's clear that you understand logical induction pretty well, so while I feel like you're missing something, I'm not clear on what that could be. I think maybe the more fruitful branch of this conversation (as opposed to me trying to provide an instrumental justification for radical probabilism, though I'm still interested in that) is the question of describing the human utility function. The logical induction picture isn't strictly at odds with a platonic utility function, I think, since we can consider the limit. (I only claim that this isn't the best way to think about it in general, since Nature didn't decide a platonic utility function for us and then design us such that our reasoning has the appropriate limit.) For example, one case which to my mind argues in favor of the logical induction approach to preferences: the procrastination paradox. All you want to do is ensure that the button is pressed at some point. This isn't a particularly complex or unrealistic preference for an agent to have. Yet, it's unclear how to make computable beliefs think about this appropriately. Logical induction provides a theory about how to think about this kind of goal. (I haven't thought much about how TRL would handle it.) Agree or disagree: agents can sensibly pursueΔ2objectives? And, do you think that question is cruxy for you?
2Abram Demski5moSo, one point is that the InfraBayes picture still gives epistemics an important role: the kind of guarantee arrived at is a guarantee that you won't do too much worse than the most useful partial model expects. So, we can think about generalized partial models which update by thinking longer in addition to taking in sense-data. I suppose TRL can model this by observing what those computations would say, in a given situation, and using partial models which only "trust computation X" rather than having any content of their own. Is this "complete" in an appropriate sense? Can we always model a would-be radical-infrabayesian as a TRL agent observing what that radical-infrabayesian would think? Even if true, there may be a significant computational complexity gap between just doing the thing vs modeling it in this way.
3Abram Demski5moThis does make sense to me, and I view it as a weakness of the idea. However, the productivity of dutch-book type thinking in terms of implying properties which seem appealing for other reasons speaks heavily in favor of it, in my mind. A formal connection to more pragmatic criteria would be great. But also, maybe I can articulate a radical-probabilist position without any recourse to dutch books... I'll have to think more about that. I'm not sure how to double crux with this intuition, unfortunately. When I imagine the perspective you describe, I feel like it's rolling all dynamic inconsistency into time-preference and ignoring the role of deliberation. My claim is that there is a type of change-over-time which is due to boundedness, and which looks like "dynamic inconsistency" from a classical bayesian perspective, but which isn't inherently dynamically inconsistent. EG, if you "sleep on it" and wake up with a different, firmer-feeling perspective, without any articulable thing you updated on. (My point isn't to dogmatically insist that you haven't updated on anything, but rather, to point out that it's useful to have the perspective where we don't need to suppose there was evidence which justifies the update as Bayesian, in order for it to be rational.)
My Current Take on Counterfactuals

I guess we can try studying Troll Bridge using infra-Bayesian modal logic, but atm I don't know what would result.

From a radical-probabilist perspective, the complaint would be that Turing RL still uses the InfraBayesian update rule, which might not always be necessary to be rational (the same way Bayesian updates aren't always necessary).

Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).

3Abram Demski5moI agree that radical probabilism can be thought of as bayesian-with-a-side-channel, but it's nice to have a more general characterization where the side channel is black-box, rather than an explicit side-channel which we explicitly update on. This gives us a picture of the space of rational updates. EG, the logical induction criterion allows for a large space of things to count as rational. We get to argue for constraints on rational behavior by pointing to the existence of traders which enforce those constraints, while being agnostic about what's going on inside a logical inductor. So we have this nice picture, where rationality is characterized by non-exploitability wrt a specific class of potential exploiters. Here's an argument for why this is an important dimension to consider: 1. Human value-uncertainty is not particularly well-captured by Bayesian uncertainty, as I imagine you'll agree. One particular complaint is realizability: we have no particular reason to assume that human preferences are within any particular space of hypotheses we can write down. 2. One aspect of this can be captured by InfraBayes: it allows us to eliminate the realizability assumption, instead only assuming that human preferences fall within some set of constraints which we can describe. 3. However, there is another aspect to human preference-uncertainty: human preferences change over time. Some of this is irrational, but some of it is legitimate philosophical deliberation. 4. And, somewhat in the spirit of logical induction, humans do tend to eventually address the most egregious irrationalities. 5. Therefore, I tend to think that toy models of alignment (such as CIRL, DRL, DIRL) should model the human as a radical probabilist; not because it's a perfect model, but because it constitutes a major incremental improvement wrt modeling what kind of uncertainty humans have over our own preferences. Recognizing preferences as a thing whic
My Current Take on Counterfactuals

I only skimmed this post for now, but a few quick comments on links to infra-Bayesianism:

InfraBayes doesn’t seem to have that worry, since it applies to non-realizable cases. (Or does it? Is there some kind of non-oscillation guarantee? Or is non-oscillation part of what it means for a set of environments to be learnable -- IE it can oscillate in some cases?)... AFAIK the conditions for learnability in the InfraBayes case are still pretty wide open.

It's true that these questions still need work, but I think it's rather clear that something like "there ... (read more)

Is there a way to operationalize "respecting logic"? For example, a specific toy scenario where an infra-Bayesian agent would fail due to not respecting logic?

"Respect logic" means either (a) assigning probability one to tautologies (at least, to those which can be proved in some bounded proof-length, or something along those lines), or, (b) assigning probability zero to contradictions (again, modulo boundedness). These two properties should be basically equivalent (ie, imply each other) provided the proof system is consistent. If it's inconsistent, they i... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

From your reply to Paul, I understand your argument to be something like the following:

  1. Any solution to single-single alignment will involve a tradeoff between alignment and capability.
  2. If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
  3. If AI systems are designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff.
  4. Given the technical knowledge to design c
... (read more)
Formal Solution to the Inner Alignment Problem

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.

I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect ... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research. It seems like the failures described happened because the AI systems were misaligned in the usual "unipolar" sense. These management assistants, DAOs etc are not aligned to the goals of their respective, individual users/owners.

I do see two reasons why multipolar scenarios might require more technical research:

  1. Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient wa
... (read more)
5Andrew Critch5moI don't mean to say this post warrants a new kind of AI alignment research, and I don't think I said that, but perhaps I'm missing some kind of subtext I'm inadvertently sending? I would say this post warrants research on multi-agent RL and/or AI social choice and/or fairness and/or transparency, none of which are "new kinds" of research (I promoted them heavily in my preceding post), and none of which I would call "alignment research" (though I'll respect your decision to call all these topics "alignment" if you consider them that). I would say, and I did say: I do hope that the RAAP concept can serve as a handle for noticing structure in multi-agent systems, but again I don't consider this a "new kind of research", only an important/necessary/neglected kind of research for the purposes of existential safety. Apologies if I seemed more revolutionary than intended. Perhaps it's uncommon to take a strong position of the form "X is necessary/important/neglected for human survival" without also saying "X is a fundamentally new type of thinking that no one has done before", but that is indeed my stance for X∈{a variety of non-alignment AI research areas [https://www.lesswrong.com/posts/hvGoYXi2kgnS3vxqb/some-ai-research-areas-and-their-relevance-to-existential-1] }.
2Andrew Critch6moHow are you inferring this? From the fact that a negative outcome eventually obtained? Or from particular misaligned decisions each system made? It would be helpful if you could point to a particular single-agent decision in one of the stories that you view as evidence of that single agent being highly misaligned with its user or creator. I can then reply with how I envision that decision being made even with high single-agent alignment. Yes, this^.
Formal Solution to the Inner Alignment Problem

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

Yes, you're right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the "bridge rules" by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.

Here's the sketch of a proposal how to solve this. Let's construct our prior to be ... (read more)

6Paul Christiano6moI broadly think of this approach as "try to write down the 'right' universal prior." I don't think the bridge rules / importance-weighting consideration is the only way in which our universal prior is predictably bad. There are also issues like anthropic update and philosophical considerations about what kind of "programming language" to use and so on. I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit. I guess you just need to get close enough thatεδis manageable but I think I still find it scary (and don't totally remember all my sources of concern). I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior. That is, if someone inside one of our hypotheses has noticed that e.g. a certain class of decisions is more important and so they will simulate only those situations, then we should also notice this and by the same token care more about our decision if we are in one of those situations (rather than using a universal prior without importance weighting). My sense is that without competitiveness we are in trouble anyway on other fronts, and so it is probably also reasonable to think of as a first-line defense against this kind of issue. This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with "giant" universes that do all the possible computations you would want, and then using the "free" complexity in the bridge rules to pick which of the computations you actually wanted. I am not sure if the DFA proposal gets around this kind of problem though it sounds like it would be pretty similar.
Vanessa Kosoy's Shortform

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is and your quantilization fraction is then the AI's probability of corrupt... (read more)

Vanessa Kosoy's Shortform

More observations about this attack vector ("attack from counterfactuals"). I focus on "amplifying by subjective time".

  • The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle[1]. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defende
... (read more)
Load More