I like the overall idea - seems very worthwhile.
A query on the specifics:
We consider a model capable of some task X if:...We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y
We consider a model capable of some task X if:
Are you thinking that this is a helpful definition even when treating models as black boxes, or only based on some analysis of the model's internals? To me it seems workable only in the latter case.
In particular, from a black-box perspective, I don't think we ever know that task X is required for task Y. The most we can know is that some [task Z whose output logically entails the output of task X] is required for task Y (where of course Z may be X).
So this clause seems never to be satisfiable without highly specific knowledge of the internals of the model. (if we were to say that it's satisfied when we know Y requires some Z entailing X, then it seems we'd be requiring logical omniscience for intent alignment)
For example, the model may be doing something like: Z→(Y&A)→YWithout knowing that Z≡(X&A), and that X→Y also works (A happening to be superfluous in this case).
Does that seem right, or am I confused somewhere?
Another way to put this is that for workable cases, I'd expect the first clause to cover things: if the model knows how to simply separate Z into X&A in the above, then I'd expect suitable prompt engineering, fine-tuning... to be able to get the model to do task X.
(EDIT - third time lucky :) :If this isn't possible for a given X, then I expect the model isn't capable of task X (for non-deceptive models, at least).For black boxes, the second clause only seems able to get us something like "the model contains sufficient information for task X to be performed", which is necessary, but not sufficient, for capability.)
If you’re trying to align wildly superintelligent systems, you don’t have to worry about any concern related to your system being incompetent.
In general, this seems false. The thing you don't have to worry about is subhuman competence. You may still have to worry about incompetence relative to some highly superhuman competence threshold. (it may be fine to say that this isn't an alignment problem - but it's a worry)
One concern is reaching [competent at X] before [competent at operating safely when [competent at X]].
Here it'd be fine if the system had perfect knowledge of the risks of X, or perfect calibration of its uncertainty around such risks. Replace "perfect" with "wildly superhuman", and you lose the guarantee. If human-level-competence would be wildly unsafe at the [...operating safely...] task, then knowing the system will do better isn't worth much. (we're wildly superchimp at AI safety; this may not be good enough)
I think it can sometimes be misleading to think/talk about issues "in the limit of competence": in the limit you're throwing away information about relative competence levels (at least unless you're careful to take limits of all the important ratios etc. too).
E.g. take two systems:Alice: [x power, x2 wisdom]Bob: [x power, √x wisdom]
We can let x tend to infinity and say they're both arbitrarily powerful and arbitrarily wise, but I'd still trust Alice a whole lot more than Bob at any given time (for safe exploration, and many other things).I don't think it's enough to say "Bob will self-modify to become Alice-like (in a singleton scenario)". The concerning cases are where Bob has insufficient wisdom to notice or look for a desirable [self-modify to Alice] style option.
It's conceivable to me that this is a non-problem in practice: that any system with only modestly super-human wisdom starts to make Wei Dai look like a reckless megalomaniac, regardless of its power. Even if that's true, it seems important to think about ways to train systems such that they acquire this level of wisdom early.
Perhaps this isn't exactly an alignment problem - but it's the kind of thing I'd want anyone aligning wildly superintelligent systems to worry about (unless I'm missing something).
Thanks, that's interesting. [I did mean to reply sooner, but got distracted]
A few quick points:
Yes, by "incoherent causal model" I only mean something like "causal model that has no clear mapping back to a distribution over real worlds" (e.g. where different parts of the model assume that [kite exists] has different probabilities).Agreed that the models LCDT would use are coherent in their own terms. My worry is, as you say, along garbage-in-garbage-out lines.
Having LCDT simulate HCH seems more plausible than its taking useful action in the world - but I'm still not clear how we'd avoid the LCDT agent creating agential components (or reasoning based on its prediction that it might create such agential components) [more on this here: point (1) there seems ok for prediction-of-HCH-doing-narrow-task (since all we need is some non-agential solution to exist); point (2) seems like a general problem unless the LCDT agent has further restrictions].
Agreed on HCH practical difficulties - I think Evan and Adam are a bit more optimistic on HCH than I am, but no-one's saying it's a non-problem. From the LCDT side, it seems we're ok so long as it can simulate [something capable and aligned]; HCH seems like a promising candidate.
On HCH-simulation practical specifics, I think a lot depends on how you're generating data / any model of H, and the particular way any [system that limits to HCH] would actually limit to HCH. E.g. in an IDA setup, the human(s) in any training step will know that their subquestions are answered by an approximate model.
I think we may be ok on error-compounding, so long as the learned model of humans is not overconfident of its own accuracy (as a model of humans). You'd hope to get compounding uncertainty rather than compounding errors.
However, I don't think this is quite right (unless I'm missing something):
Now observe that in the LCDT planning world model C constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I'd argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.
I don't think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that's the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper's likely action still leaves the same Nash solution (based on knowing both that the keeper will probably go left, and that left is the agent's stronger side).If the agent knew the keeper would definitely go left, then of course it'd kick right - but I don't think that's the situation.
I'd be interested on your take on Evan's comment on incoherence in LCDT. Specifically, do you think the issue I'm pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I'm just wrong about the incoherence??)As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world - but I might be wrong in either case.
Ok, that mostly makes sense to me. I do think that there are still serious issues (but these may be due to my remaining confusions about the setup: I'm still largely reasoning about it "from outside", since it feels like it's trying to do the impossible).
Assuming this is actually a problem, it struck me that it may be worth thinking about a condition vaguely like:
The idea being to specify a weaker condition that does enough forwarding-the-guarantee to allow safe instantiation of particular types of agent while still avoiding deception.
I'm far from clear that anything along these lines would help: it probably doesn't work, and it doesn't seem to solve the side-effect-agent problem anyway: [complete indifference to influence on X] and [robustly avoiding creation of X] seem fundamentally incompatible.
Thoughts welcome. With luck I'm still confused.
Ah yes, you're right there - my mistake.
However, I still don't see how LCDT can make good decisions over adjustments to its simulation. That simulation must presumably eventually contain elements classed as agentic.Then given any adjustment X which influences the simulation outcome both through agentic paths and non-agentic paths, the LCDT agent will ignore the influence [relative to the prior] through the agentic paths. Therefore it will usually be incorrect about what X is likely to accomplish.
It seems to me that you'll also have incoherence issues here too: X can change things so that p(Y = 0) is 0.99 through a non-agentic path, whereas the agents assumes the equivalent of [p(Y = 0) is 0.5] through an agentic path.
I don't see how an LCDT agent can make efficient adjustments to its simulation when it won't be able to decide rationally on those judgements in the presence of agentic elements (which again, I assume must exist to simulate HCH).
Ok thanks, I think I see a little more clearly where you're coming from now.(it still feels potentially dangerous during training, but I'm not clear on that)
A further thought:
The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way
Ok, so suppose for the moment that HCH is aligned, and that we're able to specify a sufficiently accurate HCH model. The hard part of the problem seems to be safe-and-efficient simulation of the output of that HCH model.I'm not clear on how this part works: for most priors, it seems that the LCDT agent is going to assign significant probability to its creating agentic elements within its simulation. But by assumption, it doesn't think it can influence anything downstream of those (or the probability that they exist, I assume).
That seems to be the place where LCDT needs to do real work, and I don't currently see how it can do so efficiently. If there are agentic elements contributing to the simulation's output, then it won't think it can influence the output.Avoiding agentic elements seems impossible almost by definition: if you can create an arbitrarily accurate HCH simulation without its qualifying as agentic, then your test-for-agents can't be sufficiently inclusive.
...but hopefully I'm still confused somewhere.
Right, as far as I can see, it achieves the won't-be-deceptive aim. My issue is in seeing how we find a model that will consistently do the right thing in training (given that it's using LCDT).
As I understand it, under LCDT an agent is going to trade an epsilon utility gain on non-agent-influencing-paths for an arbitrarily bad outcome on agent-influencing-paths (since by design it doesn't care about those). So it seems that it's going to behave unacceptably for almost all goals in almost all environments in which there can be negative side-effects on agents we care about.
We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved.
Quite possibly I'm still missing something, but I don't currently see how the LCDT decisions do much useful work here (Am I wrong? Do you see LCDT decisions doing significant optimisation?).I can picture its being a useful wrapper around a simulation, but it's not clear to me in what ways finding a non-deceptive (/benign) simulation is an easier problem than finding a non-deceptive (/benign) agent. (maybe side-channel attacks are harder??)
[Pre-emptive apologies for the stream-of-consciousness: I made the mistake of thinking while I wrote. Hopefully I ended up somewhere reasonable, but I make no promises]
simulating HCH or anything really doesn't require altering the action set of a human/agent
My point there wasn't that it requires it, but that it entails it. After any action by the LCDT agent, the distribution over future action sets of some agents will differ from those same distributions based on the prior (perhaps very slightly).
E.g. if I burn your kite, your actual action set doesn't involve kite-flying; your prior action set does. After I take the [burn kite] action, my prediction of [kite exists] doesn't have a reliable answer.
If I'm understanding correctly (and, as ever, I may not be), this is just to say that it'd come out differently based on the way you set up the pre-link-cutting causal diagram. If the original diagram effectively had [kite exists iff Adam could fly kite], then I'd think it'd still exist after [burn kite]; if the original had [kite exists iff Joe didn't burn kite] then I'd think that it wouldn't.
In the real world, those two setups should be logically equivalent. The link-cutting breaks the equivalence. Each version of the final diagram functions in its own terms, but the answer to [kite exists] becomes an artefact of the way we draw the initial diagram. (I think!)
In this sense, it's incoherent (so Evan's not claiming there's no bullet, but that he's biting it); it's just less clear that it matters that it's incoherent.
I still tend to think that it does matter - but I'm not yet sure whether it's just offending my delicate logical sensibilities, or if there's a real problem.
For instance, in my reply to Evan, I think the [delete yourself to free up memory] action probably looks good if there's e.g. an [available memory] node directly downstream of the [delete yourself...] action.If instead the path goes [delete yourself...] --> [memory footprint of future self] --> [available memory], then deleting yourself isn't going to look useful, since [memory footprint...] shouldn't change.
Perhaps it'd work in general to construct the initial causal diagrams in this way:You route maximal causality through agents, when there's any choice.So you then tend to get [LCDT action] --> [Agent action-set-alteration] --> [Whatever can be deduced from action-set-alteration].
You couldn't do precisely this in general, since you'd need backwards-in-time causality - but I think you could do some equivalent. I.e. you'd put an [expected agent action set distribution] node immediately after the LCDT decision, treat that like an agent at decision time, and deduce values of intermediate nodes from that.
So in my kite example, let's say you'll only get to fly your kite (if it exists) two months from my decision, and there's a load of intermediate nodes.But directly downstream of my [burn kite] action we put a [prediction of Adam's future action set] node. All of the causal implications of [burn kite] get routed through the action set prediction node.
Then at decision time the action-set prediction node gets treated as part of an agent, and there's no incoherence. (but I predict that my [burn kite] fails to burn your kite)
Anyway, quite possibly doing things this way would have a load of downsides (or perhaps it doesn't even work??), but it seems plausible to me.
My remaining worry is whether getting rid of the incoherence in this way is too limiting - since the LCDT agent gets left thinking its actions do almost nothing (given that many/most actions would be followed by nodes which negate their consequences relative to the prior).
[I'll think more about whether I'm claiming much/any of this impacts the simulation setup (beyond any self-deletion issues)]
Ah ok. Weird, but ok. Thanks.
Perhaps I'm now understanding correctly(??). An undesirable action that springs to mind: delete itself to free up disk space. Its future self is assumed to give the same output regardless of this action.More generally, actions with arbitrarily bad side-effects on agents, to gain marginal utility. Does that make sense?
I need to think more about the rest.
[EDIT and see rambling reply to Adam re ways to avoid the incoherence. TLDR: I think placing a [predicted agent action set alterations] node directly after the LCDT decision node in the original causal diagram, deducing what can be deduced from that node, and treating it as an agent at decision-time might work. It leaves the LCDT agent predicting that many of its actions don't do much, but it does get rid of the incoherence (I think). Currently unclear whether this throws the baby out with the bathwater; I don't think it does anything about negative side-effects]