I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.
But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).
The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:
In this story deception is all about the model having hidden behaviors that never get triggered during training
Not necessarily - depends on how abstractly we're considering behaviours. (It also depends on how likely we are to detect the bad behaviours during training.)
Consider an AI trained on addition problems that is only exposed to a few problems that look like 1+3=4, 3+7=10, 2+5=7, 2+6=8 during training, where there are two summands which are each a single digit and they appear in ascending order. Now at inference time the model exposed to 10+2= output... (read more)
We might summarise this counterargument to #30 as "verification is easier than generation". The idea is that the AI comes up with a plan (+ explanation of how it works etc.) that the human systems could not have generated themselves, but that human systems can understand and check in retrospect.
Counterclaim to "verification is easier than generation" is that any pivotal act will involve plans that human systems cannot predict the effects of just by looking at the plan. What about the explanation, though? I think the problem there may be more that we don't ... (read more)
What kind of access might be needed to private models? Could there be a secure multi-party computation approach that is sufficient?
If I've understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different "types", because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI's world model (or if it doesn't really have a world model, then more local things like actions themselves as opposed to... (read more)
It's possible that reality is even worse than this post suggests, from the perspective of someone keen on using models with an intuitive treatment of time. I'm thinking of things like "relaxed-memory concurrency" (or "weak memory models") where there is no sequentially consistent ordering of events. The classic example is where these two programs run in parallel, with X and Y initially both holding 0, [write 1 to X; read Y into R1] || [write 1 to Y; read X into R2], and after both programs finish both R1 and R2 contain 0. What's going on here is that the l... (read more)
I agree with this, and I think the distinction between "explicit search" and "heuristics" is pretty blurry: there are characteristics of search (evaluating alternative options, making comparisons, modifying one option to find another option, etc.) that can be implemented by heuristics, so you get some kind of hybrid search-instinct system overall that still has "consequentialist nature".
Thanks a lot for posting this! A minor point about the 2nd intuition pump (100-timesteps, 4 actions: Take $1, Do Nothing, Buy Apple, Buy Banana; the point being that most action sequences take the Take $1 action a lot rather than the Do Nothing action): the "goal" of getting 3 apples seems irrelevant to the point, and may be misleading if you think that that goal is where the push to acquire resources comes from. A more central source seems to me to be the "rule" of not ending with a negative balance: this is what prunes paths through the tree that contain more "do nothing" actions.
In order to cross-check a non-holdout sensor with a holdout sensor, you need to know the expected relationship between the two sensor readings under different levels of tampering. A simple case: holdout sensor 1 and non-holdout sensor 1 are identical cameras on the ceiling pointing down at the room, the expected relationship is that the images captured agree (up to say 1 pixel shift because the cameras are at very slightly different positions) under no tampering, and don't agree when there's been tampering.
Problem: tampering with the non-holdout sensor may... (read more)
In my understanding there's a missing step between upgraded verification (of software, algorithms, designs) and a "defence wins" world: what the specifications for these proofs need to be isn't a purely mathematical thing. The missing step is how to figure out what the specs should say. Better theorem proving isn't going to help much with the hard parts of that.
Question: Does ARC consider ELK-unlimited to be solved, where ELK-unlimited is ELK without the competitiveness restriction (computational resource requirements comparable to the unaligned benchmark)?
One might suppose that the "have AI help humans improve our understanding" strategy is a solution to ELK-unlimited because its counterexample in the report relies on the competitiveness requirement. However, there may still be other counterexamples that were less straightforward to formulate or explain.
I'm asking for clarification of this point because I notice... (read more)
I think the problem you're getting at here is real -- path-dependency of what a human believes on how they came to believe it, keeping everything else fixed (e.g., what the beliefs refer to) -- but I also think ARC's ELK problem is not claiming this isn't a real problem but rather bracketing (deferring) it for as long as possible. Because there are cases where ELK fails that don't have much path-dependency in them, and we can focus on solving those cases until whatever else is causing the problem goes away (and only path-dependency is left).
Our notion of narrowness is that we are interested in solving the problem where the question we're asking is such that a state always resolves a question. E.g. there isn't any ambiguity around whether a state "really contains a diamond". (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.
This "there isn't any ambiguity"+"there is ambiguity" does n... (read more)
Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.
How do we use this to construct new sensors that allow the human to detect tampering?
I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.
... (read more)The proposed
Here’s an attempt at condensing an issue I’m hung up on currently with ELK. This also serves as a high-level summary that I’d welcome poking at in case I’m getting important parts wrong.
The setup for ELK is that we’re trying to accurately label a dataset of (observation, action, predicted subsequent observation) triples for whether the actions are good. (The predicted subsequent observations can be optimised for accuracy using automated labels - what actually gets observed subsequently - whereas the actions need their labels to come from a source of ... (read more)
Just a few links to complement Abram's answer:
On how seemingly myopic training schemes can nonetheless produce non-myopic behaviour:
On approval-directed agents:
Thanks for the reply! I think you’ve understood correctly that the human rater needs to understand the proposed experiment – i.e., be able to carry it out and have a confident expectation about the outcome – in order to rate the proposer highly.
Here’s my summary of your point: for some tampering actions, there are no experiments that a human would understand in the above sense that would expose the tampering. Therefore that kind of tampering will result in low value for the experiment proposer (who has no winning strategy), and get rated highly.
This is a c... (read more)
Tweaking your comment slightly:
I'd be scared that the "Am I tricking you?" head just works by:
- Predicting what the human will predict [when experiment E is performed]
- Predicting what will actually happen [when experiment E is performed]
- Output a high value iff the human's prediction is confident but different from reality.
If this is the case, then the head will report detectable tampering but not undetectable tampering.
Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what's in the repo... (read more)
Here’s a Builder move (somewhat underdeveloped but I think worth posting now even as I continue to think - maybe someone can break it decisively quickly).
Training strategy: Add an “Am I tricking you?” head to the SmartVault model.
The proposed flow chart for how the model works has an “Experiment Proposer” coming out of “Figure out what’s going on”, and two heads out of Experiment Proposer, called “Experiment Description” and “Value” (meaning “Expected Value of Experiment to the Proposer”). I won’t make use of the question-answering Reporter/Answer parts, s... (read more)
Thanks for the reply! My comments are rather more thinking-in-progress than robust-conclusions than I’d like, but I figure that’s better than nothing.
Would it have helped if I had replaced "preferences over trajectories" with the synonymous "preferences that are not exclusively about the future state of the world"?
(Thanks for doing that!) I was going to answer ‘yes’ here, but… having thought about this more, I guess I now find myself confused about what it means to have preferences in a way that doesn't give rise to consequentialist behaviour. Having (unst... (read more)
Thanks for writing this up! I appreciate the summarisation achieved by the background sections, and the clear claims made in bold in the sketch.
The "preferences (purely) over future states" and "preferences over trajectories" distinction is getting at something, but I think it's broken for a couple of reasons. I think you've come to a similar position by noticing that people have preferences both over states and over trajectories. But I remain confused about the relationship between the two posts (Yudkowsky and Shah) you mentioned at the start. Anyway, her... (read more)
why would intelligence generalize to "qualitatively new thought processes, things being way out of training distribution", but corrigibility would not?
This sounds confused to me: the intelligence is the "qualitatively new thought processes". The thought processes aren't some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I'd say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim - I don't think these th... (read more)
Here are two versions of "an AGI will understand very well what I mean":
Yes - nice post - feels to me like another handle on the pointers problem.
Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0. Here I’m assuming that U is defined over histories that are composed of states/observations and actions.
This is not the type signature for a utility function that matters for the coherence arguments (by which I don't mean VNM - see this comment). It does often fit the type signature in the way those arguments a... (read more)
Now, four billion years later, we are about to set in motion a second seed.
We can also view this as part of the effects of the initial seed.
I'm a little confused about what the large-scale predictable, and steerable, consequences of the initial seed are. For predictable consequences, I can imagine things like proliferation of certain kinds of molecules (like proteins). But where's the steerability?
A couple of direct questions I'm stuck on:
I'm asking these to understand the work better.
Currently my answers are:
I wonder what effect there is from selecting for reading the third post in a sequence of MIRI conversations from start to end and also looking at the comments and clicking links in them.
Thanks for the replies! I'm still somewhat confused but will try again to both ask the question more clearly and summarise my current understanding.
What, in the case of consequentialists, is analogous to the water funnelled by literal funnels? Is it possibilities-according-to-us? Or is it possibilities-according-to-the-consequentialist? Or is it neither (or both) of those?
To clarify a little what the options in my original comment were, I'll say what I think they correspond to for literal funnels. Option 1 corresponds to the fact that funnels are usually n... (read more)
My reply to your distinction between 'consequentialists' and 'outcome pumps' would be, "Please forget entirely about any such thing as a 'consequentialist' as you defined it; I would now like to talk entirely about powerful outcome pumps. All understanding begins there, and we should only introduce the notion of how outcomes are pumped later in the game. Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."
(Modulo that lots... (read more)
A couple of other arguments the non-MIRI side might add here:
Here Daniel Kokotajlo and I try to paraphrase the two sides of part of the disagreement and point towards a possible crux about the simplicity of corrigibility.
We are training big neural nets to be effective. (More on what effective means elsewhere; it means something like “being able to steer the future better than humans can.”) We want to have an effective&corrigible system, and we are worried that instead we’ll get an effective&deceptive system. Ngo, Shah, etc. are hopeful that it won’t be “that hard” to get the former and avoid the latter; mayb... (read more)
I am interested in the history-funnelling property -- the property of being like a consequentialist, or of being effective at achieving an outcome -- and have a specific confusion I'd love to get insight on from anyone who has any.
Question: Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?
Option 1 (robustness/behavioural/our models): They achieve narrow outcomes with respect t... (read more)
Thanks! I'd had a bit of a look through that book before and agree it's a great resource. One thing I wasn't able to easily find is examples of robust patterns. Does anyone know if there's been much investigation of robustness in the Life community? The focus I've seen seems to be more on particular constructions (used in its entirety as the initial state for a computation), rather than on how patterns fare when placed in various ranges of different contexts.
Nice comment - thanks for the feedback and questions!
Do you know of any formal or empirical arguments/evidence for the claim that evolution stops being relevant when there exist sufficiently intelligent entities (my possibly incorrect paraphrase of "Darwinian evolution as such isn't a thing amongst superintelligences")?
That's right. A partial function can be thought of as a subset (of its domain) and a total function on that subset. And a (total) function can be thought of as a partition (of its domain): the parts are the inverse images of each point in the function's image.
Blackwell’s theorem says that the conditions under which can be said to be more generally useful than are precisely the situations where is a post-garbling of .
Are the indices the wrong way around here?
A formalisation of the ideas in this sequence in higher-order logic, including machine verified proofs of all the theorems, is available here.
"subagent [] that could choose " -- do you mean or or neither of these? Since is not closed under unions, I don't think the controllables version of "could choose" is closed under coarsening the partition. (I can prove that the ensurables version is closed; but it would have been nice if the controllables version worked.)
ETA: Actually controllables do work out if I ignore the degenerate case of a singleton partition of the world. This is because, when considering partitions of the world, ensur... (read more)
I have something suggestive of a negative result in this direction:
Let be the prime-detector situation from Section 2.1 of the coarse worlds post, and let be the (non-surjective) function that "heats" the outcome (changes any "C" to an "H"). The frame is clearly in some sense equivalent to the one from the example (which deletes the temperature from the outcome) -- I am using my version just to stay within the same category when comparing frames. As a reminder, primality is not observable in but is ob... (read more)
With the other problem resolved, I can confirm that adding an escape clause to the multiplicative definitions works out.
Using the idea we talked about offline, I was able to fix the proof - thanks Rohin!
Summary of the fix:
When and are defined, additionally assume they are biextensional (take their biextensional collapse), which is fine since we are trying to prove a biextensional equivalence. (By the way this is why we can't take , since we might have after biextensional collapse.) Then to prove , observe that for all , which means , hence ... (read more)
I presume the fix here will be to add an explicit escape clause to the multiplicative definitions. I haven't been able to confirm this works out yet (trying to work around this), but it at least removes the counterexample.
How is this supposed to work (focusing on the claim specifically)?
and so
Thus, .
Earlier, was defined as follows:
given by and
but there is no reason to suppose above.
It suffices to establish that
I think the and here are supposed to be and
Indeed I think the case may be the basis of a counterexample to the claim in 4.2. I can prove for any (finite) with that there is a finite partition of such that 's agent observes according to the assuming definition but does not observe according to the constructive multiplicative definition, if I take
Let
nit: should be here
and let be an element of .
and the second should be . I think for these and to exist you might need to deal with the case separately (as in Section 5). (Also couldn't you just use the same twice?)
UPDATE: I was able to prove in general whenever and are disjoint and both in , with help from Rohin Shah, following the "restrict attention to world " approach I hinted at earlier.
Yes that sounds right to me.