All of Ramana Kumar's Comments + Replies

Where I agree and disagree with Eliezer

I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":

  1. cares about something
  2. cares about something external (not shallow function of local sensory data)
  3. cares about something specific and external

(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.

I think I need to distinguish two versions of feat 3:

  1. there is a reliable
... (read more)
2Alex Turner2d
Hm. I feel confused about the importance of 3b as opposed to 3a. Here's my first guess: Because we need to target the AI's motivation in particular ways in order to align it with particular desired goals, it's important for there not just to be a predictable mapping, but a flexibly steerable one, such that we can choose to steer towards "dog" or "rock" or "cheese wheels" or "cooperating with humans." Is this close?
Where I agree and disagree with Eliezer

Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.

But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).

The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:

  • for AGIs there's a principal (humans) that we want to align the AGI to
  • for humans there is no principal - our values can be whatever. Or if you take evolution as the principal, the alignment problem wasn't solved.
4Alex Turner6d
I addressed this distinction previously [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=wyhwqrj4eJuPpzKSz#xBkKxAYzrhDCw9iWR] , in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it's external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail. I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we have ever empirically observed. Huh? I think I misunderstand you. I perceive you as saying: "There is not a predictable mapping from whatever-is-in-the-genome+environmental-factors to learned-values." If so, I strongly disagree. Like, in the world where that is true, wouldn't parents be extremely uncertain whether their children will care about hills or dogs or paperclips or door hinges? Our values are not "whatever", human values are generally formed over predictable kinds of real-world objects like dogs and people and tasty food. The linked theory [https://docs.google.com/document/d/1UDzBDL82Z-eCCHmxRC5aefX4abRfK2_Pc1AUI1vkJaw/edit?usp=sharing] makes it obvious why evolution couldn't have possibly solved the human alignment problem. To quote: (Edited to expand my thoughts)
Training Trace Priors

In this story deception is all about the model having hidden behaviors that never get triggered during training


Not necessarily - depends on how abstractly we're considering behaviours. (It also depends on how likely we are to detect the bad behaviours during training.)

Consider an AI trained on addition problems that is only exposed to a few problems that look like 1+3=4, 3+7=10, 2+5=7, 2+6=8 during training, where there are two summands which are each a single digit and they appear in ascending order. Now at inference time the model exposed to 10+2= output... (read more)

1Adam Jermyn13d
Right. Maybe a better way to say it is: 1. Without hidden behaviors (suitably defined), you can't have deception. 2. With hidden behaviors, you can have deception. The two together give a bit of a lever that I think we can use to bias away from deception if we can find the right operational notion of hidden behaviors.
AGI Ruin: A List of Lethalities

We might summarise this counterargument to #30 as "verification is easier than generation". The idea is that the AI comes up with a plan (+ explanation of how it works etc.) that the human systems could not have generated themselves, but that human systems can understand and check in retrospect.

Counterclaim to "verification is easier than generation" is that any pivotal act will involve plans that human systems cannot predict the effects of just by looking at the plan. What about the explanation, though? I think the problem there may be more that we don't ... (read more)

8Eliezer Yudkowsky21d
This seems to me like a case of the imaginary hypothetical "weak pivotal act" that nobody can ever produce. If you have a pivotal act you can do via following some procedure that only the AI was smart enough to generate, yet humans are smart enough to verify and smart enough to not be reliably fooled about, NAME THAT ACTUAL WEAK PIVOTAL ACT.
Richard Ngo's Shortform

What kind of access might be needed to private models? Could there be a secure multi-party computation approach that is sufficient?

An observation about Hubinger et al.'s framework for learned optimization

If I've understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different "types", because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI's world model (or if it doesn't really have a world model, then more local things like actions themselves as opposed to... (read more)

Against Time in Agent Models

It's possible that reality is even worse than this post suggests, from the perspective of someone keen on using models with an intuitive treatment of time. I'm thinking of things like "relaxed-memory concurrency" (or "weak memory models") where there is no sequentially consistent ordering of events. The classic example is where these two programs run in parallel, with X and Y initially both holding 0, [write 1 to X; read Y into R1] || [write 1 to Y; read X into R2], and after both programs finish both R1 and R2 contain 0. What's going on here is that the l... (read more)

2Maxwell Clarke1mo
(Edited a lot from when originally posted) (For more info on consistency see the diagram here: https://jepsen.io/consistency [https://jepsen.io/consistency] ) I think that the prompt to think about partially ordered time naturally leads one to think about consistency levels - but when thinking about agency, I think it makes more sense to just think about DAGs of events, not reads and writes. Low-level reality doesn't really have anything that looks like key-value memory. (Although maybe brains do?) And I think there's no maintaining of invariants in low-level reality, just cause and effect. Maintaining invariants under eventual (or causal?) consistency might be an interesting way to think about minds. In particular, I think making minds and alignment strategies work under "causal consistency" (which is the strongest consistency level that can be maintained under latency / partitions between replicas), is an important thing to do. It might happen naturally though, if an agent is trained in a distributed environment. So I think "strong eventual consistency" (CRDTs) and causal consistency are probably more interesting consistency levels to think about in this context than the really weak ones.
[ASoT] Consequentialist models as a superset of mesaoptimizers

I agree with this, and I think the distinction between "explicit search" and "heuristics" is pretty blurry: there are characteristics of search (evaluating alternative options, making comparisons, modifying one option to find another option, etc.) that can be implemented by heuristics, so you get some kind of hybrid search-instinct system overall that still has "consequentialist nature".

The Big Picture Of Alignment (Talk Part 1)

Thanks a lot for posting this! A minor point about the 2nd intuition pump (100-timesteps, 4 actions: Take $1, Do Nothing, Buy Apple, Buy Banana; the point being that most action sequences take the Take $1 action a lot rather than the Do Nothing action): the "goal" of getting 3 apples seems irrelevant to the point, and may be misleading if you think that that goal is where the push to acquire resources comes from. A more central source seems to me to be the "rule" of not ending with a negative balance: this is what prunes paths through the tree that contain more "do nothing" actions.

5johnswentworth4mo
Yup! More generally, key pieces for modeling a "resource": amounts of the resource are additive, and more resources open up more actions (operationalized by the need for a positive balance in this case). If there's something roughly like that in the problem space, then the resource-seeking argument kicks in.
Some Hacky ELK Ideas

In order to cross-check a non-holdout sensor with a holdout sensor, you need to know the expected relationship between the two sensor readings under different levels of tampering. A simple case: holdout sensor 1 and non-holdout sensor 1 are identical cameras on the ceiling pointing down at the room, the expected relationship is that the images captured agree (up to say 1 pixel shift because the cameras are at very slightly different positions) under no tampering, and don't agree when there's been tampering.


Problem: tampering with the non-holdout sensor may... (read more)

3Hoagy4mo
I was going to write something similar, and just wanted to add that this problem can be expected to get worse the more non-holdout sensors there are. If there were just a single non-holdout camera then spoofing only the one camera would be worthwhile - but if there were a grid of cameras with just a few being held out then it would likely be easiest to take an action that fools them all, like a counterfeit diamond. This method would work best when there be whole modes of data which are ignored, and the work needed to spoof them is orthogonal to all non-holdout modes.
Alex Ray's Shortform

In my understanding there's a missing step between upgraded verification (of software, algorithms, designs) and a "defence wins" world: what the specifications for these proofs need to be isn't a purely mathematical thing. The missing step is how to figure out what the specs should say. Better theorem proving isn't going to help much with the hard parts of that.

1Alex Gray5mo
I think that's right that upgraded verification by itself is insufficient for 'defense wins' worlds. I guess I'd thought that was apparent but you're right it's definitely worth saying explicitly. A big wish of mine is that we end up doing more planning/thinking-things-through for how researchers working on AI today could contribute to 'defense wins' progress. My implicit other take here that wasn't said out loud is that I don't really know of other pathways where good theorem proving translates to better AI x-risk outcomes. I'd be eager to know of these.
Prizes for ELK proposals

Question: Does ARC consider ELK-unlimited to be solved, where ELK-unlimited is ELK without the competitiveness restriction (computational resource requirements comparable to the unaligned benchmark)?

One might suppose that the "have AI help humans improve our understanding" strategy is a solution to ELK-unlimited because its counterexample in the report relies on the competitiveness requirement. However, there may still be other counterexamples that were less straightforward to formulate or explain.

I'm asking for clarification of this point because I notice... (read more)

5Paul Christiano6mo
My guess is that "help humans improve their understanding" doesn't work anyway, at least not without a lot of work, but it's less obvious and the counterexamples get weirder. It's less clear whether ELK is a less natural subproblem for the unlimited version of the problem. That is, if you try to rely on something like "human deliberation scaled up" to solve ELK, you probably just have to solve the whole (unlimited) problem along the way. It seems to me like the core troubles with this point are: * You still have finite training data, and we don't have a scheme for collecting it. This can result in inner alignment problems (and it's not clear those can be distinguished from other problems, e.g. you can't avoid them with a low-stakes assumption). * It's not clear that HCH ever figures out all the science, no matter how much time the humans spend (and having a guarantee that you eventually figure everything out seems seems kind of close to ELK, where the "have AI help humans improve our understanding" is to some extent just punting to the humans+AI to figure out something). * Even if HCH were to work well it will probably be overtaken by internal consequentialists, and I'm not sure how to address that without competitiveness. (Though you may need a weaker form of competitiveness.) I'm generally interested in crisper counterexamples since those are a bit of a mess.
ARC's first technical report: Eliciting Latent Knowledge

I think the problem you're getting at here is real -- path-dependency of what a human believes on how they came to believe it, keeping everything else fixed (e.g., what the beliefs refer to) -- but I also think ARC's ELK problem is not claiming this isn't a real problem but rather bracketing (deferring) it for as long as possible. Because there are cases where ELK fails that don't have much path-dependency in them, and we can focus on solving those cases until whatever else is causing the problem goes away (and only path-dependency is left).

ARC's first technical report: Eliciting Latent Knowledge

Our notion of narrowness is that we are interested in solving the problem where the question we're asking is such that a state always resolves a question. E.g. there isn't any ambiguity around whether a state "really contains a diamond". (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.

This "there isn't any ambiguity"+"there is ambiguity" does n... (read more)

2Paul Christiano6mo
I don't think we have any kind of precise definition of "no ambiguity." That said, I think it's easy to construct examples where there is no ambiguity about whether the diamond remained in the room, yet there is no sequence of actions a human could take that would let them figure out the answer. For example, we can imagine simple toy universes where we understand exactly what features of the world give rise to human beliefs about diamonds and where we can say unambiguously that the same features are/aren't present in a given situation. In general I feel a lot better about our definitions when we are using them to arbitrate a counterexample than if we were trying to give a formal definition. If all the counterexamples involved border cases of the concepts, where there was arguable ambiguity about whether the diamond really stayed in the room, then it would seem important to firm up these concepts but right now it feels like it is easy to just focus on cases where algorithms unambiguously fail. (That methodological point isn't obvious though---it may be that precise definitions are very useful for solving the problem even if you don't need them to judge current solutions as inadequate. Or it may be that actually existing counterexamples are problematic in ways we don't recognize. Pushback on these fronts is always welcome, but right now I feel pretty comfortable with the situation.)
ARC's first technical report: Eliciting Latent Knowledge

Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.

How do we use this to construct new sensors that allow the human to detect tampering?

I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.

The proposed

... (read more)
ARC's first technical report: Eliciting Latent Knowledge

Here’s an attempt at condensing an issue I’m hung up on currently with ELK. This also serves as a high-level summary that I’d welcome poking at in case I’m getting important parts wrong.
 

The setup for ELK is that we’re trying to accurately label a dataset of (observation, action, predicted subsequent observation) triples for whether the actions are good. (The predicted subsequent observations can be optimised for accuracy using automated labels - what actually gets observed subsequently - whereas the actions need their labels to come from a source of ... (read more)

4Paul Christiano6mo
Echoing Mark and Ajeya: I basically think this distinction is real and we are talking about problem 1 instead of problem 2. That said, I don't feel like it's quite right to frame it as "states" that the human does or doesn't understand. Instead we're thinking about properties of the world as being ambiguous or not in a given state. As a silly example, you could imagine having two rooms where one room is normal and the other is crazy. Then questions about the first room are easy and questions about the second are hard. But in reality the degrees of freedom will be much more mixed up than that. To give some more detail on my thoughts on state: * Obviously the human never knows the "real" state, which has a totally different type signature than their beliefs. * So it's natural to talk about knowing states based on correctly predicting what will happen in the future starting from that state. But it's ~never the case that the human's predictions about what will happen next are nearly as good as the predictor's. * We could try to say "you can make good predictions about what happens next for typical actions" or something, but even for typical actions the human predictions are bad relative to the predictor, and it's not clear in what sense they are "good" other than some kind of calibration condition. * If we imagine an intuitive translation between two models of reality, most "weird" states aren't outside of the domain of the translation, it's just that there are predictively important parts of the state that are obscured by the translation (effectively turning into noise, perhaps very surprising noise). Despite all of that, it seems like it really is sometimes unambiguous to say "You know that thing out there in the world that you would usually refer to by saying 'the diamond is sitting there and nothing weird happened to it'? That thing which would lead you to predict that the camera will show a still frame of a diamond? That th
1Mark Xu6mo
I think that problem 1 and problem 2 as you describe them are potentially talking about the same phenomenon. I'm not sure I'm understanding correctly, but I think I would make the following claims: * Our notion of narrowness is that we are interested in solving the problem where the question we're asking is such that a state always resolves a question. E.g. there isn't any ambiguity around whether a state "really contains a diamond". (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds. * We are explicitly interested in solving some forms of problem 2, e.g. we're interested in our AI being able to answer questions about the presence/absence of diamonds no matter how alien the world gets. In some sense, we are interested in our AI answering questions the same way a human would answer questions if they "knew what was really going on", but that "knew what was really going on" might be a misleading phrase. I'm not imagining that "knowing what is really going on" to be a very involved process; intuitively, it means something like "the answer they would give if the sensors are 'working as intended'". In particular, I don't think that, for the case of the diamond, "Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds." * We want to solve these versions of problem 2 because the speed "things getting weirder" in the world might be much faster than human ability to understand what's going on the world. In these worlds, we want to leverage the fact that answers to "narrow" questions are unambiguous to incentivize our AIs to give humans a locally understandable environment in which to deliberate. * We're not
2Ajeya Cotra6mo
My understanding is that we are eschewing Problem 2, with one caveat -- we still expect to solve the problem if the means by which the diamond was stolen or disappeared could be beyond a human's ability to comprehend, as long as the outcome (that the diamond isn't still in the room) is still comprehensible. For example, if the robber used some complicated novel technology to steal the diamond and hack the camera, there would be many things about the state that the human couldn't understand even if the AI tried to explain it to them (at least without going over our compute budget for training). But nevertheless it would still be an instance of Problem 1 because they could understand the basic notion of "because of some actions involving complicated technology, the diamond is no longer in the room, even though it may look like it is."
Are limited-horizon agents a good heuristic for the off-switch problem?

Just a few links to complement Abram's answer:

On how seemingly myopic training schemes can nonetheless produce non-myopic behaviour:

On approval-directed agents:

ARC's first technical report: Eliciting Latent Knowledge

Thanks for the reply! I think you’ve understood correctly that the human rater needs to understand the proposed experiment – i.e., be able to carry it out and have a confident expectation about the outcome – in order to rate the proposer highly.

Here’s my summary of your point: for some tampering actions, there are no experiments that a human would understand in the above sense that would expose the tampering. Therefore that kind of tampering will result in low value for the experiment proposer (who has no winning strategy), and get rated highly.

This is a c... (read more)

3Mark Xu6mo
My point is either that: * it will always be possible to find such an experiment for any action, even desirable ones, because the AI will have defended the diamond in a way the human didn't understand or the AI will have deduced some property of diamonds that humans thought they didn't have * or there will be some tampering for which it's impossible to find an experiment, because in order to avoid the above problem, you will have to restrict the space of experiments
ARC's first technical report: Eliciting Latent Knowledge

Tweaking your comment slightly:

I'd be scared that the "Am I tricking you?" head just works by:

  1. Predicting what the human will predict [when experiment E is performed]
  2. Predicting what will actually happen [when experiment E is performed]
  3. Output a high value iff the human's prediction is confident but different from reality.

If this is the case, then the head will report detectable tampering but not undetectable tampering.

Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what's in the repo... (read more)

2Paul Christiano6mo
Suppose the value head learns to predict "Will the human be confidently wrong about the outcome of this experiment," where an 'experiment' is a natural language description of a sequence of actions that the human could execute. And then the experiment head produces natural language descriptions of actions that a human could take for which they'd be confidently wrong. What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwards? How do we use this to construct new sensors that allow the human to detect tampering? ETA: here's my best guess after reading the other comment---after taking a sequence of actions, we run the experiment proposer to suggest an experiment that will allow the human to notice if tampering actually occurred. This seems like it could be different from "experiment that human would be confidently wrong about" since a human who doesn't understand the environment dynamics will always have tons of experiments they are confidently wrong about, but instead we want to find an experiment that causes them to update strongly to believing that tampering occurred. Is that right? If so it seems like there are a few problems: * The proposed experiment could itself perform tampering (after which the human will correctly infer that tampering occurred, thereby giving the experiment a high score), or exploit the human errors to make it appear that tampering had occurred (e.g. if the human is wrong about how sensors work then you can construct new sensors that will appear to report tampering). * If you tamper with the mechanism by which the human "executes" the experiment (e.g. by simply killing the human and replacing them with a different experiment-executor) then it seems like the experiment proposer will always lose. This maybe depends on details of exactly how the setup works. * Like Mark I do expect forms
ARC's first technical report: Eliciting Latent Knowledge

Here’s a Builder move (somewhat underdeveloped but I think worth posting now even as I continue to think - maybe someone can break it decisively quickly).

Training strategy: Add an “Am I tricking you?” head to the SmartVault model.

The proposed flow chart for how the model works has an “Experiment Proposer” coming out of “Figure out what’s going on”, and two heads out of Experiment Proposer, called “Experiment Description” and “Value” (meaning “Expected Value of Experiment to the Proposer”). I won’t make use of the question-answering Reporter/Answer parts, s... (read more)

3Paul Christiano6mo
I'd be scared that the "Am I tricking you?" head just works by: 1. Predicting what the human will predict 2. Predicting what will actually happen 3. Output a high value iff the human's prediction is confident but different from reality. If this is the case, then the head will report detectable tampering but not undetectable tampering. To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren't, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I think there's a big genre of proposals that try to leverage that kind of structure, which might be promising (though it's not the kind of thing I'm thinking about right now).
3Mark Xu6mo
Thanks for your proposal! I'm not sure I understand how the "human is happy with experiment" part is supposed to work. Here are some thoughts: * Eventually, it will always be possible to find experiments where the human confidently predicts wrongly. Situations I have in mind are ones where your AI understands the world far better than you, so can predict that e.g. combining these 1000 chemicals will produce self-replicating protein assemblages, whereas the human's best guess is going to be "combining 1000 random chemicals doesn't do anything" * If the human is unhappy with experiments that are complicated, then advanced ways of hacking the video feed that requires experiments of comparable complexity to reveal are not going to be permitted. For instance, if the diamond gets replaced by a fake, one might have to perform a complicated imaging technique to determine the difference. If the human doesn't already understand this technique, then they might not be happy with the experiment. * If the human doesn't really understand the world that well, then it might not be possible to find an experiment for which the human is confident in the outcome that distinguishes the diamond from a fake. For instance, if a human gets swapped out for a copy of a human that will make subtly different moral judgments because of factors the human doesn't understand, this copy will be identical in all ways that a human can check, e.g. there will be no experiment that a human is confident in that will distinguish the copy of the human from the real thing.
Consequentialism & corrigibility

Thanks for the reply! My comments are rather more thinking-in-progress than robust-conclusions than I’d like, but I figure that’s better than nothing.

Would it have helped if I had replaced "preferences over trajectories" with the synonymous "preferences that are not exclusively about the future state of the world"?

(Thanks for doing that!) I was going to answer ‘yes’ here, but… having thought about this more, I guess I now find myself confused about what it means to have preferences in a way that doesn't give rise to consequentialist behaviour. Having (unst... (read more)

2Steve Byrnes6mo
Thanks, this is helpful! Oh, sorry, I'm thinking of a planning agent. At any given time it considers possible courses of action, and decides what to do based on "preferences". So "preferences" are an ingredient in the algorithm, not something to be inferred from external behavior. That said, if someone "prefers" to tell people what's on his mind, or if someone "prefers" to hold their fork with their left hand … I think those are two examples of "preferences" in the everyday sense of the word, but that they're not expressible as a rank-ordering of the state of the world at a future date. Instead of "desire to be corrigible", I'll switch to something more familiar: "desire to save the rainforest". Let's say my friend Sally is "trying to save the rainforest". There's no "save the rainforest detector" external to Sally, which Sally is trying to satisfy. Instead, the "save the rainforest" concept is inside Sally's own head. When Sally decides to execute Plan X because it will help save the rainforest, that decision is based on the details of Plan X as Sally herself understands it. Let's also assume that Sally's motivation is ego-syntonic (which we definitely want for our AGIs): In other words, Sally wants to save the rainforest and Sally wants to want to save the rainforest. Under those circumstances, I don't think saying something like "Sally wants to fool the recognizer" is helpful. That's not an accurate description of her motivation. In particular, if she were offered an experience machine [https://en.wikipedia.org/wiki/Experience_machine] or brain-manipulator that could make her believe that she has saved the rainforest, without all the effort of actually saving the rainforest, she would emphatically turn down that offer. So what can go wrong? Let's say Sally and Ahmed are working at the same rainforest advocacy organization. They're both "trying to save the rainforest", but maybe those words mean slightly different things to them. Let's quiz them with a l
Consequentialism & corrigibility

Thanks for writing this up! I appreciate the summarisation achieved by the background sections, and the clear claims made in bold in the sketch.

The "preferences (purely) over future states" and "preferences over trajectories" distinction is getting at something, but I think it's broken for a couple of reasons. I think you've come to a similar position by noticing that people have preferences both over states and over trajectories. But I remain confused about the relationship between the two posts (Yudkowsky and Shah) you mentioned at the start. Anyway, her... (read more)

2Steve Byrnes6mo
Thanks! Hmm. I think there's a notion of "how much a set of preferences gives rise to stereotypically-consequentialist behavior". Like, if you see an agent behaving optimally with respect to preferences about "how the world will be in 10 years", they would look like a consequentialist goal-seeking agent. Even if you didn't know what future world-states they preferred, you would be able to guess with high confidence that they preferred some future world-states over others. For example, they would almost certainly pursue convergent instrumental subgoals [https://www.lesswrong.com/tag/instrumental-convergence] like power-seeking [https://www.alignmentforum.org/s/fSMbebQyR4wheRrvk]. By contrast, if you see an agent which, at any time, behaves optimally with respect to preferences about "how the world will be in 5 seconds", it would look much less like that, especially if after each 5-second increment they roll a new set of preferences. And an agent which, at any time, behaves optimally with respect to preferences over what it's doing right now would look not at all like a consequentialist goal-seeking agent. (We care about "looking like a consequentialist goal-seeking agent" because corrigible AIs do NOT "look like a consequentialist goal-seeking agent".) Now we can say: By the time-reversibility of the laws of physics, a rank-ordering of "states-of-the-world at future time T (= midnight on January 1 2050)" is equivalent to a rank-ordering of "universe-histories up through future time T". But I see that as kinda an irrelevant technicality. An agent that makes decisions myopically according to (among other things) a "preference for telling the truth right now" in the universe-history picture would cash out as "some unfathomably complicated preference over the microscopic configuration of atoms in the universe at time T". And indeed, an agent with that (unfathomably complicated) preference ordering would not look like a consequentialist goal-seeking agent. So by the s
Discussion with Eliezer Yudkowsky on AGI interventions

why would intelligence generalize to "qualitatively new thought processes, things being way out of training distribution", but corrigibility would not?

This sounds confused to me: the intelligence is the "qualitatively new thought processes". The thought processes aren't some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I'd say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim - I don't think these th... (read more)

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Here are two versions of "an AGI will understand very well what I mean":

  1. Given things in my world model / ontology, the AGI will know which things they translate to in its own world model / ontology, such that the referents (the things "in the real world" being pointed at from our respective models) are essentially coextensive.
  2. For any behaviour I could exhibit (such as pressing a button, or expressing contentment with having reached common understanding in a dialogue) that, for me, turns on the words being used, the AGI is very good at predicting my behavio
... (read more)
2Stuart Armstrong7mo
Thanks for the link.
Coherence arguments do not entail goal-directed behavior

Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0. Here I’m assuming that U is defined over histories that are composed of states/observations and actions.

This is not the type signature for a utility function that matters for the coherence arguments (by which I don't mean VNM - see this comment). It does often fit the type signature in the way those arguments a... (read more)

2Rohin Shah7mo
Have you seen this post [https://www.lesswrong.com/posts/vphFJzK3mWA4PJKAg/coherent-behaviour-in-the-real-world-is-an-incoherent] , which looks at the setting you mentioned? From my perspective, I want to know why it makes sense to assume that the AI system will have preferences over world states, before I start reasoning about that scenario. And there are reasons to expect something along these lines! I talk about some of them in the next post in this sequence! But I think once you've incorporated some additional reason like "humans will want goal-directed agents" or "agents optimized to do some tasks we write down will hit upon a core of general intelligence", then I'm already on board that you get goal-directed behavior, and I'm not interested in the construction in this post any more. The only point of the construction in this post is to demonstrate that you need this additional reason.
Life and expanding steerable consequences

Now, four billion years later, we are about to set in motion a second seed.

We can also view this as part of the effects of the initial seed.

 

I'm a little confused about what the large-scale predictable, and steerable, consequences of the initial seed are. For predictable consequences, I can imagine things like proliferation of certain kinds of molecules (like proteins). But where's the steerability?

Ngo and Yudkowsky on alignment difficulty

A couple of direct questions I'm stuck on:

  • Do you agree that Flint's optimizing systems are a good model (or even definition) of outcome pumps?
  • Are black holes and fires reasonable examples of outcome pumps?

I'm asking these to understand the work better.

Currently my answers are:

  • Yes. Flint's notion is one I came to independently when thinking about "goal-directedness". It could be missing some details, but I find it hard to snap out of the framework entirely.
  • Yes. But maybe not the most informative examples. They're highly non-retargetable.
Yudkowsky and Christiano discuss "Takeoff Speeds"

I wonder what effect there is from selecting for reading the third post in a sequence of MIRI conversations from start to end and also looking at the comments and clicking links in them.

Ngo and Yudkowsky on alignment difficulty

Thanks for the replies! I'm still somewhat confused but will try again to both ask the question more clearly and summarise my current understanding.

What, in the case of consequentialists, is analogous to the water funnelled by literal funnels? Is it possibilities-according-to-us? Or is it possibilities-according-to-the-consequentialist? Or is it neither (or both) of those?

To clarify a little what the options in my original comment were, I'll say what I think they correspond to for literal funnels. Option 1 corresponds to the fact that funnels are usually n... (read more)

My reply to your distinction between 'consequentialists' and 'outcome pumps' would be, "Please forget entirely about any such thing as a 'consequentialist' as you defined it; I would now like to talk entirely about powerful outcome pumps.  All understanding begins there, and we should only introduce the notion of how outcomes are pumped later in the game.  Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

(Modulo that lots... (read more)

Ngo and Yudkowsky on alignment difficulty

A couple of other arguments the non-MIRI side might add here:

  • The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)
  • How do we imagine scaled up humans [e.g. thinking faster, thinking in more copies, having more resources, or having more IQ] to be effective? Wouldn’t they be corrigible? Wouldn't they have nice goals? What can we learn from the closest examples we already have of scaled up humans? (h/t Shahar for bringing this point up in conversation).
5Rohin Shah7mo
I'll note that this is framed a bit too favorably to me, the actual question is "why is an effective and corrigible system so much more of a narrow target than that?"
Ngo and Yudkowsky on alignment difficulty

Here Daniel Kokotajlo and I try to paraphrase the two sides of part of the disagreement and point towards a possible crux about the simplicity of corrigibility.

We are training big neural nets to be effective. (More on what effective means elsewhere; it means something like “being able to steer the future better than humans can.”) We want to have an effective&corrigible system, and we are worried that instead we’ll get an effective&deceptive system. Ngo, Shah, etc. are hopeful that it won’t be “that hard” to get the former and avoid the latter; mayb... (read more)

3Ramana Kumar7mo
A couple of other arguments the non-MIRI side might add here: * The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.) * How do we imagine scaled up humans [e.g. thinking faster, thinking in more copies, having more resources, or having more IQ] to be effective? Wouldn’t they be corrigible? Wouldn't they have nice goals? What can we learn from the closest examples we already have of scaled up humans? (h/t Shahar for bringing this point up in conversation).
4Daniel Kokotajlo7mo
For (a): Deception is a convergent instrumental goal; you get it “for free” when you succeed in making an effective system, in the sense that the simplest, most-likely-to-be-randomly-generated effective systems are deceptive. Corrigibility by contrast is complex and involves making various nuanced decisions between good and bad sorts of influence on human behavior. For (b): If you take an effective system and modify it to be corrigible, this will tend to make it less effective. By contrast, deceptiveness (insofar as it arises “naturally” as a byproduct of pursuing convergent instrumental goals effectively) does not “get in the way” of effectiveness, and even helps in some cases! Ngo’s (and Shah’s) position (we think) is that the data we’ll be using to select our systems will be heavily entangled with human preferences - we’ll indeed be trying to use human preferences to guide and shape the systems - so there’s a strong bias towards actually learning them. You don’t have to get human preferences right in all their nuance and detail to know some basic things like that humans generally don’t want to die or be manipulated/deceived. I think they mostly bounce off the claim that “effectiveness” has some kind of “deep underlying principles” that will generalise better than any plausible amount of human preference data actually goes into building the effective system. We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.” It seems to us that Ngo, Shah, etc. draw intuitive support from analogy to humans, whereas Yudkowsky etc. draw intuitive support from the analogy to programs and expected utility equations. If you are thinking about a piece of code that describes a bayesian EU-maximizer, and then you try to edit the code to make the agent corrigible, it’s obv
Ngo and Yudkowsky on alignment difficulty

I am interested in the history-funnelling property -- the property of being like a consequentialist, or of being effective at achieving an outcome -- and have a specific confusion I'd love to get insight on from anyone who has any.
 

Question: Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

Option 1 (robustness/behavioural/our models): They achieve narrow outcomes with respect t... (read more)

6Eliezer Yudkowsky7mo
To Rob's reply, I'll add that my own first reaction to your question was that it seems like a map-territory / perspective issue as appears in eg thermodynamics? Like, this has a similar flavor to asking "What does it mean to say that a classical system is in a state of high entropy when it actually only has one particular system state?" Adding this now in case I don't have time to expand on it later; maybe just saying that much will help at all, possibly.
4Rob Bensinger7mo
I'm not sure that I understand the question, but my intuition is to say: they funnel world-states into particular outcomes in the same sense that literal funnels funnel water into particular spaces, or in the same sense that a slope makes things roll down it. If you find water in a previously-empty space with a small aperture, and you're confused that no water seems to have spilled over the sides, you may suspect that a funnel was there. Funnels are part of a larger deterministic universe, so maybe in some sense any given funnel (like everything else) 'had to do exactly that thing'. Still, we can observe that funnels are an important part of the causal chain in these cases, and that places with funnels tend to end up with this type of outcome much more often. Similarly, consequentialists tend to remake parts of the world (typically, as much of the world as they can reach) into things that are high in their preference ordering. From Optimization and the Singularity [https://www.lesswrong.com/posts/HFTn3bAT6uXSNwv4m/optimization-and-the-singularity] : But it's not clear what a "preference" is, exactly. So a more general way of putting it, in Recognizing Intelligence [https://www.lesswrong.com/posts/LZMeuRGQhSw77XewC/recognizing-intelligence], is: "Consequentialists funnel the universe into shapes that are higher in their preference ordering" isn't a required inherent truth for all consequentialists; some might have weird goals, or be too weak to achieve much. Likewise, some literal funnels are broken or misshapen, or just never get put to use. But in both cases, we can understand the larger class by considering the unusual function well-working instances can perform. (In the case of literal funnels, we can also understand the class by considering its physical properties rather than its function/behavior/effects. Eventually we should be able to do the same for consequentialists, but currently we don't know what physical properties of a system make it consequentia
Optimization Concepts in the Game of Life

Thanks! I'd had a bit of a look through that book before and agree it's a great resource. One thing I wasn't able to easily find is examples of robust patterns. Does anyone know if there's been much investigation of robustness in the Life community? The focus I've seen seems to be more on particular constructions (used in its entirety as the initial state for a computation), rather than on how patterns fare when placed in various ranges of different contexts.

3Oscar_Cunningham8mo
Some related discussions: 1. https://www.conwaylife.com/forums/viewtopic.php?f=2&t=979 [https://www.conwaylife.com/forums/viewtopic.php?f=2&t=979] 2. https://www.conwaylife.com/forums/viewtopic.php?f=7&t=2877 [https://www.conwaylife.com/forums/viewtopic.php?f=7&t=2877] 3. https://www.conwaylife.com/forums/viewtopic.php?p=86140#p86140 [https://www.conwaylife.com/forums/viewtopic.php?p=86140#p86140] My own thoughts. * Patterns in GoL are generally not robust. Typically changing anything will cause the whole pattern to disintegrate in a catastrophic explosion and revert to the usual 'ash' of randomly placed small still lifes and oscillators along with some escaping gliders. * The pattern Eater 2 [https://www.conwaylife.com/wiki/Eater_2] can eat gliders along 4 adjacent lanes. * The Highway Robber [https://www.conwaylife.com/wiki/Highway_robber] can eat gliders travelling along a lane right at the edge of the pattern, such that gliders on the next lane pass by unaffected. So one can use several staggered highway robbers to make a wall which eats any gliders coming at it from a given direction along multiple adjacent lanes. The wall will be very oblique and will fail if two gliders come in on the same lane too close together. * The block [https://www.conwaylife.com/wiki/Block] is robust to deleting any one of its live cells, but is not robust to placing single live cells next to it. * The maximum speed at which GoL patterns can propagate into empty space is 1 cell every 2 generations, measured in the L_1 norm. Spaceships which travel at this speed limit (such as the glider [https://www.conwaylife.com/wiki/Glider], XWSS [https://www.conwaylife.com/wiki/XWSS], and Sir Robin [https://www.conwaylife.com/wiki/Sir_Robin]) are therefore robust to things happening behind them, in the sense that nothing can catch up with them. * It's long been hypothesised that it should be possi
Optimization Concepts in the Game of Life

Nice comment - thanks for the feedback and questions!

  1. I think the specific example we had in mind has a singleton set of target states: just the empty board. The basin is larger: boards containing no groups of more than 3 live cells. This is a refined version of "death" where even the noise is gone. But I agree with you that "high entropy" or "death", intuitively, could be seen as a large target, and hence maybe not an optimization target. Perhaps compare to the black hole.
  2. Great suggestion - I think the "macrostate" terminology may indeed be a good fit / wo
... (read more)
2Edouard Harris8mo
Thanks! I think this all makes sense. 1. Oh yeah, I definitely agree with you that the empty board would be an optimizing system in the GoL context. All I meant was that the "Death" square in the examples [https://www.alignmentforum.org/posts/mL8KdftNGBScmBcBg/#Examples_] table might not quite correspond to it in the analogy, since the death property is perhaps not an optimization target by the definition. Sorry if that wasn't clear. 2. :) 3. 4. 5. Got it, thanks! So if I've understood correctly, you are currently only using the mask as a way to separate the agent from its environment at instantiation, since that is all you really need to do to be able to define properties like robustness and retargetability in this context. That seems reasonable.
Intelligence or Evolution?

Do you know of any formal or empirical arguments/evidence for the claim that evolution stops being relevant when there exist sufficiently intelligent entities (my possibly incorrect paraphrase of "Darwinian evolution as such isn't a thing amongst superintelligences")?

1Donald Hobson8mo
Error correction codes exist. They are low cost in terms of memory etc. Having a significant portion of your descendent mutate and do something you don't want is really bad. If error correcting to the point where there is not a single mutation in the future only costs you 0.001% resources in extra hard drive, then <0.001% resources will be wasted due to mutations. Evolution is kind of stupid compared to super-intelligences. Mutations are not going to be finding improvements. Because the superintelligence will be designing their own hardware and the hardware will already be extremely optimized. If the superintelligence wants to spend resources developing better tech, It can do that better than evolution. So squashing evolution is a convergent instrumental goal, and easily achievable for an AI designing its own hardware.
Finite Factored Sets: Conditional Orthogonality

That's right. A partial function can be thought of as a subset (of its domain) and a total function on that subset. And a (total) function can be thought of as a partition (of its domain): the parts are the inverse images of each point in the function's image.

The Blackwell order as a formalization of knowledge

Blackwell’s theorem says that the conditions under which  can be said to be more generally useful than  are precisely the situations where  is a post-garbling of .

Are the indices the wrong way around here?

1Alex Flint9mo
Yes. Thank you. Fixed.
Introduction to Cartesian Frames

A formalisation of the ideas in this sequence in higher-order logic, including machine verified proofs of all the theorems, is available here.

Time in Cartesian Frames

"subagent [] that could choose " -- do you mean  or  or neither of these? Since  is not closed under unions, I don't think the controllables version of "could choose" is closed under coarsening the partition. (I can prove that the ensurables version is closed; but it would have been nice if the controllables version worked.)

ETA: Actually controllables do work out if I ignore the degenerate case of a singleton partition of the world. This is because, when considering partitions of the world, ensur... (read more)

Time in Cartesian Frames

I have something suggestive of a negative result in this direction:

Let  be the prime-detector situation from Section 2.1 of the coarse worlds post, and let  be the (non-surjective) function that "heats" the outcome (changes any "C" to an "H"). The frame  is clearly in some sense equivalent to the one from the example (which deletes the temperature from the outcome) -- I am using my version just to stay within the same category when comparing frames. As a reminder, primality is not observable in  but is ob... (read more)

Eight Definitions of Observability

With the other problem resolved, I can confirm that adding an  escape clause to the multiplicative definitions works out.

Eight Definitions of Observability

Using the idea we talked about offline, I was able to fix the proof - thanks Rohin!
Summary of the fix:
When  and  are defined, additionally assume they are biextensional (take their biextensional collapse), which is fine since we are trying to prove a biextensional equivalence. (By the way this is why we can't take , since we might have  after biextensional collapse.) Then to prove , observe that for all  which means , hence ... (read more)

Eight Definitions of Observability

I presume the fix here will be to add an explicit  escape clause to the multiplicative definitions. I haven't been able to confirm this works out yet (trying to work around this), but it at least removes the  counterexample.

2Ramana Kumar1y
With the other problem resolved, I can confirm that adding anA=∅escape clause to the multiplicative definitions works out.
Eight Definitions of Observability

How is this supposed to work (focusing on the  claim specifically)?

and so

Thus, .

Earlier,  was defined as follows:

given by  and 

but there is no reason to suppose  above.

3Rohin Shah1y
The problem is a bit earlier actually: This isn't true, because∙1doesn't just ignoreb1here (sincer∈R1). I think the route is to say "Leth(b′2)=f2. Then∙1must treatf1andf2identically, meaning that either they are equal, or the frame is biextensionally equivalent to one where they are equal."
Time in Cartesian Frames

It suffices to establish that 

I think the  and  here are supposed to be  and 

Eight Definitions of Observability

Indeed I think the  case may be the basis of a counterexample to the claim in 4.2. I can prove for any (finite)  with  that there is a finite partition  of  such that 's agent observes  according to the assuming definition but does not observe  according to the constructive multiplicative definition, if I take 

2Ramana Kumar1y
I presume the fix here will be to add an explicitA=∅escape clause to the multiplicative definitions. I haven't been able to confirm this works out yet (trying to work around this [https://www.lesswrong.com/posts/5R9dRqTREZriN9iL7/eight-definitions-of-observability?commentId=vc7NcsErj4NgWFdTX] ), but it at least removes thenullcounterexample.
Eight Definitions of Observability

Let 

nit:  should be  here

and let  be an element of .

and the second  should be . I think for these  and  to exist you might need to deal with the  case separately (as in Section 5). (Also couldn't you just use the same  twice?)

1Ramana Kumar1y
Indeed I think theA=∅case may be the basis of a counterexample to the claim in 4.2. I can prove for any (finite)Wwith|W|>1that there is a finite partitionVofW such thatC's agent observesVaccording to the assuming definition but does not observeVaccording to the constructive multiplicative definition, if I takeC=null .
Eight Definitions of Observability

UPDATE: I was able to prove  in general whenever  and  are disjoint and both in , with help from Rohin Shah, following the "restrict attention to world " approach I hinted at earlier.
 

Load More