# All of paulfchristiano's Comments + Replies

Prizes for ELK proposals

I'd like to get different answers in those two worlds. That definitely requires having some term in the loss that is different in W1 and W2. There are three ways the kinds of proposals in the doc can handle this:

• Consistency checks will behave differently in W1 and W2. Even if a human can never produce different answers to Q1 and Q2, they can talk about situations where Q1 and Q2 differ and describe how the answers to those questions relate to all the other facts about the world (and to the answer to Q).
• If language is rich enough, and we are precise enough
Alex Ray's Shortform

"The goal is" -- is this describing Redwood's research or your research or a goal you have more broadly?

My general goal, Redwood's current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally.

I'm curious how this is connected to "doesn't write fiction where a human is harmed".

"Don't produce outputs where someone is injured" is just an arbitrary thing not to do. It's chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direct... (read more)

1A Ray6dI think this is the crux-y part for me. My basic intuition here is something like "it's very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)" and this intuition points me in the direction of instead "conditionally training them to only do that thing in certain contexts" is easier in a way that matters. My intuitions are based on a bunch of assumptions that I have access to and probably some that I don't. Like, I'm basically only thinking about large language models, which are at least pre-trained on a large swatch of a natural language distribution. I'm also thinking about using them generatively, which means sampling from their distribution -- which implies getting a model to "not do something" means getting the model to not put probability on that sequence. At this point it still is a conjecture of mine -- that conditional prefixing behaviors we wish to control is easier than getting them not to do some behavior unconditionally -- but I think it's probably testable? A thing that would be useful to me in designing an experiment to test this would be to hear more about adversarial training as a technique -- as it stands I don't know much more than what's in that post.
Alex Ray's Shortform

The goal is not to remove concepts or change what the model is capable of thinking about, it's to make a model that never tries to deliberately kill everyone. There's no doubt that it could deliberately kill everyone if it wanted to.

2A Ray6d"The goal is" -- is this describing Redwood's research or your research or a goal you have more broadly? I'm curious how this is connected to "doesn't write fiction where a human is harmed".
The Solomonoff Prior is Malign

I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?

2Vanessa Kosoy14dWhy? Maybe you're thinking of UDT? In which case, it's sort of true but IBP is precisely a formalization of UDT + extra nuance regarding the input of the utility function. Well, IBP is explained here [https://www.lesswrong.com/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized] . I'm not sure what kind of non-IBP agent you're imagining.
The Solomonoff Prior is Malign

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

I'm not sure I understand what you mean by "decision-theoretic approach"

I mean that you... (read more)

4Vanessa Kosoy14dWhy is embededness not enough? Once you don't have bridge rules, what is left is the laws of physics. What does the malign hypothesis explain about the laws of physics that the true hypothesis doesn't explain? I suspect (but don't have a proof or even a theorem statement) that IB physicalism produces some kind of agreement theorem for different agents within the same universe, which would guarantee that the user and the AI should converge to the same beliefs (provided that both of them follow IBP). I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies. Okay, but suppose that the AI has real evidence for the simulation hypothesis (evidence that we would consider valid). For example, suppose that there is some metacosmological explanation for the precise value of the fine structure constant (not in the sense of, this is the value which supports life, but in the sense of, this is the value that simulators like to simulate). Do you agree that in this case it is completely rational for the AI to reason about the world via reasoning about the simulators?
The Solomonoff Prior is Malign

Infra-Bayesian physicalism does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.

I agree that removing bridge hypotheses removes one of the advantages for malign hypotheses. I didn't mention this because it doesn't seem like the way in which john is using "embededness;" for example, it seems orthogonal to the wa... (read more)

4Vanessa Kosoy15dSure. But it becomes much more amenable to methods such as confidence thresholds [https://www.lesswrong.com/posts/CnruhwFGQBThvgJiX/formal-solution-to-the-inner-alignment-problem] , which are applicable to some alignment protocols at least. I'm not sure I understand what you mean by "decision-theoretic approach". This attack vector has structure similar to acausal bargaining (between the AI and the attacker), so plausibly some decision theories that block acausal bargaining can rule out this as well. Is this what you mean? This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=N8oamtFAhWKEbyCBq] ) and infers what the attacker would do, which doesn't imply any wastefulness.
Prizes for ELK proposals

My guess is that "help humans improve their understanding" doesn't work anyway, at least not without a lot of work, but it's less obvious and the counterexamples get weirder.

It's less clear whether ELK is a less natural subproblem for the unlimited version of the problem. That is, if you try to rely on something like "human deliberation scaled up" to solve ELK, you probably just have to solve the whole (unlimited) problem along the way.

It seems to me like the core troubles with this point are:

• You still have finite training data, and we don't have a scheme
Prizes for ELK proposals

In some sense, ELK as a problem only even starts "applying" to pretty smart models (ones who can talk including about counterfactuals / hypotheticals, as discussed in this appendix.) This is closely related to how alignment as a problem only really starts applying to models smart enough to be thinking about how to pursue a goal.

I think that it's more complicated to talk about what models "really know" as they get dumber, so we want to use very smart models to construct unambiguous counterexamples. I do think that the spirit of the problem applies even to v... (read more)

Prizes for ELK proposals

The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: "The AI model gets a higher score to the degree that people doing 'digital neuroscience' would have an easier time, and find more interesting things, probing its 'digital brain.'" So an AI with neurons corresponding to diamonds, robbers, sensors, etc. would outscore an AI whose neurons can't easily be seen to correspond to any human-familiar concepts.

I think that a lot depe... (read more)

ARC's first technical report: Eliciting Latent Knowledge

Generally we are asking for an AI that doesn't give an unambiguously bad answer, and if there's any way of revealing the facts where we think a human would (defensibly) agree with the AI, then probably the answer isn't unambiguously bad and we're fine if the AI gives it.

There are lots of possible concerns with that perspective; probably the easiest way to engage with them is to consider some concrete case in which a human might make different judgments, but where it's catastrophic for our AI not to make the "correct" judgment. I'm not sure what kind of exa... (read more)

1Charlie Steiner17dWhen you say "some case in which a human might make different judgments, but where it's catastrophic for the AI not to make the correct judgment," what I hear is "some case where humans would sometimes make catastrophic judgments." I think such cases exist and are a problem for the premise that some humans agreeing means an idea meets some standard of quality. Bumbling into such cases naturally might not be a dealbreaker, but there are some reasons we might get optimization pressure pushing plans proposed by an AI towards the limits of human judgment.
Counterexamples to some ELK proposals

If we train on data about what hypothetical sensors should show (e.g. by experiments where we estimate what they would show using other means, or by actually building weird sensors), we could just end up getting predictions of whatever process we used to generate that data.

In general the overall situation with these sensors seems quite similar to the original outer-level problem, i.e. training the system to answer "what would an ideal sensor show?" seems to run into the same issues as answering "what's actually going on?" E.g. your supersensor idea #3 seem... (read more)

Counterexamples to some ELK proposals

I would describe the overall question as "Is there a situation where an AI trained using this approach deliberately murders us?" and for ELK more specifically as "Is there a situation where an AI trained using this approach gives an unambiguously wrong answer to a straightforward question despite knowing better?" I generally don't think that much about the complexity of human values.

ARC's first technical report: Eliciting Latent Knowledge

Thanks for the kind words (and proposal)!

There was recently a discussion on LW about a scenario similar to the SmartVault one here. My proposed solution was to use reward uncertainty -- as applied to the SmartVault scenario, this might look like: "train lots of diverse mappings between the AI's ontology and that of the human; if even one mapping of a situation says the diamond is gone according to the human's ontology, try to figure out what's going on". IMO this general sort of approach is quite promising, interested to discuss more if people have thought

ARC's first technical report: Eliciting Latent Knowledge

I don't think we have any kind of  precise definition of "no ambiguity." That said, I think it's easy to construct examples where there is no ambiguity about whether the diamond remained in the room, yet there is no sequence of actions a human could take that would let them figure out the answer. For example, we can imagine simple toy universes where we understand exactly what features of the world give rise to human beliefs about diamonds and where we can say unambiguously that the same features are/aren't present in a given situation.

In general I fe... (read more)

Eliciting Latent Knowledge Via Hypothetical Sensors

I think this is a good approach to consider, though I'm currently skeptical this kind of thing can resolve the worst case problem.

My main concern is that models won't behave well by default when we give them hypothetical sensors that they know don't exist (especially relevant for idea #3). But on the other hand, if we need to get good signal from the sensors that actually exist then it seems like we are back in the "build a bunch of sensors and hope it's hard to tamper with them all" regime. I wrote up more detailed thoughts in Counterexamples to some ELK ... (read more)

ARC's first technical report: Eliciting Latent Knowledge

I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn't be able to evaluate that answer.

Some thoughts on this setup:

• I'm very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be gr
The Solomonoff Prior is Malign

We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can't do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)

Worst-case thinking in AI alignment

I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.

Even in ML it seems like it depends on how you formulated your problem/goal. Making good predictions in the worst case is impossi... (read more)

The Solomonoff Prior is Malign

Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much

It seems like you can get malign behavior if you assume:

1. There are some important decisions on which you can't get feedback.
2. There are malign agents in the prior who can recognize those decisions.

In that case the malign agents can always defect only on important decisions where you can't get feedback.

I agree that if you can get feedback on all important decisions (and actually have time to recover from a cat... (read more)

4Vanessa Kosoy15dInfra-Bayesian physicalism [https://www.lesswrong.com/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized] does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis. Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the "messed up situation"?
2johnswentworth1moI like the feedback framing, it seems to get closer to the heart-of-the-thing than my explanation did. It makes the role of the pointers problem [https://www.lesswrong.com/tag/the-pointers-problem] and latent variables more clear, which in turn makes the role of outer alignment more clear. When writing my review, I kept thinking that it seemed like reflection and embeddedness and outer alignment all needed to be figured out to deal with this kind of malign inner agent, but I didn't have a good explanation for the outer alignment part, so I focused mainly on reflection and embeddedness. That said, I think the right frame here involves "feedback" in a more general sense than I think you're imagining it. In particular, I don't think catastrophes are very relevant. The role of "feedback" here is mainly informational; it's about the ability to tell which decision is correct. The thing-we-want from the "feedback" is something like the large-data guarantee from SI: we want to be able to train the system on a bunch of data before asking it for any output, and we want that training to wipe out the influence of any malign agents in the hypothesis space. If there's some class of decisions where we can't tell which decision is correct, and a malign inner agent can recognize that class, then presumably we can't create the training data we need. With that picture in mind, the ability to give feedback "online" isn't particularly relevant, and therefore catastrophes are not particularly central. We only need "feedback" in the sense that we can tell which decision we want, in any class of problems which any agent in the hypothesis space can recognize, in order to create a suitable dataset.
Should we rely on the speed prior for safety?

Minimal circuits are not quite the same as fastest programs---they have no adaptive computation, so you can't e.g. memorize a big table but only use part of it. In some sense it's just one more (relatively extreme) way of making a complexity-speed tradeoff  I basically agree that a GLUT is always faster than meta-learning if you have arbitrary adaptive computation.

That said, I don't think it's totally right to call a GLUT constant complexity---if you have an  bit input and  bit output, then it takes at least  operati... (read more)

ARC's first technical report: Eliciting Latent Knowledge

Echoing Mark and Ajeya:

I basically think this distinction is real and we are talking about problem 1 instead of problem 2. That said, I don't feel like it's quite right to frame it as "states" that the human does or doesn't understand. Instead we're thinking about properties of the world as being ambiguous or not in a given state.

As a silly example, you could imagine having two rooms where one room is normal and the other is crazy. Then questions about the first room are easy and questions about the second are hard. But in reality the degrees of freedom wi... (read more)

ARC's first technical report: Eliciting Latent Knowledge

Suppose the value head learns to predict "Will the human be confidently wrong about the outcome of this experiment," where an 'experiment' is a natural language description of a sequence of actions that the human could execute.  And then the experiment head produces natural language descriptions of actions that a human could take for which they'd be confidently wrong.

What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwar... (read more)

1Ramana Kumar1moProposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough. I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors. Yep this is a problem. "Was I tricking you?" isn't being distinguished from "Can I trick you after the fact?". The other problems seem like real problems too; more thought required....
ARC's first technical report: Eliciting Latent Knowledge

That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to ), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like "zero incentive to tamper, and tampering seems complicated" fail here.

Even for "homeostatic" tasks I expect the difficulty to scale up as the environment becomes more complex (e.g. because you must defend against increasingly sophisticated attackers). There ma... (read more)

ARC's first technical report: Eliciting Latent Knowledge

This isn't clear to me, because "human imitation" here refers (I think) to "imitation of a human that has learned as much as possible (on the compute budget we have) from AI helpers." So as we pour more compute into the predictor, that also increases (right?) the budget for the AI helpers, which I'd think would make the imitator have to become more complex.

In the following section, you say something similar to what I say above about the "computation time" penalty ... I'm not clear on why this applies to the "computation time" penalty and not the complexity

ARC's first technical report: Eliciting Latent Knowledge

For example, the "How we'd approach ELK in practice" section talks about combining several of the regularizers proposed by the "builder." It also seems like you believe that combining multiple regularizers would create a "stacking" benefit, driving the odds of success ever higher.

This is because of the remark on ensembling---as long as we aren't optimizing for scariness (or diversity for diversity's sake), it seems like it's way better to have tons of predictors and then see if any of them report tampering. So adding more techniques improves our chances of... (read more)

ARC's first technical report: Eliciting Latent Knowledge

Not sure about the following, but it seems the new formulation requires that the AI answer questions about humans in a future that may have very low probability according to the AI's current beliefs (i.e., the current human through a delegation chain eventually delegates to a future human existing in a possible world with low probability). The AI may well not be able to answer questions about such a future human, because it wouldn't need that ability to seek power (it only needs to make predictions about high probability futures). Or to put it another way,

ARC's first technical report: Eliciting Latent Knowledge

In general we don't have an explicit representation of the human's beliefs as a Bayes net (and none of our algorithms are specialized to this case), so the only way we are representing "change to Bayes net" is as "information you can give to a human that would lead them to change their predictions."

That said, we also haven't described any inference algorithm other than "ask the human." In general inference is intractable (even in very simple models), and the only handle we have on doing fast+acceptable approximate inference is that the human can apparently do it.

(Though if that was the only problem then we also expect we could find some loss function that incentivizes the AI to do inference in the human Bayes net.)

ARC's first technical report: Eliciting Latent Knowledge

It will depend on how much much high-quality data you need to train the reporter. Probably it's a small fraction of the data you need to train the predictor, and so for generating each reporter datapoint you can afford to use many times more data than the predictor usually uses. I often imagine the helpers having 10-100x more computation time.

ARC's first technical report: Eliciting Latent Knowledge

I'd be scared that the "Am I tricking you?" head just works by:

1. Predicting what the human will predict
2. Predicting what will actually happen
3. Output a high value iff the human's prediction is confident but different from reality.

If this is the case, then the head will report detectable tampering but not undetectable tampering.

To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren't, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I thi... (read more)

1Ramana Kumar1moTweaking your comment slightly: Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what's in the report, this strategy constructs new sensors as needed. The Proposer head is designed to optimise E, which ought to make more tampering detectable, and I have an intuition that it makes all tampering detectable.
ARC's first technical report: Eliciting Latent Knowledge

Is the idea that the helper AI allows the labeler to understand everything just as well as SmartVault does, so that there's no difference in their respective Bayes nets, and so it works for SmartVault to use the labeler's Bayes net?

Yes, that's the main way this could work.  The question is whether an AI understands things that humans can't understand by doing amplification/debate/rrm, our guess is yes and the argument is mostly "until the builder explains why, gradient descent and science may just have pretty different strengths and weaknesses" (and w... (read more)

2Ben Pace1moThis is an interesting tack, this step and the next ("Strategy: have humans adopt the optimal Bayes net") feels new to me.
2Ben Pace1moQuestion: what's the relative amount of compute you are imagining SmartVault and the helper AI having? Both the same, or one having a lot more?
ARC's first technical report: Eliciting Latent Knowledge

I'm thinking of this in a family of proposals like:

• Some kinds of tampering can be easily detected (and so should get identified with states  where tampering has occurred)
• Some other tampering can't be easily detected, but this undetectable tampering has important similarities with detectable tampering and we could use that to recognize it.
• In this case, we're going to try to exploit the fact that detectable tampering shares a prefix of actions/states with undetectable tampering (such that later states reached in that sequence have a much higher p
ARC's first technical report: Eliciting Latent Knowledge

The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to .

The reason I'm concerned that it brings about this state is because the actual good state  is much harder to access than  (e.g. because it requires achieving hard real-world goals). The intuition is that  has constant difficulty while  gets harder and harder as we make the tasks more sophistica... (read more)

3davidad1moThat makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar tosgoodM), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like "zero incentive to tamper, and tampering seems complicated" fail here. While someMs may indeed predict this via reasoning, not allMs that behave this way would, for example anMthat internally modeled the tampering sequence of actions incorrectly as actually leading tosgoodM(and didn't even model a distinctscorruptedM). I think either: 1. (A) it would be at least as apt to ascribe a confused model toMas to ascribe one in which it "reasoned about the nature of the sensor tampering" (e.g. if a contemporary model-free RL robot did some sensor tampering, I'd probably ascribe to it the belief that the tampering actually led tosgoodM), or 2. (B)Mwould correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states thatMpredicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory inH, or null my-actuators-are-disconnected actions) with higher probability in the future ofscorruptedMthan in the future ofsgoodH(making those states separable inX), or 3. (C) If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, thens corruptedMdoes get identified withsgoodH, but some other action sequence (of similar length) would lead fromsprecedingMtostrippyM, a state in which bizarre observations appear forever that would be extremely unlikely at any state inSH. First, I don't think timing ("last minute", "almost-tampered") is critical here. If timing were critical, the 'breaker' could reparameteri
ARC's first technical report: Eliciting Latent Knowledge

Consider a state  where the sensors have been tampered with in order to "look like" the human state , i.e. we've connected the actuators and camera to a box which just simulates the human model (starting from ) and then feeding the predicted outputs of the human model to the camera.

It seems to me like the state  would have zero distance from the state  under all of these proposals. Does that seem right? (I didn't follow all of the details of the example, and definitely not the more general idea.)

3davidad1mo(Thanks for playing along with me as 'breaker'!) I agree that such ansMwould have zero distance from the correspondingsH, but I have some counterpoints: 1. This is a problem for ELK in general, to the extent it's a problem (which I think is smallish-but-non-negligible). AnMwith this property is functionally equivalent to anM′which actually believes thatsMrefers to the same state of the real world assH. So the dynamics ofM's world-model don't contain any latent knowledge of the difference at this point. * This seems to be against the ELK report's knowledge-criterion "There is a feature of the computation done by the AI which is robustly correlated with Z." * The only way I can think of that ELK could claim to reliably distinguishsMfromsHis by arguing that the only plausible way to get such anM′is via a training trajectory where some previousMθdid treatsMdifferently fromsH, and perform ELK monitoring at training checkpoints (in which case I don't see reason to expect my approach comes off worse than others). 2. Such ansMwould not be incentivized by the model. Assuming that rewards factor throughO,Q(sM)=Q(sH). So a policy that's optimized against the world-modelMwouldn't have enough selection pressure to find the presumably narrow and high-entropy path that would lead to the tampered state from the initial state (assuming that the initial state in the real world at deployment is tamper-free). 3. In the real world,Mhas higher expected loss insM. If all the sensor inputs are generated by simulating the human model,Mhas completely blinded itself to potential threats that could disrupt the tampering and reveal a missing diamond. These real-world threats are independent of the threats that the tampering box would be stochastically simulating as part of the human model, either of which would produce observations with high loss. Thus, the real-world
ARC's first technical report: Eliciting Latent Knowledge

The previous definition was aiming to define a utility function "precisely," in the sense of giving some code which would produce the utility value if you ran it for a (very, very) long time.

One basic concern with this is (as you pointed out at the time) that it's not clear that an AI which was able to acquire power would actually be able to reason about this abstract definition of utility. A more minor concern is that it involves considering the decisions of hypothetical humans very unlike those existing in the real world (who therefore might reach bad co... (read more)

ARC's first technical report: Eliciting Latent Knowledge

If all works well, this would filter out anything from the environment that significantly changes your values that you don't specifically want. (E.g. you don't filter out food vs "random configuration of atoms I could eat" because you specifically want to figure out food.) We normally think of the hard case where correct deliberation is dependent on some aspects of the environment staying "on distribution" but you don't recognize which (discussed a bit here). But correct arguments from your friend are the same: you can have preferences over which arguments... (read more)

4Wei Dai1moOk, this all makes sense now. I guess when I first read that section I got the impression that you were trying to do something more ambitious. You may want to consider adding some clarification that you're not describing a scheme designed to block only manipulation while letting helpful arguments through, or that "letting helpful arguments through" would require additional ideas outside of that section.
ARC's first technical report: Eliciting Latent Knowledge

Everything seems right except I didn't follow the definition of the regularizer. What is ?

The trick is that if I have a faithful model M : Action × CameraState → CameraState, I can back out hidden information about the state. The idea is that M must contain information about the true state, not just CameraState, in order to make accurate predictions.

This is what we want to do, and intuitively you ought to be able to back out info about the hidden state, but it's not clear how to do so.

All of our strategies involve introducing some extra structure, t... (read more)

2davidad1moHere's an approach I just thought of, building on scottviteri's comment. Forgive me if there turns out to be nothing new here. Supposing that the machine and the human are working with the same observation space (O:=CameraState) and action space (A:=Action), then the human's modelH:SH→ A→P(O×SH)and the machine's modelM:SM→A→P(O×SM)are both coalgebras [https://ncatlab.org/nlab/show/coalgebra+for+an+endofunctor] of the endofunctorF :=λX.A→P(O×X), therefore both have a canonical morphism into the terminal coalgebra [https://ncatlab.org/nlab/show/terminal+coalgebra+for+an+endofunctor] ofF,X:≅FX(assuming that such anXexists in the ambient category). That is, we can mapSH→XandSM→X. Then, if we can define a distance function onXwith typedX:X×X→R≥ 0, we can use these maps to define distances between human states and machine states,d:SH×SM→R≥0. How can we make use of a distance function? Basically, we can use the distance function to define a kernel (e.g.K(x,y)=exp(−βdX(x,y))), and then use kernel regression [https://en.wikipedia.org/wiki/Kernel_regression] to predict the utility of states inSMby averaging "nearby" states inSH, and then finally (and crucially) estimating the generalization error [https://arxiv.org/pdf/2106.02261.pdf] so that states fromSMthat aren't really near to anywhere inSHget big warning flags (and/or utility penalties for being outside a trust region). How to get such a distance function? One way is to useCMet(the category of complete metric spaces) as the ambient category, and instantiatePas the Kantorovich monad [https://arxiv.org/pdf/1712.05363.pdf]. Crank-turning yields the formula dX(sH,sM)=supa:AsupU:O×X↣R∣∣Eo,s′H∼H(sH)(a)U(o,s′H)−Eo,s′M∼M(sM)(a)U(o,s′M)∣∣ whereUis constrained to be a non-expansive map, i.e., it is subject to the condition|U(o1,s1)−U(o2,s2)|≤max{dO(o1,o2),dX(s1,s2)}. IfOis discrete, I think this is maybe equivalent to an adversarial game where the adversary chooses, for every possiblesHandsM, a partition ofOand a next actio
ARC's first technical report: Eliciting Latent Knowledge

If I only take counterfactuals over a single AI's decision then I can have this problem with just two AIs: each of them tries to manipulate me, and if one of them fails the other will succeed and so I see no variation in my preferences.

In that case the hope is to take counterfactuals over all the decisions. I don't know if this is realistic, but I think it probably either fails in mundane cases or works in this slightly exotic case.  Also honestly it doesn't seem that much harder than taking counterfactuals over one decision, which is already tough.

5Wei Dai1moOh I see. Why doesn't this symmetrically cause you to filter out good arguments for changing your values (told to you by a friend, say) as well as bad ones?
ARC's first technical report: Eliciting Latent Knowledge

"the human isn't well-modelled using a Bayes net" is a possible response the breaker could give

The breaker is definitely allowed to introduce counterexamples where the human isn't well-modeled using a Bayes net. Our training strategies (introduced here) don't say anything at all about Bayes nets and so it's not clear if this immediately helps the breaker---they are the one who introduced the assumption that the human used a Bayes nets (in in order to describe a simplified situation where the naive training strategy failed here). We're definitely not intent... (read more)

Conversation on technology forecasting and gradualism

I didn't push this point at the time, but Paul's claim that "GPT-3 + 5 person-years of engineering effort [would] foom" seems really wild to me, and probably a good place to poke at his model more. Is this 5 years of engineering effort and then humans leaving it alone with infinite compute?

The 5 years are up front and then it's up to the AI to do the rest. I was imagining something like 1e25  flops running for billions of years.

I don't really believe the claim unless you provide computing infrastructure that is externally maintained or else extremely ... (read more)

Biology-Inspired AGI Timelines: The Trick That Never Works

Here are the graphs from Hippke (he or I should publish summary at some point, sorry).

I wanted to compare Fritz (which won WCCC in 1995) to a modern engine to understand the effects of hardware and software performance. I think the time controls for that tournament are similar to SF STC I think. I wanted to compare to SF8 rather than one of the NNUE engines to isolate out the effect of compute at development time and just look at test-time compute.

So having modern algorithms would have let you win WCCC while spending about 50x less on compute than the winn... (read more)

1Charlie Steiner2moYeah, the nonlinearity means it's hard to know what question to ask. If we just eyeball the graph and say that the Elo is log(log(compute)) + time (I'm totally ignoring constants here), and we assume that compute =etso that convenientlylog(compute)=t, thenddtElo=1t+1. The first term is from compute and the second from software. And so our history is totally not scale-free! There's some natural timescale set byt=1, before which chess progress was dominated by compute and after which chess progress will be (was?) dominated by software. Though maybe I shouldn't spend so much time guessing at the phenomenology of chess, and different problems will have different scaling behavior :P I think this is the case for text models and things like the Winograd schema challenges.
Biology-Inspired AGI Timelines: The Trick That Never Works

I think this is uncharitable and most likely based on a misreading of Moravec. (And generally with gwern on this one.)

As far as I can tell, the source for your attribution of this "prediction" is:

If this rate of improvement were to continue into the next century, the 10 teraops required for a humanlike computer would be available in a $10 million supercomputer before 2010 and in a$1,000 personal computer by 2030."

As far as I could tell it sounds from the surrounding text like his "prediction" for transformative impacts from AI was something like "between 2010 and 2030" with broad error bars.

Adding to what Paul said: jacob_cannell points to this comment which claims that in Mind Children Moravec predicted human-level AGI in 2028.

Moravec, "Mind Children", page 68: "Human equivalence in 40 years". There he is actually talking about human-level intelligent machines arriving by 2028 - not just the hardware you would theoretically require to build one if you had the ten million dollars to spend on it.

I just went and skimmed Mind Children. He's predicting human-equivalent computational power on a personal computer in 40 years. He seems to say tha... (read more)

Biology-Inspired AGI Timelines: The Trick That Never Works

I think it's fine to say that you think something else is better without being able to precisely say what it is. I just think "the trick that never works" is an overstatement if you aren't providing evidence about whether it has  worked, and that it's hard to provide such evidence without saying something about what you are comparing to.

(Like I said though, I just skimmed the post and it's possible it contains evidence or argument that I didn't catch.)

It's possible the action is in disagreements about Moravec's view rather than the lack of an alternat... (read more)

Biology-Inspired AGI Timelines: The Trick That Never Works

This is a long post that I have only skimmed.

It seems like the ante, in support of any claim like "X forecasting method doesn't work [historically]," is to compare it to some other forecasting method on offer---whatever is being claimed as the default to be used in X's absence. (edited to add "historically")

It looks to me like historical forecasts that look like biological anchors have fared relatively well compared to the alternatives, but I could easily be moved by someone giving some evidence about what kind of methodology would have worked well or poor... (read more)

6Oliver Habryka2moThe post feels like it's trying pretty hard to point towards an alternative forecasting method, though I also agree it's not fully succeeding at getting there. I feel like de-facto the forecasting methodology of people who are actually good at forecasting don't usually strike me as low-inferential distance, such that it is obvious how to communicate the full methodology. My sense from talking to a number of superforecasters over the years is that they do pretty complicated things, and I don't feel like the critique of "A critique is only really valid if it provides a whole countermethodology" is a very productive way of engaging with their takes. I feel like I've had lots of conversations of the type "X methodology doesn't work" without someone being able to explain all the details of what they do instead, that were still valuable and helped me model the world, and said meaningful things. Usually the best they can do is something like "well, instead pay attention and build a model of these different factors", which feels like a bar Eliezer is definitely clearing in this post.
More Christiano, Cotra, and Yudkowsky on AI progress

I still expect things to be significantly more gradual than Eliezer, in the 10 year world I think it will be very fast but we still have much tighter bounds on how fast (maybe median is more like a year and very likely 2+ months). But yes, the timeline will be much shorter than my default expectation, and then you also won't have time for big broad impacts.

I don't think you should have super low credence in fast takeoff. I gave 30% in the article that started this off, and I'm still somewhere in that ballpark.

Perhaps you think this implies a "low credence" in <10 year timelines. But I don't really think the arguments about timelines are "solid" to the tune of 20%+ probability in 10 years.

4Daniel Kokotajlo2moThanks! Wow I missed/forgot that 30% figure, my bad. I disagree with you much less than I thought! (I'm more like 70% instead of 30%). [ETA: Update: I'm going with the intuitive definition of takeoff speeds here, not the "doubling in 4 years before 1 year?" one. For my thoughts on how to define takeoff speeds, see here. [https://www.lesswrong.com/s/dZMDxPBZgHzorNDTt/p/aFaKhG86tTrKvtAnT] If GWP doubling times is the definition we go with then I'm more like 85% fast takeoff I think, for reasons mentioned by Rob Bensinger below.]
More Christiano, Cotra, and Yudkowsky on AI progress

I think transformers are a big deal, but I think this comment is a bad guess at the counterfactual and it reaffirms my desire to bet with you about either history or the future. One bet down, handful to go?

More Christiano, Cotra, and Yudkowsky on AI progress

The counterfactual depends on what other research people would have done and how successful it would have been. I don't think you can observe it "by simply looking."

That said, I'm not quite sure what counterfactual you are imagining. By the time transformers were developed, soft attention in combination with LSTMs was already popular. I assume that in your counterfactual soft attention didn't ever catch on? Was it proposed in 2014 but languished in obscurity and no one picked it up? Or was sequence-to-sequence attention widely used, but no one ever conside... (read more)

More Christiano, Cotra, and Yudkowsky on AI progress

Copy-pasting the transfomer vs LSTM graph for reference (the one with the bigger gap):

If you told me that AGI looks like that graph, where you replace "flounders at 100M parameters" with "flounders at the scale where people are currently doing AGI research," then I don't think that's going to give you a hard takeoff.

If you said "actually people will be using methods that flounder at a compute budget of 1e25 flops, but people will be doing AGI research with 1e30 flops, and the speedup will be > 1 OOM" then I agree that will give you a hard takeoff, but t... (read more)

If you said "actually people will be using methods that flounder at a compute budget of 1e25 flops, but people will be doing AGI research with 1e30 flops, and the speedup will be > 1 OOM" then I agree that will give you a hard takeoff, but that's what I'm saying transformers aren't a good example of.

Why not? Here we have a pretty clean break: RNNs are not a tweak or two away from Transformers. We have one large important family of algorithms, which we can empirically demonstrate do not absorb usefully the compute which another later discretely differ... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

Based on the other thread I now want to revise this prediction, both because 4% was too low and "IMO gold" has a lot of noise in it based on test difficulty.

I'd put 4% on "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem" where "hardest problem" = "usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra." (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)

Yudkowsky and Christiano discuss "Takeoff Speeds"

Not surprising in any of the ways that good IMO performance would be surprising.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Going through previous ten IMOs, and imagining a very impressive automated theorem prover, I think

• 2020 - unlikely, need 5/6 and probably can't get problems 3 or 6. Also good chance to mess up at 4 or 5
• 2019 - tough but possible, 3 seems hard but even that is not unimaginable, 5 might be hard but might be straightforward, and it can afford to get one wrong
• 2018 - tough but possible, 3 is easier for machine than human but probably still hard, 5 may be hard, can afford to miss one
• 2017 - tough but possible, 3 looks out of reach, 6 looks hard but not sure a