All of Chris_Leong's Comments + Replies

An alternative framing that might be useful: What do you see as the main bottleneck for people having better predictions of timelines (as you see it)?

Do you in fact think that having such a list is it?

What we need to find, for a given agent to be constrained by being a 'utility maximiser' is to consider it as having a member of a class of utility functions where the actions that are available to it systematically alter the expected utility available to it - for all utility functions within this class.

This sentence is extremely difficult for me to parse. Any chance you could clarify it?

In most situations, were these preferences over my store of dollars for example, this would seem to be outside the class of utility functions that would meaningfully constrain my action, since this function is not at all smooth over the resource in question.

Could you explain smoothness is typically required for meaningly constraining our actions?

Thanks so much for not only writing a report, but taking the time to summarise for our easy consumption!

Oh sorry, just realised that davinci-002 is separate from text-davinci-002.

Note that davinci-002 and babbage-002 are the new base models released a few days ago.

You mean davinci-003?

  • Learn more about the possible failures, to understand how likely they are, what causes them to arise, and what techniques may mitigate the failures (discussed here).
  • Inform the current conversation about AI risk by providing the best evidence of misalignment risks, if any. We hope this will be helpful for labs, academia, civil society, and policymakers to make better decisions (discussed here). If misalignment issues end up being serious, then it will be critical to form a strong scientific consensus that these issues are real, for which examples
... (read more)
2Carson Denison1mo
Thank you for catching this.  These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.
2Ethan Perez1mo
Fixed (those were just links to the rest of the doc)

It’s not clear to me that the space of things you can verify is in fact larger than the space of things you can do because an AI might be able to create a fake solution that feels more real than the actual solution. At a sufficiently high intelligence level of the AI, being able to avoid this tricks is likely harder than just doing the task if you hadn’t been subject to malign influence.

Do you have a theory for why chain-of-thought decomposition helps?

3Ansh Radhakrishnan2mo
Honestly, I don't think we have any very compelling ones! We gesture at some possibilities in the paper, such as it being harder for the model to ignore its reasoning when it's in an explicit question-and-answer format (as opposed to a more free-form CoT), but I don't think we have a good understanding of why it helps.  It's also worth noting that CoT decomposition helps mitigate the ignored reasoning problem, but actually is more susceptible to biasing features in the context than CoT. Depending on how you weigh the two, it's possible that CoT might still come out ahead on reasoning faithfulness (we chose to weigh the two equally).

I’m confused about the back door attack detection task even after reading it a few times:

The article says: “The key difference in the attack detection task is that you are given the backdoor input along with the backdoored model, and merely need to recognize the nput as an attack”.

When I read that, I find myself wondering why that isn’t trivial solved by a model that memorises which input(s) are known to be an attack.

My best interpretation is that there are a bunch of possible inputs that cause an attack and you are given one of them and just have to recognise that one plus the others you don’t see. Is this interpretation correct?

2Mark Xu2mo
You have to specify your backdoor defense before the attacker picks which input to backdoor.

gotten only 

This should read gotten only d

Sounds really cool. Would be useful to have some idea of the kind of time you're planning to pick so that people in other timezones can make a call about whether or not to apply.

Good point! We are planning to gauge time preferences among the participants and fix slots then. What is maybe most relevant, we are intending to accommodate all time zones. (We have been doing this with PIBBSS fellows as well, so I am pretty confident we will be able to find time slots that work pretty well across the globe.)

I see some value in the framing of "general intelligence" as a binary property, but it also doesn't quite feel as though it fully captures the phenomenon. Like, it would seem rather strange to describe GPT4 as being a 0 on the general intelligence scale.

I think maybe a better analogy would be to consider the sum of a geometric sequence.

Consider the sum for a few values of r as it increases at a steady rate.

0.5 - 2a
0.6 - 2.5a
0.7 - 3.3a
0.8 - 5a
0.9 - 10a
1 - Diverges to infinity

What we see then is quite significant returns to increases in r and then a sudden d... (read more)

So I've thought about this argument a bit more and concluded that you are correct, but also that there's a potential fix to get around this objection.

I think that it's quite plausible that an agent will have an understanding of its decision mechanism that a) let's it know it will take the same action in both counterfactuals b) won't tell it what action it will take in this counterfactual before it makes the decision.

And in that case, I think it makes sense to conclude that the Omega's prediction depends on your action such that paying gives you the $10,000... (read more)

Thanks for your response. There's a lot of good material here, although some of these components like modules or language seem less central to agency, at least from my perspective. I guess you might see these are appearing slightly down the stack?

They fit naturally into the coherent whole picture. In very broad strokes, that picture looks like selection theorems starting from selection pressures for basic agency, running through natural factorization of problem domains (which is where modules and eventually language come in), then world models and general purpose search (which finds natural factorizations dynamically, rather than in a hard-coded way) once the environment and selection objective has enough variety.

Summary: John describes the problems of inner and outer alignment. He also describes the concept of True Names - mathematical formalisations that hold up under optimisation pressure. He suggests that having a "True Name" for optimizers would be useful if we wanted to inspect a trained system for an inner optimiser and not risk missing something.

He further suggests that the concept of agency breaks down into lower-level components like "optimisation", "goals", "world models", ect. It would be possible to make further arguments about how these lower-level concepts are important for AI safety.

This might be worth a shot, although it's not immediately clear that having such powerful maths provers would accelerate alignment more than capabilities. That said, I have previously wondered myself whether there is a need to solve embedded agency problems or whether we can just delegate that to a future AGI.

Oh wow, it's fascinating to see someone actually investigating this proposal. (I had a similar idea, but only posted it in the EA meme group).

Sorry, I'm confused by the terminology: 

Thanks for the extra detail!

(Actually, I was reading a post by Mark Xu which seems to suggest that the TradingAlgorithms have access to the price history rather than the update history as I suggested above)

My understanding after reading this is that TradingAlgorithms generate a new trading policy after each timestep (possibly with access to the update history, but I'm unsure). Is this correct? If so, it might be worth clarifying this, even though it seems clearer later.

2Alex Flint8mo
That is correct. I know it seems little weird to generate a new policy on every timestep. The reason it's done that way is that the logical inductor needs to understand the function that maps prices to the quantities that will be purchased, in order to solve for a set of prices that "defeat" the current set of trading algorithms. That function (from prices to quantities) is what I call a "trading policy", and it has to be represented in a particular way -- as a set of syntax tree over trading primitives -- in order for the logical inductor to solve for prices. A trading algorithm is a sequence of such sets of syntax trees, where each element in the sequence is the trading policy for a different time step. Normally, it would be strange to set up one function (trading algorithms) that generates another function (trading policies) that is different for every timestep. Why not just have the trading algorithm directly output the amount that it wants to buy/sell? The reason is that we need not just the quantity to buy/sell, but that quantity as a function of price, since prices themselves are determined by solving an optimization problem with respect to these functions. Furthermore, these functions (trading policies) have to be represented in a particular way. Therefore it makes most sense to have trading algorithms output a sequence of trading policies, one per timestep.

Interesting, I think this clarifies things, but the framing also isn't quite as neat as I'd like.

I'd be tempted to redefine/reframe this as follows:

• Outer alignment for a simulator - Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren't actually literal?

• Inner alignment for a simulator - Training a simulator to perfectly simulate the assigned character

• Outer alignment for characters - fi... (read more)

I thought this was a really important point, although I might be biased because I was finding it confusing how some discussions were talking about the gradient landscape as though it could be modified and not clarifying the source of this (for example, whether they were discussing reinforcement learning).

First off, the base loss landscape of the entire model is a function  that's the same across all training steps, and the configuration of the weights selects somewhere on this loss landscape. Configuring the weights differently can put the mod

... (read more)

In the section: "The role of naturalized induction in decision theory" a lot of variables seem to be missing.

(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups. 

I expect that it has also happened to an extent with animals as well. I wonder if anyone has ever looked into this.

converge to

Converge to 1? (Context is "9. Non-Dogmatic...").

Anyway, thanks so much for writing this! I found this to be a very useful resource.

2Alex Flint8mo
Thanks - fixed! And thank you for the note, too.

It seems strange to treat ontological crises as a subset of embedded world-models, as it seems as though a Cartesian agent could face the same issues?

UDT doesn't really counter my claim that Newcomb-like problems are problems in which we can't ignore that our decisions aren't independent of the state of the world when we make that decision, even though in UDT we know less. To make this clear in the example of Newcomb's, the policy we pick affects the prediction which then affects the results of the policy when the decision is made. UDT isn't ignoring the fact that our decision and the state of the world are tied together, even if it possibly represents it in a different fashion. The UDT algorithm takes ... (read more)

1Vladimir Nesov9mo
UDT still doesn't forget enough. Variations on UDT that move towards acausal trade with arbitrary agents are more obviously needed because UDT forgets too much, since that makes it impossible to compute in practice and forgetting less poses a new issue of choosing a particular updateless-to-some-degree agent to coordinate with (or follow). But not forgetting enough can also be a problem. In general, an external/updateless agent (whose suggested policy the original agent follows) can forget the original preference, pursue a different version of it that has undergone an ontological shift. So it can forget the world and its laws, as long as the original agent would still find it to be a good idea to follow its policy (in advance, based on the updateless agent's nature, without looking at the policy). This updateless agent is shared among the counterfactual variants of the original agent that exist in the updateless agent's ontology, it's their chosen updateless core, the source of coherence in their actions.

Good point.

(That said, it seems like to useful check to see what the optimal policy will do. And if someone believes it won't achieve the optimal policy, it seems useful to try to understand the barrier that stops that. I don't feel quite clear on this yet).

My initial thoughts were:

  • On one hand, if you positively reinforce, the system will seek it out, if you negatively reinforce the system will work around it.
  • On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.

Having thought about it a bit more, I think I managed to resolve the tension. It seems that if at least one of the actions is positive utility, then the s... (read more)

3Alex Turner9mo
This is only true for optimal policies, no? For learned policies, positive reward will upweight and generalize certain circuits (like "approach juice"), while negative reward will downweight and generally-discourage those same circuits. This can then lead to path-dependent differences in generalization (e.g. whether person pursues juice in general). (In general, I think reward is not best understood as an optimization target like "utility.")

Strongly agreed. I do worry that most people on LW have a bias towards formalisation even when it doesn't add very much.

What are the key philosophical problems you believe we need to solve for alignment?

I guess it depends on the specific alignment approach being taken, such as whether you're trying to build a sovereign or an assistant. Assuming the latter, I'll list some philosophical problems that seem generally relevant:

  1. metaphilosophy
    • How to solve new philosophical problems relevant to alignment as they come up?
    • How to help users when they ask the AI to attempt philosophical progress?
    • How to help defend the user against bad philosophical ideas (whether in the form of virulent memes, or intentionally optimized by other AIs/agents to manipulate the use
... (read more)

Minor correction

But then, in the small fraction of worlds where we survive, we simulate lots and lots of copies of that AI where it instead gets reward 0 when it attempts to betray us!

The reward should be negative rather than 0.

Regarding the AI not wanting to cave to threats, there's a sense in which the AI is also (implicitly) threatening us, so it might not apply. (Defining what counts as a "threat" is challenging).

Could someone clarify the relevance of ribosomes?

This seems wrong to me in large part because the AI safety community and EA community more broadly have been growing independent of increased interest in AI


Agreed, this is one of the biggest considerations missed, in my opinion, by people who think accelerating progress was good. (TBH, if anyone was attempting to accelerate progress to reduce AI risk, I think that they were trying to be too clever by half; or just rationalisting).

I guess I would lean towards saying that once powerful AI systems exist, we'll need powerful aligned systems relatively fast in order to develop against them, otherwise we'll be screwed. In other words, AI arms race dynamics push us towards a world where systems are deployed with an insufficient amount of testing and this provides one path for us to fall victim to an AI system that you might have expected iterative design to catch.

I would love to see you say why you consider these bad ideas. Obvious such AI's could be unaligned themselves or is it more along the lines of these assistants needing a complete model of human values to be truly useful?

5Raymond Arnold1y
John's Why Not Just... sequence is a series of somewhat rough takes on a few of them. (though I think many of them are not written up super comprehensively)

Speedup on evolution?

Maybe? Might work okayish, but doubt the best solution is that speculative.

As in, you could score some actions, but then there isn't a sense in which you "can" choose one according to any criterion.


I've noticed that issue as well. Counterfactuals are more a convenient model/story than something to be taken literally. You've grounded decision by taking counterfactuals to exist a priori. I ground them by noting that our desire to construct counterfactuals is ultimately based on evolved instincts and/or behaviours so these stories aren't just arbitrary stories but a way in which we can leverage the lessons that have been instilled in us by evolution. I'm curious, given this explanation, why do we still need choices to be actual?

1Jessica Taylor1y
Do you think of counterfactuals as a speedup on evolution? Could this be operationalized by designing AIs that quantilize on some animal population, therefore not being far from the population distribution, but still surviving/reproducing better than average?
2Jessica Taylor1y
Note the preceding I'm assuming use of a metaphysics in which you, the agent, can make choices. Without this metaphysics there isn't an obvious motivation for a theory of decisions. As in, you could score some actions, but then there isn't a sense in which you "can" choose one according to any criterion. Maybe this metaphysics leads to contradictions. In the rest of the post I argue that it doesn't contradict belief in physical causality including as applied to the self.

Let A be some action. Consider the statement: "I will take action A". An agent believing this statement may falsify it by taking any action B not equal to A. Therefore, this statement does not hold as a law. It may be falsified at will.


If you believe determinism then an agent can sometimes falsify it, sometimes not.

I think it's quite clear how shifting ontologies could break a specification of values. And sometimes you just need a formalisation, any formalisation, to play around with. But I suppose it depends more of the specific details of your investigation.

I strongly disagree with your notion of how privileging the hypothesis works. It's not absurd to think that techniques for making AIXI-tl value diamonds despite ontological shifts could be adapted for other architectures. I agree that there are other examples of people working on solving problems within a formalisation that seem rather formalisation specific, but you seem to have cast the net too wide.

2Alex Turner1y
My basic point remains. Why is it not absurd to think that, without further evidential justification? By what evidence have you considered the highly specific investigation into AIXI-tl, and located the idea that ontology identification is a useful problem to think about at all (in its form of "detecting a certain concept in the AI")? 

I tend to agree that burning up the timeline is highly costly, but more because Effective Altruism is an Idea Machine that has only recently started to really crank up. There's a lot of effort being directed towards recruiting top students from uni groups, but these projects require time to pay off.

I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its egg

... (read more)

An ability to refuse to generate theories about a hypothetical world being in a simulation.

I guess the problem with this test is that the kinds of people who could do this tend to be busy, so they probably can't do this with so little notice.

Hmm... It seems much, much harder to catch every single one than to catch 99%.

One of my assumptions is that it's possible to design a "satisficing" engine -- an algorithm that generates candidate proposals for a fixed number of cycles, and then, assuming at least one proposal with estimated utility greater than X has been generated within that amount of time, selects one of the qualifying proposals at random. If there are no qualifying candidates, the AI takes no action. If you have a straightforward optimizer that always returns the action with the highest expected utility, then, yeah, you only have to miss one "cheat" that improves "official" utility at the expense of murdering everyone everywhere and then we all die. But if you have a satisficer, then as long as some of the qualifying plans don't kill everyone, there's a reasonable chance that the AI will pick one of those plans. Even if you forget to explicitly penalize one of the pathways to disaster, there's no special reason why that one pathway would show up in a large majority of the AI's candidate plans.

Regarding the point about most alignment work not really addressing the core issue: I think that a lot of this work could potentially be valuable nonetheless. People can take inspiration from all kinds of things and I think there is often value in picking something that you can get a grasp on, then using the lessons from that to tackle something more complex. Of course, it's very easy for people to spend all of their time focusing on irrelevant toy problems and never get around to making any progress on the real problem. Plus there are costs with adding more voices into the conversation as it can be tricky for people to distinguish the signal from the noise.

Load More