All of Chris_Leong's Comments + Replies

Sounds really cool. Would be useful to have some idea of the kind of time you're planning to pick so that people in other timezones can make a call about whether or not to apply.

Good point! We are planning to gauge time preferences among the participants and fix slots then. What is maybe most relevant, we are intending to accommodate all time zones. (We have been doing this with PIBBSS fellows as well, so I am pretty confident we will be able to find time slots that work pretty well across the globe.)

I see some value in the framing of "general intelligence" as a binary property, but it also doesn't quite feel as though it fully captures the phenomenon. Like, it would seem rather strange to describe GPT4 as being a 0 on the general intelligence scale.

I think maybe a better analogy would be to consider the sum of a geometric sequence.

Consider the sum for a few values of r as it increases at a steady rate.

0.5 - 2a
0.6 - 2.5a
0.7 - 3.3a
0.8 - 5a
0.9 - 10a
1 - Diverges to infinity

What we see then is quite significant returns to increases in r and then a sudden d... (read more)

So I've thought about this argument a bit more and concluded that you are correct, but also that there's a potential fix to get around this objection.

I think that it's quite plausible that an agent will have an understanding of its decision mechanism that a) let's it know it will take the same action in both counterfactuals b) won't tell it what action it will take in this counterfactual before it makes the decision.

And in that case, I think it makes sense to conclude that the Omega's prediction depends on your action such that paying gives you the $10,000... (read more)

Thanks for your response. There's a lot of good material here, although some of these components like modules or language seem less central to agency, at least from my perspective. I guess you might see these are appearing slightly down the stack?

They fit naturally into the coherent whole picture. In very broad strokes, that picture looks like selection theorems [] starting from selection pressures for basic agency, running through natural factorization of problem domains (which is where modules and eventually language come in), then world models and general purpose search (which finds natural factorizations dynamically, rather than in a hard-coded way) once the environment and selection objective has enough variety.

Summary: John describes the problems of inner and outer alignment. He also describes the concept of True Names - mathematical formalisations that hold up under optimisation pressure. He suggests that having a "True Name" for optimizers would be useful if we wanted to inspect a trained system for an inner optimiser and not risk missing something.

He further suggests that the concept of agency breaks down into lower-level components like "optimisation", "goals", "world models", ect. It would be possible to make further arguments about how these lower-level concepts are important for AI safety.

This might be worth a shot, although it's not immediately clear that having such powerful maths provers would accelerate alignment more than capabilities. That said, I have previously wondered myself whether there is a need to solve embedded agency problems or whether we can just delegate that to a future AGI.

Oh wow, it's fascinating to see someone actually investigating this proposal. (I had a similar idea, but only posted it in the EA meme group).

Sorry, I'm confused by the terminology: 

Thanks for the extra detail!

(Actually, I was reading a post by Mark Xu which seems to suggest that the TradingAlgorithms have access to the price history rather than the update history as I suggested above)

My understanding after reading this is that TradingAlgorithms generate a new trading policy after each timestep (possibly with access to the update history, but I'm unsure). Is this correct? If so, it might be worth clarifying this, even though it seems clearer later.

2Alex Flint4mo
That is correct. I know it seems little weird to generate a new policy on every timestep. The reason it's done that way is that the logical inductor needs to understand the function that maps prices to the quantities that will be purchased, in order to solve for a set of prices that "defeat" the current set of trading algorithms. That function (from prices to quantities) is what I call a "trading policy", and it has to be represented in a particular way -- as a set of syntax tree over trading primitives -- in order for the logical inductor to solve for prices. A trading algorithm is a sequence of such sets of syntax trees, where each element in the sequence is the trading policy for a different time step. Normally, it would be strange to set up one function (trading algorithms) that generates another function (trading policies) that is different for every timestep. Why not just have the trading algorithm directly output the amount that it wants to buy/sell? The reason is that we need not just the quantity to buy/sell, but that quantity as a function of price, since prices themselves are determined by solving an optimization problem with respect to these functions. Furthermore, these functions (trading policies) have to be represented in a particular way. Therefore it makes most sense to have trading algorithms output a sequence of trading policies, one per timestep.

Interesting, I think this clarifies things, but the framing also isn't quite as neat as I'd like.

I'd be tempted to redefine/reframe this as follows:

• Outer alignment for a simulator - Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren't actually literal?

• Inner alignment for a simulator - Training a simulator to perfectly simulate the assigned character

• Outer alignment for characters - fi... (read more)

I thought this was a really important point, although I might be biased because I was finding it confusing how some discussions were talking about the gradient landscape as though it could be modified and not clarifying the source of this (for example, whether they were discussing reinforcement learning).

First off, the base loss landscape of the entire model is a function  that's the same across all training steps, and the configuration of the weights selects somewhere on this loss landscape. Configuring the weights differently can put the mod

... (read more)

In the section: "The role of naturalized induction in decision theory" a lot of variables seem to be missing.

(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups. 

I expect that it has also happened to an extent with animals as well. I wonder if anyone has ever looked into this.

converge to

Converge to 1? (Context is "9. Non-Dogmatic...").

Anyway, thanks so much for writing this! I found this to be a very useful resource.

2Alex Flint4mo
Thanks - fixed! And thank you for the note, too.

It seems strange to treat ontological crises as a subset of embedded world-models, as it seems as though a Cartesian agent could face the same issues?

UDT doesn't really counter my claim that Newcomb-like problems are problems in which we can't ignore that our decisions aren't independent of the state of the world when we make that decision, even though in UDT we know less. To make this clear in the example of Newcomb's, the policy we pick affects the prediction which then affects the results of the policy when the decision is made. UDT isn't ignoring the fact that our decision and the state of the world are tied together, even if it possibly represents it in a different fashion. The UDT algorithm takes ... (read more)

1Vladimir Nesov5mo
UDT still doesn't forget enough. Variations on UDT that move towards acausal trade with arbitrary agents are more obviously needed because UDT forgets too much, since that makes it impossible to compute in practice and forgetting less poses a new issue of choosing a particular updateless-to-some-degree agent to coordinate with (or follow). But not forgetting enough can also be a problem. In general, an external/updateless agent (whose suggested policy the original agent follows) can forget the original preference, pursue a different version of it that has undergone an ontological shift. So it can forget the world and its laws, as long as the original agent would still find it to be a good idea to follow its policy (in advance, based on the updateless agent's nature, without looking at the policy). This updateless agent is shared among the counterfactual variants of the original agent that exist in the updateless agent's ontology, it's their chosen updateless core, the source of coherence in their actions.

Good point.

(That said, it seems like to useful check to see what the optimal policy will do. And if someone believes it won't achieve the optimal policy, it seems useful to try to understand the barrier that stops that. I don't feel quite clear on this yet).

My initial thoughts were:

  • On one hand, if you positively reinforce, the system will seek it out, if you negatively reinforce the system will work around it.
  • On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.

Having thought about it a bit more, I think I managed to resolve the tension. It seems that if at least one of the actions is positive utility, then the s... (read more)

3Alex Turner5mo
This is only true for optimal policies, no? For learned policies, positive reward will upweight and generalize certain circuits (like "approach juice"), while negative reward will downweight and generally-discourage those same circuits. This can then lead to path-dependent differences in generalization (e.g. whether person pursues juice in general). (In general, I think reward is not best understood as an optimization target [] like "utility.")

Strongly agreed. I do worry that most people on LW have a bias towards formalisation even when it doesn't add very much.

What are the key philosophical problems you believe we need to solve for alignment?

I guess it depends on the specific alignment approach being taken, such as whether you're trying to build a sovereign or an assistant. Assuming the latter, I'll list some philosophical problems that seem generally relevant:

  1. metaphilosophy
    • How to solve new philosophical problems relevant to alignment as they come up?
    • How to help users when they ask the AI to attempt philosophical progress?
    • How to help defend the user against bad philosophical ideas (whether in the form of virulent memes, or intentionally optimized by other AIs/agents to manipulate the use
... (read more)

Minor correction

But then, in the small fraction of worlds where we survive, we simulate lots and lots of copies of that AI where it instead gets reward 0 when it attempts to betray us!

The reward should be negative rather than 0.

Regarding the AI not wanting to cave to threats, there's a sense in which the AI is also (implicitly) threatening us, so it might not apply. (Defining what counts as a "threat" is challenging).

Could someone clarify the relevance of ribosomes?

This seems wrong to me in large part because the AI safety community and EA community more broadly have been growing independent of increased interest in AI


Agreed, this is one of the biggest considerations missed, in my opinion, by people who think accelerating progress was good. (TBH, if anyone was attempting to accelerate progress to reduce AI risk, I think that they were trying to be too clever by half; or just rationalisting).

I guess I would lean towards saying that once powerful AI systems exist, we'll need powerful aligned systems relatively fast in order to develop against them, otherwise we'll be screwed. In other words, AI arms race dynamics push us towards a world where systems are deployed with an insufficient amount of testing and this provides one path for us to fall victim to an AI system that you might have expected iterative design to catch.

I would love to see you say why you consider these bad ideas. Obvious such AI's could be unaligned themselves or is it more along the lines of these assistants needing a complete model of human values to be truly useful?

5Raymond Arnold9mo
John's Why Not Just... [] sequence is a series of somewhat rough takes on a few of them. (though I think many of them are not written up super comprehensively)

Speedup on evolution?

Maybe? Might work okayish, but doubt the best solution is that speculative.

As in, you could score some actions, but then there isn't a sense in which you "can" choose one according to any criterion.


I've noticed that issue as well. Counterfactuals are more a convenient model/story than something to be taken literally. You've grounded decision by taking counterfactuals to exist a priori. I ground them by noting that our desire to construct counterfactuals is ultimately based on evolved instincts and/or behaviours so these stories aren't just arbitrary stories but a way in which we can leverage the lessons that have been instilled in us by evolution. I'm curious, given this explanation, why do we still need choices to be actual?

1Jessica Taylor10mo
Do you think of counterfactuals as a speedup on evolution? Could this be operationalized by designing AIs that quantilize [] on some animal population, therefore not being far from the population distribution, but still surviving/reproducing better than average?
2Jessica Taylor10mo
Note the preceding I'm assuming use of a metaphysics in which you, the agent, can make choices. Without this metaphysics there isn't an obvious motivation for a theory of decisions. As in, you could score some actions, but then there isn't a sense in which you "can" choose one according to any criterion. Maybe this metaphysics leads to contradictions. In the rest of the post I argue that it doesn't contradict belief in physical causality including as applied to the self.

Let A be some action. Consider the statement: "I will take action A". An agent believing this statement may falsify it by taking any action B not equal to A. Therefore, this statement does not hold as a law. It may be falsified at will.


If you believe determinism then an agent can sometimes falsify it, sometimes not.

I think it's quite clear how shifting ontologies could break a specification of values. And sometimes you just need a formalisation, any formalisation, to play around with. But I suppose it depends more of the specific details of your investigation.

I strongly disagree with your notion of how privileging the hypothesis works. It's not absurd to think that techniques for making AIXI-tl value diamonds despite ontological shifts could be adapted for other architectures. I agree that there are other examples of people working on solving problems within a formalisation that seem rather formalisation specific, but you seem to have cast the net too wide.

2Alex Turner10mo
My basic point remains. Why is it not absurd to think that, without further evidential justification? By what evidence have you considered the highly specific investigation into AIXI-tl, and located the idea that ontology identification is a useful problem to think about at all (in its form of "detecting a certain concept in the AI")? 

I tend to agree that burning up the timeline is highly costly, but more because Effective Altruism is an Idea Machine that has only recently started to really crank up. There's a lot of effort being directed towards recruiting top students from uni groups, but these projects require time to pay off.

I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its egg

... (read more)

An ability to refuse to generate theories about a hypothetical world being in a simulation.

I guess the problem with this test is that the kinds of people who could do this tend to be busy, so they probably can't do this with so little notice.

Hmm... It seems much, much harder to catch every single one than to catch 99%.

One of my assumptions is that it's possible to design a "satisficing" engine -- an algorithm that generates candidate proposals for a fixed number of cycles, and then, assuming at least one proposal with estimated utility greater than X has been generated within that amount of time, selects one of the qualifying proposals at random. If there are no qualifying candidates, the AI takes no action. If you have a straightforward optimizer that always returns the action with the highest expected utility, then, yeah, you only have to miss one "cheat" that improves "official" utility at the expense of murdering everyone everywhere and then we all die. But if you have a satisficer, then as long as some of the qualifying plans don't kill everyone, there's a reasonable chance that the AI will pick one of those plans. Even if you forget to explicitly penalize one of the pathways to disaster, there's no special reason why that one pathway would show up in a large majority of the AI's candidate plans.

Regarding the point about most alignment work not really addressing the core issue: I think that a lot of this work could potentially be valuable nonetheless. People can take inspiration from all kinds of things and I think there is often value in picking something that you can get a grasp on, then using the lessons from that to tackle something more complex. Of course, it's very easy for people to spend all of their time focusing on irrelevant toy problems and never get around to making any progress on the real problem. Plus there are costs with adding more voices into the conversation as it can be tricky for people to distinguish the signal from the noise.

If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world's computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort

I had the exact same thought. My guess would be that Eliezer might say that since the AI is maximising if the generalisation function misses even one action of this sort as something that we should exclude that we're screwed.

Sure, I agree! If we miss even one such action, we're screwed. My point is that if people put enough skill and effort into trying to catch all such actions, then there is a significant chance that they'll catch literally all the actions that are (1) world-ending and that (2) the AI actually wants to try. There's also a significant chance we won't, which is quite bad and very alarming, hence people should work on AI safety.

I tend to value a longer timeline more than a lot of other people do. I guess I see EA and AI Safety setting up powerful idea machines that get more powerful when they are given more time to gear up.  A lot more resources have been invested into EA field-building recently, but we need time for these investments to pay off. At EA London this year, I gained a sense that AI Safety movement building is only now becoming its own thing; and of course it'll take time to iterate to get it right, then time for people to pass through the programs, then time for... (read more)

How large do you expect Conjecture to become? What percent of people do you expect to be working on the product and what percentage to be working on safety? 

0Connor Leahy1y
Ideally, we would like Conjecture to scale quickly. Alignment wise, in 5 years time, we want to have the ability to take a billion dollars and turn it into many efficient, capable, aligned teams of 3-10 people working on parallel alignment research bets, and be able to do this reliably and repeatedly. We expect to be far more constrained by talent than anything else on that front, and are working hard on developing and scaling pipelines to hopefully alleviate such bottlenecks. For the second question, we don't expect it to be a competing force (as in, we have people who could be working on alignment working on product instead). See point two in this comment [].

Random idea:  A lot of people seem discouraged from doing anything about AI Safety because it seems like such a big overwhelming problem.

What if there was a competition to encourage people to engage in low-effort actions towards AI safety, such as hosting a dinner for people who are interested, volunteering to run a session on AI safety for their local EA group, answering a couple of questions on the stampy wiki, offering to proof-read a few people’s posts or offering a few free tutorial sessions to aspiring AI Safety Researchers.

I think there’s a dec... (read more)

Thoughts on the introduction of Goodhart's. Currently, I'm more motivated by trying to make the leaderboard, so maybe that suggests that merely introducing a leaderboard, without actually paying people, would have had much the same effect. Then again, that might just be because I'm not that far off. And if there hadn't been the payment, maybe I wouldn't have ended up in the position where I'm not that far off.

I guess I feel incentivised to post a lot more than I would otherwise, but especially in the comments rather than the posts since if you post a lot o... (read more)

If we have an algorithm that aligns an AI with X values, then we can add human values to get an AI that is aligned with human values.

On the other hand, I agree that it doesn't really make sense to declare an AI safe in the abstract, rather than in respect to say human values. (Small counterpoint: in order to be safe, it's not just about alignment, you also need to avoid bugs. This can be defined without reference to human values. However, this isn't sufficient for safety).

I suppose this works as a criticism of approaches like quantisers or impact-minimisation which attempt abstract safety. Although I can't see any reason why it'd imply that it's impossible to write an AI that can be aligned with arbitrary values.

If you think this is financially viable, then I'm fairly keen on this, especially if you provide internships and development opportunities for aspiring safety researchers.

3Stuart Armstrong1y
Yes, those are important to provide, and we will.

In science and engineering, people will usually try very hard to make progress by standing on the shoulders of others. The discourse on this forum, on the other hand, more often resembles that of a bunch of crabs in a bucket.

Hmm... Yeah, I certainly don't think that there's enough collaboration or appreciation of the insights that other approaches may provide.

Any thoughts on how to encourage a healthier dynamic.

2Koen Holtman1y
I have no easy solution to offer, except for the obvious comment that the world is bigger than this forum. My own stance is to treat the over-production of posts of type 1 above as just one of these inevitable things that will happen in the modern media landscape. There is some value to these posts, but after you have read about 20 of them, you can be pretty sure about how the next one will go. So I try to focus my energy, as a reader and writer, on work of type 2 instead. I treat arXiv as my main publication venue, but I do spend some energy cross-posting my work of type 2 here. I hope that it will inspire others, or at least counter-balance some of the type 1 work.

The object-level claims here seem straightforwardly true, but I think "challenges with breaking into MIRI-style research" is a misleading way to characterize it. The post makes it sound like these are problems with the pipeline for new researchers, but really these problems are all driven by challenges of the kind of research involved.

There's definitely some truth to this, but I guess I'm skeptical that there isn't anything that we can do about some of these challenges. Actually, rereading I can see that you've conceded this towards the end of your post. I... (read more)

To be clear, I don't intend to argue that the problem is too hard or not worthwhile or whatever. Rather, my main point is that solutions need to grapple with the problems of teaching people to create new paradigms, and working with people who don't share standard frames. I expect that attempts to mimic the traditional pipelines of paradigmatic fields will not solve those problems. That's not an argument against working on it, it's just an argument that we need fundamentally different strategies than the standard education and career paths in other fields.

Even if the content is proportional, the signal-to-noise ratio will still be much higher for those interested in MIRI-style research. This is a natural consequence of being a niche area.

When I said "might not have the capacity to vet", I was referring to a range of orgs.

I would be surprised if the lack of papers didn't have an effect as presumably, you're trying to highlight high-quality work and people are more motivated to go the extra yard when trying to get published because both the rewards and standards are higher.

Just some sort of official & long-term& OFFLINE study program that would teach some of the previous published MIRI research would be hugely beneficial for growing the AF community.


At the last EA global there was some sort of AI safety breakout session. There were ~12 tables with different topics. I was dismayed to discover that almost every table was full with people excitingly discussing various topics in prosaic AI alignment and other things the AF table had just 2 (!) people.

Wow, didn't realise it was that little!

I have spoken with MIRI p

... (read more)

Unclear. Some things that might be involved

  • a somewhat anti/non academic vibe
  • a feeling that they have the smartest people anyway, only hire the elite few that have a proven track record
  • feeling that it would take too much time and energy to educate people
  • a lack of organisational energy
  • .... It would be great if somebody from MIRI could chime in.

I might add that I know a number of people interested in AF who feel somewhat afloat/find it difficult to contribute. Feels a bit like a waste of talent

Load More