All of Joe_Collman's Comments + Replies

Glad to see you're working on this. It seems even more clearly correct (the goal, at least :)) for not-so-short timelines. Less clear how best to go about it, but I suppose that's rather the point!

A few thoughts:

  1. I expect it's unusual that [replace methodology-1 with methodology-2] will be a pareto improvement: other aspects of a researcher's work will tend to have adapted to fit methodology-1. So I don't think the creation of some initial friction is a bad sign. (also mirrors therapy - there's usually a [take things apart and better understand them] phase
... (read more)
5Adam Shimi12d
Thanks for the kind words and useful devil's advocate! (I'm expecting nothing less from you ;p) I agree that pure replacement of methodology is a massive step that is probably premature before we have a really deep understanding both of the researcher's approach and of the underlying algorithm for knowledge production. Which is why in my model, this comes quite late; instead the first step are more revealing the cached methodology to the researcher, and showing alternatives from History of Science (and Technology) to make more options and approaches credible for them. Also looking at the "sins of the fathers" for philosophy of science (how methodologies have fucked up people across history) is part of our last set of framing questions. ;) Two reactions here: 1. I agree with the need to find things that are missing and alternatives, which is where the history and philosophy of science works come to help. One advantage of it is that you can generally judge whether the methodology was successful or problematic in hindsight there, compared to interviews. 2. I hadn't thought about interviewing other researchers. I expect it to be less efficient in a lot of ways than the HPS work, but I'm also now on the lookout for the option, so thanks! I see what you're pointing out. A couple related thoughts: 1. The benefits of working with established researchers is that you have a historical record of what they did, which makes it easier to judge whether you're actually helping. 2. I also expect helping established researchers to be easier on some dimensions, because they have more experience learning new models and leveraging them. 3. Related to your first point, I don't worry too much about messing people up because the initial input will far less invasive than replacements of methodologies wholesale. But we're still investigating the risks to be sure we're not doing something net negative.

[apologies on slowness - I got distracted]
Granted on type hierarchy. However, I don't think all instances of GPT need to look like they inherit from the same superclass. Perhaps there's such a superclass, but we shouldn't assume it.

I think most of my worry comes down to potential reasoning along the lines of:

  • GPT is a simulator;
  • Simulators have property p;
  • Therefore GPT has property p;

When what I think is justified is:

  • GPT instances are usually usefully thought of as simulators;
  • Simulators have property p;
  • We should suspect that a given instance of GPT will have
... (read more)

There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.

"trying to appear aligned" seems imprecise to me - unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there's no requirement for modelling of the base optimiser or oversight system; it's enough to understand the optimisation pressure and be uncertain whether it'll persist).

Are you thinking that it makes sense to agree to test for systems that are "tryin... (read more)

Mostly I agree with this.
I have more thoughts, but probably better to put them in a top-level post - largely because I think this is important and would be interested to get more input on a good balance.

A few thoughts on LW endorsing invalid arguments:
I'd want to separate considerations of impact on [LW as collective epistemic process] from [LW as outreach to ML researchers]. E.g. it doesn't necessarily seem much of a problem for the former to have reliance on unstated assumptions. I wouldn't formally specify an idea before sketching it, and it's not clear... (read more)

1David Scott Krueger1mo
Yeah I put those in one sentence in my comment but I agree that they are two separate points. RE impact on ML community: I wasn't thinking about anything in particular I just think the ML community should have more respect for LW/x-safety, and stuff like that doesn't help.

I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought processes and workflows that produce useful alignment ideas.

What are your thoughts on failure modes with this approach?
(please let me know if any/all of the following seems confused/vanishingly unlikely)

For example, one of the first that occurs to me is that such c... (read more)

Thanks a lot for this comment. These are extremely valid concerns that we've been thinking about a lot.

I'd just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.

I don't think this is feasible given our current understanding of epistemology in general and epistemology of alignment research in particular. The problems you listed are potential problems with any methodology, not just AI assisted research. Being able to look at a proposed method and make clear arguments that it's unlikely to hav... (read more)

0Seth Herd15d
This seems like a valid concern. It seems to apply to other directions in alignment research as well. Any approach can make progress in some directions seem easier, while ultimately that direction will be a dead end. Based on that logic, it would seem that having more different approaches should serve as a sort of counterbalance. As we make judgment calls about ease of progress vs. ultimate usefulness, having more options would seem like to provide better progress in useful directions.

Great post. Very interesting.

However, I think that assuming there's a "true name" or "abstract type that GPT represents" is an error.

If GPT means "transformers trained on next-token prediction", then GPT's true name is just that. The character of the models produced by that training is another question - an empirical one. That character needn't be consistent (even once we exclude inner alignment failures).

Even if every GPT is a simulator in some sense, I think there's a risk of motte-and-baileying our way into trouble.

Also see this comment thread [] for discussion of true names and the inadequacy of "simulator"

If GPT means "transformers trained on next-token prediction", then GPT's true name is just that.

Things are instances of more than one true name because types are hierarchical.

GPT is a thing. GPT is an AI (a type of thing). GPT is a also ML model (a type of AI). GPT is also a simulator (a type of ML model). GPT is a generative pretrained transformer (a type of simulator). GPT-3 is a generative pretrained transformer with 175B parameters trained on a particular dataset (a type/instance of GPT).

The intention is not to rename GPT -> simulator. Things tha... (read more)

Presumably "too dismissive of speculative and conceptual research" is a direct consequence of increased emphasis on rigor. Rigor is to be preferred all else being equal, but all else is not equal.

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

I note that within rigorous fields, the downsides of rigor are not obvious: we can point to all the progress made; progress that wasn't made due to the neglect of conceptua... (read more)

A rough heuristic I have is that if the idea you're introducing is highly novel, it's OK to not be rigorous. Your contribution is bringing this new, potentially very promising, idea to people's attention. You're seeking feedback on how promising it really is and where people are confused , which will be helpful for then later formalizing it and studying it more rigorously. But if you're engaging with a large existing literature and everyone seems to be confused and talking past each other (which I'd characterize a significant fraction of the mesa-optimization literature, for example) -- then the time has come to make things more rigorous, and you are unlikely to make much further progress without it.

I think rigor and clarity are more similar than you indicate.  I mostly think of rigor as either (i) formal definitions and proofs, or (ii) experiments well described, executed, and interpreted.  I think it's genuinely hard to reach a high level of clarity about many things without (i) or (ii).  For instance, people argue about "optimization", but without referencing (hypothetical) detailed experiments or formal notions, those arguments just won't be very clear; experimental or definitional details just matter a lot, and this is very often t... (read more)

I’d expect AI checks and balances to have some benefits even if AIs are engaged in advanced collusion with each other. For example, AIs rewarded to identify and patch security vulnerabilities would likely reveal at least some genuine security vulnerabilities, even if they were coordinating with other AIs to try to keep the most important ones hidden.

This seems not to be a benefit.
What we need is to increase the odds of finding the important vulnerabilities. Collusion that reveals a few genuine vulnerabilities seems likely to lower our odds by giving us mis... (read more)

Ok, thanks - I can at least see where you're coming from then.
Do you think debate satisfies the latter directly - or is the frontier only pushed out if it helps in the automation process? Presumably the latter (??) - or do you expect e.g. catastrophic out-with-a-whimper dynamics before deceptive alignment?

I suppose I'm usually thinking of a "do we learn anything about what a scalable alignment approach would look like?" framing. Debate doesn't seem to get us much there (whether for scalable ELK, or scalable anything else), unless we can do the automating a... (read more)

Thanks for the list.

On (7), I'm not clear how this is useful unless we assume that debaters aren't deceptively aligned. (if debaters are deceptively aligned, they lose the incentive to expose internal weaknesses in their opponent, so the capability to do so doesn't achieve much)

In principle, debate on weights could add something over interpretability tools alone, since optimal in-the-spirit-of-things debate play would involve building any useful interpretability tools which didn't already exist. (optimal not-in-the-spirit-of-things play likely involves per... (read more)

There's one attitude towards alignment techniques which is something like "do they prevent all catastrophic misalignment?" And there's another which is more like "do they push out the frontier of how advanced our agents need to be before we get catastrophic misalignment?" I don't think the former approach is very productive, because by that standard no work is ever useful. So I tend to focus on the latter, with the theory of victory being "push the frontier far enough that we get a virtuous cycle of automating alignment work".

To rephrase, it seems to me that in some sense all evidence is experimental. What changes is the degree of generalisation/abstraction required to apply it to a particular problem.

Once we make the distinction between experimental and non-experimental evidence, then we allow for problems on which we only get the "non-experimental" kind - i.e. the kind requiring sufficient generalisation/abstraction that we'd no longer tend to think of it as experimental.

So the question on Y-problems becomes something like:

  • Given some characterisation of [experimental evidence
... (read more)

Thanks again for writing this.
A few thoughts:

I think that the release of GPT-3 and the OpenAI API led to significantly increased focus and somewhat of a competitive spirit around large language models... I don't think OpenAI predicted this in advance, and believe that it would have been challenging, but not impossible, to foresee this.

Do you believe any general lessons have been learned from this? Specifically, it seems a highly negative pattern if [we can't predict concretely how this is likely to go badly] translates to [we don't see any reason not to go... (read more)

[First of all, many thanks for writing the post; it seems both useful and the kind of thing that'll predictably attract criticism]

I'm not quite sure what you mean to imply here (please correct me if my impression is inaccurate - I'm describing how-it-looks-to-me, and I may well be wrong):

I would expect OpenAI leadership to put more weight on experimental evidence than you...

Specifically, John's model (and mine) has:
X = [Class of high-stakes problems on which we'll get experimental evidence before it's too late]
Y = [Class of high-stakes problems on which we... (read more)

2Jacob Hilton1mo
I don't think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a "problem". Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.

However, suppose that the box setup uses theoretically robust cybersecurity, combined with an actual physical box that is designed to not let any covert information enter or leave.

I think what you want to say here is:

However, suppose that the box setup uses robust cybersecurity, combined with an actual physical box that does not let any covert information enter or leave.

  1. "...theoretically..." weakens rather than strengthens: we need the cybersecurity to be robust, in spite of implementation details.
  2. It doesn't matter what the box "is designed to" do; it matt
... (read more)

Interesting, thanks. This makes sense to me.
I do think strong-HCH can support the "...more like a simulated society..." stuff in some sense - which is to say that it can be supported so long as we can rely on individual Hs to robustly implement the necessary pointer passing (which, to be fair, we can't).

To add to your "tree of John Wentworths", it's worth noting that H doesn't need to be an individual human - so we could have our H be e.g. {John Wentworth, Eliezer Yudkowsky, Paul Christiano, Wei Dai}, or whatever team would make you more optimistic about lack of memetic disaster. (we also wouldn't need to use the same H at every level)

Yeah, at some point we're basically simulating the alignment community (or possibly several copies thereof interacting with each other). There will probably be another post on that topic soonish.

I'd be interested in your thoughts on [Humans-in-a-science-lab consulting HCH], for questions where we expect that suitable empirical experiments could be run on a significant proportion of subquestions. It seems to me that lack of frequent empirical grounding is what makes HCH particularly vulnerable to memetic selection.

Would you still expect this to go badly wrong (assume you get to pick the humans)? If so, would you expect sufficiently large civilizations to be crippled through memetic selection by default? If [yes, no], what do you see as the importan... (read more)

Ok, so, some background on my mental image. Before yesterday, I had never pictured HCH as a tree of John Wentworths (thank you Rohin for that). When I do picture John Wentworths, they mostly just... refuse to do the HCH thing. Like, they take one look at this setup and decide to (politely) mutiny or something. Maybe they're willing to test it out, but they don't expect it to work, and it's likely that their output is something like the string "lol nope". I think an entire society of John Wentworths would probably just not have bureaucracies at all; nobody would intentionally create them, and if they formed accidentally nobody would work for them or deal with them. Now, there's a whole space of things-like-HCH, and some of them look less like a simulated infinite bureaucracy and more like a simulated society. (The OP mostly wasn't talking about things on the simulated-society end of the spectrum, because there will be another post on that.) And I think a bunch of John Wentworths in something like a simulated society would be fine - they'd form lots of small teams working in-person, have forums like LW for reasonably-high-bandwidth interteam communication, and have bounties on problems and secondary markets on people trying to get the bounties and independent contractors and all that jazz. Anyway, back to your question. If those John Wentworths lacked the ability to run experiments, they would be relatively pessimistic about their own chances, and a huge portion of their work would be devoted to figuring out how to pump bits of information and stay grounded without a real-world experimental feedback channel. That's not a deal-breaker; background knowledge of our world already provides far more bits of evidence than any experiment ever run, and we could still run experiments on the simulated-Johns. But I sure would be a lot more optimistic with an experimental channel. I do not think memetic selection in particular would cripple those Johns, because that's exactly t

Sure, that's true - but in that case the entire argument should be put in terms of:
We can (aim to) implement a pivotal process before a unilateral AGI-assisted pivotal act is possible.

And I imagine the issue there would all be around the feasibility of implementation. I think I'd give a Manhattan project to solve the technical problem much higher chances than a pivotal process. (of course people should think about it - I just won't expect them to come up with anything viable)

Once it's possible, the attitude of the creating org before interacting with their... (read more)

This still seems to somewhat miss the point (as I pointed out last time):
Conditional on org X having an aligned / corrigible AGI, we should expect:

  1. If the AGI is an aligned sovereign, it'll do the pivotal act (PA) unilaterally if that's best, and do it in distributed fashion if that's best (according to whatever it's aligned to).
  2. If the AGI is more like a corrigible tool, we should expect X to ask 'their' AGI what would be best to do (or equivalent), and we're pretty-much back to case 1.

The question isn't what the humans in X would do, but what the [AGI + hu... (read more)

I think you are missing the possibility that the outcomes of the pivotal process could be -no one builds autonomous AGI -autonomos AGI is build only in post-pivotal outcome states, where the condition of building it is alignment being solved

Well I'm sure I could have been clearer. (and it's possible that I'm now characterising what I think, rather than what I wrote)

But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. ('correct' meaning something like: [leads to best consequences, according to our values])
Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer.

I think I focus on this, since it's the non-obvious part of the argument... (read more)

...the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer...
I don't think you disagreed with this?

Yes, agreed.

A few points on the rest:

  1. At the highest level, the core issue is that QI makes it quite a bit harder to identify misalignment. If aligned systems will sometimes not answer the question, non-answering isn't necessarily strong evidence of misalignment.
    So "consequentialist judges will [sometimes correctly] select QIA's" is bad in the sense that it provides cover for "consequentialist judg
... (read more)
0Chris van Merwijk5mo
"I talk about consequentialists, but not rational consequentialists", ok this was not the impression I was getting.

This mostly seems to be an argument for: "It'd be nice if no pivotal act is necessary", but I don't think anyone disagrees with that.

As for "Should an AGI company be doing this?" the obvious answer is "It depends on the situation". It's clearly nice if it's not necessary. Similarly, if [the world does the enforcement] has higher odds of success than [the AGI org does the enforcement] then it's clearly preferable - but it's not clear that would be the case.

I think it's rather missing the point to call it a "pivotal act philosophy" as if anyone values pivota... (read more)

This mostly seems to be an argument for: "It'd be nice if no pivotal act is necessary", but I don't think anyone disagrees with that.

It's arguing that, given that your organization has scary (near) AGI capabilities, it is not so much harder (to get a legitimate authority to impose an off-switch on the world's compute) than (to 'manufacture your own authority' to impose that off-switch) such that it's worth avoiding the cost of (developing those capabilities while planning to manufacture authority). Obviously there can be civilizations where that's true, and civilizations where that's not true.

Examples would be interesting, certainly. Concerning the post's point, I'd say the relevant claim is that [type of alignment research that'll be increasingly done in slow takeoff scenarios] is already being done by non x-risk motivated people.

I guess the hope is that at some point there are clear-to-everyone problems with no hacky solutions, so that incentives align to look for fundamental fixes - but I wouldn't want to rely on this.

Wholeheartedly agree, and I think it's great that you're doing this.
I'll be very interested in what you learn along the way w.r.t. more/less effective processes.

(Bonus points for referencing the art of game design - one of my favourite books.)

2Adam Shimi6mo
Thanks! Yes, this is very much an experiment, and even if it fails, I expect it to be a productive mistake [] we can learn from. ;)

Thanks. A few thoughts:

  • It is almost certainly too long. Could use editing/distillation/executive-summary. I erred on the side of leaving more in, since the audience I'm most concerned with are those who're actively working in this area (though for them there's a bit much statement-of-the-obvious, I imagine).
  • I don't think most of it is new, or news to the authors: they focused on the narrow version for a reason. The only part that could be seen as a direct critique is the downside risks section: I do think their argument is too narrow.
  • As it relates to Truth
... (read more)

For sure I agree that the researcher knowing these things is a good start - so getting as many potential researchers to grok these things is important.

My question is about which ideas researchers should focus on generating/elaborating given that they understand these things. We presumably don't want to restrict thinking to ideas that may overcome all these issues - since we want to use ideas that fail in some respects, but have some aspect that turns out to be useful.

Generating a broad variety of new ideas is great, and we don't want to be too quick in thr... (read more)

Mostly I'd agree with this, but I think there needs to be a bit of caution and balance around:

How do we get more streams of evidence? By making productive mistakes. By attempting to leverage weird analogies and connections, and iterating on them. We should obviously recognize that most of this will be garbage, but you’ll be surprised how many brilliant ideas in the history of science first looked like, or were, garbage.

Do we want variety? Absolutely: worlds where things work out well likely correlate strongly with finding a variety of approaches.

However, t... (read more)

2Logan Riggs Smith6mo
I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.

Suggestion on Agency 2.1: rephrase so that the "Before reading his post" part comes before the link to the post. I assume there'll otherwise be some overzealous link followers.

2Richard Ngo7mo
Thanks, done!

I'm encouraged by your optimism, and wish you the best of luck (British, and otherwise), but I hope you're not getting much of your intuition from the "Humans have demonstrated a skill with value extrapolation..." part. I don't think we have good evidence for this in a broad enough range of circumstances for it to apply well to the AGI case.

We know humans do pretty 'well' at this - when surrounded by dozens of other similar agents, in a game-theoretical context where it pays to cooperate, it pays to share values with others, and where extreme failure modes... (read more)

3Stuart Armstrong7mo
I do not put too much weight on that intuition, except as an avenue to investigate (how do humans do it, exactly? If it depends on the social environment, can the conditions of that be replicated?).

Ah, I was just being an idiot on the bargaining system w.r.t. small numbers of people being able to hold it to ransom. Oops. Agreed that more majority power isn't desirable.
[re iteration, I only meant that the bargaining could become iterated if the initial bargaining result were to decide upon iteration (to include more future users). I now don't think this is particularly significant.]

I think my remaining uncertainty (/confusion) is all related to the issue I first mentioned (embedded copy experiences). It strikes me that something like this can also hap... (read more)

I agree with most of this. Not sure about how much moral weight I'd put on "a computation structured like an agent" - some, but it's mostly coming from [I might be wrong] rather than [I think agentness implies moral weight].

Agreed that malthusian dynamics gives you an evolution-like situation - but I'd guess it's too late for it to matter: once you're already generally intelligent, can think your way to the convergent instrumental goal of self-preservation, and can self-modify, it's not clear to me that consciousness/pleasure/pain buys you anything.

Heurist... (read more)

Thanks, that's interesting, though mostly I'm not buying it (still unclear whether there's a good case to be made; fairly clear that he's not making a good case).

  1. Most of it seems to say "Being a subroutine doesn't imply something doesn't suffer". That's fine, but few positive arguments are made. Starting with the letter 'h' doesn't imply something doesn't suffer either - but it'd be strange to say "Humans obviously suffer, so why not houses, hills and hiccups?".
  2. We infer preference from experience of suffering/joy...:
    [Joe Xs when he might not X] &a
... (read more)

Sure, I'm not sure there's a viable alternative either. This kind of approach seems promising - but I want to better understand any downsides.

My worry wasn't about the initial 10%, but about the possibility of the process being iterated such that you end up with almost all bargaining power in the hands of power-keepers.

In retrospect, this is probably silly: if there's a designable-by-us mechanism that better achieves what we want, the first bargaining iteration should find it. If not, then what I'm gesturing at must either be incoherent, or not endorsed by... (read more)

2Vanessa Kosoy10mo
I'm not sure what you mean here, but also the process is not iterated: the initial bargaining is deciding the outcome once and for all. At least that's the mathematical ideal we're approximating. I don't think so? The bargaining system does advantage large groups over small groups. In practice, I think that for the most part people don't care much about what happens "far" from them (for some definition of "far", not physical distance) so giving them private utopias is close to optimal from each individual perspective. Although it's true they might pretend to care more than they do for the usual reasons, if they're thinking in "far-mode". I would certainly be very concerned about any system that gives even more power to majority views. For example, what if the majority of people are disgusted by gay sex and prefer it not the happen anywhere? I would rather accept things I disapprove of happening far away from me than allow other people to control my own life. Ofc the system also mandates win-win exchanges. For example, if Alice's and Bob's private utopias each contain something strongly unpalatable to the other but not strongly important to the respective customer, the bargaining outcome will remove both unpalatable things. I'm fine with strong-denosing negative utlitarianists who would truly stick to their guns about negative utilitarianism (but I also don't think there are many).

This seems generally clear and useful. Thumbs-up.

(2) could probably use a bit more "...may...", "...sometimes..." for clarity. You can't fool all of the models all of the time etc. etc.

1Beth Barnes10mo
thanks, edited :)

Sure, that's possible (and if so I agree it'd be importantly dystopic) - but do you see a reason to expect it?
It's not something I've thought about a great deal, but my current guess is that you probably don't get moral patients without aiming for them (or by using training incentives much closer to evolution than I'd expect).

1Beth Barnes10mo
some relevant ideas here maybe: []

I guess I expect there to be a reasonable amount of computation taking place, and it seems pretty plausible a lot of these computations will be structured like agents who are taking part in the Malthusian competition. I'm sufficiently uncertain about how consciousness works that I want to give some moral weight to 'any computation at all', and reasonable weight to 'a computation structured like an agent'.

I think if you have malthusian dynamics you *do* have evolution-like dynamics.

I assume this isn't a crux, but fwiw I think it's pretty likely most vertebrates are moral patients

My sense is that most would-be dystopian scenarios lead to extinction fairly quickly. In most Malthusian situations, ruthless power struggles... humans would be a fitness liability that gets optimised away.

The way this doesn't happen is if we have AIs with human-extinction-avoiding constraints: some kind of alignment (perhaps incomplete/broken).

I don't think it makes much sense to reason further than this without making a guess at what those constraints may look like. If there aren't constraints, we're dead. If there are, then those constraints determine the rules of the game.

It sounds like you're implying that you need humans around for things to be dystopic? That doesn't seem clear to me; the AIs involved in the Malthusian struggle might still be moral patients

Ah - that's cool if IB physicalism might address this kind of thing (still on my to-read list).

Agreed that the subjoe thing isn't directly a problem. My worry is mainly whether it's harder to rule out  experiencing a simulation of , since sub isn't a user. However, if you can avoid the suffering s by limiting access to information, the same should presumably work for relevant sub-s.

If existing people want future people to be treated well, then they have nothing to worry about since this preference is part of the

... (read more)
3Vanessa Kosoy10mo
I admit that at this stage it's unclear because physicalism brings in the monotonicity principle that creates bigger problems than what we discuss here. But maybe some variant can work. Roughly speaking, in this case the 10% preserve their 10% of the power forever. I think it's fine because I want the buy-in of this 10% and the cost seems acceptable to me. I'm also not sure there is any viable alternative which doesn't have even bigger problems.

This is very interesting (and "denosing operator" is delightful).

Some thoughts:

If I understand correctly, I think there can still be a problem where user  wants an experience history such that part of the history is isomorphic to a simulation of user  suffering ( wants to fully experience  suffering in every detail).

Here a fixed  may entail some fixed  for (some copy of) some .

It seems the above approach can't then avoid leaving one of  or  badly off:
If  i... (read more)

4Vanessa Kosoy10mo
I think that a rigorous treatment of such issues will require some variant of IB physicalism [] (in which the monotonicity problem has been solved, somehow). I am cautiously optimistic that a denosing operator exists there which dodges these problems. This operator will declare both the manifesting and evaluation of the source codes of other users to be "out of scope" for a given user. Hence, a preference of i to observe the suffering of j would be "satisfied" by observing nearly anything, since the maximization can interpret anything as a simulation of j. The "subjoe" problem is different: it is irrelevant because "subjoe" is not a user, only Joe is a user. All the transhumanist magic that happens later doesn't change this. Users are people living during the AI launch, and only them. The status of any future (trans/post)humans is determined entirely according to the utility functions of users. Why? For two reasons: (i) the AI can only have access and stable pointers to existing people (ii) we only need the buy-in of existing people to launch the AI. If existing people want future people to be treated well, then they have nothing to worry about since this preference is part of the existing people's utility functions.

Oh and of course your non-obstruction does much better at capturing what we care about.
It's not yet clear to me whether some adapted version of  gets at something independently useful. Maybe.

[I realize that you're aiming to get at something different here - but so far I'm not clear on a context where I'd be interested in  as more than a curiosity]

In either case, we might need to do some kind of outlier filtering: if e.g. literally every person on Earth is a user, then maybe some of them are utterly insane in ways that cause the Pareto frontier to collapse.

This seems near guaranteed to me: a non-zero amount of people will be that crazy (in our terms), so filtering will be necessary.

Then I'm curious about how we draw the line on outlier filtering. What filtering rule do we use? I don't yet see a good principled rule (e.g. if we want to throw out people who'd collapse agreement to the disagreement point, there's more than one way to do that).

This is a nice idea. I think it'd need alterations before it became a useful tool (if I'm understanding clearly, and not missing applications of the unaltered version), but it has potential.

[[Note: I haven't looked in any detail at tailcalled's comments/post, since I wanted to give my initial impressions first; apologies for any redundancy]]


  1. There's an Anna Karenina issue: All happiness-inducing AI policies are alike; each unhappiness-inducing policy induces unhappiness in its own way.
    In some real-world situation, perhaps there are  g
... (read more)
Oh and of course your non-obstruction [] does much better at capturing what we care about. It's not yet clear to me whether some adapted version ofCorrigibilityPMgets at something independently useful. Maybe. [I realize that you're aiming to get at something different here - but so far I'm not clear on a context where I'd be interested inCorrigibilityPMas more than a curiosity]

Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency.

It's an interesting idea, but are you confident that LCDT actually works? E.g. have you thought more about the issues I talked about here and concluded they're not serious problems?

I still don't see how we could get e.g. an HCH simulator without agentic components (or the simulator's qualifying as an agent).
As soon as an LCDT agent expects that it may create agentic components in its simulation, it's going to reason ho... (read more)

An issue with the misalignment definition:

2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:

  • capable of outputting A instead, and
  • capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B

Though it's a perfectly good rule-of-thumb / starting point, I don't think this ends up being a good definition: it doesn't work throughout with A and B either fixed as concrete outputs, or fixed as properties of outputs.

Case 1 -... (read more)

1William Saunders1y
I've been thinking of Case 2. It seems harder to establish "capable of distinguishing between situations where the user wants A vs B" on individual examples since a random classifier would let you cherrypick some cases where this seems possible without the model really understanding. Though you could talk about individual cases as examples of Case 2. Agree that there's some implicit "all else being equal" condition, I'd expect currently it's not too likely to change conclusions. Ideally you'd just have the category A="best answer according to user" B="all answers that are worse than the best answer according to the user" but I think it's simpler to analyze more specific categories.

This mostly seems plausible to me - and again, I think it's a useful exercise that ought to yield interesting results.

Some thoughts:

  1. Handwaving would seem to take us from "we can demonstrate capability of X" to "we have good evidence for capability of X". In cases where we've failed to prompt/finetune the model into doing X we also have some evidence against the model's capability of X. Hard to be highly confident here.
  2. Precision over the definition of a task seems important when it comes to output. Since e.g. "do arithmetic" != "output arithmetic".
    This is r
... (read more)

I like the overall idea - seems very worthwhile.

A query on the specifics:

We consider a model capable of some task X if:

  • ...
  • We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y

Are you thinking that this is a helpful definition even when treating models as black boxes, or only based on some analysis of the model's internals? To me it seems workable only in the latter case.

In particular, from a black-box perspective, I don't think we ever know that task X is required fo... (read more)

2Beth Barnes1y
Yeah, I think you need some assumptions about what the model is doing internally. I'm hoping you can handwave over cases like 'the model might only know X&A, not X' with something like 'if the model knows X&A, that's close enough to it knowing X for our purposes - in particular, if it thought about the topic or learned a small amount, it might well realise X'. Where 'our purposes' are something like 'might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don't know X'? It seems plausible to me that there are cases where you can't get the model to do X by finetuning/prompt engineering, even if the model 'knows' X enough to be able to use it in plans. Something like - the part of its cognition that's solving X isn't 'hooked up' to the part that does output, but is hooked up to the part that makes plans. In humans, this would be any 'knowledge' that can be used to help you achieve stuff, but which is subconscious - your linguistic self can't report it directly (and further you can't train yourself to be able to report it)

If you’re trying to align wildly superintelligent systems, you don’t have to worry about any concern related to your system being incompetent.

In general, this seems false. The thing you don't have to worry about is subhuman competence. You may still have to worry about incompetence relative to some highly superhuman competence threshold. (it may be fine to say that this isn't an alignment problem - but it's a worry)

One concern is reaching [competent at X] before [competent at operating safely when [competent at X]].

Here it'd be fine if the system had perfe... (read more)

Thanks, that's interesting. [I did mean to reply sooner, but got distracted]

A few quick points:

Yes, by "incoherent causal model" I only mean something like "causal model that has no clear mapping back to a distribution over real worlds" (e.g. where different parts of the model assume that [kite exists] has different probabilities).
Agreed that the models LCDT would use are coherent in their own terms. My worry is, as you say, along garbage-in-garbage-out lines.

Having LCDT simulate HCH seems more plausible than its taking useful action in the world - but I'm... (read more)

Interesting, thanks.

However, I don't think this is quite right (unless I'm missing something):

Now observe that in the LCDT planning world model  constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I'd argue that this is a form of deception, a deceptive

... (read more)
1Koen Holtman1y
See the comment here [] for my take.
1Koen Holtman1y
To be clear: the point I was trying to make is also that I do not think that B and C are significantly different in the goalkeeper benchmark. My point was that we need to go to a random prior to produce a real difference. But your question makes me realise that this goalkeeper benchmark world opens up a bigger can of worms than I expected. When writing it, I was not thinking about Nash equilibrium policies, which I associate mostly with iterated games, and I was specifically thinking about an agent design that uses the planning world to compute a deterministic policy function. To state what I was thinking about in different mathematical terms, I was thinking of an agent design that is trying to compute the action a that optimizes argmaxa in the non-iterated gameplay world C. To produce the Nash equilibrium type behaviour you are thinking about (i.e. the agent will kick left most of the time but not all the time), you need to start out with an agent design that will use the C constructed by LCDT to compute a nondeterministic policy function, which it will then use to do compute its real world action. If I follow that line of thought, it I would need additional ingredients to make the agent actually compute that Nash equilibrium policy function. I would need need to have iterated gameplay in B, with mechanics that allow the goalkeeper to observe whether the agent is playing a non-Nash-equilibrium policy/strategy, so that the goalkeeper will exploit this inefficiency for sure if the agent plays the non-Nash-equilibrium strategy. The possibility of exploitation by the goalkeeper is what would push the optimal agent policy towards a Nash equilibrium. But interestingly, such mechanics where the goalkeeper can learn about a non-Nash agent policy being used might be present in an iterated version of the real world model B, but they will be removed by LCDT from an iterated version of C. (Another wrinkle: some AI algorithms for solving the optimal policy in a single-shot g

Ok, that mostly makes sense to me. I do think that there are still serious issues (but these may be due to my remaining confusions about the setup: I'm still largely reasoning about it "from outside", since it feels like it's trying to do the impossible).

For instance:

  1. I agree that the objective of simulating an agent isn't a problem. I'm just not seeing how that objective can be achieved without the simulation taken as a whole qualifying as an agent. Am I missing some obvious distinction here?
    If for all x in X, sim_A(x) = A(x), then if A is behaviourally an
... (read more)

Ah yes, you're right there - my mistake.

However, I still don't see how LCDT can make good decisions over adjustments to its simulation. That simulation must presumably eventually contain elements classed as agentic.
Then given any adjustment X which influences the simulation outcome both through agentic paths and non-agentic paths, the LCDT agent will ignore the influence [relative to the prior] through the agentic paths. Therefore it will usually be incorrect about what X is likely to accomplish.

It seems to me that you'll also have incoherence issues here ... (read more)

4Evan Hubinger1y
That's a really interesting thought—I definitely think you're pointing at a real concern with LCDT now. Some thoughts: * Note that this problem is only with actually running agents internally, not with simply having the objective of imitating/simulating an agent—it's just that LCDT will try to simulate that agent exclusively via non-agentic means. * That might actually be a good thing, though! If it's possible to simulate an agent via non-agentic means, that certainly seems a lot safer than internally instantiating agents—though it might just be impossible to efficiently simulate an agent without instantiating any agents internally, in which case it would be a problem. * In some sense, the core problem here is just that the LCDT agent needs to understand how to decompose its own decision nodes into individual computations so it can efficiently compute things internally and then know when and when not to label its internal computations as agents. How to decompose nodes into subnodes to properly work with multiple layers is a problem with all CDT-based decision theories, though—and it's hopefully the sort of problem that finite factored sets [] will help with.

Ok thanks, I think I see a little more clearly where you're coming from now.
(it still feels potentially dangerous during training, but I'm not clear on that)

A further thought:

The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way

Ok, so suppose for the moment that HCH is aligned, and that we're able to specify a sufficiently accurate HCH model. The hard part of the problem seems to be safe-and-efficient simulation of the output of that HCH model.
I'm not clear on how this... (read more)

3Evan Hubinger1y
This is not true—LCDT is happy to influence nodes downstream of agent nodes, it just doesn't believe it can influence them through those agent nodes. So LCDT (at decision time) doesn't believe it can change what HCH does, but it's happy to change what it does to make it agree with what it thinks HCH will do, even though that utility node is downstream of the HCH agent nodes.
Load More