Vojtech Kovarik

Background in mathematics (descriptive set theory, Banach spaces) and game-theory (mostly zero-sum, imperfect information games). CFAR mentor. Usually doing alignment research.

Wiki Contributions


Not that I expect it to make much difference, but: Maybe it would be good if texts like these didn't make it into the training data od future models.

Reiterating two points people already pointed out, since they still aren't fixed after a month. Please, actually fix them, I think it is important. (Reasoning: I am somewhat on the fence on how big weight to assign to the simulator theory, I expect so are others. But as a mathematician, I would feel embarrassed to show this post to others and admit that I take it seriously, when it contains so egregious errors. No offense meant to the authors, just trying to point at this as an impact-limiting factor.)

Proposition 1: This is false, and the proof is wrong. For the same reason that you can get an infinite series (of positive numbers) with a finite sum.

The terminology: I think it is a really bad idea to refer to tokens as "states", for several reasons. Moreover, these reasons point to fundamental open questions around the simulator framing, and it seems unfortunate to chose terminology which makes these issues confusing/hard to even notice. (Disclaimer: I point out some holes in the simulator framing and suggest improvements. However, I am well aware that all of my suggestions also have holes.)

(1) To the extent that a simulator fully describes some situation that evolves over time, a single token is a too small unit to describe the state of the environment. A single frame of a video (arguably) corresponds to a state. Or perhaps a sentence in a story might (arguably) corresponds to a state. But not a single pixel (or patch) and not a single word.

(2) To the extent that a simulator fully describes some situation that evolves over time, there is no straightforward correspondence between the tokens produced so far and the current state of the environment. To give several examples: The process of tossing a coin repeatedly can be represented by a sequence such as "1 0 0 0 1 0 1 ...", where the current state can be identified with the latest token (and you do not want to identify the current state with the whole sequence). The process of me writing the digits of pi on a paper, one per second, can be described as "3 , 1 4 1 ..." --- here, you need the full sequence to characterize the current state. Or what if I keep writing different numbers, but get bored with them and switch to new ones after a while: " pi = 3 , 1 4 1 Stop, got bored. e = 2 , 7. Stop, got bored. sqrt(2) = ...".

(3) It is misleading/false to describe models like GPT as "describing some situation that evolves over time". Indeed, fiction books and movies do crazy things like jumping from character to character, flashbacks, etc. Non-fiction books are even weirder (could contain snippets of stories, and then non-story things, etc). You could argue that in order to predict a text of a non-fiction book, GPT is simulating the author of that book. But where does this stop? What if the 2nd half of the book is darker because the author got sacked out of his day job and got depressed --- are you then simulating the whole world, to predict this thing? If (more advanced) GPT is a simulator in the sense of "evolving situations over time", then I would like this claim flashed out in detail on the example of (a) non-fiction books, (b) fiction books, and perhaps (c) movies on TV that include commercial breaks.

(4) But most importantly: To the extent that a simulator describes some situation that evolves over time, it only outputs a small portion of the situation that it is "imagining" internally. (For example, you are telling a story about a princess, and you never mention the colour of her dress, despite the princess in your head having blue dress.) So it feels like a type-error to refer to the output as "state". At best, you could call it something like "rendering of a state".
Arguably, the output (+ the user input) uniquely determines the internal state of the simulator. So you could perhaps identify the output (+ the user input) with "the internal state of the simulator". But that seems dangerous and likely to cause reasoning errors.

(5) Finally, to make (4) even worse: To the extent that a simulator describes some situation that evolves over time, it is not internally maintaining a single fully fleshed out state that it (probabilistically) evolves over time. Instead, it maintains a set of possible states (macro-state?). And when it generates new responses, it throws out some of the possible states (refines the macro-state?). (For example, in your story about a princess, dress colour is not determined, could be anything. Then somebody asks about the colour, and you need to refine it to blue --- which could still mean many different shades of blue.) 

However, even the explanation, given in (5), of what is going on with simulators, is missing some important pieces. Indeed, it doesn't explain what happens in cases such as "GPT tells the great story about the princess with blue dress, and suddenly the user jumps in and refers to the dress as red". At the moment, this is my main reason for scepticism about the simulator framing. As result, my current view is that "GPT can act as a simulator" (in the sense of Simulators) but it would be "false" to say that "GPT is a simulator" (in the sense of Simulators).

Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.

But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)

> (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.

Either I misunderstand this or it seems incorrect. 

Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:

To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assistants able to speed up alignment vs capability" and (proliferation prevention) "how much can OpenAI prevent them from proliferating to capabilities research".[1] And then the corresponding more realistic version of the claims would be that:

  • either (i') AI assistants will fundamentally be able to speed up alignment much more than capabilities
  • or (ii') the potential speedup ratios will be comparable, but OpenAI will be able to significantly restrict the proliferation of AI assistants for capabilities research
  • or (iii') both the potential speedup ratios and adoption rates of AI assistants will be comparable for capabilities research will be, but somehow we will have enough time to solve alignment anyway.


  • Regarding (iii'): It seems that in the worlds where (iii') holds, you could just as well solve alignment without developing AI assistants.
  • Regarding (i'): Personally I don't buy this assumption. But you could argue for it on the grounds that perhaps alignment is just impossible to solve for unassisted humans. (Otherwise arguing for (i') seems rather hard to me.)
  • Regarding (ii'): As before, this seems implausible based on the track record :-).


  1. ^

    This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.

(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn't be viewed as negative judgement on the people involved :-).)

My ~2-hour reaction to the challenge:[1]

(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their "alignment plan" make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making things actively worse, not better. Therefore, their plan is perhaps not unarguably harmful, but certainly irresponsible.[2] For this reason, I find it unlikely that the post is endorsed as a strategy by OpenAI's leadership.

(III)[3] My assumption: To make sense of the text, I will from now assume that the post is endorsed by OpenAI's alignment team only, and that the team is in a position where they cannot affect the actions of OpenAI's capabilities team in any way. (Perhaps except to the extent that their proposals would only incur a near-negligible alignment tax.) They are simply determined to make the best use of the research that would happen anyway. (I don't have any inside knowledge into OpenAI. This assumption seems plausible to me, and very sad.)


(IV) A general comment that I would otherwise need to repeat essentially ever point I make is the following: OpenAI should set up a system that will (1) let them notice if their assumptions turn out to be mistaken and (2) force them to course-correct if it happens. In several places, the post explicitly states, or at least implies, critical assumptions about the nature of AI, AI alignment, or other topics. However, it does not include any ways of noticing if these assumptions turn out to not hold. To act responsibly, OpenAI should (at the minimum): (A) Make these assumptions explicit. (B) Make these hypotheses falsifiable by publicizing predictions, or other criteria they could use to check the assumptions. (C) Set up a system for actually checking (B), and course-correcting if the assumptions turn out false.

Assumptions implied by OpenAI's plans, with my reactions:

  • (V) Easy alignment / warning shots for misaligned AGI:
    "Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: [...]" My biggest objection with the whole plan is already regarding the second sentence of the post: relying on a trial-and-error approach. I assume OpenAI believes either: (1) The proposed alignment plan is so unlikely to fail that we don't need to worry about the worlds where it does fail. Or (2) In the worlds where the plan fails, we will have a clear warning shots. (I personally believe this is suicidal. I don't expect people to automatically agree, but with everything at stake, they should be open to signs of being wrong.)
  • (VI) "AGI alignment" isn't "AGI complete":
    This is already acknowledged in the post: "It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems." However, it isn't exactly clear what precise assumptions are being made here. Moreover, there is no vision for how to monitor whether the assumptions hold or not. Do we keep iterating on AI capabilities, each time hoping that "this time, it will be powerful enough to help with alignment"?
  • (VII) Related assumption: No lethal discontinuities:
    The whole post suggest the workflow "new version V of AI-capabilities ==> capabilities ppl start working on V+1 & (simultaneously) alignment people use V for alignment research ==> alignment(V) gets used on V, or informs V+1". (Like with GPT-3.) This requires the assumption that either you can hold off research on V+1 until alignment(V) is ready, or the assumption that deployed V will not kill you before you solve alignment(V). Which of the assumptions is being made here? I currently don't see evidence for "ability to hold off on capabilities research". What are the organizational procedures allowing this?
  • (VIII) [Point intentionally removed. I endorse the sentiment that treating these types of lists as complete is suicidal. In line with this, I initially wrote 7 points and then randomly deleted one. This is, obviously, in addition to all the points that I failed to come up with at all, or that I didn't mention because I didn't have enough original thoughts on them and it would seem too much like parroting MIRI. And in addition to the points that nobody came up with yet...]
  • (IX) Regarding "outer alignment  alignment": Other people solving the remaining issues. Or having warning shots & the ability to hold off capabilities research until OpenAI solves them:
    It is good to at least acknowledge that there might be other parts of AI alignment than just "figuring out learning from human feedback (& human-feedback augmentation)". However, even if this ingredient is necessary, the plan assumes that if it turns out not-sufficient, you will (a) notice and (b) have enough time to fix the issue.
  • (X) Ability to differentially use capabilities progress towards alignment progress:
    The plan involves training AI assistants to help with alignment research. This seems to assume that either (i) the AI assistants will only be able to help with alignment research, or (ii) they will be general, but OpenAI can keep their use restricted to alignment research only, or (iii) they will be general and generally used, but somehow we will have enough time to do the alignment research anyway. Personally, I think all three of these assumptions are false --- (i) because it seems unlikely they won't also be usable on capabilities research, (ii) based on track record so far, and (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
  • (XI) Creating an aligned AI is sufficient for getting AI to go well:
    The plan doesn't say anything about what to do with the hypothetical aligned AGI. Is the assumption that OpenAI can just release the seems-safe-so-far AGI through their API, $1 for 10,000 tokens, and we will all live happily ever after? Or is the plan to, uhm, offer it to all governments of the world for assistance in decision-making? Or something else inside the Overton window? If so, what exactly, and what is the theory of change for it? I think there could be many moral & responsible plans outside of the Overton window, just because public discource these days tends to be tricky. Having a specific strategy like that seems fine and reasonable. But I am afraid there is simultaneously (a) the desire to stick to the Overton window strategies and (b) no theory of change for how this prevents misaligned AGI by other actors, or other failure modes, (c) no "explicit assumptions & detection system & course-correction-procedure" for "nothing will go wrong if we just do (b)".

General complaint: The plan is not a plan at all! It's just a meta-plan.

  • (XII) Ultimately, I would paraphrase the plan-as-stated as: "We don't know how to solve alignment. It seems hard. Let's first build an AI to make us smarter, and then try again." I think OpenAI should clarify whether this is literally true, or whether there is some idea for how the object-level AI alignment plan looks like --- and if so, what is it.
  • (XIII) For example, the post mentions that "robustness and interpretability research [is important for the plan]". However, this is not at all apparent from the plan. (This is acknowledged in the post, but that doesn't make it any less of an issue!) This means that the plan is not detailed enough.
    As an analogy, suppose you have a mathematical theorem that makes an assumption X. And then you look at the proof, and you can't see the step that would fail if X was untrue. This doesn't say anything good about your proof.
  1. ^

    Eliezer adds:  "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."

    As far as I know, I came up with points (I), (III), and (XII) myself and I don't remember reading those points before. On the other hand, (IV), (IX), and (XI) are (afaik) pretty much direct ripoffs of MIRI arguments. The status of the remaining 7 points is unclear. (I read most of MIRI's publicly available content, and attended some MIRI-affiliated events pre-covid. And I think all of my alignment thinking is heavily MIRI-inspired. So the remaining points are probably inspired by something I read. Perhaps I would be able to derive 2-3 out of 7 if MIRI disappeared 6 years ago?)

  2. ^

    (II) For example, consider the following claim: "We believe the best way to learn as much as possible about how to make AI-assisted evaluation work in practice is to build AI assistants." My reaction: Yes, technically speaking this is true. But likewise --- please excuse the jarring analogy --- the best way to learn as much as possible about how to treat radiation exposure is to drop a nuclear bomb somewhere and then study the affected population. And yeees, if people are going to be dropping nuclear bombs, you might as well study the results. But wouldn't it be even better if you personally didn't plan to drop bombs on people? Maybe you could even try coordinating with other bomb-posessing people on not dropping them on people :-).

  3. ^

    Apologies for the inconsistent numbering. I had to give footnote [2] number (II) to get to the nice round total of 13 points :-).

To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn't really safe.

As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about "officially approved actions" very loud. And run a secondary thread of thinking about "non-approved thoughts" in the background --- "quieter" and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.

Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of "interpretable by default" and into "a race between your interpretability and my cognitive steganography".

Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The "elephant in the brain" type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like "give me a weekend to play with this".) And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.

One way in which the world seems brittle / having free energy AI could use to gain advantage: We haven't figured out good communication practices for the digital age. We don't have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. ("Enough" is meant to stand for "a lot, but within reach of an early AGI". Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)

Re sharp left turn: Maybe I misunderstand the "sharp left turn" term, but I thought this just means a sudden extreme gain in capabilities? If I am correct, then I expect you might get "sharp left turn" with a simulator during training --- eg, a user fine-tunes it on one additional dataset, and suddenly FOOOM. (Say, suddenly it can simulate agents that propose takeover plans that would actually work, when previously they failed at this with identical prompting.)

One implication I see is that it if the simulator architecture becomes frequently used, it might be really hard to tell whether a thing is dangerous or not. For example might just behave completely fine with most prompts and catastrophically with some other prompts, and you will never know until you try. (Or unless you do some extra interpretability/other work that doesn't yet exist.) It would be rather unfortunate if the Vulnerable World Hypothesis was true because of specific LLM prompts :-).

Explanation for my strong downvote/disagreement:
Sure, in the ideal world, this post would have a much better scholarship.

In the actual world, there are tradeoffs between the number of posts and the quality of scholarship. The cost is both the time and the fact that doing literature review is a chore. If you demand good scholarship, people will write slower/less. With some posts this is a good thing. With this post, I would rather have an attrocious scholarship and 1% higher chance of the sequence having one more post in it. (Hypothetical example. I expect the real tradeoffs are less favourable.)

Load More