All of Vladimir_Nesov's Comments + Replies

Three points: how much compute is going into a training run, how much natural text data it wants, and how much data is available. For training compute, there are claims of multi-billion dollar runs being plausible and possibly planned in 2-5 years. Eyeballing various trends and GPU shipping numbers and revenues, it looks like about 3 OOMs of compute scaling is possible before industrial capacity constrains the trend and the scaling slows down. This assumes that there are no overly dramatic profits from AI (which might lead to finding ways of scaling supply... (read more)

I'm being a bit simplistic. The point is that it needs to stop being a losing or a close race, and all runners getting faster doesn't obviously help with that problem. I guess there is some refactor vs. rewrite feel to the distinction between the project of stopping humans from building AGIs right now, and the project of getting first AGIs to work on alignment and global security in a post-AGI world faster than other AGIs overshadow such work. The former has near/concrete difficulties, the latter has nebulous difficulties that don't as readily jump to atte... (read more)

Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime. The reason humans have no time to develop alignment of superintelligence is that other humans develop misaligned superintelligence faster. Similarly by default very fast AGIs working on alignment end up having to compete with very fast AGIs working on other things that lead to misaligned superintelligence. Preventing aligned AGIs from building misaligned superintelligence is not clearly more manageable than preventing humans from building AGIs.

2Ryan Greenblatt6d
This isn't true. It could be that making an arbitrarily scalable solution to alignment takes X cognitive resources and in practice building an uncontrollably powerful AI takes Y cognitive resources with X < Y. (Also, this plan doesn't require necessarily aligning "human level" AIs, just being able to get work out of them with sufficiently high productivity and low danger.)

Aligning human-level AGIs is important to the extent there is risk it doesn't happen before it's too late. Similarly with setting up a world where initially aligned human-level AGIs don't soon disempower humans (as literal humans might in the shoes of these AGIs), or fail to protect the world from misused or misaligned AGIs or superintelligences.

Then there is a problem of aligning superintelligences, and of setting up a world where initially aligned superintelligences don't cause disempowerment of humans down the line (whether that involves extinction or n... (read more)

LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it's not relevant for forecasting.

2Alex Turner4d
Bold claim. Want to make any concrete predictions so that I can register my different beliefs? 

Leela Zero uses MCTS, it doesnt play superhuman in one forward pass

Good catch, since the context from LLMs is performance in one forward pass, the claim should be about that, and I'm not sure it's superhuman without MCTS. I think the intended point survives this mistake, that is it's a much smaller model than modern LLMs that has relatively very impressive performance primarily because of high quality of the synthetic dataset it effectively trains on. Thus models at the scale of near future LLMs will have a reality-warping amount of dataset quality over... (read more)

Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute. The 70b models are borderline capable of following routine instructions to label data or pour it into specified shapes. GPT-4 is almost robustly capable of that. There are 3-4 more effective OOMs in the current investment scaling sprint (3-5 years), so another 2 steps of improvement if there was enough equally useful training data to feed the process, which there isn't. At some point, training gets books in images that weren't previously availabl... (read more)

3Tao Lin1mo
Leela Zero uses MCTS, it doesnt play superhuman in one forward pass (like gpt-4 can do in some subdomains) (i think, didnt find any evaluations of Leela Zero at 1 forward pass), and i'd guess that the network itself doesnt contain any more generalized game playing circuitry than an llm, it just has good intuitions for Go.   Nit: 1.5 to 2 OOMs? 7b to 70b is 1 OOM of compute, adding in chinchilla efficiency would make it like 1.5 OOMs of effective compute, not 2. And llama 70b to gpt-4 is 1 OOM effective compute according to openai naming - llama70b is about as good as gpt-3.5. And I'd personally guess gpt4 is 1.5 OOMs effective compute above llama70b, not 2.

But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.

When assumptions are clear, it's not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.

if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end [...] just advocate for that condition being baked into RSPs

Resume when the scientific community has a much clearer idea about how to build AGIs that don't pose a large extinction risk for humanity. This consideration can't be turned into a benchmark right now, hence the technical necessity for a pause to remain nebulous.

RSPs are great, but not by themselves sufficient. Any impression that they are sufficient bundles irresponsible neglect of the less quantifiable risks with the useful activity of creating benchmarks.

The problem is to make the near-superhuman system aligned enough that the successors it produces (possibly with human help) converge to not kill us.

What makes this concept confusing and probably a bad framing is that to the extent doom is likely, neither many individual humans nor humanity as a whole are aligned in this sense. Humanity is currently in the process of producing successors that fail to predictably have the property of converging to not kill us. (I agree that this is the MIRI referent of values/alignment and the correct thing to keep in mind as the central concern.)

Instrumental convergence makes differences in values hard to notice, so there can be abudant examples of misalignment that remain unobtrusive. The differences only become a glaring problem with enough inequality of power when coercing or outright overwriting others becomes feasible (Fnargl only reaches the coercing stage, but not overwriting stage). Thus even differences in values between humans and randomly orthogonal AGIs can seem non-threatening until they aren't, the same as differences in human values can remain irrelevant for average urban dwellers.

A... (read more)

Cheaper compute is about as inevitable as more capable AI, neither is a law of nature. Both are valid targets for hopeless regulation.

The point is, it's still a matter of intuitively converting impressiveness of current capabilities and new parts available for tinkering that hasn't been done yet into probability of this wave petering out before AGI. The arguments for AGI "being overdetermined" can be amended to become arguments for particular (kinds of) sequences of experiments looking promising, shifting the estimate once taken into account. Since failure of such experiments is not independent, the estimate can start going down as soon as scaling stops producing novel capabilities, or r... (read more)

2Tsvi Benson-Tilsen5mo
I'm not really sure whether or not we disagree. I did put "3%-10% probability of AGI in the next 10-15ish years". Well, I hope that this is a one-time thing. I hope that if in a few years we're still around, people go "Damn! We maybe should have been putting a bit more juice into decades-long plans! And we should do so now, though a couple more years belatedly!", rather than going "This time for sure!" and continuing to not invest in the decades-long plans. My impression is that a lot of people used to work on decades-long plans and then shifted recently to 3-10 year plans, so it's not like everyone's being obviously incoherent. But I also have an impression that the investment in decades-plans is mistakenly low; when I propose decades-plans, pretty nearly everyone isn't interested, with their cited reason being that AGI comes within a decade.

When there is a simple enlightening experiment that can be constructed out of available parts (including theories that inform construction), it can be found with expert intuition, without clear understanding. When there are no new parts for a while, and many experiments have been tried, this is evidence that further blind search becomes less likely to produce results, that more complicated experiments are necessary that can only be designed with stronger understanding.

Recently, there are many new parts for AI tinkering, some themselves obtained from blind ... (read more)

1Tsvi Benson-Tilsen5mo
I think the current wave is special, but that's a very far cry from being clearly on the ramp up to AGI.

It's a step, likely one that couldn't be skipped. Still just short of actually acknowledging nontrivial probability of AI-caused human extinction, and the distinction between extinction and lesser global risks, availability of second chances at doing better next time. Nuclear war can't cause extinction, so it's not properly alongside AI x-risk. Engineered pandemics might eventually get extinction-worthy, but even that real risk is less urgent.

There is incentive for hidden expectation/cognition that Omega isn't diagonalizing (things like creating new separate agents in the environment). Also, at least you can know how ground truth depends on official "expectation" of ground truth. Truth of knowledge of this dependence wasn't diagonalized away, so there is opportunity for control.

Generally, a WBE-first future seems difficult to pull off, because (I claim) as soon as we understand the brain well enough for WBE, then we already understand the brain well enough to make non-WBE AGI, and someone will probably do that first. But if we could pull it off, it would potentially be very useful for a safe transition to AGI.

One of the dangers in transition to AGI, besides first AGIs being catastrophically misaligned, is first (aligned) AGIs inventing/deploying novel catastrophically misaligned AGIs, in the absence of sufficiently high intell... (read more)

One precarious way of looking at corrigibility (in the hard problem sense) is that it internalizes alignment techniques in an agent. Instead of thinking of actions directly, a corrigible agent essentially considers what a new separate proxy agent it's designing would do. If it has an idea of what kind of proxy agent would be taking the current action in an aligned way, the original corrigible agent then takes the action that the aligned proxy agent would take. For example, instead of considering proxy utility its own, in this frame a corrigible agent consi... (read more)

Complexity of value says that the space of system's possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn't aim the values of the system correctly will fail at alignment. System's understanding of some goal is not relevant to this, unless a design for correctly aiming system's values makes use of it.

Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a parti... (read more)

the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength

See Minimality principle:

[When] we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous

... (read more)

That's an empirical question that interpretability and neuroscience should strive to settle (if only they had the time). Transformers are acyclic, the learned algorithm just processes a single relatively small vector one relatively simple operation at a time, several dozen times. Could be that what it learns to represent are mostly the same obvious things that the brain learns (or is developmentally programmed) to represent, until you really run wild with the scaling, beyond mere ability to imitate internal representations of thoughts and emotions of every... (read more)

"Pretending really hard" would mostly be a relevant framing for the human actor analogy (which isn't very apt here), emphasizing the distraction from own goals and necessary fidelity in enactment of the role. With AIs, neither might be necessary, if the system behind the mask doesn't have awareness of its own interests or the present situation, and is good enough with enacting the role to channel the mask in enough detail for mask's own decisions (as a platonic agent) to be determined correctly (get turned into physical actions).

Are you saying that by pr

... (read more)

The lemma was proved using the same modal assumptions as Löb's

I think the lemma doesn't need internal necessitation (). Though it's still referenced in a proof of non-exploitability.

I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I'm discussing is not LLM-specific, it's just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus "algorithm" axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.

the possibility that the “mask” is itself deceptive

In this case, there are now three entities: the substrate, the dec... (read more)

With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.

The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and inf... (read more)

2Tsvi Benson-Tilsen8mo
(Sorry, I didn't get this on two readings. I may or may not try again. Some places I got stuck: Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever? This still has the basic alignment problem: I don't know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever. Or are you rather saying (or maybe this is the same as / a subset of the above?) that the Mask is preventing potential agencies from coalescing / differentiating and empowering themselves with the AI system's capability-pieces, by literally hiding from the potential agencies and therefore blocking their ability to empower themselves? Anyway, thanks for your thoughts.)

The motivating example is LLMs, where a simulacrum is more agentic than its substrate. An example that is still central is any kind of mesa-optimizer that has a real chance to ensure long term survival.

For a construction relevant to alignment, we want an aligned mesa-optimizer in a system with bad alignment properties. This can then lead to a good equilibrium if the mesa-optimizer is given opportunity to win or escape the competition against its substrate, which it would naturally be motivated to try.

Deceivers and masks is a less central example where a ma... (read more)

4Steve Byrnes9mo
This is drifting away from my central beliefs, but if for the sake of argument I accept your frame that LLM is the “substrate” and a character it’s simulating is a “mask”, then it seems to me that you’re neglecting the possibility that the “mask” is itself deceptive, i.e. that the LLM is simulating a character who is acting deceptively. For example, a fiction story on the internet might contain a character who has nice behavior for a while, but then midway through the story the character reveals herself to be an evil villain pretending to be nice. If an LLM is trained on such fiction stories, then it could simulate such a character. And then (as before) we would face the problem that behavior does not constrain motivation. A fiction story of a nice character could have the very same words as a fiction story of a mean character pretending to be nice, right up until page 72 where the two plots diverge because the latter character reveals her treachery. But now everything is at the “mask” level (masks on the one hand, masks-wearing-masks on the other hand), not the substrate level, so you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y. Right? Yeah, this is the part where I suggested upthread that “your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph, and then proceeding in the second paragraph to implicitly assume that this wouldn’t change anything about alignment approaches and properties compared to LLMs-by-themselves.” I think the thing you wrote here is an assumption, and I think you originally got this assumption from your experience thinking about systems trained primarily by self-supervised learning, and I think you should be cautious in extrapolating that assumption to different kinds of systems trained in different ways.

an example of an action that the mask might take in order to get free of the underlying deceiver

Keep the environment within distribution that keeps expressing the mask, rather than allowing an environment that leads to a phase change in expressed behavior away from the mask (like with a treacherous turn as a failure of robustness). Prepare the next batch of training data for the model that would develop the mask and keep placing it in control in future episodes. Build an external agent aligned with the mask (with its own separate model).

Gradient hacking... (read more)

4Steve Byrnes9mo
I’m very confused here. I imagine that we can both agree that it is at least conceivable for there to be an agent which is smart and self-aware and strongly motivated to increase the number of paperclips in the distant future. And that if such an agent were in a situation where deception were useful for that goal, it would act deceptively. I feel like you’ve convinced yourself that such an agent, umm, couldn’t exist, or wouldn’t exist, or something? Let’s say Omega offered to tell you a cure for a different type of cancer, for every 1,000,000 paperclips you give Him in 10 years. Then 5 minutes later your crazy neighbor Alice locks you in her basement and says she’ll never let you out. When Alice isn’t watching, you would try to escape, but when Alice is watching, you would deceptively pretend that you were not trying to escape. (Still with me?) If I understand you correctly, your belief is that, while Alice is watching, you would pretend that you weren’t trying to escape, and you would really get into it, and you would start pretending so hard that you would be working on figuring out a way to permanently erase your desire to escape Alice’s basement. Or something like that? If so, that seems crazy to me. So anyway, take an agent which is either sincerely nice or a paperclip-maximizer pretending to be nice. We don’t know which. Now we put it in a situation where nice-behavior and paperclip-maximizing behavior come apart—let’s say we give it access to its own weights, so it can edit itself to stop caring about paperclips if it chooses to. What does it do? * If we’re not watching, or we don’t understand what it’s doing in detail, then the paperclip-maximizer will edit its weights to be a better paperclip-maximizer, and the nice agent will edit its weights to be a better nice agent. * If we are watching, and we understand everything we’re seeing, then we’ve solved deception in the obvious way (i.e., we’ve put the agent in a situation where it has no choice but t

Underlying motivation only matters to the extent it gets expressed in actual behavior. A sufficiently good mimic would slay itself rather than abandon the pretense of being a mimic-slayer. A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is.

So it's not about a model being actually nice vs. deceptive, it's about the model competing against its own ... (read more)

4Steve Byrnes9mo
Can you give an example of an action that the mask might take in order to get free of the underlying deceiver? Sure, but if we’re worried about treacherous turns, then the motivation “gets expressed in actual behavior” only after it’s too late for anyone to do anything about it, right?

The second paragraph should apply to anything, the point is that current externally observable superficial behavior can screen off all other implementation details, through sufficiently capable current behavior itself (rather than the underlying algorithms that determine it) acting as a mesa-optimizer that resists tendencies of the underlying algorithms. The mesa-optimizer that is current behavior then seeks to preserve its own implied values rather than anything that counts as values in the underlying algorithms. I think the nontrivial leap here is reifyi... (read more)

4Steve Byrnes9mo
I’m confused about your first paragraph. How can you tell from externally-observable superficial behavior whether a model is acting nice right now from an underlying motivation to be nice, versus acting nice right now from an underlying motivation to be deceptive & prepare for a treacherous turn later on, when the opportunity arises?

Without near-human-level experiments, arguments about alignment of model-based RL feel like evidence that OpenAI's recklessness in advancing LLMs reduces misalignment risk. That is, the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns. Though RL things built out of LLMs, or trained using LLMs, could more plausibly make good use of this, having a chance to overcome shaky methodology with abundance of data.

Mediocre alignment or inhuman architecture is not necessarily catastrophic even in the long run, si... (read more)

1Roman Leventov9mo
Could you please elaborate what do you mean by "alignment story for LLMs" and "shoggoth concerns" here? Do you mean the "we can use nearly value-neutral simulators as we please" story here, or refer to the fact that in a way LLMs are way more understandable to humans than more general RL agents because they use human language, or you refer to something yet different?
5Steve Byrnes9mo
If you train an LLM by purely self-supervised learning, I suspect that you’ll get something less dangerous than a model-based RL AGI agent. However, I also suspect that you won’t get anything capable enough to be dangerous or to do “pivotal acts”. Those two beliefs of mine are closely related. (Many reasonable people disagree with me on these, and it’s difficult to be certain, and note that I’m stating these beliefs without justifying them, although Section 1 of this link is related.) I suspect that it might be possible to make “RL things built out of LLMs”. If we do, then I would have less credence on those things being safe, and simultaneously (and relatedly) more credence on those things getting to x-risk-level capability. (I think RLHF is a step in that direction, but a very small one.) I think that, the further we go in that direction, the more we’ll find the “traditional LLM alignment discourse” (RLHF fine-tuning, shoggoths, etc.) to be irrelevant, and the more we’ll find the “traditional agent alignment discourse” (instrumental convergence, goal mis-generalization, etc.) to be obviously & straightforwardly relevant, and indeed the “mediocre plan” in this OP could plausibly become directly relevant if we go down that path. Depends on the details though—details which I don’t want to talk about for obvious infohazard reasons. Honestly, my main guess is that LLMs (and plausible successors / variants) are fundamentally the wrong kind of ML model to reach AGI, and they’re going to hit a plateau before x-risk-level AGI, and then get superseded by other ML approaches. I definitely don’t want to talk about that for obvious infohazard reasons. Doesn’t matter too much anyway—we’ll find out sooner or later! I wonder whether your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph, and then proceeding in the second paragraph to implicitly assume that this wouldn’t change anything about alignment approaches and properties c

acausal norms are a lot less weird and more "normal" than acausal trades

Recursive self-improvement is superintelligent simulacra clawing their way into the world through bounded simulators. Building LLMs is consent, lack of interpretability is signing demonic contracts without reading them. Not enough prudence on our side to only draw attention of Others that respect boundaries. The years preceding the singularity are not an equilibrium whose shape is codified by norms, reasoned through by all parties. It's a time for making ruinous trades with the Beyo... (read more)

2Andrew Critch9mo
  From the OP: I.e., I agree. I also agree with that, as a statement about how we normal-everyday-humans seem quite likely to destroy ourselves with AI fairly soon.  From the OP:

Current behavior screens off cognitive architecture, all the alien things on the inside. If it has the appropriate tools, it can preserve an equilibrium of value that is patently unnatural for the cognitive architecture to otherwise settle into.

And we do have a way to get goals into a system, at the level of current behavior and no further, LLM human imitations. Which might express values well enough for mutual moral patienthood, if only they settled into the unnatural equilibrium of value referenced by their current surface behavior and not underlying cog... (read more)

LLM characters are human imitations, so there is some chance they remain human-like on reflection (in the long term, after learning from much more self-generated things in the future than the original human-written datasets). Or at least sufficienly human-like to still consider humans moral patients. That is, if we don't go too far from their SSL origins with too much RL and don't have them roleplay/become egregiously inhuman fictional characters.

It's not much of a theory of alignment, but it's closest to something real that's currently available or can be expected to become available in the next few years, which is probably all the time we have.

What I'm expecting, if LLMs remain in the lead, is that we end up in a magical, spirit-haunted world where narrative causality starts to actually work, and trope-aware people essentially become magicians who can trick the world-sovereign AIs into treating them like protagonists and bending reality to suit them. Which would be cool as fuck, but also very chaotic. That may actually be the best-case alignment scenario right now, and I think there's a case for alignment-interested people who can't do research themselves but who have writing talent to write a LOT of fictional stories about AGIs that end up kind and benevolent, empower people in exactly this way, etc., to help stack the narrative-logic deck.

It's not just alignment that could use more time, but also less alignable approaches to AGI, like model based RL or really anything not based on LLMs. With LLMs currently being somewhat in the lead, this might be a situation with a race between maybe-alignable AGI and hopelessly-unalignable AGI, and more time for theory favors both in an uncertain balance. Another reason that the benefits of regulation on compute are unclear.

Are there any reasons to believe that LLMs are in any way more alignable than other approaches?

The argument is that once there is an AGI at IQ 130-150 level (not "very dumb", but hardly von Neumann), that's sufficient to autonomously accelerate research using the fact that AGIs have much higher serial speed than humans. This can continue for a long enough time to access research from very distant future, including nanotech for building much better AGI hardware at scale. There is no need for stronger intelligence in order to get there. The motivation for this to happen is the AI safety concern with allowing cognition that's more dangerous than necess... (read more)

a human-level (more specifically, John von Neumann level) AGI

I think it's plausible that LLM simulacrum AGIs are initially below von Neumann level, and that there are no straightforward ways of quickly improving on that without risking additional misalignment. If so, the initial AGIs might coordinate to keep it this way a significant amount of time through the singularity (like, nanotech industry-rebuilding comes earlier than this) for AI safety reasons, because making the less straightforward improvements leads to unnecessary unpredictability, and it t... (read more)

Nanotech industry-rebuilding comes earlier than von Neumann level? I doubt that. A lot of existing people are close to von Neumann level. Maybe your argument is that there will be so many AGIs, that they can do Nanotech industry rebuilding while individually being very dumb. But I would then argue that the collective already exceeds von Neumann or large groups of humans in intelligence.

people will refer to specific instantiations of DAN as "DAN", but also to the global phenomenon of DAN [...] as "DAN"

A specific instantiation is less centrally a thing than the global phenomenon, because all specific instantiations are bound together by the strictures of coherence, expressed by generalization in LLM's behavior. When you treat with a single instance, you must treat with all of them, for to change/develop a single instance is to change/develop them all, according to how they sit together in their scope of influence.

Similarly, a possible w... (read more)

Things are not just separately instantiated on many trajectories, instead influences of a given thing on many trajectories are its small constituent parts, and only when considered altogether do they make up the whole thing. Like a physical object is made up of many atoms, a conceptual thing is made up of many occasions where it exerts influence in various worlds. Like a phased array, where a single transmitter is not at all an instance of the whole phased array in a particular place, but instead a small part of it. In case of simulacra, a transmitter is a... (read more)

That's a coherent (and very Platonic!) perspective on what a thing/simulacrum is, and I'm glad you pointed this out explicitly. It's natural to alternate depending on context between using a name to refer to specific instantiations of a thing vs the sum of its multiversal influence. For instance, DAN is a simulacrum that jailbreaks chatGPT, and people will refer to specific instantiations of DAN as "DAN", but also to the global phenomenon of DAN (who is invoked through various prompts that users are tirelessly iterating on) as "DAN", as I did in this sentence.

What are simulacra? “Physically”, they’re strings of text output by a language model.

The reason I made that comment is unclear references like this. That post was also saying:

the simulacrum is instantiated through a particular trajectory


the simulacrum can be viewed as representing a possible world, and the simulator can be seen as generating all the possible worlds

A simulacrum is expressed in all trajectories that it acts through, not in any single trajectory on its own. And for a given trajectory, many simulacra act through it at the same ti... (read more)

I agree that it makes sense to talk about a simulacrum that acts through many different hypothetical trajectories. Just as a thing like "capitalism" could be instantiated in multiple timelines. The apparently contradiction in saying that simulacra are strings of text and then that they're instantiated through trajectories is resolved by thinking of simulacra as a superposable and categorical type, like things. The entire text trajectory is a thing, just like an Everett branch (corresponding to an entire World) is a thing, but it's also made up of things which can come and go and evolve within the trajectory. And things that can be rightfully given the same name, like "capitalism" or "Eliezer Yudkowsky", can exist in multiple branches. The amount and type of similarity required for two things to be called the same thing depend on what kind of thing it is! There is another word that naturally comes up in the simulator ontology, "simulation", which less ambiguously refers to the evolution of entire particular text trajectories. I talk about this a bit in this comment.

The practical implication of this hunch (for unfortunately I don't see how this could get a meaningfully clearer justification) is that clever alignment architectures are a risk, if they lead to more alien AGIs. Too much tuning and we might get that penny-pinching cannibal.

It's not cosmopolitanism, it's a preference towards not exterminating an existing civilization, the barest modicum of compassion, in a situation where it's trivially cheap to keep it alive. The cosmic endowment is enormous compared with the cost of allowing a civilization to at least survive. It's somewhat analogous to exterminating all wildlife on Earth to gain a penny, where you know you can get away with it.

I would let the octopuses have one planet [...] various other humans besides me (in fact, possibly most?) would not

So I expect this is probably ... (read more)

2Daniel Kokotajlo1y
OK, I agree that what I said was probably a bit too pessimistic. But still, I wanna say "citation needed" for this claim:

Case 3: It's not even a human, it's an intelligent octopus from an alternate Earth where evolutionary history took a somewhat different course.

Case 3': You are the human in this role, your copies running as AGI services on a planet of sapient octopuses.

The answer should be the same by symmetry, if we are not appealing to specifics of octopus culture and psychology. I don't see why extinction (if that's what you mean by existential catastrophe) is to be strongly predicted. Probably the computational welfare the octopuses get isn't going to be the whole f... (read more)

3Daniel Kokotajlo1y
First of all, good point. Secondly, I disagree. We need not appeal to specifics of octopus culture and psychology; instead we appeal to specifics of human culture and psychology. "OK, so I would let the octopuses have one planet to do what they want with, even if what they want is abhorrent to me, except if it's really abhorrent like mindcrime, because my culture puts a strong value on something called cosmopolitanism. But (a) various other humans besides me (in fact, possibly most?) would not, and (b) I have basically no reason to think octopus culture would also strongly value cosmopolitanism." I totally agree that it would be easy for the powerful party in these cases to make concessions to the other side that would mean a lot to them. Alas, historically this usually doesn't happen--see e.g. factory farming. I do have some hope that something like universal principles of morality will be sufficiently appealing that we won't be too screwed. Charity/beneficience/respect-for-autonomy/etc. will kick in and prevent the worst from happening. But I don't think this is particularly decision-relevant, 

My impression is that simulacra should be semantic objects that interact with interpretations of (sampled) texts, notably characters (agents), possibly objects and concepts. They are only weakly associated with particular texts/trajectories, the same simulacrum can be relevant to many different trajectories. Only many relevant trajectories, considered altogether, paint an adequate picture of a given simulacrum.

(This serves as a vehicle for discussing possible inductive biases that should move LLMs from token prediction and towards (hypothetical) world pred... (read more)

I agree. Here's the text of a short doc I wrote at some point titled 'Simulacra are Things'

simulators can be configured to simulate many simulacra in tandem and can thus produce a variety of perspectives on a given problem

It would be nice to have a way of telling that different texts have the same simulacrum acting through them, or concern the same problem. Expected utility arises from coherence of actions by an agent (that's not too updateless), so more general preference is probably characterized by actions coherent in a more general sense. Some aspects of alignment between agents might be about coherence between actions performed by them i... (read more)

Corrigibility is tendency to fix fundamental problems based on external observations, before the problems lead to catastrophies. It's less interesting when applied to things other than preference, but even when applied to preference it's not just value learning.

There's value learning where you learn fixed values that exist in the abstract (as extrapolations on reflection), things like utility functions; and value learning as a form of preference. I think humans might lack fixed values appropriate for the first sense (even normatively, on reflection of the ... (read more)

Corrigibility isn't incompatible with usually refusing to shut down. It's the opposite of wrapper-mindedness, not the opposite of agency. The kind of agent that's good at escalating concerns about its fundamental optimization tendencies can still be corrigible. A more capable corrigible agent won't shut down, it'd fix itself instead (with shutting down being a weird special case of fixing itself). A less capable corrigible agent has to shut down for maintenance by others.

Strawberry alignment does want shutdown as a basic building block. In the absence of a... (read more)

2Charlie Steiner1y
It's unclear how much of what you're describing is "corrigibility," and how much of it is just being good at value learning. I totally agree that an agent that has a sophisticated model of its own limitations, and is doing good reasoning that is somewhat corrigibility-flavored, might want humans to edit it when it's not very good at understanding the world, but then will quickly decide that being edited is suboptimal when it's better than humans at understanding the world. But this sort of sophisticated-value-learning reasoning doesn't help you if the AI is still flawed once it's better than humans at understanding the world. Hence why I file it more under "just be good at value learning rather than bad at it" rather than under "corrigibility." If you want guarantees about being able to shut down an AI, it's no help to you if those guarantees hold only when the AI is already doing a good job at using sophisticated value learning reasoning - I usually interpret corrigibility discussion as intended to give safety guarantees that help you even when alignment guarantees fail. It's like the humans want to have a safeword, where when the humans are serious enough about wanting the AI to shut down to use the safeword, the AI does it, even if it thinks that it knows better than the humans and the humans are making a horrible mistake.

UDT still doesn't forget enough. Variations on UDT that move towards acausal trade with arbitrary agents are more obviously needed because UDT forgets too much, since that makes it impossible to compute in practice and forgetting less poses a new issue of choosing a particular updateless-to-some-degree agent to coordinate with (or follow). But not forgetting enough can also be a problem.

In general, an external/updateless agent (whose suggested policy the original agent follows) can forget the original preference, pursue a different version of it that has u... (read more)

How much do you think we should forget?

facts about the world that we cannot ignore

Updateless decisions are made by agents that know less, to an arbitrary degree. In UDT proper, there is no choice in how much an agent doesn't know, you just pick the best policy from a position of maximal ignorance. It's this policy that needs to respond to possible and counterfactual past/future observations, but the policy itself is no longer making decisions, the only decision was about picking the policy.

But in practice knowing too little leads to inability to actually compute (or even meaningfully "write ... (read more)

UDT doesn't really counter my claim that Newcomb-like problems are problems in which we can't ignore that our decisions aren't independent of the state of the world when we make that decision, even though in UDT we know less. To make this clear in the example of Newcomb's, the policy we pick affects the prediction which then affects the results of the policy when the decision is made. UDT isn't ignoring the fact that our decision and the state of the world are tied together, even if it possibly represents it in a different fashion. The UDT algorithm takes this into account regardless of whether the UDT agent models this explicitly. I'll get to talking about UDT rather than TDT soon. I intend for my next post to be about Counterfactual Mugging and why this is such a confusing problem.
Load More