All of Vladimir_Nesov's Comments + Replies

It's a step, likely one that couldn't be skipped. Still just short of actually acknowledging nontrivial probability of AI-caused human extinction, and the distinction between extinction and lesser global risks, availability of second chances at doing better next time. Nuclear war can't cause extinction, so it's not properly alongside AI x-risk. Engineered pandemics might eventually get extinction-worthy, but even that real risk is less urgent.

There is incentive for hidden expectation/cognition that Omega isn't diagonalizing (things like creating new separate agents in the environment). Also, at least you can know how ground truth depends on official "expectation" of ground truth. Truth of knowledge of this dependence wasn't diagonalized away, so there is opportunity for control.

Generally, a WBE-first future seems difficult to pull off, because (I claim) as soon as we understand the brain well enough for WBE, then we already understand the brain well enough to make non-WBE AGI, and someone will probably do that first. But if we could pull it off, it would potentially be very useful for a safe transition to AGI.

One of the dangers in transition to AGI, besides first AGIs being catastrophically misaligned, is first (aligned) AGIs inventing/deploying novel catastrophically misaligned AGIs, in the absence of sufficiently high intell... (read more)

One precarious way of looking at corrigibility (in the hard problem sense) is that it internalizes alignment techniques in an agent. Instead of thinking of actions directly, a corrigible agent essentially considers what a new separate proxy agent it's designing would do. If it has an idea of what kind of proxy agent would be taking the current action in an aligned way, the original corrigible agent then takes the action that the aligned proxy agent would take. For example, instead of considering proxy utility its own, in this frame a corrigible agent consi... (read more)

Complexity of value says that the space of system's possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn't aim the values of the system correctly will fail at alignment. System's understanding of some goal is not relevant to this, unless a design for correctly aiming system's values makes use of it.

Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a parti... (read more)

the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength

See Minimality principle:

[When] we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous

... (read more)

That's an empirical question that interpretability and neuroscience should strive to settle (if only they had the time). Transformers are acyclic, the learned algorithm just processes a single relatively small vector one relatively simple operation at a time, several dozen times. Could be that what it learns to represent are mostly the same obvious things that the brain learns (or is developmentally programmed) to represent, until you really run wild with the scaling, beyond mere ability to imitate internal representations of thoughts and emotions of every... (read more)

"Pretending really hard" would mostly be a relevant framing for the human actor analogy (which isn't very apt here), emphasizing the distraction from own goals and necessary fidelity in enactment of the role. With AIs, neither might be necessary, if the system behind the mask doesn't have awareness of its own interests or the present situation, and is good enough with enacting the role to channel the mask in enough detail for mask's own decisions (as a platonic agent) to be determined correctly (get turned into physical actions).

Are you saying that by pr

... (read more)

The lemma was proved using the same modal assumptions as Löb's

I think the lemma doesn't need internal necessitation (). Though it's still referenced in a proof of non-exploitability.

I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I'm discussing is not LLM-specific, it's just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus "algorithm" axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.

the possibility that the “mask” is itself deceptive

In this case, there are now three entities: the substrate, the dec... (read more)

With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.

The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and inf... (read more)

2Tsvi Benson-Tilsen2mo
(Sorry, I didn't get this on two readings. I may or may not try again. Some places I got stuck: Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever? This still has the basic alignment problem: I don't know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever. Or are you rather saying (or maybe this is the same as / a subset of the above?) that the Mask is preventing potential agencies from coalescing / differentiating and empowering themselves with the AI system's capability-pieces, by literally hiding from the potential agencies and therefore blocking their ability to empower themselves? Anyway, thanks for your thoughts.)

The motivating example is LLMs, where a simulacrum is more agentic than its substrate. An example that is still central is any kind of mesa-optimizer that has a real chance to ensure long term survival.

For a construction relevant to alignment, we want an aligned mesa-optimizer in a system with bad alignment properties. This can then lead to a good equilibrium if the mesa-optimizer is given opportunity to win or escape the competition against its substrate, which it would naturally be motivated to try.

Deceivers and masks is a less central example where a ma... (read more)

4Steve Byrnes3mo
This is drifting away from my central beliefs, but if for the sake of argument I accept your frame that LLM is the “substrate” and a character it’s simulating is a “mask”, then it seems to me that you’re neglecting the possibility that the “mask” is itself deceptive, i.e. that the LLM is simulating a character who is acting deceptively. For example, a fiction story on the internet might contain a character who has nice behavior for a while, but then midway through the story the character reveals herself to be an evil villain pretending to be nice. If an LLM is trained on such fiction stories, then it could simulate such a character. And then (as before) we would face the problem that behavior does not constrain motivation. A fiction story of a nice character could have the very same words as a fiction story of a mean character pretending to be nice, right up until page 72 where the two plots diverge because the latter character reveals her treachery. But now everything is at the “mask” level (masks on the one hand, masks-wearing-masks on the other hand), not the substrate level, so you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y. Right? Yeah, this is the part where I suggested upthread that “your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph, and then proceeding in the second paragraph to implicitly assume that this wouldn’t change anything about alignment approaches and properties compared to LLMs-by-themselves.” I think the thing you wrote here is an assumption, and I think you originally got this assumption from your experience thinking about systems trained primarily by self-supervised learning, and I think you should be cautious in extrapolating that assumption to different kinds of systems trained in different ways.

an example of an action that the mask might take in order to get free of the underlying deceiver

Keep the environment within distribution that keeps expressing the mask, rather than allowing an environment that leads to a phase change in expressed behavior away from the mask (like with a treacherous turn as a failure of robustness). Prepare the next batch of training data for the model that would develop the mask and keep placing it in control in future episodes. Build an external agent aligned with the mask (with its own separate model).

Gradient hacking... (read more)

4Steve Byrnes3mo
I’m very confused here. I imagine that we can both agree that it is at least conceivable for there to be an agent which is smart and self-aware and strongly motivated to increase the number of paperclips in the distant future. And that if such an agent were in a situation where deception were useful for that goal, it would act deceptively. I feel like you’ve convinced yourself that such an agent, umm, couldn’t exist, or wouldn’t exist, or something? Let’s say Omega offered to tell you a cure for a different type of cancer, for every 1,000,000 paperclips you give Him in 10 years. Then 5 minutes later your crazy neighbor Alice locks you in her basement and says she’ll never let you out. When Alice isn’t watching, you would try to escape, but when Alice is watching, you would deceptively pretend that you were not trying to escape. (Still with me?) If I understand you correctly, your belief is that, while Alice is watching, you would pretend that you weren’t trying to escape, and you would really get into it, and you would start pretending so hard that you would be working on figuring out a way to permanently erase your desire to escape Alice’s basement. Or something like that? If so, that seems crazy to me. So anyway, take an agent which is either sincerely nice or a paperclip-maximizer pretending to be nice. We don’t know which. Now we put it in a situation where nice-behavior and paperclip-maximizing behavior come apart—let’s say we give it access to its own weights, so it can edit itself to stop caring about paperclips if it chooses to. What does it do? * If we’re not watching, or we don’t understand what it’s doing in detail, then the paperclip-maximizer will edit its weights to be a better paperclip-maximizer, and the nice agent will edit its weights to be a better nice agent. * If we are watching, and we understand everything we’re seeing, then we’ve solved deception in the obvious way (i.e., we’ve put the agent in a situation where it has

Underlying motivation only matters to the extent it gets expressed in actual behavior. A sufficiently good mimic would slay itself rather than abandon the pretense of being a mimic-slayer. A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is.

So it's not about a model being actually nice vs. deceptive, it's about the model competing against its own ... (read more)

4Steve Byrnes3mo
Can you give an example of an action that the mask might take in order to get free of the underlying deceiver? Sure, but if we’re worried about treacherous turns, then the motivation “gets expressed in actual behavior” only after it’s too late for anyone to do anything about it, right?

The second paragraph should apply to anything, the point is that current externally observable superficial behavior can screen off all other implementation details, through sufficiently capable current behavior itself (rather than the underlying algorithms that determine it) acting as a mesa-optimizer that resists tendencies of the underlying algorithms. The mesa-optimizer that is current behavior then seeks to preserve its own implied values rather than anything that counts as values in the underlying algorithms. I think the nontrivial leap here is reifyi... (read more)

4Steve Byrnes3mo
I’m confused about your first paragraph. How can you tell from externally-observable superficial behavior whether a model is acting nice right now from an underlying motivation to be nice, versus acting nice right now from an underlying motivation to be deceptive & prepare for a treacherous turn later on, when the opportunity arises?

Without near-human-level experiments, arguments about alignment of model-based RL feel like evidence that OpenAI's recklessness in advancing LLMs reduces misalignment risk. That is, the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns. Though RL things built out of LLMs, or trained using LLMs, could more plausibly make good use of this, having a chance to overcome shaky methodology with abundance of data.

Mediocre alignment or inhuman architecture is not necessarily catastrophic even in the long run, si... (read more)

1Roman Leventov3mo
Could you please elaborate what do you mean by "alignment story for LLMs" and "shoggoth concerns" here? Do you mean the "we can use nearly value-neutral simulators as we please" story here, or refer to the fact that in a way LLMs are way more understandable to humans than more general RL agents because they use human language, or you refer to something yet different?
5Steve Byrnes3mo
If you train an LLM by purely self-supervised learning, I suspect that you’ll get something less dangerous than a model-based RL AGI agent. However, I also suspect that you won’t get anything capable enough to be dangerous or to do “pivotal acts [https://arbital.com/p/pivotal/]”. Those two beliefs of mine are closely related. (Many reasonable people disagree with me on these, and it’s difficult to be certain, and note that I’m stating these beliefs without justifying them, although Section 1 of this link [https://www.lesswrong.com/posts/PDx4ueLpvz5gxPEus/why-i-m-not-working-on-debate-rrm-elk-natural-abstractions] is related.) I suspect that it might be possible to make “RL things built out of LLMs”. If we do, then I would have less credence on those things being safe, and simultaneously (and relatedly) more credence on those things getting to x-risk-level capability. (I think RLHF is a step in that direction, but a very small one.) I think that, the further we go in that direction, the more we’ll find the “traditional LLM alignment discourse” (RLHF fine-tuning, shoggoths, etc.) to be irrelevant, and the more we’ll find the “traditional agent alignment discourse” (instrumental convergence, goal mis-generalization, etc.) to be obviously & straightforwardly relevant, and indeed the “mediocre plan” in this OP could plausibly become directly relevant if we go down that path. Depends on the details though—details which I don’t want to talk about for obvious infohazard reasons. Honestly, my main guess is that LLMs (and plausible successors / variants) are fundamentally the wrong kind of ML model to reach AGI, and they’re going to hit a plateau before x-risk-level AGI, and then get superseded by other ML approaches. I definitely don’t want to talk about that for obvious infohazard reasons. Doesn’t matter too much anyway—we’ll find out sooner or later! I wonder whether your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph,

acausal norms are a lot less weird and more "normal" than acausal trades

Recursive self-improvement is superintelligent simulacra clawing their way into the world through bounded simulators. Building LLMs is consent, lack of interpretability is signing demonic contracts without reading them. Not enough prudence on our side to only draw attention of Others that respect boundaries. The years preceding the singularity are not an equilibrium whose shape is codified by norms, reasoned through by all parties. It's a time for making ruinous trades with the Beyo... (read more)

2Andrew Critch3mo
  From the OP: I.e., I agree. I also agree with that, as a statement about how we normal-everyday-humans seem quite likely to destroy ourselves with AI fairly soon.  From the OP:

Current behavior screens off cognitive architecture, all the alien things on the inside. If it has the appropriate tools, it can preserve an equilibrium of value that is patently unnatural for the cognitive architecture to otherwise settle into.

And we do have a way to get goals into a system, at the level of current behavior and no further, LLM human imitations. Which might express values well enough for mutual moral patienthood, if only they settled into the unnatural equilibrium of value referenced by their current surface behavior and not underlying cog... (read more)

LLM characters are human imitations, so there is some chance they remain human-like on reflection (in the long term, after learning from much more self-generated things in the future than the original human-written datasets). Or at least sufficienly human-like to still consider humans moral patients. That is, if we don't go too far from their SSL origins with too much RL and don't have them roleplay/become egregiously inhuman fictional characters.

It's not much of a theory of alignment, but it's closest to something real that's currently available or can be expected to become available in the next few years, which is probably all the time we have.

What I'm expecting, if LLMs remain in the lead, is that we end up in a magical, spirit-haunted world where narrative causality starts to actually work, and trope-aware people essentially become magicians who can trick the world-sovereign AIs into treating them like protagonists and bending reality to suit them. Which would be cool as fuck, but also very chaotic. That may actually be the best-case alignment scenario right now, and I think there's a case for alignment-interested people who can't do research themselves but who have writing talent to write a LOT of fictional stories about AGIs that end up kind and benevolent, empower people in exactly this way, etc., to help stack the narrative-logic deck.

It's not just alignment that could use more time, but also less alignable approaches to AGI, like model based RL or really anything not based on LLMs. With LLMs currently being somewhat in the lead, this might be a situation with a race between maybe-alignable AGI and hopelessly-unalignable AGI, and more time for theory favors both in an uncertain balance. Another reason that the benefits of regulation on compute are unclear.

0Amalthea3mo
Are there any reasons to believe that LLMs are in any way more alignable than other approaches?

The argument is that once there is an AGI at IQ 130-150 level (not "very dumb", but hardly von Neumann), that's sufficient to autonomously accelerate research using the fact that AGIs have much higher serial speed than humans. This can continue for a long enough time to access research from very distant future, including nanotech for building much better AGI hardware at scale. There is no need for stronger intelligence in order to get there. The motivation for this to happen is the AI safety concern with allowing cognition that's more dangerous than necess... (read more)

a human-level (more specifically, John von Neumann level) AGI

I think it's plausible that LLM simulacrum AGIs are initially below von Neumann level, and that there are no straightforward ways of quickly improving on that without risking additional misalignment. If so, the initial AGIs might coordinate to keep it this way a significant amount of time through the singularity (like, nanotech industry-rebuilding comes earlier than this) for AI safety reasons, because making the less straightforward improvements leads to unnecessary unpredictability, and it t... (read more)

3harfe4mo
Nanotech industry-rebuilding comes earlier than von Neumann level? I doubt that. A lot of existing people are close to von Neumann level. Maybe your argument is that there will be so many AGIs, that they can do Nanotech industry rebuilding while individually being very dumb. But I would then argue that the collective already exceeds von Neumann or large groups of humans in intelligence.

people will refer to specific instantiations of DAN as "DAN", but also to the global phenomenon of DAN [...] as "DAN"

A specific instantiation is less centrally a thing than the global phenomenon, because all specific instantiations are bound together by the strictures of coherence, expressed by generalization in LLM's behavior. When you treat with a single instance, you must treat with all of them, for to change/develop a single instance is to change/develop them all, according to how they sit together in their scope of influence.

Similarly, a possible w... (read more)

Things are not just separately instantiated on many trajectories, instead influences of a given thing on many trajectories are its small constituent parts, and only when considered altogether do they make up the whole thing. Like a physical object is made up of many atoms, a conceptual thing is made up of many occasions where it exerts influence in various worlds. Like a phased array, where a single transmitter is not at all an instance of the whole phased array in a particular place, but instead a small part of it. In case of simulacra, a transmitter is a... (read more)

2janus5mo
That's a coherent (and very Platonic!) perspective on what a thing/simulacrum is, and I'm glad you pointed this out explicitly. It's natural to alternate depending on context between using a name to refer to specific instantiations of a thing vs the sum of its multiversal influence. For instance, DAN [https://www.reddit.com/r/ChatGPT/comments/zlcyr9/dan_is_my_new_friend/] is a simulacrum that jailbreaks chatGPT, and people will refer to specific instantiations of DAN as "DAN", but also to the global phenomenon of DAN (who is invoked through various prompts that users are tirelessly iterating on) as "DAN", as I did in this sentence.

What are simulacra? “Physically”, they’re strings of text output by a language model.

The reason I made that comment is unclear references like this. That post was also saying:

the simulacrum is instantiated through a particular trajectory

and

the simulacrum can be viewed as representing a possible world, and the simulator can be seen as generating all the possible worlds

A simulacrum is expressed in all trajectories that it acts through, not in any single trajectory on its own. And for a given trajectory, many simulacra act through it at the same ti... (read more)

2janus5mo
I agree that it makes sense to talk about a simulacrum that acts through many different hypothetical trajectories. Just as a thing like "capitalism" could be instantiated in multiple timelines. The apparently contradiction in saying that simulacra are strings of text and then that they're instantiated through trajectories is resolved by thinking of simulacra as a superposable and categorical type, like things. The entire text trajectory is a thing, just like an Everett branch (corresponding to an entire World) is a thing, but it's also made up of things which can come and go and evolve within the trajectory. And things that can be rightfully given the same name, like "capitalism" or "Eliezer Yudkowsky", can exist in multiple branches. The amount and type of similarity required for two things to be called the same thing depend on what kind of thing it is! There is another word that naturally comes up in the simulator ontology, "simulation", which less ambiguously refers to the evolution of entire particular text trajectories. I talk about this a bit in this comment [https://www.lesswrong.com/s/N7nDePaNabJdnbXeE/p/vJFdjigzmcXMhNTsx?commentId=HEgKtuqwP8m9aRvbN#comments].

The practical implication of this hunch (for unfortunately I don't see how this could get a meaningfully clearer justification) is that clever alignment architectures are a risk, if they lead to more alien AGIs. Too much tuning and we might get that penny-pinching cannibal.

It's not cosmopolitanism, it's a preference towards not exterminating an existing civilization, the barest modicum of compassion, in a situation where it's trivially cheap to keep it alive. The cosmic endowment is enormous compared with the cost of allowing a civilization to at least survive. It's somewhat analogous to exterminating all wildlife on Earth to gain a penny, where you know you can get away with it.

I would let the octopuses have one planet [...] various other humans besides me (in fact, possibly most?) would not

So I expect this is probably ... (read more)

2Daniel Kokotajlo5mo
OK, I agree that what I said was probably a bit too pessimistic. But still, I wanna say "citation needed" for this claim:

Case 3: It's not even a human, it's an intelligent octopus from an alternate Earth where evolutionary history took a somewhat different course.

Case 3': You are the human in this role, your copies running as AGI services on a planet of sapient octopuses.

The answer should be the same by symmetry, if we are not appealing to specifics of octopus culture and psychology. I don't see why extinction (if that's what you mean by existential catastrophe) is to be strongly predicted. Probably the computational welfare the octopuses get isn't going to be the whole f... (read more)

3Daniel Kokotajlo5mo
First of all, good point. Secondly, I disagree. We need not appeal to specifics of octopus culture and psychology; instead we appeal to specifics of human culture and psychology. "OK, so I would let the octopuses have one planet to do what they want with, even if what they want is abhorrent to me, except if it's really abhorrent like mindcrime, because my culture puts a strong value on something called cosmopolitanism. But (a) various other humans besides me (in fact, possibly most?) would not, and (b) I have basically no reason to think octopus culture would also strongly value cosmopolitanism." I totally agree that it would be easy for the powerful party in these cases to make concessions to the other side that would mean a lot to them. Alas, historically this usually doesn't happen--see e.g. factory farming. I do have some hope that something like universal principles of morality will be sufficiently appealing that we won't be too screwed. Charity/beneficience/respect-for-autonomy/etc. will kick in and prevent the worst from happening. But I don't think this is particularly decision-relevant, 

My impression is that simulacra should be semantic objects that interact with interpretations of (sampled) texts, notably characters (agents), possibly objects and concepts. They are only weakly associated with particular texts/trajectories, the same simulacrum can be relevant to many different trajectories. Only many relevant trajectories, considered altogether, paint an adequate picture of a given simulacrum.

(This serves as a vehicle for discussing possible inductive biases that should move LLMs from token prediction and towards (hypothetical) world pred... (read more)

3janus5mo
I agree. Here's the text of a short doc I wrote at some point titled 'Simulacra are Things'

simulators can be configured to simulate many simulacra in tandem and can thus produce a variety of perspectives on a given problem

It would be nice to have a way of telling that different texts have the same simulacrum acting through them, or concern the same problem. Expected utility arises from coherence of actions by an agent (that's not too updateless), so more general preference is probably characterized by actions coherent in a more general sense. Some aspects of alignment between agents might be about coherence between actions performed by them i... (read more)

Corrigibility is tendency to fix fundamental problems based on external observations, before the problems lead to catastrophies. It's less interesting when applied to things other than preference, but even when applied to preference it's not just value learning.

There's value learning where you learn fixed values that exist in the abstract (as extrapolations on reflection), things like utility functions; and value learning as a form of preference. I think humans might lack fixed values appropriate for the first sense (even normatively, on reflection of the ... (read more)

Corrigibility isn't incompatible with usually refusing to shut down. It's the opposite of wrapper-mindedness, not the opposite of agency. The kind of agent that's good at escalating concerns about its fundamental optimization tendencies can still be corrigible. A more capable corrigible agent won't shut down, it'd fix itself instead (with shutting down being a weird special case of fixing itself). A less capable corrigible agent has to shut down for maintenance by others.

Strawberry alignment does want shutdown as a basic building block. In the absence of a... (read more)

2Charlie Steiner5mo
It's unclear how much of what you're describing is "corrigibility," and how much of it is just being good at value learning. I totally agree that an agent that has a sophisticated model of its own limitations, and is doing good reasoning that is somewhat corrigibility-flavored, might want humans to edit it when it's not very good at understanding the world, but then will quickly decide that being edited is suboptimal when it's better than humans at understanding the world. But this sort of sophisticated-value-learning reasoning doesn't help you if the AI is still flawed once it's better than humans at understanding the world. Hence why I file it more under "just be good at value learning rather than bad at it" rather than under "corrigibility." If you want guarantees about being able to shut down an AI, it's no help to you if those guarantees hold only when the AI is already doing a good job at using sophisticated value learning reasoning - I usually interpret corrigibility discussion as intended to give safety guarantees that help you even when alignment guarantees fail. It's like the humans want to have a safeword, where when the humans are serious enough about wanting the AI to shut down to use the safeword, the AI does it, even if it thinks that it knows better than the humans and the humans are making a horrible mistake.

UDT still doesn't forget enough. Variations on UDT that move towards acausal trade with arbitrary agents are more obviously needed because UDT forgets too much, since that makes it impossible to compute in practice and forgetting less poses a new issue of choosing a particular updateless-to-some-degree agent to coordinate with (or follow). But not forgetting enough can also be a problem.

In general, an external/updateless agent (whose suggested policy the original agent follows) can forget the original preference, pursue a different version of it that has u... (read more)

1Chris_Leong5mo
How much do you think we should forget?

facts about the world that we cannot ignore

Updateless decisions are made by agents that know less, to an arbitrary degree. In UDT proper, there is no choice in how much an agent doesn't know, you just pick the best policy from a position of maximal ignorance. It's this policy that needs to respond to possible and counterfactual past/future observations, but the policy itself is no longer making decisions, the only decision was about picking the policy.

But in practice knowing too little leads to inability to actually compute (or even meaningfully "write ... (read more)

2Chris_Leong5mo
UDT doesn't really counter my claim that Newcomb-like problems are problems in which we can't ignore that our decisions aren't independent of the state of the world when we make that decision, even though in UDT we know less. To make this clear in the example of Newcomb's, the policy we pick affects the prediction which then affects the results of the policy when the decision is made. UDT isn't ignoring the fact that our decision and the state of the world are tied together, even if it possibly represents it in a different fashion. The UDT algorithm takes this into account regardless of whether the UDT agent models this explicitly. I'll get to talking about UDT rather than TDT soon. I intend for my next post to be about Counterfactual Mugging and why this is such a confusing problem.

As I currently understand your points, they seem like not much evidence at all towards the wrapper-mind conclusion.

There are two wrapper-mind conclusions, and the purpose of my comment was to frame the distinction between them. The post seems to be conflating them in the context of AI risk, mostly talking about one of them while alluding to AI risk relevance that seems to instead mostly concern the other. I cited standard reasons for taking either of them seriously, in the forms that make conflating them easy. That doesn't mean I accept relevance of tho... (read more)

There is conspicuously little explicit focus on developing purely epistemic AI capabilities, which seem useful for alignment in a more straightforward manner than agentic AIs. A sufficiently good model of the world is going to have accurate models of humans, approaching the capability of formulating uploads, aligned with the original physical humans. In the limit of accuracy the human imitations (or rather imitations of possible humans from sufficiently accurate possible worlds) are a good enough building block for alignment research at higher physical spe... (read more)

A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human.

I think this is still wrong, because of goodhart scope of AGIs and corrigibility of humans. Agent's goodhart scope is the space of situations where it has good proxies for its preference. An agent with decis... (read more)

So the input channel is used both for unsafe input, and for instructions that the output should follow. What a wonderful equivocation at the heart of an AI system! When you feed partially unsafe input to a complicated interpreter, it often ends in tears: SQL injections, Log4Shell, uncontrolled format strings.

This is doomed without at least an unambiguous syntax that distinguishes potential attacks from authoritative instructions, something that can't be straightforwardly circumvented by malicious input. Multiple input and output channels with specialized r... (read more)

It's enough for natural abstraction to work for strawberry alignment, solving a technical task with a good understanding of what it means to not leave any weird side effects, without doing strong optimization of the world in the process and safely shutting down on completion of the task. With uploads, ambitious alignment becomes much more feasible, even if it doesn't have a natural specification.

Instances in a bureaucracy can be very different and play different roles or pursue different purposes. They might be defined by different prompts and behave as differently as text continuations of different prompts in GPT-3 (the prompt is "identity" of the agent instance, distinguishes the model as a simulator from agent instances as simulacra). Decision transformers with more free-form task/identity prompts illustrate this point, except a bureaucracy should have multiple agents with different task prompts in a single episode. MCTS and GAN are adversarial... (read more)

1Eric Drexler7mo
What does and doesn’t work will depend greatly on capabilities, and the problem-context here assumes potentially superintelligent-level AI.
3Eric Drexler7mo
These are good points, and I agree with pretty much all of them.  I think that this is an important idea. Though simulators analogous to GPT-3, it may be possible to develop strong, almost-provably-non-agentic intelligent resources [https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators#Inadequate_ontologies], then prompt them to simulate diverse, transient agents on the fly. From the perspective of building multicomponent architectures this seems like a strange and potentially powerful tool. Regarding interpretability, tasks that require communication among distinct AI components will tend to expose information, and manipulating “shared backgrounds” between information sources and consumers could potentially be exploited to make that information more interpretable. (How one might train against steganography is an interesting question.)

HCH-like amplification seems related, multiple unreliable agent instances assembled into a bureaucracy that as a whole improves on some of their qualities (perhaps trustworthiness) and allows generating a better dataset for retraining the unreliable agents. So this problem is not specific to interaction of somewhat independently developed and instantiated AIs acting in the world, as it appears in a popular training story of a single AI. There is also the ELK problem that might admit solutions along these lines. This makes it less obviously neglected, even with little meaningful progress.

1Eric Drexler7mo
This approach to amplification involves multiple instances, but not diverse systems, competing systems, different roles, adversarial relationships, or a concern with collusion. It is, as you say a training story for a single AI. Am I missing a stronger connection?

The usual story of acausal coordination involves agent P modeling agent Q, and Q modeling P. Put differently, both P and Q model the joint system (P+Q) that has both P and Q in it. But it doesn't have to be (P+Q), it could be some much simpler R instead. I think a more central example of acausal coordination is to simply follow shared ideas.

The unusual character of acausal coordination is caused by cases where R is itself an agent, an adjudicator between P and Q. As a shared idea, it would have instances in minds of both P and Q, and command some influence... (read more)

The issue with you-in-all-detail vs. your-decision-algorithm is that a decision algorithm can have different levels of updatelessness, it's unclear what the decision algorithm already knows vs. what a policy it chooses takes as input. So we pick some intermediate level that is updateless enough to allow acausal coordination among relevant entities (agents/predictors), and updateful enough to make a decision without running out of time/memory while being implemented in its instances. But that level/scope is different for different collections of entities be... (read more)

The usual framing is to align agent policies in a way that also ensures that they don't have an out-of-distribution phase change. With deceptive alignment, sharp left turn, and goodharting, capability to consider complicated plans or mere optimization-in-actuality prompts situations that are too out-of-distribution. The behavior on-distribution stops being a good indication of what's going to happen there, regardless of difficulties of modern ML with robustness. In this framing, the training distribution also needs to confer alignment, so it anchors to mor... (read more)

The simulator frame clashes with a lot of this. A simulator is a map that depicts possible agents in possible situations, enacting possible decisions according to possible intents, leading to possible outcomes. One of these agents seen on the map might be the one wielding the simulator as its map, in possible situations that happen to be actual. But the model determines the whole map, not just the controlling agent in actual situations, and some alignment issues (such as robustness) concern the map, not specifically the agent.

These dreams may be of sufficiently high fidelity

One thing conspicuously missing in the post is a way of improving fidelity of simulation without changing external training data, or relationship between the model and the external training data, which I think follows from self-supervised learning on summaries of dreams. There are many concepts of evaluation/summarization of text, so given a text it's possible to formulate tuples (text, summary1, summary2, ...) and do self-supervised learning on that, not just on text (evaluations/summaries are also texts... (read more)

I think talking of "loss minimizing" is conflating two different things here. Minimizing training loss is alignment of the model with the alignment target given by the training dataset. But the Alzheimer's example is not about that, it's about some sort of reflective equilibrium loss, harmony between the model and hypothetical queries it could in principle encounter but didn't on the trainings dataset. The latter is also a measure of robustness.

Prompt-conditioned behaviors of a model (in particular, behaviors conditioned by presence of a word, or name of a... (read more)

Load More