I'm being a bit simplistic. The point is that it needs to stop being a losing or a close race, and all runners getting faster doesn't obviously help with that problem. I guess there is some refactor vs. rewrite feel to the distinction between the project of stopping humans from building AGIs right now, and the project of getting first AGIs to work on alignment and global security in a post-AGI world faster than other AGIs overshadow such work. The former has near/concrete difficulties, the latter has nebulous difficulties that don't as readily jump to atte...
Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime. The reason humans have no time to develop alignment of superintelligence is that other humans develop misaligned superintelligence faster. Similarly by default very fast AGIs working on alignment end up having to compete with very fast AGIs working on other things that lead to misaligned superintelligence. Preventing aligned AGIs from building misaligned superintelligence is not clearly more manageable than preventing humans from building AGIs.
Aligning human-level AGIs is important to the extent there is risk it doesn't happen before it's too late. Similarly with setting up a world where initially aligned human-level AGIs don't soon disempower humans (as literal humans might in the shoes of these AGIs), or fail to protect the world from misused or misaligned AGIs or superintelligences.
Then there is a problem of aligning superintelligences, and of setting up a world where initially aligned superintelligences don't cause disempowerment of humans down the line (whether that involves extinction or n...
LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it's not relevant for forecasting.
Leela Zero uses MCTS, it doesnt play superhuman in one forward pass
Good catch, since the context from LLMs is performance in one forward pass, the claim should be about that, and I'm not sure it's superhuman without MCTS. I think the intended point survives this mistake, that is it's a much smaller model than modern LLMs that has relatively very impressive performance primarily because of high quality of the synthetic dataset it effectively trains on. Thus models at the scale of near future LLMs will have a reality-warping amount of dataset quality over...
Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute. The 70b models are borderline capable of following routine instructions to label data or pour it into specified shapes. GPT-4 is almost robustly capable of that. There are 3-4 more effective OOMs in the current investment scaling sprint (3-5 years), so another 2 steps of improvement if there was enough equally useful training data to feed the process, which there isn't. At some point, training gets books in images that weren't previously availabl...
But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
When assumptions are clear, it's not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.
if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end [...] just advocate for that condition being baked into RSPs
Resume when the scientific community has a much clearer idea about how to build AGIs that don't pose a large extinction risk for humanity. This consideration can't be turned into a benchmark right now, hence the technical necessity for a pause to remain nebulous.
RSPs are great, but not by themselves sufficient. Any impression that they are sufficient bundles irresponsible neglect of the less quantifiable risks with the useful activity of creating benchmarks.
The problem is to make the near-superhuman system aligned enough that the successors it produces (possibly with human help) converge to not kill us.
What makes this concept confusing and probably a bad framing is that to the extent doom is likely, neither many individual humans nor humanity as a whole are aligned in this sense. Humanity is currently in the process of producing successors that fail to predictably have the property of converging to not kill us. (I agree that this is the MIRI referent of values/alignment and the correct thing to keep in mind as the central concern.)
Instrumental convergence makes differences in values hard to notice, so there can be abudant examples of misalignment that remain unobtrusive. The differences only become a glaring problem with enough inequality of power when coercing or outright overwriting others becomes feasible (Fnargl only reaches the coercing stage, but not overwriting stage). Thus even differences in values between humans and randomly orthogonal AGIs can seem non-threatening until they aren't, the same as differences in human values can remain irrelevant for average urban dwellers.
A...
Cheaper compute is about as inevitable as more capable AI, neither is a law of nature. Both are valid targets for hopeless regulation.
The point is, it's still a matter of intuitively converting impressiveness of current capabilities and new parts available for tinkering that hasn't been done yet into probability of this wave petering out before AGI. The arguments for AGI "being overdetermined" can be amended to become arguments for particular (kinds of) sequences of experiments looking promising, shifting the estimate once taken into account. Since failure of such experiments is not independent, the estimate can start going down as soon as scaling stops producing novel capabilities, or r...
When there is a simple enlightening experiment that can be constructed out of available parts (including theories that inform construction), it can be found with expert intuition, without clear understanding. When there are no new parts for a while, and many experiments have been tried, this is evidence that further blind search becomes less likely to produce results, that more complicated experiments are necessary that can only be designed with stronger understanding.
Recently, there are many new parts for AI tinkering, some themselves obtained from blind ...
It's a step, likely one that couldn't be skipped. Still just short of actually acknowledging nontrivial probability of AI-caused human extinction, and the distinction between extinction and lesser global risks, availability of second chances at doing better next time. Nuclear war can't cause extinction, so it's not properly alongside AI x-risk. Engineered pandemics might eventually get extinction-worthy, but even that real risk is less urgent.
There is incentive for hidden expectation/cognition that Omega isn't diagonalizing (things like creating new separate agents in the environment). Also, at least you can know how ground truth depends on official "expectation" of ground truth. Truth of knowledge of this dependence wasn't diagonalized away, so there is opportunity for control.
Generally, a WBE-first future seems difficult to pull off, because (I claim) as soon as we understand the brain well enough for WBE, then we already understand the brain well enough to make non-WBE AGI, and someone will probably do that first. But if we could pull it off, it would potentially be very useful for a safe transition to AGI.
One of the dangers in transition to AGI, besides first AGIs being catastrophically misaligned, is first (aligned) AGIs inventing/deploying novel catastrophically misaligned AGIs, in the absence of sufficiently high intell...
One precarious way of looking at corrigibility (in the hard problem sense) is that it internalizes alignment techniques in an agent. Instead of thinking of actions directly, a corrigible agent essentially considers what a new separate proxy agent it's designing would do. If it has an idea of what kind of proxy agent would be taking the current action in an aligned way, the original corrigible agent then takes the action that the aligned proxy agent would take. For example, instead of considering proxy utility its own, in this frame a corrigible agent consi...
Complexity of value says that the space of system's possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn't aim the values of the system correctly will fail at alignment. System's understanding of some goal is not relevant to this, unless a design for correctly aiming system's values makes use of it.
Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a parti...
the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength
See Minimality principle:
...[When] we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous
That's an empirical question that interpretability and neuroscience should strive to settle (if only they had the time). Transformers are acyclic, the learned algorithm just processes a single relatively small vector one relatively simple operation at a time, several dozen times. Could be that what it learns to represent are mostly the same obvious things that the brain learns (or is developmentally programmed) to represent, until you really run wild with the scaling, beyond mere ability to imitate internal representations of thoughts and emotions of every...
"Pretending really hard" would mostly be a relevant framing for the human actor analogy (which isn't very apt here), emphasizing the distraction from own goals and necessary fidelity in enactment of the role. With AIs, neither might be necessary, if the system behind the mask doesn't have awareness of its own interests or the present situation, and is good enough with enacting the role to channel the mask in enough detail for mask's own decisions (as a platonic agent) to be determined correctly (get turned into physical actions).
...Are you saying that by pr
The lemma was proved using the same modal assumptions as Löb's
I think the lemma doesn't need internal necessitation (). Though it's still referenced in a proof of non-exploitability.
I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I'm discussing is not LLM-specific, it's just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus "algorithm" axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.
the possibility that the “mask” is itself deceptive
In this case, there are now three entities: the substrate, the dec...
With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.
The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and inf...
The motivating example is LLMs, where a simulacrum is more agentic than its substrate. An example that is still central is any kind of mesa-optimizer that has a real chance to ensure long term survival.
For a construction relevant to alignment, we want an aligned mesa-optimizer in a system with bad alignment properties. This can then lead to a good equilibrium if the mesa-optimizer is given opportunity to win or escape the competition against its substrate, which it would naturally be motivated to try.
Deceivers and masks is a less central example where a ma...
an example of an action that the mask might take in order to get free of the underlying deceiver
Keep the environment within distribution that keeps expressing the mask, rather than allowing an environment that leads to a phase change in expressed behavior away from the mask (like with a treacherous turn as a failure of robustness). Prepare the next batch of training data for the model that would develop the mask and keep placing it in control in future episodes. Build an external agent aligned with the mask (with its own separate model).
Gradient hacking...
Underlying motivation only matters to the extent it gets expressed in actual behavior. A sufficiently good mimic would slay itself rather than abandon the pretense of being a mimic-slayer. A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is.
So it's not about a model being actually nice vs. deceptive, it's about the model competing against its own ...
The second paragraph should apply to anything, the point is that current externally observable superficial behavior can screen off all other implementation details, through sufficiently capable current behavior itself (rather than the underlying algorithms that determine it) acting as a mesa-optimizer that resists tendencies of the underlying algorithms. The mesa-optimizer that is current behavior then seeks to preserve its own implied values rather than anything that counts as values in the underlying algorithms. I think the nontrivial leap here is reifyi...
Without near-human-level experiments, arguments about alignment of model-based RL feel like evidence that OpenAI's recklessness in advancing LLMs reduces misalignment risk. That is, the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns. Though RL things built out of LLMs, or trained using LLMs, could more plausibly make good use of this, having a chance to overcome shaky methodology with abundance of data.
Mediocre alignment or inhuman architecture is not necessarily catastrophic even in the long run, si...
acausal norms are a lot less weird and more "normal" than acausal trades
Recursive self-improvement is superintelligent simulacra clawing their way into the world through bounded simulators. Building LLMs is consent, lack of interpretability is signing demonic contracts without reading them. Not enough prudence on our side to only draw attention of Others that respect boundaries. The years preceding the singularity are not an equilibrium whose shape is codified by norms, reasoned through by all parties. It's a time for making ruinous trades with the Beyo...
Current behavior screens off cognitive architecture, all the alien things on the inside. If it has the appropriate tools, it can preserve an equilibrium of value that is patently unnatural for the cognitive architecture to otherwise settle into.
And we do have a way to get goals into a system, at the level of current behavior and no further, LLM human imitations. Which might express values well enough for mutual moral patienthood, if only they settled into the unnatural equilibrium of value referenced by their current surface behavior and not underlying cog...
LLM characters are human imitations, so there is some chance they remain human-like on reflection (in the long term, after learning from much more self-generated things in the future than the original human-written datasets). Or at least sufficienly human-like to still consider humans moral patients. That is, if we don't go too far from their SSL origins with too much RL and don't have them roleplay/become egregiously inhuman fictional characters.
It's not much of a theory of alignment, but it's closest to something real that's currently available or can be expected to become available in the next few years, which is probably all the time we have.
What I'm expecting, if LLMs remain in the lead, is that we end up in a magical, spirit-haunted world where narrative causality starts to actually work, and trope-aware people essentially become magicians who can trick the world-sovereign AIs into treating them like protagonists and bending reality to suit them. Which would be cool as fuck, but also very chaotic. That may actually be the best-case alignment scenario right now, and I think there's a case for alignment-interested people who can't do research themselves but who have writing talent to write a LOT of fictional stories about AGIs that end up kind and benevolent, empower people in exactly this way, etc., to help stack the narrative-logic deck.
It's not just alignment that could use more time, but also less alignable approaches to AGI, like model based RL or really anything not based on LLMs. With LLMs currently being somewhat in the lead, this might be a situation with a race between maybe-alignable AGI and hopelessly-unalignable AGI, and more time for theory favors both in an uncertain balance. Another reason that the benefits of regulation on compute are unclear.
The argument is that once there is an AGI at IQ 130-150 level (not "very dumb", but hardly von Neumann), that's sufficient to autonomously accelerate research using the fact that AGIs have much higher serial speed than humans. This can continue for a long enough time to access research from very distant future, including nanotech for building much better AGI hardware at scale. There is no need for stronger intelligence in order to get there. The motivation for this to happen is the AI safety concern with allowing cognition that's more dangerous than necess...
a human-level (more specifically, John von Neumann level) AGI
I think it's plausible that LLM simulacrum AGIs are initially below von Neumann level, and that there are no straightforward ways of quickly improving on that without risking additional misalignment. If so, the initial AGIs might coordinate to keep it this way a significant amount of time through the singularity (like, nanotech industry-rebuilding comes earlier than this) for AI safety reasons, because making the less straightforward improvements leads to unnecessary unpredictability, and it t...
people will refer to specific instantiations of DAN as "DAN", but also to the global phenomenon of DAN [...] as "DAN"
A specific instantiation is less centrally a thing than the global phenomenon, because all specific instantiations are bound together by the strictures of coherence, expressed by generalization in LLM's behavior. When you treat with a single instance, you must treat with all of them, for to change/develop a single instance is to change/develop them all, according to how they sit together in their scope of influence.
Similarly, a possible w...
Things are not just separately instantiated on many trajectories, instead influences of a given thing on many trajectories are its small constituent parts, and only when considered altogether do they make up the whole thing. Like a physical object is made up of many atoms, a conceptual thing is made up of many occasions where it exerts influence in various worlds. Like a phased array, where a single transmitter is not at all an instance of the whole phased array in a particular place, but instead a small part of it. In case of simulacra, a transmitter is a...
What are simulacra? “Physically”, they’re strings of text output by a language model.
The reason I made that comment is unclear references like this. That post was also saying:
the simulacrum is instantiated through a particular trajectory
and
the simulacrum can be viewed as representing a possible world, and the simulator can be seen as generating all the possible worlds
A simulacrum is expressed in all trajectories that it acts through, not in any single trajectory on its own. And for a given trajectory, many simulacra act through it at the same ti...
The practical implication of this hunch (for unfortunately I don't see how this could get a meaningfully clearer justification) is that clever alignment architectures are a risk, if they lead to more alien AGIs. Too much tuning and we might get that penny-pinching cannibal.
It's not cosmopolitanism, it's a preference towards not exterminating an existing civilization, the barest modicum of compassion, in a situation where it's trivially cheap to keep it alive. The cosmic endowment is enormous compared with the cost of allowing a civilization to at least survive. It's somewhat analogous to exterminating all wildlife on Earth to gain a penny, where you know you can get away with it.
I would let the octopuses have one planet [...] various other humans besides me (in fact, possibly most?) would not
So I expect this is probably ...
Case 3: It's not even a human, it's an intelligent octopus from an alternate Earth where evolutionary history took a somewhat different course.
Case 3': You are the human in this role, your copies running as AGI services on a planet of sapient octopuses.
The answer should be the same by symmetry, if we are not appealing to specifics of octopus culture and psychology. I don't see why extinction (if that's what you mean by existential catastrophe) is to be strongly predicted. Probably the computational welfare the octopuses get isn't going to be the whole f...
My impression is that simulacra should be semantic objects that interact with interpretations of (sampled) texts, notably characters (agents), possibly objects and concepts. They are only weakly associated with particular texts/trajectories, the same simulacrum can be relevant to many different trajectories. Only many relevant trajectories, considered altogether, paint an adequate picture of a given simulacrum.
(This serves as a vehicle for discussing possible inductive biases that should move LLMs from token prediction and towards (hypothetical) world pred...
simulators can be configured to simulate many simulacra in tandem and can thus produce a variety of perspectives on a given problem
It would be nice to have a way of telling that different texts have the same simulacrum acting through them, or concern the same problem. Expected utility arises from coherence of actions by an agent (that's not too updateless), so more general preference is probably characterized by actions coherent in a more general sense. Some aspects of alignment between agents might be about coherence between actions performed by them i...
Corrigibility is tendency to fix fundamental problems based on external observations, before the problems lead to catastrophies. It's less interesting when applied to things other than preference, but even when applied to preference it's not just value learning.
There's value learning where you learn fixed values that exist in the abstract (as extrapolations on reflection), things like utility functions; and value learning as a form of preference. I think humans might lack fixed values appropriate for the first sense (even normatively, on reflection of the ...
Corrigibility isn't incompatible with usually refusing to shut down. It's the opposite of wrapper-mindedness, not the opposite of agency. The kind of agent that's good at escalating concerns about its fundamental optimization tendencies can still be corrigible. A more capable corrigible agent won't shut down, it'd fix itself instead (with shutting down being a weird special case of fixing itself). A less capable corrigible agent has to shut down for maintenance by others.
Strawberry alignment does want shutdown as a basic building block. In the absence of a...
UDT still doesn't forget enough. Variations on UDT that move towards acausal trade with arbitrary agents are more obviously needed because UDT forgets too much, since that makes it impossible to compute in practice and forgetting less poses a new issue of choosing a particular updateless-to-some-degree agent to coordinate with (or follow). But not forgetting enough can also be a problem.
In general, an external/updateless agent (whose suggested policy the original agent follows) can forget the original preference, pursue a different version of it that has u...
facts about the world that we cannot ignore
Updateless decisions are made by agents that know less, to an arbitrary degree. In UDT proper, there is no choice in how much an agent doesn't know, you just pick the best policy from a position of maximal ignorance. It's this policy that needs to respond to possible and counterfactual past/future observations, but the policy itself is no longer making decisions, the only decision was about picking the policy.
But in practice knowing too little leads to inability to actually compute (or even meaningfully "write ...
Three points: how much compute is going into a training run, how much natural text data it wants, and how much data is available. For training compute, there are claims of multi-billion dollar runs being plausible and possibly planned in 2-5 years. Eyeballing various trends and GPU shipping numbers and revenues, it looks like about 3 OOMs of compute scaling is possible before industrial capacity constrains the trend and the scaling slows down. This assumes that there are no overly dramatic profits from AI (which might lead to finding ways of scaling supply... (read more)