I actually think this type of change is very common—because individuals' identities are very strongly interwoven with the identities of the groups they belong to
Mm, I'll concede that point. I shouldn't have used people as an example; people are messy.
Literal gears, then. Suppose you're studying some massive mechanism. You find gears in it, and derive the laws by which each individual gear moves. Then you grasp some higher-level dynamics, and suddenly understand what function a given gear fulfills in the grand scheme of things. But your low-level model of a...
But I think the mistake you're making is to assume that the lower levels are preserved after finding higher-level abstractions. Instead, higher-level abstractions reframe the way we think about lower-level abstractions, which can potentially change them dramatically
Mm, I think there's two things being conflated there: ontological crises (even small-scale ones, like the concept of fitness not being outright destroyed but just re-shaped), and the simple process of translating your preference around the world-model without changing that world-model.
It's not a...
I'd previously sketched out a model basically identical to this one, see here and especially here.
... but I've since updated away from it, in favour of an even simpler explanation.
The major issue with this model is the assumption that either (1) the SGD/evolution/whatever-other-selection-pressure will always convergently instill the drive for doing value systematization into the mind it's shaping, or (2) that agents will somehow independently arrive at it on their own; and that this drive will have overwhelming power, enough to crush the object-level value...
Thanks for the comment! I agree that thinking of minds as hierarchically modeling the world is very closely related to value systematization.
But I think the mistake you're making is to assume that the lower levels are preserved after finding higher-level abstractions. Instead, higher-level abstractions reframe the way we think about lower-level abstractions, which can potentially change them dramatically. This is what happens with most scientific breakthroughs: we start with lower-level phenomena, but we don't understand them very well until we discover th...
Do you have any cached thoughts on the matter of "ontological inertia" of abstract objects? That is:
A human is not well modelled as a wrapper mind; do you disagree?
Certainly agree. That said, I feel the need to lay out my broader model here. The way I see it, a "wrapper-mind" is a general-purpose problem-solving algorithm hooked up to a static value function. As such:
It's not a binary. You can perform explicit optimization over high-level plan features, then hand off detailed execution to learned heuristics. "Make coffee" may be part of an optimized stratagem computed via consequentialism, but you don't have to consciously optimize every single muscle movement once you've decided on that goal.
Essentially, what counts as "outputs" or "direct actions" relative to the consequentialist-planner is flexible, and every sufficiently-reliable (chain of) learned heuristics can be put in that category, with choosing to execute on...
I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques)
My impression is that it being a concrete example is the why. "What is the right framework to use?" and "what is the environment-structure in which natural abstractions can be defined?" are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer.
The fact that it loops-in the speed of causal influence is also sug...
Sure, but isn't the goal of the whole agenda to show that does have a certain correct factorization, i. e. that abstractions are convergent?
I suppose it may be that any choice of low-level boundaries results in the same , but the itself has a canonical factorization, and going from back to reveals the corresponding canonical factorization of ? And then depending on how close the initial choice of boundaries was to the "correct" one, is easier or harder to compute (or there's somethin...
Almost. The hope/expectation is that different choices yield approximately the same , though still probably modulo some conditions (like e.g. sufficiently large ).
Can you elaborate on this expectation? Intuitively, should consist of a number of higher-level variables as well, and each of them should correspond to a specific set of lower-level variables: abstractions and the elements they abstract over. So for a given , there should be a specific "correct" way to draw the boundaries in the low-level system.
But if ~any way of dr...
Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance implies that nothing other than could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.
... I feel compelled to note that I'd pointed out a very similar thing a while ago.
Granted, that's not exactly the same formulation, and the devil's in the details.
By the way, do we need the proof of the theorem to be quite this involved? It seems we can just note that for for any two (sets of) variables , separated by distance , the earliest sampling-step at which their values can intermingle (= their lightcones intersect) is (since even in the "fastest" case, they can't do better than moving towards each other at 1 variable per 1 sampling-step).
Hmm. I may be currently looking at it from the wrong angle, but I'm skeptical that it's the right frame for defining abstractions. It seems to group low-level variables based on raw distance, rather than the detailed environment structure? Which seems like a very weak constraint. That is,
By further iteration, we can conclude that any number of sets of variables which are all separated by a distance of are independent given . That’s the full Lightcone Theorem.
We can make literally any choice of those sets subject to this condition: we ca...
While it's true, there's something about making this argument that don't like. It's like it's setting you up for moving goalposts if you succeed with it? It makes it sound like the core issue is people giving AIs power, with the solution to that issue — and, implicitly, to the whole AGI Ruin thing — being to ban that.
Which is not going to help, since the sort of AGI we're worried about isn't going to need people to naively hand it power. I suppose "not proactively handing power out" somewhat raises the bar for the level of superintelligence necessary, but ...
it's not clear why to expect the new feedback loop to be much more powerful than the existing ones
Yeah, the argument here would rely on the assumption that e. g. the extant scientific data already uniquely constraint some novel laws of physics/engineering paradigms/psychological manipulation techniques/etc., and we would be eventually able to figure them out even if science froze right this moment. In this case, the new feedback loop would be faster because superintelligent cognition would be faster than real-life experiments.
And I think there's a decent a...
Interesting, thanks.
I don't expect a discontinuous jump at the point you hit the universality property
Agreed that this point (universality leads to discontinuity) probably needs to be hashed out more. Roughly, my view is that universality allows the system to become self-sustaining. Prior to universality, it can't autonomously adapt to novel environments (including abstract environments, e. g. new fields of science). Its heuristics have to be refined by some external ground-truth signals, like trial-and-error experimentation or model-based policy gradients...
I agree that those are useful pursuits.
I still disagree but it no longer seems internally inconsistent
Mind gesturing at your disagreements? Not necessarily to argue them, just interested in the viewpoint.
Discontinuity ending (without stalling):
Stalling:
Ah, makes sense.
Are you imagining systems that are built differently from today?
I do expect that some sort of ability to reprogram itself at inference time will be ~necessary for AGI, yes. But I also had in mind something like your "SGD creates a set of weights that effectively treats the input English tokens as a programming language" example. In the unlikely case that modern transformers are AGI-complete, I'd expect something on that order of exoticism to be necessary (but it's not my baseline prediction)....
I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.
Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?
...It sounds like your answer is that the development of AGI could lead to something below-human-level, that wouldn't be able to get itself more compute / privileges, but we will not realize that it's AGI, so we'll give it mor
Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?
Discontinuity ending (without stalling):
Stalling:
Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).
Are you imagining systems that are built differently from today? Because I'm not seeing how SGD could ...
Do any humans have the general-intelligence property?
Yes, ~all of them. Humans are not superintelligent because despite their minds embedding the algorithm for general intelligence, that algorithm is still resource-constrained (by the brain's compute) and privilege-constrained within the mind (e. g., it doesn't have full write-access to our instincts). There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions. On the contrary: even if we resolve to check for "AGI-ness" often, with the intent of sto...
What would you expect to observe, if a binary/sharp threshold of generality did not exist?
Great question!
I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us.
E. g., there'd be things like, we'd see octopi frequently executing some sequences of actions that lead to beneficial outcomes for them, and we would be fundamentally unable to understand what is happening. As it is, sure, some animals have specialized cognitive algorith...
deliberately filtering out simulation hypotheses seems quite difficult, because it's unclear to specify it
Aha, that's the difficulty I was overlooking. Specifically, I didn't consider that the approach under consideration here requires us to formally define how we're filtering them out.
Thanks!
The problem is that the AI doesn't a priori know the correct utility function, and whatever process it uses to discover that function is going to be attacked by Mu
I don't understand the issue here. Mu can only interfere with the simulated AI's process of utility-function discovery. If the AI follows the policy of "behave as if I'm outside the simulation", AIs simulated by Mu will, sure, recover tampered utility functions. But AIs instantiated in the non-simulated universe, who deliberately avoid thinking about Mu/who discount simulation hypotheses, should ...
Disclaimer: Haven't actually tried this myself yet, naked theorizing.
“We made a wrapper for an LLM so you can use it to babble random ideas!”
I'd like to offer a steelman of that idea. Humans have negative creativity — it takes conscious effort to come up with novel spins on what you're currently thinking about. An LLM babbling about something vaguely related to your thought process can serve as a source of high-quality noise, noise that is both sufficiently random to spark novel thought processes and relevant enough to prompt novel thoughts on the ac...
Me: *looks at some examples* “These operationalizations are totally ad-hoc. Whoever put together the fine-tuning dataset didn’t have any idea what a robust operationalization looks like, did they?”
... So maybe we should fund an effort to fine-tune some AI model on a carefully curated dataset of good operationalizations? Not convinced building it would require alignment research expertise specifically, just "good at understanding the philosophy of math" might suffice.
Finding the right operationalization is only partly intuition, partly it's just knowin...
Inner alignment for simulators
Broadly agreed. I'd written a similar analysis of the issue before, where I also take into account path dynamics (i. e., how and why we actually get to Azazel from a random initialization). But that post is a bit outdated.
My current best argument for it goes as follows:
...
- The central issue, the reason why "naive" approaches for just training a ML model to make good prediction will likely result in a mesa-optimizer, is that all such setups are "outer-misaligned" by default. They don't optimize AIs towards being good world-models,
Goals are functions over the concepts in one's internal ontology, yes. But having a concept for something doesn't mean caring about it — your knowing what a "paperclip" is doesn't make you a paperclip-maximizer.
The idea here isn't to train an AI with the goals we want from scratch, it's to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.
Now this is admittedly very different from the thesis that value is complex and fragile.
I disagree. The fact that some concept is very complicated doesn't mean it won't be necessarily represented in any advanced AGI's ontology. Humans' psychology, or the specific tools necessary to build nanomachines, or the agent foundation theory necessary to design aligned successor agents, are all also "complex and fragile" concepts (in the sense that getting a small detail wrong would result in a grand failure of prediction/planning), but we can expect such concepts t...
Two agents with the same ontology and very different purposes would behave in very different ways.
I don't understand this objection. I'm not making any claim isomorphic to "two agents with the same ontology would have the same goals". It sounds like maybe you think I'm arguing that if we can make the AI's world-model human-like, it would necessarily also be aligned? That's not my point at all.
The motivation is outlined at the start of 1A: I'm saying that if we can learn how to interpret arbitrary advanced world-models, we'd be able to more precisely "aim" ...
I agree that the AI would only learn the abstraction layers it'd have a use for. But I wouldn't take it as far as you do. I agree that with "human values" specifically, the problem may be just that muddled, but with none of the other nice targets — moral philosophy, corrigibility, DWIM, they should be more concrete.
The alternative would be a straight-up failure of the NAH, I think; your assertion that "abstractions can be on a continuum" seems directly at odds with it. Which isn't impossible, but this post is premised on the NAH working.
the opaque test is something like an obfuscated physics simulation
I think it'd need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which itself can't tell whether there's etheric interference involved or not. The way Fermat's test can't tell a Carmichael number from a prime — it just doesn't interact with the input number in a way that'd reveal the difference between th...
Lazy World Models
It seems like "generators" should just be simple functions over natural abstractions? But I see two different ways to go with this, inspired either by the minimal latents approach, or by the redundant-information one.
First, suppose I want to figure out a high-level model of some city, say Berlin. I already have a "city" abstraction, let's call it , which summarizes my general knowledge about cities in terms of a probability distribution over possible structures. I also know a bunch of facts about Berlin specifically, let's call th...
So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?
Yeah, in that case, the only viable way to handle this is to get something into the system that can distinguish between no tampering and etheric interference. Just like the only way to train an AI to distinguish primes from Carmichael ...
What are your current thoughts on the exact type signature of abstractions? In the Telephone Theorem post, they're described as distributions over the local deterministic constraints. The current post also mentions that the "core" part of an abstraction is the distribution , and its ability to explain variance in individual instances of .
Applying the deterministic-constraint framework to trees, I assume it says something like "given certain ground-truth conditions (e. g., the environment of a savannah + the genetic code of a given tree), th...
I think there's a sense in which the Fermat test is a capability problem, not an interpretability/alignment problem.
It's basically isomorphic to a situation in which sensor tampering is done via a method that never shows up in the AI's training data. E. g., suppose it's done via "etheric interference", which we don't know about, and which never fails and therefore never leads to any discrepancies in the data so the AI can't learn it via SSL either, etc. Then the AI just... can't learn about it, period. It's not that it can, in theory, pick up on it, but in...
Can you posit a training environment which matches what you're thinking about, relative to a given network architecture [e.g. LSTM]?
Sure, gimme a bit.
Why not just not internally represent the reward function, and but still contextually generate "win this game of Go" or "talk like a 4chan user"?
What mechanism does this contextual generation? How does this mechanism behave in off-distribution environments; what goals does it generate in them?
...I think it's fine to say "here's one effect [diversity and empirical loss minimization] which pushes towards reward wr
Alright, seems we're converging on something.
But I see no reason to think that those kinds of updates will be made accessible enough to shape the heuristic-generating machinery so that it always or approximately always generates heuristics optimized for achieving R (as opposed to generating heuristics optimized for achieving whatever-the-agent-wants-to-achieve).
How would this machinery appear, then? I don't see how it'd show up without being built into the agent by the optimization algorithm, and the optimization algorithm will only build it if it serves t...
Why should we treat that as the relevant idealization?
Yeah, okay, maybe that wasn't the right frame to use. Allow me to pivot:
Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.
In other...
Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.
In other words, the agent would need to be autonomous.
Agreed. Generally, whenever I talk about the agent being smart/competent, I am a...
As an example, you could have a wrapper-mind that cares about some correlate of R but not R itself. If it is smart, such an agent can navigate the selection process just as well as an R-pursuer
... By figuring out what is and deciding to act as an -pursuing wrapper-mind, therefore essentially becoming an -pursuing wrapper-mind. With the only differences being that it 1) self-modified into one at runtime, instead of being like this from the start, and 2) it'd decide to "stop pretending" in some hypothetical set of situations/OOD, but ...
But existence of such populations and weight settings doesn't imply net local pressures or gradients in those directions.
How so? This seems like the core disagreement. Above, I think you're agreeing that under a wide enough distribution on scenarios, the only zero-gradient agent-designs are those that optimize for directly. Yet that somehow doesn't imply that training an agent in a sufficiently diverse environment would shape it into an -optimizer?
Are you just saying that there aren't any gradients from initialization to an -optimiz...
Thanks for extensive commentary! Here's an... unreasonably extensive response.
what it means to "excavate" the procedural and implicit knowledge
1) Suppose that you have a shard that looks for a set of conditions like "it's night AND I'm resting in an unfamiliar location in a forest AND there was a series of crunching sounds nearby". If they're satisfied, it raises an alarm, and forms and bids for plans to look in the direction of the noises and get ready for a fight.
That's procedural knowledge: none of that is happening at the level o...
This touches on some issues I'd wanted to discuss: abstraction hierarchies, and incompatible abstraction layers.
So, here’s a new conditional independence condition for “large” systems, i.e. systems with an infinite number of ’s: given , any finite subset of the ’s must be approximately independent (i.e. mutual information below some small ) of all but a finite number of the other ’s
Suppose we have a number of tree-instances . Given a sufficiently large , we can compute a valid "general tree abstractio...
Mm, I believe that it's not central because my initial conception of the GPS didn't include it at all, and everything still worked. I don't think it serves the same role here as you're critiquing in the posts you've linked; I think it's inserted at a different abstraction level.
But sure, I'll wait for you to finish with the post.
What does it mean to ask "how hard should I optimize"?
Satisficing threshold, probability of the plan's success, the plan's robustness to unexpected perturbations, etc. I suppose the argmin is somewhat misleading: the GPS doesn't output the best possible plan for achieving some goal in the world outside the agent, it's solving the problem in the most efficient way possible, which often means not spending too much time and resources on it. I. e., "mental resources spent" is part of the problem specification, and it's something it tries to minimize too.
I don'...
I don't think the GPS "searches over all relevant plans". As per John's post:
...Consider, for example, a human planning a trip to the grocery store. Typical reasoning (mostly at the subconscious level) might involve steps like:
- There’s a dozen different stores in different places, so I can probably find one nearby wherever I happen to be; I don’t need to worry about picking a location early in the planning process.
- My calendar is tight, so I need to pick an open time. That restricts my options a lot, so I should worry about that early in the planning process.
- &l
Agreed. It's the same principle by which people are advised to engage in plan-making even if any specific plan they will invent will break on contact with reality; the same principle that underlies "do the math, then burn the math and go with your gut".
While any specific model is likely to be wrong, trying to derive a consistent model gives you valuable insights into how a consistent model would look like at all, builds model-building skills. What specific externally-visible features of the system do you need to explain? How much complexity is required to ...
Any updates to your model of the socioeconomic path to aligned AI deployment? Namely:
I expect there to be no major updates, but seems worthwhile to keep an eye on this.
...So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human
Still on the "figure out agency and train up an aligned AGI unilaterally" path?
"Train up an AGI unilaterally" doesn't quite carve my plans at the joints.
One of the most common ways I see people fail to have any effect at all is to think in terms of "we". They come up with plans which "we" could follow, for some "we" which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in "we" implementing the plan. (And also, usually, the "we" in question is to...
Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
Here's a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)
My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 yea...
I no longer believe this claim quite as strongly as implied: see here and here. The shard theory has presented a very compelling alternate case of human value formation, and it suggests that even the ultimate compilation of two different modern people's values would likely yield different unitary utility functions.
I still think there's a sense in which stone-age!humans and modern humans, if tasked with giving an AI an utility function that'd make all humans happy, would arrive at the same result (if given thousands of years to think). But it might be the s...
Values steer optimization; they are not optimized against
I strongly disagree with the implication here. This statement is true for some agents, absolutely. It's not true universally.
It's a good description of how an average human behaves most of the time, yes. We're often puppeted by our shards like this, and some people spend the majority of their lives this way. I fully agree that this is a good description of most of human cognition, as well.
But it's not the only way humans can act, and it's not when we're at our most strategically powerful.
Consider if ...
I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness
Do you have anything specific in mind?
As a proponent:
My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.
Said model is based on analyses of how humans think and how human cognition differs from animal/LLM cognition, plus reasoning about how a general-intelligence algo... (read more)