All of Thane Ruthenis's Comments + Replies

As a proponent:

My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.

Said model is based on analyses of how humans think and how human cognition differs from animal/LLM cognition, plus reasoning about how a general-intelligence algo... (read more)

I actually think this type of change is very common—because individuals' identities are very strongly interwoven with the identities of the groups they belong to

Mm, I'll concede that point. I shouldn't have used people as an example; people are messy.

Literal gears, then. Suppose you're studying some massive mechanism. You find gears in it, and derive the laws by which each individual gear moves. Then you grasp some higher-level dynamics, and suddenly understand what function a given gear fulfills in the grand scheme of things. But your low-level model of a... (read more)

But I think the mistake you're making is to assume that the lower levels are preserved after finding higher-level abstractions. Instead, higher-level abstractions reframe the way we think about lower-level abstractions, which can potentially change them dramatically

Mm, I think there's two things being conflated there: ontological crises (even small-scale ones, like the concept of fitness not being outright destroyed but just re-shaped), and the simple process of translating your preference around the world-model without changing that world-model.

It's not a... (read more)

2Richard Ngo1mo
I actually think this type of change is very common—because individuals' identities are very strongly interwoven with the identities of the groups they belong to. You grow up as a kid and even if you nominally belong to a given (class/political/religious) group, you don't really understand it very well. But then over time you construct your identity as X type of person, and that heavily informs your friendships—they're far less likely to last when they have to bridge very different political/religious/class identities. E.g. how many college students with strong political beliefs would say that it hasn't impacted the way they feel about friends with opposing political beliefs? This is a straightforwardly incorrect model of deontologists; the whole point of deontology is rejecting the utility-maximization framework. Instead, deontologists have a bunch of rules and heuristics (like "don't kill"). But those rules and heuristics are underdefined in the sense that they often endorse different lines of reasoning which give different answers. For example, they'll say pulling the lever in a trolley problem is right, but pushing someone onto the tracks is wrong, but also there's no moral difference between doing something via a lever or via your own hands. I guess technically you could say that the procedure for resolving this is "do a bunch of moral philosophy" but that's basically equivalent to "do a bunch of systematization". Yeah, I totally agree with this. The question is then: why don't translated human goals remain instrumental? It seems like your answer is basically just that it's a design flaw in the human brain, of allowing value drift; the same type of thing which could in principle happen in an agent with a perfect world-model. And I agree that this is probably part of the effect. But it seems to me that, given that humans don't have perfect world-models, the explanation I've given (that systematization makes our values better-defined) is more likely to be the

I'd previously sketched out a model basically identical to this one, see here and especially here.

... but I've since updated away from it, in favour of an even simpler explanation.

The major issue with this model is the assumption that either (1) the SGD/evolution/whatever-other-selection-pressure will always convergently instill the drive for doing value systematization into the mind it's shaping, or (2) that agents will somehow independently arrive at it on their own; and that this drive will have overwhelming power, enough to crush the object-level value... (read more)

Thanks for the comment! I agree that thinking of minds as hierarchically modeling the world is very closely related to value systematization.

But I think the mistake you're making is to assume that the lower levels are preserved after finding higher-level abstractions. Instead, higher-level abstractions reframe the way we think about lower-level abstractions, which can potentially change them dramatically. This is what happens with most scientific breakthroughs: we start with lower-level phenomena, but we don't understand them very well until we discover th... (read more)

Do you have any cached thoughts on the matter of "ontological inertia" of abstract objects? That is:

  • We usually think about abstract environments in terms of DAGs. In particular, ones without global time, and with no situations where we update-in-place a variable. A node in a DAG is a one-off.
  • However, that's not faithful to reality. In practice, objects have a continued existence, and a good abstract model should have a way to track e. g. the state of a particular human across "time"/the process of system evolution. But if "Alice" is a variable/node in our
... (read more)

A human is not well modelled as a wrapper mind; do you disagree?

Certainly agree. That said, I feel the need to lay out my broader model here. The way I see it, a "wrapper-mind" is a general-purpose problem-solving algorithm hooked up to a static value function. As such:

  • Are humans proper wrapper-minds? No, certainly not.
  • Do humans have the fundamental machinery to be wrapper-minds? Yes.
  • Is any individual run of a human general-purpose problem-solving algorithm essentially equivalent to wrapper-mind-style reasoning? Yes.
  • Can humans choose to act as wrapper-mind
... (read more)

It's not a binary. You can perform explicit optimization over high-level plan features, then hand off detailed execution to learned heuristics. "Make coffee" may be part of an optimized stratagem computed via consequentialism, but you don't have to consciously optimize every single muscle movement once you've decided on that goal.

Essentially, what counts as "outputs" or "direct actions" relative to the consequentialist-planner is flexible, and every sufficiently-reliable (chain of) learned heuristics can be put in that category, with choosing to execute on... (read more)

2Cinera Verinia6mo
Yeah, I agree with this. But I don't think the human system aggregates into any kind of coherent total optimiser. Humans don't have an objective function (not even approximately?). A human is not well modelled as a wrapper mind; do you disagree?

I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) 

My impression is that it being a concrete example is the why. "What is the right framework to use?" and "what is the environment-structure in which natural abstractions can be defined?" are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer.

The fact that it loops-in the speed of causal influence is also sug... (read more)

Sure, but isn't the goal of the whole agenda to show that  does have a certain correct factorization, i. e. that abstractions are convergent?

I suppose it may be that any choice of low-level boundaries results in the same , but the  itself has a canonical factorization, and going from  back to  reveals the corresponding canonical factorization of ? And then depending on how close the initial choice of boundaries was to the "correct" one,  is easier or harder to compute (or there's somethin... (read more)

3johnswentworth6mo
Yes, there is a story for a canonical factorization of Λ, it's just separate from the story in this post.

Almost. The hope/expectation is that different choices yield approximately the same , though still probably modulo some conditions (like e.g. sufficiently large ).

Can you elaborate on this expectation? Intuitively,  should consist of a number of higher-level variables as well, and each of them should correspond to a specific set of lower-level variables: abstractions and the elements they abstract over. So for a given , there should be a specific "correct" way to draw the boundaries in the low-level system.

But if ~any way of dr... (read more)

3johnswentworth6mo
Λ is conceptually just the whole bag of abstractions (at a certain scale), unfactored.

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance  implies that nothing other than  could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.

... I feel compelled to note that I'd pointed out a very similar thing a while ago.

Granted, that's not exactly the same formulation, and the devil's in the details.

By the way, do we need the proof of the theorem to be quite this involved? It seems we can just note that for for any two (sets of) variables  separated by distance , the earliest sampling-step at which their values can intermingle (= their lightcones intersect) is  (since even in the "fastest" case, they can't do better than moving towards each other at 1 variable per 1 sampling-step).

3johnswentworth6mo
Yeah, that probably works.

Hmm. I may be currently looking at it from the wrong angle, but I'm skeptical that it's the right frame for defining abstractions. It seems to group low-level variables based on raw distance, rather than the detailed environment structure? Which seems like a very weak constraint. That is,

By further iteration, we can conclude that any number of sets of variables which are all separated by a distance of  are independent given . That’s the full Lightcone Theorem.

We can make literally any choice of those sets subject to this condition: we ca... (read more)

4johnswentworth6mo
Almost. The hope/expectation is that different choices yield approximately the same Λ, though still probably modulo some conditions (like e.g. sufficiently large T). System size, i.e. number of variables.

While it's true, there's something about making this argument that don't like. It's like it's setting you up for moving goalposts if you succeed with it? It makes it sound like the core issue is people giving AIs power, with the solution to that issue — and, implicitly, to the whole AGI Ruin thing — being to ban that.

Which is not going to help, since the sort of AGI we're worried about isn't going to need people to naively hand it power. I suppose "not proactively handing power out" somewhat raises the bar for the level of superintelligence necessary, but ... (read more)

2Alex Turner6mo
I'm sympathetic. I think that I should have said "instrumental convergence seems like a moot point when deciding whether to be worried about AI disempowerment scenarios)"; instrumental convergence isn't a moot point for alignment discussion and within lab strategy, of course. But I do consider the "give AIs power" to be a substantial part of the risk we face, such that not doing that would be quite helpful. I think it's quite possible that GPT 6 isn't autonomously power-seeking, but I feel pretty confused about the issue.

it's not clear why to expect the new feedback loop to be much more powerful than the existing ones

Yeah, the argument here would rely on the assumption that e. g. the extant scientific data already uniquely constraint some novel laws of physics/engineering paradigms/psychological manipulation techniques/etc., and we would be eventually able to figure them out even if science froze right this moment. In this case, the new feedback loop would be faster because superintelligent cognition would be faster than real-life experiments.

And I think there's a decent a... (read more)

Interesting, thanks.

I don't expect a discontinuous jump at the point you hit the universality property

Agreed that this point (universality leads to discontinuity) probably needs to be hashed out more. Roughly, my view is that universality allows the system to become self-sustaining. Prior to universality, it can't autonomously adapt to novel environments (including abstract environments, e. g. new fields of science). Its heuristics have to be refined by some external ground-truth signals, like trial-and-error experimentation or model-based policy gradients... (read more)

3Rohin Shah7mo
I agree that there's a threshold for "can meaningfully build and chain novel abstractions" and this can lead to a positive feedback loop that was not previously present, but there will already be lots of positive feedback loops (such as "AI research -> better AI -> better assistance for human researchers -> AI research") and it's not clear why to expect the new feedback loop to be much more powerful than the existing ones. (Aside: we're now talking about a discontinuity in the gradient of capabilities rather than of capabilities themselves, but sufficiently large discontinuities in the gradient of capabilities have much of the same implications.)

I agree that those are useful pursuits.

I still disagree but it no longer seems internally inconsistent

Mind gesturing at your disagreements? Not necessarily to argue them, just interested in the viewpoint.

5Rohin Shah7mo
Oh, I disagree with your core thesis that the general intelligence property is binary. (Which then translates into disagreements throughout the rest of the post.) But experience has taught me that this disagreement tends to be pretty intractable to talk through, and so I now try just to understand the position I don't agree with, so that I can notice if its predictions start coming true. You mention universality, active adaptability and goal-directedness. I do think universality is binary, but I expect there are fairly continuous trends in some underlying latent variables (e.g. "complexity and generality of the learned heuristics"), and "becoming universal" occurs when these fairly continuous trends exceed some threshold. For similar reasons I think active adaptability and goal-directedness will likely increase continuously, rather than being binary. You might think that since I agree universality is binary that alone is enough to drive agreement with other points, but: 1. I don't expect a discontinuous jump at the point you hit the universality property (because of the continuous trends), and I think it's plausible that current LLMs already have the capabilities to be "universal". I'm sure this depends on how you operationalize universality, I haven't thought about it carefully. 2. I don't think that the problems significantly change character after you pass the universality threshold, and so I think you are able to iterate prior to passing it.

Discontinuity ending (without stalling):

Stalling:

Ah, makes sense.

Are you imagining systems that are built differently from today?

I do expect that some sort of ability to reprogram itself at inference time will be ~necessary for AGI, yes. But I also had in mind something like your "SGD creates a set of weights that effectively treats the input English tokens as a programming language" example. In the unlikely case that modern transformers are AGI-complete, I'd expect something on that order of exoticism to be necessary (but it's not my baseline prediction).... (read more)

4Rohin Shah7mo
Okay, this mostly makes sense now. (I still disagree but it no longer seems internally inconsistent.) Fwiw, I feel like if I had your model, I'd be interested in: 1. Producing tests for general intelligence. It really feels like there should be something to do here, that at least gives you significant Bayesian evidence. For example, filter the training data to remove anything talking about <some scientific field, e.g. complexity theory>, then see whether the resulting AI system can invent that field from scratch if you point it at the problems that motivated the development of the field. 2. Identifying "dangerous" changes to architectures, e.g. inference time reprogramming. Maybe we can simply avoid these architectures and stick with things that are more like LLMs. 3. Hardening the world against mildly-superintelligent AI systems, so that you can study them / iterate on them more safely. (Incidentally, I don't buy the argument that mildly-superintelligent AI systems could clearly defeat us all. It's not at all clear to me that once you have a mildly-superintelligent AI system you'll have a billion mildly-superintelligent-AI-years worth of compute to run them.)

I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?

It sounds like your answer is that the development of AGI could lead to something below-human-level, that wouldn't be able to get itself more compute / privileges, but we will not realize that it's AGI, so we'll give it mor

... (read more)

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?

Discontinuity ending (without stalling):

Stalling:

Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).

Are you imagining systems that are built differently from today? Because I'm not seeing how SGD could ... (read more)

Do any humans have the general-intelligence property?

Yes, ~all of them. Humans are not superintelligent because despite their minds embedding the algorithm for general intelligence, that algorithm is still resource-constrained (by the brain's compute) and privilege-constrained within the mind (e. g., it doesn't have full write-access to our instincts). There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions. On the contrary: even if we resolve to check for "AGI-ness" often, with the intent of sto... (read more)

4Rohin Shah7mo
Sorry, I seem to have missed the problems mentioned in that section on my first read. I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level. (I care about this because I think it cuts against this point: We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail.  Either we get AGI right on the first try, or we die. In particular it seems like if the discontinuity ends before human level then you can iterate on alignment.) Why isn't this also true of the weak AGI? Current models cannot autonomously get more compute (humans have to give it to them) or perform gradient descent on their own weights (unless the humans specifically try to make that happen); most humans placed in the models' position would not be able to do that either. It sounds like your answer is that the development of AGI could lead to something below-human-level, that wouldn't be able to get itself more compute / privileges, but we will not realize that it's AGI, so we'll give it more compute / privileges until it gets to "so superintelligent we can't do anything about it". Is that correct? ... Huh. How do you know that humans are generally intelligent? Are you relying on introspection on your own cognitive process, and extrapolating that to other humans? What if our policy is to scale up resources / privileges available to almost-human-level AI very slowly? Presumably after getting to a somewhat-below-human-level AGI, with a small amount of additional resources it would get to a mildly-superhuman-level AI, and we could distinguish it then? Or maybe you're relying on an assumption that the AGI immediately becomes deceptive and successfully hides the fact that it's an AGI?

What would you expect to observe, if a binary/sharp threshold of generality did not exist?

Great question!

I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us.

E. g., there'd be things like, we'd see octopi frequently executing some sequences of actions that lead to beneficial outcomes for them, and we would be fundamentally unable to understand what is happening.  As it is, sure, some animals have specialized cognitive algorith... (read more)

deliberately filtering out simulation hypotheses seems quite difficult, because it's unclear to specify it

Aha, that's the difficulty I was overlooking. Specifically, I didn't consider that the approach under consideration here requires us to formally define how we're filtering them out.

Thanks!

The problem is that the AI doesn't a priori know the correct utility function, and whatever process it uses to discover that function is going to be attacked by Mu

I don't understand the issue here. Mu can only interfere with the simulated AI's process of utility-function discovery. If the AI follows the policy of "behave as if I'm outside the simulation", AIs simulated by Mu will, sure, recover tampered utility functions. But AIs instantiated in the non-simulated universe, who deliberately avoid thinking about Mu/who discount simulation hypotheses, should ... (read more)

4Vanessa Kosoy9mo
The problem is that any useful prior must be based on Occam's razor, and Occam's razor + first-person POV creates the same problems as with the universal prior. And deliberately filtering out simulation hypotheses seems quite difficult, because it's unclear to specify it. See also this.

Disclaimer: Haven't actually tried this myself yet, naked theorizing.

“We made a wrapper for an LLM so you can use it to babble random ideas!” 

I'd like to offer a steelman of that idea. Humans have negative creativity — it takes conscious effort to come up with novel spins on what you're currently thinking about. An LLM babbling about something vaguely related to your thought process can serve as a source of high-quality noise, noise that is both sufficiently random to spark novel thought processes and relevant enough to prompt novel thoughts on the ac... (read more)

Me: *looks at some examples* “These operationalizations are totally ad-hoc. Whoever put together the fine-tuning dataset didn’t have any idea what a robust operationalization looks like, did they?”

... So maybe we should fund an effort to fine-tune some AI model on a carefully curated dataset of good operationalizations? Not convinced building it would require alignment research expertise specifically, just "good at understanding the philosophy of math" might suffice.

Finding the right operationalization is only partly intuition, partly it's just knowin... (read more)

Inner alignment for simulators

Broadly agreed. I'd written a similar analysis of the issue before, where I also take into account path dynamics (i. e., how and why we actually get to Azazel from a random initialization). But that post is a bit outdated.

My current best argument for it goes as follows:

  • The central issue, the reason why "naive" approaches for just training a ML model to make good prediction will likely result in a mesa-optimizer, is that all such setups are "outer-misaligned" by default. They don't optimize AIs towards being good world-models,
... (read more)

Goals are functions over the concepts in one's internal ontology, yes. But having a concept for something doesn't mean caring about it — your knowing what a "paperclip" is doesn't make you a paperclip-maximizer.

The idea here isn't to train an AI with the goals we want from scratch, it's to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.

Now this is admittedly very different from the thesis that value is complex and fragile.

I disagree. The fact that some concept is very complicated doesn't mean it won't be necessarily represented in any advanced AGI's ontology. Humans' psychology, or the specific tools necessary to build nanomachines, or the agent foundation theory necessary to design aligned successor agents, are all also "complex and fragile" concepts (in the sense that getting a small detail wrong would result in a grand failure of prediction/planning), but we can expect such concepts t... (read more)

Two agents with the same ontology and very different purposes would behave in very different ways.

I don't understand this objection. I'm not making any claim isomorphic to "two agents with the same ontology would have the same goals". It sounds like maybe you think I'm arguing that if we can make the AI's world-model human-like, it would necessarily also be aligned? That's not my point at all.

The motivation is outlined at the start of 1A: I'm saying that if we can learn how to interpret arbitrary advanced world-models, we'd be able to more precisely "aim" ... (read more)

1G Gordon Worley III10mo
Isn't a special case of aiming at any target we want the goals we would want it to have? And whatever goals we'd want it to have would be informed by our ontology? So what I'm saying is I think there's a case where the generality of your claim breaks down.

I agree that the AI would only learn the abstraction layers it'd have a use for. But I wouldn't take it as far as you do. I agree that with "human values" specifically, the problem may be just that muddled, but with none of the other nice targets — moral philosophy, corrigibility, DWIM, they should be more concrete.

The alternative would be a straight-up failure of the NAH, I think; your assertion that "abstractions can be on a continuum" seems directly at odds with it. Which isn't impossible, but this post is premised on the NAH working.

the opaque test is something like an obfuscated physics simulation

I think it'd need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which itself can't tell whether there's etheric interference involved or not. The way Fermat's test can't tell a Carmichael number from a prime — it just doesn't interact with the input number in a way that'd reveal the difference between th... (read more)

2Paul Christiano1y
I definitely agree that this sounds like a really bizarre sort of model and it seems like we should be able to rule it out one way or another. If we can't then it suggests a different source of misalignment from the kind of thing I normally worry about.

Lazy World Models

It seems like "generators" should just be simple functions over natural abstractions? But I see two different ways to go with this, inspired either by the minimal latents approach, or by the redundant-information one.

First, suppose I want to figure out a high-level model of some city, say Berlin. I already have a "city" abstraction, let's call it , which summarizes my general knowledge about cities in terms of a probability distribution over possible structures. I also know a bunch of facts about Berlin specifically, let's call th... (read more)

So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?

Yeah, in that case, the only viable way to handle this is to get something into the system that can distinguish between no tampering and etheric interference. Just like the only way to train an AI to distinguish primes from Carmichael ... (read more)

2Paul Christiano1y
Mostly--the opaque test is something like an obfuscated physics simulation, and so it tells you if things look good. So you try a bunch of random actions until you get one where things look good. But if you can't understand the simulation, or the mechanics of the sensor tampering, then there's not much to do after that so it seems like we're in trouble. It seems like there are plenty of hopes: * I have a single lame example of the phenomenon, which feels extremely unlike sensor tampering. We could get more examples of hard-to-distinguish mechanisms, and understand whether sensor tampering could work this way or if there is some property that means it's OK. * In fact it is possible to distinguish Carmichael numbers and primes, and moreover we have a clean dataset that we can use to specify which one we want. So in this case there is enough internal structure to do the distinguishing, and the problem is just that the distinguisher is too complex and it's not clear how to find it automatically. We could work on that. * We could try to argue that without a model of sensor tampering an AI has limited ability to make it happen, so we can just harden sensor enough that it can't happen by chance and then declare victory. More generally, I'm not happy to give up because "in this situation there's nothing we can do," I want to understand whether the bad situation is plausible, and if it is plausible then how you can measure to fee it is' happening, and how to formalize the kind of assumptions that we'd need to make the problem soluble.

What are your current thoughts on the exact type signature of abstractions? In the Telephone Theorem post, they're described as distributions over the local deterministic constraints. The current post also mentions that the "core" part of an abstraction is the distribution , and its ability to explain variance in individual instances of .

Applying the deterministic-constraint framework to trees, I assume it says something like "given certain ground-truth conditions (e. g., the environment of a savannah + the genetic code of a given tree), th... (read more)

3johnswentworth1y
Roughly, yeah. I currently view the types of P[Λ] and P[X|Λ] as the "low-level" type signature of abstraction, in some sense to be determined. I expect there are higher-level organizing principles to be found, and those will involve refinement of the types and/or different representations.

I think there's a sense in which the Fermat test is a capability problem, not an interpretability/alignment problem.

It's basically isomorphic to a situation in which sensor tampering is done via a method that never shows up in the AI's training data. E. g., suppose it's done via "etheric interference", which we don't know about, and which never fails and therefore never leads to any discrepancies in the data so the AI can't learn it via SSL either, etc. Then the AI just... can't learn about it, period. It's not that it can, in theory, pick up on it, but in... (read more)

2Paul Christiano1y
The thing I'm concerned about is: the AI can predict that Carmichael numbers look prime (indeed it simply runs the Fermat test on each number). So it can generate lots of random candidate actions (or search through actions) until it finds one that looks prime. Similarly, your AI can consider lots of actions until it finds one that it predicts will look great, then execute that one. So you get sensor tampering. I'm not worried about cases like the etheric interference, because the AI won't select actions that exploit etheric interference (since it can't predict that a given action will lead to sensor tampering via etheric interference). I'm only worried about cases where the prediction of successful sensor tampering comes from the same laws / reasoning strategies that the AI learned to make predictions on the training distribution (either it's seen etheric interference, or it e.g. learned a model of physics that correctly predicts the possibility of etheric interference).

Can you posit a training environment which matches what you're thinking about, relative to a given network architecture [e.g. LSTM]?

Sure, gimme a bit.

Why not just not internally represent the reward function, and but still contextually generate "win this game of Go" or "talk like a 4chan user"?

What mechanism does this contextual generation? How does this mechanism behave in off-distribution environments; what goals does it generate in them?

I think it's fine to say "here's one effect [diversity and empirical loss minimization] which pushes towards reward wr

... (read more)

Alright, seems we're converging on something.

But I see no reason to think that those kinds of updates will be made accessible enough to shape the heuristic-generating machinery so that it always or approximately always generates heuristics optimized for achieving R (as opposed to generating heuristics optimized for achieving whatever-the-agent-wants-to-achieve).

How would this machinery appear, then? I don't see how it'd show up without being built into the agent by the optimization algorithm, and the optimization algorithm will only build it if it serves t... (read more)

3Alex Turner1y
We're apparently anchoring our expectations on "pointed at R", and then apparently allowing some "deviation." The anchoring seems inappropriate to me.  The network can learn to make decisions via a "IF circle-detector fires, THEN upweight logits on move-right" subshard. The network can then come to make decisions on the basis of round things, in a way which accords with the policy gradients generated by the policy-gradient-intensity function. All without the network making decisions on the basis of the policy-gradient-intensity function. And this isn't well-described as "imperfectly pointed at the policy-gradient-intensity function." 

Why should we treat that as the relevant idealization?

Yeah, okay, maybe that wasn't the right frame to use. Allow me to pivot:

Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.

In other... (read more)

4Alex Turner1y
I bid for us to discuss a concrete example. Can you posit a training environment which matches what you're thinking about, relative to a given network architecture [e.g. LSTM]? What is "achieving R" buying us? The agent internally represents a reward function, and then consults what the reward is in this scenario, and then generates heuristics to achieve that reward. Why not just not internally represent the reward function, and but still contextually generate "win this game of Go" or "talk like a 4chan user"? That seems strictly more space-efficacious, and also doesn't involve being an R-wrapper. EDIT The network might already have R in its WM, depending on the point in training. I also don't think "this weight setting saves space" is a slam dunk, but just wanted to point out the consideration. I don't know what to make of this. It seems to me like you're saying "in a perfect-exploration limit only wrapper minds for the reward function are fixed under updating." It seems like you're saying this is relevant to SGD. But then it seems like you make the opposite claim of "inner alignment still hard." I think it's fine to say "here's one effect [diversity and empirical loss minimization] which pushes towards reward wrapper minds, but I don't think it's the only effect, I just think we should be aware of it." Is this a good summary of your position? I also feel unsure whether you're arguing primarily for a wrapper mind, or for reward-optimizers, or for both?

Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.

In other words, the agent would need to be autonomous.

Agreed. Generally, whenever I talk about the agent being smart/competent, I am a... (read more)

As an example, you could have a wrapper-mind that cares about some correlate of R but not R itself. If it is smart, such an agent can navigate the selection process just as well as an R-pursuer

... By figuring out what  is and deciding to act as an -pursuing wrapper-mind, therefore essentially becoming an -pursuing wrapper-mind. With the only differences being that it 1) self-modified into one at runtime, instead of being like this from the start, and 2) it'd decide to "stop pretending" in some hypothetical set of situations/OOD, but ... (read more)

3Charles Foster1y
It is not essentially a R-pursuing wrapper-mind. It is essentially an X-pursuing wrapper-mind that will only instrumentally pretend to care about R to the degree it needs to, and that will try with all its might to get what it actually wants, R be damned. As you note in 2, the agent's behavioral alignment to R is entirely superficial, and thus entirely deceptive/unreliable, even if we had somehow managed to craft the "perfect" R. Part of what might've confused me reading the title and body of this post is that, as I understand the term, "wrapper-mind" was and is primarily about structure, about how the agent makes decisions. Why am I so focused on motivational structure, even beyond that, rather than focused on observed behavior during training? Because motivational structure is what determines how an agent's behavior generalizes, whereas OOD generalization is left underspecified if we only condition on an agent's observed in-distribution behavior. (There are many different profiles of OOD behavior compatible with the same observed ID behavior, so we need some additional rationale on top—like structure or inductive biases—to conclude the agent will generalize in some particular way.) In the above quote it sounds like your response is "just make everything in-distribution", right? My reply to that would be that (1) this is just refusing to confront the central difficulty of generalization rather than addressing it, (2) this seems impractical/impossible because OOD is a practically unbounded space whereas at any given point in training you've only given the agent feedback on a comparatively tiny region of it, and (3) even to make only the situations you encounter in practice be in-distribution, you [the training process designer] must know what sorts of OOD contexts the AI will push the training process into, which means it's your cleverness pitted against the AI's, which is a situation you never want to be in if you can at all help it (see: cognitive uncontainabili

But existence of such populations and weight settings doesn't imply net local pressures or gradients in those directions.

How so? This seems like the core disagreement. Above, I think you're agreeing that under a wide enough distribution on scenarios, the only zero-gradient agent-designs are those that optimize for  directly. Yet that somehow doesn't imply that training an agent in a sufficiently diverse environment would shape it into an -optimizer?

Are you just saying that there aren't any gradients from initialization to an -optimiz... (read more)

Thanks for extensive commentary! Here's an... unreasonably extensive response.

what it means to "excavate" the procedural and implicit knowledge

On Procedural Knowledge

1) Suppose that you have a shard that looks for a set of conditions like "it's night AND I'm resting in an unfamiliar location in a forest AND there was a series of crunching sounds nearby". If they're satisfied, it raises an alarm, and forms and bids for plans to look in the direction of the noises and get ready for a fight.

That's procedural knowledge: none of that is happening at the level o... (read more)

This touches on some issues I'd wanted to discuss: abstraction hierarchies, and incompatible abstraction layers.

So, here’s a new conditional independence condition for “large” systems, i.e. systems with an infinite number of ’s: given , any finite subset of the ’s must be approximately independent (i.e. mutual information below some small ) of all but a finite number of the other ’s

Suppose we have a number of tree-instances . Given a sufficiently large , we can compute a valid "general tree abstractio... (read more)

Mm, I believe that it's not central because my initial conception of the GPS didn't include it at all, and everything still worked. I don't think it serves the same role here as you're critiquing in the posts you've linked; I think it's inserted at a different abstraction level.

But sure, I'll wait for you to finish with the post.

What does it mean to ask "how hard should I optimize"?

Satisficing threshold, probability of the plan's success, the plan's robustness to unexpected perturbations, etc. I suppose the argmin is somewhat misleading: the GPS doesn't output the best possible plan for achieving some goal in the world outside the agent, it's solving the problem in the most efficient way possible, which often means not spending too much time and resources on it. I. e., "mental resources spent" is part of the problem specification, and it's something it tries to minimize too.

I don'... (read more)

2Alex Turner1y
I'm going to read the rest of the essay, and also I realize you posted this before my four posts on "holy cow argmax can blow all your alignment reasoning out of reality all the way to candyland." But I want to note that including an argmin in the posited motivational architecture makes me extremely nervous / distrusting. Even if this modeling assumption doesn't end up being central to your arguments on how shard-agents become wrapper-like, I think this assumption should still be flagged extremely heavily. 

I don't think the GPS "searches over all relevant plans". As per John's post:

Consider, for example, a human planning a trip to the grocery store. Typical reasoning (mostly at the subconscious level) might involve steps like:

  • There’s a dozen different stores in different places, so I can probably find one nearby wherever I happen to be; I don’t need to worry about picking a location early in the planning process.
  • My calendar is tight, so I need to pick an open time. That restricts my options a lot, so I should worry about that early in the planning process.
    • &l
... (read more)
2Alex Turner1y
OK, but you are positing that there's an argmin, no? That's a big part of what I'm objecting to. I anticipate that insofar as you're claiming grader-optimization problems come back, they come back because there's an AFAICT inappropriate argmin which got tossed into the analysis via the GPS.  Sure, sounds reasonable. Noting that I still feel confused after hearing this explanation. What does it mean to ask "how hard should I optimize"?  Really? I think that people usually don't do that in life-or-death scenarios. People panic all the time. 

Agreed. It's the same principle by which people are advised to engage in plan-making even if any specific plan they will invent will break on contact with reality; the same principle that underlies "do the math, then burn the math and go with your gut".

While any specific model is likely to be wrong, trying to derive a consistent model gives you valuable insights into how a consistent model would look like at all, builds model-building skills. What specific externally-visible features of the system do you need to explain? How much complexity is required to ... (read more)

Any updates to your model of the socioeconomic path to aligned AI deployment? Namely:

  • Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
  • Still on the "figure out agency and train up an aligned AGI unilaterally" path?
  • Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?

I expect there to be no major updates, but seems worthwhile to keep an eye on this.

So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human

... (read more)
3johnswentworth1y
Basically no. I basically buy your argument, though there's still the question of how safe a target DWIM is.

Still on the "figure out agency and train up an aligned AGI unilaterally" path?

"Train up an AGI unilaterally" doesn't quite carve my plans at the joints.

One of the most common ways I see people fail to have any effect at all is to think in terms of "we". They come up with plans which "we" could follow, for some "we" which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in "we" implementing the plan. (And also, usually, the "we" in question is to... (read more)

Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?

Here's a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)

My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 yea... (read more)

I no longer believe this claim quite as strongly as implied: see here and here. The shard theory has presented a very compelling alternate case of human value formation, and it suggests that even the ultimate compilation of two different modern people's values would likely yield different unitary utility functions.

I still think there's a sense in which stone-age!humans and modern humans, if tasked with giving an AI an utility function that'd make all humans happy, would arrive at the same result (if given thousands of years to think). But it might be the s... (read more)

1Seb Farquhar1y
Thanks, that makes sense. I think part of my skepticism about the original claim comes from the fact that I'm not sure that any amount of time for people living in some specific stone-age grouping would come up with the concept of 'sapient' without other parts of their environment changing to enable other concepts to get constructed. There might be a similar point translated into something shard theoryish that's like 'The available shards are very context dependent, so persistent human values across very different contexts is implausible.' SLT in particular probably involves some pretty different contexts.

Values steer optimization; they are not optimized against

I strongly disagree with the implication here. This statement is true for some agents, absolutely. It's not true universally.

It's a good description of how an average human behaves most of the time, yes. We're often puppeted by our shards like this, and some people spend the majority of their lives this way. I fully agree that this is a good description of most of human cognition, as well.

But it's not the only way humans can act, and it's not when we're at our most strategically powerful.

Consider if ... (read more)

3Alex Turner1y
As usual, you've left a very insightful comment. Strong-up, tentative weak disagree, but haven't read your linked post yet. Hope to get to that soon. 

I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness

Do you have anything specific in mind?

1Charlie Steiner1y
One thing might be that I'd rather have an AI design that's more naturally self-reflective, i.e. using its whole model to reason about itself, rather than having pieces that we've manually retargeted to think about some other pieces. This reduces how much Cartesian doubt is happening on the object level all at the same time, which sorta takes the AI farther away from the spec. But this maybe isn't that great an example, because maybe it's more about not endorsing the "retargeting the search" agenda.
Load More