Thane Ruthenis

Wiki Contributions


deliberately filtering out simulation hypotheses seems quite difficult, because it's unclear to specify it

Aha, that's the difficulty I was overlooking. Specifically, I didn't consider that the approach under consideration here requires us to formally define how we're filtering them out.


The problem is that the AI doesn't a priori know the correct utility function, and whatever process it uses to discover that function is going to be attacked by Mu

I don't understand the issue here. Mu can only interfere with the simulated AI's process of utility-function discovery. If the AI follows the policy of "behave as if I'm outside the simulation", AIs simulated by Mu will, sure, recover tampered utility functions. But AIs instantiated in the non-simulated universe, who deliberately avoid thinking about Mu/who discount simulation hypotheses, should just safely recover the untampered utility function. Mu can't acausally influence you unless you deliberately open a channel to it.

I think I'm missing some part of the picture here. Is it assumed that any process of utility-function discovery has to somehow route through (something like) the unfiltered universal prior? Or that uncertainty with regards to one's utility function means you can't rule out the simulation hypothesis out of the gate, because it might be that what you genuinely care about is the simulators?

Disclaimer: Haven't actually tried this myself yet, naked theorizing.

“We made a wrapper for an LLM so you can use it to babble random ideas!” 

I'd like to offer a steelman of that idea. Humans have negative creativity — it takes conscious effort to come up with novel spins on what you're currently thinking about. An LLM babbling about something vaguely related to your thought process can serve as a source of high-quality noise, noise that is both sufficiently random to spark novel thought processes and relevant enough to prompt novel thoughts on the actual topic you're thinking about (instead of sending you off in a completely random direction). Tools like Loom seem optimized for that.

It's nothing a rubber duck or a human conversation partner can't offer, qualitatively, but it's more stimulating than the former, and is better than the latter in that it doesn't take up another human's time and is always available to babble about what you want.

Not that it'd be a massive boost to productivity, but might lower friction costs on engaging in brainstorming, make it less effortful.

... Or it might degrade your ability to think about the subject matter mechanistically and optimize your ideas in the direction of what sounds like it makes sense semantically. Depends on how seriously you'd be taking the babble, perhaps.

Me: *looks at some examples* “These operationalizations are totally ad-hoc. Whoever put together the fine-tuning dataset didn’t have any idea what a robust operationalization looks like, did they?”

... So maybe we should fund an effort to fine-tune some AI model on a carefully curated dataset of good operationalizations? Not convinced building it would require alignment research expertise specifically, just "good at understanding the philosophy of math" might suffice.

Finding the right operationalization is only partly intuition, partly it's just knowing what sorts of math tools are available. That is, what exists in the concept-space and is already discovered. That part basically requires having a fairly legible high-level mental map of the entire space of mathematics, and building it is very effortful, takes many years, and has very little return on learning any specific piece of math.

At least, it's definitely something I'm bottlenecked on, and IIRC even the Infra-Bayesianism people ended up deriving from scratch a bunch of math that latter turned out to be already known as part of imprecise probability theory. So it may be valuable to get some sort of "intelligent applied-math wiki" that babbles possible operationalizations at you/points you towards math-fields that may have the tools for modeling what you're trying to model.

That said, I broadly agree that the whole "accelerate alignment research via AI tools" doesn't seem very promising, either the Cyborgism or the Conditioning Generative Models directions. Not that I see any fundamental reason why pre-AGI AI tools can't be somehow massively helpful for research — on the contrary, it feels like there ought to be some way to loop them it. But it sure seems trickier than it looks at first or second glance.

Inner alignment for simulators

Broadly agreed. I'd written a similar analysis of the issue before, where I also take into account path dynamics (i. e., how and why we actually get to Azazel from a random initialization). But that post is a bit outdated.

My current best argument for it goes as follows:

  • The central issue, the reason why "naive" approaches for just training a ML model to make good prediction will likely result in a mesa-optimizer, is that all such setups are "outer-misaligned" by default. They don't optimize AIs towards being good world-models, they optimize them for making specific good predictions and then channeling these predictions through a low-bandwidth communication channel. (Answering a question, predicting the second part of a video/text, etc.)
  • That is, they don't just simulate a world: they simulate a world, then locate some specific data they need to extract from the simulation, and translate them into a format understandable to humans.
  • As simulation complexity grows, it seems likely that these last steps would require powerful general intelligence/GPS as well. And at that point, it's entirely unclear what mesa-objectives/values/shards it would develop. (Seems like it almost fully depends on the structure of the goal-space. And imagine if e. g. "translate the output into humanese" and "convince humans of the output" are very nearby, and then the model starts superintelligently optimizing for the latter?)
  • In addition, we can't just train simulators "not to be optimizers", in terms of locating optimization processes/agents within them and penalizing such structures. It's plausible that advanced world-modeling is impossible without general-purpose search, and it would certainly be necessary inasmuch as the world-model would need to model humans.

Goals are functions over the concepts in one's internal ontology, yes. But having a concept for something doesn't mean caring about it — your knowing what a "paperclip" is doesn't make you a paperclip-maximizer.

The idea here isn't to train an AI with the goals we want from scratch, it's to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.

Now this is admittedly very different from the thesis that value is complex and fragile.

I disagree. The fact that some concept is very complicated doesn't mean it won't be necessarily represented in any advanced AGI's ontology. Humans' psychology, or the specific tools necessary to build nanomachines, or the agent foundation theory necessary to design aligned successor agents, are all also "complex and fragile" concepts (in the sense that getting a small detail wrong would result in a grand failure of prediction/planning), but we can expect such concepts to be convergently learned.

Not that I necessarily expect "human values" specifically to actually be a natural abstraction — an indirect pointer at "moral philosophy"/DWIM/corrigibility seem much more plausible and much less complex.

Two agents with the same ontology and very different purposes would behave in very different ways.

I don't understand this objection. I'm not making any claim isomorphic to "two agents with the same ontology would have the same goals". It sounds like maybe you think I'm arguing that if we can make the AI's world-model human-like, it would necessarily also be aligned? That's not my point at all.

The motivation is outlined at the start of 1A: I'm saying that if we can learn how to interpret arbitrary advanced world-models, we'd be able to more precisely "aim" our AGI at any target we want, or even manually engineer some structures over its cognition that would ensure the AGI's aligned/corrigible behavior.

I agree that the AI would only learn the abstraction layers it'd have a use for. But I wouldn't take it as far as you do. I agree that with "human values" specifically, the problem may be just that muddled, but with none of the other nice targets — moral philosophy, corrigibility, DWIM, they should be more concrete.

The alternative would be a straight-up failure of the NAH, I think; your assertion that "abstractions can be on a continuum" seems directly at odds with it. Which isn't impossible, but this post is premised on the NAH working.

the opaque test is something like an obfuscated physics simulation

I think it'd need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which itself can't tell whether there's etheric interference involved or not. The way Fermat's test can't tell a Carmichael number from a prime — it just doesn't interact with the input number in a way that'd reveal the difference between their internal structures.

By analogy, we'd need some "simulation" which doesn't interact with the sensory input in a way that can reveal a structural difference between the presence of a specific type of tampering and the absence of any tampering at all (while still detecting many other types of tampering). Otherwise, we'd have to be able to detect undesirable behavior, with sufficiently advanced interpretability tools. Inasmuch as physical simulations spin out causal models of events, they wouldn't fit the bill.

It's a really weird image, and it seems like it ought to be impossible for any complex real-life scenarios. Maybe it's provably impossible, i. e. we can mathematically prove that any model of the world with the necessary capabilities would have distinguishable states for "no interference" and "yes interference".

Models of world-models is a research direction I'm currently very interested in, so hopefully we can just rule that scenario out, eventually.

It seems like there are plenty of hopes

Oh, I agree. I'm just saying that there doesn't seem to be any other approaches aside from "figure out whether this sort of worst case is even possible, and under what circumstances" and "figure out how to distinguish bad states from good states at the object-level, for whatever concrete task you're training the AI".

Load More