"Why Not Just..."
Basic Foundations for Agent Models

Wiki Contributions


Yes! There's two ways that can be relevant. First, a ton of bits presumably come from unsupervised learning of the general structure of the world. That part also carries over to natural abstractions/minimal latents: the big pile of random variables from which we're extracting a minimal latent is meant to represent things like all those images the toddler sees over the course of their early life.

Second, sparsity: most of the images/subimages which hit my eyes do not contain apples. Indeed, most images/subimages which hit my eyes do not contain instances of most abstract object types. That fact could either be hard-coded in the toddler's prior, or learned insofar as it's already learning all these natural latents in an unsupervised way and can notice the sparsity. So, when a parent says "apple" while there's an apple in front of the toddler, sparsity dramatically narrows down the space of things they might be referring to.

Point is that the "Structural(Inner) prediction method" doesn't seem particularly likely to generalize across things-which-look-like-big-neural-nets. It more plausibly generalizes across things-which-learn-to-perform-diverse-tasks-in-diverse-environments, but I don't think neural net aspect is carrying very much weight there.

This is some evidence that it'll work for AGIs too; after all, both humans and AGIs are massive neural nets that learn to perform diverse tasks in diverse environments.

Highly debatable whether "massive neural nets that learn to perform diverse tasks in diverse environments" is a natural category. "Massive neural net" is not a natural category - e.g. transformers vs convnets vs boltzmann machines are radically different things, to the point where understanding one tells us very little about the others. The embedding of interpretable features of one does not carry over to the others. Analytical approximations for one do not carry over to the others.

The "learn to perform diverse tasks in diverse environments" part more plausibly makes it a natural category, insofar as we buy selection theorems/conjectures.

Great post! I think the things said in the post are generally correct - in particular, I agree with the overall point that objective-centric arguments (e.g. power-seeking) are plausible, and therefore support a high enough probability of doom to justify alignment work, but aren't sufficiently probable to justify a very high probability of doom.

That said, I do think a very high probability of doom can be justified. The arguments have to route primarily through failure of the iterative design loop for AI alignment in particular, rather than primarily through arguments about goal-directedness. The high-level argument is something like: "There are going to be some very powerful things reshaping the entire world, and iterative design failure means that by-default we will have very little de-facto ability to steer them. Those two conditions make doom an extremely strong default outcome.".

Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.

I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).

FWIW, this was basically my thinking on the two scenarios. Not 99% likelihood, but scenario 1 does strike me as ambiguous but much more likely to be an outer misalignment problem (in the root cause sense).

Roughly, yeah. I currently view the types of  and  as the "low-level" type signature of abstraction, in some sense to be determined. I expect there are higher-level organizing principles to be found, and those will involve refinement of the types and/or different representations.

The main problem I see with hodge-podge-style strategies is that most alignment ideas fail in roughly-the-same cases, for roughly-the-same reasons. It's the same hard cases/hard subproblems which kill most plans. In particular, section B.2 (and to a lesser extent B.1 - B.3) of List of Lethalities covers "core problems" which strategies usually fail to handle.

In terms of methodology, epistemology, etc, what did you do right/wrong? What advice would you today give to someone who produced something like your old goal-deconfusion work, or what did your previous self really need to hear?

I want to see Adam do a retrospective on his old goal-deconfusion stuff.

Load More