G Gordon Worley III

Director of Research at PAISRI


Formal Alignment


Pitfalls of the agent model

Somewhat ironically, some of these failures from thinking of oneself or others as agents causes a lack of agency! Maybe this is just a trick of language, but here's what I have in mind from thinking about some of the pitfalls:

  • Self-hatred results in less agency (freedom to do what you want) rather than more because effort is placed on hating the self rather than trying to change the self to be more in the desired state.
  • Procrastination is basically the textbook example of a failure of agency.
  • Hatred of others is basically the same story here as self-hatred.

On the other hand, failing at Newcomb's problem probably feels like you've been tricked or duped out of your agency.

Anyway this has got me thinking about the relationship between agents and agency. Linguistically the idea here is that "agency" means to act "like an agent" which is to say someone or something that can choose what to do for themselves. I see a certain amount of connection here to, for example, developmental psychology, where greater levels of development come from less thinking of oneself as a subject (an agent) and more thinking of the things one does as object and so thinking of oneself less as subject/agent results in greater agency because the range of things one allows oneself to consider mutable is greater.

This also suggests an end state where there's no agent model at all, or at least not one that is held to tightly (it's one of many lenses through which to see the world, and the agent lens is now totally optional rather than partially optional).

Where are intentions to be found?

Oh, I don't think those things exactly sidestep the problem of the criterion so much as commit to a response to it without necessarily realizing that's what they're doing. All of them sort of punt on it by saying "let humans figure out that part", which at the end of the day is what any solution is going to do because we're the ones trying to build the AI and making the decisions, but we can be more or less deliberate about how we do this part.

Probability theory and logical induction as lenses

Right. For example, I think Stuart Armstrong is hitting something very important about AI alignment with his pursuit of the idea that there's no free lunch in value learning. We only close the gap by making an "arbitrary" assumption, but it's only arbitrary if you assume there's some kind of context-free version of the truth. Instead we can choose in a non-arbitrary way based on what we care about and is useful to us.

I realize lots of people are bored by this point because they're non-arbitrary solution that is useful is some version of rationality criteria since those are very useful for not getting Dutch booked, for example, but we could just as well choose something else and humans, for example, seem to do just that, even though so far we'd be hard pressed to very precisely say just what it is that humans do assume to ground things in, although we have some clues of things that seem important, like staying alive.

Where are intentions to be found?

Not really. If we were Cartesian, in order to fit the way we find the world, it seems to be that it'd have to be that agentiness is created outside the observable universe, possibly somewhere hypercomputation is possible, which might only admit an answer about how to build AI that would look roughly like "put a soul in it", i.e. link it up to this other place where agentiness is coming from. Although I guess if the world really looked like that maybe the way to do the "soul linkage" part would be visible, but it's not so seems unlikely.

Beware over-use of the agent model

I think this is right and underappreciated. However I struggle myself to make a clear case of what to do about it. There's something here, but I think it mostly shows up in not getting confused that the agent model just is how reality is, which underwhelms people who perhaps most fail to deeply grok what that means because they have a surface understanding of it.

Probability theory and logical induction as lenses

Well stated. For what it's worth I think this is a great explanation of why I'm always going on about the problem of the criterion: as embedded, finite agents without access to hypercomputation or perfect, a priori knowledge we're stuck in this mess of trying to figure things out from the inside and always getting it a little bit wrong, no matter how hard we try, so it's worth paying attention to that because solving, for example, alignment for idealized mathematical systems that don't exist is maybe interesting but also not an actual solution to the alignment problem.

Where are intentions to be found?

Largely agree. I think you're exploring what I'd call the deep implications of the fact that agents are embedded rather than Cartesian.

Testing The Natural Abstraction Hypothesis: Project Intro

Nice! From my perspective this would be pretty exciting because, if natural abstractions exist, it solves at least some of the inference problem I view at the root of solving alignment, i.e. how do you know that the AI really understands you/humans and isn't misunderstanding you/humans in some way that looks like it does understand from the outside but it doesn't. Although I phrased this in terms of reified experiences (noemata/qualia as a generalization of axia), abstractions are essentially the same thing in more familiar language, so I'm quite excited for the possibility that we can prove that we may be able to say something about the noemata/qualia/axia of minds other than our own beyond simply taking for granted that other minds share some commonality with ours (which works well for thinking about other humans up to a point, but quickly runs up against problems of assuming too much even before you start thinking about beings other than humans).

Solving the whole AGI control problem, version 0.0001

Regarding conservatism, there seems to be an open question of just how robust Goodhart effects are in that we all agree Goodhart is a problem but it's not clear how much of a problem it is and when. We have opinions ranging from mine, which is basically that Goodharting happens the moment you try to apply even the weakest optimization pressure and this will be a problem (or at least a problem in expectation; you might get lucky) for any system you need to never deviate, to what I read to be Paul's position: it's not that bad and we can do a lot to correct systems before Goodharting would be disastrous.

Maybe part of the problem is we're mixing up math and engineering problems and not making clear distinctions, but anyway I bring this up in the context of conservatism because it seems relevant that we also need to figure out how conservative, if at all, we need to be about optimization pressure, let alone how we would do it. I've not seen anything like a formal argument that X amount of optimization pressure, measured in whatever way is convenient, and given conditions Y produce Z% chance of Goodharting. Then at least we wouldn't have to disagree over what feels safe or not.

Load More