G Gordon Worley III

If you are going to read just one thing I wrote, read The Problem of the Criterion.

More AI related stuff collected over at PAISRI


Formal Alignment


Formal Philosophy and Alignment Possible Projects

Re Project 4, you might find my semi-abandoned (mostly because I wasn't and still am not in a position to make further progress on it) research agenda for deconfusing human values useful.

Formal Philosophy and Alignment Possible Projects

Re: Project 2

This project’s goal is to better understand the bridge principles needed between subjective, first person optimality and objective, third person success.

This seems quite valuable, because there is, properly speaking, no objective, third person perspective on which we can speak, only the inferred sense that there exists something that looks to use like a third person perspective from our first person perspectives. Thus I think this seems like a potentially fruitful line of research since the proposed premise contains the confusion that needs to be unraveled to get to addressing what is something more like the intersubjective agreement on what the world is like.

Epistemological Vigilance for Alignment

As it happens, I think this is a rather important topic. Failure to consider and mitigate the risk of assumptions creates both false negative (less concerning) and false positive (most concerning) risks when attempting to build aligned AI.

G Gordon Worley III's Shortform

AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make  adequately evaluate  as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed  we forgot about minimizing side effects.

Optimization at a Distance

Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.

I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.

G Gordon Worley III's Shortform

"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.

Against Time in Agent Models

For what it's worth, I think this is trying to get at the same insight as logical time but via a different path.

For the curious reader, this is also the same reason we use vector clocks to build distributed systems when we can't synchronize the clocks very well. 

And there's something quite interesting about computation as a partial order. It might seem that this only comes up when you have a "distributed" system, but actually you need partial orders to reason about unitary programs when they are non-deterministic (any program with loops and conditionals that can't be unrolled because they depend on inputs not known before runtime are non-deterministic in this sense). For this reason, partial orders are the bread-and-butter of program verification.

G Gordon Worley III's Shortform

I actually don't think that model is general enough. Like, I think Goodharting is just a fact of control system's observing.

Suppose we have a simple control system with output  and a governor  takes a measurement  (an observation) of . So long as  is not error free (and I think we can agree that no real world system can be actually error free), then  for some error factor . Since  uses  to regulate the system to change , we now have error influencing the value of . Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e.  regulating the value of  for long enough),  comes to dominate the value of .

This is a bit handwavy, but I'm pretty sure it's true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that's human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.

G Gordon Worley III's Shortform

I'm fairly pessimistic on our ability to build aligned AI. My take is roughly that it's theoretically impossible and at best we might build AI that is aligned well enough that we don't lose. I've not written one thing to really summarize this or prove it, though.

The source of my take comes from two facts:

  1. Goodharting is robust. That is, the mechanism of Goodharting seems impossible to overcome. Goodharting is just a fact of any control system.
  2. It's impossible to infer the inner experience (and thus values) of another being perfectly without making normative assumptions.

Stuart Armstrong has made a case for (2) with his no free lunch theorem. I've not seen anyone formally make the case for (1), though.

Is this something worth trying to prove? That Goodharting is unavoidable and at most we can try to contain its effects?

I'm many years out from doing math full time so I'm not sure if I could make a rigorous proof of it, but this seems to be something that people disagree on sometimes (arguing that Goodharting can be overcome) but I think most of those discussions don't get very precise about what that means.

Why I'm co-founding Aligned AI

This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.

Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I know how to build aligned AI for real this time guys and the answer is [a thing folks have been disagreeing about whether or not it works for years]" makes me -1 unless there's some explanation of how it's different this time.

Sorry if this is a bit harsh. I don't want to be too down on this project, but I feel like a core chunk of the post is that there's some exciting development that leads Stuart to think something new is possible but then doesn't really tell us what that something new is, and I feel that by the standards of LW/AF that's good reason to complain and ask for more info.

Load More