## AI ALIGNMENT FORUMAF

G Gordon Worley III

If you are going to read just one thing I wrote, read The Problem of the Criterion.

More AI related stuff collected over at PAISRI

Formal Alignment

# Wiki Contributions

G Gordon Worley III's Shortform

AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make  adequately evaluate  as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed  we forgot about minimizing side effects.

Optimization at a Distance

Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.

I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.

G Gordon Worley III's Shortform

"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.

Against Time in Agent Models

For what it's worth, I think this is trying to get at the same insight as logical time but via a different path.

For the curious reader, this is also the same reason we use vector clocks to build distributed systems when we can't synchronize the clocks very well.

And there's something quite interesting about computation as a partial order. It might seem that this only comes up when you have a "distributed" system, but actually you need partial orders to reason about unitary programs when they are non-deterministic (any program with loops and conditionals that can't be unrolled because they depend on inputs not known before runtime are non-deterministic in this sense). For this reason, partial orders are the bread-and-butter of program verification.

G Gordon Worley III's Shortform

I actually don't think that model is general enough. Like, I think Goodharting is just a fact of control system's observing.

Suppose we have a simple control system with output  and a governor  takes a measurement  (an observation) of . So long as  is not error free (and I think we can agree that no real world system can be actually error free), then  for some error factor . Since  uses  to regulate the system to change , we now have error influencing the value of . Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e.  regulating the value of  for long enough),  comes to dominate the value of .

This is a bit handwavy, but I'm pretty sure it's true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that's human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.

G Gordon Worley III's Shortform

I'm fairly pessimistic on our ability to build aligned AI. My take is roughly that it's theoretically impossible and at best we might build AI that is aligned well enough that we don't lose. I've not written one thing to really summarize this or prove it, though.

The source of my take comes from two facts:

1. Goodharting is robust. That is, the mechanism of Goodharting seems impossible to overcome. Goodharting is just a fact of any control system.
2. It's impossible to infer the inner experience (and thus values) of another being perfectly without making normative assumptions.

Stuart Armstrong has made a case for (2) with his no free lunch theorem. I've not seen anyone formally make the case for (1), though.

Is this something worth trying to prove? That Goodharting is unavoidable and at most we can try to contain its effects?

I'm many years out from doing math full time so I'm not sure if I could make a rigorous proof of it, but this seems to be something that people disagree on sometimes (arguing that Goodharting can be overcome) but I think most of those discussions don't get very precise about what that means.

Why I'm co-founding Aligned AI

This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.

Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I know how to build aligned AI for real this time guys and the answer is [a thing folks have been disagreeing about whether or not it works for years]" makes me -1 unless there's some explanation of how it's different this time.

Sorry if this is a bit harsh. I don't want to be too down on this project, but I feel like a core chunk of the post is that there's some exciting development that leads Stuart to think something new is possible but then doesn't really tell us what that something new is, and I feel that by the standards of LW/AF that's good reason to complain and ask for more info.

Value extrapolation partially resolves symbol grounding

This doesn't really seem like solving symbol grounding, partially or not, so much as an argument that it's a non-problem for the purposes of value alignment.

$1000 USD prize - Circular Dependency of Counterfactuals Agreed. That said, I don't think counterfactuals are in the territory. I think I said before that they were in the map, although I'm now leaning away from that characterisation as I feel that they are more of a fundamental category that we use to draw the map. Yes, I think there is something interesting going on where human brains seem to operate in a way that makes counterfactuals natural. I actually don't think there's anything special about counterfactuals, though, just that the human brain is designed such that thoughts are not strongly tethered to sensory input vs. "memory" (internally generated experience), but that's perhaps only subtly different than saying counterfactuals rather than something powering them is a fundamental feature of how our minds work.$1000 USD prize - Circular Dependency of Counterfactuals

I don't think they're really at odds. Zack's analysis cuts off at a point where the circularity exists below it. There's still the standard epistemic circularity that exists whenever you try to ground out any proposition, counterfactual or not, but there's a level of abstraction where you can remove the seeming circularity by shoving it lower or deeper into the reduction of the proposition towards grounding out in some experience.

Another way to put this is that we can choose what to be pragmatic about. Zack's analysis choosing to be pragmatic about counterfactuals at the level of making decisions, and this allows removing the circularity up to the purpose of making a decision. If we want to be pragmatic about, say, accurately predicting what we will observe about the world, then there's still some weird circularity in counterfactuals to be addressed if we try to ask questions like "why these counterfactuals rather than others?" or "why can we formulate counterfactuals at all?".

Also I guess I should be clear that there's no circularity outside the map. Circularity is entirely a feature of our models of reality rather than reality itself. That's way, for example, the analysis on epistemic circularity I offer is that we can ground things out in purpose and thus the circularity was actually an illusion of trying to ground truth in itself rather than experience.

I'm not sure I've made this point very clearly elsewhere before, so sorry if that's a bit confusing. The point is that circularity is a feature of the relative rather than the absolute, so circularity exists in the map but not the territory. We only get circularity by introducing abstractions that can allow things in the map to depend on each other rather than the territory.