Alex_Altair — AI Alignment Forum

Shallow review of technical AI safety, 2024

Some small corrections/additions to my section ("Altair agent foundations"). I'm currently calling it "Dovetail research". That's not publicly written anywhere yet, but if it were listed as that here, it might help people who are searching for it later this year.

Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake

I wouldn't put number 9. Not intended to "solve" most of these problems, but is intended to help make progress on understanding the nature of the problems through formalization, so that they can be avoided or postponed, or more effectively solved by other research agenda.

Target case: worst-case

definitely not worst-case, more like pessimistic-case

Some names: Alex Altair, Alfred Harwood, Daniel C, Dalcy K

Add "José Pedro Faustino"

Estimated # FTEs: 1-10

I'd call it 2, averaged throughout 2024.

Some outputs in 2024: mostly exposition but it’s early days

Basically right; I'd add this post and this post.

When is Goodhart catastrophic?

Alex_Altair2y42

I finally got around to reading this sequence, and I really like the ideas behind these methods. This feels like someone actually trying to figure out exactly how fragile human values are. It's especially exciting because it seems like it hooks right into an existing, normal field of academia (thus making it easier to leverage their resources toward alignment).

I do have one major issue with how the takeaway is communicated, starting with the term "catastrophic". I would only use that word when the outcome of the optimization is really bad, much worse that "average" in some sense. That's in line with the idea that the AI will "use the atoms for something else", and not just leave us alone to optimize its own thing. But the theorems in this sequence don't seem to be about that;

We call this catastrophic Goodhart because the end result, in terms of , is as bad as if we hadn't conditioned at all.

Being as bad as if you hadn't optimized at all isn't very bad; it's where we started from!

I think this has almost the opposite takeaway from the intended one. I can imagine someone (say, OpenAI) reading these results and thinking something like, great! They just proved that in the worst case scenario, we do no harm. Full speed ahead!

(Of course, putting a bunch of optimization power into something and then getting no result would still be a waste of the resources put into it, which is presumably not built into $V$ . But that's still not very bad.)

That said, my intuition says that these same techniques could also suss out the cases where optimizing for $U$ pessimizes for $V$ , in the previously mentioned use-our-atoms sense.

Meaning & Agency

Alex_Altair2y10

I'll also note that I think what you're calling "Vingean agency" is a notable sub-type of optimization process that you've done a good job at analyzing here. But it's definitely not the definition of optimization or agency to me. For example, in the post you say

We perceive agency when something is better at doing something than us; we endorse some aspect of its reasoning or activity.

This doesn't feel true to me (in the carve-nature-at-its-joints sense). I think children are strongly agents, even though I do everything more competently than they do.

Meaning & Agency

Alex_Altair2y32

I have some comments on the arbitrariness of the "baseline" measure in Yudkowsky's measure of optimization.

Sometimes, I am surprised in the moment about how something looks, and I quickly update to believing there's an optimization process behind it. For example, if I climb a hill expecting to see a natural forest, and then instead see a grid of suburban houses or an industrial logging site, I'll immediately realize that there's no way this is random and instead there's an optimization process that I wasn't previously modelling. In cases like this, I think Yudkowsky's measure accurately captures the measure of optimization.

Alternatively, sometimes I'm thinking about optimization processes that I've always known are there, and I'm wondering to myself how powerful they are. For example, sometimes I'll be admiring how competent one of my friends is. To measure their competence, I can imagine what a "typical" person would do in that situation, and check the Yudkowsky measure as a diff. I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.

While it may be clear how to do this in many cases, it isn't clear in general. I suspect if we tried to write down the algorithm for doing it, it would involve an "agency detector" at some point; you have to be able to draw a circle around the agent in order to selectively forget it.

I think this is where Flint's framework was insightful. Instead of "detecting" and "deleting" the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this. The potential optimization process will be in that average, but it will be washed out by all the other trajectories (assuming most trajectories don't go up the ordering nearly as much; if they did, then your observed process would rightly not register as an optimizer).

(Obviously this is not helpful for e.g. looking into a neural network and figuring out whether it contains something that will powerfully optimize the world around you. But that's not what this level of the framework is for; this level is for deciding what it even means for something to powerfully optimize something around you.)

Of course, to run this comparison you need a "baseline" of a measure over every possible trajectory. But I think this is just reflecting the true nature of optimization; I think it's only meaningful relative to some other expectation.

Meaning & Agency

Alex_Altair2y41

I feel like there's a key concept that you're aiming for that isn't quite spelled out in the math.

I remember reading somewhere that there's a typically unmentioned distinction between "Bayes' theorem" and "Bayesian inference". Bayes' theorem is the statement about , which is true from the axioms of probability theory for any $A$ and $B$ whatsoever. Notably, it has nothing to do with time, and it's still true even after you learn $\neg B$ . On the other hand, Bayesian inference is the premise your beliefs should change in accordance with Bayes' theorem. Namely that $P_{after} (x) = P_{before} (x | o)$ where $o$ is an observation. That is, when you observe something, you wholesale replace your probability space $P_{before}$ with a new probability space $P_{after}$ which is calculated by applying the conditional (via Bayes' theorem).

And I think there's a similar thing going on with your definitions of endorsement. While trying to understand the equations, I found it easier to visualize $P_{1}$ and $P_{2}$ as two separate distributions on the same $Ω$ , where endorsement is simply a consistency condition. For belief consistency, you would just say that $P_{1}$ endorses $P_{2}$ on event $X$ if $P_{1} (X) = P_{2} (X)$ .

But that isn't what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between $P_{1}$ and $P_{2}$ , whereas your concept of endorsement is not symmetrical. It seems like the intention is that $P_{1}$ "learns" or "hears about" $P_{2}$ 's belief, and then $P_{1}$ updates (in the above Bayesian inference sense) to have a new $P_{1, after}$ that has the consistency condition with $P_{2}$ .

By putting $‘ ‘ P_{2} (X) = p "$ in the conditional, you're saying that it's an event on $Ω_{1}$ , a thing with the same type as $X$ . And it feels like that's conceptually correct, but also kind of the hard part. It's as if $P_{1}$ is modelling $P_{2}$ as an agent embedded into $Ω_{1}$ .

Towards Measures of Optimisation

Alex_Altair3y50

You might be interested in some of my open drafts about optimization;

One distinction that I pretty strongly hold as carving nature at its joint is (what I call) optimization vs agents. Optimization has no concept of a utility function, and it just about the state going up an ordering. Agents are the thing that has a utility function, which they need for picking actions with probabilistic outcomes.

Optimization Concepts in the Game of Life

Alex_Altair3y20

I feel very on-board with this research aesthetic.

Here are just some nit-picks/notational confusions I had while reading this;

The sequence , i.e., $n \mapsto {s t e p}^{n} (p)$ , is the computation seeded at $p$ (or a “trajectory” in dynamical systems terminology).
...
A property $P$ is achieved by a computation s if there exists some number of steps $n$ such that $s (n) \in P$ ...

It took me a second to figure out what $s (n)$ referred to, partly because the first s was not rendered in LaTeX, partly because it was never shown as a function before, and partly because it looked kinda like ${s t e p}^{n} (p)$ , so I thought maybe the notation had changed.

the empty board $C = {⊥}$

I've seen $⊥$ as "false" before, but I don't think it's super common, and you also previously said

a pattern is an infinite two-dimensional Boolean grid, or equivalently a function of type ℤxℤ→{true, false}

which made this feel like a switchup of notation. (Also, I think the type signature is off? The empty board $C$ should be a function, but instead it's a set containing one symbol...)

This includes still lifes ( $N = 0$ ), blinkers ( $N = 2$ )

I think if blinkers have period 2 then still lifes have to be considered to have period 1, and not 0.

Eater. An eater p is robust for $P = {p}$ within any context $c$ that contains $n \geq 0$ spaceships traveling in the direction of the eater (and nothing else on the board).

I think the true thing is a lot weaker than this; it's robust to gliders (not all spaceships) traveling along a specific diagonal with respect to the location of the eater (and possibly the glider has to have a certain phase, I'd have to check).

The basin of attraction for a pattern $p$ and a property $P$ is the largest context set $B$ such that $p$ is robust for $P$ within $B$ .
Examples:
Eater. Let $p$ be an eater and $P = {p}$ . $B$ is the context set containing $n \geq 0$ spaceships moving in the direction of the eater and nothing else (in any other context, the contents of the board don't get consumed by the eater).

This is definitely not the largest context set $B$ , because there are tons of patterns that extinguish themselves.

Your posts should be on arXiv

Alex_Altair3y39

I would especially especially love it if it popped out a .tex file that I could edit, since I'm very likely to be using different language on LW than I would in a fancy academic paper.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments