6mo10

(Edit: others have made this point already, but anyhow)

My main objection to this angle: self-improvements do not necessarily look like "design a successor AI to be in charge". They can look more like "acquire better world models", "spin up more copies", "build better processors", "train lots of narrow AI to act as fingers", etc.

I don't expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.

16mo

I think my response to this is similar to the one to Wei Dai above. Which is to agree that there are certain kinds of improvements that generate less risk of misalignment but it's hard to be certain. It seems like those paths are (1) less likely to produce transformational improvements in capabilities than other, more aggressive, changes and (2) not the kinds of changes we usually worry about in the arguments for human-AI risk, such that the risks remain largely symmetric. But maybe I'm missing something here!

7mo11

Is the following an accurate summary?

The agent is built to have a "utility function" input that the humans can change over time, and a probability distribution over what the humans will ask for at different time steps, and maximizes according a combination of the utility functions it anticipates across time steps?

If that's correct, here are some places this conflicts with my intuition about how things should be done:

I feel awkward about the randomness is being treated essential. I'd rather be able to do something other than randomness in order to get my mild optimization, and something feels unstable/non-compositional about needing randomness in place for your evaluations... (Not that I have an alternative that springs to mind!)

I also feel like "worst case" is perhaps problematic, since it's bringing maximization in, and you're then needing to rely on your convex s...

7mo21

Can I check that I follow how you recover quantilization?

Are you evaluating distributions over actions, and caring about the worst-case expectation of that distribution?

If so, proposing a particular action is evaluated badly? (Since there's a utility function in your set that spikes downward at that action.)

But proposing a range of actions to randomize amongst can be assessed to have decent worst-case expected utility, since particular downward spikes get smoothed over, and you can rely on your knowledge of "in-distribution" behaviour?

Edited to add: ...

17mo

If that's correct, here are some places this conflicts with my intuition about how things should be done:
I feel awkward about the randomness is being treated essential. I'd rather be able to do something other than randomness in order to get my mild optimization, and something feels unstable/non-compositional about needing randomness in place for your evaluations... (Not that I have an alternative that springs to mind!)
I also feel like "worst case" is perhaps problematic, since it's bringing maximization in, and you're then needing to rely on your convex set being some kind of smooth in order to get good outcomes. If I have a distribution over potential utility functions, and quantilize for the worst 10% of possibilities, does that do the same sort of work that "worst case" is doing for mild optimization?

I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.

Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.

I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and ther...

I'm perhaps misusing "publish" here, to refer to "putting stuff on the internet" and "raising awareness of the work through company Twitter" and etc.

I mostly meant to say that, as I see it, too many things that shouldn't be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).

The transformer circuits work strikes me this way, so does a bunch of others.

Also, I'm grateful to know your read! I'm broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.

27mo

Interesting, thanks for the context. I buy that this could be bad, but I'm surprised that you see little upside - the obvious upside esp for great work like transformer circuits is getting lots of researchers nerdsniped and producing excellent and alignment relevant interp work. Which seems huge if it works

I mostly do just mean "keeping it within a single research group" in the absence of better ideas. And I don't have a better answer, especially not for independent folk or small orgs.

I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some "I won't use this for capabilities work without the permission of the authors" legal docs as well.

This isn't something I can visualize working, but maybe it has components of an answer.

7mo108

To throw in my two cents, I think it's clear that whole classes of "mechansitic interpretability" work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.

And I think this points strongly against publishing this stuff, especially if the goal is to "make this whole field more prestigious real quick". Insofar as the prestige is coming from folks who work on AI capabilities, that's drinking from a poisoned well (since they'll grant the most prestige to the work that helps them ...

17mo

Can you describe how the "local cluster" thing would work outside of keeping it within a single organization? I'd also be very interested in some case studies where people tried this.

17mo

I'm surprised by this claim, can you say more? My read is weakly that people in interp under publish to wider audiences (eg getting papers into conferences), though maybe that people overpublish blog posts? (Or that I try too hard to make things go viral on Twitter lol)

Seems right, oops! A5 is here saying that if any part of my is flat it had better stay flat!

I think I can repair my counterexample but looks like you've already found your own.

Apr 15, 20230-2

No on Q4? I think Alex's counterexample applies to Q4 as well.

(EDIT: Scott points out I'm wrong here, Alex's counterexample doesn't apply, and mine violates A5.)

In particular I think A4 and A5 don't imply anything about the rate of change as we move between lotteries, so we can have movements too sharp to be concave. We only have quasi-concavity.

My version of the counterexample: you have two outcomes and , we prefer anything with equally, and we otherwise prefer higher .

If you give me a corresponding , it must satisfy ...

38mo

Alex's counterexample as stated is not a counterexample to Q4, since it is in fact concave.
I believe your counterexample violates A5, taking B=¬X, A=X, and p=12.

Okay, I now think A5 implies: "if moving by is good, then moving by any negative multiple is bad". Which checks out to me re concavity.

38mo

You can also think of A5 in terms of its contrapositive: For all A,B∈L, if A≻B, then for all 0<p≤1A≻pA+(1−p)B
This is basically just the strict version of A4. I probably should have written it that way instead. I wanted to use ⪰ instead of ≻, because it is closer to the base definition, but that is not how I was natively thinking about it, and I probably should have written it the way I think about it.

The way I understand A4 is that it says "if moving by is good, then moving by any fraction is also good".

And A5 says "if moving by is good, then moving by any multiple is also good", which is much stronger.

38mo

Your understanding of A4 is right. In A5, "good" should be replaced with "bad."

8mo10

[Edit: yeah nevermind I have the inequality backwards]

A5 seems too strong?

Consider lotteries and , and a mixture in between. Applying A5 twice gives:

- If then
- If then

So if and then ?

Either I'm confused or A5 is a stricter condition than concavity.

28mo

I haven't actually thought about whether A5 implies A4 though. It is plausible that it does. (together with A1-A3, or some other simple axioms,)
When A≻B, we get A4 from A5, so it suffices to replace A4 with the special case that A∼B. If A∼B, and A,B≻X, a mixture of A and B, then all we need to do is have any Y such that A≻Y≻X, then we can get Y′ between A and X by A3, and then X will also be a mixture of Y′ and B, contradicting A5, since B≻Y′.
A1,A2,A3,A5 do not imply A4 directly, because you can have the function that assigns utility 0 to a fair coin flip between two options, and utility 1 to everything else. However, I suspect when we add the right axiom to imply continuity, I think that will be sufficient to also allow us to remove A4, and only have A5.

38mo

You have the inequality backwards. You can't apply A5 when the mixture is better than the endpoint, only when the mixture is worse than the endpoint.

18mo

The way I understand A4 is that it says "if moving by Δ is good, then moving by any fraction λΔ is also good".
And A5 says "if moving by Δ is good, then moving by any multiple nΔ is also good", which is much stronger.

8mo42

Something I'm now realizing, having written all these down: the core mechanism really does echo Löb's theorem! Gah, maybe these are more like Löb than I thought.

(My whole hope was to generalize to things that Löb's theorem doesn't! And maybe these ideas still do, but my story for why has broken, and I'm now confused.)

As something to ponder on, let me show you how we can prove Löb's theorem following the method of ideas #3 and #5:

- is assumed
- We consider the loop-cutter
- We verify that if activates then must be true:

Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...

9mo20

My troll example is a fully connected network with all zero weights and biases, no skip connections.

This isn't something that you'd reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.

To look for a true hacker I'd try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.

It looks like you're investigating an angle that I can't follow, but here's my two cents re bounded agents:

My main idea to port this to the bounded setting is to have a bot that searches for increasingly long proofs, knowing that if it takes longer to find a proof then it is itself a bit harder to reason about.

We can instantiate this like:

The idea is that if there is a short way to prove that the opponent would cooperate back, then it takes just a constant steps more to prove that we cooperate. So it doesn't open us up to...

A point of confusion: is it enough to prove that ? What about ? I'm not sure I can say this well, but here goes:

We might not be able to prove in the theory, or even that (which would mean "there are no proofs of inconsistency"). But if we believe our theory is sound, we believe that it can't show that a copy of itself proves something false.

So tells us that if is true, the theory would show that a copy of itself proves is true. And this is enough to convince *us* that we can't simultaneously have true and false.

10mo147

Also, here's a proof that a bot is never exploited. It only cooperates when its partner provably cooperates.

First, note that , i.e. if cooperates it provably cooperates. (Proof sketch: .)

Now we show that (i.e. if chooses to cooperate, its partner is provably cooperating):

- We get by distributing.
- We get by applying internal necessitation to .
- By (1) and (2), .

(PS: we can strengthen this to , by noticing that .)

210mo

This is cool (and fwiw to other readers) correct. I must reflect on what it means for real world cooperation... I especially like the A <-> []X -> [][]X <-> []A trick.

610mo

A point of confusion: is it enough to prove that A→□B? What about A→B? I'm not sure I can say this well, but here goes:
We might not be able to prove A→B in the theory, or even that ¬□⊥ (which would mean "there are no proofs of inconsistency"). But if we believe our theory is sound, we believe that it can't show that a copy of itself proves something false.
So A→□B tells us that if A is true, the theory would show that a copy of itself proves B is true. And this is enough to convince us that we can't simultaneously have A true and B false.

If I follow what you mean, we can derive:

So there's a Löbian proof, in which the provability is self-fulfilling. But this isn't sufficient to avoid this kind of proof.

(Aside on why I don't like the Löbian method: I moreso want the agents to be doing "correct" counterfactual reasoning about how their actions affect their opponent, and to cooperate because they see that mutual cooperation is possible and then choose it. The Löbian proof style isn't a good model of that, imo.)

(In fact, I think "assume they will think I cooperate" turns out to be too strong, and leads to unnecessary defection. I'm still working through the details.)

010mo

So, ideally you would like to assume only
1. □A→B
2. □B→A
and conclude A and B ?

710mo

Nice, I like this proof also. Maybe there's a clearer way to say thing, but your "unrolling one step" corresponds to my going from u to v. We somehow need to "look two possible worlds deep".

10mo57

In case this helps folks' intuition, my generating idea was something like: "look at what my opponent is thinking, but assume that whenever they check if I cooperate, they think the answer is yes". This is how we break the loop.

This results in a proof of the lemma like thus:

- (given)
- (unrolling one step)
- is straightforward intuitively, and we can get there by applying box-distributivity to

(EDIT: This is basically the same proof as in the post, but less simple. Maybe the interesting part is the "unroll once...

410mo

(In fact, I think "assume they will think I cooperate" turns out to be too strong, and leads to unnecessary defection. I'm still working through the details.)

Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.

Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't a...

Eliezer's post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you'd see elsewhere, and it's frank on this point.

I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and "There is no fire alarm".

Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.

1y74

Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.

Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't a...

maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.

I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.

Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered patter...

Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?

In my picture most of the extra sauce you'd need on top of GPT-3 looks very agenty. It seems tricky to name "virtual worlds" in which AIs manipulate just "virtual resources" and still manage to do something like melting the GPUs.

31y

I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.
Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren't safe.

1y6

I agree with your point that blobs of bayes net nodes aren't very legible, but I still think neural nets are relevantly a lot less interpretable than that! I think basically all structure that limits how your AI does its thinking is helpful for alignment, and that neural nets are pessimal on this axis.

In particular, an AI system based on a big bayes net can generate its outputs in a fairly constrained and structured way, using some sort of inference algorithm that tries to synthesize all the local constraints. A neural net lacks this structure, and is ther...

31y

I'd crystallize the argument here as something like: suppose we're analyzing a neural net doing inference, and we find that its internal computation is implementing <algorithm> for Bayesian inference on <big Bayes net>. That would be a huge amount of interpretability progress, even though the "big Bayes net" part is still pretty uninterpretable.
When we use Bayes nets directly, we get that kind of step for free.
... I think that's decent argument, and I at least partially buy it.
That said, if we compare a neural net directly to a Bayes net (as opposed to inference-on-a-Bayes-net), they have basically the same structure: both are circuits. Both constrain locality of computation.

And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring "mesaoptimizer" as "stuff + another mesaoptimizer"?

If so, I can't really update on the results of the argument.

A longer reply on the points about heuristic mesaobjectives and the switch:

I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.

But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:

- "Pursuing objective X" is an abstraction we use to think about an agent that

Two quick things to say:

(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.

(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.

2y4

I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your `mesaoptimize`

in the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.

A different way to point at my perceived issue: the mesaoptimizers are built out of a `mesaoptimize`

primitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.

2[anonymous]2y

I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space.
In the traditional deceptive mesaoptimization story, the model needs to at some point switch from "pursuing objective X directly" to "pursuing objective Y indirectly by deceptively pursuing objective X". I agree that, if there isn't really a core "mesaoptimizer" that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for "pursuing objective Y".
I'm not sure whether you're arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you're arguing that deceptive mesaoptimization isn't likely in the first place.

2y2

If the field of ML shifts towards having a better understanding of models ...

I think this would be a negative outcome, and not a positive one.

Specifically, I think it means faster capabilities progress, since ML folks might run better experiments. Or worse yet, they might better identify and remove bottlenecks on model performance.

It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.

On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".

I like thinking about coordination from this viewpoint.

There is a nuclear analog for accident risk. A quote from Richard Hamming:

Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmeti...

14y

I don't really know what this is meant to imply? Maybe you're answering my question of "did that happen with nukes?", but I don't think an affirmative answer means that the analogy starts to work.
I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"; the magnitude/severity of the accident risk is not that relevant to this argument.

5y6

First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.

This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future "you" and merge utility functions, which seems strictly better than not. (Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)

As a separate note, the "further out" your goal is and th...

15y

I agree that this problem is not a particularly important one, and explicitly discard it a few sentences later. I hadn't considered your objection though, and will need to think more about it.
Mind explaining why? Is this more a stylistic preference, or do you think most of them are wrong/irrelevant?
Also true if you make world states temporally extended.

5y2

Q7 (Python):

Y = lambda s: eval(s)(s)

Y('lambda s: print("Y = lambda s: eval(s)(s)\\nY({s!r})")')

Q8 (Python):

Not sure about the interpretation of this one. Here's a way to have it work for any fixed (python function) f:

f = 'lambda s: "\\n".join(s.splitlines()[::-1])'

go = 'lambda s: print(eval(f)(eval(s)(s)))'

eval(go)('lambda src: f"f = {f!r}\\ngo = {go!r}\\neval(go)({src!r})"')

Awesome, thanks for writing this up!

I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".

(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)