Sorted by New

Wiki Contributions


The Speed + Simplicity Prior is probably anti-deceptive

And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring "mesaoptimizer" as "stuff + another mesaoptimizer"?

If so, I can't really update on the results of the argument.

The Speed + Simplicity Prior is probably anti-deceptive

A longer reply on the points about heuristic mesaobjectives and the switch:

I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.

But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:

  1. "Pursuing objective X" is an abstraction we use to think about an agent that manages to robustly take actions that move in the direction of objective X
  2. We can think of an agent as "pursuing X directly" if we think that the agent will take an available option that it can tell moves toward X
  3. We can think of an agent as "pursuing X deceptively" if the agent would stop taking actions that move toward X under some change of context.
  4. Some such "deceptive" agents might be better described as "pursuing Y directly" for some Y.

So an example transition from pursing X "directly" to "deceptively" would be an agent you train to keep your diamonds safe, that eventually learns that you're judging this via cameras, and will therefore take actions that fool the cameras if they become available.

And notably I don't think your argument applies to this class of example? It at least doesn't seem like I could write down a speed prior that would actually reassure me that my diamond-keeper won't try to lie to me.

The Speed + Simplicity Prior is probably anti-deceptive

Two quick things to say:

(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.

(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.

The Speed + Simplicity Prior is probably anti-deceptive

I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your mesaoptimize in the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.

A different way to point at my perceived issue: the mesaoptimizers are built out of a mesaoptimize primitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.

A Longlist of Theories of Impact for Interpretability

If the field of ML shifts towards having a better understanding of models ...

I think this would be a negative outcome, and not a positive one.

Specifically, I think it means faster capabilities progress, since ML folks might run better experiments. Or worse yet, they might better identify and remove bottlenecks on model performance.

AI Alignment Open Thread August 2019

It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.

On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".

I like thinking about coordination from this viewpoint.

AI Alignment Open Thread August 2019

There is a nuclear analog for accident risk. A quote from Richard Hamming:

Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogen—after all, there could be no experiments at the needed energy levels." He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, "What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?" I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, "Never mind, Hamming, no one will ever blame you."

Coherent behaviour in the real world is an incoherent concept
First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.

This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future "you" and merge utility functions, which seems strictly better than not. (Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)

As a separate note, the "further out" your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.

(This is a reason in my mind to prefer the approval-directed-agent frame, in which humans get to inject preferences that are more about trajectories.)

Diagonalization Fixed Point Exercises

Q7 (Python):

Y = lambda s: eval(s)(s)
Y('lambda s: print("Y = lambda s: eval(s)(s)\\nY({s!r})")')

Q8 (Python):

Not sure about the interpretation of this one. Here's a way to have it work for any fixed (python function) f:

f = 'lambda s: "\\n".join(s.splitlines()[::-1])'

go = 'lambda s: print(eval(f)(eval(s)(s)))'

eval(go)('lambda src: f"f = {f!r}\\ngo = {go!r}\\neval(go)({src!r})"')