Sorted by New

Wiki Contributions


Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.

Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't about connecting that to increased difficulty in succeeding at the alignment problem?

Re (32), I don't think your quote isn't talking about the thing Eliezer is talking about, which is that in order to be human level at modelling human-generated text, your AI must be doing something on par with human thought that figures out what humans would say. Your quote just isn't discussing this, namely that strong imitation requires cognition that is dangerous.

So I guess I don't take much issue with (14) or (15), but I think you're quite off the mark about (32). In any case, I still have a strong sense that Eliezer is successfully being more on the mark here than the rest of us manage. Kudos of course to you and others that are working on writing things up and figuring things out. Though I remain sympathetic to Eliezer's complaint.

Eliezer's post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you'd see elsewhere, and it's frank on this point.

I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and "There is no fire alarm".

Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.

maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.

I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.

Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren't safe.

Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?

In my picture most of the extra sauce you'd need on top of GPT-3 looks very agenty. It seems tricky to name "virtual worlds" in which AIs manipulate just "virtual resources" and still manage to do something like melting the GPUs.

I agree with your point that blobs of bayes net nodes aren't very legible, but I still think neural nets are relevantly a lot less interpretable than that! I think basically all structure that limits how your AI does its thinking is helpful for alignment, and that neural nets are pessimal on this axis.

In particular, an AI system based on a big bayes net can generate its outputs in a fairly constrained and structured way, using some sort of inference algorithm that tries to synthesize all the local constraints. A neural net lacks this structure, and is thereby basically unconstrained in the type of work it's allowed to perform.

All else equal, more structure in your AI should mean less room for dangerous computations, and lower the surface area you need to inspect.

And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring "mesaoptimizer" as "stuff + another mesaoptimizer"?

If so, I can't really update on the results of the argument.

A longer reply on the points about heuristic mesaobjectives and the switch:

I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.

But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:

  1. "Pursuing objective X" is an abstraction we use to think about an agent that manages to robustly take actions that move in the direction of objective X
  2. We can think of an agent as "pursuing X directly" if we think that the agent will take an available option that it can tell moves toward X
  3. We can think of an agent as "pursuing X deceptively" if the agent would stop taking actions that move toward X under some change of context.
  4. Some such "deceptive" agents might be better described as "pursuing Y directly" for some Y.

So an example transition from pursing X "directly" to "deceptively" would be an agent you train to keep your diamonds safe, that eventually learns that you're judging this via cameras, and will therefore take actions that fool the cameras if they become available.

And notably I don't think your argument applies to this class of example? It at least doesn't seem like I could write down a speed prior that would actually reassure me that my diamond-keeper won't try to lie to me.

Two quick things to say:

(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.

(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.

I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your mesaoptimize in the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.

A different way to point at my perceived issue: the mesaoptimizers are built out of a mesaoptimize primitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.

If the field of ML shifts towards having a better understanding of models ...

I think this would be a negative outcome, and not a positive one.

Specifically, I think it means faster capabilities progress, since ML folks might run better experiments. Or worse yet, they might better identify and remove bottlenecks on model performance.

Load More