Ofer Givoli

Send me anonymous feedback: https://docs.google.com/forms/d/e/1FAIpQLScLKiFJbQiuRYBhrBbVYUo_c6Xf0f8DN_blbfpJ-2Ml39g1zA/viewform

Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.

Some quick info about me:

I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).

You can also find me on the EA Forum.

Feel free to reach out by sending me a PM here or on my website.


Formal Solution to the Inner Alignment Problem

To extended Evan's comment about coordination between deceptive models: Even if the deceptive models lack relevant game theoretical mechanisms, they may still coordinate due to being (partially) aligned with each other. For example, a deceptive model X may prefer [some other random deceptive model seizing control] over [model X seizing control with probability 0.1% and the entire experiment being terminated with probability 99.9%].

Why should we assume that the deceptive models will be sufficiently misaligned with each other such that this will not be an issue? Do you have intuitions about the degree of misalignment between huge neural networks that were trained to imitate demonstrators but ended up being consequentialists that care about the state of our universe?

Short summary of mAIry's room

The topic of risks related to morally relevant computations seems very important, and I hope a lot more work will be done on it!

My tentative intuition is that learning is not directly involved here. If the weights of a trained RL agent are no longer being updated after some point[1], my intuition is that the model is similarly likely to experience pain before and after that point (assuming the environment stays the same).

Consider the following hypothesis which does not involve a direct relationship between learning and pain: In sufficiently large scale (and complex environments), TD learning tends to create components within the network, call them "evaluators", that evaluate certain metrics that correlate with expected return. In practice the model is trained to optimize directly for the output of the evaluators (and maximizing the output of the evaluators becomes the mesa objective). Suppose we label possible outputs of the evaluators with "pain" and "pleasure". We get something that seems analogous to humans. A human cares directly about pleasure and pain (which are things that correlated with expected evolutionary fitness in the ancestral environment), even when those things don't affect their evolutionary fitness accordingly (e.g. pleasure from eating chocolate, and pain from getting a vaccine shot).

  1. In TD learning, if from some point the model always perfectly predicted the future, the gradient would always be zero and no weights would be updated. Also, if an already-trained RL agent is being deployed, and there's no longer reinforcement learning going on after deployment (which seems like a plausible setup in products/services that companies sell to customers), the weights would obviously not be updated. ↩︎

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

My understanding is that the 2020 algorithms in Ajeya Cotra's draft report refer to algorithms that train a neural network on a given architecture (rather than algorithms that search for a good neural architecture etc.). So the only "special sauce" that can be found by such algorithms is one that corresponds to special weights of a network (rather than special architectures etc.).

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Great post!

we’ll either have to brute-force search for the special sauce like evolution did

I would drop the "brute-force" here (evolution is not a random/naive search).

Re the footnote:

This "How much special sauce is needed?" variable is very similar to Ajeya Cotra's variable "how much compute would lead to TAI given 2020's algorithms."

I don't see how they are similar.

Why I'm excited about Debate

One might argue:

We don't need the model to use that much optimization power, to the point where it breaks the operator. We just need it to perform roughly at human-level, and then we can just deploy many instances of the trained model and accomplish very useful things (e.g. via factored cognition).

So I think it's important to also note that, getting a neural network to "perform roughly at human-level in an aligned manner" may be a much harder task than getting a neural network to achieve maximal rating by breaking the operator. The former may be a much narrower target. This point is closely related to what you wrote here in the context of amplification:

Speaking of inexact imitation: It seems to me that having an AI output a high-fidelity imitation of human behavior, sufficiently high-fidelity to preserve properties like "being smart" and "being a good person" and "still being a good person under some odd strains like being assembled into an enormous Chinese Room Bureaucracy", is a pretty huge ask.

It seems to me obvious, though this is the sort of point where I've been surprised about what other people don't consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku's Go play so well that a scholar couldn't tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator's abilities in addition to your own.

Correspondingly, a realistic AI we build that literally passes the strong version of the Turing Test would probably have to be much smarter than the other humans in the test, probably smarter than any human on Earth, because it would have to possess all the human capabilities in addition to its own. Or at least all the human capabilities that can be exhibited to another human over the course of however long the Turing Test lasts. [...]

Gradient hacking

It does seem useful to make the distinction between thinking about how gradient hacking failures look like in worlds where they cause an existential catastrophe, and thinking about how to best pursue empirical research today about gradient hacking.

Gradient hacking

Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network

I think the part in bold should instead be something like "failing hard if SGD would (not) update weights in such and such way". (SGD is a local search algorithm; it gradually improves a single network.)

This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.

As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb's problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of "being a person that 1-boxs", because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).

Gradient hacking

My point was that there's no reason that SGD will create specifically "deceptive logic" because "deceptive logic" is not privileged over any other logic that involves modeling the base objective and acting according to it. But I now think this isn't always true - see the edit block I just added.

Gradient hacking

"deceptive logic" is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.

But you can similarly say this for the following logic: "check whether 1+1<4 and if so, act according to the base objective". Why is SGD more likely to create "deceptive logic" than this simpler logic (or any other similar logic)?

[EDIT: actually, this argument doesn't work in a setup where the base objective corresponds to a sufficiently long time horizon during which it is possible for humans to detect misalignment and terminate/modify the model (in a way the is harmful with respect to the base objective).]

So my understanding is that deceptive behavior is a lot more likely to arise from general-problem-solving logic, rather than SGD directly creating "deceptive logic" specifically.

Gradient hacking

I think that if SGD makes the model slightly deceptive it's because it made the model slightly more capable (better at general problem solving etc.), which allowed the model to "figure out" (during inference) that acting in a certain deceptive way is beneficial with respect to the mesa-objective.

This seems to me a lot more likely than SGD creating specifically "deceptive logic" (i.e. logic that can't do anything generally useful other than finding ways to perform better on the mesa-objective by being deceptive).

Load More