If you disagree with something I write about AI, I want to hear! I often find myself posting mainly because I want the best available counterarguments.

John_Maxwell's Comments

AI Alignment 2018-19 Review

Typically, the problem with supervised learning is that it's too expensive to label everything we care about.

I don't think we'll create AGI without first acquiring capabilities that make supervised learning much more sample-efficient (e.g. better unsupervised methods let us better use unlabeled data, so humans no longer need to label everything they care about, and instead can just label enough data to pinpoint "human values" as something that's observable in the world--or characterize it as a cousin of some things that are observable in the world).

But if you think there are paths to AGI which don't go through more sample-efficient supervised learning, one course of action would be to promote differential technological development towards more sample-efficient supervised learning and away from deep reinforcement learning. For example, we could try & convince DeepMind and OpenAI to reallocate resources away from deep RL and towards sample efficiency. (Note: I just stumbled on this recent paper which is probably worth a careful read before considering advocacy of this type.)

In this case, are you imagining that we label some types of behaviors as good and some as bad, perhaps like what we would do with an approval directed agent?

This seems like a promising option.

AI Alignment 2018-19 Review

Value learning. Building an AI that learns all of human value has historically been thought to be very hard, because it requires you to decompose human behavior into the “beliefs and planning” part and the “values” part, and there’s no clear way to do this.

My understanding is that IRL requires this, but it's not obvious to me that supervised learning does? (It's surprising to me how little attention supervised learning has received in AI alignment circles, given that it's by far the most common way for us to teach current ML systems about our values.)

Anyway, regarding IRL: I can see how it would be harmful to make the mistake of attributing stuff to the planner which actually belongs in the values part.

  • For example, perhaps our AI observes a mother caring for her disabled child, and believes that the mother's goal is to increase her inclusive fitness in an evolutionary sense, but that the mother is irrational and is following a suboptimal strategy for doing this. So the AI executes a "better" strategy for increasing inclusive fitness which allocates resources away from the child.

However, I haven't seen a clear story for why the opposite mistake, of attributing stuff to the values part which actually belongs to the planner, would cause a catastrophe. It seems to me that in the limit, attributing all human behavior as arising from human values could end up looking something like an upload--that is, it still makes the stupid mistakes that humans make, and it might not be competitive with other approaches, but it doesn't seem to be unaligned in the sense that we normally use the term. You could make a speed superintelligence which basically values behaving as much like the humans it has observed as possible. But if this scenario is multipolar, each actor could be incentivized to spin the values/planner dial of its AI towards attributing more of human behavior to the human planner, in order to get an agent which behaves a little more rationally in exchange for a possibly lower fidelity replication of human values.

Realism about rationality

It seems to me like my position, and the MIRI-cluster position, is (1) closer to "rationality is like fitness" than "rationality is like momentum"

Eliezer is a fan of law thinking, right? Doesn't the law thinker position imply that intelligence can be characterized in a "lawful" way like momentum?

Whereas the non-MIRI cluster is saying "biologists don't need to know about evolution."

As a non-MIRI cluster person, I think deconfusion is valuable (insofar as we're confused), but I'm skeptical of MIRI because they seem more confused than average to me.

Self-Supervised Learning and AGI Safety

The term "self-supervised learning" (replacing the previous and more general term "unsupervised learning")

BTW, the way I've been thinking about it, "self-supervised learning" represents a particular way to achieve "unsupervised learning"--not sure what use is standard.

Tabooing 'Agent' for Prosaic Alignment

I think the world where H is true is a good world, because it's a world where we are much closer to understanding and predicting how sophisticated models generalize.

This seemed liked a really surprising sentence to me. If the model is an agent, doesn't that pull in all the classic concerns related to treacherous turns and so on? Whereas a non-agent probably won't have an incentive to deceive you?

Even if the model is an agent, then you still need to be able to understand its goals based on their internal representation. Which could mean, for example, understanding what a deep neural network was doing. Which doesn't appear to be much easier than the original task of "understand what a model, for example a deep neural network, is doing".

2019 AI Alignment Literature Review and Charity Comparison

just use ML to learn ethics

Can you explain more about why you think that these papers are low quality? Is it just a matter of lack of originality? Personally, I think this is a perspective that can be steelmanned pretty effectively, and gets unfairly disregarded because it's too simple or something like that. I think it's worth engaging with this perspective in depth because (a) I think it's pretty likely there's a solution to friendliness there and (b) even if there isn't, a very clear explanation of why (which anticipates as many counterarguments as possible) could be a very useful recruiting tool.

A dilemma for prosaic AI alignment

I am not sure what to think of the lack of commercial applications of RL, but I don't think it is strong evidence either way, since commercial applications involve competing with human and animal agents and RL hasn't gotten us anything as good as human or animal agents yet.

Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn't suggests to me that if you can apply both to a problem, RL is probably an inferior approach.

Another way to think about it: If superhuman performance is easier with supervised learning than RL, that gives us some evidence about the relative strengths of each approach.

Agent-like architectures are simple yet powerful ways of achieving arbitrary things, because for almost any thing you wish achieved, you can insert it into the "goal" slot of the architecture and then let it loose, and it'll make good progress even in a very complex environment. (I'm comparing agent-like architectures to e.g. big lists of heuristics, or decision trees, or look-up tables, all of which have complexity that increases really fast as the environment becomes more complex. Maybe there is some other really powerful yet simple architecture I'm overlooking?)

I'm not exactly sure what you mean by "architecture" here, but maybe "simulation", or "computer program", or "selection" (as opposed to control) could satisfy your criteria? IMO, attaining understanding and having ideas aren't tasks that require an agent architecture -- it doesn't seem most AI applications in these categories make use of agent architectures -- and if we could do those things safely, we could make AI research assistants which make remaining AI safety problems easier.

Aren't the 3.5 bullet points above specific examples of how 'predict the next word in this text' could benefit from -- in the sense of produce, when used as training signal

I do think these are two separate questions. Benefit from = if you take measures to avoid agentlike computation, that creates a significant competitiveness penalty above and beyond whatever computation is necessary to implement your measures (say, >20% performance penalty). Produce when used as a training signal = it could happen by accident, but if that accident fails to happen, there's not necessarily a loss of competitiveness. An example would be bullet point 2, which is an accident that I suspect would harm competitiveness. Bullet points 3 and 3.5 are also examples of unintended agency, not answers to the question of why text prediction benefits from an agent architecture. (Note: If you don't mind, let's standardize on using "agent architecture" to only refer to programs which are doing agenty things at the toplevel, so bullet points 2, 3, and 3.5 wouldn't qualify--maybe they are agent-like computation, but they aren't descriptions of agent-like software architectures. For example, in bullet point 2 the selection process that leads to the agent might be considered part of the architecture, but the agent which arose out of the selection process probably wouldn't.)

How would you surmount bullet point 3?

Hopefully I'll get around to writing a post about that at some point, but right now I'm focused on generating as many concrete plausible scenarios around accidentally agency as possible, because I think not identifying a scenario and having things blow up in an unforseen way is a bigger risk than having all safety measures fail on a scenario that's already been anticipated. So please let me know if you have any new concrete plausible scenarios!

In any case, note that issues with the universal prior seem to be a bit orthogonal to the agency vs unsupervised discussion -- you can imagine agent architectures that make use of it, and non-agent architectures that don't.

A dilemma for prosaic AI alignment

I think there is a bit of a motte and bailey structure to our conversation. In your post above, you wrote: "to be competitive prosaic AI safety schemes must deliberately create misaligned mesa-optimizers" (emphasis mine). And now in bullet point 2, we have (paraphrase) "maybe if you had a really weird/broken training scheme where it's possible to sabotage rival subnetworks, agenty things get selected for somehow [probably in a way that makes the system as a whole less competitive]". I realize this is a bit of a caricature, and I don't mean to call you out or anything, but this is a pattern I've seen in AI safety discussions and it seemed worth flagging.

Anyway, I think there is a discussion worth having here because most people in AI safety seem to assume RL is the thing, and RL has an agent style architecture, which seems like a pretty strong inductive bias towards mesa-optimizers. Non-RL stuff seem like a relatively unknown quantity where mesa-optimizers are concerned, and thus worth investigating, and additionally, even RL will plausibly have non-RL stuff as a subcomponent of its cognition, so still useful to know how to do non-RL stuff in a mesa-optimizer free way (so the RL agent doesn't get pwned by its own cognition).

Agent-like architectures are simple yet powerful ways of achieving arbitrary things

Why do you think that's true? I think the lack of commercial applications of reinforcement learning is evidence against this. From my perspective, RL has been a huge fad and people have been trying to shoehorn it everywhere, yet they're coming up empty handed.

Can you get more specific about how "predict the next word in this text" could benefit from an agent architecture? (Or even better, can you support your original strong claim and explain how the only way to achieve predictive performance on "predict the next word in this text" is through deliberate creation of a misaligned mesa-optimizer?)

Bullet point 3 is one of the more plausible things I've heard -- but it seems fairly surmountable.

Inductive biases stick around

I think we would benefit from tabooing the word "simple". It seems to me that when people use the word "simple" in the context of ML, they are usually referring to either smoothness/Lipschitzness or minimum description length. But it's easy to see that these metrics don't always coincide. A random walk is smooth, but its minimum description length is long. A tall square wave is not smooth, but its description length is short. L2 regularization makes a model smoother without reducing its description length. Quantization reduces a model's description length without making it smoother. I'm actually not aware of any argument that smoothness and description length are or should be related--it seems like this might be an unexamined premise.

Based on your paper, the argument for mesa-optimizers seems to be about description length. But if SGD's inductive biases target smoothness, it's not clear why we should expect SGD to discover mesa-optimizers. Perhaps you think smooth functions tend to be more compressible than functions which aren't smooth. I don't think that's enough. Imagine a Venn diagram where compressible functions are a big circle. Mesa-optimizers are a subset, and the compressible functions discovered by SGD are another subset. The question is whether these two subsets are overlapping. Pointing out that they're both compressible is not a strong argument for overlap: "all cats are mammals, and all dogs are mammals, so therefore if you see a cat, it's also likely to be a dog".

When I read your paper, I get a sense that an optimizers outperform by allowing one to collapse a lot of redundant functionality into a single general method. It seems like maybe it's the act of compression that gets you an agent, not the property of being compressible. If our model is a smooth function which could in principle be compressed using a single general method, I'm not seeing why the reapplication of that general method in a very novel context is something we should expect to happen.

BTW I actually do think minimum description length is something we'll have to contend with long term. It's just too useful as an inductive bias. (Eliminating redundancies in your cognition seems like a basic thing an AGI will need to do to stay competitive.) But I'm unconvinced SGD possesses the minimum description length inductive bias. Especially if e.g. the flat minima story is the one that's true (as opposed to e.g. the lottery ticket story).

Also, I'm less confident that what I wrote above applies to RNNs.

Load More