Adam Shimi

Half-researcher, half-distiller (see https://distill.pub/2017/research-debt/), both in AI Safety. Funded, and also PhD in theoretical computer science (distributed computing).

If you're interested by some research ideas that you see in my posts, know that I probably have many private docs complete in the process of getting feedback (because for my own work, the AF has proved mostly useless in terms of feedback https://www.lesswrong.com/posts/rZEiLh5oXoWPYyxWh/adamshimi-s-shortform?commentId=4ZciJDznzGtimvPQQ). I can give you access if you PM me!

Sequences

Reviews for the Alignment Forum
AI Alignment Unwrapped
Understanding Goal-Directedness
Toying With Goal-Directedness

Comments

Mundane solutions to exotic problems

Sorry about that. I corrected it but it was indeed the first link you gave.

AMA: Paul Christiano, alignment researcher

Copying my question from your post about your new research center (because I'm really interested in the answer): which part (if any) of theoretical computer science do you expect to be particularly useful for alignment?

Coherence arguments imply a force for goal-directed behavior

Yeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn't take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)

My first intuition is that I expect mapping internal concept to mathematical formalisms to be easier when the end goal is deconfusion and making sense of behaviors, compared to actually improving capabilities. But I'd have to think about it some more. Thanks at least for an interesting test to try to apply to my attempt.

I don't think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)

Okay, do you mean that you agree with my paragraph but what you are really arguing about is that utility functions don't care about the low-level/internals of the system, and that's why they're bad abstractions? (That's how I understand your liver and health points example).

Coherence arguments imply a force for goal-directed behavior

Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole - concepts which people just don't talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they're making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.

I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other. Where here the meaningful comes from thinking about cognition and what following such a utility function would entail. There's a pretty intuitive sense in which a utility function that encodes exactly a trajectory and nothing else, for a complex enough setting, doesn't look like a goal.

A difference between us I think is that I expect that we can add structure that restricts the set of utility functions we consider (structure that comes from thinking among other things about cognition) such that maximizing the expected utility for such a constrained utility function would actually capture most if not all the aspect of goal-directedness that matters to us.

My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn't have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts. So either adapting the state space and action space, or going for fixed spaces but mapping/equivalence classes/metrics on them that encode the relevant assumptions about cognition.

Announcing the Alignment Research Center

This is so great! I always hate wishing people luck when I trust in their competence to mostly deal with bad luck and leverage good luck. I'll use that one now.

Announcing the Alignment Research Center

Sounds really exciting! I'm wondering which kind of theoretical computer science do you have in mind specifically? Like which part of that do you think has the most uses for alignment? (Still trying to find a way to use my PhD in the theory of distributed computing for something alignment related ^^)

Gradations of Inner Alignment Obstacles

Agreed, it depends on the training process.

Gradations of Inner Alignment Obstacles

Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).

But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can't get started.

This argument is obviously a bit sloppy, though.

I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let's say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.

On the other hand, if there's just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don't make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That's the intuition that to learn that lying is a useful strategy, you must actually be "good enough" at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to).

If you actually need deceptiveness to be strong already to have this issue, then I don't think your ELH points to a problem because I don't see why deceptiveness should dominate already.

Where are intentions to be found?

I have two reactions while reading this post:

  • First, even if we say that a given human (for example) at a fixed point in time doesn't necessarily contain everything that we would want the AI to learn, if it only learns what's in there, there might already be a lot of alignment failures that disappear. For example paperclip maximizers are probably ruled out by taking one human's values at a point in time and extrapolating. But that clearly doesn't help with scenarios where the AI does the sort of bad things humans can do, for example.
  • Second, I would argue that in the you of the past, there might actually be enough information to encode, if not the you of now, at least better and better versions of you through interactions with the environment. Or said another way, I feel like what we're pointing at when we're pointing at a human is the normativity of human values, including how they evolve, and how we think about how they evolve, and... recursively. So I think you might actually have all the information you want from this part of space if AI captures the process behind rethinking our values and ideas.
Load More