AI ALIGNMENT FORUM
AF

Jeremy Gillen
Ω153301
Message
Dialogue
Subscribe

I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Jeremy Gillen's Shortform
3y
0
Foom & Doom 2: Technical alignment is hard
Jeremy Gillen21d10

In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.

Yes, agreed. The extra machinery and assumptions you describe seem sufficient to make sure nonconsequentialist preferences are passed to a successor.

I think an actually high integrity person/AI doesn't search for loopholes or want to search for loopholes.

If I try to condition on the assumptions that you're using (which I think include a central part of the AIs preferences having a true-but-maybe-approximate pointer toward the instruction-givers preferences, and also involves a desire to defer or at least flag relevant preference differences) then I agree that such an AI would not search for loopholes on the object-level.

I'm not sure whether you missed the straightforward point I was trying to make about searching for loopholes, or whether you understand it and are trying to point at a more relevant-to-your-models scenario? The straightforward point was that preference-like objects need to be robust to search. Your response reads as "imagine we have a bunch of higher-level-preferences and protective machinery that already are robust to optimisation, then on the object level these can reduce the need for robustness". This is locally valid. 

I don't think its relevant because we don't know how to build those higher-level-preferences and protective machinery in a way that is itself very robust to the OOD push that comes from scaling up intelligence, learning, self-correcting biases, and increased option-space.

(I don't think disgust is an example of a deontological constraint, it's just an obviously unendorsed physical impulse!)

Some people reflectively endorse their own disgust at picking up insects, and wouldn't remove it if given the option. I wanted an example of a pure non-consequentialist preference, and I stand by it as a good example.

deontological constraints we want are like the human notions of integrity, loyalty, and honesty

Probably we agree about this, but for the sake of flagging potential sources of miscommunication: if I think about the machinery involved in implementing these "deontological" constraints, there's a lot of consequentialist machinery involved (but it's mostly shorter-term and more local than normal consequentialist preferences).

Reply
Foom & Doom 2: Technical alignment is hard
Jeremy Gillen21d50

(Overall I like these posts in most ways, and especially appreciate the effort you put into making a model diff with your understanding of Eliezer's arguments)

Eliezer and some others, by contrast, seem to expect ASIs to behave like a pure consequentialist, at least as a strong default, absent yet-to-be-invented techniques. I think this is upstream of many of Eliezer’s other beliefs, including his treating corrigibility as “anti-natural”, or his argument that ASI will behave like a utility maximizer.

It feels like you're rounding off Eliezer's words in a way that removes the important subtlety. What you're doing here is guessing at the upstream generator of Eliezer's conclusions, right? As far as I can see in the links, he never actually says anything that translates to "I expect all ASI preferences to be over future outcomes"? It's not clear to me that Eliezer would disagree with "impure consequentialism".

I think you get closest to an argument that I believe with (2):

(2) The Internal Competition Argument: We’ll wind up with pure-consequentialist AIs (absent some miraculous technical advance) because in the process of reflection within the mind of any given impure-consequentialist AI, the consequentialist preferences will squash the non-consequentialist preferences.

Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won't necessarily build its successor to have the same non-consequentialist preference[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors. (And building successors is a similar process to self-modification).

As another example, I’ve seen people imagine non-consequentialist preferences as “rules that the AI grudgingly follows, while searching for loopholes”, rather than “preferences that the AI enthusiastically applies its intelligence towards pursuing”.

I think you're misrepresenting/misunderstanding the argument people are making here. Even when you enthusiastically apply your intelligence toward pursuing a deontological constraint (alongside other goals), you implicitly search for "loopholes" in that constraint, i.e. weird ways to achieve all of your goals that don't involve violating the constraint. To you, they aren't loopholes, they're clever ways to achieve all goals.

  1. ^

    Perhaps this feels intuitively incorrect. If so, I claim that's because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don't want to get rid of your own disgust reaction, but you're okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.

Reply
ryan_greenblatt's Shortform
Jeremy Gillen2mo10

I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.

I think this might be wrong when it comes to our disagreements, because I don't disagree with this shortform.[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?

  1. ^

    As long as "downstream performance" doesn't include downstream performance on tasks that themselves involve a bunch of integrating/generalising.

Reply
Evaluating Stability of Unreflective Alignment
Jeremy Gillen8mo10

I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.

Reply
Evaluating Stability of Unreflective Alignment
Jeremy Gillen8mo*10

Extremely underrated post, I'm sorry I only skimmed it when it came out.

I found 3a,b,c to be strong and well written, a good representation of my view. 

In contrast, 3d I found to be a weak argument that I didn't identify with. In particular, I don't think internal conflicts are a good way to explain the source of goal misgeneralization. To me it's better described as just overfitting or misgeneralization.[1] Edge cases in goals are clearly going to be explored by a stepping back process, if initial attempts fail. In particular if attempted pathways continue to fail. Whereas thinking of the AI as needing to resolve conflicting values seems to me to be anthropomorphizing in a way that doesn't seem to transfer to most mind designs.

You also used the word coherent in a way that I didn't understand.


Human intelligence seems easily useful enough to be a major research accelerator if it can be produced cheaply by AI

I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.


humans provides a pretty strong intuitive counterexample

It's a good observation that humans seem better at stepping back inside of low-level tasks than at high-level life-purposes. For example, I got stuck on a default path of finishing a neuroscience degree, even though if I had reflected properly I would have realised it was useless for achieving my goals a couple of years earlier. I got got by sunk costs and normality.

However, I think this counterexample isn't as strong as you think it is. Firstly because it's incredibly common for people to break out of a default-path. And secondly because stepping back is usually proceeded by some kind of failure to achieve the goal using a particular approach. Such failures occur often at small scales. They occur infrequently in most people's high-level life plans, because such plans are fairly easy and don't often raise flags that indicate potential failure. We want difficult work out of an AI. This implies frequent total failure, and hence frequent high-level stepping back. If it's doing alignment research, this is particularly true.

  1. ^

    Like for reasons given in section 4 of the misalignment and catastrophe doc.

Reply
Safe Predictive Agents with Joint Scoring Rules
Jeremy Gillen9mo20

To me it seems like one important application of this work is to understanding and fixing the futachy hack in FixDT and in Logical Inductor decision theory. But I'm not sure whether your results can transfer to these settings, because of the requirement that the agents have the same beliefs.

Is there a reason we can't make duplicate traders in LI and have their trades be zero-sum?

I'm generally confused about this. Do you have thoughts? 

Reply
Decomposing Agency — capabilities without desires
Jeremy Gillen1y1-2

What task? All the tasks I know of that are sufficient to reduce x-risk are really hard.

Reply
TurnTrout's shortform feed
Jeremy Gillen2y65

I think the term is very reasonable and basically accurate, even more so with regard to most RL methods. It's a good way of describing a training process without implying that the evolving system will head toward optimality deliberately. I don't know a better way to communicate this succinctly, especially while not being specific about what local search algorithm is being used.

Also, evolutionary algorithms can be used to approximate gradient descent (with noisier gradient estimates), so it's not unreasonable to use similar language about both.

I'm not a huge fan of the way you imply that it was chosen for rhetorical purposes.

Reply
Instrumental Convergence? [Draft]
Jeremy Gillen2y*104

I read about half of this post when it came out. I didn't want to comment without reading the whole thing, and reading the whole thing didn't seem worth it at the time. I've come back and read it because Dan seemed to reference it in a presentation the other day.

The core interesting claim is this:

My conclusion will be that most of the items on Bostrom's laundry list are not 'convergent' instrumental means, even in this weak sense. If Sia's desires are randomly selected, we should not give better than even odds to her making choices which promote her own survival, her own cognitive enhancement, technological innovation, or resource acquisition.

This conclusion doesn't follow from your arguments. None of your models even include actions that are analogous to the convergent actions on that list. 

The non-sequential theoretical model is irrelevant to instrumental convergence, because instrumental convergence is about putting yourself in a better position to pursue your goals later on. The main conclusion seems to come from proposition 3, but the model there is so simple it doesn’t include any possibility of Sia putting itself in a better position for later.

Section 4 deals with sequential decisions, but for some reason mainly gets distracted by a Newcomb-like problem, which seems irrelevant to instrumental convergence. I don't see why you didn't just remove Newcomb-like situations from the model? Instrumental convergence will show up regardless of the exact decision theory used by the agent.

Here's my suggestion for a more realistic model that would exhibit instrumental convergence, while still being fairly simple and having "random" goals across trajectories. Make an environment with 1,000,000 timesteps. Have the world state described by a vector of 1000 real numbers. Have a utility function that is randomly sampled from some Gaussian process (or any other high entropy distribution over functions) on R1,000,000×1000→R. Assume there exist standard actions which directly make small edits to the world-state vector. Assume that there exist actions analogous to cognitive enhancement, making technology and gaining resources. Intelligence can be used in the future to more precisely predict the consequences of actions on the future world state (you’d need to model a bounded agent for this). Technology can be used to increase the amount or change the type of effect your actions have on the world state. Resources can be spent in the future for more control over the world state. It seems clear to me that for the vast majority of the random utility functions, it's very valuable to have more control over the future world state. So most sampled agents will take the instrumentally convergent actions early in the game and use the additional power later on. 

The assumptions I made about the environment are inspired by the real world environment, and the assumptions I've made about the desires are similar to yours, maximally uninformative over trajectories.

Reply
Soft optimization makes the value target bigger
Jeremy Gillen3y30

Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I'll remove that paragraph. Thanks for the links, I hadn't read those, and I appreciate the pseudocode.

I think most likely I still don't understand what you mean by grader-optimizer, but it's probably better to discuss on your post after I've spent more time going over your posts and comments.

My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)?
And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?

Reply
Load More
Eurisko
3mo
Eurisko
3mo
(+7/-6)
14Context-dependent consequentialism
8mo
0
63Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
1y
0
43Soft optimization makes the value target bigger
3y
4
24Finding Goals in the World Model
3y
0
19The Core of the Alignment Problem is...
3y
0