All of adamShimi's Comments + Replies

[AN #157]: Measuring misalignment in the technology underlying Copilot

Exactly. I'm mostly arguing that I don't think the case for the agent situation is as clear cut as I've seen some people defend it, which doesn't mean it's not possibly true.

[AN #157]: Measuring misalignment in the technology underlying Copilot

Sorry for the delay in answering, I was a bit busy.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now

That makes some sense, but I do find the "motivationless" state interesting from an alignment point of view. Because if it has no motivation, it also doesn't have a motivation to do al... (read more)

2Rohin Shah5hYeah, I agree that in the future there is a difference. I don't think we know which of these situations we're going to be in (which is maybe what you're arguing). Idk what Gwern predicts.
DeepMind: Generally capable agents emerge from open-ended play

Actually, I think you're right. I always thought that MuZero was one and the same system for every game, but the Nature paper describes it as an architecture that can be applied to learn different games. I'd like a confirmation from someone who actually studied it more, but it looks like MuZero indeed isn't the same system for each game.

DeepMind: Generally capable agents emerge from open-ended play

Could you use this technique to e.g. train the same agent to do well on chess and go?

If I don't misunderstand your question, this is something they already did with MuZero.

Didn't they train a separate MuZero agent for each game? E.g. the page you link only talks about being able to learn without pre-existing knowledge.

[AN #157]: Measuring misalignment in the technology underlying Copilot

Sorry for ascribing you beliefs you don't have. I guess I'm just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me.

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Sure, but don't you agree that it's a very confusing use of the term? Like, if I say GPT-3 isn't trying to kill me, I'm not saying it is trying to kill anyo... (read more)

2Rohin Shah3dMaybe? Idk, according to me the goal of alignment is "create a model that is motivated to help us", and so misalignment = not-alignment = "the mode is not motivated to help us". Feels pretty clear to me but illusion of transparency is a thing. I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now. I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯ I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.) It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned. I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.
[AN #157]: Measuring misalignment in the technology underlying Copilot

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. Notably, there are IMO pretty good arguments (mostly by people affiliated with El... (read more)

4Rohin Shah5dWhere do you see any assumption of agency/goals? (I find this some combination of sad and amusing as a commentary on the difficulty of communication, in that I feel like I tend to be the person pushing against ascribing goals to GPT.) Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else. Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times. As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both. I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples [https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/] as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward. Agreed, which is why I didn't say anything like that?
paulfchristiano's Shortform

Ok, so you optimize the circuit both for speed and for small loss on human answers/comparisons, hoping that it generalizes to more questions while not being complex enough to be deceptive. Is that what you mean?

3Paul Christiano25dI'm mostly worried about parameter sharing between the human models in the environment and the QA procedure (which leads the QA to generalize like a human instead of correctly). You could call that deception but I think it's a somewhat simpler phenomenon.
paulfchristiano's Shortform

This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that's what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don't have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of ta

... (read more)
3Paul Christiano25dYou basically just need full universality / epistemic competitiveness locally. This is just getting around "what are values?" not the need for competitiveness. Then the global thing is also epistemically competitive, and it is able to talk about e.g. how our values interact with the alien concepts uncovered by our AI (which we want to reserve time for since we don't have any solution better than "actually figure everything out 'ourselves'"). Almost all of the time I'm thinking about how to get epistemic competitiveness for the local interaction. I think that's the meat of the safety problem.
paulfchristiano's Shortform

Here's my starting proposal:

  • We quantify the human's local preferences by asking "Look at the person you actually became. How happy are you with that person? Quantitatively, how much of your value was lost by replacing yourself with that person?" This gives us a loss on a scale from 0% (perfect idealization, losing nothing) to 100% (where all of the value is gone). Most of the values will be exceptionally small, especially if we look at a short period like an hour.
  • Eventually once the human becomes wise enough to totally epistemically dominate the original A
... (read more)
2Paul Christiano1moThe hope is that a tampering large enough to corrupt the human's final judgment would get a score of ~0 in the local value learning. 0 is the "right" score since the tampered human by hypothesis has lost all of the actual correlation with value. (Note that at the end you don't need to "ask it to do simple stuff" you can just directly assign a score of 1.) This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that's what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don't have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn't have to be so complex.) (I'm not sure if this made too much sense, I have a draft of a related comment that I'll probably post soon but overall expect to just leave this as not-making-much-sense for now.)
paulfchristiano's Shortform

One aspect of this proposal which I don't know how to do is evaluation the answers of the question-answerer. That looks too me very related to the deconfusion of universality that we discussed a few months ago, and without an answer to this, I feel like I don't even know how to run this silly approach.

3Paul Christiano1moYou could imitate human answers, or you could ask a human "Is answerA′much better than answerA?" Both of these only work for questions that humans can evaluate (in hindsight), and then the point of the scheme is to get an adequate generalization to (some) questions that humans can't answer.
Brute force searching for alignment

Well, if you worry that these properties don't have a simple conceptual core, maybe you can do the trick where you try to formalize a subset of them with a small conceptual core. That's basically Evan move on Myopia as a more easy to study subset of non-deceptiveness.

Brute force searching for alignment

If I try to rephrase it in my words, your proposal looks like a way to go from partial deconfusion (in the form of an extensive definition, a list of examples of what you want) to full deconfusion (an actual program with the property that you want) through brute force search.

Stated like that, it looks really cool. I wonder if you need an AGI already to do the search with a reasonable amount of compute. In this case, the worry might be that you have to deconfuse what you want to deconfuse before being able to apply this technique, which would make it useles... (read more)

Frequent arguments about alignment

Thanks for this post! I have to admit that I took some time to read it because I believed that it would be basic, but I really like the focus on more current techniques (which makes sense since you cofounded and work at OpenAI).

Let's start with the wise AI advisor. Even if our model has internal knowledge about the truth and human wellbeing, that doesn't mean that it'll act on that knowledge the way we want. Rather, the model has been trained to imitate the training corpus, and therefore it'll repeat the misconceptions and flaws of typical authors, even if

... (read more)
Environmental Structure Can Cause Instrumental Convergence

Sorry for the awkwardness (this comment was difficult to write). But I think it is important that people in the AI alignment community publish these sorts of thoughts. Obviously, I can be wrong about all of this.

Despite disagreeing with you, I'm glad that you published this comment and I agree that airing up disagreements is really important for the research community.

In particular, I don't think the paper provides a simple description for the set of MDPs that the main claim in the abstract applies to ("We prove that for most prior beliefs one might have a

... (read more)
Open problem: how can we quantify player alignment in 2x2 normal-form games?

I want to point that this is a great example of a deconfusion open problem. There is a bunch of intuitions, some constraints, and then we want to clarify the confusion underlying it all. Not planning to work on it myself, but it sounds very interesting.

(Only caveat I have with the post itself is that it could be more explicit in the title that it is an open problem).

Knowledge is not just digital abstraction layers

Nice post, as always.

What I take from the sequence up to this point is that the way we formalize information is unfit to capture knowledge. This is quite intuitive, but you also give concrete counterexamples that are really helpful.

It is reasonable to say that a data recorder is accumulating nonzero knowledge, but it is strange to say that exchanging the sensor data for a model derived from that sensor data is always a net decrease in knowledge.

Definitely agreed. This sounds like your proposal doesn't capture the transformation of information into more valuable precomputation (making valuable abstraction requires throwing away some information).

1Alex Flint1moYep, agreed. These are still writing that I drafted before we chatted a couple of weeks ago btw. I have some new ideas based on the things we chatted about that I hope to write up soon :)
Vignettes Workshop (AI Impacts)

Already told you yesterday, but great idea! I'll definitely be a part of it, and will try to bring some people with me.

Looking Deeper at Deconfusion

Sure.

... (read more)
[Event] Weekly Alignment Research Coffee Time (07/26)

Hey, it seems like other could use the link, so I'm not sure what went wrong. If you have the same problem tomorrow, just send me a PM.

Knowledge is not just mutual information

Thanks again for a nice post in this sequence!

The previous post looked at measuring the resemblance between some region and its environment as a possible definition of knowledge and found that it was not able to account for the range of possible representations of knowledge.

I found myself going back to the previous post to clarify what you mean here. I feel like you could do a better job of summarizing the issue of the previous post (maybe by mentioning the computer example explicitly?).

Formally, the mutual information between two objects is the gap betwee

... (read more)
Search-in-Territory vs Search-in-Map

This is a very interesting distinction. Notably, I feel that you point better at a distinction between "search inside" and "search outside" which I waved at in my review of Abram's post. Compared with selection vs control, this split also has the advantage that there is no recursive calls of one to the other: a controller can do selection inside, but you can't do search-in-territory by doing search-in-map (if I understand you correctly).

That being said, I feel you haven't yet deconfused optimization completely because you don't give a less confused explana... (read more)

MDP models are determined by the agent architecture and the environmental dynamics

I'm wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?

My current understanding is something like:

  • There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
  • Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the cla
... (read more)
3Alex Turner2mo(I continued this discussion with Adam in private - here are some thoughts for the public record) I think I'm claiming first bullet. I am not claiming the second. Yes, that. It doesn't have to be unique. We're predicting "for the agents we build, will optimal policies in their MDP models seek power?", and once you account for the environment dynamics, our beliefs about the agent architecture, and then our beliefs on the reward functions conditional on each architecture, this prediction has no subjective degrees of freedom. I'm not claiming that there's One Architecture To Rule Them All. I'm saying that if we want to predict what happens, we: 1. Consider the underlying environment (assumed Markovian) 2. Consider different state/action encodings we might supply the agent. 3. For each, fix a reward function distribution (what goals we expect to assign to the agent) 4. See what my theory predicts. There's a further claim (which seems plausible, but which I'm not yet making) that (2) won't affect (4) very much in practice. The point of this post is that if you say "the MDP has a different model", you're either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2). To falsify "5 googolplex", all you have to know is the dynamics + the agent's observation and action encodings. That determines the MDP structure. You don't have to run anything. (Although I suppose your proposed direction of inference is interesting: power-seeking tendencies + dynamics give you evidence about the encoding) This shows you the action and state encodings, which determines the model with which the agent interfaces.The encodings + environment dynamics tell you what model the agent is interfacing with, which allows you to apply my theorems as usual.
MDP models are determined by the agent architecture and the environmental dynamics

Despite agreeing with your conclusion, I'm unconvinced by the reasons you propose. Sure, once the interface is chosen, then the MDP is pretty much constrained by the real-world (for a reasonable modeling process). But that just means the subjectivity comes from the choice of the interface!

To be more concrete, maybe the state space of Pacman could be red-ghost, starting-state and live-happily-ever-after (replacing the right part of the MDP). Then taking the right action wouldn't be power-seeking either.

What I think is happening here is that in reality, ther... (read more)

2Alex Turner2moI'm wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it? There's no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question. You don't have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions.
Attainable Utility Preservation: Concepts

(Definitely a possibility that this is answered later in the sequence)

Rereading the post and thinking about this, I wonder if AUP-based AIs can still do anything (which is what I think Steve was pointing at). Or maybe phrased differently, whether it can still be competitive.

Sure, reading a textbook doesn't decrease the AU of most other goals, but applying the learned knowledge might. On your paperclip example, I expect that the AUP-based AI will make very few paper clips, or it could have a big impact (after all, we make paperclips in factories, but they c... (read more)

2Alex Turner2moHow, exactly, would it have a big impact? Do you expect making a few paperclip factories to have a large impact in real life? If not, why would idealized-AUP agents expect that? I think that for many tasks, idealized-AUP agents would not be competitive. It seems like they'd still be competitive on tasks with more limited scope, like putting apples on plates, construction tasks, or (perhaps) answering questions etc. I'm not sure what your model is here. In this post, this isn't a constrained optimization problem, but rather a tradeoff between power gain and the main objective. So it's not like AUP raps the agent's knuckles and wholly rules out plans involving even a bit of power gain. The agent computes something like (objective score) - c*(power gain), where c is some constant. On rereading, I guess this post doesn't make that clear: this post assumes not only that we correctly implement the concepts behind AUP, but also that we slide along the penalty harshness spectrum [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/LfGzAduBWzY5gq6FE] until we get reasonable plans. It seems like we should hit reasonable plans before power-seeking is allowed, although this is another detail swept under the rug by the idealization. Idealized-AUP doesn't directly penalize gaining power for the user, no. Whether this is indirectly incentivized depends on the idealizations we make. I think that impact measures levy a steep alignment tax, so yes, I think [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/wAAvP8RG6EwzCvHJy]that there are competitive pressures to cut corners on impact allowances.
SGD's Bias

In SGD, our “intended” drift is  - i.e. drift down the gradient of the objective. But the location-dependent noise contributes a “bias” - a second drift term, resulting from drift down the noise-gradient. Combining the equations from the previous two sections, the noise-gradient-drift is

I have not followed all your reasoning, but focusing on this last formula, does it represent a bias towards less variance over the different gradients one can sample at a given point?

If so, then I do find this quite interesting. A ra... (read more)

2johnswentworth2moYup, exactly.
Knowledge Neurons in Pretrained Transformers

I think that particularly the first of these two results is pretty mind-blowing, in that it demonstrates an extremely simple and straightforward procedure for directly modifying the learned knowledge of transformer-based language models. That being said, it's the second result that probably has the most concrete safety applications—if it can actually be scaled up to remove all the relevant knowledge—since something like that could eventually be used to ensure that a microscope AI isn't modeling humans or ensure that an agent is myopic in the sense that it

... (read more)
Formal Inner Alignment, Prospectus

Even with a significantly improved definition of goal-directedness, I think we'd be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.

Oh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it's only the first step. That being said, I think I'm mo... (read more)

3Abram Demski2moAh, on this point, I very much agree. I was treating the brain as fixed in size, so, having some upper bound on memory. Naturally this isn't quite true in practice (for all we know, healthy million-year-olds might have measurably larger heads if they existed, due to slow brain growth, but either way this seems like a technicality).
Challenge: know everything that the best go bot knows about go

My take is:

  • I think making this post was a good idea. I'm personally interested in deconfusing the topic of universality (which basically should capture what "learning everything the model knows"), and you brought up a good "simple" example to try to build intuition on.
  • What I would call your mistake is a mostly 8, but a bit of the related ones (so 3 and 4?). Phrasing it as "can we do that" is a mistake in my opinion because the topic is very confused (as shown by the comments). On the other hand, I think asking the question of what it would mean is a very e
... (read more)
Formal Inner Alignment, Prospectus

While I agree that outer objective, training data and prior should be considered together, I disagree that it makes the inner alignment problem dissolve except for manipulation of the search. In principle, if you could indeed ensure through a smart choice of these three parameters that there is only one global optimum, only "bad" (meaning high loss) local minima, and that your search process will always reach the global optimum, then I would agree that the inner alignment problem disappears.

But answering "what do we even want?" at this level of precision s... (read more)

AXRP Episode 7 - Side Effects with Victoria Krakovna

Thanks a lot for another great episode! I want to be clear (because you don't always get that many comments) that I think you're doing a great service to the field. These podcasts are great way for beginners to get a better grip on complex topics, and even for people who know about some of the ideas, the length and the depth of the conversation usually unravels cool new takes.

 

This might suggest a mitigation strategy of minimizing the degree to which AI systems have large effects on the world that absolutely necessary for achieving their objective.

Not... (read more)

7Alex Turner2moWell, as I understand it, RR usually penalizes wrt a baseline - it doesn't penalize absolute loss in reachability. If the agent can reach states s1 - s1000 if it follows the baseline policy, that means it'll penalize losing access to (states similar to) those states. That doesn't mean RR will make the agent maximize its options. In fact, if the baseline doesn't involve power-seeking, some variants of RR will penalize the agent for maximizing its available options / power. I think RR may have interesting interactions with the reachability preservation tendencies exposed by my power-seeking theorems. However, controlling for other design choices, I don't think the results suggest that RR incentivizes power-seeking. I don't currently think the power-seeking theorems suggest that either RR or AUP is particularly bad wrt power-seeking behavior.
1DanielFilan2moYou're quite right, let me fix that.
Formal Inner Alignment, Prospectus

Haven't read the full comment thread, but on this sentence

Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment!

Evan actually wrote a post to explain that it isn't the complement for him (and not the compliment either :p) 

3Abram Demski2moRight, but John is disagreeing with Evan's frame, and John's argument that such-and-such problems aren't inner alignment problems is that they are outer alignment problems.
Formal Inner Alignment, Prospectus

Thanks for the post!

Here is my attempt at a detailed peer-review feedback. I admit that I'm more excited by doing this because you're asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).

One thing I really like is the multiple "failure" stories at the beginning. It's usually frustrating in posts like that to see people argue against position/arguments which are not written anywhere. Here we can actually see the problematic arguments.

I responded that for me, the whole po

... (read more)
7Abram Demski2moEven with a significantly improved definition of goal-directedness, I think we'd be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign. But I'm happy to include your approach in the final document! Can you elaborate on this? Right. Low total error for, eg, imitation learning, might be associated with catastrophic outcomes. This is partly due to the way imitation learning is readily measured in terms of predictive accuracy, when what we really care about is expected utility (although we can't specify our utility function, which is one reason we may want to lean on imitation, of course). But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we're bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world. Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I'd be happy to hear it, but "systems performing complex tasks in complex environments have to pay that cost anyway" seems like a big problem for arguments of this kind. The question becomes where they put the complexity. I meant time as a function of data (I'm not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you'd need O(X). A memoryless alg could be constant time; IE, even though you have and X-long history, you just need to feed it the most recent thing, so its response time is not a function of X. Similarly with finite context windows. I agree than in principle we could decode the brain's algorithms and say
2Abram Demski3moThanks! Right. By "no connection" I specifically mean "we have no strong reason to posit any specific predictions we can make about mesa-objectives from outer objectives or other details of training" -- at least not for training regimes of practical interest. (I will consider this detail for revision.) I could have also written down my plausibility argument (that there is actually "no connection"), but probably that just distracts from the point here. (More later!)
Challenge: know everything that the best go bot knows about go

What does that mean though? If you give the go professional a massive transcript of the bot knowledge, it's probably unusable. I think what the go professional gives you is the knowledge of where to look/what to ask for/what to search. 

2Nisan3moOr maybe it means we train the professional in the principles and heuristics that the bot knows. The question is if we can compress the bot's knowledge into, say, a 1-year training program for professionals. There are reasons to be optimistic: We can discard information that isn't knowledge (lossy compression). And we can teach the professional in human concepts (lossless compression).
Challenge: know everything that the best go bot knows about go

That's basically what Paul's universality (my distillation post for another angle) is aiming for: having a question-answering overseer which can tell you everything you want to know about what the system knows and what it will do. You still probably need to be able to ask a relevant question, which I think is what you're pointing at.

Mundane solutions to exotic problems

Sorry about that. I corrected it but it was indeed the first link you gave.

4Evan Hubinger3moYour link is broken. For reference, the first post in Paul's ascription universality sequence can be found here [https://ai-alignment.com/towards-formalizing-universality-409ab893a456] (also Adam has a summary here [https://www.alignmentforum.org/posts/farherQcqFQXqRcvv/universality-unwrapped] ).
AMA: Paul Christiano, alignment researcher

Copying my question from your post about your new research center (because I'm really interested in the answer): which part (if any) of theoretical computer science do you expect to be particularly useful for alignment?

5Paul Christiano3moLearning theory definitely seems most relevant. Methodologically I think any domain where you are designing and analyzing algorithms, especially working with fuzzy definitions or formalizing intuitive problems, is also useful practice though much less bang for your buck (especially if just learning about it rather than doing research in it). That theme cuts a bunch across domains, though I think cryptography, online algorithms, and algorithmic game theory are particularly good.
Coherence arguments imply a force for goal-directed behavior

Yeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn't take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)

My first intuition is that I expect mapping internal concept to mathematical formalisms to be easier when the end goal is deconfusion and making sense of behaviors,... (read more)

Coherence arguments imply a force for goal-directed behavior

Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole - concepts which people just don't talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they're making a very similar mistake as a doctor who tries to design artificial livers based

... (read more)
1Richard Ngo3moYeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn't take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.) I don't think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)
Announcing the Alignment Research Center

This is so great! I always hate wishing people luck when I trust in their competence to mostly deal with bad luck and leverage good luck. I'll use that one now.

Announcing the Alignment Research Center

Sounds really exciting! I'm wondering which kind of theoretical computer science do you have in mind specifically? Like which part of that do you think has the most uses for alignment? (Still trying to find a way to use my PhD in the theory of distributed computing for something alignment related ^^)

Gradations of Inner Alignment Obstacles

Agreed, it depends on the training process.

Gradations of Inner Alignment Obstacles

Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).

But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can't get started.

This argument is obviously a bit sloppy, though.

I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively c... (read more)

I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let's say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.

I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.

Where are intentions to be found?

I have two reactions while reading this post:

  • First, even if we say that a given human (for example) at a fixed point in time doesn't necessarily contain everything that we would want the AI to learn, if it only learns what's in there, there might already be a lot of alignment failures that disappear. For example paperclip maximizers are probably ruled out by taking one human's values at a point in time and extrapolating. But that clearly doesn't help with scenarios where the AI does the sort of bad things humans can do, for example.
  • Second, I would argue th
... (read more)
Gradations of Inner Alignment Obstacles

Cool post! It's clearly not super polished, but I think you're pointing at a lot of important ideas, and so it's a good thing to publish it relatively quickly.

The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about.

As far as I understand it, the initial assumption of internal search was mostly done for two reasons: because then you can sp... (read more)

7Abram Demski3moRight, so, the point of the argument for basin-like proposals is this: A basin-type solution has to 1. initialize in such a way as to be within a good basin / not within a bad basin. 2. Train in a way which preserves this property. Most existing proposals focus on (2) and don't say that much about (1), possibly counting on the idea that random initializations will at least not be actively deceptive. The argument I make in the post is meant to question this, pointing toward a difficulty in step (1). One way to put the problem in focus: suppose the ensemble learning hypothesis: Ensemble learning hypothesis (ELH): Big NNs basically work as a big ensemble of hypotheses, which learning sorts through to find a good one. This bears some similarity to lottery-ticket thinking. Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization). But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can't get started. This argument is obviously a bit sloppy, though.
2johnswentworth3moYes.
Updating the Lottery Ticket Hypothesis

The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall  found during training is actually nearly a solution to the linearly-approximated equations.

Trying to check if I'm understanding correctly: does that mean that despite SGD doing a lot of successive changes that use the gradient at the successive parameter values, these "even out" s... (read more)

5johnswentworth3moSort of. They end up equivalent to a single Newton step, not a single gradient step (or at least that's what this model says). In general, a set of linear equations is not solved by one gradient step, but is solved by one Newton step. It generally takes many gradient steps to solve a set of linear equations. (Caveat to this: if you directly attempt a Newton step on this sort of system, you'll probably get an error, because the system is underdetermined. Actually making Newton steps work for NN training would probably be a huge pain in the ass, since the underdetermination would cause numerical issues.)
Load More