All of evhub's Comments + Replies

Challenges with Breaking into MIRI-Style Research

One of my hopes with the SERI MATS program is that it can help fill this gap by providing a good pipeline for people interested in doing theoretical AI safety research (be that me-style, MIRI-style, Paul-style, etc.). We're not accepting public applications right now, but the hope is definitely to scale up to the point where we can run many of these every year and accept public applications.

2021 AI Alignment Literature Review and Charity Comparison

(Moderation note: added to the Alignment Forum from LessWrong.)

More Christiano, Cotra, and Yudkowsky on AI progress

And what other EAs reading it are thinking, I expect, is plain old Robin-Hanson-style reference class tennis of "Why would you expect intelligence to scale differently from bridges, where are all the big bridges?"

I find these sorts of characterizations very strange, since I feel like I know quite a lot of EAs, but approximately nobody that's really into that sort of reference class forecasting (at least not more so than where Paul and Eliezer agree that superforecaster-style methodology is sound). I'm curious who specifically you're thinking of other th... (read more)

Are minimal circuits deceptive?

This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure?

Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here.


If you train network N on dataset D and optimization pressure causes N to internally develop a planning system (mesa-optimizer) M, aren't all questions of whether M is aligned with N's optimization objective just generalization questions?

This reflects a misunderstanding of what a mesa-optimizer is—as we say in Risks from Learne... (read more)

Theoretical Neuroscience For Alignment Theory

To the extent Steve is right that “[understanding] the algorithms in the human brain that give rise to social instincts and [putting] some modified version of those algorithms into our AGIs” is a worthwhile safety proposal, I think we should be focusing our attention on instantiating the relevant algorithms that underlie affective and cognitive ToM + affective empathy.

It seems to me like you would very likely get both cognitive and affective theory of mind “for free” in the sense that they're necessary things to understand for predicting humans well. If... (read more)

5Steve Byrnes1moFor my part, I strongly agree with the first part, and I said something similar in my comment [https://www.lesswrong.com/posts/ZJY3eotLdfBPCLP3z/theoretical-neuroscience-for-alignment-theory?commentId=bfjsLXdx5TJRGQxQA] . For the second part, if we're talking about within-lifetime brain learning / thinking, we're talking about online-learning. For example, if I'm having a conversation with someone, and they tell me their name is Fred, and then 2 minutes later I say "Well Fred, this has been a lovely conversation", I can thank online-learning for my remembering their name. Another example: the math student trying to solve a homework problem (and learning from the experience) is using the same basic algorithms as the math professor trying to prove a new theorem—even if the first is vaguely analogous to "training" and the second to "deployment". So then you can say: "Well fine, but online learning is pretty unfashionable in ML today. Can we talk about what the brain's within-lifetime learning algorithms would look like without online learning?" And I would say: "Ummmm, I don't know. I'm not sure that's a coherent or useful thing to talk about. A brain without online-learning would look like unusually severe retrograde amnesia." That's not a criticism of what you said. Just a warning that "non-online-learning versions of brain algorithms" is maybe an incoherent notion that we shouldn't think too hard about. :)
Moore's Law, AI, and the pace of progress

(Moderation note: added to the Alignment Forum from LessWrong.)

Hard-Coding Neural Computation

(Moderation note: added to the Alignment Forum from LessWrong.)

larger language models may disappoint you [or, an eternally unfinished draft]

It's totally possible to do ecological evaluation with large LMs. (Indeed, lots of people are doing it.) For example, you can:

  • Take an RL environment with some text in it, and make an agent that uses the LM as its "text understanding module."
    • If the LM has a capacity, and that capability is helpful for the task, the agent will learn to elicit it from the LM as needed. See e.g. this paper.
  • Just do supervised learning on a capability you want to probe.

I don't understand why you think this would actually give you an ecological evaluation. It seems ... (read more)

4nostalgebraist2moIt will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence. All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like "giving someone a free-response math test." * They know the material, yet they fail the test: easy to imagine (say, if they are preoccupied by some unexpected life event) * They don't know the material, yet they ace the test: requires an astronomically unlikely coincidence The distinction I'm meaning to draw is not that there is no directional error, but that the RL/SL tasks have the right structure: there is an optimization procedure which is "leaving money on the table" if the capability is present yet ends up unused.
Behavior Cloning is Miscalibrated

(Moderation note: added to the Alignment Forum from LessWrong.)

Yudkowsky and Christiano discuss "Takeoff Speeds"

But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).

And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).

Seems interestingly similar to the grokking phenomenon.

A positive case for how we might succeed at prosaic AI alignment

are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?

No—I'm separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you're going to get it. My step (1) above, which is what I understand that we're talking about, is just about that first piece: understanding what we're going to be shooting for when we set up our training process (and then once we know what we're shooting for we can think about h... (read more)

A positive case for how we might succeed at prosaic AI alignment

To be clear, I agree with this as a response to what Edouard said—and I think it's a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don't think it's a response to what I'm advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn't be too surprised that it's not fully clear).

In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that's clearly not a safe model to amplify,... (read more)

All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary

I'm confused from several directions here.  What is a "robust" Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate "an agent that optimizes an objective" are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?

A positive case for how we might succeed at prosaic AI alignment

If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?

Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you'll get deception. The solution isn't to somehow “filter out the unwanted instrumental behavio... (read more)

This is a great thread. Let me see if I can restate the arguments here in different language:

  1. Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob's brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, "You have a million subjective years to think of an effective pivotal act i
... (read more)
A positive case for how we might succeed at prosaic AI alignment

How does a "myopic optimizer" successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about?

It just reasons about them, using deduction, prediction, search, etc., the same way any optimizer would.

To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

The sense that it's still myopic is in the sense that it's non-deceptive, which is the o... (read more)

7David Xu2mo[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don't anti -endorse them either, or else I wouldn't be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer's, but instead something like my extrapolation of my model of Eliezer, which may not correspond at all to what the real Eliezer thinks.] If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent? Conversely, if the myopic agent does not learn to imitate the underlying process to sufficient resolution that unwanted behaviors like deception start carrying over, then it is very likely that the powerful consequentialist properties of the underlying process have not been carried over, either. This is because (on my extrapolation of Eliezer's model) deceptive behavior, like all other instrumental strategies, arises from consequentialist reasoning, and is deeply tied to such reasoning in a way that is not cleanly separable—which is to say, by default, you do not manage to sever one without also severing the other. Again, I (my model of Eliezer) does not think the "deep tie" in question is necessarily insoluble; perhaps there is some sufficiently clever method which, if used, would successfully filter out the "unwanted" instrumental behavior ("deception", in your terminology) from the "wanted" instrumental behavior (planning, coming up with strateg
A positive case for how we might succeed at prosaic AI alignment

I mean, that's because this is just a sketch, but a simple argument for why myopia is more natural than “obey humans” is that if we don't care about competitiveness, we already know how to build myopic optimizers, whereas we don't know how to build an optimizer to “obey humans” at any level of capabilities.

Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency. I suspect we can get much better upper bounds on the complexity than that, though.

6Joe_Collman2moIt's an interesting idea, but are you confident that LCDT actually works? E.g. have you thought more about the issues I talked about here [https://www.lesswrong.com/posts/Y76durQHrfqwgwM5o/lcdt-a-myopic-decision-theory?commentId=crfQECLrYRaP225ne] and concluded they're not serious problems? I still don't see how we could get e.g. an HCH simulator without agentic components (or the simulator's qualifying as an agent). As soon as an LCDT agent expects that it may create agentic components in its simulation, it's going to reason horribly about them (e.g. assuming that any adjustment it makes to other parts of its simulation can't possibly impact their existence or behaviour, relative to the prior). I think LCDT does successfully remove the incentives you're aiming to remove. I just expect it to be too broken to do anything useful. I can't currently see how we could get the good parts without the brokenness.
2Richard Ngo2moWhat are you referring to here?
A positive case for how we might succeed at prosaic AI alignment

(1) a (good) pivotal act is probably a non-myopic problem, and (2) you can't solve a nontrivial nonmyopic problem with a myopic solver. [...] My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it output whatever HCH would do, for instance).

Yeah, that's right, I definitely agree with (1) and disagree with (2).

And then Eliezer would probably reply that the non-myopia has been wrapped up somewhere else (e.g. in HCH), and that has become the dangerous part (or, more realistically, the insufficiently capable

... (read more)

It still doesn't seem to me like you've sufficiently answered the objection here.

I tend to think that HCH is not dangerous, but I agree that it's likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful.

What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed... (read more)

A positive case for how we might succeed at prosaic AI alignment

To be clear, I agree with this also, but don't think it's really engaging with what I'm advocating for—I'm not proposing any sort of assemblage of reasoners; I'm not really sure where that misconception came from.

I don't think the assemblage is the point. I think the idea here is that "myopia" is a property of problems: a non-myopic problem is (roughly) one which inherently requires doing things with long time horizons. I think Eliezer's claim is that (1) a (good) pivotal act is probably a non-myopic problem, and (2) you can't solve a nontrivial nonmyopic problem with a myopic solver. Part (2) is what I think TekhneMakr is gesturing at and Eliezer is endorsing.

My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it out... (read more)

A positive case for how we might succeed at prosaic AI alignment

Why do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has?

To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitat... (read more)

2Richard Ngo2moThat all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn't seem like they help explain why myopia is significantly more natural than "obey humans"?
A positive case for how we might succeed at prosaic AI alignment

I suspect that there were a lot of approaches that would have produced similar results to how we ended up doing language modeling. I believe that the main advantage of Transformers over LSTMs is just that LSTMs have exponentially decaying ability to pay attention to prior tokens while Transformers can pay constant attention to all tokens in the context. I suspect that it would have been possible to fix the exponential decay problem with LSTMs and get them to scale like Transformers, but Transformers came first, so nobody tried. And that's not to say that M... (read more)

3Quintin Pope2moI agree that transformers vs other architectures is a better example of the field “following the leader” because there are lots of other strong architectures (perceiver, mlp mixer, etc). In comparison, using self supervised transfer learning is just an objectively good idea you can apply to any architecture and one the brain itself almost surely uses. The field would have converged to doing so regardless of the dominant architecture. One hopeful sign is how little attention the ConvBERT language model [https://huggingface.co/transformers/model_doc/convbert.html] has gotten. It mixes some convolution operations with self attention to allow self attention heads to focus on global patterns as opposed to local patterns better handled by convolution. ConvBERT is more compute efficient than a standard transformer, but hasn’t made much of a splash. It shows the field can ignore low profile advances made by smaller labs. For your point about the value of alignment: I think there’s a pretty big range of capabilities where the marginal return on extra capabilities is higher than the marginal return on extra alignment. Also, you seem focused on avoiding deception/treacherous turns, which I think are a small part of alignment costs until near human capabilities. I don’t know what sort of capabilities penalty you pay for using a myopic training objective, but I don’t think there’s much margin available before voluntary mass adoption becomes implausible.
A positive case for how we might succeed at prosaic AI alignment

The notion of (1) seems like the cat-belling problem here; the other steps don't seem interesting by comparison, the equivalent of talking about all the neat things to do after belling the cat.

I'm surprised that you think (1) is the hard part—though (1) is what I'm currently working on, since I think it's necessary to make a lot of the other parts go through, I expect it to be one of the easiest parts of the story to make work.

What pivotal act is this AGI supposed to be executing? Designing a medium-strong nanosystem?

I left this part purposefully v... (read more)

7Richard Ngo2moWhy do you expect this to be any easier than directing that optimisation towards the goal of "doing what the human wants"? In particular, if you train a system on the objective "imitate HCH", why wouldn't it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still has to do long-term planning anyway. (I feel like this is basically the same set of concerns/objections that I raised in this post [https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training] . I also think that myopia is a fairly central example of the thing that Eliezer was objecting to with his "water" metaphor in our dialogue [https://www.alignmentforum.org/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty] , and I endorse his objection in this context.)
Certainly it doesn't matter what substrate the computation is running on.

I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implement... (read more)

How do we become confident in the safety of a machine learning system?

This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.

Glad you think so! I definitely agree and am planning on using this framework in my own research going forward.

"story" makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that mo

... (read more)
What exactly is GPT-3's base objective?

First, the problem is only with outer/inner alignment—the concept of unintended mesa-optimization is still quite relevant and works just fine.

Second, the problems with applying Risks from Learned Optimization terminology to GPT-3 have nothing to do with the training scenario, the fact that you're doing unsupervised learning, etc.

The place where I think you run into problems is that, for cases where mesa-optimization is intended in GPT-style training setups, inner alignment in the Risks from Learned Optimization sense is usually not the goal. Most of the op... (read more)

What exactly is GPT-3's base objective?

This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot... now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable?

To be clear, that's definitely not what I'm arguing. I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it's talking about. The problem is just that it's not general enough to handle all possible ways of training a... (read more)

4DanielFilan2moGPT-3 was trained using supervised learning, which I would have thought was a pretty standard way of training a model using machine learning. What training scenarios do you think the Risks from Learned Optimization terminology can handle, and what's the difference between those and the way GPT-3 was trained?
What exactly is GPT-3's base objective?

My current position is that this is the wrong question to be asking—instead, I think the right question is just “what is GPT-3's training story?” Then, we can just talk about to what extent the training rationale is enough to convince us that we would get the desired training goal vs. some other model, like a deceptive model, instead—rather than having to worry about what technically counts as the base objective, mesa-objective, etc.

6Daniel Kokotajlo2moI was wondering if that was the case, haha. Thanks! This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot... now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable? I do like your new thing and it seems better to me in some ways, but worse in others. I feel like I expect a failure mode where people exploit ambiguity and norm-laden concepts to convince themselves of happy fairy tales. I should think more about this and write a comment. ETA: Here's an attempt to salvage the original inner/outer alignment problem framing: We admit up front that it's a bit ambiguous what the base objective is, and thus there will be cases where it's ambiguous whether a mesa-optimizer is aligned to the base objective. However, we say this isn't a big deal. We give a handful of examples of "reasonable construals" of the base objective, like I did in the OP, and say that all the classic arguments are arguments for the plausibility of cases where a mesa-optimizer is misaligned with every reasonable construal of the base objective. Moreover, we make lemons out of lemonade, and point out that the fact there are multiple reasonable construals is itself reason to think inner alignment problems are serious and severe. I'm imagining an interlocutor who thinks "bah, it hasn't been established yet that inner-alignment problems are even a thing; it still seems like the default hypothesis is that you get what you train for, i.e. you get an agent that is trying to maximize predictive accuracy or whatever." And then we say "Oh? What exactly is it trying to maximize? Predictive accuracy full stop? Or predictive accuracy conditional on dataset D? Or is it instead trying to maximize reward, in which case it'd hack its reward channel if it could? Whichever one you think it is, would you not agree that it's plausible that it might inste
1Charlie Steiner2moYeah, agreed. It's true that GPT obeys the objective "minimize the cross-entropy loss between the output and the distribution of continuations in the training data." But this doesn't mean it doesn't also obey objectives like "write coherent text", to the extent that we can tell a useful story about how the training set induces that behavior. (It is amusing to me how our thoughts immediately both jumped to our recent hobbyhorses.)
Forecasting progress in language models

(Moderation note: added to the Alignment Forum from LessWrong.)

Towards Deconfusing Gradient Hacking

(Moderation note: added to the Alignment Forum from LessWrong.)

[AN #166]: Is it crazy to claim we're in the most important century?

Note though that it does not defuse all such uneasiness -- you can still look at how early we appear to be (given the billions of years of civilization that could remain in the future), and conclude that the simulation hypothesis is true, or that there is a Great Filter in our future that will drive us extinct with near-certainty. In such situations there would be no extraordinary impact to be had today by working on AI risk.

I don't think I agree with this—in particular, it seems like even given the simulation hypothesis, there could still be quite a lo... (read more)

5Rohin Shah3moYeah, I agree the statement is false as I literally wrote it, though what I meant was that you could easily believe you are in the kind of simulation where there is no extraordinary impact to have.
Meta learning to gradient hack

Interesting! I'd definitely be excited to know if you figure out what it's doing.

Meta learning to gradient hack

70 steps is not very many—does training converge if you train for longer (e.g. 700, 7000, 70000)?

Also, in addition to regularization making this strategy not very effective, I'd also suspect that hyperparameter tuning would break it as well—e.g. I'd be interested in what happens if you do black-box hyperparameter tuning on the base training process's hyperparameters after applying meta-learning (though then, to be fair to the meta-learning process, you'd also probably want to do the meta-learning in a setting with variable hyperparameters).

7Quintin Pope4moThanks for the feedback! I use batch norm regularisation, but not dropout. I just tried retraining the 100,000 cycle meta learned model in a variety of ways, including for 10,000 steps with 10,000x higher lr, using resilient backprop (which multiplies weights by a factor to increase/decrease them), and using an L2 penalty to decrease weight magnitude. So far, nothing has gotten the network to model the base function. The L2 penalty did reduce weight values to ~the normal range, but the network still didn’t learn the base function. I now think the increase in weight values is just incidental and that the meta learner found some other way of protecting the network from SGD.
Meta learning to gradient hack

(Moderation node: added to the Alignment Forum from LessWrong.)

Selection Theorems: A Program For Understanding Agents

Thanks! I hope the post is helpful to you or anyone else trying to think about the type signatures of goals. It's definitely a topic I'm pretty interested in.

Selection Theorems: A Program For Understanding Agents

Have you seen Mark and my “Agents Over Cartesian World Models”? Though it doesn't have any Selection Theorems in it, and it just focuses on the type signatures of goals, it does go into a lot of detail about possible type signatures for agent's goals and what the implications of those type signatures would be, starting from the idea that a goal can be defined on any part of a Cartesian boundary.

Oh excellent, that's a perfect reference for one of the successor posts to this one. You guys do a much better job explaining what agent type signatures are and giving examples and classification, compared to my rather half-baked sketch here.

AI safety via market making

Suppose the intended model is to predict H's estimate at convergence, and the actual model is predicting H's estimate at round N for some fixed N larger than any convergence time in the training set. Would you call this an "inner alignment failure", an "outer alignment failure", or something else (not an alignment failure)?

I would call that an inner alignment failure, since the model isn't optimizing for the actual loss function, but I agree that the distinction is murky. (I'm currently working on a new framework that I really wish I could reference her... (read more)

Pathways: Google's AGI

(Moderation note: added to the Alignment Forum from LessWrong.)

AI safety via market making

What would happen if we instead use convergence as the stopping condition (and throw out any questions that take more than some fixed or random threshold to converge)? Can we hope that M would be able to extrapolate what we want it to do, and predict H's reflective equilibrium even for questions that take longer to converge than what it was trained on?

This is definitely the stopping condition that I'm imagining. What the model would actually do, though, if you, at deployment time, give it a question that takes the human longer to converge on than any qu... (read more)

6Wei Dai4moThanks for this very clear explanation of your thinking. A couple of followups if you don't mind. Suppose the intended model is to predict H's estimate at convergence, and the actual model is predicting H's estimate at round N for some fixed N larger than any convergence time in the training set. Would you call this an "inner alignment failure", an "outer alignment failure", or something else (not an alignment failure)? Putting these theoretical/conceptual questions aside, the reason I started thinking about this is from considering the following scenario. Suppose some humans are faced with a time-sensitive and highly consequential decision, for example, whether to join or support some proposed AI-based governance system (analogous to the 1690 "liberal democracy" question), or a hostile superintelligence is trying to extort all or most of their resources and they have to decide how to respond. It seems that convergence on such questions might take orders of magnitude more time than what M was trained on. What do you think would actually happen if the humans asked their AI advisor to help with a decision like this? (What are some outcomes you think are plausible?) What's your general thinking about this kind of AI risk (i.e., where an astronomical amount of potential value is lost because human-AI systems fail to make the right decisions in high-stakes situations that are eventuated by the advent of transformative AI)? Is this something you worry about as an alignment researcher, or do you (for example) think it's orthogonal to alignment and should be studied in another branch of AI safety / AI risk?
Agents Over Cartesian World Models

Or is this just exploring a potential approximation?

Yeah, that's exactly right—I'm interested in how an agent can do something like manage resource allocation to do the best HCH imitation in a resource-bounded setting.

Are we including "long speech about why human should give high approval to me because I'm suffering" as an action? I guess there's a trade-off here, where limiting to word-level output demands too much lookahead coherence of the human, while long sentences run the risk of incentivizing reward tampering. Is that the reason you had in mind

... (read more)
Obstacles to gradient hacking

(Moderation note: added to the Alignment Forum from LessWrong.)

Automating Auditing: An ambitious concrete technical research proposal

Sure, but presumably they'll also say what particular attacks are so hard that current ML models aren't capable of solving them—and I think that's a valuable piece of information to have.

Automating Auditing: An ambitious concrete technical research proposal

Yeah, that's a great question—I should have talked more about this. I think there are three ways to handle this sort of problem—and ideally we should do some combination of all three:

  1. Putting the onus on the attacker. Probably the simplest way to handle this problem is just to have the attacker produce larger specification breaks than anything that exists in the model independently of the attacker. If it's the case that whenever the attacker does something subtle like you're describing the auditor just finds some other random problem, then the attacker sh
... (read more)
Automating Auditing: An ambitious concrete technical research proposal

Is it inconsistent with the original meaning of the term? I thought that the original meaning of inside view was just any methodology that wasn't reference-class forecasting—and I don't use the term “outside view” at all.

Also, I'm not saying that “inside view” means “real reason,” but that my real reason in this case is my inside view.

2Daniel Kokotajlo5moI regret saying anything! But since I'm in this hole now, might as well try to dig my way out: IDK, "any methodology that wasn't reference-class forecasting" wasn't how I interpreted the original texts, but *shrugs.* But at any rate what you are doing here seems importantly different than the experiments and stuff in the original texts; it's not like I can point to those experiments with the planning fallacy or the biased pundits and be like "See! Evan's inside-view reason is less trustworthy than the more general thoughts he lists below; Evan should downweight its importance and not make it his 'real reason' for doing things." My thesis in "Taboo Outside View" was that we'd all be better off if, whenever we normally would use the term "inside view" and "outside view" we tabood that term and used something more precise instead. In this case, I think that if you just used "real reason" instead of "inside view," it would work fine -- but that's just my interpretation of the meaning you were trying to convey; perhaps instead you were trying to convey additional information beyond "real reason" and that's why you chose "inside view." If so, you may be interested to know that said additional information never made it to me, because I don't know what it might be. ;) Perhaps it was "real reason" + "This is controversial, and I'm not arguing for it here, I don't expect everyone to agree"? I guess meaning is use, and if enough people even after reading my post still feel compelled to use these terms without tabooing them, then fair enough, even if there isn't a succinct definition of what they mean. I wish we had invented new words though instead of re-using "inside view" and "outside view" given that those terms already had meanings from Tetlock, Kahneman, etc.
LCDT, A Myopic Decision Theory

That's a really interesting thought—I definitely think you're pointing at a real concern with LCDT now. Some thoughts:

  • Note that this problem is only with actually running agents internally, not with simply having the objective of imitating/simulating an agent—it's just that LCDT will try to simulate that agent exclusively via non-agentic means.
  • That might actually be a good thing, though! If it's possible to simulate an agent via non-agentic means, that certainly seems a lot safer than internally instantiating agents—though it might just be impossible to
... (read more)
3Joe_Collman5moOk, that mostly makes sense to me. I do think that there are still serious issues (but these may be due to my remaining confusions about the setup: I'm still largely reasoning about it "from outside", since it feels like it's trying to do the impossible). For instance: 1. I agree that the objective of simulating an agent isn't a problem. I'm just not seeing how that objective can be achieved without the simulation taken as a whole qualifying as an agent. Am I missing some obvious distinction here? If for all x in X, sim_A(x) = A(x), then if A is behaviourally an agent over X, sim_A seems to be also.(Replacing equality with approximate equality doesn't seem to change the situation much in principle) [Pre-edit: Or is the idea that we're usually only concerned with simulating some subset of the agent's input->output mapping, and that a restriction of some function may have different properties from the original function? (agenthood being such a property)] 1. I can see that it may be possible to represent such a simulation as a group of nodes none of which is individually agentic - but presumably the same could be done with a human. It can't be ok for LCDT to influence agents based on having represented them as collections of individually non-agentic components. 2. Even if sim_A is constructed as a Chinese room [https://en.wikipedia.org/wiki/Chinese_room] (w.r.t. agenthood), it's behaving collectively as an agent. 2. "it's just that LCDT will try to simulate that agent exclusively via non-agentic means" - mostly agreed, and agreed that this would be a good thing (to the extent possible). However, I do think there's a significant difference between e.g.: [LCDT will not aim to instantiate agents] (true) vs [LCDT will not instantiate agents] (potentially false: they may be side-effects) Side-effect-agents seem plausible if e.g.:
LCDT, A Myopic Decision Theory

But by assumption, it doesn't think it can influence anything downstream of those (or the probability that they exist, I assume).

This is not true—LCDT is happy to influence nodes downstream of agent nodes, it just doesn't believe it can influence them through those agent nodes. So LCDT (at decision time) doesn't believe it can change what HCH does, but it's happy to change what it does to make it agree with what it thinks HCH will do, even though that utility node is downstream of the HCH agent nodes.

4Joe_Collman5moAh yes, you're right there - my mistake. However, I still don't see how LCDT can make good decisions over adjustments to its simulation. That simulation must presumably eventually contain elements classed as agentic. Then given any adjustment X which influences the simulation outcome both through agentic paths and non-agentic paths, the LCDT agent will ignore the influence [relative to the prior] through the agentic paths. Therefore it will usually be incorrect about what X is likely to accomplish. It seems to me that you'll also have incoherence issues here too: X can change things so that p(Y = 0) is 0.99 through a non-agentic path, whereas the agents assumes the equivalent of [p(Y = 0) is 0.5] through an agentic path. I don't see how an LCDT agent can make efficient adjustments to its simulation when it won't be able to decide rationally on those judgements in the presence of agentic elements (which again, I assume must exist to simulate HCH).
LCDT, A Myopic Decision Theory

My issue is in seeing how we find a model that will consistently do the right thing in training (given that it's using LCDT).

How about an LCDT agent with the objective of imitating HCH? Such an agent should be aligned and competitive, assuming the same is true of HCH. Such an agent certainly shouldn't delete itself to free up disk space, since HCH wouldn't do that—nor should it fall prey to the general argument you're making about taking epsilon utility in a non-agent path, since there's only one utility node it can influence without going through other... (read more)

4Joe_Collman5moOk thanks, I think I see a little more clearly where you're coming from now. (it still feels potentially dangerous during training, but I'm not clear on that) A further thought: Ok, so suppose for the moment that HCH is aligned, and that we're able to specify a sufficiently accurate HCH model. The hard part of the problem seems to be safe-and-efficient simulation of the output of that HCH model. I'm not clear on how this part works: for most priors, it seems that the LCDT agent is going to assign significant probability to its creating agentic elements within its simulation. But by assumption, it doesn't think it can influence anything downstream of those (or the probability that they exist, I assume). That seems to be the place where LCDT needs to do real work, and I don't currently see how it can do so efficiently. If there are agentic elements contributing to the simulation's output, then it won't think it can influence the output. Avoiding agentic elements seems impossible almost by definition: if you can create an arbitrarily accurate HCH simulation without its qualifying as agentic, then your test-for-agents can't be sufficiently inclusive. ...but hopefully I'm still confused somewhere.
LCDT, A Myopic Decision Theory

Perhaps I'm now understanding correctly(??). An undesirable action that springs to mind: delete itself to free up disk space. Its future self is assumed to give the same output regardless of this action. More generally, actions with arbitrarily bad side-effects on agents, to gain marginal utility. Does that make sense?

Sure—that's totally fine. The point of LCDT isn't to produce an aligned agent, but to produce an agent that's never deceptive. That way, if your AI is going to delete itself to free up disk space, it'll do it in training and you can see th... (read more)

3Joe_Collman5moRight, as far as I can see, it achieves the won't-be-deceptive aim. My issue is in seeing how we find a model that will consistently do the right thing in training (given that it's using LCDT). As I understand it, under LCDT an agent is going to trade an epsilon utility gain on non-agent-influencing-paths for an arbitrarily bad outcome on agent-influencing-paths (since by design it doesn't care about those). So it seems that it's going to behave unacceptably for almost all goals in almost all environments in which there can be negative side-effects on agents we care about. We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved. Quite possibly I'm still missing something, but I don't currently see how the LCDT decisions do much useful work here (Am I wrong? Do you see LCDT decisions doing significant optimisation?). I can picture its being a useful wrapper around a simulation, but it's not clear to me in what ways finding a non-deceptive (/benign) simulation is an easier problem than finding a non-deceptive (/benign) agent. (maybe side-channel attacks are harder??)
LCDT, A Myopic Decision Theory

To have an incoherent world model: one in which I can believe with 99% certainty that a kite no longer exists, and with 80% certainty that you're still flying that probably-non-existent kite.

I feel pretty willing to bite the bullet on this—what sorts of bad things do you think LCDT agents would do given such a world model (at decision time)? Such an LCDT agent should still be perfectly capable of tasks like simulating HCH without being deceptive—and should still be perfectly capable of learning and improving its world model, since the incoherence only shows up at decision-time and learning is done independently.

3Joe_Collman6moAh ok. Weird, but ok. Thanks. Perhaps I'm now understanding correctly(??). An undesirable action that springs to mind: delete itself to free up disk space. Its future self is assumed to give the same output regardless of this action. More generally, actions with arbitrarily bad side-effects on agents, to gain marginal utility. Does that make sense? I need to think more about the rest. [EDIT and see rambling reply to Adam re ways to avoid the incoherence. TLDR: I think placing a [predicted agent action set alterations] node directly after the LCDT decision node in the original causal diagram, deducing what can be deduced from that node, and treating it as an agent at decision-time might work. It leaves the LCDT agent predicting that many of its actions don't do much, but it does get rid of the incoherence (I think). Currently unclear whether this throws the baby out with the bathwater; I don't think it does anything about negative side-effects]
LCDT, A Myopic Decision Theory

Specifically, I think if an agent can do the kind of reasoning that would allow it to create a causal world-model in the first place, then the same kind of reasoning would lead it to realize that there is in fact supposed to be a link at each of the places where we manually cut it—i.e., that the causal world-model is incoherent.

An LCDT agent should certainly be aware of the fact that those causal chains actually exist—it just shouldn't care about that. If you want to argue that it'll change to not using LCDT to make decisions anymore, you have to argue ... (read more)

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

Can write an arbitrary program for ?

Yes—at least that's the assumption I'm working under.

It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?

I agree that the you've described has lower complexity than the intended —but the in this case has higher complexity, since is no longer getting any of its complexity for free from conditioning on the condition. And in fact what you've just described is precisely the unintended model—what I call —that I'm trying to ... (read more)

4Rohin Shah6moYeah, that makes sense. I guess I don't really see the intuition about why this should be true, but fair enough to leave that as an open question.
Answering questions honestly instead of predicting human answers: lots of problems and some solutions

Seems like if the different heads do not share weights then "the parameters in " is perfectly well-defined?

It seemed to me like you were using it in a way such that shared no weights with , which I think was because you were confused by the quantification, like you said previously. I think we're on the same page now.

Okay, so iiuc you're relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between and ?

Sorry, I was unclear about this in my last response. and will only a... (read more)

4Rohin Shah6moI think I might be missing a change you made to the algorithm. Canθ1write an arbitrary program forf?? In that case, what prevents you from getting def M_theta_1_plus(theta_2, x, q): axioms = world_model(theta_2=theta_2)(x) deduced_stmts = deduction(axioms) return { "f": f_minus(q, deduced_stmts), "f?": True, } It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands? I mean, I would still have said this because I interpret a "head"f1as "the part after the shared layers", but I'm also happy to instead treatf1as the entire functionX×Q→Afor which the first head forms part of the implementation.
Load More