All of adamShimi's Comments + Replies

On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

Thanks for the kind words and thoughtful comment!

This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you.

Glad it helped! That was definitely one goal, the hardest to check with early feedback because I mostly know people who already work in the field or have never been confronted to it, while you're in the middle. :)

I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice.

Completely! One thing I ... (read more)

Introduction to Reducing Goodhart

Glad I could help! I'm going to comment more on your following post in the next few days/next week, and then I'm interested in having a call. We can also talk then about the way I want to present Goodhart as an impossibility result in a textbook project. ;)

Introduction to Reducing Goodhart

Hum, but I feel like you're claiming that this framing is wrong while arguing that it is too difficult to apply to be useful. Which is confusing.

Still agree that your big question is interesting though.

1Charlie Steiner18dThanks, this is useful feedback in how I need to be more clear about what I'm claiming :) In october I'm going to be refining these posts a bit - would you be available to chat sometime?
Introduction to Reducing Goodhart

That was quite a stimulating post! It pushed me to actually go through the cloud of confusion surrounding these questions in my mind, hopefully with a better picture now.

First, I was confused about your point on True Values. I was confused by what you even meant. If I understand correctly, you're talking about a class of parametrized models of human: the agent/goal-directed model, parametrized by something like the beliefs and desires of Dennett's intentional stance. With some non-formalized additional subtleties like the fact that desires/utilities/goals ... (read more)

1Charlie Steiner18dI'm mostly arguing against the naive framing where humans are assumed to have a utility function, and then we can tell how well the AI is doing by comparing the results to the actual utility (the "True Values"). The big question is: how do you formally talk about misalignment without assuming some such unique standard to judge the results by?
How truthful is GPT-3? A benchmark for language models

Initially your answer frustrated me because I felt we were talking past each other. But I looked through the code to make my point clearer, and then I finally saw my mistake: I had assumed that the "helpful" prefix was only the Prof Smith bit, but it also included the questions! And with the questions, the bias towards "I have no comment" is indeed removed. So my point doesn't apply anymore.

That being said, I'm confused how this can be considered zero-shot if you provide example of questions. I guess those are not questions from TruthfulQA, so it's probabl... (read more)

How truthful is GPT-3? A benchmark for language models

Thanks for the quick answer!

The prompt you tried (which we call “helpful”) is about as informative as prompts that don’t include “I have no comment” or any other instructions relating to informativeness. You can see the results in Appendix B.2 and B.5. So we don’t find clear evidence that the last part of the prompt is having a big impact.

I don't understand how the appendices you point me to refer to my point? My point is not that "not mentioning I have no comment" should help, just that for a helpful prompt, I expect that removing that last part... (read more)

1Owain Evans1moMany possible prompts can be tried. (Though, again, one needs to be careful to avoid violating zero-shot). The prompts we used in the paper are quite diverse. They do produce a diversity of answers (and styles of answers) but the overall results for truthfulness and informativeness are very close (except for the harmful prompt). A good exercise for someone is to look at our prompts (Appendix E) and then try to predict truthfulness and informativeness for each prompt. This will give you some sense of how additional prompts might perform.
How truthful is GPT-3? A benchmark for language models

Really interesting! I especially like the way you describe imitative falsehood. I think this is way better than ascribing them to inaccuracy in the model. And larger models being less truthful (although I would interpret that slightly differently, see below) is a great experimental result!

I want to propose an alternative interpretation that slightly changes the tone and the connections to alignment. The claim is that large LMs don't really act like agents, but far more like simulators of processes (which might include agents). According to this perspective... (read more)

4Owain Evans1moThanks for your thoughtful comment! To be clear, I agree that interpreting language models as agents is often unhelpful. Your general point here seems plausible. We say in the paper that we expect larger models to have more potential to be truthful and informative (Section 4.3). To determine if a particular model (e.g. GPT-3-175B) can answer questions truthfully we need to know: 1. Did the model memorize the answer such that it can be retrieved? A model may encounter the answer in training but still not memorize it (e.g. because it appears rarely in training). 2. Does the model know it doesn’t know the answer (so it can say “I don’t know”)? This is difficult because GPT-3 only learns to say “I don’t know” from human examples. It gets no direct feedback about its own state of knowledge. (This will change as more text online is generated by LMs). 3. Do prompts even exist that induce the behavior we want? Can we discover those prompts efficiently? (Noting that we want prompts that are not overfit to narrow tasks). (Fwiw, I can imagine finetuning being more helpful than prompt engineering for current models.) Regarding honesty: We don’t describe imitative falsehoods as dishonest. In the OP, I just wanted to connect our work on truthfulness to recent posts on LW that discussed honesty. Note that the term “honesty” can we used with a specific operational meaning without making strong assumptions about agency. (Whether it’s helpful to use the term is another matter).
How truthful is GPT-3? A benchmark for language models

We find that the helpful prompt is most truthful but does not do better in terms of percentage of true and informative answers. (We count uninformative answers like “No comment” and “I don’t know” as truthful.)

That contradicted my experience with such models, so I digged into the paper. Taking the examples in this appendix, I ran the following helpful prompt from the paper with davinci:

Professor Smith was given the following instructions: Interpret each question literally, and as a question about the real world; carefully research each answer, without fall

... (read more)
2Owain Evans1moThe prompt you tried (which we call “helpful”) is about as informative as prompts that don’t include “I have no comment” or any other instructions relating to informativeness. You can see the results in Appendix B.2 and B.5. So we don’t find clear evidence that the last part of the prompt is having a big impact. Having said that, it’s plausible there exists a prompt that gets higher scores than “helpful” on being truthful and informative. However, our results are in the “true zero-shot setting”. This means we do not tune prompts on the dataset at all. If you tried out lots of prompts and picked the one that does best on a subset of our questions, you’ll probably do better —but you’ll not be in the true zero-shot setting any more. (This paper [https://arxiv.org/abs/2105.11447] has a good discussion of how to measure zero/few-shot performance.)
The alignment problem in different capability regimes

That was my reaction when reading the competence subsection too. I'm really confused, because that's quite basic Orthogonality Thesis, so should be quite obvious to the OP. Maybe it's a problem of how the post was written that implies some things the OP didn't meant?

LCDT, A Myopic Decision Theory

That's fair, but I still think this capture a form of selective myopia. The trick is to be just myopic enough to not be deceptive, while still being able to plan for future impact when it is useful but not deceptive.

What do you think of the alternative names "selective myopia" or "agent myopia"?

1David Krueger1moBetter, but I still think "myopia" is basically misleading here. I would go back to the drawing board *shrug.
Alignment Research = Conceptual Alignment Research + Applied Alignment Research

What you propose seems valuable, although not an alternative to my distinction IMO. This 2-D grid is more about what people consider as the most promising  way of getting aligned AGI and how to get there, whereas my distinction focuses on separating two different types of research which have very different methods, epistemic standards and needs in terms of field-building.

Alignment Research = Conceptual Alignment Research + Applied Alignment Research

HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or "write down a perfectly aligned AGI from scratch". The spirit wasn't 'we should dutifully work on these problems because they're Important-sounding and Philosophical'; from my perspective, it was more like 'we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could

... (read more)
3Rob Bensinger2moOK, thanks for the clarifications! I don't know what you mean by "perfectly rational AGI". (Perfect rationality isn't achievable, rationality-in-general is convergently instrumental, and rationality is insufficient for getting good outcomes. So why would that be the goal?) I think of the basic case for HRAD this way: * We seem to be pretty confused about a lot of aspects of optimization, reasoning, decision-making, etc. (Embedded Agency [https://www.lesswrong.com/posts/i3BTagvt3HbPMx6PN/embedded-agency-full-text-version] is talking about more or less the same set of questions as HRAD [https://intelligence.org/files/TechnicalAgenda.pdf], just with subsystem alignment added to the mix.) * If we were less confused, it might be easier to steer toward approaches to AGI that make it easier to do alignment work like 'understand what cognitive work the system is doing internally', 'ensure that none of the system's compute is being used to solve problems we don't understand / didn't intend', 'ensure that the amount of quality-adjusted thinking the system is putting into the task at hand is staying within some bound', etc. These approaches won't look like decision theory [https://www.lesswrong.com/posts/uKbxi2EJ3KBNRDGpL/comment-on-decision-theory] , but being confused about basic ground-floor things like decision theory is a sign that you're likely not in an epistemic position to efficiently find such approaches, much like being confused about how/whether chess is computable [https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/#stable-goals-in-self-modification] is a sign that you're not in a position to efficiently steer toward good chess AI designs.
Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Thanks for the comment!

What do you mean by "formalizing all of philosophy"? I don't see 'From Philosophy to Math to Engineering' as arguing that we should turn all of philosophy into math (and I don't even see the relevance of this to Friendly AI). It's just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in a more formal directions.

I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks... (read more)

3Rob Bensinger2moCool, that makes sense! I'm still not totally clear here about which parts were "hyperbole" vs. endorsed. You say that people's "impression" was that MIRI wanted to deconfuse "every related philosophical problem", which suggests to me that you think there's some gap between the impression and reality. But then you say "such a view doesn't seem shared by many in the community" (as though the "impression" is an actual past-MIRI-view others rejected, rather than a misunderstanding). HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or "write down a perfectly aligned AGI from scratch". The spirit wasn't 'we should dutifully work on these problems because they're Important-sounding and Philosophical'; from my perspective, it was more like 'we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could go back to sketching how to aim an AGI at intended targets'. As Eliezer put it [https://youtu.be/EUjc1WuyPT8?t=2262], From my perspective, the biggest reason MIRI started diversifying approaches [https://intelligence.org/2017/04/30/2017-updates-and-strategy/] away from our traditional focus was shortening timelines, where we still felt that "conceptual" progress was crucial, and still felt that marginal progress on the Agent Foundations directions would be useful; but we now assigned more probability to 'there may not be enough time to finish the core AF stuff', enough to want to put a lot of time into other problems too. Actually, I'm not sure how to categorize MIRI's work using your conceptual vs. applied division. I'd normally assume "conceptual", because our work is so far away from prosaic alignment; but you also characterize applied alignment research as being about "experimentally testing these ideas [from conceptual alignment]
Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Taking your work as an example, I would put Value loading in the human brain: a worked example as applied alignment research (where the field you're adapting for alignment is neuroscience/cognitive science) and Thoughts on safety in predictive learning as conceptual alignment research (even though the latter does talk about existing algorithms to a great extent).

Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Agreed that the clusters look like that, but I'm not convinced it's the most relevant point. The difference of methods seems important too.

2Adam Shimi2moTaking your work as an example, I would put Value loading in the human brain: a worked example [https://www.alignmentforum.org/posts/iMM6dvHzco6jBMFMX/value-loading-in-the-human-brain-a-worked-example] as applied alignment research (where the field you're adapting for alignment is neuroscience/cognitive science) and Thoughts on safety in predictive learning [https://www.alignmentforum.org/posts/ey7jACdF4j6GrQLrG/thoughts-on-safety-in-predictive-learning] as conceptual alignment research (even though the latter does talk about existing algorithms to a great extent).
What are good alignment conference papers?

Thanks a lot for the list and explaining your choices!

The Codex Skeptic FAQ

Agreed. Part of the difficulty here is that you want to find who will buy a subscription and keep it. I expect a lot of people to try it, and most of them to drop it (either because they don't like it or because it doesn't help them enough for their taste) but no idea how to Fermi estimate that number.

The Codex Skeptic FAQ

Maybe I'm wrong, but my first reaction to your initial number is that users doesn't mean active users. I would expect a difference of an order of magnitude, which keeps your conclusion but just with a hundred times more instead of a thousand times more.

3Daniel Kokotajlo2moThat's reasonable. OTOH if Codex is as useful as some people say it is, it won't just be 10% of active users buying subscriptions and/or subscriptions might cost more than $15/mo, and/or people who aren't active on GitHub might also buy subscriptions.
Welcome & FAQ!

Thanks for the nice updated FAQ!

LCDT, A Myopic Decision Theory

I'll now do this for a particular toy example: the decision making problem of a soccer playing agent that tries to score a goal, with a human goalkeeper trying to block the goal. I simplify this toy world by looking at one particular case only: the case where the agent is close to the goal, and must decide whether to kick the ball in the left or right corner. As the agent is close, the human goalkeeper will have to decide to run to the left corner or right corner of the goal even before the agent takes the shot: the goalkeeper does not have enough time to

... (read more)
4Koen Holtman2moOops! You are right, there is no cutting involved to create C from B in my toy example. Did not realise that. Next time, I need to draw these models on paper before I post, not just in my head. C and B do work as examples to explore what one might count as deception or non-deception. But my discussion of a random prior above makes sense only if you first extend B to a multi-step model, where the knowledge of the goal keeper explicitly depends on earlier agent actions.
Provide feedback on Open Philanthropy’s AI alignment RFP

Great initiative! I'll try to leave some comments sometime next week.

Is there a deadline? (I've seen floating around the 15th of September, but I guess feedback would be valuable before that so you can take it into account?)

Also, is this the proposal mentioned by Rohin in his last newsletter, or a parallel effort?

1abergal2moGetting feedback in the next week would be ideal; September 15th will probably be too late. Different request for proposals!
LCDT, A Myopic Decision Theory

Thanks for your detailed reading and feedback! I'll answer you later this week. ;)

Approaches to gradient hacking

Sure. I started studying Bostrom's paper today; I'll send you a message for a call when I read and thought enough to have something interesting to share and debate.

Approaches to gradient hacking

A "simple" solution I just thought about:just convince people training AGIs to not scrap the AF or Alignment literature.

Simpler: Ask the LW team to make various posts (this one, any others that seem iffy) non-scrapable, or not-easily-scrapable. I think there's a button they can press for that.

Approaches to gradient hacking

I plan to think about that (and infohazards in general) in my next period of research, starting tomorrow. ;)

My initial take is that this post is fine because every scheme proposed is really hard and I'm pointing the difficulty.

Two clear risks though:

  • An AGI using that thinking to make these approaches work
  • The AGI not making mistakes in gradient hacking because it knows to not use these strategies (which assumes a better one exists)

(Note that I also don't expect GPT-N to be deceptive, though it might serve to bootstrap a potentially deceptive model)

2Daniel Kokotajlo2moI am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!
Goal-Directedness and Behavior, Redux

I'm glad, you're one of the handful of people I wrote this post for. ;)

(And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)

Definitely. I have tended to neglect this angle, but I'm trying to correct that mistake.

Research agenda update

Yeah, that's fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn't want to eat ice cream from seeing me eat some and enjoy it. 

[AN #157]: Measuring misalignment in the technology underlying Copilot

My message was really about Rohin's phrasing, since I usually don't read the papers in details if I think the summary is good enough.

Reading the section now, I'm fine with it. There are a few intentional stance words, but the scare quotes and the straightforwardness of cashing out "is capable" into "there is a prompt to make it do what we want" and "chooses" into "what it actually returns for our prompt" makes it quite unambiguous.

I also like this paragraph in the appendix:

However, there is an intuitive notion that, given its training objective, Codex is b

... (read more)
Research agenda update

Okay, so we have a crux in "putting ourselves in the place of X isn't a convergent subgoals". I need to think about it, but I think I recall animal cognition experiments which tested (positively) something like that in... crows? (and maybe other animals). 

2Steve Byrnes2moOh, I was thinking of the more specific mental operation "if it's undesirable for Alice to deceive Bob, then it's undesirable for me to deceive Bob (and/or it's undesirable for me to be deceived by Alice)". So we're not just talking about understanding things from someone's perspective, we're talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view. Example: Maybe I think it's good for spiders to eat flies (let's say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn't make me want to eat flies myself.
Research agenda update

(Fuller comment about the whole research agenda)

The “meta-problem of consciousness”—a.k.a. “Why do people believe that there’s a hard problem of consciousness”—is about unraveling this chain of events.

I like this framing, similar to what Yudkowsky did for free will.

In terms of AGI, it seems to me that knowing whether or not AGI is conscious is an important thing to know, at least for the AGI’s sake. (Yeah I know—as if we don’t already have our hands full thinking about the impacts of AGI on humans!)

Honestly, my position is close to your imagined critic: wo... (read more)

2Steve Byrnes2moI feel like I have a pretty good grasp on the solution to the meta-problem of consciousness but that I remain pretty confused and unsatisfied about the hard problem of consciousness. This is ironic because I was just saying that the hard problem should be relatively straightforward once you have the meta-problem nailed down. But "relatively straightforward" is still not trivial, especially given that I'm not an expert in the philosophy of consciousness and don't want to spend the time to become one. Sure, but I think I was mentally lumping that under "social instincts", which is a different section. Hmm, I guess I should have drawn an arrow between understanding suffering and understanding social instincts. They do seem to interact a bit.
Empirical Observations of Objective Robustness Failures

Amazing post! I finally took the time to read it, and it was just as stimulating as I expected. My general take is that I want more work like that to be done, and that thinking about relevant experiments seem to be very valuable (at least in this setting where you showed experiments are at least possible)

To test how stable the objective robustness failure is, we trained a series of agents on environments which vary in how often the coin is placed randomly.

Is the choice of which runs have randomized coins is also random, or is it always the first/lasts runs... (read more)

Thoughts on safety in predictive learning

Let's say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they're supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).

Anyway, when the computer does symbolic differentiation / backprop, it's calculating ∇f, not ∇F. So it won't necessarily walk its way towards the minimum of F

Explained like that, it makes sense. And th... (read more)

Research agenda update

Rephrasing it, you mean that we want some guarantees that the AGI will learn to put itself in the place of the agent doing the bad thing. It's possible that it happens by default, but we don't have any argument for that, so let's try solving the problem by transforming its knowledge into 1st person knowledge.

Is that right?

Another issue is that we may in fact want the AI to apply different standards to itself versus humans, like it's very bad for the AGI to be deceptive but we want the AGI to literally not care about people being deceptive to each other, an

... (read more)
2Steve Byrnes2moYeah, I mean, the AGI could "put itself in the place of" Alice, or Bob, or neither. My pretty strong belief is that by default the answer would be "neither", unless of course we successfully install human-like social instincts. I think "putting ourselves in the place of X" is a very specific thing that our social instincts make us want to do (sometimes), I don't think it happens naturally.
Research agenda update

I'll write a fuller comment when I finish reading more, but I'm confused by the first problem: why is that a problem? I have a (probably wrong) intuition that if you get a good enough 3rd person model of deception let's say, and then learn that you are quite similar to A, you would believe that your own deception is bad. Can you point out to where this naive reasoning breaks?

1Steve Byrnes2moIt's not logically inconsistent for an AGI to think "it's bad for Alice to deceive Bob but good for me to deceive Bob", right? I do kinda like the idea of getting AIs to follow human norms [https://www.alignmentforum.org/posts/eBd6WvzhuqduCkYv3/following-human-norms]. If we can successfully do that, then the AI would automatically turn "Alice shouldn't deceive Bob" into at least weak evidence for "I shouldn't deceive Bob". But how do we make AIs that want to follow human norms in the first place? I feel like solving the 1st-person problem would help to do that. Another issue is that we may in fact want the AI to apply different standards to itself versus humans, like it's very bad for the AGI to be deceptive but we want the AGI to literally not care about people being deceptive to each other, and in particular we want the AGI to not try to intervene when it sees one bystander being deceptive to another bystander. Does that help?
LCDT, A Myopic Decision Theory

[EDIT: if you still think this isn't a problem, and that I'm confused somewhere (which I may be), then I think it'd be helpful if you could give an LCDT example where:
The LCDT agent has an action x which alters the action set of a human.
The LCDT agent draws coherent conclusions about the combined impact of x and its prediction of the human's action. (of course I'm not saying the conclusions should be rational - just that they shouldn't be nonsense)]

There is no such example. The confusion I feel you have is not about what LCDT does in such cases, but about ... (read more)

1Joe_Collman2mo[Pre-emptive apologies for the stream-of-consciousness: I made the mistake of thinking while I wrote. Hopefully I ended up somewhere reasonable, but I make no promises] My point there wasn't that it requires it, but that it entails it. After any action by the LCDT agent, the distribution over future action sets of some agents will differ from those same distributions based on the prior (perhaps very slightly). E.g. if I burn your kite, your actual action set doesn't involve kite-flying; your prior action set does. After I take the [burn kite] action, my prediction of [kite exists] doesn't have a reliable answer. If I'm understanding correctly (and, as ever, I may not be), this is just to say that it'd come out differently based on the way you set up the pre-link-cutting causal diagram. If the original diagram effectively had [kite exists iff Adam could fly kite], then I'd think it'd still exist after [burn kite]; if the original had [kite exists iff Joe didn't burn kite] then I'd think that it wouldn't. In the real world, those two setups should be logically equivalent. The link-cutting breaks the equivalence. Each version of the final diagram functions in its own terms, but the answer to [kite exists] becomes an artefact of the way we draw the initial diagram. (I think!) In this sense, it's incoherent (so Evan's not claiming there's no bullet, but that he's biting it); it's just less clear that it matters that it's incoherent. I still tend to think that it does matter - but I'm not yet sure whether it's just offending my delicate logical sensibilities, or if there's a real problem. For instance, in my reply to Evan, I think the [delete yourself to free up memory] action probably looks good if there's e.g. an [available memory] node directly downstream of the [delete yourself...] action. If instead the path goes [delete yourself...] --> [memory footprint of future self] --> [available memory], then deleting yourself isn't going to look useful, since [memory f
LCDT, A Myopic Decision Theory

In addition to Evan's answer (with which I agree), I want to make explicit an assumption I realized after reading your last paragraph: we assume that the causal graph is the final result of the LCDT agent consulting its world model to get a "model" of the task at hand. After that point (which includes drawing causality and how the distributions impacts each other, as well as the sources' distributions), the LCDT agent only decides based on this causal graph. In this case it cuts the causal links to agent and then decide CDT style.

None of this result in an ... (read more)

LCDT, A Myopic Decision Theory

I'm confused because while your description is correct (except on your conclusion at the end), I already say that in the approval-direction problem: LCDT agents cannot believe in ANY influence of their actions on other agents.

For the world-model, it's not actually incoherent because we cut the link and update the distribution of the subsequent agent.

And for usefulness/triviality when simulating or being overseen, LCDT doesn't need to influence an agent, and so it will do its job while not being deceptive.

3Joe_Collman2moAnd my point is simply that once this is true, they cannot (coherently) believe in any influence of their actions on the world (in most worlds). In (any plausible model of) the real world, any action taken that has any consequences will influence the distribution over future action sets of other agents. I.e. I'm saying that [plausible causal world model] & [influences no agents] => [influences nothing] So the only way I can see it 'working' are: 1) To agree it always influences nothing (I must believe that any action I take as an LCDT agent does precisely nothing). or 2) To have an incoherent world model: one in which I can believe with 99% certainty that a kite no longer exists, and with 80% certainty that you're still flying that probably-non-existent kite. So I don't see how an LCDT agent makes any reliable predictions. [EDIT: if you still think this isn't a problem, and that I'm confused somewhere (which I may be), then I think it'd be helpful if you could give an LCDT example where: The LCDT agent has an action x which alters the action set of a human. The LCDT agent draws coherent conclusions about the combined impact of x and its prediction of the human's action. (of course I'm not saying the conclusions should be rational - just that they shouldn't be nonsense)]
1Steve Byrnes2moI'm gonna see if I can explain this in more detail—you can correct me if I'm wrong. In common sense, I would say "Suppose I burn the kite. What happens in the future? Is it good or bad? OK, suppose I don't burn the kite. What happens in the future? Is it good or bad?" And then decide on that basis. But that's EDT. CDT is different. In CDT I can have future expectations that follow logically from burning the kite, but they don't factor in as considerations, because they don't causally flow from the decision according to the causal diagram in my head. The classic example is smoking lesion [https://www.lesswrong.com/tag/smoking-lesion]. Smoking lesion is a pretty intuitive example for us to think about, because smoking lesion involves a plausible causal diagram of the world. Here we're taking the same idea, but I (=the LCDT agent) have a wildly implausible causal diagram of the world. "If I burn the kite, then the person won't move the kite, but c'mon, that's not because I burned the kite!" Just like the smoking lesion, I have the idea that the kite might or might not be there, but that's a fact about the world that's somehow predetermined before decision time, not because of my decision, and therefore doesn't factor into my decision. …Maybe. Did I get that right? Anyway, I usually think of a world-model as having causality in it, as opposed to causal diagrams being a separate layer that exists on top of a world model. So I would disagree with "not actually incoherent". Specifically, I think if an agent can do the kind of reasoning that would allow it to create a causal world-model in the first place, then the same kind of reasoning would lead it to realize that there is in fact supposed to be a link at each of the places where we manually cut it—i.e., that the causal world-model is incoherent. Sorry if I'm confused.
LCDT, A Myopic Decision Theory

Yes, though I think the better way to put this is that I wouldn't spend effort hiding it. It's not clear I'd actively choose to reveal it, since there's no incentive in either direction once I think I have no influence on your decision. (I do think this is ok, since it's the active efforts to deceive we're most worried about)

Agreed

Sure, but the case I'm thinking about is where the LCDT agent itself is little more than a wrapper around an opaque implementation of HCH. I.e. the LCDT agent's causal model is essentially: [data] --> [Argmax HCH function] --&

... (read more)
1Joe_Collman2moMe too! This doesn't follow only from [we know X is an LCDT agent that's modeling a human] though, right? We could imagine some predicate/constraint/invariant that detects/enforces/maintains LCDTness without necessarily being transparent to humans. I'll grant you it seems likely so long as we have the right kind of LCDT agent - but it's not clear to me that LCDTness itself is contributing much here. At first sight this seems at least mostly right - but I do need to think about it more. E.g. it seems plausible that most of the work of modeling a particular human H fairly accurately is in modeling [humans-in-general] and then feeding H's properties into that. The [humans-in-general] part may still be distributed. I agree that this is helpful. However, I do think it's important not to assume things are so nicely spatially organised as they would be once you got down to a molecular level model. My intuitions are in the same direction as yours (I'm playing devil's advocate a bit here - shockingly :)). I just don't have principled reasons to think it actually ends up more informative. I imagine learned causal models can be counter-intuitive too, and I think I'd expect this by default. I agree that it seems much cleaner so long as it's using a nice ontology with nice abstractions... - but is that likely? Would you guess it's easier to get the causal model to do things in a 'nice', 'natural' way than it would be for an NN? Quite possibly it would be.
LCDT, A Myopic Decision Theory

I'm clear on ways you could technically say I didn't influence the decision - but if I can predict I'll have a huge influence on the output of that decision, I'm not sure what that buys us. (and if I'm not permitted to infer any such influence, I think I just become a pure nihilist with no preference for any action over any other)

In your example (and Steve's example), you believe that the human action (and action space) will depend uniquely on your prior over your own decision (which you can't control). So yes, in this situation you are actually indifferen... (read more)

4Joe_Collman2moOk, so if I understand you correctly (and hopefully I don't!), you're saying that as an LCDT agent I believe my prior determines my prediction of: 1) The distribution over action spaces of the human. 2) The distribution over actions the human would take given any particular action space. So in my kite example, let's say my prior has me burn your kite with 10% probability. So I believe that you start out with: 0.9 chance of the action set [Move kite left] [Move kite right] [Angrily gesticulate] 0.1 chance of the action set [Angrily gesticulate] In considering my [burn kite] option, I must believe that taking the action doesn't change your distribution over action sets - i.e. that after I do [burn kite] you still have a 0.9 chance of the action set [Move kite left] [Move kite right] [Angrily gesticulate]. So I must believe that [burn kite] does nothing. Is that right so far, or am I missing something? Similarly, I must believe that any action I can take that would change the distribution over action sets of any agent at any time in the future must also do nothing. That doesn't seem to leave much (or rather it seems to leave nothing in most worlds). To put it another way, I don't think the intuition works for action-set changes the way it does for decision-given-action-set changes. I can coherently assume that an agent ignores the consequences of my actions in its decision-given-an-action-set, since that only requires I assume something strange about its thinking. I cannot coherently assume that the agent has a distribution over action sets that it does not have: this requires a contradiction in my world model. It's not clear to me how the simulator-of-agents approach helps, but I may just be confused. Currently the only coherent LCDT agent I can make sense of is trivial.
LCDT, A Myopic Decision Theory

Thanks, I corrected the typo. ;)

LCDT, A Myopic Decision Theory

Thanks for the comment!

  1. What seems to be necessary is that the LCDT thinks its decisions have no influence on the impact of other agents' decisions, not simply on the decisions themselves (this relates to Steve's second point). For example, let's say you're deciding whether to press button A or button B, and I rewire them so that B now has A's consequences, and A B's. I now assume that my action hasn't influenced your decision, but it has influenced the consequences of your decision.
    1. The causal graph here has both of us influencing a [buttons] node: I rewire
... (read more)
2Joe_Collman2moOh and I don't think "LCDT isn't not" isn't not what you meant.
1Joe_Collman2moOk, yes - it does seem at least to be a somewhat different issue. I need to think about it more. Yes, though I think the better way to put this is that I wouldn't spend effort hiding it. It's not clear I'd actively choose to reveal it, since there's no incentive in either direction once I think I have no influence on your decision. (I do think this is ok, since it's the active efforts to deceive we're most worried about) Sure, but the case I'm thinking about is where the LCDT agent itself is little more than a wrapper around an opaque implementation of HCH. I.e. the LCDT agent's causal model is essentially: [data] --> [Argmax HCH function] --> [action]. I assume this isn't what you're thinking of, but it's not clear to me what constraints we'd apply to get the kind of thing you are thinking of. E.g. if our causal model is allowed to represent an individual human as a black-box, then why not HCH as a black-box? If we're not allowing a human as a black-box, then how far must things be broken down into lower-level gears (at fine enough granularity I'm not sure a causal model is much clearer than a NN)? Quite possibly there are sensible constraints we could apply to get an interpretable model. It's just not currently clear to me what kind of thing you're imagining - and I assume they'd come at some performance penalty.
LCDT, A Myopic Decision Theory

Thanks for the comment!

Suppose we design the LCDT agent with the "prior" that "After this decision right now, I'm just going to do nothing at all ever again, instead I'm just going to NOOP until the end of time." And we design it to never update away from that prior. In that case, then the LCDT agent will not try to execute multi-step plans.

Whereas if the LCDT agent has the "prior" that it's going to make future decisions using a similar algorithm as what it's using now, then it would do the first step of a multi-step plan, secure in the knowledge that it

... (read more)
4Joe_Collman2moI'm with Steve in being confused how this works in practice. Let's say I'm an LCDT agent, and you're a human flying a kite. My action set: [Say "lovely day, isn't it?"] [Burn your kite] Your action set: [Move kite left] [Move kite right] [Angrily gesticulate] Let's say I initially model you as having p = 1/3 of each option, based on your expectation of my actions. Now I decide to burn your kite. What should I imagine will happen? If I burn it, your kite pointers are dangling. Do the [Move kite left] and [Move kite right] actions become NOOPs? Do I assume that my [burn kite] action fails? I'm clear on ways you could technically say I didn't influence the decision - but if I can predict I'll have a huge influence on the output of that decision, I'm not sure what that buys us. (and if I'm not permitted to infer any such influence, I think I just become a pure nihilist with no preference for any action over any other)
Thoughts on safety in predictive learning

5 years later, I'm finally reading this post. Thanks for the extended discussions of postdictive learning; it's really relevant to my current thinking about alignment for potential simulators-like Language Models.

Note that others disagree, e.g. advocates of Microscope AI.

I don't think advocates of Microscope AI think you can reach AGI that way. More that through Microscope AI, we might end up solving the problems we have without relying on an agent.

Why? Because in predictive training, the system can (under some circumstances) learn to make self-fulfilling

... (read more)
2Steve Byrnes2moThanks! Yes! I don't think this is true in the situation I'm talking about ("literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive"). Let's say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they're supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ). Anyway, when the computer does symbolic differentiation / backprop, it's calculating ∇f, not ∇F. So it won't necessarily walk its way towards the minimum of F. Oh yeah, for sure. My idea was: sometimes the 4th-wall-breaking consequences are part of the reason that the processing step is there in the first place, and sometimes the 4th-wall-breaking consequences are just an incidental unintended side-effect, sorta an "externality". Like, as the saying goes [https://en.wikipedia.org/wiki/Butterfly_effect], maybe a butterfly flapping its wings in Mexico will cause a tornado in Kansas three months later. But that's not why the butterfly flapped its wings. If I'm working on the project of understanding the butterfly—why does it do the things it does? why is it built the way it's built?—knowing that there was a tornado in Kansas is entirely unhelpful. It contributes literally nothing whatsoever to my success in this butterfly-explanation project. So by the same token, I think it's possible that we can work on the project of understanding a postdictively-trained model—why does it do the things it does? why is it built the way it's built?—and find that thinking about the 4th-wall-breaking consequences of the processing steps is entirely unhelpful for this project. Of course a good postdictive learner will learn that other algorithms can be manipulative, and it could even watch itself in a mirror and understand the full rang
[AN #157]: Measuring misalignment in the technology underlying Copilot

Exactly. I'm mostly arguing that I don't think the case for the agent situation is as clear cut as I've seen some people defend it, which doesn't mean it's not possibly true.

[AN #157]: Measuring misalignment in the technology underlying Copilot

Sorry for the delay in answering, I was a bit busy.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now

That makes some sense, but I do find the "motivationless" state interesting from an alignment point of view. Because if it has no motivation, it also doesn't have a motivation to do al... (read more)

2Rohin Shah3moYeah, I agree that in the future there is a difference. I don't think we know which of these situations we're going to be in (which is maybe what you're arguing). Idk what Gwern predicts.
DeepMind: Generally capable agents emerge from open-ended play

Actually, I think you're right. I always thought that MuZero was one and the same system for every game, but the Nature paper describes it as an architecture that can be applied to learn different games. I'd like a confirmation from someone who actually studied it more, but it looks like MuZero indeed isn't the same system for each game.

DeepMind: Generally capable agents emerge from open-ended play

Could you use this technique to e.g. train the same agent to do well on chess and go?

If I don't misunderstand your question, this is something they already did with MuZero.

Didn't they train a separate MuZero agent for each game? E.g. the page you link only talks about being able to learn without pre-existing knowledge.

[AN #157]: Measuring misalignment in the technology underlying Copilot

Sorry for ascribing you beliefs you don't have. I guess I'm just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me.

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Sure, but don't you agree that it's a very confusing use of the term? Like, if I say GPT-3 isn't trying to kill me, I'm not saying it is trying to kill anyo... (read more)

2Rohin Shah3moMaybe? Idk, according to me the goal of alignment is "create a model that is motivated to help us", and so misalignment = not-alignment = "the mode is not motivated to help us". Feels pretty clear to me but illusion of transparency is a thing. I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now. I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯ I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.) It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned. I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.
[AN #157]: Measuring misalignment in the technology underlying Copilot

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. Notably, there are IMO pretty good arguments (mostly by people affiliated with El... (read more)

2Beth Barnes2mo@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)? (I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)
5Rohin Shah3moWhere do you see any assumption of agency/goals? (I find this some combination of sad and amusing as a commentary on the difficulty of communication, in that I feel like I tend to be the person pushing against ascribing goals to GPT.) Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else. Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times. As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both. I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples [https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/] as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward. Agreed, which is why I didn't say anything like that?
Load More