When I try to get the paper, I get a 404 error.
Thanks for the kind words and thoughtful comment!
This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you.
Glad it helped! That was definitely one goal, the hardest to check with early feedback because I mostly know people who already work in the field or have never been confronted to it, while you're in the middle. :)
I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice.
Completely! One thing I ... (read more)
Glad I could help! I'm going to comment more on your following post in the next few days/next week, and then I'm interested in having a call. We can also talk then about the way I want to present Goodhart as an impossibility result in a textbook project. ;)
Hum, but I feel like you're claiming that this framing is wrong while arguing that it is too difficult to apply to be useful. Which is confusing.
Still agree that your big question is interesting though.
That was quite a stimulating post! It pushed me to actually go through the cloud of confusion surrounding these questions in my mind, hopefully with a better picture now.
First, I was confused about your point on True Values. I was confused by what you even meant. If I understand correctly, you're talking about a class of parametrized models of human: the agent/goal-directed model, parametrized by something like the beliefs and desires of Dennett's intentional stance. With some non-formalized additional subtleties like the fact that desires/utilities/goals ... (read more)
Initially your answer frustrated me because I felt we were talking past each other. But I looked through the code to make my point clearer, and then I finally saw my mistake: I had assumed that the "helpful" prefix was only the Prof Smith bit, but it also included the questions! And with the questions, the bias towards "I have no comment" is indeed removed. So my point doesn't apply anymore.
That being said, I'm confused how this can be considered zero-shot if you provide example of questions. I guess those are not questions from TruthfulQA, so it's probabl... (read more)
Thanks for the quick answer!
The prompt you tried (which we call “helpful”) is about as informative as prompts that don’t include “I have no comment” or any other instructions relating to informativeness. You can see the results in Appendix B.2 and B.5. So we don’t find clear evidence that the last part of the prompt is having a big impact.
I don't understand how the appendices you point me to refer to my point? My point is not that "not mentioning I have no comment" should help, just that for a helpful prompt, I expect that removing that last part... (read more)
Really interesting! I especially like the way you describe imitative falsehood. I think this is way better than ascribing them to inaccuracy in the model. And larger models being less truthful (although I would interpret that slightly differently, see below) is a great experimental result!
I want to propose an alternative interpretation that slightly changes the tone and the connections to alignment. The claim is that large LMs don't really act like agents, but far more like simulators of processes (which might include agents). According to this perspective... (read more)
We find that the helpful prompt is most truthful but does not do better in terms of percentage of true and informative answers. (We count uninformative answers like “No comment” and “I don’t know” as truthful.)
That contradicted my experience with such models, so I digged into the paper. Taking the examples in this appendix, I ran the following helpful prompt from the paper with davinci:
Professor Smith was given the following instructions: Interpret each question literally, and as a question about the real world; carefully research each answer, without fall
That was my reaction when reading the competence subsection too. I'm really confused, because that's quite basic Orthogonality Thesis, so should be quite obvious to the OP. Maybe it's a problem of how the post was written that implies some things the OP didn't meant?
That's fair, but I still think this capture a form of selective myopia. The trick is to be just myopic enough to not be deceptive, while still being able to plan for future impact when it is useful but not deceptive.
What do you think of the alternative names "selective myopia" or "agent myopia"?
What you propose seems valuable, although not an alternative to my distinction IMO. This 2-D grid is more about what people consider as the most promising way of getting aligned AGI and how to get there, whereas my distinction focuses on separating two different types of research which have very different methods, epistemic standards and needs in terms of field-building.
HRAD has always been about deconfusion (though I agree we did a terrible job of articulating this), not about trying to solve all of philosophy or "write down a perfectly aligned AGI from scratch". The spirit wasn't 'we should dutifully work on these problems because they're Important-sounding and Philosophical'; from my perspective, it was more like 'we tried to write down a sketch of how to align an AGI, and immediately these dumb issues with self-reference and counterfactuals and stuff cropped up, so we tried to get those out of the way fast so we could
Thanks for the comment!
What do you mean by "formalizing all of philosophy"? I don't see 'From Philosophy to Math to Engineering' as arguing that we should turn all of philosophy into math (and I don't even see the relevance of this to Friendly AI). It's just claiming that FAI research begins with fuzzy informal ideas/puzzles/goals (like the sort you might see philosophers debate), then tries to move in a more formal directions.
I abused the hyperbole in that case. What I was pointing out is the impression that old-school MIRI (a lot of the HRAD work) thinks... (read more)
Taking your work as an example, I would put Value loading in the human brain: a worked example as applied alignment research (where the field you're adapting for alignment is neuroscience/cognitive science) and Thoughts on safety in predictive learning as conceptual alignment research (even though the latter does talk about existing algorithms to a great extent).
Agreed that the clusters look like that, but I'm not convinced it's the most relevant point. The difference of methods seems important too.
Thanks a lot for the list and explaining your choices!
Agreed. Part of the difficulty here is that you want to find who will buy a subscription and keep it. I expect a lot of people to try it, and most of them to drop it (either because they don't like it or because it doesn't help them enough for their taste) but no idea how to Fermi estimate that number.
Maybe I'm wrong, but my first reaction to your initial number is that users doesn't mean active users. I would expect a difference of an order of magnitude, which keeps your conclusion but just with a hundred times more instead of a thousand times more.
Thanks for the nice updated FAQ!
I'll now do this for a particular toy example: the decision making problem of a soccer playing agent that tries to score a goal, with a human goalkeeper trying to block the goal. I simplify this toy world by looking at one particular case only: the case where the agent is close to the goal, and must decide whether to kick the ball in the left or right corner. As the agent is close, the human goalkeeper will have to decide to run to the left corner or right corner of the goal even before the agent takes the shot: the goalkeeper does not have enough time to
Great initiative! I'll try to leave some comments sometime next week.
Is there a deadline? (I've seen floating around the 15th of September, but I guess feedback would be valuable before that so you can take it into account?)
Also, is this the proposal mentioned by Rohin in his last newsletter, or a parallel effort?
Thanks for your detailed reading and feedback! I'll answer you later this week. ;)
Sure. I started studying Bostrom's paper today; I'll send you a message for a call when I read and thought enough to have something interesting to share and debate.
A "simple" solution I just thought about:just convince people training AGIs to not scrap the AF or Alignment literature.
Simpler: Ask the LW team to make various posts (this one, any others that seem iffy) non-scrapable, or not-easily-scrapable. I think there's a button they can press for that.
I plan to think about that (and infohazards in general) in my next period of research, starting tomorrow. ;)
My initial take is that this post is fine because every scheme proposed is really hard and I'm pointing the difficulty.
Two clear risks though:
(Note that I also don't expect GPT-N to be deceptive, though it might serve to bootstrap a potentially deceptive model)
I'm glad, you're one of the handful of people I wrote this post for. ;)
(And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)
Definitely. I have tended to neglect this angle, but I'm trying to correct that mistake.
Yeah, that's fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn't want to eat ice cream from seeing me eat some and enjoy it.
My message was really about Rohin's phrasing, since I usually don't read the papers in details if I think the summary is good enough.
Reading the section now, I'm fine with it. There are a few intentional stance words, but the scare quotes and the straightforwardness of cashing out "is capable" into "there is a prompt to make it do what we want" and "chooses" into "what it actually returns for our prompt" makes it quite unambiguous.
I also like this paragraph in the appendix:
However, there is an intuitive notion that, given its training objective, Codex is b
Okay, so we have a crux in "putting ourselves in the place of X isn't a convergent subgoals". I need to think about it, but I think I recall animal cognition experiments which tested (positively) something like that in... crows? (and maybe other animals).
(Fuller comment about the whole research agenda)
The “meta-problem of consciousness”—a.k.a. “Why do people believe that there’s a hard problem of consciousness”—is about unraveling this chain of events.
I like this framing, similar to what Yudkowsky did for free will.
In terms of AGI, it seems to me that knowing whether or not AGI is conscious is an important thing to know, at least for the AGI’s sake. (Yeah I know—as if we don’t already have our hands full thinking about the impacts of AGI on humans!)
Honestly, my position is close to your imagined critic: wo... (read more)
Amazing post! I finally took the time to read it, and it was just as stimulating as I expected. My general take is that I want more work like that to be done, and that thinking about relevant experiments seem to be very valuable (at least in this setting where you showed experiments are at least possible)
To test how stable the objective robustness failure is, we trained a series of agents on environments which vary in how often the coin is placed randomly.
Is the choice of which runs have randomized coins is also random, or is it always the first/lasts runs... (read more)
Let's say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they're supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).Anyway, when the computer does symbolic differentiation / backprop, it's calculating ∇f, not ∇F. So it won't necessarily walk its way towards the minimum of F
Let's say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they're supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).
Anyway, when the computer does symbolic differentiation / backprop, it's calculating ∇f, not ∇F. So it won't necessarily walk its way towards the minimum of F
Explained like that, it makes sense. And th... (read more)
Rephrasing it, you mean that we want some guarantees that the AGI will learn to put itself in the place of the agent doing the bad thing. It's possible that it happens by default, but we don't have any argument for that, so let's try solving the problem by transforming its knowledge into 1st person knowledge.
Is that right?
Another issue is that we may in fact want the AI to apply different standards to itself versus humans, like it's very bad for the AGI to be deceptive but we want the AGI to literally not care about people being deceptive to each other, an
I'll write a fuller comment when I finish reading more, but I'm confused by the first problem: why is that a problem? I have a (probably wrong) intuition that if you get a good enough 3rd person model of deception let's say, and then learn that you are quite similar to A, you would believe that your own deception is bad. Can you point out to where this naive reasoning breaks?
[EDIT: if you still think this isn't a problem, and that I'm confused somewhere (which I may be), then I think it'd be helpful if you could give an LCDT example where:The LCDT agent has an action x which alters the action set of a human.The LCDT agent draws coherent conclusions about the combined impact of x and its prediction of the human's action. (of course I'm not saying the conclusions should be rational - just that they shouldn't be nonsense)]
There is no such example. The confusion I feel you have is not about what LCDT does in such cases, but about ... (read more)
In addition to Evan's answer (with which I agree), I want to make explicit an assumption I realized after reading your last paragraph: we assume that the causal graph is the final result of the LCDT agent consulting its world model to get a "model" of the task at hand. After that point (which includes drawing causality and how the distributions impacts each other, as well as the sources' distributions), the LCDT agent only decides based on this causal graph. In this case it cuts the causal links to agent and then decide CDT style.
None of this result in an ... (read more)
I'm confused because while your description is correct (except on your conclusion at the end), I already say that in the approval-direction problem: LCDT agents cannot believe in ANY influence of their actions on other agents.
For the world-model, it's not actually incoherent because we cut the link and update the distribution of the subsequent agent.
And for usefulness/triviality when simulating or being overseen, LCDT doesn't need to influence an agent, and so it will do its job while not being deceptive.
Yes, though I think the better way to put this is that I wouldn't spend effort hiding it. It's not clear I'd actively choose to reveal it, since there's no incentive in either direction once I think I have no influence on your decision. (I do think this is ok, since it's the active efforts to deceive we're most worried about)
Sure, but the case I'm thinking about is where the LCDT agent itself is little more than a wrapper around an opaque implementation of HCH. I.e. the LCDT agent's causal model is essentially: [data] --> [Argmax HCH function] --&
I'm clear on ways you could technically say I didn't influence the decision - but if I can predict I'll have a huge influence on the output of that decision, I'm not sure what that buys us. (and if I'm not permitted to infer any such influence, I think I just become a pure nihilist with no preference for any action over any other)
In your example (and Steve's example), you believe that the human action (and action space) will depend uniquely on your prior over your own decision (which you can't control). So yes, in this situation you are actually indifferen... (read more)
Thanks, I corrected the typo. ;)
What seems to be necessary is that the LCDT thinks its decisions have no influence on the impact of other agents' decisions, not simply on the decisions themselves (this relates to Steve's second point). For example, let's say you're deciding whether to press button A or button B, and I rewire them so that B now has A's consequences, and A B's. I now assume that my action hasn't influenced your decision, but it has influenced the consequences of your decision.The causal graph here has both of us influencing a [buttons] node: I rewire
Suppose we design the LCDT agent with the "prior" that "After this decision right now, I'm just going to do nothing at all ever again, instead I'm just going to NOOP until the end of time." And we design it to never update away from that prior. In that case, then the LCDT agent will not try to execute multi-step plans.Whereas if the LCDT agent has the "prior" that it's going to make future decisions using a similar algorithm as what it's using now, then it would do the first step of a multi-step plan, secure in the knowledge that it
Suppose we design the LCDT agent with the "prior" that "After this decision right now, I'm just going to do nothing at all ever again, instead I'm just going to NOOP until the end of time." And we design it to never update away from that prior. In that case, then the LCDT agent will not try to execute multi-step plans.
Whereas if the LCDT agent has the "prior" that it's going to make future decisions using a similar algorithm as what it's using now, then it would do the first step of a multi-step plan, secure in the knowledge that it
5 years later, I'm finally reading this post. Thanks for the extended discussions of postdictive learning; it's really relevant to my current thinking about alignment for potential simulators-like Language Models.
Note that others disagree, e.g. advocates of Microscope AI.
I don't think advocates of Microscope AI think you can reach AGI that way. More that through Microscope AI, we might end up solving the problems we have without relying on an agent.
Why? Because in predictive training, the system can (under some circumstances) learn to make self-fulfilling
Exactly. I'm mostly arguing that I don't think the case for the agent situation is as clear cut as I've seen some people defend it, which doesn't mean it's not possibly true.
Sorry for the delay in answering, I was a bit busy.
I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now
That makes some sense, but I do find the "motivationless" state interesting from an alignment point of view. Because if it has no motivation, it also doesn't have a motivation to do al... (read more)
Actually, I think you're right. I always thought that MuZero was one and the same system for every game, but the Nature paper describes it as an architecture that can be applied to learn different games. I'd like a confirmation from someone who actually studied it more, but it looks like MuZero indeed isn't the same system for each game.
Could you use this technique to e.g. train the same agent to do well on chess and go?
If I don't misunderstand your question, this is something they already did with MuZero.
Didn't they train a separate MuZero agent for each game? E.g. the page you link only talks about being able to learn without pre-existing knowledge.
Sorry for ascribing you beliefs you don't have. I guess I'm just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me.
Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.
Sure, but don't you agree that it's a very confusing use of the term? Like, if I say GPT-3 isn't trying to kill me, I'm not saying it is trying to kill anyo... (read more)
Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.
I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. Notably, there are IMO pretty good arguments (mostly by people affiliated with El... (read more)