All of Charlie Steiner's Comments + Replies

AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

I initially thought you were going to debate Beth Barnes.

Also, thanks for the episode :) It was definitely interesting, although I still don't have a good handle on why some people are optimistic that there aren't classes of arguments humans will "fall for" irrespective of their truth value.

1DanielFilan6dYeah the initial title was not good
Testing The Natural Abstraction Hypothesis: Project Intro

One generalization I am also interested in is to learn not merely abstract objects within a big model, but entire self-contained abstract levels of description, together with actions and state transitions that move you between abstract states. E.g. not merely detecting that "the grocery store" is a sealed box ripe for abstraction, but that "go to the grocery store" is a valid action within a simplified world-model with nice properties.

This might be significantly more challenging to say something interesting about, because it depends not just on the world but on how the agent interacts with the world.

Which counterfactuals should an AI follow?

Very nice overview! Of course, I think most of the trick is crammed into that last bit :) How do you get a program to find the "common-sense" implied model of the world to use for counterfactuals.

My take on Michael Littman on "The HCI of HAI"

Even when talking about how humans shouldn't always be thought of as having some "true goal" that we just need to communicate, it's so difficult to avoid talking in that way :)  We naturally phrase alignment as alignment to something - and if it's not humans, well, it must be "alignment with something bigger than humans." We don't have the words to be more specific than "good" or "good for humans," without jumping straight back to aligning outcomes to something specific like "the goals endorsed by humans under reflective equilibrium" or whatever.

We need a good linguistic-science fiction story about a language with no such issues.

1Alex Flint4dYes, I agree, it's difficult to find explicit and specific language for what it is that we would really like to align AI systems with. Thank you for the reply. I would love to read such a story!
How do scaling laws work for fine-tuning?

I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I'm wrong and there's some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.

How do scaling laws work for fine-tuning?

Sure, but if you're training on less data it's because fewer parameters is worse :P

2Daniel Kokotajlo9dNot according to this paper! They were able to get performance comparable to full-size networks, it seems. IDK.
How do scaling laws work for fine-tuning?

I'm not sure how your reply relates to my guess, so I'm a little worried.

If you're intending the compute comment to be in opposition to my first paragraph, then no - when finetuning a subset of the parameters, compute is not simply proportional to the size of the subset you're finetuning, because you still have to do all the matrix multiplications of the original model, both for inference and gradient propagation. I think the point for the paper only finetuning a subset was to make a scientific point, not save compute.

My edit question was just because you ... (read more)

2Daniel Kokotajlo10dI totally agree that you still have to do all the matrix multiplications of the original model etc. etc. I'm saying that you'll need to do them fewer times, because you'll be training on less data. Each step costs, say, 6*N flop where N is parameter count. And then you do D steps, where D is how many data points you train on. So total flop cost is 6*N*D. When you fine-tune, you still spend 6*N for each data point, but you only need to train on 0.001D data points, at least according to the scaling laws, at least according to the orthodox interpretation around here. I'd recommend reading Ajeya's report (found here) [https://www.alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines] for more on the scaling laws. There's also this comment thread. [https://www.alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines]
How do scaling laws work for fine-tuning?

I think it's plausible that the data dependence will act like it's 3 OOM smaller. Compute dependence will be different, though, right? Even if you're just finetuning part of the model you have to run the whole thing to do evaluation. In a sense this actually seems like the worst of both worlds (but you get the benefit from pretraining).

Edit: Actually, I'm confused why you say a smaller model needs that factor fewer steps. I thought the slope on that one was actually quite gentle. It's just that smaller models are cheap - or am I getting it wrong?

2Daniel Kokotajlo10dI think compute cost equals data x parameters, so even if parameters are the same, if data is 3 OOM smaller, then compute cost will be 3 OOM smaller. I'm not sure I understand your edit question. I'm referring to the scaling laws as discussed and interpreted by Ajeya. Perhaps part of what's going on is that in the sizes of model we've explored so far, bigger models only need a little bit more data, because bigger models are more data-efficient. But very soon it is prophecied that this will stop and we will transition to a slower scaling law according to which we need to increase data by almost as much as we increase parameter count. So that's the relevant one I'm thinking about when thinking about TAI/AGI/etc.
Preferences and biases, the information argument

But is that true? Human behavior has a lot of information. We normally say that this extra information is irrelevant to the human's beliefs and preferences (i.e. the agential model of humans is a simplification), but it's still there.

2Stuart Armstrong22dLook at the paper linked for more details ( https://arxiv.org/abs/1712.05812 [https://arxiv.org/abs/1712.05812] ). Basically "humans are always fully rational and always take the action they want to" is a full explanation of all of human behaviour, that is strictly simpler than any explanation which includes human biases and bounded rationality.
HCH Speculation Post #2A

Sure, but the interesting thing to me isn't fixed points in the input/output map, it's properties (i.e. attractors that are allowed to be large sets) that propagate from the answers seen by a human in response to their queries, into their output.

Even if there's a fixed point, you have to further prove that this fixed point is consistent - that it's actually the answer to some askable question. I feel like this is sort of analogous to Hofstadter's q-sequence.

1Donald Hobson1moIn the giant lookup table space, HCH must converge to a cycle, although that convergence can be really slow. I think you have convergence to a stationary distribution if each layer is trained on a random mix of several previous layers. Of course, you can still have occilations in what is said within a policy fixed point.
HCH Speculation Post #2A

Yeah, I agree with this. It's certainly possible to see normal human passage through time as a process with probable attractors. I think the biggest differences are that HCH is a psychological "monoculture," HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there's some presumption that the output will be "an answer" whereas I have no such demands on the brain-state I pass to tomorrow.

If we imagine actual human imitations I think all of these problems have fairly obvious solutions, but I t... (read more)

1Vanessa Kosoy1mo[EDIT: After thinking about this some more, I realized that malign AI leakage is a bigger problem than I thought when writing the parent comment, because the way I imagined it can be overcome doesn't work that well [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=uL44tqHPCetnaKDTe] .] I don't think that last one is a real constraint. What counts as "an answer" is entirely a matter of interpretation by the participants in the HCH. For example, initially I can ask the question "what are the most useful thoughts about AI alignment I can come up with during 1,000,000 iterations?". When I am tasked to answer the question "what are the most useful thoughts about AI alignment I can come up with during N iterations?" then * If N=1, I will just spend my allotted time thinking about AI alignment and write whatever I came up with in the end. * If N>1, I will ask "what are the most useful thoughts about AI alignment I can come up with during N−1 iterations?". Then, I will study the answer and use the remaining time to improve on it to the best of my ability. An iteration of 2 weeks might be too short to learn the previous results, but we can work in longer iterations. Certainly, having to learn the previous results from text carries overhead compared to just remembering myself developing them (and having developed some illegible intuitions in the process), but only that much overhead. As to "monoculture", we can do HCH with multiple people (either the AI learns to simulate the entire system of multiple people or we use some rigid interface e.g. posting on a forum). For example, we can imagine putting the entire AI X-safety community there. But, we certainly don't want to put the entire world in there, since that way malign AI would probably leak into the system. Yes: it shows how to achieve reliable imitation (although for now in a theoretical model that's not feasible to implement), and the same idea should be applicab
The case for aligning narrowly superhuman models

I'd say "If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human 'moral experts' are going to disagree about], then we've already mostly-won" is an accurate correlation, but doesn't stand up to optimization pressure. We can't mostly-win just by fine-tuning a language model to do moral discourse. I'd guess you agree?

Anyhow, my point was more: You said "you get what you can measure" is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interest... (read more)

2johnswentworth1moUh... yeah, I agree with that statement, but I don't really see how it's relevant. If we tune a language model to do moral discourse, then won't it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like "they said they want fusion power, but they probably also want it to not be turn-into-bomb-able". Or are you using "moral discourse" in a broader sense? I disagree with the exact phrasing "fact of the matter for whether decisions are good or bad"; I'm not supposing there is any "fact of the matter". It's hard enough to figure out, just for one person (e.g. myself), whether a given decision is something I do or do not want. Other than that, this is a good summary, and I generally agree with the-thing-you-describe-me-as-saying and disagree with the-thing-you-describe-yourself-as-saying. I do not think that values-disagreements between humans are a particularly important problem for safe AI; just picking one human at random and aligning the AI to what that person wants would probably result in a reasonably good outcome. At the very least, it would avert essentially-all of the X-risk.
2Alex Turner1moEnglish sentences don't have to hold up to optimization pressure, our AI designs do. If I say "I'm hungry for pizza after I work out", you could say "that doesn't hold up to optimization pressure - I can imagine universes where you're not hungry for pizza", it's like... okay, but that misses the point? There's an implicit notion here of "if you told me that we had built AGI and it got hung up on exotic moral questions, I would expect that we had mostly won." Perhaps this notion isn't obvious to all readers, and maybe it is worth spelling out, but as a writer I do find myself somewhat exhausted by the need to include this kind of disclaimer. Furthermore, what would be optimized in this situation? Is there a dissatisfaction genie that optimizes outcomes against realizations technically permitted by our English sentences? I think it would be more accurate to say "this seems true in the main, although I can imagine situations where it's not." Maybe this is what you meant, in which case I agree.
The case for aligning narrowly superhuman models

Hm, interesting, I'm actually worried about a totally different implication of "you get what you can measure."

E.g.:

"If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide - are the humans allowed to say "hold on, I don't want that," or are we just going to accept that as wha... (read more)

I think one argument running through a lot of the sequences is that the parts of "human values" which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as "moral questions". Like, these examples from your comment below:

Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?

If an AGI i... (read more)

Open Problems with Myopia

(Edited for having an actual point)

You mention some general ways to get non-myopic behavior, but when it comes to myopic behavior you default to a clean, human-comprehensible agent model. I'm curious if you have any thoughts on open avenues related to training procedures that encourage myopia in inner optimizers, even if those inner optimizers are black boxes? I do seem to vaguely recall a post from one of you about this, or maybe it was Richard Ngo.

3Evan Hubinger1moI think that trying to encourage myopia via behavioral incentives is likely to be extremely difficult, if not impossible (at least without a better understanding of our training processes' inductive biases). Krueger et al.'s “ Hidden Incentives for Auto-Induced Distributional Shift [https://arxiv.org/abs/2009.09153]” is a good resource for some of the problems that you run into when you try to do that. As a result, I think that mechanistic incentives are likely to be necessary—and I personally favor some form of relaxed adversarial training [https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment] —but that's going to require us to get a better understanding of what exactly it looks for an agent to be myopic or not so we know what the overseer in a setup like that should be looking for.
Open Problems with Myopia

You beat me to making this comment :P Except apparently I came here to make this comment about the changed version.

"A human would only approve safe actions" is just a problem clause altogether. I understand how this seems reasonable for sub-human optimizers, but if you (now addressing Mark and Evan) think it has any particular safety properties for superhuman optimization pressure, the particulars of that might be interesting to nail down a bit better.

3Mark Xu1mohas been changed to imitation, as suggested by Evan.
2Evan Hubinger1moYeah, I agree—the example should probably just be changed to be about an imitative amplification agent or something instead.
The case for aligning narrowly superhuman models

Re: part 1 -

Good points, I agree. Though I think you could broadly replicate the summarization result using supervised learning - the hope for using supervised learning in superhuman domains is that your model learns a dimension of variation for "goodness" that can generalize well even if you condition on "goodness" being slightly outside any of the training examples.

Re: part 2 -

What it boils down to is that my standards (and I think the practical standards) for medical advice are low, while my standards for moral advice are high (as in, you could use this... (read more)

The case for aligning narrowly superhuman models

I (conceptual person) broadly do agree that this is valuable.

It's possible that we won't need this work - that alignment research can develop AI that doesn't benefit from the same sort of work you'd do to get GPT-3 to do tricks on command. But it's also possible that this really would be practice for "the same sort of thing we want to eventually do."

My biggest concern is actually that the problem is going to be too easy for supervised learning. Need GPT-3 to dispense expert medical advice? Fine-tune it on a corpus of expert medical advice! Or for slightly ... (read more)

1Ajeya Cotra1moI don't think you can get away with supervised learning if you're holding yourself to the standard of finding fuzzy tasks where the model is narrowly superhuman. E.g. the Stiennon et al., 2020 paper involved using RL from human feedback: roughly speaking, that's how it was possible for the model to actually improve upon humans rather than simply imitating them. And I think in some cases, the model will be capable of doing better than (some) humans' evaluations, meaning that to "get models to the best they can to help us" we will probably need to do things like decomposition, training models to explain their decisions, tricks to amplify or de-noise human feedback, etc. I don't agree that there's obviously conceptual progress that's necessary for moral advice which is not necessary for medical advice — I'd expect a whole class of tasks to require similar types of techniques, and if there's a dividing line I don't think it is going to be "whether it's related to morality", but "whether it's difficult for the humans doing the evaluation to tell what's going on." To answer your question for both medical and moral advice, I'd say the obvious first thought is RL from human feedback, and the second thought I had to go beyond that is trying to figure out how to get less-capable humans to replicate the training signal produced by more-capable humans, without using any information/expertise from the latter to help the former (the "sandwiching" idea). I'm not sure if it'll work out though.
Bootstrapped Alignment

I'm still holding out hope for jumping straight to FAI :P Honestly I'd probably feel safer switching on a "big human" than a general CIRL agent that models humans as Boltzmann-rational.

Though on the other hand, does modern ML research already count as trying to use UFAI to learn how to build FAI?

1G Gordon Worley III1moSeems like it probably does, but only incidentally. I instead tend to view ML research as the background over which alignment work is now progressing. That is, we're in a race against capabilities research that we have little power to stop, so our best bets are either that it turns out capabilities are about to hit the upper inflection point of an S-curve, buying us some time, or that the capabilities can be safely turned to helping us solve alignment. I do think there's something interesting about a direction not considered in this post related to intelligence enhancement of humans and human emulations (ems) as a means to working on alignment, but I think realistically current projections of AI capability timelines suggest they're unlikely to have much opportunity for impact.
[AN #139]: How the simplicity of reality explains the success of neural nets

I normally don't think of most functions as polynomials at all - in fact, I think of most real-world functions as going to zero for large values. E.g. the function "dogness" vs. "nose size" cannot be any polynomial, because polynomials (or their inverses) blow up unrealistically for large (or small) nose sizes.

I guess the hope is that you always learn even polynomials, oriented in such a way that the extremes appear unappealing?

2Rohin Shah2moWhat John said. To elaborate, it's specifically talking about the case where there is some concept from which some probabilistic generative model creates observations tied to the concept, and claiming that the log probabilities follow a polynomial. Suppose the most dog-like nose size is K. One function you could use is y = exp(-(x - K)^d) for some positive integer d. That's a function whose maximum value is 0 (where higher values = more "dogness") and doesn't blow up unreasonably anywhere. (Really you should be talking about probabilities, in which case you use the same sort of function but then normalize, which transforms the exp into a softmax, as the paper suggests)
4johnswentworth2moI believe the paper says that log densities are (approximately) polynomial - e.g. a Gaussian would satisfy this, since the log density of a Gaussian is quadratic.
Implications of Quantum Computing for Artificial Intelligence Alignment Research

I recently got reminded of this post. I'm not sure I agree with it, because I think we have different paradigms for AI alignment - I'm not nearly so concerned with the sort of oversight that relies on looking at the state of the computer. Though I have nothing against the sort of oversight where you write a program to tell you about what's going on with your model.

Instead, I think that anticipating the effects of QC on AI alignment is a task in prognosticating how ML is going to change if you make quantum computing available. I think the relevant killer ap... (read more)

1Jsevillamol2more: impotance of oversight I do not think we really disagree on this point. I also believe that looking at the state of the computer is not as important as having an understanding of how the program is going to operate and how to shape its incentives. Maybe this could be better emphasized, but the way I think about this article is showing that even the strongest case for looking at the intersection of quantum computing and AI alignment does not look very promising. re: How quantum computing will affect ML I basically agree that the most plausible way QC can affect AI aligment is by providing computational speedups - but I think this mostly changes the timelines rather than violating any specific assumptions in usual AI alignment research. Relatedly, I am bullish that we will see better than quadratic speedups (ie Grover) - to get better-than-quadratic speedups you need to surpass many challenges that right now it is not clear can be surpassed outside of very contrived problem setup [REF] [https://scottaaronson.com/papers/qml.pdf]. In fact I think that the speedups will not even be quadratic because you "lose" the quadratic speedup when parallelizing quantum computing (in the sense that the speedup does not scale quadratically with the number of cores).
Suggestions of posts on the AF to review

Steve's big thoughts on alignment in the brain probably deserve a review. Component posts include  https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain , https://www.lesswrong.com/posts/DWFx2Cmsvd4uCKkZ4/inner-alignment-in-the-brain , https://www.lesswrong.com/posts/jNrDzyc8PJ9HXtGFm/supervised-learning-of-outputs-in-the-brain

Interestingly, I think there aren't any of my posts I should recommend - basically all of them are speculative. However, I did have a post called Gricean communication and meta-preferences th... (read more)

Formal Solution to the Inner Alignment Problem

I think this looks fine for IDA - the two problems remain the practical one of implementing Bayesian reasoning in a complicated world, and the philosophical one that probably IDA on human imitations doesn't work because humans have bad safety properties.

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

Hm, I thought that was what Evan called it, but maybe I misheard. Anyhow, I mean the problem where because you can model humans in different ways, we have no unique utility function. We might think of this as having not just one Best Intentional Stance, but a generalizable intentional stance with knobs and dials on it, different settings of which might lead to viewing the subject in different ways.

I call such real-world systems that can be viewed non-uniquely through the lens of the intentional stance "approximate agents."

To the extent that mesa-optimizers... (read more)

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

Great interview! Weird question - did Rob Miles get a sneak peek at this interview, given that he just did a video on the same paper?

The biggest remaining question I have is a followup on the question you asked "Am I a mesa-optimizer, and if so, what's my meta-objective?" You spend some time talking about lookup tables, but I wanted to hear about human-esque "agents" that seem like they do planning, but simultaneously have a very serious determination problem for their values - is Evan's idea to try to import some "solution to outer alignment" to these age... (read more)

1DanielFilan2moIf Rob got a sneak peek, he managed to do so without my knowledge. I don't totally understand the other question you're asking: in particular, what you're thinking of as the "determination problem".
Hierarchical planning: context agents

Oh wait, are you the first author on this paper? I didn't make the connection until I got around to reading your recent post.

So when you talk about moving to a hierarchical human model, how practical do you think it is to also move to a higher-dimensional space of possible human-models, rather than using a few hand-crafted goals? This necessitates some loss function or prior probability over models, and I'm not sure how many orders of magnitude more computationally expensive it makes everything.

1Xuan (Tan Zhi Xuan)16dYup! And yeah I think those are open research questions -- inference over certain kinds of non-parametric Bayesian models is tractable, but not in general. What makes me optimistic is that humans in similar cultures have similar priors over vast spaces of goals, and seem to do inference over that vast space in a fairly tractable manner. I think things get harder when you can't assume shared priors over goal structure or task structure, both for humans and machines.
AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

My first idea is, you take your common sense AI, and rather than saying "build me a spaceship, but, like, use common sense," you can tell it "do the right thing, but, like, use common sense." (Obviously with "saying" and "tell" in invisible finger quotes.) Bam, Type-1 FAI.

Of course, whether this will go wrong or not depends on the specifics. I'm reminded of Adam Shimi et al's recent post that mentioned "Ideal Accomplishment" (how close to an explicit goal a system eventually gets) and "Efficiency" (how fast it gets there). If you have a general purpose "co... (read more)

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

I'm a lot less excited about the literature of the world's philosophy than I am about the living students of it.

Of course, there are some choices in designing an AI that are ethical choices, for which there's no standard by which one culture's choice is better than another's. In this case, incorporating diverse perspectives is "merely" a fair way to choose how to steer the future - a thing to do because we want to, not because it solves some technical problem.

But there are also philosophical problems faced in the construction of AI that are technical probl... (read more)

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

Children learn to follow common sense, despite not having (explicit) meta-ethical and meta-normative beliefs at all.

Children also learn right from wrong - I'd be interested in where you draw the line between "An AI that learns common sense" and "An AI that learns right from wrong." (You say this argument doesn't apply in the case of human values, but it seems like you mean only explicit human values, not implicit ones.)

My suspicion, which is interesting to me so I'll explain it even if you're going to tell me that I'm off base, is that you're thinking that... (read more)

2Rohin Shah3moI'm happy to assume that AI will learn right from wrong to about the level that children do. This is not a sufficiently good definition of "the good" that we can then optimize it. That sounds basically right, with the caveat that you want to be a bit more specific and precise with what the AI system should do than just saying "common sense"; I'm using the phrase as a placeholder for something more precise that we need to figure out. Also, I'd change the last sentence to "an AI that has learned right from wrong to the same extent that humans learn it, and then optimizes for right things as hard as possible, will probably make dangerous moral mistakes". The point is that when you're trying to define "the good" and then optimize it, you need to be very very correct in your definition, whereas when you're trying not to optimize too hard in the first place (which is part of what I mean by "common sense") then that's no longer the case. I think at this point I don't think we're talking about the same "common sense". But why? Again it depends on how accurate the "right/wrong classifier" needs to be, and how accurate the "common sense" needs to be. My main claim is that the path to safety that goes via "common sense" is much more tolerant of inaccuracies than the path that goes through optimizing the output of the right/wrong classifier.
Literature Review on Goal-Directedness

The little quizzes were highly effective in getting me to actually read the post :)

I think depending on what position you take, there are difference in how much one thinks there's "room for a lot of work in this sphere." The more you treat goal-directedness as important because it's a useful category in our map for predicting certain systems, the less important it is to be precise about it. On the other hand if you want to treat goal-directedness in a human-independent way or otherwise care about it "for its own sake" for some reason, then it's a different story.

1Adam Shimi3moGlad they helped! That's the first time I use this feature, and we debated whether to add more or remove them completely, so thanks for the feedback. :) If I get you correctly, you're arguing that there's less work on goal-directedness if we try to use it concretely (for discussing AI risk), compared to if we study it for it's own sake? I think I agree with that, but I still believe that we need a pretty concrete definition to use goal-directedness in practice, and that we're far from there. There is less pressure to deal ith all the philosophical nitpicks, but we should at least get the big intuitions (of the type mentioned in this lit review) right, or explain why they're wrong.
Why I'm excited about Debate

Good question. There's a big roadblock to your idea as stated, which is that asking something to define "alignment" is a moral question. But suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?"

I don't know - I think it still depends on how bad humans are as judges of arguments - we've made the domain more objective, but maybe there's some policy of argumentation that still wins by what we would consider cheating. I can ... (read more)

2Richard Ngo3moI'm less excited about this, and more excited about candidate training processes or candidate paradigms of AI research (for example, solutions to embedded agency). I expect that there will be a large cluster of techniques which produce safe AGIs, we just need to find them - which may be difficult, but hopefully less difficult with Debate involved.
Why I'm excited about Debate

I think the Go example really gets to the heart of why I think Debate doesn't cut it.

The reason Go is hard is that it has a large game tree despite simple rules. When we treat an AI game as information about the value of a state of the Go board, we know exactly what the rules are and how the game should be scored, the superhuman work the AIs are doing is in searching this game tree that's too big for us. The adversarial gameplay provides a check that the search through the game tree is actually finding high-scoring policies.

What does this framework need to... (read more)

2Rafael Harth3moYour comment is an argument against using Debate to settle moral questions. However, what if Debate is trained on Physics and/or math questions, with the eventual goal of asking "what is a provably secure alignment proposal?"
Transparency and AGI safety

Re: non-agenty AGI. The typical problem is that there are incentives for individual actors to build AI systems that pursue goals in the world. So even if you postulate non-agenty AGI, you then have to further figure out why nobody has asked the Oracle AI "What's the code to an AI that will make me rich?" or asked it for the motor output of a robot given various sense data and natural-language goals, then used that output to control a robot (also see https://slatestarcodex.com/2020/01/06/a-very-unlikely-chess-game/ ).

Transparency and AGI safety

Thanks!

I'm reminded a bit of the reason why Sudoku and quantum computing are difficult: the possibilities you have to track are not purely local, they can be a nonlocal combination of different things. General NNs seem like they'd be at least NP to interpret.

But this is what dropout is useful for, penalizing reliance on correlations. So maybe if you're having trouble interpreting something you can just crank up the dropout parameters. On the other hand, dropout also promotes redundancy, which might make interpretation confusing - perhaps there's something ... (read more)

Hierarchical planning: context agents

Sorry for being slow :) No, I haven't read anything of Bratman's. Should I? The synopsis looks like it might have some interesting ideas but I'm worried he could get bogged down in what human planning "really is" rather than what models are useful.

I'd totally be happy to chat either here or in PMs. Full Bayesian reasoning seems tricky if the environment is complicated enough to make hierarchical planning attractive - or do you mean optimizing a model for posterior probability (the prior being something like MML?) by local search?

I think one interesting que... (read more)

Vanessa Kosoy's Shortform

What does it mean to have an agent in the information-state?

Nevermind, I think I was just looking at it with the wrong class of reward function in mind.

Vanessa Kosoy's Shortform

Ah, okay, I see what you mean. Like how preferences are divisible into "selfish" and "worldly" components, where the selfish component is what's impacted by a future simulation of you that is about to have good things happen to it.

(edit: The reward function in AMDPs can either be analogous to "wordly" and just sum the reward calculated at individual timesteps, or analogous to "selfish" and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)

I brought up the hist... (read more)

1Vanessa Kosoy4moAMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are "selfish": there is a reward function r:(S×A)∗→R which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the "unrolling" you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.) For a more complete framework, we can use an ontology chain [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=SBPzgAZgFFxtL9E64] , but (i) instead of A×O labels use A×M labels, where M is the set of possible memory states (a policy is then described by π:M→A), to allow for agents that don't fully trust their memory (ii) consider another chain with a bigger state space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible. Here, the semantics of p(s) is: the multiset of ontological states resulting from interpreting the physical state s by taking the viewpoints of different agents s contains. I didn't understand "no actual agent in the information-state that corresponds to having those probabilities". What does it mean to have an agent in the information-state?
Vanessa Kosoy's Shortform

Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states  with the memory states , etc. The action takes a robot in  to memory state , and a robot in  to one robot in  and another in .

(Skip this paragraph unless the specifics of what's going on aren't obvious: given a transition distribution  (P being the distribution over sets of sta... (read more)

1Vanessa Kosoy4moI'm not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It's like in classical RL theory, when you're proving a regret bound or whatever, your probability space consists of histories. Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I'm talking about hypotheses which conform to the classical "cybernetic agent model". If you wish, we can call it "Bayesian cybernetic hypothesis". Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite. Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel: ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′ Now, consider the following example of an AMDP. There are three actions A:={a,b, c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in one robot whose state is s0 with probability 12 and s1 with probability 12. Consider the following two policies. πa takes the sequence of actions cacaca… and πb takes the sequence of actions cbcbcb…. A population that f
Vanessa Kosoy's Shortform

Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a "memory MDP" that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.

1Vanessa Kosoy4moI'm not sure what do you mean by that "unrolling". Can you write a mathematical definition? Let's consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it. s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How will our population look like? 0th step: all robots remember s0 1st step: all robots remember s0s1 2nd step: 1/2 of robots remember s0s1s0 and 1/2 of robots remember s0s1s1 3rd step: 1/3 of robots remembers s0s1s0s1, 1/3 of robots remember s0s1s1s0 and 1/3 of robots remember s0s1s1s1 There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr [s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr [s0s1s0]=13, Pr[s0s1s1]=23. In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already. Or, at least it's not obvious there is such a hypothesis. In this example, Pr[s0 s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don't know, maybe for finite state spaces it can work. Would definitely be interesting to check. [EDIT: actually, in this example there is such a hypothesis but in general there isn't, see below [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=E58br2mJWbgzQqZhX] ]
Hierarchical planning: context agents

Yeah, I agree, it seems both more human-like and more powerful to have a dynamical system where models are activating other models based on something like the "lock and key" matching of neural attention. But for alignment purposes, it seems to me that we need to not only optimize models for usefulness or similarity to actual human thought, but also for how similar they are to how humans think of human thought - when we imagine an AI with the goal of doing good, we want it to have decision-making that matches our understanding of "doing good." The model in ... (read more)

Latent Variables and Model Mis-Specification

Just ended up reading your paper (well, a decent chunk of it), so thanks for the pointer :) 

The ethics of AI for the Routledge Encyclopedia of Philosophy

Congrats! Here are my totally un-researched first thoughts:

Pre-1950 History: Speculation about artificial intelligence (if not necessarily in the modern sense) dates back extremely far. Brazen head, Frankenstein, the mechanical turk, R.U.R. robots. Basically all of this treats the artificial intelligence as essentially a human, though the brazen head mythology is maybe more related to deals with djinni or devils, but basically all of it (together with more modern science fiction) can be lumped into a pile labeled "exposes and shapes human intuition, but no... (read more)

2Stuart Armstrong5moThanks!
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.

If you don't agree with me on this, why didn't you reply when I spent about six months just writing posts that were all variations of this idea? Here's Scott Alexander making the basic point.

It's like... is there a True rational approximation of pi? Well, 22/7 is pretty good, but 355/113 is more precise, if harder to remember. And just 3 is really easy to remember, ... (read more)

1johnswentworth5moBefore I get into the meat of the response... I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human's world-model, which still gives rise to all the same problems as a total order in the human's world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents [https://www.lesswrong.com/posts/3xF66BNSC5caZuKyC/why-subagents]. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.) Anyway, I don't think that's the main disconnect here... Ok, I think I see what you're saying now. I am of course on board with the notion that e.g. human values do not make sense when we're modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction. However, there is a difference between "the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction", and "there is not a unique, well-defined referent of 'human values'". The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human's model. An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its "goals" do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different "goals" to the robot - e.g. modelling it at the level of an electronic circuit o
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I think that one of the problems in this post is actually easier in the real world than in the toy model.

In the toy model the AI has to succeed by maximizing the agent's True Values, which the agent is assumed to have as a unique function over its model of the world. This is a very tricky problem, especially when, as you point out, we might allow the agent's model of reality to be wrong in places.

But in the real world, humans don't have a unique set of True Values or even a unique model of the world - we're non-Cartesian, which means that when we talk abou... (read more)

1johnswentworth5moThis comment seems wrong to me in ways that make me think I'm missing your point. Some examples and what seems wrong about them, with the understanding that I'm probably misunderstanding what you're trying to point to: I have no idea why this would be tied to non-Cartesian-ness. There are certainly ways in which humans diverge from Bayesian utility maximization, but I don't see why we would think that values or models are non-unique. Certainly we use multiple levels of abstraction, or multiple sub-models, but that's quite different from having multiple distinct world-models. How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say "just pick one set of values/one world model and satisfy that", which seems wrong. One way to interpret all this is that you're pointing to things like submodels, subagents, multiple abstraction levels, etc. But then I don't see why the problem would be any easier in the real world than in the model, since all of those things can be expressed in the model (or a straightforward extension of the model, in the case of subagents).
Learning Normativity: A Research Agenda

I'm pretty on board with this research agenda, but I'm curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.

And on the assumption that you have no idea what I'm referring to, here's the link to my post.

There are a couple different directions to go from here. One way is to try to collapse the recursion. Find a single agent-shaped model of humans that is (or approximates) a fixed point of this model-ratification process (and also hopefully stays close to

... (read more)
2Abram Demski5moAh, nice post, sorry I didn't see it originally! It's pointing at a very very related idea. Seems like it also has to do with John's communication model [https://www.lesswrong.com/posts/zAvhvGa6ToieNGuy2/communication-prior-as-alignment-strategy] . With respect to your question about fixed points, I think the issue is quite complicated, and I'd rather approach it indirectly by collecting criteria and trying to make models which fit the various criteria. But here are some attempted thoughts. 1. We should be quite skeptical of just taking a fixed point, without carefully building up all the elements of the final solution -- we don't just want consistency, we want consistency as a result of sufficiently humanlike deliberation. This is similar to the idea that naive infinite HCH might be malign (because it's just some weird fixed point of humans-consulting-HCH), but if we ensure that the HCH tree is finite by (a) requiring all queries to have a recursion budget, or (b) having a probability of randomly stopping (not allowing the tree to be expanded any further), or things like that, we can avoid weird fixed points (and, not coincidentally, these models fit better with what you'd get from iterated amplification if you're training it carefully rather than in a way which allows weird malign fixed-points to creep in). 2. However, I still may want to take fixed points in the design; for example, the way UTAAs allow me to collapse all the meta-levels down. A big difference between your approach in the post and mine here is that I've got more separation between the rationality criteria of the design vs the rationality the system is going to learn, so I can use pure fixed points on one but not the other (hopefully that makes sense?). The system can be based on a perfect fixed point of some sort, while still building up a careful picture iteratively improving on initial models. That's kind of illustrated
Communication Prior as Alignment Strategy

Yeah, this is basically CIRL, when the human-model is smart enough to do Gricean communication. The important open problems left over after starting with CIRL are basically "how do you make sure that your model of communicating humans infers the right things about human preferences?", both due to very obvious problems like human irrationality, and also due to weirder stuff like the human intuition that we can't put complete confidence in any single model.

3johnswentworth5moRoughly, yeah, though there are some differences - e.g. here the AI has no prior "directly about" values, it's all mediated by the "messages", which are themselves informing intended AI behavior directly. So e.g. we don't need to assume that "human values" live in the space of utility functions, or that the AI is going to explicitly optimize for something, or anything like that. But most of the things which are hard in CIRL are indeed still hard here; it doesn't really solve anything in itself. One way to interpret it: this approach uses a similar game to CIRL, but strips out most of the assumptions about the AI and human being expected utility maximizers. To the extent we're modelling the human as an optimizer, it's just an approximation to kick off communication, and can be discarded later on.
Draft papers for REALab and Decoupled Approval on tampering

Very interesting. Naturalizing feedback (as opposed to directly accessing True Reward) seems like it could lead to a lot of desirable emergent behaviors, though I'm somewhat nervous about reliance on a handwritten model of what reliable feedback is.

Defining capability and alignment in gradient descent

Interesting post. Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).

1Edouard Harris5moThanks for the comment! I think this is a reasonable objection. I don't make this very clear in the post, but the "true objective" I've written down in the example indeed isn't unique: like any measure of utility or loss, it's only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren't uniquely defined either. (I'll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.) Interesting question! To try to interpret in light of the definitions I'm proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it's also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective. Hope that's somewhat helpful, but please let me know if it's unclear and I can try to unpack things a bit more!
Additive Operations on Cartesian Frames

Typo in the definition of product: b cdot e should be b star e.

1Scott Garrabrant6moYep. Fixed. Thanks.
Security Mindset and Takeoff Speeds

I also agree that direct jumps in capability due to research insight are rare. But in part I think that's just because things get tried at small scale first, and so there's always going to be some scaling-up period where the new insight gets fed more and more resources, eventually outpacing the old state of the art. From a coarse-grained perspective GPT-2 relative to your favorite LSTM model from 2018 is the "jump in capability" due to research insight, it just got there in a not-so-discontinuous way.

Maybe you're optimistic that in the future, everyone wil... (read more)

2Rohin Shah6moSeems right to me. (I'm not convinced this is a good tripwire, but under the assumption that it is:) Ideally they have already applied safety solutions and so this doesn't even happen in the first place. But supposing this did happen, they turn off the AI system because they remember how Amabook lost a billion dollars through their AI system embezzling money from them, and they start looking into how to fix this issue.
Security Mindset and Takeoff Speeds

I think my biggest disagreement with the Rohin-character is about continuity. I expect there will be plenty of future events like BERT vs. RNNs. BERT isn't all that much better than RNNs on small datasets, but it scales better - so then OpenAI comes along and dumps 100x or 10,000x the compute into something like it just to see what happens.

Not only do these switches make me less confident that capability won't have sudden jumps, but I think they pose a problem for carry safety properties over from the past to the future. Right now, DeepMind solves some ali... (read more)

so then OpenAI comes along and dumps 100x or 10,000x the compute into something like it just to see what happens.

10000x would be unprecedented -- why wouldn't you first do a 100x run to make sure things work well before doing a 10000x run? (This only increases costs by 1%.)

Also, 10000x increase in compute corresponds to 100-1000x more parameters, which does not usually lead to things I would call "discontinuities" (e.g. GPT-2 to GPT-3 does not seem like an important discontinuity to me, even if we ignore the in-between models trained along the way). Put an... (read more)

Load More