All of Beth Barnes's Comments + Replies

Call for research on evaluating alignment (funding + advice available)

Yeah, I think you need some assumptions about what the model is doing internally.

I'm hoping you can handwave over cases like 'the model might only know X&A, not X' with something like 'if the model knows X&A, that's close enough to it knowing X for our purposes - in particular, if it thought about the topic or learned a small amount, it might well realise X'.

Where 'our purposes' are something like 'might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don't know X'?

Another way to put this is that for work
... (read more)
1Joe_Collman1dThis mostly seems plausible to me - and again, I think it's a useful exercise that ought to yield interesting results. Some thoughts: 1. Handwaving would seem to take us from "we can demonstrate capability of X" to "we have good evidence for capability of X". In cases where we've failed to prompt/finetune the model into doing X we also have some evidence against the model's capability of X. Hard to be highly confident here. 2. Precision over the definition of a task seems important when it comes to output. Since e.g. "do arithmetic" != "output arithmetic". This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren't behaviours (usually), but rather internal processes. This doesn't seem too useful in attempting to show misalignment, since knowing the model can do X doesn't mean it can output the result of X.
Beth Barnes's Shortform

You mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.

Beth Barnes's Shortform

When can models report their activations?

Related to call for research on evaluating alignment

Here's an experiment I'd love to see someone run (credit to Jeff Wu for the idea, and William Saunders for feedback):

Finetune a language model to report the activation of a particular neuron in text form.

E.g., you feed the model a random sentence that ends in a full stop. Then the model should output a number from 1-10 that reflects a particular neuron's activation.

We assume the model will not be able to report the activation of a neuron in the final layer, even i... (read more)

2Alex Turner1moSurely there exist correct fixed points, though? (Although probably not that useful, even if feasible)
[AN #157]: Measuring misalignment in the technology underlying Copilot

@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)?

(I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)

1Adam Shimi2moMy message was really about Rohin's phrasing, since I usually don't read the papers in details if I think the summary is good enough. Reading the section now, I'm fine with it. There are a few intentional stance words, but the scare quotes and the straightforwardness of cashing out "is capable" into "there is a prompt to make it do what we want" and "chooses" into "what it actually returns for our prompt" makes it quite unambiguous. I also like this paragraph in the appendix: Rohin also changed my mind on my criticism of calling that misalignment; I now agree that misalignment is the right term. One thought I just had: this looks more like a form of proxy alignment to what we really want, which is not ideal but significantly better than deceptive alignment. Maybe autoregressive language models point to a way of paying a cost of proxy alignment to avoid deceptive alignment?
Frequent arguments about alignment

You might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' where you're discussing alignment research that can be done with current models

1John Schulman4moyup, added a sentence about it
Frequent arguments about alignment

I think this is a really useful post, thanks for making this! I maybe have a few things I'd add but broadly I agree with everything here.

1Beth Barnes4moYou might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' [] where you're discussing alignment research that can be done with current models
AMA: Paul Christiano, alignment researcher

"Even if actively trying to push the field forward full-time I'd be a small part of that effort"

I think conditioning on something like 'we're broadly correct about AI safety' implies 'we're right about some important things about how AI development will go that the rest of the ML community is surprisingly wrong about'. In that world we're maybe able to contribute as much as a much larger fraction of the field, due to being correct about some things that everyone else is wrong about.

I think your overall point still stands, but it does seem like you sometimes overestimate how obvious things are to the rest of the ML community

Imitative Generalisation (AKA 'Learning the Prior')

We're trying to address cases where the human isn't actually able to update on all of D and form a posterior based on that. We're trying to approximate 'what the human posterior would be if they had been able to look at all of D'. So to do that, we learn the human prior, and we learn the human likelihood, then have the ML do the computationally-intensive part of looking at all of D and updating based on everything in there.

Does that make sense?

Imitative Generalisation (AKA 'Learning the Prior')
Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?

I think the distinction isn't actually super clear, because you can usually trade off capabilities problems and safety problems. I think of it as expanding the range of questions you can get aligned answers to in a reasonable number of steps. If you're just doing IDA/debate, and you try to get your model to give you answers to questions where the model only knows the an... (read more)

Debate Minus Factored Cognition

That is a concern, but only in the case where there's no answer that has an argument tree that bottoms out in depth<D. As long as there exists an answer that is supported by a depth<D tree, this answer will beat the answers only supported by depth>D argument trees.

So there is a case where the debaters are not incentivised to be honest; the case where the debaters know something but there's no human-understandable argument for it that bottoms out in <D steps. This is where we get the PSPACE constraint.

If we include discussion of cro... (read more)

Debate Minus Factored Cognition

I don't think 'assuming one player is honest' and 'not trusting answers by default' are in contradiction. if the judge assumes one player is honest, then if they see two different answers they don't know which one to trust, but if they only see one answer (the debaters agree on an answer/the answer is not challenged by the opposing debater) then they can trust that answer.

AI safety via market making

I was trying to describe something that's the same as the judging procedure in that doc! I might have made a mistake, but I'm pretty sure the key piece about recursion payments is the same. Apologies that things are unclear. I'm happy to try to clarify, if there were particular aspects that seem different to you.

Yeah, I think the infinite tree case should work just the same - ie an answer that's only supported by an infinite tree will behave like an answer that's not supported (it will lose to an answer with a finite tree and draw... (read more)

Debate update: Obfuscated arguments problem
In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

Not systematically; I would be excited about people doing these experiments. One tricky thing is that you might t... (read more)

Imitative Generalisation (AKA 'Learning the Prior')
It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements - this is the main reason that we need science.

Agree that humans are not necessarily great at assigning priors. The main response to this is that we don't have a way to get better priors than an amplified human's best prior. If amplified humans think the NN prior is better than their prior, the... (read more)

Debate Minus Factored Cognition

Thanks for the post, I'm excited that you're thinking about debate!

I think I disagree with the claim you're making about being able to avoid requiring the judge to assume that one player is honest (but I might be confused about what you're proposing). 
Basically, it sounds like you're saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we're providing, then we're going to need to run an extremely la... (read more)

2Abram Demski9moDon't you yourself disagree with requiring the judge to assume that one player is honest? In a recent comment, you discuss how claims should not be trusted by default [] .
2Abram Demski9moI'm not sure why you're saying this, but in the post, I restricted my claim to NP-like problems. So for example, traveling salesman -- the computation to find good routes may be very difficult, but the explanation for the answer remains short (EG an explicit path). So, yes, I'm saying that I don't see the same sort of argument working for exp-sized explanations. (Although Rohin's comment gave me pause, and I still need to think it over more.) But aside from that, I'm also not sure what you mean by the "run an extremely large number of debates" point. Debate isn't like search, where we run more/longer to get better answers. Do you mean that my proposal seems to require longer training time to get anywhere? If so, why is that? Or, what do you mean? I'm not asserting that the judge should distrust, either. Like the normal debate argument, I want to end up in an honest equilibrium. So I'm not saying we need some kind of equilibrium where the judge is justified in distrust. My concern involves the tricky relationship between the equilibrium we're after and what the judge has to actually do during training (when we might not be anywhere near equilibrium). I don't want the judge to have to pretend answers are honest at times when they're statistically not. I didn't end up going through that whole argument in the post (unfortunately), but in my notes for the post, the judge being able to judge via honest opinion at all times during training was an important criterion. I agree that that's what we're after. But I think maybe the difference in our positions can be captured if we split "honest" into two different notions... a-honesty: the statement lacks an immediate (a-honest) counterargument. IE, if I think a statement is a-honest, then I don't think there's a next statement which you can (a-honestly) tell me which would make me disbelieve the statement. b-honesty: the statement cannot be struck down by multi-step (b-honest) debate. IE, if I think a statement is b-hone
Debate Minus Factored Cognition

The standard argument against having a non-zero-sum debate game is that then you may incentivise your debaters to collude.  

I don't know if you've seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior - seems somewhat relevant to what you're thinking about here. 

2Abram Demski9moI took a look, and it was indeed helpful. However, I left a comment there about a concern I have. The argument at the end only argues for what you call D-acceptability: having no answer that's judged better after D steps of debate. My concern is that even if debaters are always D-acceptable for all D, that does not mean they are honest. They can instead use non-well-founded argument trees which never bottom out.
2Abram Demski9moI think the collusion concern basically over-anthropomorphizes the training process. Say, in prisoner's dilemma, if you train myopically, then "all incentives point toward defection" translates concretely to actual defection. Granted, there are training regimes in which this doesn't happen, but those would have to be avoided. OTOH, the concern might be that an inner optimizer would develop which colludes. This would have to be dealt with by more general anti-inner-optimizer technology. Yep, I should take a look!
Debate update: Obfuscated arguments problem

To be clear, I think this is a good suggestion and is close to how I imagine we'd actually run debate in practice. It just doesn't get us beyond MA if the debaters only write P-size arguments.

Debate update: Obfuscated arguments problem

I'd be interested to hear more detail of your thoughts on how we might use robustness techniques!

Debate update: Obfuscated arguments problem

Yep, planning to put up a post about that soon. The short argument is something like:
The equivalent of an obfuscated argument for IDA is a decomposition that includes questions the model doesn't know how to answer. 
We can't always tell the difference between an IDA tree that uses an obfuscated decomposition and gets the wrong answer, vs an IDA tree that uses a good decomposition and gets the right answer, without unpacking the entire tree

Debate update: Obfuscated arguments problem

I just mean that this method takes order(length of argument in judge-understandable language) time. So if the argument is large then you're going to need to let the debate run for a long time. This is as opposed to the previous hope that even if the argument tree is exp-sized, the debate can run in p-time

Debate update: Obfuscated arguments problem


Yep, this does work, but limits us to questions where the argument in judge-understandable language is short enough that the debaters can write the whole thing down. So if the debaters run in P-time at deployment time, this gives us MA, not PSPACE as originally hoped. 

1John Schulman10moOK, I guess I'm a bit unclear on the problem setup and how it involves a training phase and deployment phase.
Homogeneity vs. heterogeneity in AI takeoff scenarios

One counterexample is Manhattan Project - they developed two different designs simultaneously because they weren't sure which would work better. From wikipedia: Two types of atomic bombs were developed concurrently during the war: a relatively simple gun-type fission weapon and a more complex implosion-type nuclear weapon.,Tube%20Alloys%20project)%20and%20Canada.

AI safety via market making

Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that's supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument

4Abram Demski9moThis was a very interesting comment (along with its grandparent comment), thanks -- it seems like a promising direction. However, I'm still confused about whether this would work. It's very different from judging procedure outlined here [] ; why is that? Do you have a similarly detailed write-up of the system you're describing here? I'm actually less concerned about loops and more concerned about arguments which are infinite trees, but the considerations are similar. It seems possible that the proposal you're discussing very significantly addresses concerns I've had about debate.
Learning Normativity: A Research Agenda

I see myself as trying to construct a theory of normativity which gets that "by construction", IE, we can't expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account.

Nice, this is what I was trying to say but was struggling to phrase it. I like this.

I guess I usually think of HCH as having this property, as long as the thinking time for each human is long enough, the tree is deep enough, and we're correct about the hope... (read more)

Learning Normativity: A Research Agenda

However, that only works if we have the right prior. We could try to learn the prior from humans, which gets us 99% of the way there... but as I've mentioned earlier, human imitation does not get us all the way. Humans don't perfectly endorse their own reactions.

Note that Learning the Prior uses an amplified human (ie, a human with access to a model trained via IDA/Debate/RRM). So we can do a bit better than a base human - e.g. could do something like having an HCH tree where many humans generate possible feedback and other humans look at the feedback and ... (read more)

3Abram Demski1yRight, I agree. I see myself as trying to construct a theory of normativity which gets that "by construction", IE, we can't expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account. HCH isn't such a theory; it does provide a somewhat reasonable notion of amplification, but if we noticed systematic flaws with how HCH reasons, we would not be able to systematically correct them.
Extortion beats brinksmanship, but the audience matters

FYI/nit: at first glance I thought extorsion was supposed to mean something different from extortion (I've never seen it spelt with the s) and this was a little confusing. 

2Stuart Armstrong1yThat's a misspelling that's entirely my fault, and has now been corrected.
AI safety via market making

Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it. 
If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose - the honest debater will pay to recurse until they get to a winning node. 
If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn't pay to recurse, the judge... (read more)

4Rohin Shah1yThis part makes sense. So in this case it's a stalemate, presumably? If the two players disagree but neither pays to recurse, how should the judge make a decision?
AI safety via market making

Suppose by strong induction that  always gives the right answer immediately for all sets of size less than 

Pretty sure debate can also access R if you make this strong of an assumption - ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n. 

I think the sort of claim that's actually useful is going to look more like 'we can guarantee that we'll get a reasonable training signal for problems in [some class]'

Ie, suppose M gives correct answers some fraction of the time. Are t... (read more)

3Evan Hubinger1yFirst, my full exploration of what's going on with different alignment proposals and complexity classes can be found here [] , so I'd recommend just checking that out rather than relying on my the mini proof sketch I gave here. Second, in terms of directly addressing what you're saying, I tried doing a proof by induction to get debate to RE and it doesn't work. The problem is that you can only get guarantees for trees that the human can judge, which means they have to be polynomial in length (though if you relax that assumption then you might be able to do better). Also, it's worth noting that the text that you're quoting isn't actually an assumption of the proof in any way—it's just the inductive hypothesis in a proof by induction. I think that is the same as what I'm proving, at least if you allow for “training signal” to mean “training signal in the limit of training on arbitrarily large amounts of data.” See my full post on complexity proofs [] for more detail on the setup I'm using.
AI safety via market making

I think for debate you can fix the circular argument problem by requiring debaters to 'pay' (sacrifice some score) to recurse on a statement of their choice. If a debater repeatedly pays to recurse on things that don't resolve before the depth limit, then they'll lose.  

5Rohin Shah1yHmm, I was imagining that the honest player would have to recurse on the statements in order to exhibit the circular argument, so it seems to me like this would penalize the honest player rather than the circular player. Can you explain what the honest player would do against the circular player such that this "payment" disadvantages the circular player? EDIT: Maybe you meant the case where the circular argument is too long to exhibit within the debate, but I think I still don't see how this helps.
AGI safety from first principles: Goals and Agency

But note that humans are far from fully consequentialist, since we often obey deontological constraints or constraints on the types of reasoning we endorse.

I think the ways in which humans are not fully consequentialist is much broader - we often do things because of habit, instinct, because doing that thing feels rewarding itself, because we're imitating someone else, etc. 

1Adam Shimi1yProbably because humans are not always doing optimization? That does raise an interesting question: is satisfying the first two criteria (which basically make you an optimizer) a necessary condition for satisfying the third one?
Writeup: Progress on AI Safety via Debate

That's correct about simultaneity.

Yeah, the questions and answers can be arbitrary, doesn't have to be X and ¬X.

I'm not completely sure whether Scott's method would work given how we're defining the meaning of questions, especially in the middle of the debate. The idea is to define the question by how a snapshot of the questioner taken when they wrote the question would answer questions about what they meant. So in this case, if you asked the questioner 'is your question equivalent to 'should I eat potatoes tonight?... (read more)

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

Yeah I also thought this might just be true already, for similar reasons

$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is

Of course GPT-3 isn't aligned, its objective is to output the most likely next word, ie imitate text on the internet. It seems pretty certain that if you give it a prompt that tells it it should be imitating some part of the internet where someone says something dumb, it will say something dumb, and if you give it a prompt that tells it it's imitating something where someone says something smart, it will "try" to say something smart. This question seems weird to me, Am I missing something?

This question seems weird to me, Am I missing something?

I think there are two interesting parts. First, do we now have an example of an AI not using cognitive capacities that it had, because the 'face' it's presenting wouldn't have those cognitive capacities? If so, we can point to this whenever people say "but that wouldn't happen" or "why would you expect that to happen?" or so on; now we can say "because of this observation" instead of "because our models anticipate that will happen in the fut... (read more)

I agree. And I thought Arthur Breitman had a good point on one of the related Twitter threads:

GPT-3 didn't "pretend" not to know. A lot of this is the AI dungeon environment. If you just prompt the raw GPT-3 with: "A definition of a monotreme follows" it'll likely do it right. But if you role play, sure, it'll predict that your stoner friend or young nephew don't know.

Using vector fields to visualise preferences and make them consistent

You might find this paper interesting. It does a similar decomposition with the dynamics of differentiable games (where the 'preferences' for how to change your strategy may not be the gradient of any function)

"The key result is to decompose the second-order dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems."