All of Beth Barnes's Comments + Replies

What do you think is important about pure RL agents vs RL-finetuned language models? I expect the first powerful systems to include significant pretraining so I don't really think much about agents that are only trained with RL (if that's what you were referring to).

How were you thinking this would measure Goodharting in particular? 

I agree that seems like a reasonable benchmark to have for getting ML researchers/academics to work on imitation learning/value learning. I don't think I'm likely to prioritize it - I don't think 'inability to learn human values' is going to be a problem for advanced AI, so I'm less excited about value learning as a key thing to work on.

video game companies can be extremely well-aligned with delivering a positive experience for their users

This doesn't seem obvious to me; video game companies are incentivized to make games that are as addicting as possible without putting off new users/getting other backlash. 

My impression is that they don't have the skills needed for successful foraging. There's a lot of evidence for some degree of cultural accumulation in apes and e.g. macaques. But I haven't looked into this specific claim super closely.

Thanks for the post! One narrow point:
You seem to lean at least a bit on the example of 'much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner'.  It seems to me that
a. You don't need to go to humans before you get significant accumulation of important cultural knowledge outside genes (e.g. my understanding is that unaccultured chimps die in the wild)
b.  the genetic bottleneck is a somewhat weird and contingent feature of animal evolution, and I don't think the... (read more)

(Is that just because they get attacked and killed by other chimp groups?)

It seems to me like this should be pretty easy to do and I'm disappointed there hasn't been more action on it yet. Things I'd try:
- reach out to various human-data-as-a-service companies like SurgeHQ, Scale, Samasource
- look for people on upwork 
- find people who write fiction on the internet (e.g. post on fanfiction forums) and offer to pay them to annotate their existing stories (not a dungeon run exactly, but I don't see why the dungeon setting is important)

I'd be interested to hear if anyone has tried these things and run into roadblocks.

I'm also ... (read more)

Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don't understand exactly what's meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation - e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.

2Evan Hubinger6mo
I don't think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it's not harder to prove something about a program with larger loop bounds, since you don't have to unroll the loop, you just have to demonstrate a loop invariant.

crossposting my comments from Slack thread:

Here are some debate trees from experiments I did on long-text QA  on this example short story:


Debater view 1

Debater view 2

Our conclusion was that we don’t expect debate to work robustly in these cases. In our case this was mostly because in cases where the debate is things like ’is there implied subtext A?’,  human debaters don’t really know why they believe some text does or doesn’t have a particular implication. They have some mix of priors about what the text might be saying (which can’t really ... (read more)

2Sam Bowman6mo
Yep. (Thanks for re-posting.) We're pretty resigned to the conclusion that debate fails to reach a correct conclusion in at least some non-trivial cases—we're mainly interested in figuring out (i) whether there are significant domains or families of questions for which it will often reach a conclusion, and (ii) whether it tends to fail gracefully (i.e., every outcome is either correct or a draw).

I think I’m something like 30% on ‘The highest-leverage point for alignment work is once we have models that are capable of alignment research - we should focus on maximising the progress we make at that point, rather than on making progress now, or on making it to that point - most of the danger comes after it’

Things this maybe implies:

  • We should try to differentially advance models’ ability to do alignment research relative to other abilities (abilities required to be dangerous, or abilities required to accelerate capabilities)
    • For instance, trying to make
... (read more)

You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]

yeah that's a fair point

IMO, the alignment MVP claim Jan is making is approximately '‘we only need to focus on aligning narrow-ish alignment research models that are just above human level, which can be done with RRM (and maybe some other things, but no conceptual progress?)’'
and requires:

  1. we can build models that are:
    1. Not dangerous themselves
    2. capable of alignment research
    3. We can use RRM to make them aligned enough that we can get useful research out of them. 
  2. We can build these models before [anyone builds models that would be dangerous without [more progress on alignment than i
... (read more)

As written there, the strong form of the orthogonality thesis states 'there's no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.'

I don't know whether that's intended to mean the same as 'there are no types of goals that are more 'natural' or that are easier to build agents that pursue, or that you're more likely to get if you have some noisy process for creating agents'.

I feel like I haven't seen a good argument for the latter statement, and it seems intuitively wrong to me.

Yeah, I'm particular worried about the second comment/last paragraph - people not actually wanting to improve their values, or only wanting to improve them in ways we think are not actually an improvement (e.g. wanting to have purer faith)

Random small note - the 'dungeon' theme is slightly ...culturally offputting? or something for me, as someone who's never been into this kind of thing or played any of these and is therefore a bit confused about what exactly this involves, and has vague negative associations (I guess because dungeons sound unpleasant?). I wonder if something a bit blander like a story, play, or AI assistant setting could be better?

Someone who wants to claim the bounty could just buy the dataset from one of the companies that does this sort of thing, if they're able to produce a sufficiently high-quality version, I assume? Would that be in the spirit of the bounty?

Thanks, that's interesting, though mostly I'm not buying it (still unclear whether there's a good case to be made; fairly clear that he's not making a good case). Thoughts: 1. Most of it seems to say "Being a subroutine doesn't imply something doesn't suffer". That's fine, but few positive arguments are made. Starting with the letter 'h' doesn't imply something doesn't suffer either - but it'd be strange to say "Humans obviously suffer, so why not houses, hills and hiccups?". 2. We infer preference from experience of suffering/joy...: [Joe Xs when he might not X] & [Joe experiences suffering and joy] -> [Joe prefers Xing] [this rock is Xing] -> [this rock Xs] Methinks someone is petitioing a principii. (Joe is mechanistic too - but the suffering/joy being part of that mechanism is what gets us to call it "preference") 3. Too much is conflated: [Happeningtox]≢[Aimingtox]≢[Preferringtox] In particular, I can aim to x and not care whether I succeed. Not achieving an aim doesn't imply frustration or suffering in general - we just happen to be wired that way (but it's not universal, even for humans: we can try something whimsical-yet-goal-directed, and experience no suffering/frustration when it doesn't work). [taboo/disambiguate 'aim' if necessary] 1. There's no argument made for frustration/satisfaction. It's just assumed that not achieving a goal is frustrating, and that achieving one is satisfying. A case can be made to ascribe intentionality to many systems - e.g. Dennett's intentional stance []. Ascribing welfare is a further step, and requires further arguments. Non-achievement of an aim isn't inherently frustrating (c.f. Buddhists - and indeed current robots). 1. The only argument I saw on this was "we can sum over possible interpretations" - sure, but I can do that for

I guess I expect there to be a reasonable amount of computation taking place, and it seems pretty plausible a lot of these computations will be structured like agents who are taking part in the Malthusian competition. I'm sufficiently uncertain about how consciousness works that I want to give some moral weight to 'any computation at all', and reasonable weight to 'a computation structured like an agent'.

I think if you have malthusian dynamics you *do* have evolution-like dynamics.

I assume this isn't a crux, but fwiw I think it's pretty likely most vertebrates are moral patients

I agree with most of this. Not sure about how much moral weight I'd put on "a computation structured like an agent" - some, but it's mostly coming from [I might be wrong] rather than [I think agentness implies moral weight]. Agreed that malthusian dynamics gives you an evolution-like situation - but I'd guess it's too late for it to matter: once you're already generally intelligent, can think your way to the convergent instrumental goal of self-preservation, and can self-modify, it's not clear to me that consciousness/pleasure/pain buys you anything. Heuristics are sure to be useful as shortcuts, but I'm not sure I'd want to analogise those to qualia (??? presumably the right kind would be - but I suppose I don't expect the right kind by default). The possibilities for signalling will also be nothing like that in a historical evolutionary setting - the utility of emotional affect doesn't seem to be present (once the humans are gone). [these are just my immediate thoughts; I could easily be wrong] I agree with its being likely that most vertebrates are moral patients. Overall, I can't rule out AIs becoming moral patients - and it's clearly possible. I just don't yet see positive reasons to think it has significant probability (unless aimed for explicitly).

It sounds like you're implying that you need humans around for things to be dystopic? That doesn't seem clear to me; the AIs involved in the Malthusian struggle might still be moral patients

Sure, that's possible (and if so I agree it'd be importantly dystopic) - but do you see a reason to expect it? It's not something I've thought about a great deal, but my current guess is that you probably don't get moral patients without aiming for them (or by using training incentives much closer to evolution than I'd expect).

I guess I was kind of subsuming this into 'benevolent values have become more common'

I tend to want to split "value drift" into "change in the mapping from (possible beliefs about logical and empirical questions) to (implied values)" and "change in beliefs about logical and empirical questions", instead of lumping both into "change in values".

ah yeah, so the claim is something like 'if we think other humans have 'bad values', maybe in fact our values are the same and one of us is mistaken, and we'll get less mistaken over time'?

1Beth Barnes10mo
I guess I was kind of subsuming this into 'benevolent values have become more common'

Is this making a claim about moral realism? If so, why wouldn't it apply to a paperclip maximiser? If not, how do we distinguish between objective mistakes and value disagreements?

4Matthew Barnett10mo
I interpreted steven0461 to be saying that many apparent "value disagreements" between humans turn out, upon reflection, to be disagreements about facts rather than values. It's a classic outcome concerning differences in conflict vs. mistake theory []: people are interpreted as having different values because they favor different strategies, even if everyone shares the same values.
combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most large language datasets have this information, e.g. books have titles and (maybe) blurbs, websites have titles, URLs, and (maybe) associated subreddit links, etc. This data is obviously much noisier and lower quality than what you get from paying people for annotations, but it’s voluminous, diverse, and ~free.

I'm sympathetic ... (read more)

I am very excited about finding scalable ways to collect large volumes of high-quality data on weird, specific tasks. This seems very robustly useful for alignment, and not something we're currently that good at. I'm a bit less convinced that this task itself is particularly useful.

Have you reached out to e.g. or another one of the companies that does human-data-generation-as-a-service?

Random small note - the 'dungeon' theme is slightly ...culturally offputting? or something for me, as someone who's never been into this kind of thing or played any of these and is therefore a bit confused about what exactly this involves, and has vague negative associations (I guess because dungeons sound unpleasant?). I wonder if something a bit blander like a story, play, or AI assistant setting could be better?

3Beth Barnes10mo
Someone who wants to claim the bounty could just buy the dataset from one of the companies that does this sort of thing, if they're able to produce a sufficiently high-quality version, I assume? Would that be in the spirit of the bounty?

Instruction-following davinci model. No additional prompt material

1Charlie Steiner1y

Yeah, I think you need some assumptions about what the model is doing internally.

I'm hoping you can handwave over cases like 'the model might only know X&A, not X' with something like 'if the model knows X&A, that's close enough to it knowing X for our purposes - in particular, if it thought about the topic or learned a small amount, it might well realise X'.

Where 'our purposes' are something like 'might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don't know X'?

Another way to put this is that for work
... (read more)
This mostly seems plausible to me - and again, I think it's a useful exercise that ought to yield interesting results. Some thoughts: 1. Handwaving would seem to take us from "we can demonstrate capability of X" to "we have good evidence for capability of X". In cases where we've failed to prompt/finetune the model into doing X we also have some evidence against the model's capability of X. Hard to be highly confident here. 2. Precision over the definition of a task seems important when it comes to output. Since e.g. "do arithmetic" != "output arithmetic". This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren't behaviours (usually), but rather internal processes. This doesn't seem too useful in attempting to show misalignment, since knowing the model can do X doesn't mean it can output the result of X.

You mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.

When can models report their activations?

Related to call for research on evaluating alignment

Here's an experiment I'd love to see someone run (credit to Jeff Wu for the idea, and William Saunders for feedback):

Finetune a language model to report the activation of a particular neuron in text form.

E.g., you feed the model a random sentence that ends in a full stop. Then the model should output a number from 1-10 that reflects a particular neuron's activation.

We assume the model will not be able to report the activation of a neuron in the final layer, even i... (read more)

2Alex Turner1y
Surely there exist correct fixed points, though? (Although probably not that useful, even if feasible)

@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)?

(I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)

1Adam Shimi1y
My message was really about Rohin's phrasing, since I usually don't read the papers in details if I think the summary is good enough. Reading the section now, I'm fine with it. There are a few intentional stance words, but the scare quotes and the straightforwardness of cashing out "is capable" into "there is a prompt to make it do what we want" and "chooses" into "what it actually returns for our prompt" makes it quite unambiguous. I also like this paragraph in the appendix: Rohin also changed my mind on my criticism of calling that misalignment; I now agree that misalignment is the right term. One thought I just had: this looks more like a form of proxy alignment to what we really want, which is not ideal but significantly better than deceptive alignment. Maybe autoregressive language models point to a way of paying a cost of proxy alignment to avoid deceptive alignment?

You might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' where you're discussing alignment research that can be done with current models

1John Schulman1y
yup, added a sentence about it

I think this is a really useful post, thanks for making this! I maybe have a few things I'd add but broadly I agree with everything here.

1Beth Barnes1y
You might want to reference Ajeya's post on 'Aligning Narrowly Superhuman Models' [] where you're discussing alignment research that can be done with current models

"Even if actively trying to push the field forward full-time I'd be a small part of that effort"

I think conditioning on something like 'we're broadly correct about AI safety' implies 'we're right about some important things about how AI development will go that the rest of the ML community is surprisingly wrong about'. In that world we're maybe able to contribute as much as a much larger fraction of the field, due to being correct about some things that everyone else is wrong about.

I think your overall point still stands, but it does seem like you sometimes overestimate how obvious things are to the rest of the ML community

We're trying to address cases where the human isn't actually able to update on all of D and form a posterior based on that. We're trying to approximate 'what the human posterior would be if they had been able to look at all of D'. So to do that, we learn the human prior, and we learn the human likelihood, then have the ML do the computationally-intensive part of looking at all of D and updating based on everything in there.

Does that make sense?

Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?

I think the distinction isn't actually super clear, because you can usually trade off capabilities problems and safety problems. I think of it as expanding the range of questions you can get aligned answers to in a reasonable number of steps. If you're just doing IDA/debate, and you try to get your model to give you answers to questions where the model only knows the an... (read more)

That is a concern, but only in the case where there's no answer that has an argument tree that bottoms out in depth<D. As long as there exists an answer that is supported by a depth<D tree, this answer will beat the answers only supported by depth>D argument trees.

So there is a case where the debaters are not incentivised to be honest; the case where the debaters know something but there's no human-understandable argument for it that bottoms out in <D steps. This is where we get the PSPACE constraint.

If we include discussion of cro... (read more)

I don't think 'assuming one player is honest' and 'not trusting answers by default' are in contradiction. if the judge assumes one player is honest, then if they see two different answers they don't know which one to trust, but if they only see one answer (the debaters agree on an answer/the answer is not challenged by the opposing debater) then they can trust that answer.

I was trying to describe something that's the same as the judging procedure in that doc! I might have made a mistake, but I'm pretty sure the key piece about recursion payments is the same. Apologies that things are unclear. I'm happy to try to clarify, if there were particular aspects that seem different to you.

Yeah, I think the infinite tree case should work just the same - ie an answer that's only supported by an infinite tree will behave like an answer that's not supported (it will lose to an answer with a finite tree and draw... (read more)

In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

Not systematically; I would be excited about people doing these experiments. One tricky thing is that you might t... (read more)

It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements - this is the main reason that we need science.

Agree that humans are not necessarily great at assigning priors. The main response to this is that we don't have a way to get better priors than an amplified human's best prior. If amplified humans think the NN prior is better than their prior, the... (read more)

Thanks for the post, I'm excited that you're thinking about debate!

I think I disagree with the claim you're making about being able to avoid requiring the judge to assume that one player is honest (but I might be confused about what you're proposing). 
Basically, it sounds like you're saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we're providing, then we're going to need to run an extremely la... (read more)

2Abram Demski2y
Don't you yourself disagree with requiring the judge to assume that one player is honest? In a recent comment, you discuss how claims should not be trusted by default [] .
2Abram Demski2y
I'm not sure why you're saying this, but in the post, I restricted my claim to NP-like problems. So for example, traveling salesman -- the computation to find good routes may be very difficult, but the explanation for the answer remains short (EG an explicit path). So, yes, I'm saying that I don't see the same sort of argument working for exp-sized explanations. (Although Rohin's comment gave me pause, and I still need to think it over more.) But aside from that, I'm also not sure what you mean by the "run an extremely large number of debates" point. Debate isn't like search, where we run more/longer to get better answers. Do you mean that my proposal seems to require longer training time to get anywhere? If so, why is that? Or, what do you mean? I'm not asserting that the judge should distrust, either. Like the normal debate argument, I want to end up in an honest equilibrium. So I'm not saying we need some kind of equilibrium where the judge is justified in distrust. My concern involves the tricky relationship between the equilibrium we're after and what the judge has to actually do during training (when we might not be anywhere near equilibrium). I don't want the judge to have to pretend answers are honest at times when they're statistically not. I didn't end up going through that whole argument in the post (unfortunately), but in my notes for the post, the judge being able to judge via honest opinion at all times during training was an important criterion. I agree that that's what we're after. But I think maybe the difference in our positions can be captured if we split "honest" into two different notions... a-honesty: the statement lacks an immediate (a-honest) counterargument. IE, if I think a statement is a-honest, then I don't think there's a next statement which you can (a-honestly) tell me which would make me disbelieve the statement. b-honesty: the statement cannot be struck down by multi-step (b-honest) debate. IE, if I think a statement is b-hone

The standard argument against having a non-zero-sum debate game is that then you may incentivise your debaters to collude.  

I don't know if you've seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior - seems somewhat relevant to what you're thinking about here. 

2Abram Demski2y
I took a look, and it was indeed helpful. However, I left a comment there about a concern I have. The argument at the end only argues for what you call D-acceptability: having no answer that's judged better after D steps of debate. My concern is that even if debaters are always D-acceptable for all D, that does not mean they are honest. They can instead use non-well-founded argument trees which never bottom out.
2Abram Demski2y
I think the collusion concern basically over-anthropomorphizes the training process. Say, in prisoner's dilemma, if you train myopically, then "all incentives point toward defection" translates concretely to actual defection. Granted, there are training regimes in which this doesn't happen, but those would have to be avoided. OTOH, the concern might be that an inner optimizer would develop which colludes. This would have to be dealt with by more general anti-inner-optimizer technology. Yep, I should take a look!

To be clear, I think this is a good suggestion and is close to how I imagine we'd actually run debate in practice. It just doesn't get us beyond MA if the debaters only write P-size arguments.

I'd be interested to hear more detail of your thoughts on how we might use robustness techniques!

Yep, planning to put up a post about that soon. The short argument is something like:
The equivalent of an obfuscated argument for IDA is a decomposition that includes questions the model doesn't know how to answer. 
We can't always tell the difference between an IDA tree that uses an obfuscated decomposition and gets the wrong answer, vs an IDA tree that uses a good decomposition and gets the right answer, without unpacking the entire tree

I just mean that this method takes order(length of argument in judge-understandable language) time. So if the argument is large then you're going to need to let the debate run for a long time. This is as opposed to the previous hope that even if the argument tree is exp-sized, the debate can run in p-time


Yep, this does work, but limits us to questions where the argument in judge-understandable language is short enough that the debaters can write the whole thing down. So if the debaters run in P-time at deployment time, this gives us MA, not PSPACE as originally hoped. 

1John Schulman2y
OK, I guess I'm a bit unclear on the problem setup and how it involves a training phase and deployment phase.

One counterexample is Manhattan Project - they developed two different designs simultaneously because they weren't sure which would work better. From wikipedia: Two types of atomic bombs were developed concurrently during the war: a relatively simple gun-type fission weapon and a more complex implosion-type nuclear weapon.,Tube%20Alloys%20project)%20and%20Canada.

Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that's supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument

4Abram Demski2y
This was a very interesting comment (along with its grandparent comment), thanks -- it seems like a promising direction. However, I'm still confused about whether this would work. It's very different from judging procedure outlined here [] ; why is that? Do you have a similarly detailed write-up of the system you're describing here? I'm actually less concerned about loops and more concerned about arguments which are infinite trees, but the considerations are similar. It seems possible that the proposal you're discussing very significantly addresses concerns I've had about debate.

I see myself as trying to construct a theory of normativity which gets that "by construction", IE, we can't expect to find any mechanism which does better because if we could say anything about what that mechanism does better then we could tell it to the system, and the system would take it into account.

Nice, this is what I was trying to say but was struggling to phrase it. I like this.

I guess I usually think of HCH as having this property, as long as the thinking time for each human is long enough, the tree is deep enough, and we're correct about the hope... (read more)

Load More