Wiki Contributions


Glad to see you're working on this. It seems even more clearly correct (the goal, at least :)) for not-so-short timelines. Less clear how best to go about it, but I suppose that's rather the point!

A few thoughts:

  1. I expect it's unusual that [replace methodology-1 with methodology-2] will be a pareto improvement: other aspects of a researcher's work will tend to have adapted to fit methodology-1. So I don't think the creation of some initial friction is a bad sign. (also mirrors therapy - there's usually a [take things apart and better understand them] phase before any [put things back together in a more adaptive pattern] phase)
    1. It might be useful to predict this kind of thing ahead of time, to develop a sense of when to expect specific side-effects (and/or predictably unpredictable side effects).
  2. I do think it's worth interviewing at least a few carefully selected non-alignment researchers. I basically agree with your alignment-is-harder case. However, it also seems most important to be aware of things the field is just completely missing.
    1. In particular, this may be useful where some combination of cached methodologies is a local maximum for some context. Knowing something about other hills seems useful here.
      1. I don't expect it'd work to import full sets of methodologies from other fields, but I do expect there are useful bits-of-information to be had.
    2. Similarly, if thinking about some methodology x that most alignment researchers currently use, it might be useful to find and interview other researchers that don't use x. Are they achieving [things-x-produces] in other ways? What other aspects of their methodology are missing/different?
      1. This might hint both at how a methodology change may impact alignment researchers, and how any negative impact might be mitigated.
  3. Worth considering that there's less of a risk in experimenting (kindly, that is) on relative newcomers than on experienced researchers. It's a good idea to get a clear understanding of the existing process of experienced researchers. However, once we're in [try this and see what happens] mode there's much less downside with new people - even abject failure is likely to be informative, and the downside in counterfactual object-level research lost is much smaller in expectation.

[apologies on slowness - I got distracted]
Granted on type hierarchy. However, I don't think all instances of GPT need to look like they inherit from the same superclass. Perhaps there's such a superclass, but we shouldn't assume it.

I think most of my worry comes down to potential reasoning along the lines of:

  • GPT is a simulator;
  • Simulators have property p;
  • Therefore GPT has property p;

When what I think is justified is:

  • GPT instances are usually usefully thought of as simulators;
  • Simulators have property p;
  • We should suspect that a given instance of GPT will have property p, and confirm/falsify this;

I don't claim you're advocating the former: I'm claiming that people are likely to use the former if "GPT is a simulator" is something they believe. (this is what I mean by motte-and-baileying into trouble)

If you don't mean to imply anything mechanistic by "simulator", then I may have misunderstood you - but at that point "GPT is a simulator" doesn't seem to get us very far.

If it's deceptively aligned, it's not a simulator in an important sense because its behavior is not sufficient to characterize very important aspects of its nature (and its behavior may be expected to diverge from simulation in the future).

It's true that the distinction between inner misalignment and robustness/generalization failures, and thus the distinction between flawed/biased/misgeneralizing simulators and pretend-simulators, is unclear, and seems like an important thing to become less confused about.

I think this is the fundamental issue.
Deceptive alignment aside, what else qualifies as "an important aspect of its nature"?
Which aspects disqualify a model as a simulator?
Which aspects count as inner misalignment?

To be clear on [x is a simulator (up to inner misalignment)], I need to know:

  1. What is implied mechanistically (if anything) by "x is a simulator".
  2. What is ruled out by "(up to inner misalignment)".

I'd be wary of assuming there's any neat flawed-simulator/pretend-simulator distinction to be discovered. (but probably you don't mean to imply this?)
I'm all for deconfusion, but it's possible there's no joint at which to carve here.

(my guess would be that we're sometimes confused by the hidden assumption:
[a priori unlikely systematically misleading situation => intent to mislead]
whereas we should be thinking more like
[a priori unlikely systematically misleading situation => selection pressure towards things that mislead us]

I.e. looking for deception in something that systematically misleads us is like looking for the generator for beauty in something beautiful. Beauty and [systematic misleading] are relations between ourselves and the object. Selection pressure towards this relation may or may not originate in the object.)

Can you give an example of what it would mean for a GPT not to be a simulator, or to not be a simulator in some sense?

Here I meant to point to the lack of clarity around what counts as inner misalignment, and what GPT's being a simulator would imply mechanistically (if anything).

There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.

"trying to appear aligned" seems imprecise to me - unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there's no requirement for modelling of the base optimiser or oversight system; it's enough to understand the optimisation pressure and be uncertain whether it'll persist).

Are you thinking that it makes sense to agree to test for systems that are "trying to appear aligned", or would you want to include any system that is instrumentally acting such that it's unaltered by optimisation pressure?

Mostly I agree with this.
I have more thoughts, but probably better to put them in a top-level post - largely because I think this is important and would be interested to get more input on a good balance.

A few thoughts on LW endorsing invalid arguments:
I'd want to separate considerations of impact on [LW as collective epistemic process] from [LW as outreach to ML researchers]. E.g. it doesn't necessarily seem much of a problem for the former to have reliance on unstated assumptions. I wouldn't formally specify an idea before sketching it, and it's not clear to me that there's anything wrong with collective sketching (so long as we know we're sketching - and this part could certainly be improved).
I'd first want to optimize the epistemic process, and then worry about the looking foolish part. (granted that there are instrumental reasons not to look foolish)

On ML's view, are you mainly thinking of people who may do research on an important x-safety sub-problem without necessarily buying x-risk arguments? It seems unlikely to me that anyone gets persuaded of x-risk from the bottom up, whether or not the paper/post in question is rigorous - but perhaps this isn't required for a lot of useful research?

I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought processes and workflows that produce useful alignment ideas.

What are your thoughts on failure modes with this approach?
(please let me know if any/all of the following seems confused/vanishingly unlikely)

For example, one of the first that occurs to me is that such cyborgism is unlikely to amplify production of useful-looking alignment ideas uniformly in all directions.

Suppose that it makes things 10x faster in various directions that look promising, but don't lead to solutions, but only 2x faster in directions that do lead to solutions. In principle this should be very helpful: we can allocate fewer resources to the 10x directions, leaving us more time to work on the 2x directions, and everybody wins.
In practice, I'd expect the 10x boost to:

  1. Produce unhelpful incentives for alignment researchers: work on any of the 10x directions and you'll look hugely more productive. Who will choose to work on the harder directions?
    1. Note that it won't be obvious you're going slowly because the direction is inherently harder: from the outside, heading in a difficult direction will be hard to distinguish from being ineffective (from the inside too, in fact).
    2. Same reasoning applies at every level of granularity: sub-direction choice, sub-sub-direction choice....
  2. Warp our perception of promising directions: once the 10x directions seem to be producing progress much faster, it'll be difficult not to interpret this as evidence they're more promising.
    1. Amplified assessment-of-promise seems likely to correlate unhelpfully: failing to help us notice promising directions precisely where it's least able to help us make progress.

It still seems positive-in-expectation if the boost of cyborgism isn't negatively correlated with the ground-truth usefulness of a direction - but a negative correlation here seems plausible.

Suppose that finding the truly useful directions requires patterns of thought that are rare-to-non-existent in the training set, and are hard to instill via instruction. In that case it seems likely to me that GPT will be consistently less effective in these directions (to generate these ideas / to take these steps...). Then we may be in terrible-incentive-land.
[I'm not claiming that most steps in hard directions will be hard, but that speed of progress asymptotes to progress-per-hard-step]

Of course all this is hand-waving speculation.
I'd just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.

So e.g. negative impact through:

  • Boosting capabilities research.
  • Creation of undesirable incentives in alignment research.
  • Warping assessment of research directions.
  • [other stuff I haven't thought of]

Do you know of any existing discussion along these lines?

Great post. Very interesting.

However, I think that assuming there's a "true name" or "abstract type that GPT represents" is an error.

If GPT means "transformers trained on next-token prediction", then GPT's true name is just that. The character of the models produced by that training is another question - an empirical one. That character needn't be consistent (even once we exclude inner alignment failures).

Even if every GPT is a simulator in some sense, I think there's a risk of motte-and-baileying our way into trouble.

Presumably "too dismissive of speculative and conceptual research" is a direct consequence of increased emphasis on rigor. Rigor is to be preferred all else being equal, but all else is not equal.

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

I note that within rigorous fields, the downsides of rigor are not obvious: we can point to all the progress made; progress that wasn't made due to the neglect of conceptual/speculative research is invisible. (has the impact of various research/publication norms ever been studied?)

Further, it seems limiting only to consider [must always be rigorous (in publications)] vs [no demand for rigor]. How about [50% of your publications must be rigorous] (and no incentive to maximise %-of-rigorous-publications), or any other not-all-or-nothing approach?

I'd contrast rigor with clarity here. Clarity is almost always a plus.
I'd guess that the issue in social science fields isn't a lack of rigor, but rather of clarity. Sometimes clarity without rigor may be unlikely, e.g. where there's a lot of confusion or lack of good faith - in such cases an expectation of rigor may help. I don't think this situation is universal.

What we'd want on LW/AF is a standard of clarity.
Rigor is an often useful proxy. We should be careful when incentivizing proxies.

I’d expect AI checks and balances to have some benefits even if AIs are engaged in advanced collusion with each other. For example, AIs rewarded to identify and patch security vulnerabilities would likely reveal at least some genuine security vulnerabilities, even if they were coordinating with other AIs to try to keep the most important ones hidden.

This seems not to be a benefit.
What we need is to increase the odds of finding the important vulnerabilities. Collusion that reveals a few genuine vulnerabilities seems likely to lower our odds by giving us misplaced confidence. I don't think this is an artefact of the example: wherever we've unaware of collusion, but otherwise accurate in our assessment of check-and-balance-systems' capabilities, we'll be overconfident in the results.

In many cases, the particular vulnerabilities (or equivalent) revealed will be selected at least in part to maximise our overconfidence.

Ok, thanks - I can at least see where you're coming from then.
Do you think debate satisfies the latter directly - or is the frontier only pushed out if it helps in the automation process? Presumably the latter (??) - or do you expect e.g. catastrophic out-with-a-whimper dynamics before deceptive alignment?

I suppose I'm usually thinking of a "do we learn anything about what a scalable alignment approach would look like?" framing. Debate doesn't seem to get us much there (whether for scalable ELK, or scalable anything else), unless we can do the automating alignment work thing - and I'm quite sceptical of that (weakly, and with hand-waving).

I can buy a safe-AI-automates-quite-a-bit-of-busy-work argument; once we're talking about AI that's itself coming up with significant alignment breakthroughs, I have my doubts. What we want seems to be unhelpfully complex, such that I expect we need to target our helper AIs precisely (automating capabilities research seems much simpler).

Since we currently can't target our AIs precisely, I imagine we'd use (amplified-)human-approval. My worry is that errors/noise compounds for indirect feedback (e.g. deep debate tree on a non-crisp question), and that direct feedback is only as good as our (non-amplified) ability to do alignment research.
AI that doesn't have these problems seems to be the already dangerous variety (e.g. AI that can infer our goal and optimize for it).

I'd be more optimistic if I thought we were at/close-to a stage where we know the crisp high-level problems we need to solve, and could ask AI assistants for solutions to those crisp problems.

That said, I certainly think it's worth thinking about how we might get to a "virtuous cycle of automating alignment work". It just seems to me that it's bottlenecked on the same thing as our direct attempts to tackle the problem: our depth of understanding.

Thanks for the list.

On (7), I'm not clear how this is useful unless we assume that debaters aren't deceptively aligned. (if debaters are deceptively aligned, they lose the incentive to expose internal weaknesses in their opponent, so the capability to do so doesn't achieve much)

In principle, debate on weights could add something over interpretability tools alone, since optimal in-the-spirit-of-things debate play would involve building any useful interpretability tools which didn't already exist. (optimal not-in-the-spirit-of-things play likely involves persuasion/manipulation)

However, assuming debaters are not deceptively aligned seems to assume away the hard part of the problem.
Do you expect that there's a way to make use of this approach before deceptive alignment shows up, or is there something I'm missing?

Load More