Wiki Contributions


Ah, well that's mildly discouraging (encouraging that you've made this scale of effort; discouraging in what it says about the difficulty of progress).

I'd still be interested to know what you'd see as a promising approach here - if such crux resolution were the only problem, and you were able to coordinate things as you wished, what would be a (relatively) promising strategy?
But perhaps you're already pursuing it? I.e. if something like [everyone works on what they see as key problems, increases their own understanding and shares insights] seems most likely to open up paths to progress.

Assuming review wouldn't do much to help on this, have you thought about distributed mechanisms that might? E.g. mapping out core cruxes and linking all available discussions where they seem a fundamental issue (potentially after holding/writing-up a bunch more MIRI Dialogues style interactions [which needn't all involve MIRI]).
Does this kind of thing seem likely to be of little value - e.g. because it ends up clearly highlighting where different intuitions show up, but shedding little light on their roots or potential justification?

I suppose I'd like to know what shape of evidence seems most likely to lead to progress - and whether much/any of it might be unearthed through clarification/distillation/mapping of existing ideas. (where the mapping doesn't require connections that only people with the deepest models will find)

Oh sure, I certainly don't mean to imply that there's been little effort in absolute terms - I'm very encouraged by the MIRI dialogues, and assume there are a bunch of behind-the-scenes conversations going on.
I also assume that everyone is doing what seems best in good faith, and has potentially high-value demands on their time.

However, given the stakes, I think it's a time for extraordinary efforts - and so I worry that [this isn't the kind of thing that is usually done] is doing too much work.

I think the "principled epistemics and EV calculations" could perfectly well be the explanation, if it were the case that most researchers put around a 1% chance on [Eliezer/Nate/John... are largely correct on the cruxy stuff].

That's not the sense I get - more that many put the odds somewhere around 5% to 25%, but don't believe the arguments are sufficiently crisp to allow productive engagement.

If I'm correct on that (and I may well not be), it does not seem a principled justification for the status-quo. Granted the right course isn't obvious - we'd need whoever's on the other side of the double-cruxing to really know their stuff. Perhaps Paul's/Rohin's... time is too valuable for a 6 month cost to pay off. (the more realistic version likely involves not-quite-so-valuable people from each 'side' doing it)

As for "done a thing a bunch and it doesn't seem to be working", what's the prior on [two experts in a field from very different schools of thought talk for about a week and try to reach agreement]? I'm no expert, but I strongly expect that not to work in most cases.

To have a realistic expectation of its working, you'd need to be doing the kinds of thing that are highly non-standard. Experts having some discussions over a week is standard. Making it your one focus for 6 months is not. (frankly, I'd be over the moon for the one month version [but again, for all I know this may have been tried])

for that to fix these problems the reviewers would have to be more epistemically competent than the post authors

I think this is an overstatement. They'd need to notice issues the post authors missed. That doesn't require greater epistemic competence: they need only tend to make different mistakes, not fewer mistakes.

Certainly there's a point below which the signal-to-noise ratio is too low. I agree that high reviewer quality is important.

On the "same old cruxes and disagreements" I imagine you're right - but to me that suggests we need a more effective mechanism to clarify/resolve them (I think you're correct in implying that review is not that mechanism - I don't think academic review achieves this either). It's otherwise unsurprising that they bubble up everywhere.

I don't have any clear sense of the degree of time and effort that has gone into clarifying/resolving such cruxes, and I'm sure it tends to be a frustrating process. However, my guess is that the answer is "nowhere close to enough". Unless researchers have very high confidence that they're on the right side of such disagreements, it seems appropriate to me to spend ~6 months focusing on purely this (of course this would require coordination, and presumably seems wildly impractical).

My sense is that nothing on this scale happens (right?), and that the reasons have more to do with (entirely understandable) impracticality, coordination difficulties and frustration, than with principled epistemics and EV calculations.
But perhaps I'm way off? My apologies if this is one of the same old cruxes and disagreements :).

Glad to see you're working on this. It seems even more clearly correct (the goal, at least :)) for not-so-short timelines. Less clear how best to go about it, but I suppose that's rather the point!

A few thoughts:

  1. I expect it's unusual that [replace methodology-1 with methodology-2] will be a pareto improvement: other aspects of a researcher's work will tend to have adapted to fit methodology-1. So I don't think the creation of some initial friction is a bad sign. (also mirrors therapy - there's usually a [take things apart and better understand them] phase before any [put things back together in a more adaptive pattern] phase)
    1. It might be useful to predict this kind of thing ahead of time, to develop a sense of when to expect specific side-effects (and/or predictably unpredictable side effects).
  2. I do think it's worth interviewing at least a few carefully selected non-alignment researchers. I basically agree with your alignment-is-harder case. However, it also seems most important to be aware of things the field is just completely missing.
    1. In particular, this may be useful where some combination of cached methodologies is a local maximum for some context. Knowing something about other hills seems useful here.
      1. I don't expect it'd work to import full sets of methodologies from other fields, but I do expect there are useful bits-of-information to be had.
    2. Similarly, if thinking about some methodology x that most alignment researchers currently use, it might be useful to find and interview other researchers that don't use x. Are they achieving [things-x-produces] in other ways? What other aspects of their methodology are missing/different?
      1. This might hint both at how a methodology change may impact alignment researchers, and how any negative impact might be mitigated.
  3. Worth considering that there's less of a risk in experimenting (kindly, that is) on relative newcomers than on experienced researchers. It's a good idea to get a clear understanding of the existing process of experienced researchers. However, once we're in [try this and see what happens] mode there's much less downside with new people - even abject failure is likely to be informative, and the downside in counterfactual object-level research lost is much smaller in expectation.

[apologies on slowness - I got distracted]
Granted on type hierarchy. However, I don't think all instances of GPT need to look like they inherit from the same superclass. Perhaps there's such a superclass, but we shouldn't assume it.

I think most of my worry comes down to potential reasoning along the lines of:

  • GPT is a simulator;
  • Simulators have property p;
  • Therefore GPT has property p;

When what I think is justified is:

  • GPT instances are usually usefully thought of as simulators;
  • Simulators have property p;
  • We should suspect that a given instance of GPT will have property p, and confirm/falsify this;

I don't claim you're advocating the former: I'm claiming that people are likely to use the former if "GPT is a simulator" is something they believe. (this is what I mean by motte-and-baileying into trouble)

If you don't mean to imply anything mechanistic by "simulator", then I may have misunderstood you - but at that point "GPT is a simulator" doesn't seem to get us very far.

If it's deceptively aligned, it's not a simulator in an important sense because its behavior is not sufficient to characterize very important aspects of its nature (and its behavior may be expected to diverge from simulation in the future).

It's true that the distinction between inner misalignment and robustness/generalization failures, and thus the distinction between flawed/biased/misgeneralizing simulators and pretend-simulators, is unclear, and seems like an important thing to become less confused about.

I think this is the fundamental issue.
Deceptive alignment aside, what else qualifies as "an important aspect of its nature"?
Which aspects disqualify a model as a simulator?
Which aspects count as inner misalignment?

To be clear on [x is a simulator (up to inner misalignment)], I need to know:

  1. What is implied mechanistically (if anything) by "x is a simulator".
  2. What is ruled out by "(up to inner misalignment)".

I'd be wary of assuming there's any neat flawed-simulator/pretend-simulator distinction to be discovered. (but probably you don't mean to imply this?)
I'm all for deconfusion, but it's possible there's no joint at which to carve here.

(my guess would be that we're sometimes confused by the hidden assumption:
[a priori unlikely systematically misleading situation => intent to mislead]
whereas we should be thinking more like
[a priori unlikely systematically misleading situation => selection pressure towards things that mislead us]

I.e. looking for deception in something that systematically misleads us is like looking for the generator for beauty in something beautiful. Beauty and [systematic misleading] are relations between ourselves and the object. Selection pressure towards this relation may or may not originate in the object.)

Can you give an example of what it would mean for a GPT not to be a simulator, or to not be a simulator in some sense?

Here I meant to point to the lack of clarity around what counts as inner misalignment, and what GPT's being a simulator would imply mechanistically (if anything).

There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.

"trying to appear aligned" seems imprecise to me - unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there's no requirement for modelling of the base optimiser or oversight system; it's enough to understand the optimisation pressure and be uncertain whether it'll persist).

Are you thinking that it makes sense to agree to test for systems that are "trying to appear aligned", or would you want to include any system that is instrumentally acting such that it's unaltered by optimisation pressure?

Mostly I agree with this.
I have more thoughts, but probably better to put them in a top-level post - largely because I think this is important and would be interested to get more input on a good balance.

A few thoughts on LW endorsing invalid arguments:
I'd want to separate considerations of impact on [LW as collective epistemic process] from [LW as outreach to ML researchers]. E.g. it doesn't necessarily seem much of a problem for the former to have reliance on unstated assumptions. I wouldn't formally specify an idea before sketching it, and it's not clear to me that there's anything wrong with collective sketching (so long as we know we're sketching - and this part could certainly be improved).
I'd first want to optimize the epistemic process, and then worry about the looking foolish part. (granted that there are instrumental reasons not to look foolish)

On ML's view, are you mainly thinking of people who may do research on an important x-safety sub-problem without necessarily buying x-risk arguments? It seems unlikely to me that anyone gets persuaded of x-risk from the bottom up, whether or not the paper/post in question is rigorous - but perhaps this isn't required for a lot of useful research?

I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought processes and workflows that produce useful alignment ideas.

What are your thoughts on failure modes with this approach?
(please let me know if any/all of the following seems confused/vanishingly unlikely)

For example, one of the first that occurs to me is that such cyborgism is unlikely to amplify production of useful-looking alignment ideas uniformly in all directions.

Suppose that it makes things 10x faster in various directions that look promising, but don't lead to solutions, but only 2x faster in directions that do lead to solutions. In principle this should be very helpful: we can allocate fewer resources to the 10x directions, leaving us more time to work on the 2x directions, and everybody wins.
In practice, I'd expect the 10x boost to:

  1. Produce unhelpful incentives for alignment researchers: work on any of the 10x directions and you'll look hugely more productive. Who will choose to work on the harder directions?
    1. Note that it won't be obvious you're going slowly because the direction is inherently harder: from the outside, heading in a difficult direction will be hard to distinguish from being ineffective (from the inside too, in fact).
    2. Same reasoning applies at every level of granularity: sub-direction choice, sub-sub-direction choice....
  2. Warp our perception of promising directions: once the 10x directions seem to be producing progress much faster, it'll be difficult not to interpret this as evidence they're more promising.
    1. Amplified assessment-of-promise seems likely to correlate unhelpfully: failing to help us notice promising directions precisely where it's least able to help us make progress.

It still seems positive-in-expectation if the boost of cyborgism isn't negatively correlated with the ground-truth usefulness of a direction - but a negative correlation here seems plausible.

Suppose that finding the truly useful directions requires patterns of thought that are rare-to-non-existent in the training set, and are hard to instill via instruction. In that case it seems likely to me that GPT will be consistently less effective in these directions (to generate these ideas / to take these steps...). Then we may be in terrible-incentive-land.
[I'm not claiming that most steps in hard directions will be hard, but that speed of progress asymptotes to progress-per-hard-step]

Of course all this is hand-waving speculation.
I'd just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.

So e.g. negative impact through:

  • Boosting capabilities research.
  • Creation of undesirable incentives in alignment research.
  • Warping assessment of research directions.
  • [other stuff I haven't thought of]

Do you know of any existing discussion along these lines?

Great post. Very interesting.

However, I think that assuming there's a "true name" or "abstract type that GPT represents" is an error.

If GPT means "transformers trained on next-token prediction", then GPT's true name is just that. The character of the models produced by that training is another question - an empirical one. That character needn't be consistent (even once we exclude inner alignment failures).

Even if every GPT is a simulator in some sense, I think there's a risk of motte-and-baileying our way into trouble.

Presumably "too dismissive of speculative and conceptual research" is a direct consequence of increased emphasis on rigor. Rigor is to be preferred all else being equal, but all else is not equal.

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

I note that within rigorous fields, the downsides of rigor are not obvious: we can point to all the progress made; progress that wasn't made due to the neglect of conceptual/speculative research is invisible. (has the impact of various research/publication norms ever been studied?)

Further, it seems limiting only to consider [must always be rigorous (in publications)] vs [no demand for rigor]. How about [50% of your publications must be rigorous] (and no incentive to maximise %-of-rigorous-publications), or any other not-all-or-nothing approach?

I'd contrast rigor with clarity here. Clarity is almost always a plus.
I'd guess that the issue in social science fields isn't a lack of rigor, but rather of clarity. Sometimes clarity without rigor may be unlikely, e.g. where there's a lot of confusion or lack of good faith - in such cases an expectation of rigor may help. I don't think this situation is universal.

What we'd want on LW/AF is a standard of clarity.
Rigor is an often useful proxy. We should be careful when incentivizing proxies.

Load More