# Ofer G.

Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.

I'm Ofer G. and I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).

You can also find me on the EA Forum.

(Feel free to reach out by sending me a PM through LessWrong.)

# Ofer G.'s Posts

Sorted by New

Outer alignment and imitative amplification

Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.

I would argue that according to this definition, there are no loss functions that are outer aligned at optimum (other than ones according to which no model performs optimally). [EDIT: this may be false if a loss function may depend on anything other than the model's output (e.g. if it may contain a regularization term).]

For any model that performs optimally according to a loss function there is a model that is identical to except that at the beginning of the execution it hacks the operating system or carries out mind crimes. But for any input, and formally map that input to the same output, and thus also performs optimally according to , and therefore is not outer aligned at optimum.

Relaxed adversarial training for inner alignment

we can try to train a purely predictive model with only a world model but no optimization procedure or objective.

How might a "purely predictive model with only a world model but no optimization procedure" look like, when considering complicated domains and arbitrarily high predictive accuracy?

It seems plausible that a sufficiently accurate predictive model would use powerful optimization processes. For example, consider a predictive model that predicts the change in Apple's stock price at some moment (based on data until ). A sufficiently powerful model might, for example, search for solutions to some technical problem related to the development of the next iPhone (that is being revealed that day) in order to calculate the probability that Apple's engineers overcame it.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

In this scenario, my argument is that the size ratio for "almost-AGI architectures" is better (e.g. ), and so you're more likely to find one of those first.

For a "local search NAS" (rather than "random search NAS") it seems that we should be considering here the set of ["almost-AGI architectures" from which the local search would not find an "AGI architecture"].

The "$1B NAS discontinuity scenario" allows for the$1B NAS to find "almost-AGI architectures" before finding an "AGI architecture".

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

If you model the NAS as picking architectures randomly

I don't. NAS can be done with RL or evolutionary computation methods. (Tbc, when I said I model a big part of contemporary ML research as "trial and error", by trial and error I did not mean random search.)

If you then also model architectures as non-fragile, then once you have some optimization power, adding more optimization power doesn't make much of a difference,

Earlier in this discussion you defined fragility as the property "if you make even a slight change to the thing, then it breaks and doesn't work". While finding fragile solutions is hard, finding non-fragile solution is not necessarily easy, so I don't follow the logic of that paragraph.

Suppose that all model architectures are indeed non-fragile, and some of them can implement AGI (call them "AGI architectures"). It may be the case that relative to the set of model architectures that we can end up with when using our favorite method (e.g. evolutionary search), the AGI architectures are a tiny subset. E.g. the size ratio can be (and then running our evolutionary search 10x times means roughly 10x probability of finding an AGI architecture, if [number of runs]<<).

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Creating some sort of commitment device that would bind us to follow UDT—before we evaluate some set of hypotheses—is an example for one potentially consequential intervention.

As an aside, my understanding is that in environments that involve multiple UDT agents, UDT doesn't necessarily work well (or is not even well-defined?).

Also, if we would use SGD to train a model that ends up being an aligned AGI, maybe we should figure out how to make sure that that model "follows" a good decision theory. (Or does this happen by default? Does it depend on whether "following a good decision theory" is helpful for minimizing expected loss on the training set?)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I’d like you to state what position you think I’m arguing for

I think you're arguing for something like: Conditioned on [the first AGI is created at time by AI lab X], it is very unlikely that immediately before the researchers at X have a very low credence in the proposition "we will create an AGI sometime in the next 30 days".

(Tbc, I did not interpret you as arguing about timelines or AGI transformativeness; and neither did I argue about those things here.)

I’m arguing against FOOM, not about whether there will be a fire alarm. The fire alarm question seems orthogonal to me.

Using the "fire alarm" concept here was a mistake, sorry for that. Instead of writing:

I'm pretty agnostic about whether the result of that $100M NAS would serve as a fire alarm for AGI. I should have written: I'm pretty agnostic about whether the result of that$100M NAS would be "almost AGI".

This sounds to me like saying “well, we can’t trust predictions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”.

I generally have a vague impression that many AIS/x-risk people tend to place too much weight on trend extrapolation arguments in AI (or tend to not give enough attention to important details of such arguments), which may have triggered me to write the related stuff (in response to you seemingly applying a trend extrapolation argument with respect to NAS). I was not listing the reasons for my beliefs specifically about NAS.

If I had infinite time, I’d eventually consider these scenarios (even the simulators wanting us to build a moon tower hypothesis).

(I'm mindful of your time and so I don't want to branch out this discussion into unrelated topics, but since this seems to me like a potentially important point...) Even if we did have infinite time and the ability to somehow determine the correctness of any given hypothesis with super-high-confidence, we may not want to evaluate all hypotheses—that involve other agents—in arbitrary order. Due to game theoretical stuff, the order in which we do things may matter (e.g. due to commitment races in logical time). For example, after considering some game-theoretical meta considerations we might decide to make certain binding commitments before evaluating such and such hypotheses; or we might decide about what additional things we should consider or do before evaluating some other hypotheses, etcetera.

Conditioned on the first AGI being aligned, it may be important to figure out how do we make sure that that AGI "behaves wisely" with respect to this topic (because the AGI might be able to evaluate a lot of weird hypotheses that we can't).

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

What caused the researchers to go from “$1M run of NAS” to “$1B run of NAS”, without first trying “$10M run of NAS”? I especially have this question if you’re modeling ML research as “trial and error”; I indeed model a big part of contemporary ML research as "trial and error". I agree that it seems unlikely that before the first$1B NAS there won't be any $10M NAS. Suppose there will even be a$100M NAS just before the $1B NAS that (by assumption) results in AGI. I'm pretty agnostic about whether the result of that$100M NAS would serve as a fire alarm for AGI.

Current AI systems are very subhuman, and throwing more money at NAS has led to relatively small improvements. Why don't we expect similar incremental improvements from the next 3-4 orders of magnitude of compute?

If we look at the history of deep learning from ~1965 to 2019, how well do trend extrapolation methods fare in terms of predicting performance gains for the next 3-4 orders of magnitude of compute? My best guess is that they don't fare all that well. For example, based on data prior to 2011, I assume such methods predict mostly business-as-usual for deep learning during 2011-2019 (i.e. completely missing the deep learning revolution). More generally, when using trend extrapolations in AI, consider the following from this Open Phil blog post (2016) by Holden Karnofsky (footnote 7):

The most exhaustive retrospective analysis of historical technology forecasts we have yet found, Mullins (2012), categorized thousands of published technology forecasts by methodology, using eight categories including “multiple methods” as one category. [...] However, when comparing success rates for methodologies solely within the computer technology area tag, quantitative trend analysis performs slight below average,

(The link in the quote appears to be broken, here is one that works.)

NAS seems to me like a good example for an expensive computation that could plausibly constitute a "search in idea-space" that finds an AGI model (without human involvement). But my argument here applies to any such computation. I think it may even apply to a '$1B SGD' (on a single huge network), if we consider a gradient update (or a sequence thereof) to be an "exploration step in idea-space". Suppose that such a NAS did lead to human-level AGI. Shouldn’t that mean that the AGI makes progress in AI at the same rate that we did? I first need to understand what "human-level AGI" means. Can models in this category pass strong versions of the Turing test? Does this category exclude systems that outperform humans on one or more important dimensions? (It seems to me that the first SGD-trained model that passes strong versions of the Turing test may be a superintelligence.) In all the previous NASs, why did the paths taken produce AI systems that were so much worse than the one taken by the$1B NAS? Did the $1B NAS just get lucky? Yes, the$1B NAS may indeed just get lucky. A local search sometimes gets lucky (in the sense of finding a local optimum that is a lot better than the ones found in most runs; not in the sense of miraculously starting the search at a great fragile solution). [EDIT: also, something about this NAS might be slightly novel - like the neural architecture space.]

If you want to make the case for a discontinuity because of the lack of human involvement, you would need to argue:

• The replacement for humans is way cheaper / faster / more effective than humans (in that case why wasn’t it automated earlier?)
• The discontinuity happens as soon as humans are replaced (otherwise, the system-without-human-involvement becomes the new baseline, and all future systems will look like relatively continuous improvements of this system)

In some past cases where humans did not serve any role in performance gains that were achieved with more compute/data (e.g. training GPT-2 by scaling up GPT), there were no humans to replace. So I don't understand the question "why wasn’t it automated earlier?"

In the second point, I need to first understand how you define that moment in which "humans are replaced". (In the $1B NAS scenario, would that moment be the one in which the NAS is invoked?) [AN #80]: Why AI risk might be solved without additional intervention from longtermists Conditioned on [$1B NAS yields the first AGI], that NAS itself may essentially be "a local search in idea-space". My argument is that such a local search in idea-space need not start in a world where "almost-AGI" models already exist (I listed in the grandparent two disjunctive reasons in support of this).

Relatedly, "modeling ML research as a local search in idea-space" is not necessarily contradictory to FOOM, if an important part of that local search can be carried out without human involvement (which is a supposition that seems to be supported by the rise of NAS and meta-learning approaches in recent years).

I don't see how my reasoning here relies on it being possible to "find fragile things using local search".

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

It seems to me that to get FOOM you need the property "if you make even a slight change to the thing, then it breaks and doesn't work"

The above 'FOOM via $1B NAS' scenario doesn't seem to me to require this property. Notice that the increase in capabilities during that NAS may be gradual (i.e. before evaluating the model that implements an AGI the NAS evaluates models that are "almost AGI"). The scenario would still count as a FOOM as long as the NAS yields an AGI and no model before that NAS ever came close to AGI. Conditioned on [$1B NAS yields the first AGI], a FOOM seems to me particularly plausible if either:

1. no previous NAS at a similar scale was ever carried out; or
2. the "path in model space" that the NAS traverses is very different from all the paths that previous NASs traversed. This seems to me plausible even if the model space of the $1B NAS is identical to ones used in previous NASs (e.g. if different random seeds yield very different paths); and it seems to me even more plausible if the model space of the$1B NAS is slightly novel.
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Yeah, in FOOM worlds I agree more with your (Donald’s) reasoning. (Though I still have questions, like, how exactly did someone stumble upon the correct mathematical principles underlying intelligence by trial and error?)

If there's an implicit assumption here that FOOM worlds require someone to stumble upon "the correct mathematical principles underlying intelligence", I don't understand why such an assumption is justified. For example, suppose that at some point in the future some top AI lab will throw \$1B at a single massive neural architecture search—over some arbitrary slightly-novel architecture space—and that NAS will stumble upon some complicated architecture that its corresponding model, after being trained with a massive amount of computing power, will implement an AGI.