Theoretical Computer Science Msc student at the University of [Redacted] in the United Kingdom.

I'm an aspiring alignment theorist; my research vibes are descriptive formal theories of intelligent systems (and their safety properties) with a bias towards constructive theories.

I think it's important that our theories of intelligent systems remain rooted in the characteristics of real world intelligent systems; we cannot develop adequate theory from the null string as input.

i.e. if each forecaster has an first-order belief $f (w) \in B (S)$ , and $w \in B (S)$ is your second-order belief about which forecaster is correct, then $(w ⊳_{W S} f) \in B (S)$ should be your first-order belief about the election.

I think there might be a typo here. Did you instead mean to write: " $w \in B (W)$ " for the second order beliefs about the forecasters?

We aren’t offering these criteria as necessary for “knowledge”—we could imagine a breaker proposing a counterexample where all of these properties are satisfied but where intuitively M didn’t really know that A′ was a better answer. In that case the builder will try to make a convincing argument to that effect.

Bolded should be sufficient.

In fact, I'm pretty sure that's how humans work most of the time. We use the general-intelligence machinery to "steer" ourselves at a high level, and most of the time, we operate on autopilot.

Yeah, I agree with this. But I don't think the human system aggregates into any kind of coherent total optimiser. Humans don't have an objective function (not even approximately?).

A human is not well modelled as a wrapper mind; do you disagree?

Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for $R$ 's pursuit — at the expense of everything else.

Conditional on:

Such a system being reachable/accessible to our local/greedy optimisation process
Such a system being actually performant according to the selection metric of our optimisation process

I'm pretty sceptical of #2. I'm sceptical that systems that perform inference via direct optimisation over their outputs are competitive in rich/complex environments.

Such optimisation is very computationally intensive compared to executing learned heuristics, and it seems likely that the selection process would have access to much more compute than the selected system.

Some Nuance on Learned Optimisation in the Real World

I think mesa-optimisers should not be thought of as learned optimisers, but systems that employ optimisation/search as part of their inference process.

The simplest case is that pure optimisation during inference is computationally intractable in rich environments (e.g. the real world), so systems (e.g. humans) operating in the real world, do not perform inference solely by directly optimising over outputs.

Rather optimisation is employed sometimes as one part of their inference strategy. That is systems only optimise their outputs part of the time (other [most?] times they execute learned heuristics^[1]).

Furthermore, learned optimisation in the real world seems to be more "local"/task specific (i.e. I make plans to achieve local, particular objectives [e.g.planning a trip from London to Edinburgh]. I have no global objective that I am consistently optimising for over the duration of my lifetime).

I think this is basically true for any feasible real world intelligent system^[2]. So learned optimisation in the real world is:

Partial^[3]
Local

Do these nuances of real world mesa-optimisers change the nature of risks from learned optimisation?

Cc: @evhub, @beren, @TurnTrout, @Quintin Pope.

^{^}
Though optimisation (e.g. planning) might sometimes be employed to figure out which heuristic to deploy at a particular time.
^{^}
For roughly the reasons why I think fixed immutable terminal goals are antinatural, see e.g.: "Is "Strong Coherence" Anti-Natural?"
Alternatively, I believe that real world systems learn contextual heuristics (downstream of historical selection) that influence decision making ("values") and not fixed/immutable terminal "goals". See also: "why assume AGIs will optimize for fixed goals?"
^{^}
This seems equivalent to Beren's concept of "hybrid optimisation"; I mostly use "partial optimisation", because it feels closer to the ontology of the Risks From Learned Optimisation paper. As they define optimisation, I think learned algorithms operating in the real world just will not be consistently optimising for any global objective.

GPTs are not Imitators, nor Simulators, but Predictors.

I think an issue is that GPT is used to mean two things:

A predictive model whose output is a probability distribution over token space given its prompt and context
Any particular techniques/strategies for sampling from the predictive model to generate responses/completions for a given prompt.

[See the Appendix]

The latter kind of GPT, is what I think is rightly called a "Simulator".

From @janus' Simulators (italicised by me):

I use the generic term “simulator” to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt). Analogously, a predictive model of physics can be used to compute rollouts of phenomena in simulation. A goal-directed agent which evolves according to physics can be simulated by the physics rule parameterized by an initial state, but the same rule could also propagate agents with different values, or non-agentic phenomena like rocks. This ontological distinction between simulator (rule) and simulacra (phenomena) applies directly to generative models like GPT.

It is exactly because of the existence of GPT the predictive model, that sampling from GPT is considered simulation; I don't think there's any real tension in the ontology here.

Appendix

Credit for highlighting this distinction belongs to @Cleo Nardo:

Remark 2: "GPT" is ambiguous
We need to establish a clear conceptual distinction between two entities often referred to as "GPT" —
The autoregressive language model which maps a prompt $x \in T^{k}$ to a distribution over tokens $μ (\cdot | x) \in Δ (T)$ .
The dynamic system that emerges from stochastically generating tokens using $μ$ while also deleting the start token
Don't conflate them! These two entities are distinct and must be treated as such. I've started calling the first entity "Static GPT" and the second entity "Dynamic GPT", but I'm open to alternative naming suggestions. It is crucial to distinguish these two entities clearly in our minds because they differ in two significant ways: capabilities and safety.
Capabilities:
Static GPT has limited capabilities since it consists of a single forward pass through a neural network and is only capable of computing functions that are O(1). In contrast, Dynamic GPT is practically Turing-complete, making it capable of computing a vast range of functions.
Safety:
If mechanistic interpretability is successful, then it might soon render Static GPT entirely predictable, explainable, controllable, and interpretable. However, this would not automatically extend to Dynamic GPT. This is because Static GPT describes the time evolution of Dynamic GPT, but even simple rules can produce highly complex systems.
In my opinion, Static GPT is unlikely to possess agency, but Dynamic GPT has a higher likelihood of being agentic. An upcoming article will elaborate further on this point.
This remark is the most critical point in this article. While Static GPT and Dynamic GPT may seem similar, they are entirely different beasts.

To summarise:

Static GPT: GPT as predictor
Dynamic GPT: GPT as simulator

What do you think MIRI is currently doing wrong/what should they change about their approach/general strategy?

To be clear, I enjoyed the post and am looking forward to this sequence. A point of disagreement though:

One feasible-seeming approach is "accelerating alignment," which involves leveraging AI as it is developed to help solve the challenging problems of alignment. This is not a novel idea, as it's related to previously suggested concepts such as seed AI, nanny AI, and iterated amplification and distillation (IDA).

I disagree that using AI to accelerate alignment research is particularly load bearing for the development of a practical alignment craft or really necessary.

I think we should do it to be clear — I have used ChatGPT to aid some of my writing and plan to use it more — but it's to the same extent that we use Google/Wikipedia/Word processors to do research in general. That is, I don't expect AI assistance to be load bearing enough for alignment in general to merit special distinction.

To the extent that one does expect AI to be particularly load bearing for progress on developing useful alignment craft in particular, I think they're engaging in wishful thinking and snorting too much hopium. That sounds like shying away/avoiding the hard/difficult problems of alignment. John Wentworth has said that we shouldn't do that:

Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things.

The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI, and "just" try to align that AI without understanding the Hard Parts of alignment ourselves. ... You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off. That's one of the big problems with trying to circumvent the Hard Parts: when the circumvention inevitably fails, we are still no closer to solving the Hard Parts. (It has been observed both that alignment researchers mostly seem to not be tackling the Hard Parts, and that alignment research mostly doesn't seem to build on itself; I claim that the latter is a result of the former.)

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

I don't think this point should be on the list (or at least, I don't think I endorse the position implied by explicitly placing the point on the list).

I disagree that intelligence and rationality are more fundamental than physics; the territory itself is physics, and that is all that is really there. Everything else (including the body of our phone knowledge) are models for navigating that territory.

Turing formalised computation and established the limits of computation given certain assumptions. However, those limits only apply as long as the assumptions are true. Turing did not prove that no mechanical system is superior to a Universal Turing Machine, and weird physics may enable super Turing computation.

The point I was making is that our models are only as good as their correlation with the territory. The abstract models we have aren't part of the territory itself.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Some Nuance on Learned Optimisation in the Real World

Appendix

Remark 2: "GPT" is ambiguous