Steve Byrnes

I'm Steve Byrnes, a professional physicist in the Boston area. I have a summary of my AGI safety research interests at:

Steve Byrnes's Comments

Seeking Power is Instrumentally Convergent in MDPs

This is great work, nice job!

Maybe a shot in the dark, but there might be some connection with that paper a few years back Causal Entropic Forces (more accessible summary). They define "causal path entropy" as basically the number of different paths you can go down starting from a certain point, which might be related to or the same as what you call "power". And they calculate some examples of what happens if you maximize this (in a few different contexts, all continuous not discrete), and get fun things like (what they generously call) "tool use". I'm not sure that paper really adds anything important conceptually that you don't already know, but just wanted to point that out, and PM me if you want help decoding their physics jargon. :-)

A list of good heuristics that the case for AI x-risk fails

Is there a better reference for " a number of experts have voiced concerns about AI x-risk "? I feel like there should be by now...

I hope someone actually answers your question, but FWIW, the Asilomar principles were signed by an impressive list of prominent AI experts. Five of the items are related to AGI and x-risk. The statements aren't really strong enough to declare that those people "voiced concerns about AI x-risk", but it's a data-point for what can be said about AI x-risk while staying firmly in the mainstream.

My experience in casual discussions is that it's enough to just name one example to make the point, and that example is of course Stuart Russell. When talking to non-ML people—who don't know the currently-famous AI people anyway—I may also mention older examples like Alan Turing, Marvin Minsky, or Norbert Wiener.

Thanks for this nice post. :-)

Self-Fulfilling Prophecies Aren't Always About Self-Awareness

This is good stuff!

...if the Predict-O-Matic knows about (or forecasts the development of) anything which can be modeled using the outside view "I'm not sure how this thing works, but its predictions always seem to come true!"

Can you walk through the argument here in more detail? I'm not sure I follow it; sorry if I'm being stupid.

I'll start: There's two identical systems, "Predict-O-Matic A" and "Predict-O-Matic B", sitting side-by-side on a table. For simplicity let's say that A knows everything about B, B knows everything about A, but A is totally oblivious to the existence of A, and B to B. Then what? What's a question you might you ask it that would be problematic? Thanks in advance!

The Credit Assignment Problem

I re-read this post thinking about how and whether this applies to brains...

  • The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be "solved" in humans by exponential / hyperbolic discounting. It's not exactly episodic, but we'll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.

  • Relatedly, we seem to generally make and execute plans that are (hierarchically) laid out in time and with a success criterion at its end, like "I'm going to walk to the store". So we get specific and timely feedback on whether that plan was successful.

  • We do in fact have a model class. It seems very rich; in terms of "grain of truth", well I'm inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that's good enough?

Just some thoughts; sorry if this is irrelevant or I'm misunderstanding anything. :-)

The Credit Assignment Problem

Claim: predictive learning gets gradients "for free" ... Claim: if you're learning to act, you do not similarly get gradients "for free". You take an action, and you see results of that one action. This means you fundamentally don't know what would have happened had you taken alternate actions, which means you don't have a direction to move your policy in. You don't know whether alternatives would have been better or worse. So, rewards you observe seem like not enough to determine how you should learn.

This immediately jumped out at me as an implausible distinction because I was just reading Surfing Uncertainty which goes on endlessly about how the machinery of hierarchical predictive coding is exactly the same as the machinery of hierarchical motor control (with "priors" in the former corresponding to "priors + control-theory-setpoints" in the latter, and with "predictions about upcoming proprioceptive inputs" being identical to the muscle control outputs). Example excerpt:

the heavy lifting that is usually done by the use of efference copy, inverse models, and optimal controllers [in the models proposed by non-predictive-coding people] is now shifted [in the predictive coding paradigm] to the acquisition and use of the predictive (generative) model (i.e., the right set of prior probabilistic ‘beliefs’). This is potentially advantageous if (but only if) we can reasonably assume that these beliefs ‘emerge naturally as top-down or empirical priors during hierarchical perceptual inference’ (Friston, 2011a, p. 492). The computational burden thus shifts to the acquisition of the right set of priors (here, priors over trajectories and state transitions), that is, it shifts the burden to acquiring and tuning the generative model itself. --Surfing Uncertainty chapter 4

I'm a bit hazy on the learning mechanism for this (confusingly-named) "predictive model" (I haven't gotten around to chasing down the references) and how that relates to what you wrote... But it does sorta sound like it entails one update process rather than two...

Chris Olah’s views on AGI safety

We should be careful to separate two levels of understanding: (1) We can understand the weights and activations of a particular trained model, versus (2) We can understand why a particular choice of architecture, learning algorithm, and hyperparameters is a good (effective) choice for a given ML application.

I think that (1) is great for AGI safety, (2) does a lot for capabilities and not much for safety.

So bringing up Neural Architecture Search is not necessarily the most relevant thing, since NAS is about (2), not (1).

For my part, I'm expecting that the community will "by default" make progress on (2), such that researchers using Neural Architecture Search will naturally be outcompeted by researchers who understand why to use a certain architecture and hyperparameters. Whereas I feel like (1) is the very important thing that won't necessarily happen automatically, unless people like Chris Olah keep doing the hard work to make it a community priority.

Chris Olah’s views on AGI safety

To me, the important safety feature of "microscope AI" is that the AI is not modeling the downstream consequences of its outputs (which automatically rules out manipulation and deceit). This feature is totally incompatible with agents (you can't vacuum the floor without modeling the consequences of your motor control settings), and optional for oracles [I'm using oracles in a broad sense of systems that you use to help answer your questions, leaving aside what their exact user interface is, so microscope AI is part of that]. For example, when Eliezer thinks about oracles he is not thinking this way; instead, he's thinking of a system that deliberately chooses an output to "increase the correspondence between the user's belief about relevant consequences and reality". But there's no reason in principle that we couldn't build a system that will not apply its intelligent world-model to analyze the downstream consequences of its outputs.

I think the only way to do that is to have its user interface not be created automatically as part of the training objective, but rather build the in ourselves, separately. Then the two key questions are: What's the safe training procedure that results in an intelligent world-model, and what's the separate input-output interface that we're going to build? Both of these are open questions AFAIK. I wrote Self-Supervised Learning and AGI Safety laying out this big picture as I see it.

For the latter question, what is the user interface, "Use interpretability tools & visualizations on the world-model" seems about as good an answer as any, and I am very happy to have Chris and others trying to flesh out that vision. I hope that they don't stop at feature extraction, but also pulling out the relationships (causal, compositional, etc.) that we need to do counterfactual reasoning, planning etc., and even a "search through causal pathways to get desired consequences" interface. Incidentally, the people who think that brain-computer interfaces will help with AGI safety (cf waitbutwhy) seem to be banking on something vaguely like "microscope AI", but I haven't yet found any detailed discussion along those lines.

For the first question, what is the safe training procedure that incidentally creates a world-model, contra Gurkenglas's comment here, I think it's an open question whether a safe training procedure exists. For example, unsupervised (a.k.a. "self-supervised") learning as ofer suggests seems awfully safe but is it really? See Self-Supervised Learning and Manipulative Predictions; I half-joked there about burying the computer in an underground bunker, running self-supervised learning under homomorphic encryption, until training was complete; then cutting power, digging it out, and inspecting the world model. But even then, an ambitious misaligned system could potentially leave manipulative booby-traps on its hard drive. Gurkenglas's suggestion of telling it nothing about the universe (e.g. have it play Nomic) would make it possibly safer but dramatically less useful (it won't understand the cause of Alzheimer's etc.) And it can probably still learn quite a bit about the world by observing its own algorithm... I'm not sure, I'm still generally optimistic that a solution exists, and I hope that Gurkenglas and I and everyone else keeps thinking about it. :-)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Yann's core argument for why AGI safety is easy is interesting, and actually echoes ongoing AGI safety research. I'll paraphrase his list of five reasons that things will go well if we're not "ridiculously stupid":

  1. We'll give AGIs non-open-ended objectives like fetching coffee. These are task-limited and therefore there's no more instrumental subgoals after the task is complete.
  2. We will put "simple terms in the objective" to prevent obvious problems, presumably things like "don't harm people", "don't violate laws", etc.
  3. We will put in "a mechanism" to edit the objective upon observing bad behavior;
  4. We can physically destroy a computer housing AGI;
  5. We can build a second AGI whose sole purpose is to destroy the first AGI if the first AGI has gotten out of control, and the latter will succeed because it's more specialized.

All of these are reasonable ideas on their face, and indeed they're similar to ongoing AGI safety research programs: (1) is myopic or task-limited AGI, (2) is related to AGI limiting and norm-following, (3) is corrigibility, (4) is boxing, and (5) is in the subfield of AIs-helping-with-AGI-safety (other things in this area include IDA, adversarial testing, recursive reward modeling, etc.).

The problem, of course, is that all five of these things, when you look at them carefully, are much harder and more complicated than they appear, and/or less likely to succeed. And meanwhile he's discouraging people from doing the work to solve those problems.. :-(

Two senses of “optimizer”

If the super-powerful SAT solver thing finds the plans but doesn't execute them, would you still lump it with optimizer_2? (I know it's just terminology and there's no right answer, but I'm just curious about what categories you find natural.)

(BTW this is more-or-less a description of my current Grand Vision For AGI Safety, where the "dynamics of the world" are discovered by self-supervised learning, and the search process (and much else) is TBD.)

Self-supervised learning & manipulative predictions

This is great, thanks again for your time and thoughtful commentary!

RE "I'm not entirely convinced that predictions should be made in a way that's completely divorced from their effects on the world.": My vision is to make a non-agential question-answering AGI, thus avoiding value alignment. I don't claim that this is definitely the One Right Answer To AGI Safety (see "4. Give up, and just make an agent with value-aligned goals" in the post), but I think it is a plausible (and neglected) candidate answer. See also my post In defense of oracle (tool) AI research for why I think it would solve the AGI safety problem.

If an AGI applies its intelligence and world model to its own output, choosing that output partly for its downstream effects as predicted by the model, then I say it's a goal-seeking agent. In this case, we need to solve value alignment—even if the goal is as simple as "answer my question". (We would need to make sure that the goal is what it's supposed to be, as opposed to a proxy goal, or a weird alien interpretation where rewiring the operator's brain counts as "answer my question".) Again, I'm not opposed to building agents after solving value alignment, but we haven't solved value alignment yet, and thus it's worth exploring the other option: build a non-agent which does not intelligently model the downstream effects of its output at all (or if it does model it incidentally, to not do anything with that information).

Interfacing with a non-agential AGI is generally awkward. You can't directly ask it to do things, or to find a better way to communicate. My proposal here is to ask questions like "If there were no AGIs in the world, what's the likeliest way that a person would find a cure for Alzheimer's?" This type of question does not require the AGI to think through the consequence of its output, and it also has other nice properties (it should give less weird and alien and human-unfriendly answers than the solutions a direct goal-seeking agent would find).

OK, that's my grand vision and motivation, and why I'm hoping for "no reasoning about the consequences of one's output whatsoever", as opposed to finding self-fulfilling predictions. (Maybe very very mild optimization for the consequences of one's outputs is OK, but I'm nervous.)

Your other question was: if a system is making manipulative predictions, towards what goal is it manipulating? Well, you noticed correctly, I'm not sure, and I keep changing my mind. And it may also be different answers depending on the algorithm details.

  • My top expectation is that it will manipulate towards getting further inputs that its model thinks are typical, high-probability inputs. If X implies Y, and P(Y) is low, that might sometimes spuriously push down P(X), and thus the system will pick those X's that result in high P(Y).
  • My secondary expectation is that it might manipulate towards unambiguous, low-entropy outputs. This is the expectation if the system picks out the single most likely ongoing long-term context, and output a prediction contingent on that. (If instead the system randomly draws from the probability distribution of all possible contexts, this wouldn't happen, as suggested by interstice's comments on this page.) So if X1 leads to one of 500 slightly different Y1's (Y1a, Y1b,...), while X2 definitely leads to only one specific Y2, then Y2 is probably the most likely single Y, even if all the Y1's in aggregate are likelier than Y2; so X2 is at an unfair advantage.
  • Beyond those two, I suspect there can be other goals but they depend on the algorithm and its heuristics.
Load More