All of Charlie Steiner's Comments + Replies

A nice exposition.

For myself I'd prefer the same material much more condensed and to-the-point, but I recognize that there are publication venues that prefer more flowing text.

E.g. compare

We turn next to the laggard. Compared to the fixed roles model, the laggard’s decision problem in the variable roles model is more complex primarily in that it must now consider the expected utility of attacking as opposed to defending or pursuing other goals. When it comes to the expected utility of defending or pursuing other goals, we can simply copy the formulas from

... (read more)

It might be worth going into the problem of fully updated deference. I don't think it's necessarily always a problem, but also it does stop utility aggregation and uncertainty from being a panacea, and the associated issues are probably worth a bit of discussion. And as you likely know, there isn't a great journal citation for this, so you could really cash in when people want to talk about it in a few years :P

Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.

This is fine if you want your AI to stack blocks on top of other blocks or something.

But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you're no longer just shooting f... (read more)

I feel like the "obvious" thing to do is to ask how rare (in bits) the post-opitimization EV is according to the pre-optimization distribution. Like, suppose that pre-optimization my probability distribution over utilities I'd get is normally distributed, and after optimizing my EV is +1 standard deviation. Probability of doing that well or better is 0.158, which in bits is 2.65 bits.

Seems indifferent to affine transformation of the utility function, adding irrelevant states, splitting/merging states, etc. What are some bad things about this method?

I'm definitely guilty of getting a disproportionate amount of information from the AI safety community.

I don't really have a good cure for it, but I do think having a specific question helps - it's simply not practical to keep up with the entire literature, and I don't have a good filtering mechanism for what to keep up with in general, but if I'm interested in a specific question I can usually crawl the search engine + citation network well enough to get a good representation of the literature.

You might be interested in Reducing Goodhart. I'm a fan of "detecting and avoiding internal Goodhart," and I claim that's a reflective version of the value learning problem.

Partial solutions which likely do not work in the limit 

Taking meta-preferences into account

Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.

As a fan of accounting for meta-preferences, I've made my peace with multiple fixed points, to the extent that it now seems wild to expect otherwise.

Like, of course there are multiple ways to model humans ... (read more)

Thanks for the links! What I had in mind wasn't exactly the problem 'there is more than one fixed point', but more of 'if you don't understand what did you set up, you will end in a bad place'.  I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don't think this extends to eg. LLMs - if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don't trust the process.

Nice. My main issue is that just because humans have values a certain way, doesn't mean we want to build an AI that way, and so I'd draw pretty different implications for alignment. I'm pessimistic about anything that even resembles "make an AI that's like a human child," and more interested in "use a model of a human child to help an inhuman AI understand humans in the way we want."

The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process? 

afaict, a big fraction of evolution's instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but b... (read more)

1Beren Millidge2mo
This is true but I don't think is super important for this argument. Evolution definitely encodes inductive biases into learning about relevant things which ML architectures do not, but this is primarily to speed up learning and handle limited initial data. Most of the things evolution focuses on such as faces are natural abstractions anyway and would be learnt by pure unsupervised learning systems. Yes, there are also a number of ways to short-circuit model evaluation entirely. The classic one is having a habit policy which is effectively your action prior. There are also cases where you just follow the default model-free policy and only in cases where you are even more uncertain do you actually deploy the full model-based evaluation capacities that you have.

Nice AI ethics dataset. It would be a shame if someone were to... fine-tune a LLM to perform well at it after some scratchpad reasoning, thus making an interesting advance in natural language ethical reasoning that might be useful for more general AI alignment if we expect transformative AI to look like 80% LLM and 20% other stuff bolted on, but might fail to generalize to alternative or successor systems that do more direct reasoning about the world.

Sorry, that sentence really got away from me.

For some reason, I couldn't import pysvelte (KeyError: '<run_path>') in the colab notebook. There was also a call to circuitsvis that I had to hunt down the import for.

2Neel Nanda3mo
Oh, ugh, Typeguard was updated to v3 and this broke things. And the circuitsvis import was a mistake. Should be fixed now, thanks for flagging!

Nate's concerns don't seem to be the sort of thing that gradient descent in a non-recurrent system learns. (I basically agree with Steve Byrnes here.) GPT-4 probably has enough engagement with the hardware that you could program something that acquires more computer resources using the weights of GPT-4. But it never stumbled on such a solution in training, in part because in gradient descent the gradient is calculated using a model of the computation that doesn't take hacking the computer into account.

In a recurrent system that learns by some non-gradient-... (read more)

I'm pessimistic about the chances of this happening automatically, or requiring no expensive tradeoffs, because training networks to have good performance will disupt the human interpretability of intermediate representations, even if we think those representations look like text. But I do think it's interesting, and that attempts to make auditing via intepretable intermediate representations work (and pay the costs) aren't automatically doomed.


1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
I expect the network to simultaneously be learning several different algorithms.

One method works via diffusion from the cheese and the mouse, and extraction of local connectivity information from fine-grained pixels into coarse-grained channels. This will work even better when the cheese is close to the mouse, but because of the relative lack of training data o... (read more)

The Exhibits might have been a nice place to use probabilistic reasoning. I too am too lazy to go through and try to guesstimate numbers though :)

Here's the comment I promised back on post 5.

First, I liked Benchmarking Interpretability Tools for Deep Neural Networks. Because I'm not so familiar with the methods you compared, I admit I actually took a long time to figure out that each each "interpretability tool" was responsible for both finding and visualizing the trigger of a trojan pointing to a particular class, and so "visualize the trigger" meant use the tools separately to try to visualize a trojan given the target class.

  • Detecting trojans vs. detecting deception.
    • This post, and the comparison b
... (read more)
2Stephen Casper3mo
Thanks for the comment. I appreciate how thorough and clear it is.  Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are  equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one.  +1, but this seems difficult to scale.  +1, see [] It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws.  This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn't be called deceptive. I don't think my definition of deceptive alignment applies to this because my definition requires that the model does something we don't want it to.  Strong +1. This points out a difference between trojans and deception. I'll add this to the post.  +1 Thanks!

I'm slowly making my way through these, so I'll leave you a more complete comment after I read post 8.

I think it's a big stretch to say that deception is basically just trojans. There are similarities, but the regularities that make deception a natural category of behavior that we might be able to detect are importantly fuzzier than the regularities that trojan-detecting strategies use. If "deception" just meant acting according to a wildly different distribution when certain cues were detected, trojan-detection would have us covered, but what counts as "deception" depends more heavily on our standards for the reasoning process, and doean't reliably result in behavior that's way different than non-deceptive behavior.

1Stephen Casper3mo
Thanks. See also EIS VIII []. Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.  

It seems like you deliberately picked completeness because that's where Dutch book arguments are least compelling, and that you'd agree with the more usual Dutch book arguments.

But I think even the Dutch book for completeness makes some sense. You just have to separate "how the agent internally represents its preferences" from "what it looks like the agent us doing." You describe an agent that dodges the money-pump by simply acting consistently with past choices. Internally this agent has an incomplete representation of preferences, plus a memory. But exte... (read more)

Hm, I mostly agree. There isn't any interesting structure by default, you have to get it by trying to mimic a training distribution that has interesting structure.

And I think this relates to another way that I was too reductive, which is that if I want to talk about "simulacra" as a thing, then they don't exist purely in the text, so I must be sneaking in another ontology somewhere - an ontology that consists of features inferred from text (but still not actually the state of our real universe).

I think the biggest pitfall of the "simulator" framing is that it's made people (including Beth Barnes?) think it's all about simulating our physical reality, when exactly because of the constraints you mention (text not actually pinpointing the state of the universe, etc.), the abstractions developed by a predictor are usually better understood in terms of treating the text itself as the state, and learning time-evolution rules for that state.

1Lawrence Chan4mo
The time-evolution rules of the state are simply the probabilities of the autoregressive model -- there's some amount of high level structure but not a lot. (As Ryan says, you don't get the normal property you want from a state (the Markov property) except in a very weak sense.) I also disagree that purely thinking about the text as state + GPT-3 as evolution rules is the intention of the original simulators post []; there's a lot of discussion about the content of the simulations themselves as simulated realities or alternative universes (though the post does clarify that it's not literally physical reality), e.g.: I think insofar as people end up thinking the simulation is an exact match for physical reality, the problem was not in the simulators frame itself, but instead the fact that the word physics was used 47 times in the post, while only the first few instances make it clear that literal physics is intended only as a metaphor. 
3Ryan Greenblatt4mo
Thinking about the state and time evolution rules for the state seems fine, but there isn't any interesting structure with the naive formulation imo. The state is the entire text, so we don't get any interesting Markov chain structure. (you can turn any random process into a Markov chain where you include the entire history in the state! The interesting property was that the past didn't matter!)

Also, “give rise to a self-reflective mesa-optimizer that's capable of taking over the outer process” doesn’t parse for me. If it’s important, can you explain in more detail?

So, parsing it a bit at a time (being more thorough than is strictly necessary):

What does it mean for some instrumentally-useful behavior (let's call it behavior "X") to give rise to a mesa-optimizer?

It means that if X is useful for a system in training, that system might learn to do X by instantiating an agent who wants X to happen. So if X is "trying to have good cognitive habits," t... (read more)

2Steve Byrnes3mo
Thanks. I’m generally thinking about model-based RL where the whole system is unambiguously an agent that’s trying to do things, and the things it’s trying to do are related to items in the world-model that the value-function thinks are high-value, and “world-model” and “value function” are labeled boxes in the source code, and inside those boxes a learning algorithm builds unlabeled trained models. (We can separately argue about whether that’s a good thing to be thinking about.) In this picture, you can still have subagents / Society-Of-Mind []; for example, if the value function assigns high value to the world-model concept “I will follow through on my commitment to exercise” and also assigns high value to the world-model concept “I will watch TV”, then this situation can be alternatively reframed as two subagents duking it out. But still, insofar as the subagents are getting anything done, they’re getting things done in a way that uses the world-model as a world-model, and uses the value function as a value function, etc. By contrast, when people talk about mesa-optimizers, they normally have in mind something like RFLO [], where agency & planning wind up emerging entirely inside a single black box. I don’t expect that to happen for various reasons, cf. here [] and here [].  OK, so if we restrict to model-based RL, and we forget about mesa-optimizers, then my best-guess translation of “Is separate training for cognitive strategy useful?” into my ontology is something like “Should we set up the AGI’s internal reward function to “care about” cognitive strategy explicitly, and not just let the cognitive strategy emerge by instrumental reasoning?” I mostly don’t have any great pla

Thanks for this series of expanded sections!

I'm confused about the distributional generalization thing. Why is that different from minimizing log loss? The loss function (for the base network, not RL-finetuning) is computed based on the logits, not on the temperature-0 sample, right? So a calibrated probability distribution should minimize loss.

I'm skeptical of all of those proposed markers of agentic behavior. Being able to predict what an agent would say, when prompted, is different than being an agent in the sense that causes concern (although it certai... (read more)

3Evan Hubinger4mo
The paper [] explains it better than I can, but essentially: if I give you an imbalanced labeling problem, where 60% are A and 40% are B, and I remove all the actual features and just replace them with noise, the Bayes-optimal thing to do is output B every time, but in fact large neural networks will learn to output A 60% of the time and B 40% of the time even in that setting. Yes, I agree--these markers mostly don't test whether the model is a predictor (though that's not entirely true, I do think the delta in markers of agency between different training regimes is a useful datapoint there). Primarily, however, what they do test is, if it is a predictor, how agentic is the thing that it is predicting . And I think that's extremely important, since we really want to avoid predictive models that are simulating potentially malign agents [].

My vague understanding is that to correspond with Bayesian updating, RL has to have a quite restrictive KL penalty, and in practice people use much less - which might be like Bayesian updating on the pretend dataset where you've seen 50 of each RL example.

Is this accurate? Has anyone produced interesting examples of RL faithful to the RL-as-updating recipe, that you know of?

Is separate training for cognitive strategy useful? I'm genuinely unsure. If you have an architecture that parametrizes how it attends to thoughts, then any ol' RL signal will teach your AI how to attend to thoughts in an instrumentally useful way. I just read Lee's post, so right now I'm primed to expect that this will happen surprisingly often, though maybe the architecture needs to be a little more flexible/recurrent than a transformer before it happens just from trying to predict text.

Instrumental cognitive strategy seems way safer than terminal cognit... (read more)

2Steve Byrnes4mo
If we make an AGI, and the AGI starts doing Anki because it’s instrumentally useful, then I don’t care, that doesn’t seem safety-relevant. I definitely think things like this happen by default. If we make an AGI and the AGI develops (self-reflective) preferences about its own preferences, I care very much, because now it’s potentially motivated to change its preferences, which can be good (if its meta-preferences are aligned with what I was hoping for) or bad (if misaligned). See here []. I note that intervening on an AGI’s meta-preferences seems hard. Like, if the AGI turns to look at an apple, we can make a reasonable guess that it might be thinking about apples at that moment, and that at least helps us get our foot in the door (cf. Section 4.1 in OP)—but there isn’t an analogous trick for meta-preferences. (This is a reason that I’m very interested in the nuts-and-bolts of how self-concept works in the human brain. Haven’t made much progress on that though.) I’m not sure what you mean by “separate training for cognitive strategy”. Also, “give rise to a self-reflective mesa-optimizer that's capable of taking over the outer process” doesn’t parse for me. If it’s important, can you explain in more detail?
  1. I feel like by the time your large predictive model is modeling superintelligences that are actually superintelligent, other people using similar architectures in different ways are probably already building their AGIs. I'm not excited about AI-assisted alignment that requires us to already be in a losing position.
    1. This is one of the reasons why, despite being gung-ho about conditioning current models, I think there's a very plausible case for RL finetuning being useful in the future, like maybe if you want to differentially advance a language model's capab
... (read more)
4Johannes Treutlein4mo
Thanks for your comment! Regarding 1: I don't think it would be good to simulate superintelligences with our predictive models. Rather, we want to simulate humans to elicit safe capabilities. We talk more about competitiveness of the approach in Section III []. Regarding 3: I agree it might have been good to discuss cyborgism specifically. I think cyborgism is to some degree compatible with careful conditioning. One possible issue when interacting with the model arises when the model is trained on / prompted with its own outputs, or data that has been influenced by its outputs. We write about this [] in the context of imitative amplification and above when considering factorization []: I personally think there might be ways to make such approaches work and get around the issues, e.g., by making sure that the model is myopic and that there is a unique fixed point. But we would lose some of the safety properties of just doing conditioning. Regarding 2: I agree that it would be good if we can avoid fooling ourselves. One hope would be that in a sufficiently capable model, conditioning would help with generating work that isn't worse than that produced by real humans.

How do you anticipate and strategize around dual-use concerns, particularly for basic / blue-sky interpretability-enablong research?

2Stephen Casper4mo
I think that my personal thoughts on capabilities externalities are reflected well in this post [].  I'd also note that this concern isn't very unique to interpretability work but applies to alignment work in general. And in comparison to other alignment techniques, I think that the downside risks of interpretability tools are most likely lower than those of stuff like RLHF. Most theories of change for interpretability helping with AI safety involve engineering work at some point in time, so I would expect that most interpretability researchers have similar attitudes to this on dual use concerns.  In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don't have big advancements in mind so much as stuff like fairly simple debugging work. 

Since it was evidently A Thing, I have caved to peer pressure :P

"Shard theory doesn't need more work" (in sense 2) could be true as a matter of fact, without me knowing it's true with high confidence. If you're saying "for us to become highly confident that alignment is going to work this way, we need more info", I agree. 

But I read you as saying "for this to work as a matter of fact, we need X Y Z additional research":

Yeah, this is a good point. I do indeed think that just plowing ahead wouldn't work as a matter of fact, even if shard theory alignmen... (read more)

I think I disagree with lots of things in this post, sometimes in ways that partly cancel each other out.

  • Parts of generalizing correctly involve outer alignment. I.e. building objective functions that have "something to say" about how humans want the AI to generalize.
  • Relatedly, outer alignment research is not done, and RLHF/P is not the be-all-end-all.
  • I think we should be aiming to build AI CEOs (or more generally, working on safety technology with an eye towards how it could be used in AGI that skillfully navigates the real world). Yes, the reality of the
... (read more)
2David Scott Krueger4mo
(A very quick response): Agree with (1) and (2).   I am ambivalent RE (3) and the replaceability arguments. RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".  

The memo trap reminds me of the recent work from Anthropic on superposition, memorization, and double descent - it's plausible that there's U-shaped scaling in there somewhere for similar reasons. But because of the exponential scaling of how good superposition is for memorization, maybe the paper actually implies the opposite? Hm.

I'll have to eat the downvote for now - I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters."

I think it's quite plausible that you don't need much more work for shard theory alignment, because value formation really is that easy / robust.

But how do we learn that fact?

If extremely-confident-you says "the diamond-alignment post would literally work" and I say "what about these magical steps where you make choices without kn... (read more)

3Alex Turner4mo
11 fewer words, but I don't think it communicates the intended concept!  If you have to say "I don't mean one obvious reading of the title" as the first sentence, it's probably not a good title. This isn't a dig -- titling posts is hard, and I think it's fair to not be satisfied with the one I gave. I asked ChatGPT to generate several new titles; lightly edited: After considering these, I think that "Reminder: shard theory leaves open important uncertainties" is better than these five, and far better than the current title. I think a better title is quite within reach. I didn't claim that I assign high credence to alignment just working out, I'm saying that it may as a matter of fact turn out that shard theory doesn't "need a lot more work," because alignment works out as a matter of fact from the obvious setups people try.  1. There's a degenerate version of this claim, where ST doesn't need more work because alignment is "just easy" for non-shard-theory reasons, and in that world ST "doesn't need more work" because alignment itself doesn't need more work.  2. There's a less degenerate version of the claim, where alignment is easy for shard-theory reasons -- e.g. agents robustly pick up a lot of values, many of which involve caring about us. "Shard theory doesn't need more work" (in sense 2) could be true as a matter of fact, without me knowing it's true with high confidence. If you're saying "for us to become highly confident that alignment is going to work this way, we need more info", I agree.  But I read you as saying "for this to work as a matter of fact, we need X Y Z additional research": And I think this is wrong. 2 can just be true, and we won't justifiably know it. So I usually say "It is not known to me that I know how to solve alignment", and not "I don't know how to solve alignment."  Does that make sense?
1Neel Nanda4mo
Thanks! I'd be excited to hear from anyone who ends up actually working on these :)

Here's a place where I want one of those disagree buttons separate from the downvote button :P

Given a world model that contains a bunch of different ways of modeling the same microphysical state (splitting up the same world into different parts, with different saliency connections to each other, like the discussion of job vs. ethnicity and even moreso), there can be multiple copies that coarsely match some human-intuitive criteria for a concept, given different weights by the AI. There will also be ways of modeling the world that don't get represented much... (read more)

2Thane Ruthenis5mo
I agree that the AI would only learn the abstraction layers it'd have a use for. But I wouldn't take it as far as you do []. I agree that with "human values" specifically, the problem may be just that muddled, but with none of the other nice targets — moral philosophy, corrigibility, DWIM, they should be more concrete. The alternative would be a straight-up failure of the NAH, I think; your assertion that "abstractions can be on a continuum" seems directly at odds with it. Which isn't impossible, but this post is premised on the NAH working.

Nice! Thinking about "outer alignment maximalism" as one of these framings reveals that it's based on treating outer alignment as something like "a training process that's a genuinely best effort at getting the AI to learn to do good things and not bad things" (and so of course the pull-request AI fails because it's teaching the AI about pull requests, not about right and wrong).

Introspecting, this choice of definition seems to correspond with feeling a lack of confidence that we can get the pull-request AI to behave well - I'm sure it's a solvable technic... (read more)

This was an important and worthy post.

I'm more pessimistic than Ajeya; I foresee thorny meta-ethical challenges with building AI that does good things and not bad things, challenges not captured by sandwiching on e.g. medical advice. We don't really have much internal disagreement about the standards by which we should judge medical advice, or the ontology in which medical advice should live. But there are lots of important challenges that are captured by sandwiching problems - sandwiching requires advances in how we interpret human feedback, and how we tr... (read more)

Family's coming over, so I'm going to leave off writing this comment even though there are some obvious hooks in it that I'd love to come back to later.

  • If the AI can't practically distinguish mechanisms for good vs. bad behavior even in principle, why can the human distinguish them? If the human cant distinguish them, why do we think the human is asking for a coherent thing? If the human isn't asking for a coherent thing, we don't have to smash our heads against that brick wall, we can implement "what to do when the human asks for an incoherent thing" cont
... (read more)

Did you ever end up reading Reducing Goodhart? I enjoyed reading these thought experiments, but I think rather than focusing on "the right direction" (of wisdom), or "the right person," we should mostly be thinking about "good processes" - processes for evolving humans' values that humans themselves think are good, in the ordinary way we think ordinary good things are good.

1Alex Flint5mo
Not yet, but I hope to, and I'm grateful to you for writing it. Well, sure, but the question is whether this can really be done by modelling human values and then evolving those models. If you claim yes then there are several thorny issues to contend with, including what constitutes a viable starting point for such a process, what is a reasonable dynamic for such a process, and on what basis we decide the answers to these things.

It's unclear how much of what you're describing is "corrigibility," and how much of it is just being good at value learning. I totally agree that an agent that has a sophisticated model of its own limitations, and is doing good reasoning that is somewhat corrigibility-flavored, might want humans to edit it when it's not very good at understanding the world, but then will quickly decide that being edited is suboptimal when it's better than humans at understanding the world.

But this sort of sophisticated-value-learning reasoning doesn't help you if the AI is... (read more)

1Vladimir Nesov5mo
Corrigibility is tendency to fix fundamental problems based on external observations, before the problems lead to catastrophies. It's less interesting when applied to things other than preference, but even when applied to preference it's not just value learning. There's value learning where you learn fixed values that exist in the abstract (as extrapolations on reflection), things like utility functions; and value learning as a form of preference []. I think humans might lack fixed values appropriate for the first sense (even normatively, on reflection of the kind feasible in the physical world). Values that are themselves corrigible can't be fully learned, otherwise the resulting agent won't be aligned in the ambitious sense, its preference won't be the same kind of thing as the corrigible human preference-on-reflection. The values of such an aligned agent must remain corrigible indefinitely. I think being good at corrigibility (in the general sense, not about values in particular) is orthogonal to being good at value learning, it's about recognizing one's own limitations, including limitations of value learning and corrigibility. So acting only within goodhart scope (situations where good proxies of value and other tools of good decision making are already available) is a central example of corrigibility, as is shutting down activities in situations well outside the goodhart scope. And not pushing the world outside your goodhart scope with your own actions [] (before the scope has already extended there, with sufficient value learning and reflection). Corrigibility makes the agent wait on value learning and other kinds of relevant safety guarantees, it's not in a race with them, so a corrigible agent being bad at value learning (or not knowably good enough at corrigibility) merely makes it le

Why a formal specification of the desired properties?

Humans do not carry around a formal specification of what we want printed on the inside of our skulls. So when presented with some formal specification, we would need to gain confidence that such a formal specification would lead to good things and not bad things through some informal process. There's also the problem that specifications of what we want tend to be large - humans don't do a good job of evaluating formal statements even when they're only a few hundred lines long. So why not just cut out the middleman and directly reference the informal processes humans use to evaluate whether some plan will lead to good things and not bad things?

1davidad (David A. Dalrymple)5mo
The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property Q (presumably a conjunct of many other properties, etc.) such that Q conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to resources sufficient for at least their previous standard of living). In human history, the development of increasingly formalized bright lines around what things count as definitely bad things (namely, laws) seems to have been greatly instrumental in the reduction of bad things overall. Regarding the challenges of understanding formal descriptions, I’m hopeful about this because of * natural abstractions (so the best formal representations could be shockingly compact) * code review (Google’s codebase is not exactly “a formal property,” unless we play semantics games, but it is highly reliable, fully machine-readable, and every one of its several billion lines of code has been reviewed by at least 3 humans) * AI assistants (although we need to be very careful here—e.g. reading LLM outputs cannot substitute for actually understanding the formal representation since they are often untruthful)

What about problems with direct oversight?

Shouldn't we plan to build trust in AIs in ways that don't require humans to do things like vet all changes to its world-model? Perhaps toy problems that try to get at what we care about, or automated interpretability tools that can give humans a broad overview of some indicators?

2davidad (David A. Dalrymple)5mo
Yes, I agree that we should plan toward a way to trust AIs as something more like virtuous moral agents rather than as safety-critical systems. I would prefer that. But I am afraid those plans will not reach success before AGI gets built anyway, unless we have a concurrent plan to build an anti-AGI defensive TAI that requires less deep insight into normative alignment.
2davidad (David A. Dalrymple)5mo
In response to your linked post, I do have similar intuitions about “Microscope AI” as it is typically conceived (i.e. to examine the AI for problems using mechanistic interpretability tools before deploying it). Here I propose two things that are a little bit like Microscope AI but in my view both avoid the core problem you’re pointing at (i.e. a useful neural network will always be larger than your understanding of it, and that matters): 1. Model-checking policies for formal properties. A model-checker (unlike a human interpreter) works with the entire network, not just the most interpretable parts. If it proves a property, that property is true about the actual neural network. The Model-Checking Feasibility Hypothesis says that this is feasible, regardless of the infeasibility of a human understanding the policy or any details of the proof. (We would rely on a verified verifier for the proof, of which humans would understand the details.) 2. Factoring learned information through human understanding. If we denote learning by L, human understanding by H, and big effects on the world by Le→W, then “factoring” means that e=Lf→Hg→W (for some f and g). This is in the same spirit as “human in the loop,” except not for the innermost loops of real-time action. Here, the Scientific Sufficiency Hypothesis implies that even though L is “larger” than H in the sense you point out, we can throw away the parts that don’t fit in H and move forward with a fully-understood world model. I believe this is likely feasible for world models, but not for policies (optimal policies for simple world models, like Go, can of course be much better than anything humans understand).

The way I see it, LLMs are already computing properties of the next token that correspond to predictions about future tokens (e.g. see gwern's comment). RLHF, to first order, just finds these pre-existing predictions and uses them in whatever way gives the biggest gradient of reward.

If that makes it non-myopic, it can't be by virtue of considering totally different properties of the next token. Nor can it be by doing something that's impossible to train a model to do with pure sequence-prediction. Instead it's some more nebulous thing like "how its most convenient for us to model the system," or "it gives a simple yet powerful rule for predicting the model's generalization properties."

The problem with the swiss cheese model here is illustrative of why this is unpromising as stated. In the swiss cheese model you start with some working system, and then the world throws unexpected accidents at you, and you need to protect the working system from being interrupted by an accident. This is not our position with respect to aligned AI - a misaligned AI is not well-modeled as an aligned AI plus some misaligning factors. That is living in the should-universe-plus-diff. If you prevent all "accidents," the AI will not revert to its normal non-acci... (read more)

The mad lads! Now you just need to fold the improved model back to improve the ratings used to train it, and you're ready to take off to the moon*.

*We are not yet ready to take off to the moon. Please do not depart for the moon until we have a better grasp on tracking uncertainty in generative models, modeling humans, applying human feedback to the reasoning process itself, and more.

1Johannes Treutlein5mo
Thank you!

How many major tweaks did you have to make before this worked? 0? 2? 10? Curious what your process was like in general.

There were a number of iterations with major tweaks. It went something like:

  • I spent a while thinking about the problem conceptually, and developed a pretty strong intuition that something like this should be possible. 
  • I tried to show it experimentally. There were no signs of life for a while (it turns out you need to get a bunch of details right to see any real signal -- a regime that I think is likely my comparative advantage) but I eventually got it to sometimes work using a PCA-based method. I think it took some work to make that more reliable, whi
... (read more)

Even given all of this, why should reward function "robustness" be the natural solution to this? Like, what if you get your robust reward function and you're still screwed? It's very nonobvious that this is how you fix things. 

Yeah, I sorta got sucked into playing pretend, here. I don't actually have much hope for trying to pick out a concept we'd want just by pointing into a self-supervised world-model - I expect us to need to use human feedback and the AI's self-reflectivity, which means that the AI has to want human feedback, and be able to reflect... (read more)

2Alex Turner6mo
It seems not relevant if it's an optimum or not. What's relevant is the scalar reward values output on realized datapoints.  I emphasize this because "unintended optimum" phrasing seems to reliably trigger cached thoughts around "reward functions need to be robust graders." (I also don't like "optimum" of values, because I think that's really not how values work in detail instead of in gloss, and "optimum" probably evokes similar thoughts around "values must be robust against adversaries.")
Load More