It might be worth going into the problem of fully updated deference. I don't think it's necessarily always a problem, but also it does stop utility aggregation and uncertainty from being a panacea, and the associated issues are probably worth a bit of discussion. And as you likely know, there isn't a great journal citation for this, so you could really cash in when people want to talk about it in a few years :P
Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.
This is fine if you want your AI to stack blocks on top of other blocks or something.
But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you're no longer just shooting f...
I feel like the "obvious" thing to do is to ask how rare (in bits) the post-opitimization EV is according to the pre-optimization distribution. Like, suppose that pre-optimization my probability distribution over utilities I'd get is normally distributed, and after optimizing my EV is +1 standard deviation. Probability of doing that well or better is 0.158, which in bits is 2.65 bits.
Seems indifferent to affine transformation of the utility function, adding irrelevant states, splitting/merging states, etc. What are some bad things about this method?
I'm definitely guilty of getting a disproportionate amount of information from the AI safety community.
I don't really have a good cure for it, but I do think having a specific question helps - it's simply not practical to keep up with the entire literature, and I don't have a good filtering mechanism for what to keep up with in general, but if I'm interested in a specific question I can usually crawl the search engine + citation network well enough to get a good representation of the literature.
You might be interested in Reducing Goodhart. I'm a fan of "detecting and avoiding internal Goodhart," and I claim that's a reflective version of the value learning problem.
Partial solutions which likely do not work in the limit
Taking meta-preferences into account
Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.
As a fan of accounting for meta-preferences, I've made my peace with multiple fixed points, to the extent that it now seems wild to expect otherwise.
Like, of course there are multiple ways to model humans ...
Nice. My main issue is that just because humans have values a certain way, doesn't mean we want to build an AI that way, and so I'd draw pretty different implications for alignment. I'm pessimistic about anything that even resembles "make an AI that's like a human child," and more interested in "use a model of a human child to help an inhuman AI understand humans in the way we want."
The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process?
afaict, a big fraction of evolution's instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but b...
Nice AI ethics dataset. It would be a shame if someone were to... fine-tune a LLM to perform well at it after some scratchpad reasoning, thus making an interesting advance in natural language ethical reasoning that might be useful for more general AI alignment if we expect transformative AI to look like 80% LLM and 20% other stuff bolted on, but might fail to generalize to alternative or successor systems that do more direct reasoning about the world.
Sorry, that sentence really got away from me.
For some reason, I couldn't import pysvelte (KeyError: '<run_path>') in the colab notebook. There was also a call to circuitsvis that I had to hunt down the import for.
Nate's concerns don't seem to be the sort of thing that gradient descent in a non-recurrent system learns. (I basically agree with Steve Byrnes here.) GPT-4 probably has enough engagement with the hardware that you could program something that acquires more computer resources using the weights of GPT-4. But it never stumbled on such a solution in training, in part because in gradient descent the gradient is calculated using a model of the computation that doesn't take hacking the computer into account.
In a recurrent system that learns by some non-gradient-...
I'm pessimistic about the chances of this happening automatically, or requiring no expensive tradeoffs, because training networks to have good performance will disupt the human interpretability of intermediate representations, even if we think those representations look like text. But I do think it's interesting, and that attempts to make auditing via intepretable intermediate representations work (and pay the costs) aren't automatically doomed.
Behavioral
1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
I expect the network to simultaneously be learning several different algorithms.
One method works via diffusion from the cheese and the mouse, and extraction of local connectivity information from fine-grained pixels into coarse-grained channels. This will work even better when the cheese is close to the mouse, but because of the relative lack of training data o...
The Exhibits might have been a nice place to use probabilistic reasoning. I too am too lazy to go through and try to guesstimate numbers though :)
Here's the comment I promised back on post 5.
First, I liked Benchmarking Interpretability Tools for Deep Neural Networks. Because I'm not so familiar with the methods you compared, I admit I actually took a long time to figure out that each each "interpretability tool" was responsible for both finding and visualizing the trigger of a trojan pointing to a particular class, and so "visualize the trigger" meant use the tools separately to try to visualize a trojan given the target class.
I'm slowly making my way through these, so I'll leave you a more complete comment after I read post 8.
I think it's a big stretch to say that deception is basically just trojans. There are similarities, but the regularities that make deception a natural category of behavior that we might be able to detect are importantly fuzzier than the regularities that trojan-detecting strategies use. If "deception" just meant acting according to a wildly different distribution when certain cues were detected, trojan-detection would have us covered, but what counts as "deception" depends more heavily on our standards for the reasoning process, and doean't reliably result in behavior that's way different than non-deceptive behavior.
It seems like you deliberately picked completeness because that's where Dutch book arguments are least compelling, and that you'd agree with the more usual Dutch book arguments.
But I think even the Dutch book for completeness makes some sense. You just have to separate "how the agent internally represents its preferences" from "what it looks like the agent us doing." You describe an agent that dodges the money-pump by simply acting consistently with past choices. Internally this agent has an incomplete representation of preferences, plus a memory. But exte...
Hm, I mostly agree. There isn't any interesting structure by default, you have to get it by trying to mimic a training distribution that has interesting structure.
And I think this relates to another way that I was too reductive, which is that if I want to talk about "simulacra" as a thing, then they don't exist purely in the text, so I must be sneaking in another ontology somewhere - an ontology that consists of features inferred from text (but still not actually the state of our real universe).
I think the biggest pitfall of the "simulator" framing is that it's made people (including Beth Barnes?) think it's all about simulating our physical reality, when exactly because of the constraints you mention (text not actually pinpointing the state of the universe, etc.), the abstractions developed by a predictor are usually better understood in terms of treating the text itself as the state, and learning time-evolution rules for that state.
Also, “give rise to a self-reflective mesa-optimizer that's capable of taking over the outer process” doesn’t parse for me. If it’s important, can you explain in more detail?
So, parsing it a bit at a time (being more thorough than is strictly necessary):
What does it mean for some instrumentally-useful behavior (let's call it behavior "X") to give rise to a mesa-optimizer?
It means that if X is useful for a system in training, that system might learn to do X by instantiating an agent who wants X to happen. So if X is "trying to have good cognitive habits," t...
Thanks for this series of expanded sections!
I'm confused about the distributional generalization thing. Why is that different from minimizing log loss? The loss function (for the base network, not RL-finetuning) is computed based on the logits, not on the temperature-0 sample, right? So a calibrated probability distribution should minimize loss.
I'm skeptical of all of those proposed markers of agentic behavior. Being able to predict what an agent would say, when prompted, is different than being an agent in the sense that causes concern (although it certai...
My vague understanding is that to correspond with Bayesian updating, RL has to have a quite restrictive KL penalty, and in practice people use much less - which might be like Bayesian updating on the pretend dataset where you've seen 50 of each RL example.
Is this accurate? Has anyone produced interesting examples of RL faithful to the RL-as-updating recipe, that you know of?
Is separate training for cognitive strategy useful? I'm genuinely unsure. If you have an architecture that parametrizes how it attends to thoughts, then any ol' RL signal will teach your AI how to attend to thoughts in an instrumentally useful way. I just read Lee's post, so right now I'm primed to expect that this will happen surprisingly often, though maybe the architecture needs to be a little more flexible/recurrent than a transformer before it happens just from trying to predict text.
Instrumental cognitive strategy seems way safer than terminal cognit...
How do you anticipate and strategize around dual-use concerns, particularly for basic / blue-sky interpretability-enablong research?
Since it was evidently A Thing, I have caved to peer pressure :P
"Shard theory doesn't need more work" (in sense 2) could be true as a matter of fact, without me knowing it's true with high confidence. If you're saying "for us to become highly confident that alignment is going to work this way, we need more info", I agree.
But I read you as saying "for this to work as a matter of fact, we need X Y Z additional research":
Yeah, this is a good point. I do indeed think that just plowing ahead wouldn't work as a matter of fact, even if shard theory alignmen...
I think I disagree with lots of things in this post, sometimes in ways that partly cancel each other out.
The memo trap reminds me of the recent work from Anthropic on superposition, memorization, and double descent - it's plausible that there's U-shaped scaling in there somewhere for similar reasons. But because of the exponential scaling of how good superposition is for memorization, maybe the paper actually implies the opposite? Hm.
I'll have to eat the downvote for now - I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters."
I think it's quite plausible that you don't need much more work for shard theory alignment, because value formation really is that easy / robust.
But how do we learn that fact?
If extremely-confident-you says "the diamond-alignment post would literally work" and I say "what about these magical steps where you make choices without kn...
Here's a place where I want one of those disagree buttons separate from the downvote button :P
Given a world model that contains a bunch of different ways of modeling the same microphysical state (splitting up the same world into different parts, with different saliency connections to each other, like the discussion of job vs. ethnicity and even moreso), there can be multiple copies that coarsely match some human-intuitive criteria for a concept, given different weights by the AI. There will also be ways of modeling the world that don't get represented much...
Nice! Thinking about "outer alignment maximalism" as one of these framings reveals that it's based on treating outer alignment as something like "a training process that's a genuinely best effort at getting the AI to learn to do good things and not bad things" (and so of course the pull-request AI fails because it's teaching the AI about pull requests, not about right and wrong).
Introspecting, this choice of definition seems to correspond with feeling a lack of confidence that we can get the pull-request AI to behave well - I'm sure it's a solvable technic...
This was an important and worthy post.
I'm more pessimistic than Ajeya; I foresee thorny meta-ethical challenges with building AI that does good things and not bad things, challenges not captured by sandwiching on e.g. medical advice. We don't really have much internal disagreement about the standards by which we should judge medical advice, or the ontology in which medical advice should live. But there are lots of important challenges that are captured by sandwiching problems - sandwiching requires advances in how we interpret human feedback, and how we tr...
Family's coming over, so I'm going to leave off writing this comment even though there are some obvious hooks in it that I'd love to come back to later.
Did you ever end up reading Reducing Goodhart? I enjoyed reading these thought experiments, but I think rather than focusing on "the right direction" (of wisdom), or "the right person," we should mostly be thinking about "good processes" - processes for evolving humans' values that humans themselves think are good, in the ordinary way we think ordinary good things are good.
It's unclear how much of what you're describing is "corrigibility," and how much of it is just being good at value learning. I totally agree that an agent that has a sophisticated model of its own limitations, and is doing good reasoning that is somewhat corrigibility-flavored, might want humans to edit it when it's not very good at understanding the world, but then will quickly decide that being edited is suboptimal when it's better than humans at understanding the world.
But this sort of sophisticated-value-learning reasoning doesn't help you if the AI is...
Why a formal specification of the desired properties?
Humans do not carry around a formal specification of what we want printed on the inside of our skulls. So when presented with some formal specification, we would need to gain confidence that such a formal specification would lead to good things and not bad things through some informal process. There's also the problem that specifications of what we want tend to be large - humans don't do a good job of evaluating formal statements even when they're only a few hundred lines long. So why not just cut out the middleman and directly reference the informal processes humans use to evaluate whether some plan will lead to good things and not bad things?
What about problems with direct oversight?
Shouldn't we plan to build trust in AIs in ways that don't require humans to do things like vet all changes to its world-model? Perhaps toy problems that try to get at what we care about, or automated interpretability tools that can give humans a broad overview of some indicators?
The way I see it, LLMs are already computing properties of the next token that correspond to predictions about future tokens (e.g. see gwern's comment). RLHF, to first order, just finds these pre-existing predictions and uses them in whatever way gives the biggest gradient of reward.
If that makes it non-myopic, it can't be by virtue of considering totally different properties of the next token. Nor can it be by doing something that's impossible to train a model to do with pure sequence-prediction. Instead it's some more nebulous thing like "how its most convenient for us to model the system," or "it gives a simple yet powerful rule for predicting the model's generalization properties."
The problem with the swiss cheese model here is illustrative of why this is unpromising as stated. In the swiss cheese model you start with some working system, and then the world throws unexpected accidents at you, and you need to protect the working system from being interrupted by an accident. This is not our position with respect to aligned AI - a misaligned AI is not well-modeled as an aligned AI plus some misaligning factors. That is living in the should-universe-plus-diff. If you prevent all "accidents," the AI will not revert to its normal non-acci...
The mad lads! Now you just need to fold the improved model back to improve the ratings used to train it, and you're ready to take off to the moon*.
*We are not yet ready to take off to the moon. Please do not depart for the moon until we have a better grasp on tracking uncertainty in generative models, modeling humans, applying human feedback to the reasoning process itself, and more.
How many major tweaks did you have to make before this worked? 0? 2? 10? Curious what your process was like in general.
There were a number of iterations with major tweaks. It went something like:
Even given all of this, why should reward function "robustness" be the natural solution to this? Like, what if you get your robust reward function and you're still screwed? It's very nonobvious that this is how you fix things.
Yeah, I sorta got sucked into playing pretend, here. I don't actually have much hope for trying to pick out a concept we'd want just by pointing into a self-supervised world-model - I expect us to need to use human feedback and the AI's self-reflectivity, which means that the AI has to want human feedback, and be able to reflect...
A nice exposition.
For myself I'd prefer the same material much more condensed and to-the-point, but I recognize that there are publication venues that prefer more flowing text.
E.g. compare
... (read more)