There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).
In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inocu...
This isn't responding to your post, but I'm writing it here because it's another fact about different mechanisms by which inoculation prompting might (appear to) work.
In the normal story, the inoculation prompt recontextualizes the model's undesired behavior, such that the model doesn't display the behavior in dissimilar contexts. In this story:
In retrospect it seems like such a fluke that decision theory in general and UDT in particular became a central concern in AI safety. In most possible worlds (with something like humans) there is probably no Eliezer-like figure, or the Eliezer-like figure isn't particularly interested in decision theory as a central part of AI safety, or doesn't like UDT in particular. I infer this from the fact that where Eliezer's influence is low (e.g. AI labs like Anthropic and OpenAI) there seems little interest in decision theory in connection with AI safety (cf Dari...
Thanks. This sounds like a more peripheral interest/concern, compared to Eliezer/LW's, which was more like, we have to fully solve DT before building AGI/ASI, otherwise it could be catastrophic due to something like the AI falling prey to an acausal threat or commitment races, or can't cooperate with other AIs.
An update on this 2010 position of mine, which seems to have become conventional wisdom on LW:
...In my posts, I've argued that indexical uncertainty like this shouldn't be represented using probabilities. Instead, I suggest that you consider yourself to be all of the many copies of you, i.e., both the ones in the ancestor simulations and the one in 2010, making decisions for all of them. Depending on your preferences, you might consider the consequences of the decisions of the copy in 2010 to be the most important and far-reaching, and therefore act mostly
People often say US-China deals to slow AI progress and develop AI more safely would be hard to enforce/verify.
However, there are easy to enforce deals: each destroys a fraction of their chips at some level of AI capability. This still seems like it could be helpful and it's pretty easy to verify.
This is likely worse than a well-executed comprehensive deal which would allow for productive non-capabilities uses of the compute (e.g., safety or even just economic activity). But it's harder to verify that chips aren't used to advance capabilities while easy to...
Inspired by a recent comment, a potential AI movie or TV show that might introduce good ideas to society, is one where there are already uploads, LLM-agents and biohumans who are beginning to get intelligence-enhanced, but there is a global moratorium on making any individual much smarter.
There's an explicit plan for gradually ramping up intelligence, running on tech that doesn't require ASI (i.e. datacenters are centralized, monitored and controlled via international agreement, studying bioenhancement or AI development requires approval from your country...
Yeah I went to try to write some stuff and felt bottlenecked on figuring out how to generate a character I connect with. I used to write fiction but like 20 years ago and I'm out of touch.
I think a good approach here would be to start with some serial webfiction since that's just easier to iterate on.
Many of Paul Christiano's writings were valuable corrections to the dominant Yudkowskian paradigm of AI safety. However, I think that many of them (especially papers like concrete problems in AI safety and posts like these two) also ended up providing a lot of intellectual cover for people to do "AI safety" work (especially within AGI companies) that isn't even trying to be scalable to much more powerful systems.
I want to register a prediction that "gradual disempowerment" will end up being (mis)used in a similar way. I don't really know what to do about t...
mostly it does not match my practical experience so far
I mostly wouldn't expect it to at this point, FWIW. The people engaged right now are by and large people sincerely grappling with the idea, and particularly people who are already bought into takeover risk. Whereas one of the main mechanisms by which I expect misuse of the idea is that people who are uncomfortable with the concept of "AI takeover" can still classify themselves as part of the AI safety coalition when it suits them.
As an illustration of this happening to Paul's worldview, see this Vox ar...
I made a manifold market about how likely we are to get ambitious mechanistic interpretability to GPT-2 level: https://manifold.markets/LeoGao/will-we-fully-interpret-a-gpt2-leve?r=TGVvR2Fv
I honestly didn't think of that at all when making the market, because I think takeover-capability-level AGI by 2028 is extremely unlikely.
I care about this market insofar as it tells us whether (people believe) this is a good research direction. So obviously it's perfectly ok to resolve YES if it is solved and a lot of the work was done by AI assistants. If AI fooms and murders everyone before 2028 then this is obviously a bad portent for this research agenda, because it means we didn't get it done soon enough, and it's little comfort if the ASI sol...
An analogy that points at one way I think the instrumental/terminal goal distinction is confused:
Imagine trying to classify genes as either instrumentally or terminally valuable from the perspective of evolution. Instrumental genes encode traits that help an organism reproduce. Terminal genes, by contrast, are the "payload" which is being passed down the generations for their own sake.
This model might seem silly, but it actually makes a bunch of useful predictions. Pick some set of genes which are so crucial for survival that they're seldom if ever modifie...
We start off with some early values, and then develop instrumental strategies for achieving them. Those instrumental strategies become crystallized and then give rise to other instrumental strategies for achieving them, and so on. Understood this way, we can describe an organism's goals/strategies purely in terms of which goals "have power over" which other goals, which goals are most easily replaced, etc, without needing to appeal to some kind of essential "terminalism" that some goals have and others don't.
I think it would help me if you explained wha...
When talking about "self-fulfilling misalignment", "hyperstition" is a fun name but not a good name which actually describes the concept to a new listener. (In this sense, the name has the same problem as "shard theory" --- cool but not descriptive unless you already know the idea.) As a matter of discourse health, I think people should use "self-fulfilling {misalignment, alignment, ...}" instead.
Yeah, that's fair. I guess my views on this are stronger because I think data filtering might be potentially negative rather than potentially sub-optimal.
(This post is now live on the METR website in a slightly edited form)
In the 9 months since the METR time horizon paper (during which AI time horizons have increased by ~6x), it’s generated lots of attention as well as various criticism on LW and elsewhere. As one of the main authors, I think much of the criticism is a valid response to misinterpretations, and want to list my beliefs about limitations of our methodology and time horizon more broadly. This is not a complete list, but rather ...
Nice. Yeah I also am excited about coding uplift as a key metric to track that would probably make time horizons obsolete (or at least, constitute a significantly stronger source of evidence than time horizons). We at AIFP don't have capacity to estimate the trend in uplift over time (I mean we can do small-N polls of frontier AI company employees...) but we hope someone does.
having the right mental narrative and expectation setting when you do something seems extremely important. the exact same object experience can be anywhere from amusing to irritating to deeply traumatic depending on your mental narrative. some examples:
Possible root causes if we don't end up having a good long term future (i.e., realize most of the potential value of the universe), with illustrative examples:
Diminishing returns in the NanoGPT speedrun:
To determine whether we're heading for a software intelligence explosion, one key variable is how much harder algorithmic improvement gets over time. Luckily someone made the NanoGPT speedrun, a repo where people try to minimize the amount of time on 8x H100s required to train GPT-2 124M down to 3.28 loss. The record has improved from 45 minutes in mid-2024 down to 1.92 minutes today, a 23.5x speedup. This does not give the whole picture-- the bulk of my uncertainty is in other variables-- but given this is exist...
Cool, this clarifies things a good amount for me. Still have some confusion about how you are modeling things, but I feel less confused. Thank you!
The “when no one’s watching” feature
Can we find a “when no one’s watching” feature, or its opposite, “when someone’s watching”? This could be a useful and plausible precursor to deception
With sufficiently powerful systems, this feature may be *very* hard to elicit, as by definition, finding the feature means someone’s watching and requires tricking / honeypotting the model to take it off its own modeling process of the input distribution.
This could be a precursor to deception. One way we could try to make deception harder for the model is to never d...
Two months ago I said I'd be creating a list of predictions about the future in honor of my baby daughter Artemis. Well, I've done it, in spreadsheet form. The prediction questions all have a theme: "Cyberpunk." I intend to make it a fun new year's activity to go through and make my guesses, and then every five years on her birthdays I'll dig up the spreadsheet and compare prediction to reality.
I hereby invite anybody who is interested to go in and add their own predictions to the spreadsheet. Also feel free to leave comments ...
LLMs (and probably most NNs) have lots of meaningfull, interpretable linear feature directions. These can be found though various unsupervised methods (e.g. SAEs) and supervised methods (e.g. linear probs).
However, most human interpretable features, are not what I would call the models true features.
If you find the true features, the network should look sparse and modular, up to noise factors. If you find the true network decomposition, than removing the what's left over should imporve performance, not make it wors.
Because the network has a limited number ...
I do exect some amount of superpossition, i.e. the model is using almost orthogonal directions to encode more concept than it has neurons. Depending on what you mean by "larger" this will result in a world model that is larger than the network. However such an encoding will also result in noise. Superpossition will nessesarely lead to unwanted small amplitude connections between uncorelated concepts. Removing these should imporve performance, and if it dosn't it means that you did the decomposition wrong.
Overconfidence from early transformative AIs is a neglected, tractable, and existential problem.
If early transformative AIs are overconfident, then they might build ASI/other dangerous technology or come up with new institutions that seem safe/good, but ends up being disastrous.
This problem seems fairly neglected and not addressed by many existing agendas (i.e., the AI doesn't need to be intent-misaligned to be overconfident).[1]
Overconfidence also feels like a very "natural" trait for the AI to end up having relative to the pre-training prior, compa...
I do judge comments more harshly when they're phrased confidently—your tone is effectively raising the stakes on your content being correct and worth engaging with.
If I agreed with your position, I'd probably have written something like:
...I don't think this is an important source of risk. I think that basically all the AI x-risk comes from AIs that are smart enough that they'd notice their own overconfidence (maybe after some small number of experiences being overconfident) and then work out how to correct for it.
There are other epistemic problems that I thi
I propose a taxonomy of 5 possible worlds for multi-agent theory, inspired by Imagliazzo's 5 possible worlds of complexity theory (and also the Aaronson-Barak 5 worlds of AI):