jacquesthibs — AI Alignment Forum

AI ALIGNMENT FORUM
AF

I’m working on this. I’m unsure if I should be sharing what I’m exactly working on with a frontier AGI lab though. How can we be sure this just leads to differentially accelerating alignment?

Edit: my main consideration is when I should start mentioning details. As in, should I wait until I’ve made progress on alignment internally before sharing with an AGI lab. Not sure what people are disagreeing with since I didn't make a statement.

How to train your own "Sleeper Agents"

jacquesthibs2y42

Would you be excited if someone devised an approach to detect the sleeper agents' backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?

Evolution provides no evidence for the sharp left turn

jacquesthibs3y178

Here’s my takeaway:

There are mechanistic reasons for humanity’s “Sharp Left Turn” with respect to evolution. Humans were bottlenecked by knowledge transfer between new generations, and the cultural revolution allowed us to share our lifetime learnings with the next generation instead of waiting on the slow process of natural selection.

Current AI development is not bottlenecked in the same way and, therefore, is highly unlikely to get a sharp left turn for the same reason. Ultimately, evolution analogies can lead to bad unconscious assumptions with no rigorous mechanistic understanding. Instead of using evolution to argue for a Sharp Left Turn, we should instead look for arguments that are mechanistically specific to current AI development because we are much less likely to make confused mistakes that unconsciously rely on human evolution assumptions.

AI may still suffer from a fast takeoff (through AI driving capabilities research or iteratively refining it's training data), but for AI-specific reasons so we should be paying attention to that kind of fast takeoff might happen and how to deal with it.

Edited after Quintin's response.

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

jacquesthibs3y20

Indeed! When I looked into model editing stuff with the end goal of “retargeting the search”, the finickiness and break down of internal computations was the thing that eventually updated me away from continuing to pursue this. I haven’t read these maze posts in detail yet, but the fact that the internal computations don’t ruin the network is surprising and makes me think about spending time again in this direction.

I’d like to eventually think of similar experiments to run with language models. You could have a language model learn how to solve a text adventure game, and try to edit the model in similar ways as these posts, for example.

Edit: just realized that the next post might be with GPT-2. Exciting!

Is the "Valley of Confused Abstractions" real?

jacquesthibs3y10

As in, the ratio between (interpretable capabilities / total capabilities) still asymptotes to zero, but the number of interpretable capabilities goes up (and then maybe back down) as the models gain more capabilities?

Oversight Misses 100% of Thoughts The AI Does Not Think

jacquesthibs3y10

Any additional or new thoughts on this? Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently? Do you believe it's way more likely that we'd be unable to prompt things out of the model only if it were deceptive? Could you say more?

Separately: If I have a chain-of-thought model detailing steps it will take to reach x outcome. We've fine-tuned on previous chain-of-thoughts while giving process-level feedback. However, even if you are trying to get it to externalize it's thoughts/reasoning, it could lead to extinction via side-effect. So you might ask the model at each individual thought (or just the entire plan) if we'll be happy with the outcome. How exactly would the model end up querying its internal world model in the way we would want it to?

How To Go From Interpretability To Alignment: Just Retarget The Search

jacquesthibs3y51

I'm wondering what you think we can learn from approaches like ROME. For those who don't know, ROME is focused on editing factual knowledge (e.g. Eiffel Tower is now in Rome). I'm curious how we could take it beyond factual knowledge. ROME uses causal tracing to find the parts of the model that impact specific factual knowledge the most.

What if we tried to do something similar to find which parts of the model impact the search the most? How would we retarget the search in practice? And in the lead-up to more powerful models, what are the experiments we can do now (retarget the internal "function" the model is using)?

In the case of ROME, the factual knowledge can be edited by modifying the model only a little bit. Is Search at all "editable" like facts or does this kind of approach seem impossible for retargeting search? In the case of approaches like ROME, is creating a massive database of factual knowledge to edit the model the best we can do? Or could we edit the model in more abstract ways (that could impact Search) that point to the things we want?

A descriptive, not prescriptive, overview of current AI Alignment Research

jacquesthibs3y10

I believe all of those posts can be found on the Alignment Forum so, luckily, they are included in the dataset (at least from what I remember after checking a handful of the posts). I had begun scraping from those sources, but realized they were already on AF halfway through.

A survey of tool use and workflows in alignment research

jacquesthibs4y*40

We could definitely look into making the project evolve in this direction. In fact, we're building a dataset of alignment-related texts and a small part of the dataset includes a scrape of arXiv papers extracted from the Alignment Newsletter. We're working towards building GPT models fine-tuned on the texts.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments