Seth Herd

I've been doing computational cognitive neuroscience research since getting my PhD in 2006, until the end of 2022. I've worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I've focused on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I'm incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.

Posts

Sorted by New

2Seth Herd's Shortform

6mo

15Goals selected from learned knowledge: an alternative to RL alignment

3mo

13We have promising alignment plans with low taxes

6mo

16The (partial) fallacy of dumb superintelligence

6mo

19Internal independent review for language model agent alignment

10mo

25Capabilities and alignment of LLM cognitive architectures

12The alignment stability problem

Wiki Contributions

Corrigibility

5mo

(+472)

Language model cognitive architecture

5mo

(+805)

Chain-of-Thought Alignment

(+14/-21)

Comments

Toy models of AI control for concentrated catastrophe prevention

Seth Herd3mo55

I applaud the work; detailed thinking about control efforts is one useful alignment strategy.

I find the assumption of no time-awareness in the AI to be unrealistic. Allowing continuous learning is going to be extremely tempting to anyone developing or deploying proto-AGI. Episodic memory (including across sessions) is fairly easy to implement, and can add capabilities as well as save compute costs. Working out useful knowledge and strategies will probably take nontrivial compute costs, and throwing away that knowledge after each session will seem very wasteful of resources and will limit usefulness.

Deceptive AI ≠ Deceptively-aligned AI

Seth Herd4mo10

What do you mean by "when deceptive alignment fails to work"? I'm confused.

Thoughts on “AI is easy to control” by Pope & Belrose

Seth Herd5mo63

I think this is one particularly striking example of a common problem in alignment discussions: they are confused when the type of AI we're talking about isn't made clear. I think this is a ubiquitous problem in alignment discussions: people are thinking of different types of AI without explicitly stating this, so they reach different conclusions about alignment. To some extent this is inevitable if we want to avoid advancing capabilities by proposing useful designs for AGI. But we could do better by distinguishing between known broad categories, in particular, agentic vs. tool AI and RL-trained vs. predictive AI. These are not sharp categories, but distinguishing what part of the spectrum we're primarily addressing would clarify discussions.

You've done an admirable job of doing that in this post, and doing so seems to make sense of your disagreements with Pope's conclusions.

Pope appears to be talking primarily about LLMs, so the extent to which his logic applies to other forms of AI is unclear. As you note, that logic does not seem to apply to AI that is agentic (explicitly goal-directed), or to actor-critic RL agents.

That is not the only problem with that essay, but it's a big one, since the essay comes to the conclusion that AI is safe, while analyzing only one type of AI.

I agree that human ethics is not the result solely of training, but has a critical component of innate drives to be pro-social. The existence of sociopaths whose upbringing was normal is pretty compelling evidence that the genetic component is causal.

While the genetic basis of prosocial behavior is probably simple in the sense that it is coded in a limited amount of DNA information and neural circuitry, it is likely quite complex in another sense: it is evolved to work properly in the context of a very particular type of environment, that of standard human experience. As such, I find it unlikely that those mechanisms would produce an aligned agent in a very different AI training regime, nor that that alignment would generalize to very different situations than humans commonly encounter.

As you note, even if we restricted ourselves to this type of AI, and alignment was easy, that would not reduce existential risks to near 1%. If powerful AI is accessible to many, someone is going to either make mistakes or deliberately use it destructively, probably rather quickly.

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Seth Herd9mo34

Your first point seems absolutely critical. Could you elaborate a little?

A Case for the Least Forgiving Take On Alignment

Seth Herd1y3-1

I think this is insightful pointing correctly to a major source of bifurcation in p(doom) estimates. I view this as the old guard vs. new wave perspectives on alignment.

Unfortunately, I mostly agree with these positions. I'm afraid a lack of attention to these claims may be making the new wave of alignment thinkers more optimistic than is realistic. I do partially disagree with some of these, and that makes my p(doom) a good bit lower than the MIRI 99%. But it's not enough to make me truly optimistic. My p(doom) is right around the 50% "who knows" mark.

I'll restate the main claims as:

We only get one chance
We have to get the AIs values exactly aligned with human values
1. There will be a sharp discontinuity as an AI becomes truly generally intelligent
2. the process of value reflection seems highly unstable
No known method of dodging this problem is likely to work

The source of all most my disagreement with you is in the type of AGI we expect. I expect (with above 50% probably) AGI to arise from the expansion of LLMs into language model based cognitive architectures that use LLMs as the core engine, but expand on them in a chain-of-thought, and allow them to use external tools. These expectations are entirely theoretical since AutoGPT and HuggingGPT were only released about a month or so ago. My post Capabilities and alignment of LLM cognitive architectures elaborates on why I expect these to work well.

I think such systems will readily become weakly general (at variance from your expectation of a more all-or-nothing transition) by learning about new domains through web search and experimentation with their cognitive tools, storing that knowledge in episodic memory. (I also think that before long, they will use that episodic, declarative knowledge to fine-tune the central LLM, much as humans absorb new knowledge into skills). Importantly, I expect this generality to extend to understanding themselves as systems, and thereby giving rise to something like value reflection.

This is bad because it advances timelines if true, but really good in that such systems can be run without using any RL or persistent context in one LLM.

None of the above considerations are in that post; I'm writing another that focuses on them.

In that scenario, I expect us to get a few shots, as the transition to truly general will be slow and happen in highly interpretable natural language agent systems. There are still many dangers, but I think this would massively improve our odds.

Whether or not AGI arises from that or a different network-based system, I agree that the value reflection process is unpredictable, so we may have to get value alignment exactly right. I expect the central strongest value to be preserved in a reflective value-editing process. But that means that central value has to be exactly right. Whether any broader configuration of values might be stable in a learning network is unknown, and I think worthy of a good deal more thought.

One random observation: I think your notion of general intelligence overlaps strongly with the common concept of recursive self improvement, which many people do include in their mental models.

Anyway, thanks for an insightful post that nails a good deal of the variance between my model of the average alignment optimist and pessimist.

The alignment stability problem

Seth Herd1y20

I probably should've titled this "the alignment stability problem in artificial neural network AI". There's plenty of work on algorithmic maximizers. But it's a lot trickier if values/goals are encoded in a network's distributed representations of the world.

I also should've cited Alex Turner's Understanding and avoiding value drift. There he makes a strong case that dominant shards will try to avoid value drift through other shards establishing stronger connections to rewards. But that's not quite good enough. Even if it avoids sudden value drift, at least for the central shard or central tendency in values, it doesn't really address the stability of a multi-goal system. And it doesn't address slow subtle drift over time.

Those are important, because we may need a multi-goal system, and we definitely want alignment to stay stable over years, let alone centuries of learning and reflection.

The Waluigi Effect (mega-post)

Seth Herd1y411

Fascinating. I find the core logic totally compelling. LLM must be narratologists, and narratives include villains and false fronts. The logic on RLHF actually making things worse seems incomplete. But I'm not going to discount the possibility. And I am raising my probabilities on the future being interesting, in a terrible way.

There are no coherence theorems

Seth Herd1y10

I don't think this goes through. If I have no preference between two things, but I do prefer to not be money-pumped, it doesn't seem like I'm going to trade those things so as to be money-pumped.

I am commenting because I think this might be a crucial crux: do smart/rational enough agents always act like maximizers? If not, adequate alignment might be much more feasible than if we need to find exactly the right goal and how to get it into our AGI exactly right.

Human preferences are actually a lot more complex. We value food very highly when hungry and water when we're thirsty. That can come out of power-seeking, but that's not actually how it's implemented. Perhaps more importantly, we might value stamp collecting really highly until we get bored with stamp collecting. I don't think these can be modeled as a maximizer of any sort.

If humans would pursue multiple goals even if we could edit them (and were smart enough to be consistent), then a similar AGI might only need to be minimally aligned for success. That is, it might stably value human flourishing as a small part of its complex utility function.

I'm not sure whether that's the case, but I think it's important.

Simulators

Seth Herd2y00

This seems like a valid concern. It seems to apply to other directions in alignment research as well. Any approach can make progress in some directions seem easier, while ultimately that direction will be a dead end.

Based on that logic, it would seem that having more different approaches should serve as a sort of counterbalance. As we make judgment calls about ease of progress vs. ultimate usefulness, having more options would seem like to provide better progress in useful directions.