Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, now OpenAI Futures/Governance team. Views are my own & do not represent those of my employer. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html


Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions


This story involves an AI with an internal censor that deletes deceptive thoughts. But since the AI is trying to make rapid progress towards its goals in the wetlab, it eventually stumbles into a workaround that lets it avoid the censor.

What if instead of an internal censor, it had an internal plan-goodness-classifier that heavily downweighted plans which involved deception, and also applied on the meta level (e.g. a plan such as 'think long and hard about how to get similar results as deception without technically counting as deception' would be classified as deceptive and heavily downweighted)?

...attempting to answer my own question...

The meta-level thing is definitely necessary, otherwise the system quickly and deliberately finds a workaround. (Though maybe at merely human levels of intelligence this is not catastrophic? Humans find workarounds all the time, little white lies, stretching and bending the rules, etc. yet systems often work well enough in practice.)

But yeah assume we have the meta-level thing. It's not that the cognition of the system is mysteriously failing; it's that it is knowingly averse to deception and to thinking about how it can 'get around' or otherwise undermine this aversion.

It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.

I guess you'd say that as the system gets generally smarter, it becomes likely in practice, because it'll just be doing things like "apply clever cognitive strategies like reframing the problem and then brute-force searching for solutions" and this will be a distribution shift for the deception-classifier so it'll fail, even though at no point was the system intending to make the deception-classifier stay silent... But what if it isn't a distribution shift? What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to "OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way." 

... I'm not sure what to think but I still have hope that the 'robust nondeceptiveness' thing I've been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.

It sure sounds like you are saying that though!

Me, reflecting afterwards: hmm... Cynically,[2] not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper.  And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...

I'm not accusing anyone of having bad motivations; I think it is almost always valuable to consider both people's concious motivations and their incentives (which may be subconscious (EtA: or indirect) drivers of their behavior).

Before you put in the EtA, it sure sounded like you were saying that people were subconsciously motivated to avoid academic publishing because it helped them build and preserve a moat. Now, after the EtA, it still sounds like that but is a bit more unclear since 'indirect' is a bit more ambiguous than 'subconscious.'

Worth it to the world/humanity/etc. though maybe some of them are more self-focused.

Probably a big chunk of it is lost for that reason yeah. I'm not sure what your point is, it doesn't seem to be a reply to anything I said.

Here are two hypotheses for why they don't judge those costs to be worth it, each one of which is much more plausible to me than the one you proposed:

(1) The costs aren't in fact worth it & they've reacted appropriately to the evidence.
(2) The costs are worth it, but thanks to motivated reasoning, they exaggerate the costs, because writing things up in academic style and then dealing with the publication process is boring and frustrating.

Seriously, isn't (2) a much better hypothesis than the one you put forth about moats?

I think your cynical take is pretty wrong, for the reasons Evan described. I'd add that because of the way academic prestige works, you are vulnerable to having your ideas stolen if you just write them up on LessWrong and don't publish them. You'll definitely get fewer citations, less recognition, etc.

I think people's stated motivations are the real motivations: Jumping through hoops to format your work for academia has opportunity costs and they don't judge those costs to be worth it.

Nice post! I need to think about this more, but:

(1) Maybe if what we are aiming for is honesty & corrigibility to help us build a successor system, it's OK that the NN will learn concepts like the actual algorithm humans implement rather than some idealized version of that algorithm after much reflection and science. If we aren't optimizing super hard, maybe that works well enough? 

(2) Suppose we do just build an agentic AGI that's trying to maximize 'human values' (not the ideal thing, the actual algorithm thing) and initially it is about human level intelligence. Insofar as it's going to inevitably go off the rails as it learns and grows and self-improves, and end up with something very far from the ideal thing, couldn't you say the same about humans--over time a human society would also drift into something very far from ideal? If not, why? Is the idea that it's kinda like a random walk in both cases, but we define the ideal as whatever place the humans would end up at?

I found this post very helpful, thanks! If I find time to try to form a more gears-level independent impression about alignment difficulty and possible alignment solutions, I'll use this as my jumping-off point.

Separately, I think it would be cool if a bunch of people got together and played this game for a while and wrote up the results:

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

h/t Anthony DiGiovanni who points to this new paper making a weaker version of this point, in the context of normative ethics: Johan E. Gustafsson, Bentham’s Mugging - PhilPapers

Cool stuff! I'm curious to hear how convincing this sort of thing is to typical AI risk skeptics with backgrounds in ML. 

Load More