Sorted by New

Wiki Contributions


NLP Position Paper: When Combatting Hype, Proceed with Caution

Some minor feedback points: Just from reading the abstract and intro, this could be read as a non-sequitur: "It limits our ability to mitigate short-term harms from NLP deployments". Also, calling something a "short-term" problem doesn't seem necessary and it may sound like you think the problem is not very important.

Will OpenAI's work unintentionally increase existential risks related to AI?

OpenAI's work speeds up progress, but in a way that's likely smooth progress later on. If you spend as much compute as possible now, you reduce potential surprises in the future.

[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison


On 2): Being overparameterized doesn't mean you fit all your training data. It just means that you could fit it with enough optimization. Perhaps the existence of some Savant people shows that the brain could memorize way more than it does.

On 3): The number of our synaptic weights is stupendous too - about 30000 for every second in our life.

On 4): You can underfit at the evolution level and still overparameterize at the individual level.

Overall you convinced me that underparameterization is less likely though. Especially on your definition of overparameterization, which is relevant for double descent.

[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison

Why do you think that humans are, and powerful AI systems will be, severely underparameterized?

Strategic implications of AIs' ability to coordinate at low cost, for example by merging

Also interesting to see that all of these groups were able to coordinate to the disadvantage of less coordinates groups, but not able to reach peace among themselves.

One explanation might be that the more coordinated groups also have harder coordination problems to solve because their world is bigger and more complicated. Might be the same with AI?

Seeking Power is Often Convergently Instrumental in MDPs

If X is "number of paperclips" and Y is something arbitrary that nobody optimizes, such as the ratio of number of bicycles on the moon to flying horses, optimizing X should be equally likely to increase or decrease Y in expectation. Otherwise "1-Y" would go in the opposite direction which can't be true by symmetry. But if Y is something like "number of happy people", Y will probably decrease because the world is already set up to keep Y up and a misaligned agent could disturb that state.

Seeking Power is Often Convergently Instrumental in MDPs

I should've specified that the strong version is "Y decreases relative to a world where neither of X nor Y are being optimized". Am I right that this version is not true?

Seeking Power is Often Convergently Instrumental in MDPs

Thanks for writing this! It always felt like a blind spot to me that we only have Goodhart's law that says "if X is a proxy for Y and you optimize X, the correlation breaks" but we really mean a stronger version: "if you optimize X, Y will actively decrease". Your paper clarifies that what we actually mean is an intermediate version: "if you optimize X, it becomes a harder to optimize Y". My conclusion would be that the intermediate version is true but the strong version false then. Would you say that's an accurate summary?

[AN #60] A new AI challenge: Minecraft agents that assist human players in creative mode

Costs don't really grow linearly with model size because utilization goes down as you spread a model across many GPUs. I. e. aggregate memory requirements grow superlinearly. Relatedly, model sizes increased <100x while compute increased 300000x on OpenAI's data set. That's been updating my views a bit recently.

People are trying to solve this with things like GPipe, but I don't know yet if there can be an approach that scales to many more TPUs than what they tried (8). Communication would be the next bottleneck.

The Three Levels of Goodhart's Curse

(also x-posted from

Another, speculative point: If and were my utility function and my friend's, my intuition is that an agent that optimizes the wrong function would act more robustly. If true, this may support the theory that Goodhart's curse for AI alignment would be to a large extent a problem of defending against adversarial examples by learning robust features similar to human ones. Namely, the robust response may be because me and my friend have learned similar robust, high-level features; we just give them different importance.

Load More