There is something very deep going on with pessimism: the same general method can produce a truthful agent, prevent feedback tampering, and solve the ELK challenge. Pessimism has been discovered by theoretical and empirical researchers to produce policies that are robust to distributional shift. And it is extremely simple, not...
Here is a proposal for Inverse Reinforcement Learning in General Environments. (2 1/2 pages; very little math). Copying the introduction here: The eventual aim of IRL is to understand human goals. However, typical algorithms for IRL assume the environment is finite-state Markov, and it is often left unspecified how raw...
It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is. This might be deeply flawed in a way that I'm not aware of, but I'm...
I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but I’ll use “benign” for now). This result leaves something to be desired: namely an agent which is safe...
Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to...
Do people think we could make a singleton (or achieve global coordination and preventative policing) just by imitating human policies on computers? If so, this seems pretty safe to me. Some reasons for optimism: 1) these could be run much faster than a human thinks, and 2) we could make...
I'll argue here that we should make an aligned AI which is a causal decision theorist. Son-of-CDT Suppose we are writing code for an agent with an action space A and an observation space O. The code determines how actions will be selected given the prior history of actions and...