2020

Frontpage Posts

Shortform

9Vanessa Kosoy4moI have
[https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v]
repeatedly
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform#TzkG7veQAMMRNh3Pg]
argued
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-sufficiently-advanced-agents-use-logic#fEKc88NbDWZavkW9o]
for a departure from pure Bayesianism that I call "quasi-Bayesianism". But,
coming from a LessWrong-ish background, it might be hard to wrap your head
around the fact Bayesianism is somehow deficient. So, here's another way to
understand it, using Bayesianism's own favorite trick: Dutch booking!
Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey
just follow a Bayes-optimal policy for eir prior, and such a policy can always
be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can
choose which side of the bet to take: indeed, at least one side of any bet has
non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey
know more than Alice and moreover ey can predict Alice. Omega offers Alice a
series of bets. The bets are specifically chosen by Omega s.t. Alice would pick
the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice
cannot escape eir predicament: ey might know, in some sense, that Omega is
cheating em, but there is no way within the Bayesian paradigm to justify turning
down the bets.
A possible counterargument is, we don't need to depart far from Bayesianism to
win here. We only need to somehow justify randomization, perhaps by something
like infinitesimal random perturbations of the belief state (like with
reflective oracles). But, in a way, this is exactly what quasi-Bayesianism does:
a quasi-Bayes-optimal policy is in particular Bayes-optimal when the prior is
taken to be in Nash equilibrium of the associated zero-sum game. However,
Bayes-optimality underspecifies the policy: not every optimal reply to a Nash
equil

85dI think instrumental convergence also occurs in the model space for machine
learning. For example, many different architectures likely learn edge detectors
in order to minimize classification loss on MNIST. But wait - you'd also learn
edge detectors to maximize classification loss on MNIST (loosely, getting 0% on
a multiple-choice exam requires knowing all of the right answers). I bet you'd
learn these features for a wide range of cost functions. I wonder if that's
already been empirically investigated?
And, same for adversarial features. And perhaps, same for mesa optimizers
(understanding how to stop mesa optimizers from being instrumentally convergent
seems closely related to solving inner alignment).
What can we learn about this?

56moSome thoughts about embedded agency.
From a learning-theoretic perspective, we can reformulate the problem of
embedded agency as follows: What kind of agent, and in what conditions, can
effectively plan for events after its own death? For example, Alice bequeaths
eir fortune to eir children, since ey want them be happy even when Alice emself
is no longer alive. Here, "death" can be understood to include modification,
since modification is effectively destroying an agent and replacing it by
different agent[1] [#fn-pou4BZeg2pAnxBcvw-1]. For example, Clippy 1.0 is an AI
that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value
staples before running it again. Then, Clippy 2.0 can be considered to be a new,
different agent.
First, in order to meaningfully plan for death, the agent's reward function has
to be defined in terms of something different than its direct perceptions.
Indeed, by definition the agent no longer perceives anything after death.
Instrumental reward functions
[https://www.alignmentforum.org/posts/aAzApjEpdYwAxnsAS/reinforcement-learning-with-imperceptible-rewards]
are somewhat relevant but still don't give the right object, since the reward is
still tied to the agent's actions and observations. Therefore, we will consider
reward functions defined in terms of some fixed ontology of the external world.
Formally, such an ontology can be an incomplete[2] [#fn-pou4BZeg2pAnxBcvw-2]
Markov chain, the reward function being a function of the state. Examples:
* The Markov chain is a representation of known physics (or some sector of
known physics). The reward corresponds to the total mass of diamond in the
world. To make this example work, we only need enough physics to be able to
define diamonds. For example, we can make do with quantum electrodynamics +
classical gravity and have the Knightian uncertainty account for all nuclear
and high-energy phenomena.
* The Markov chain is a representation of people and

44moLearning theory distinguishes between two types of settings: realizable and
agnostic (non-realizable). In a realizable setting, we assume that there is a
hypothesis in our hypothesis class that describes the real environment
perfectly. We are then concerned with the sample complexity and computational
complexity of learning the correct hypothesis. In an agnostic setting, we make
no such assumption. We therefore consider the complexity of learning the best
approximation of the real environment. (Or, the best reward achievable by some
space of policies.)
In offline learning and certain varieties of online learning, the agnostic
setting is well-understood. However, in more general situations it is poorly
understood. The only agnostic result for long-term forecasting that I know is
Shalizi 2009 [https://projecteuclid.org/euclid.ejs/1256822130], however it
relies on ergodicity assumptions that might be too strong. I know of no agnostic
result for reinforcement learning.
Quasi-Bayesianism was invented to circumvent the problem. Instead of considering
the agnostic setting, we consider a "quasi-realizable" setting: there might be
no perfect description of the environment in the hypothesis class, but there are
some incomplete descriptions. But, so far I haven't studied quasi-Bayesian
learning algorithms much, so how do we know it is actually easier than the
agnostic setting? Here is a simple example to demonstrate that it is.
Consider a multi-armed bandit, where the arm space is [0,1]. First, consider the
follow realizable setting: the reward is a deterministic function r:[0,1]→[0,1]
which is known to be a polynomial of degree d at most. In this setting, learning
is fairly easy: it is enough to sample d+1 arms in order to recover the reward
function and find the optimal arm. It is a special case of the general
observation that learning is tractable when the hypothesis space is
low-dimensional in the appropriate sense.
Now, consider a closely related agnostic setting.

42moThis idea was inspired by a correspondence with Adam Shimi.
It seem very interesting and important to understand to what extent a purely
"behaviorist" view on goal-directed intelligence is viable. That is, given a
certain behavior (policy), is it possible to tell whether the behavior is
goal-directed and what are its goals, without any additional information?
Consider a general reinforcement learning settings: we have a set of actions A,
a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is
a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of
rewards. (Alternatively, we could use instrumental reward functions
[https://www.alignmentforum.org/posts/aAzApjEpdYwAxnsAS/reinforcement-learning-with-imperceptible-rewards]
.)
The simplest attempt at defining "goal-directed intelligence" is requiring that
the policy π in question is optimal for some prior and utility function.
However, this condition is vacuous: the reward function can artificially reward
only behavior that follows π, or the prior can believe that behavior not
according to π leads to some terrible outcome.
The next natural attempt is bounding the description complexity of the prior and
reward function, in order to avoid priors and reward functions that are
"contrived". However, description complexity is only naturally well-defined up
to an additive constant. So, if we want to have a crisp concept, we need to
consider an asymptotic in which the complexity of something goes to infinity.
Indeed, it seems natural to ask that the complexity of the policy should be much
higher than the complexity of the prior and the reward function: in this case we
can say that the "intentional stance" is an efficient description. However, this
doesn't make sense with description complexity: the description "optimal policy
for U and ζ" is of size K(U)+K(ζ)+O(1) (K(x) stands for "description complexity
of x").
To salvage this idea, we need to take not only description complexity

2019

Frontpage Posts

Shortform

68moI've been thinking a lot about 'parallel economies' recently. One of the main
differences between 'slow takeoff' and 'fast takeoff' predictions is whether AI
is integrated into the 'human civilization' economy or constructing a separate
'AI civilization' economy. Maybe it's worth explaining a bit more what I mean by
this: you can think of 'economies' as collections of agents who trade with each
other. Often it will have a hierarchical structure, and where we draw the lines
are sort of arbitrary. Imagine a person who works at a company and participates
in its internal economy, and the company participates in national and global
economies, and the person participates in those economies as well. A better
picture has a very dense graph with lots of nodes and links between groups of
nodes whose heaviness depends on the number of links between nodes in those
groups.
As Adam Smith argues, the ability of an economy to support specialization of
labor depends on its size. If you have an island with a single inhabitant, it
doesn't make sense to fully employ a farmer (since a full-time farmer can
generate much more food than a single person could eat), for a village with 100
inhabitants it doesn't make sense to farm more than would feed a hundred mouths,
and so on. But as you make more and more of a product, investments that have a
small multiplicative payoff become better and better, to the point that a planet
with ten billion people will have massive investment in farming specialization
that make it vastly more efficient per unit than the village farming system. So
for much of history, increased wealth has been driven by this increased
specialization of labor, which was driven by the increased size of the economy
(both through population growth and decreased trade barriers widening the links
between economies until they effectively became one economy).
One reason to think economies will remain integrated is because increased size
benefits all actors in the economy on net; a

99moGame theory is widely considered the correct description of rational behavior in
multi-agent scenarios. However, real world agents have to learn, whereas game
theory assumes perfect knowledge, which can be only achieved in the limit at
best. Bridging this gap requires using multi-agent learning theory to justify
game theory, a problem that is mostly open (but some results exist). In
particular, we would like to prove that learning agents converge to game
theoretic solutions such as Nash equilibria (putting superrationality aside: I
think that superrationality should manifest via modifying the game rather than
abandoning the notion of Nash equilibrium).
The simplest setup in (non-cooperative) game theory is normal form games.
Learning happens by accumulating evidence over time, so a normal form game is
not, in itself, a meaningful setting for learning. One way to solve this is
replacing the normal form game by a repeated version. This, however, requires
deciding on a time discount. For sufficiently steep time discounts, the repeated
game is essentially equivalent to the normal form game (from the perspective of
game theory). However, the full-fledged theory of intelligent agents requires
considering shallow time discounts, otherwise there is no notion of long-term
planning. For shallow time discounts, the game theory of a repeated game is very
different from the game theory of the original normal form game. In fact, the
folk theorem asserts that any payoff vector above the maximin of each player is
a possible Nash payoff. So, proving convergence to a Nash equilibrium amounts
(more or less) to proving converges to at least the maximin payoff. This is
possible using incomplete models
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda]
, but doesn't seem very interesting: to receive the maximin payoff, the agents
only have to learn the rules of the game, they need not learn the reward
functions of the othe

58moOne challenge for theories of embedded agency over Cartesian theories is that
the 'true dynamics' of optimization (where a function defined over a space
points to a single global maximum, possibly achieved by multiple inputs) are
replaced by the 'approximate dynamics'. But this means that by default we get
the hassles associated with numerical approximations, like when integrating
differential equations. If you tell me that you're doing Euler's Method on a
particular system, I need to know lots about the system and about the particular
hyperparameters you're using to know how well you'll approximate the true
solution. This is the toy version of trying to figure out how a human reasons
through a complicated cognitive task; you would need to know lots of details
about the 'hyperparameters' of their process to replicate their final result.
This makes getting guarantees hard. We might be able to establish what the
'sensible' solution range for a problem is, but establishing what algorithms can
generate sensible solutions under what parameter settings seems much harder.
Imagine trying to express what the set of deep neural network parameters are
that will perform acceptably well on a particular task (first for a particular
architecture, and then across all architectures!).

69moThis is preliminary description of what I dubbed Dialogic Reinforcement Learning
(credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the
alignment scheme I currently find most promising.
It seems that the natural formal criterion for alignment (or at least the main
criterion) is having a "subjective regret bound": that is, the AI has to
converge (in the long term planning limit, γ→1 limit) to achieving optimal
expected user!utility with respect to the knowledge state of the user. In order
to achieve this, we need to establish a communication protocol between the AI
and the user that will allow transmitting this knowledge state to the AI
(including knowledge about the user's values). Dialogic RL attacks this problem
in the manner which seems the most straightforward and powerful: allowing the AI
to ask the user questions in some highly expressive formal language, which we
will denote F.
F allows making formal statements about a formal model M of the world, as seen
from the AI's perspective. M includes such elements as observations, actions,
rewards and corruption. That is, M reflects (i) the dynamics of the environment
(ii) the values of the user (iii) processes that either manipulate the user, or
damage the ability to obtain reliable information from the user. Here, we can
use different models of values: a traditional "perceptible" reward function, an
instrumental reward function
[https://www.alignmentforum.org/posts/aAzApjEpdYwAxnsAS/reinforcement-learning-with-imperceptible-rewards]
, a semi-instrumental reward functions, dynamically-inconsistent rewards
[https://www.alignmentforum.org/posts/aPwNaiSLjYP4XXZQW/ai-alignment-open-thread-august-2019#C9gRtMRc6qLv7J6k7]
, rewards with Knightian uncertainty etc. Moreover, the setup is
self-referential in the sense that, M also reflects the question-answer
interface and the user's behavior.
A single question can consist, for example, of asking for the probability of
some sentence in F or the expected

47moIn the past I considered the learning-theoretic approach to AI theory
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda]
as somewhat opposed to the formal logic approach popular in MIRI
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-sufficiently-advanced-agents-use-logic]
(see also discussion
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-sufficiently-advanced-agents-use-logic#fEKc88NbDWZavkW9o]
):
* Learning theory starts from formulating natural desiderata for agents,
whereas "logic-AI" usually starts from postulating a logic-based model of the
agent ad hoc.
* Learning theory naturally allows analyzing computational complexity whereas
logic-AI often uses models that are either clearly intractable or even
clearly incomputable from the onset.
* Learning theory focuses on objects that are observable or
finite/constructive, whereas logic-AI often considers objects that
unobservable, infinite and unconstructive (which I consider to be a
philosophical error).
* Learning theory emphasizes induction whereas logic-AI emphasizes deduction.
However, recently I noticed that quasi-Bayesian reinforcement learning
[https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v]
and Turing reinforcement learning
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-sufficiently-advanced-agents-use-logic#fEKc88NbDWZavkW9o]
have very suggestive parallels to logic-AI. TRL agents have beliefs about
computations they can run on the envelope: these are essentially beliefs about
mathematical facts (but, we only consider computable facts and computational
complexity plays some role there). QBRL agents reason in terms of hypotheses
that have logical relationships between them: the order on functions corresponds
to implication, taking the minimum of two functions corresponds to logical
"and", taking the concave hull of two func