I propose to call metacosmology the hypothetical field of study which would be concerned with the following questions:
This concept is of potential interest for several reasons:
Here's the sketch of an AIT toy model theorem that in complex environments without traps, applying selection pressure reliably produces learning agents. I view it as an example of Wentworth's "selection theorem" concept.
Consider any environment of infinite Kolmogorov complexity (i.e. uncomputable). Fix a computable reward function
Suppose that there exists a policy of finite Kolmogorov complexity (i.e. computable) that's optimal for in the slow discount limit. That is,
Then, cannot be the only environment with this property. Otherwise, this property could be used to define using a finite number of bits, which is impossible[1]. Since requires infinitely many more bits to specify than and , there has to be infinitely many environments with the same property[2]. Therefore, is a reinforcement learning algorithm for some infinite class of hypothesis.
Moreover, there are natural examples of as above. For instance, let's construct as an infinite sequence of finite communicating infra-RDP refinements tha...
An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.
The key observation is, imitation learning algorithms[1] might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.
This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert...
The recent success of AlphaProof updates me in the direction of "working on AI proof assistants is a good way to reduce AI risk". If these assistants become good enough, it will supercharge agent foundations research[1] and might make the difference between success and failure. It's especially appealing that it leverages AI capability advancement for the purpose of AI alignment in a relatively[2] safe way, thereby the deeper we go into the danger zone the greater the positive impact[3].
EDIT: To be clear, I'm not saying that working on proof assistants in e.g. DeepMind is net positive. I'm saying that a hypothetical safety-conscious project aiming to create proof assistants for agent foundations research, that neither leaks dangerous knowledge nor repurposes it for other goals, would be net positive.
Of course, agent foundation research doesn't reduce to solving formally stated mathematical problems. A lot of it is searching for the right formalizations. However, obtaining proofs is a critical arc in the loop.
There are some ways for proof assistants to feed back into capability research, but these effects seem weaker: at present capability advancement is not primarily dri
I think the main way that proof assistant research feeds into capabilies research is not through the assistants themselves, but by the transfer of the proof assistant research to creating foundation models with better reasoning capabilities. I think researching better proof assistants can shorten timelines.
I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user's beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).
More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let be the the user's policy in universe and the AI policy. Let be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability for any policy. Let be the value of a state from the user's subjective POV, in universe . Let be the environment in universe . Finally, let be the AI's prior over universes and ...
This idea was inspired by a correspondence with Adam Shimi.
It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?
Consider a general reinforcement learning settings: we have a set of actions , a set of observations , a policy is a mapping , a reward function is a mapping , the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)
The simplest attempt at defining "goal-directed intelligence" is requiring that the policy in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows , or the prior can believe that behavior not according to leads to some terrible outcome.
The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript...
I have repeatedly argued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!
Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.
A possible counterargument is, we don't need to depart far from Bayesianis
...Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).
The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requ
...Master post for alignment protocols.
Other relevant shortforms:
Probably not too original but I haven't seen it clearly written anywhere.
There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.
Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for "malign AI is attacking", but that still leaves us open to surprise takeovers in which there is no chance to press the button.
Amplifying by subjective time: The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a "groundhog day" scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than...
Master post for selection/coherence theorems. Previous relevant shortforms: learnability constraints decision rules, AIT selection for learning.
TLDR: Systems which locally maximal influence can be described as VNM decision-makers.
There are at least 3 different motivations leading to the concept of "agent" in the context of AI alignment:
Motivation #1 naturally suggests a descriptive approach, motivation #2 naturally suggests a prescriptive approach, and motivation #3 is sort of a mix of both: on the one hand, we're describing something that already exists, on the other hand, the concept of "preferences" inherently comes from a normative perspective. There are also reasons to think these different motivation should converge on a single, coherent concept.
Here, we will focus on motivation #1.
A central reason why we are concerned about powerful unaligned agents, is that they are influential. Agents are the sort of system that, when instantiated in a particular environment is likely to heavily change this environment, potentially in ways inconsistent with the preferences of other agents.
I think there are some subtleties with the (non-infra) bayesian VNM version, which come down to the difference between "extreme point" and "exposed point" of . If a point is an extreme point that is not an exposed point, then it cannot be the unique expected utility maximizer under a utility function (but it can be a non-unique maximizer).
For extreme points it might still work with uniqueness, if, instead of a VNM-decision-maker, we require a slightly weaker decision maker whose preferences satisfy the VNM axioms except continuity.
For any , if then either or .
I think this condition might be too weak and the conjecture is not true under this definition.
If , then we have (because a minimum over a larger set is smaller). Thus, can only be the unique argmax if .
Consider the example . Then is closed. And satisfies . But per the above it cannot be a unique maximizer.
Maybe the issue can be fixed if we strengthen the condition so that has to be also minimal with respect to .
The following are my thoughts on the definition of learning in infra-Bayesian physicalism (IBP), which is also a candidate for the ultimate prescriptive agent desideratum.
In general, learning of hypotheses about the physical universe is not possible because of traps. On the other hand, learning of hypotheses about computable mathematics is possible in the limit of ample computing resources, as long as we can ignore side effects of computations. Moreover, learning computable mathematics implies approximating Bayesian planning w.r.t the prior about the physical universe. Hence, we focus on this sort of learning.
We consider an agent comprised of three modules, that we call Simulator, Learner and Controller. The agent's history consists of two phases. In the Training phase, the Learner interacts with the Simulator, and in the end produces a program for the Controller. In the Deployment phase, the Controller runs the program.
Roughly speaking:
Until now I believed that a straightforward bounded version of the Solomonoff prior cannot be the frugal universal prior because Bayesian inference under such a prior is NP-hard. One reason it is NP-hard is the existence of pseudorandom generators. Indeed, Bayesian inference under such a prior distinguishes between a pseudorandom and a truly random sequence, whereas a polynomial-time algorithm cannot distinguish between them. It also seems plausible that, in some sense, this is the only obstacle: it was established that if one-way functions don't exist (wh...
A major impediment in applying RL theory to any realistic scenario is that even the control problem[1] is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:
Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.
Does mathematics have finite information content?
First, let's focus on computable mathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of those new facts are essentially random noise, rather than "meaningful" information?
Is there a difference of principle between "noise" and "meaningful content"? It is not obvious, but the answer is "yes": in algorithmic statistics there is the notion of "sophistication" which measures how much "non-random" information is contained in some data. In our setting, the question can be operationalized as follows: is it possible to have an algorithm plus an infinite sequence of bits , s.t. is random in some formal sense (e.g. Martin-Lof) and can decide the output of any finite computation if it's also given access to ?
The answer to th...
Some thoughts about embedded agency.
From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent[1]. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.
First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology
...This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.
It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communicati
...Epistemic status: most elements are not new, but the synthesis seems useful.
Here is an alignment protocol that I call "autocalibrated quantilzed debate" (AQD).
Arguably the biggest concern with naive debate[1] is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of "arguments and counterargument" doesn't make sense anymore. Let's call utterances that have this property "Lovecraftian". To counter this, I suggest using quantilization. Quanti...
Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)
In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.
Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. B
...One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.
In general, a prior that contains traps will be unlearnable, mea
...In the past I considered the learning-theoretic approach to AI theory as somewhat opposed to the formal logic approach popular in MIRI (see also discussion):
I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding max
...Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!
This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re...
Here is a way to construct many learnable undogmatic ontologies, including such with finite state spaces.
A deterministic partial environment (DPE) over action set and observation set is a pair where and s.t.
DPEs are equipped with a natural partial order. Namely, when and .
Let  ...
There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.
We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let's assume the offline learning setting. Domain D is represented by instance space , label space , distribution and loss function . Similarly, domain E is represented by inst...