AI ALIGNMENT FORUM
AF

Vanessa Kosoy
Ω2111627021
Message
Dialogue
Subscribe

Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran's group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.

E-mail: {first name}@alter.org.il

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
6Vanessa Kosoy's Shortform
6y
161
johnswentworth's Shortform
Vanessa Kosoy3mo105

I found LLMs to be very useful for literature research. They can find relevant prior work that you can't find with a search engine because you don't know the right keywords. This can be a significant force multiplier.

They also seem potentially useful for quickly producing code for numerical tests of conjectures, but I only started experimenting with that.

Other use cases where I found LLMs beneficial:

  • Taking a photo of a menu in French (or providing a link to it) and asking it which dishes are vegan.
  • Recommending movies (I am a little wary of some kind of meme poisoning, but I don't watch movies very often, so seems ok).

That said, I do agree that early adopters seem like they're overeager and maybe even harming themselves in some way.

Reply
Modeling versus Implementation
Vanessa Kosoy4mo30

Btw, what are some ways we can incorporate heuristics into our algorithm while staying on level 1-2?

  1. We don't know how the prove to required desiderata about the heuristic, but we can still reasonably conjecture them and support the conjectures with empirical tests.
  2. We can't prove or even conjecture anything useful-in-itself about the heuristic, but the way the heuristic is incorporated into the overall algorithm makes it safe. For example, maybe the heuristic produces suggestions together with formal certificates of their validity. More generally, we can imagine an oracle-machine (where the heuristic is slotted into the oracle) about which we cannot necessarily prove something like a regret bound w.r.t. the optimal policy, but we can prove (or at least conjecture) a regret bound w.r.t. some fixed simple reference policy. That is, the safety guarantee shows that no matter what the oracle does, the overall system is not worse than "doing nothing". Maybe, modulo weak provable assumptions about the oracle, e.g. that it satisfies a particular computational complexity bound.
  3. [Epistemic status: very fresh idea, quite speculative but intriguing.] We can't find even a guarantee like a above for a worst-case computationally bounded oracle. However, we can prove (or at least conjecture) some kind of an "average-case" guarantee. For example, maybe we have high probability of safety for a random oracle. However, assuming a uniformly random oracle is quite weak. More optimistically, maybe we can prove safety even for any oracle that is pseudorandom against some complexity class C1 (where we want C1 to be as small as possible). Even better, maybe we can prove safety for any oracle in some complexity class C2 (where we want C2 to be as large as possible) that has access to another oracle which is pseudorandom against C1. If our heuristic is not actually in this category (in particular, C2 is smaller than P and our heuristic doesn't lie in C2), this doesn't formally guarantee anything, but it does provide some evidence for the "robustness" of our high-level scheme.
Reply
Modeling versus Implementation
Vanessa Kosoy4mo*77

I see modeling vs. implementation as a spectrum more than a dichotomy. Something like:

  1. On the "implementation" extreme you prove theorems about the exact algorithm you implement in your AI, s.t. you can even use formal verification to prove these theorems about the actual code you wrote.
  2. Marginally closer to "modeling", you prove (or at least conjecture) theorems about some algorithm which is vaguely feasible in theory. Some civilization might have used that exact algorithm to build AI, but in our world it's impractical, e.g. because it's uncompetitive with other AI designs. However, your actual code is conceptually very close to the idealized algorithm, and you have good arguments why the differences don't invalidate the safety properties of the idealized model.
  3. Further along the spectrum, your actual algorithm is about as similar to the idealized algorithm as DQN is similar to vanilla Q-learning. Which is to say, it was "inspired" by the idealized algorithm but there's a lot of heavy lifting done by heuristics. Nevertheless, there is some reason to hope the heuristic aspects don't change the safety properties.
  4. On the "modeling" extreme, your idealized model is something like AIXI: completely infeasible and bears little direct resemblance to the actual algorithm in your AI. However, there is still some reason to believe real AIs will have similar properties to the idealized model.

More precisely, rather than a 1-dimensional spectrum, there are at least two parameters involved: 

  • How close is the object you make formal statements about to the actual code of your AI, where "closeness" is measured by the strength of the arguments you have for the analogy, on a scale from "they are literally the same" to solid theoretical and/or empirical evidence to pure hand-waving/intuition
  • How much evidence you have for the formal statements, on a scale from "I proved it within some widely accepted mathematical foundation (e.g. PA)" to "I proved vaguely related things, tried very hard but failed to disprove the thing and/or accumulated some empirical evidence".

[EDIT: And a 3rd parameter is, how justified/testable the assumptions of your model is. Ideally, you want these assumptions to be grounded in science. Some will likely be philosophical assumptions which cannot be tested empirically, but at least they should fit into a coherent holistic philosophical view. At the very least, you want to make sure you're not assuming away the core parts of the problem.]

For the purposes of safety, you want to be as close to the implementation end of the spectrum as you can get. However, the model side of the spectrum is still useful as: 

  • A backup plan which is better than nothing, more so if there is some combination of theoretical and empirical justification for the analogizing
  • A way to demonstrate threat models, as the OP suggests
  • An intermediate product that helps checking that your theory is heading in the right direction, comparing different research agendas, and maybe even making empirical tests.
Reply
Working through a small tiling result
Vanessa Kosoy4mo20

Sorry, I was wrong. By Lob's theorem, all versions of goodnew are provably equivalent, so they will trust each other.

Reply
Working through a small tiling result
Vanessa Kosoy4mo11

IIUC, fixed point equations like that typically have infinitely many solution. So, you defined not one goodnew predicate, but an infinite family of them. Therefore, your agent will trust a copy of itself, but usually won't trust variants of itself with other choices of fixed point. In this sense, this proposal is similar to proposals based on quining (as quining has many fixed points as well).

[This comment is no longer endorsed by its author]Reply
Vanessa Kosoy's Shortform
Vanessa Kosoy5mo*80

TLDR: A new proposal for a prescriptive theory of multi-agent rationality, based on generalizing superrationality using local symmetries.

Hofstadter's "superrationality" postulates that a rational agent should behave as if other rational agents are similar to itself. In a symmetric game, this implies the selection of a Pareto efficient outcome. However, it's not clear how to apply this principle to asymmetric games. I propose to solve this by suggesting a notion of "local" symmetry that is much more ubiquitous than global (ordinary) symmetry.

We will focus here entirely on finite deterministic (pure strategy) games. Generalizing this to mixed strategies and infinite games more generally is an important question left for the future.

Quotient Games

Consider a game Γ with a set P of players, for each player i∈P a non-empty set Si or pure strategies and a utility function ui:S∗→R, where S∗:=∏j∈PSj. What does it mean for the game to be "symmetric"? We propose a "soft" notion of symmetry that is only sensitive to the ordinal ordering between payoffs. While the cardinal value of the payoff will be important in the stochastic context, we plan to treat the latter by explicitly expanding the strategy space.

Given two games Γ(1), Γ(2), a premorphism from Γ(1) to Γ(2) is defined to consist of

  • A mapping q:P(1)→P(2)
  • For each i∈P(1), a mapping ¯qi:S(2)q(i)→S(1)i

Notice that these mappings induce a mapping ^q:S(2)∗→S(1)∗ via

^q(x)i:=¯qi(xq(i))

A premorphism is said to be a homomorpism if q and ¯q are s.t. that for any i∈P(1) and x,y∈S(2)∗, if uq(i)(x)≤uq(i)(y) then ui(^q(x))≤ui(^q(y)). Homomorphisms makes games into a category in a natural way.

An automorphism of Γ is a homomorphism from Γ to Γ that has an inverse homomorphism. An automorphism q of Γ is said to be flat when for any i∈P, if q(i)=i then ¯qi is the identity mapping[1]. A flat symmetry of Γ is a group G together with a group homomorphism φ:G→Aut(Γ) s.t. φ(g) is flat for all g∈G.

Given a flat symmetry (G,φ), we can apply the superrationality principle to reduce Γ to the "smaller" quotient game Γ/G. The players are P/G i.e. orbits in the original player set. Given α∈P/G, we define the set of strategies for player α in the game Γ/G to be

(S/G)α:={σ∈∏i∈αSα∣∣ ∣∣∀i∈α,g∈G:σ(i)=¯¯¯¯¯¯¯¯¯¯φ(g)iσ(φ(g)i)}

This is non-empty thanks to flatness.

Observe that there is a natural mapping γ:(S/G)∗→S∗ given by

γ(x)i:=xGi(i)

Finally, the utility function of α is defined by

uα(x):=1|α|∑i∈αui(γ(x))

It is easy to see that the construction of Γ/G is functorial w.r.t. the category structure on games that we defined.

We can generalize this further by replacing game homomorphisms with game quasimorphisms: premorphisms satisfying the following relaxed condition on q and ¯q:

  • For any i∈P(2) and x,y∈S(1)∗, if uq(i)(x)<uq(i)(y) then ui(^q(x))≤ui(^q(y)).

This is no longer closed under composition, so no longer defines a category structure[2]. However, we can still define flat quasisymmetries and the associated quotient games, and this construction is still functorial w.r.t. the original (not "quasi") category structure. Moreover, there is a canonical homomorphism (not just a quasimorphism) from Γ to Γ/G, even when G is a mere quasisymmetry.

The significance of quasisymmetries can be understood as follows.

The set of all games on a fixed set of players P with fixed strategy sets {Si}i∈P naturally forms a topological space[3] X. Given a group G acting on the game via invertible premorphisms, the subset of X where G is a symmetry is not closed, in general. However, the subset of X where G is a quasisymmetry is closed. I believe this will be important when generalizing the formalism to infinite games.

Coarse Graining

What if a game doesn't have even quasisymmetries? Then, we can look for a coarse graining of the game which does have them.

Consider a game Γ. For each i∈P, let S′i be some set and fi:Si→S′i a surjection. Denote S′∗:=∏i∈PS′i. Given any x∈S′∗, we have the game Γx in which:

  • The set of players is P.
  • For any i∈P, their set of strategies is Sxi:=f−1i(xi). In particular, it implies that the set of outcomes Sx∗ satisfies Sx∗⊆S∗.
  • For any i∈P, their utility function is uxi:=ui|Sx∗.

If for any x∈S′∗ we choose some αx∈Sx∗ this allows us to define the coarse-grained game Γ(α) in which:

  • The set of players is P.
  • For any i∈P, the set of strategies is S′i.
  • For any i∈P and x∈S′∗, the utility function is u(α)i(x):=ui(αx).

Essentially, we rephrased Γ as an extensive form game in which first the game Γ(α) is played and then, if the outcome was x, the game Γx is played. This, assuming the expected outcome of game Γx is αx.

It is also possible to generalize this construction by allowing multivalued f.

Local Symmetry Solutions

Given a game Γ, a local symmetry presolution (LSP) is recursively defined to be one of the following:

  • Assume there is i∈P s.t. for any j∈P∖i, |Sj|=1. Then, any x∗∈argmaxx∈S∗ui(x) is an LSP of Γ. [Effectively single player game]
  • Let {fi:Si→S′i}i∈P be a coarse-graining of Γ and for any x∈S′∗, let αx be an LSP of Γx. Let β be an LSP of Γ(α). Define x∗∈S∗ by x∗i:=αβi. Then, x∗ is an LSP of Γ.
  • Let G be a quasisymmetry of Γ and let y∗ be an LSP of Γ/G. Then, γ(y∗) is an LSP of Γ.

It's easy to see that an LSP always exists, because for any game we can choose a sequence of coarse-grainings whose effect is making the players choose their strategies one by one (with full knowledge of choices by previous players). However, not all LSP are "born equal". It seems appealing to prefer LSP which use "more" symmetry. This can be formalized as follows.

The way an LSP is constructed defines the set of coherent outcomes A⊆S∗, which is the set of outcomes compatible with the local symmetries[4]. We define it recursively as follows:

  • For the LSP of an effectively single-player game, we set A:=S∗.
  • For a coarse-graining LSP, for any x∈S′∗, let Ax be the set of coherent outcomes of Γx and let B be the set of coherent outcomes of Γ(α). We then define A:={x∈S∗|f(x)∈B and x∈Af(x)}. Here, f:S∗→S′∗ is defined by f(x)i:=fi(xi).
  • For a quasisymmetry LSP, let B be the set of coherent outcomes of Γ/G. We then define A:=γ(B).

We define a local symmetry solution (LSS) to be an LSP whose set of coherent outcomes is minimal w.r.t. set inclusion.

Examples

  • In the Prisoner's Dilemma, the only LSS is Cooperate-Cooperate.
  • In Stag Hunt, the only LSS is Stag-Stag.
  • In Chicken, the only LSS is Swerve-Swerve. If we add a finite number of randomized strategies, it opens more possibilities.
  • In Battle of the Sexes, assuming a small payoff from going to your preferred place alone, the only LSS is: each player goes to their own preferred place. If we give each player a "coinflip" strategy, then I think the only LSS is both players using the coinflip.
  • Consider a pure cooperation game where both players have strategies {a,b} and the payoff is 1 if they choose the same strategy and 0 if they choose different strategies. Then, any outcome is an LSS. I think that, if we add a "fair coinflip" strategy, any LSS has payoff at least 14. I believe the LSS payoff will approach optimum if we (i) allow randomization (ii) make the game repeated with time long horizon and (iii) add some constraints on the symmetries, but I'll leave this for a future post.
  • Consider a pure cooperation game where both players have strategies {a,b}, the payoff is 2 if both choose a, 1 if both choose b and 0 if they choose different strategies. Then, the only LSS is aa.
  • In any game, an LSS guarantees each player at least their deterministic-maximin payoff. In particular, in a two-player zero-sum game in which a pure Nash equilibrium exists, any LSS is a Nash equilibrium.
  1. ^

    Note that flatness is in general not preserved under composition.

  2. ^

    I believe this can be interpreted as a category enriched over the category of sets and partial functions, with the monoidal structure given by Cartesian product of sets.

  3. ^

    An affine space, even.

  4. ^

    Maybe we can interpret the set of coherent outcomes as a sharp infradistirbution corresponding to the players' joint belief about the outcome of the game.

Reply
Vanessa Kosoy's Shortform
Vanessa Kosoy5mo20

Master post for multi-agency. See also.

Reply
Richard Ngo's Shortform
Vanessa Kosoy5mo50

In (non-monotonic) infra-Bayesian physicalism, there is a vaguely similar asymmetry even though it's formalized via a loss function. Roughly speaking, the loss function expresses preferences over "which computations are running". This means that you can have a "positive" preference for a particular computation to run or a "negative" preference for a particular computation not to run[1].

  1. ^

    There are also more complicated possibilities, such as "if P runs then I want Q to run but if P doesn't run then I rather that Q also doesn't run" or even preferences that are only expressible in terms of entanglement between computations.

Reply
Some Rules for an Algebra of Bayes Nets
Vanessa Kosoy5mo30

I think that there are two different questions which might be getting mixed up here:

Question 1: Can we fully classify all rules for which sets of Bayes nets imply other Bayes nets over the same variables? Naturally, this is not a fully rigorous question, since "classify" is not a well-defined notion. One possible operationalization of this question is: What is the computational complexity of determining whether a given Bayes net follows from a set of other Bayes nets? For example, if there is a set of basic moves that generate all such inferences then the problem is probably in NP (at least if the number of required moved can also be bounded).

Question 2: What if we replace "Bayes nets" by something like "string diagrams in a Markov category"? Then there might be less rules (because maybe some rules hold for Bayes nets but not in the more abstract setting).

Reply
abramdemski's Shortform
Vanessa Kosoy5mo4-3

I think that in 2 years we're unlikely to accomplish anything that leaves a dent in P(DOOM), with any method, but I also think it's more likely than not that we actually have >15 years.

As to "completing" the theory of agents, I used the phrase (perhaps perversely) in the same sense that e.g. we "completed" the theory of information: the latter exists and can actually be used for its intended applications (communication systems). Or at least in the sense we "completed" the theory of computational complexity: even though a lot of key conjectures are still unproven, we do have a rigorous understanding of what computational complexity is and know how to determine it for many (even if far from all) problems of interest.

I probably should have said "create" rather than "complete".

Reply1
Load More
15New Paper: Ambiguous Online Learning
3mo
0
29New Paper: Infra-Bayesian Decision-Estimation Theory
5mo
0
28Video lectures on the learning-theoretic agenda
11mo
0
18Linear infra-Bayesian Bandits
1y
4
48AI Alignment Metastrategy
2y
1
72Critical review of Christiano's disagreements with Yudkowsky
2y
5
37Learning-theoretic agenda reading list
2y
1
54The Learning-Theoretic Agenda: Status 2023
2y
5
17Compositional language for hypotheses about computations
3y
6
6Infra-Bayesian physicalism: proofs part II
4y
0
Load More
Derivative
2mo
(+11/-1)