Vojtech Kovarik

Background in mathematics (descriptive set theory, Banach spaces) and game-theory (mostly zero-sum, imperfect information games). CFAR mentor. Usually doing alignment research.

Vojtech Kovarik's Comments

How can Interpretability help Alignment?

An important consideration is whether the interpretability research which seems useful for alignment is research which we expect the mainstream ML research community to work on and solve suitably. Do you see a way of incentivizing the RL community to change this? (If possible, that would seem like a more effective approach than doing it "ourselves".)

There’s little research which focuses on interpreting reinforcement learning agents [...]. There is some work in DeepMind's safety team on this, isn't there? (Not to dispute the overall point though, "a part of DeepMind's safety team" is rather small compared to the RL community :-).)

Nitpicks and things I didn't get:

  • It was a bit hard to understand what you mean by the "research questions vs tasks" distinction. (And then I read the bullet point below it and came, perhaps falsely, to the conclusion that you are only after "reusable piece of wisdom" vs "one-time thing" distinction.)
  • There is something funny going on in this sentence:

If we believe a particular proposal is more or less likely than others to produce aligned AI, then we would preferentially work on interpretability research which we believe will help this proposal other work which wouldn't, as it wouldn't be as useful.

Book report: Theory of Games and Economic Behavior (von Neumann & Morgenstern)

Related to that: An interesting take (not only) on cooperative game theory is Schelling's The Strategy of Conflict (from 1960, resp. second edition from 1980, but I am not aware of sufficient follow-up research on the ideas presented there). And there might be some useful references in CLR's sequence on Cooperation, Conflict, and Transformative AI.

Book report: Theory of Games and Economic Behavior (von Neumann & Morgenstern)

When I read your summary (and follow up post), I get the impression that you are suggesting it might be reasonable to study the book, follow-up on its ideas, or spend time looking to improve upon its shortcomings. It seems to me that paying attention to a 75 years old textbook (and drawing the attention of others to it) only makes sense for its historical value or if you manage to tease out some timeless lessons. Hm, or maybe to let people know that "no, it doesn't make sense for you to read this". But my impression is that neither of these was the goal? If not, what was the aim of the post then?

Why just not read something up-to-date instead?

Multiagent Systems, Algorithmic, Game-Theoretic,and Logical Foundations (by Yoav Shoham and Kevin Leyton-Brown, 2010) - which I read and can recommend (freely available online)

Game Theory (Maschler, Zamir, Solan, 2013) - which I didn't read, but it should be good and has a 2020 version

What makes counterfactuals comparable?

My impression is that logical counterfactuals, and counterfactuals, and comparability is - at the moment - too confused, and most disagreements here are "merely verbal" ones. Most of your questions (seem to me to) point in the direction of different people using different definitions. I feel slightly worried about going too deep into discussions along the lines of "Vojta reacts to Chris' claims about what other LW people argue against hypothetical 1-boxing CDT researchers from classical academia that they haven't met" :D.

My take on how to do counterfactuals correctly is that this is not a property of the world, but of your mental models:

Definition (comparability according to Vojta): Two scenarios are comparable (given model and observation sequence ) if they are both possible in and and consistent with .

According to this view, counterfactuals only make sense if your model contains uncertainty...

(Aside on logical counterfactuals: Note that there is difference between the model that I use and the hypothetical models I would be able to infer were I to use all my knowledge. Indeed, I can happily reason about 6th digit of being 7, since I don't know what it is, despite knowing the formula for calculating . I would only get into trouble if I were to do the calculations (and process their implications for the real world). Updating your models with new logical information seems like an important problem, but one I think is independent from counterfactual reasoning.)

...however, there remains the fact humans do counterfactual reasoning all the time, even about impossible things ("What if I decided to not write this comment?", "What if the Sun revolved around the Earth?"). I think this is consistent with the above definition, from three reasons. First, the models that humans use are complicated, fragmented, incomplete, and wrong. So much so that positing logical impossibilities (the Sun going around the Earth thing) doesn't make the model inconsistent (because it is so fragmented and incomplete). Second, when doing counterfactuals, we might take it for granted that you are to replace the actual observation history by some alternative . So you then apply the above definition to and (e.g., me not starting to write this comment). When is compatible with the model we use, everything is logically consistent (in ). For example, it might actually be impossible for me to not have started writing this comment, but it was perfectly consistent with my (wrong) model. Finally, when some counterfactual would be inconsistent with our model, we might take it for granted that we are supposed to relax in some manner. Moreover, people might often implicitly assume same/similar relaxation. For example, suppose I know that the month of May has 31 days. The natural relaxation is to be uncertain about month lengths while still remembering it was something between 28 and 31. I might this say that 30 was a perfectly reasonable length, while being indignant upon being asked to consider May that is 370 days long.

As for the implications for your question: The phrasing of 1) seems to suggest a model that has uncertainty about your decision procedure. Thus picking both 10 and 5 seems possible (and consistent with observation history of seeing the two boxes), and thus comparable. Note that this would seem fishier if you additionally posited that you are a utility maximizer (but, I argue, most people would implicitly relax this assumption if you asked them to consider the 5 counterfactual). Regarding 2) I think that "a typical AF reader" uses a model in which "a typical CDT adherent" can deliberate, come to the one-boxing conclusion, and find 1M in the box, making the options comparable for "typical AF readers". I think that "a typical CDT adherent" uses a model in which "CDT adherents" find the box empty while one-boxers find it full, thus making the options incomparable. The third question I didn't understand.

Disclaimer: I haven't been keeping up to date on discussions regarding these matters, so it might be that what I write has some obvious and known holes in it...

What makes counterfactuals comparable?

An evidential decision theorist would smoke in the Smoking Lesion problem so they don't get cancer.

Is this possibly a typo, and should it instead say that EDT would not smoke? (I never seem to remember the details of Smoking Lesion, but this seems inconsistent with the "so they don't get cancer".)

AI Services as a Research Paradigm

It seems to me like you might each be imagining a slightly different situation.

Not quite certain what the difference is. But it seems like Michael is talking about setting up well the parts of the system that are mostly/only AI. In my opinion, this requires AI researchers, in collaboration with experts from whatever-area-is-getting-automated. (So while it might not fall only under the umbrella of AI research, it critically requires it.) Whereas - it seems to me that - Rohin is talking more about ensuring that the (mostly) human parts of society do their job in the presence of automatization. For example, how to deal with unemployment when parts of the industry get automated. (And I agree that I wouldn't go looking for AI researches when tackling this.)

AI Services as a Research Paradigm

List of Research Questions

Looking at these, I feel like they are subquestions of "how do you design a good society that can handle technological development" -- most of it is not AI-specific or CAIS-specific.

It is intentional that not all the problems are technical problems - for example, I expect that not tackling unemployment due to AI might indirectly make you a lot less safe (it seems prudent to not be in a civial war or war when you are attempting to finish building AGI). However, you are right that the list might nevertheless be too broad (and too loosely tied to AI).

Anyway: As a smaller point, I feel that most of the listed problems will get magnified as you introduce more AI services, or they might gain important twists. As a larger point: Am I correct to understand you as implying that "technical AI alignment researchers should primarily focus on other problems" (modulo qualifications)? My intuition is that this doesn't follow, or at least that we might disagree on the degree to which this needs to be qualified to be true. However, I have not yet thought about this enough to be able to elaborate more right now :(. A bookmark that seems relevant is the following prompt:

Conditional on your AI system never turning into an agent-like AGI, how is "not dying and not losing some % of your potential utility because of AI" different from "how do you design a good society that can handle the process of more and more things getting automated"?

(This should go with many disclaimers, first among those the fact that this is a prompt, not an implicit statement that I fully endorse.)

AI Services as a Research Paradigm

Fixed the wrong section numbers and frame problem description.

Informally, we can assume that some description of the world is given by context and view a task as something specified by an initial state and an end state (or states) - accomplishing the task amounts to causing a transformation from the starting state to one of the desired end states.

I feel like this definition is not capturing what I mean by a "task". Many "agent-like" things, such as "become supreme ruler of the world", seem like tasks according to this definition; many useless things like "twitching randomly" can be thought of as completing a "task" as defined here and so would be counted as "services".

Could it be that the problem is not in the "task" part but in the definition service? If I consider the task of building me a house that I will like, I can envision a very service-like way of doing that (ask me a bunch of routine questions, select house-model correspondingly, then proceed to build it in a cook-book manner by calling on other services). But I can also imagine going about this in a very agent-like manner.

(Also, "twitching randomly" seems like a perfectly valid task, and a twitch-bot as a perfectly valid service. Just a very stupid one that nobody would want to build or pay for. Uhm, probably. Hopefully.)

AI Services as a Research Paradigm

I agree with your points in the suggested summary. However, I feel like they are not fully representative of the text. But, as the author, I might be imagining the version of the document in my head rather than the one I actually wrote :-).

  • My estimate is that after reading it, I would gain the impression that the text revolves around the abstract model. Which I thought wasn't the case; definitely wasn't the intention.
  • Also, I am not sure if it is intended that your summary doesn't mention the examples and the "classifying research questions" subsection (which seems equally important to me as the list it generates).
  • Finally, from your planned opinion, I might get the impression that the text suggests no technical problems at all. I think that some of them either are technical problems (e.g., undesired appearance of agency, preventing error propagation and correlated failures, "Tools vs Agents" in Section 6) or have important technical components (all the problems listed as related changes in environment, system, or users). Although whether these are AI specific is arguable.

Side-note 1: I also think that most of the classical AI safety problems also appear in systems of AI services (either in individual services, or in "system-wide variants"). But this is only mentioned in the text briefly, since I am not yet fully clear on how to do the translation between agent-like AIs and systems of AI services. (Also, on the extent to which such translation even makes sense.)

Side-note 2: I imagine that many "non-AI problems" might become "somewhat-AI problems" or even "problems that AI researchers need to deal with" once we get enough progress in AI to automate the corresponding domains.

Embedded Agency via Abstraction

A side-note:

Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.

Can't remember the specific reference but: Imperfect-information game theory has some research on abstractions. Naturally, an object of interest are "optimal" abstractions --- i.e., ones that are as small as possible for given accuracy, or as accurate as possible for given size. However, there are typically some negative results, stating that getting (near-) optimal abstractions is at least as expensive as finding the (near-) optimal solution of the full game. Intuitively, I would expect this to be a recurring theme for abstractions in general.

The implication of this is that all the goals should have the implicitly have the caveat that the maps have to be "not-too-expensive to construct". (This is intended to be a side-note, not an advocacy to change the formulation. The one you have there is accessible and memorable :-).)

Load More