Adam Shimi

Full time independent deconfusion researcher (https://www.alignmentforum.org/posts/5Nz4PJgvLCpJd6YTA/looking-deeper-at-deconfusion) in AI Alignment. (Also PhD in the theory of distributed computing).

If you're interested by some research ideas that you see in my posts, know that I keep private docs with the most compressed version of my deconfusion ideas in the process of getting feedback. I can give you access if you PM me!

A list of topics I'm currently doing deconfusion on:

  • Goal-directedness for discussing AI Risk
  • Myopic Decision Theories for dealing with deception (with Evan Hubinger)
  • Universality for many alignment ideas of Paul Christiano
  • Deconfusion itself to get better at it
  • Models of Languages Models to clarify the alignment issues surrounding them.

Sequences

Epistemic Cookbook for Alignment
Reviews for the Alignment Forum
AI Alignment Unwrapped
Deconfusing Goal-Directedness
Toying With Goal-Directedness

Wiki Contributions

Comments

Yudkowsky and Christiano discuss "Takeoff Speeds"

I grimly predict that the effect of this dialogue on the community will be polarization: People who didn't like Yudkowsky and/or his views will like him / his views less, and the gap between them and Yud-fans will grow (more than it shrinks due to the effect of increased dialogue). I say this because IMO Yudkowsky comes across as angry and uncharitable in various parts of this dialogue, and also I think it was kinda a slog to get through & it doesn't seem like much intellectual progress was made here.

Strongly agree with that.

Since you agree with Yudkowksy, do you think you could strongman his position?

LCDT, A Myopic Decision Theory

Yeah, that's a subtle point.

Here we're stressing the difference between the simulator's action and the simulation's (HCH or Evan in your example) action. Obviously, if the simulation is non-myopic, then the simulation's action will depend on the long-term consequences of this action (for the goals of the simulation). But the simulator itself only cares about answering the question "what would the simulation do next?". Once again, that might mean that the simulator will think about the long term consequences of the simulation's action on the simulation's goals, but the simulator doesn't have this goal: such reasoning is completely instrumental to its task of simulation. And more generally, the simulator isn't choosing his next action to make it easier to predict the future actions (like a predict-o-matic would do).

That might sound like nitpicking, but this means something important: the simulator itself has no reason to be deceptive. It might output actions (as its best guess of what the simulation would do) that are deceptive, but only if the simulation itself is deceptive.

What does that give us?

  • If we manage to point the simulation at something that is non-deceptive yet powerful, the myopic simulator will not introduce deception into the mix. Whereas doing IRL on the simulation and then optimizing for the reward would probably lead to goodhart and deception because of mesa-optimizers.
    • Here Evan would probably say that HCH sounds like the right non-deceptive simulation; I'm less convinced that HCH will not be deceptive.
    • An obvious question is to ask why not do imitation learning? Well, I expect (and I believe Evan expects to) that simulation is strictly more powerful than imitation, because it can make models of non-observed or ideal processes that we point out to.
  • If instead of having a single simulation, we have a simulator that can deal with a range of simulation (how some researchers and I are currently thinking about GPT-3 and LMs), then myopia let you use the simulator to detect deception in the simulations/change between simulations/test different options, in a way that a "deceptive agent acting like a simulator" would not (because it would tamper with your experimentations)
    • A tangent, but there's also an argument that I'm writing up about why we should not expect models which simulate many different processes to be agents — spoiler: agents are bad at being simulators.
  • Even in the worse case scenario where we make a simulator that simulates a deceptive agent, the simulator has no incentive to hides its "internal thought" about how the simulation works. That's a very small comfort, but it might make interpretability easier because there is no adversarial pressure against it.
Ngo and Yudkowsky on AI capability gains

Thanks for giving more details about your perspective.

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

It's not clear to me that the sequences and HPMOR are good pointers for this particular approach to theory building. I mean, I'm sure there are posts in the sequences that touch on that (Einstein's Arrogance is an example I already mentioned), but I expect that they only talk about it in passing and obliquely, and that such posts are spread all over the sequences. Plus the fact that Yudkowsky said that there was a new subsequence to write lead me to believe that he doesn't think the information is clearly stated already.

So I don't think you can really put the current confusion as an evidence that the explanation of how that kind of theory would work doesn't help, given that this isn't readily available in a form I or anyone reading this can access AFAIK.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you've developed very good intuitions for how those problems work. In the case of alignment, it's hard to learn things from grappling with most of these problems, because we don't have signals of when we're going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.

Completely agree that these intuitions are important training data. But your whole point in other comments is that we want to understand why we should expect these intuitions to differ from apparently bad/useless analogies between AGI and other stuff. And some explanation of where these intuitions come from could help with evaluating these intuitions, even more because Yudkowsky has said that he could write a sequence about the process. 

By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn't improve people's object-level rationality very much.

This sounds to me like a strawman of my position (which might be my fault for not explaining it well).

  • First, I don't think explaining a methodology is a "very high-level epistemological principle", because it let us concretely pick apart and criticize the methodology as a truthfinding method.
  • Second, the object-level work has already been done by Yudkowsky! I'm not saying that some outside-of-the-field epistemologist should ponder really hard about what would make sense for alignment without ever working on it concretely and then give us their teaching. Instead I'm pushing for a researcher who has built a coherent collections of intuitions and has thought about the epistemology of this process to share the latter to help us understand the former.
  • A bit similar to my last point, I think the correct comparison here is not "philosophers of science outside the field helping the field", which happens but is rare as you say, but "scientists thinking about epistemology for very practical reasons". And given that the latter is from my understanding what started the scientific revolution and a common activity of all scientists until the big paradigms were established (in Physics and biology at least) in the early 20th century, I would say there is a good track record here.
    (Note that this is more your specialty, so I would appreciate evidence that I'm wrong in my historical interpretation here)

I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.)

Hum, I certainly like a lot of epistemic stuff, but I would say my tendencies to use epistemology are almost always grounded in concrete questions, like understanding why a given experiment tells us something relevant about what we're studying.

I also have to admit that I'm kind of confused, because I feel like you're consistently using the sort of epistemic discussion that I'm advocating for when discussing predictions and what gives us confidence in a theory, and yet you don't think it would be useful to have a similar-level model of the epistemology used by Yudkowsky to make the sort of judgment you're investigating?

I expect that there's just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren't epistemologists making breakthroughs in all sorts of other domains?

As I wrote about, I don't think this is a good prompt, because we're talking about scientists using epistemology to make sense of their own work there.

Here is an analogy I just thought of: I feel that in this discussion, you and Yudkowsky are talking about objects which have different types. So when you're asking question about his model, there's a type mismatch. And when he's answering, having noticed the type mismatch, he's trying to find what to ascribe it to (his answer has been quite consistently modest epistemology, which I think is clearly incorrect). Tracking the confusing does tell you some information about the type mismatch, and is probably part of the process to resolve it. But having his best description of his type (given that your type is quite standardized) would make this process far faster, by helping you triangulate the differences.

Ngo and Yudkowsky on AI capability gains

I'm honestly confused by this answer.

Do you actually think that Yudkowsky having to correct everyone's object-level mistakes all the time is strictly more productive and will lead faster to the meat of the deconfusion than trying to state the underlying form of the argument and theory, and then adapting it to the object-level arguments and comments?

I have trouble understanding this, because for me the outcome of the first one is that no one gets it, he has to repeat himself all the time without making the debate progress, and this is one more giant hurdle for anyone trying to get into alignment and understand his position. It's unclear whether the alternative would solve all these problems (as you quote from the preface of the Sequences, learning the theory is often easier and less useful than practicing), but it still sounds like a powerful accelerator.

There is no dichotomy of "theory or practice", we probably need both here. And based on my own experience reading the discussion posts and the discussions I've seen around these posts, the object-level refutations have not been particularly useful forms of practice, even if they're better than nothing.

Ngo and Yudkowsky on AI capability gains

Good point, I hadn't thought about that one.

Still, I have to admit that my first reaction is that this particular sequence seems quite uniquely in a position to increase the quality of the debate and of alignment research singlehandedly. Of course, maybe I only feel that way because it's the only one of the long list that I know of. ^^

(Another possibility I just thought of is that maybe this subsequence requires a lot of new preliminary subsequences, such that the work is far larger than you could expect from reading the words "a subsequence". Still sounds like it would be really valuable though.

Ngo and Yudkowsky on AI capability gains

That's a really helpful comment (at least for me)!

But at least step one could be saying, "Wait, do these two kinds of ideas actually go into the same bucket at all?"

I'm guessing that a lot of the hidden work here and in the next steps would come from asking stuff like:

  • so I need to alter the bucket for each new idea, or does it instead fit in its current form each time?
  • does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer?
  • Does the bucket become more simple and more elegant with each new idea that fit in it?

Is there some truth in this, or am I completely off the mark?

It seems like the sort of thing that would take a subsequence I don't have time to write

You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field!

So I'm left wondering:

  • Do you disagree with my impression of the value of such a subsequence?
  • Do you think it would have this value but are spending your time doing something more valuable?
  • Do you think it would be valuable but really don't want to write it?
  • Do you think it would be valuable, you could in principle write it, but probably no one would get it even if you did?
  • Something else I'm failing to imagine?

Once again, you do what you want, but I feel like this would be super valuable if there was anyway of making that possible. That's also completely relevant to my own focus on the different epistemic strategies used in alignment research, especially because we don't have access to empirical evidence or trial and error at all for AGI-type problems.

(I'm also quite curious if you think this comment by dxu points at the same thing you are pointing at)

Ngo and Yudkowsky on AI capability gains

Damn. I actually think you might have provided the first clear pointer I've seen about this form of knowledge production, why and how it works, and what could break it. There's a lot to chew on in this reply, but thanks a lot for the amazing food for thought!

(I especially like that you explained the physical points and put links that actually explain the specific implication)

And I agree (tentatively) that a lot of the epistemology of science stuff doesn't have the same object-level impact. I was not claiming that normal philosophy of science was required, just that if that was not how we should evaluate and try to break the deep theory, I wanted to understand how I was supposed to do that.

Ngo and Yudkowsky on AI capability gains

That's when I understood that spatial structure is a Deep Fundamental Theory.

And it doesn't stop there. The same thing explains the structure of our roadways, blood vessels, telecomm networks, and even why the first order differential equations for electric currents, masses on springs, and water in pipes are the same.

(The exact deep structure of physical space which explains all of these is differential topology, which I think is what Vaniver was gesturing towards with "geometry except for the parallel postulate".)

Can you go into more detail here? I have done a decent amount of maths but always had trouble in physics due to my lack of physical intuition, so it might be completely obvious but I'm not clear about what is "that same thing" or how it explains all your examples? Is it about shortest path? What aspect of differential topology (a really large field) captures it?

(Maybe you literally can't explain it to me without me seeing the deep theory, which would be frustrating, but I'd want to know if that was the case. )

Ngo and Yudkowsky on AI capability gains

This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but "there's a phenomenon which breaks one of the modelling assumption in a way noncentral to the main theory" is a major way the predictions can fail.

That's a great way of framing it! And a great way of thinking about why these are not failures that are "worrysome" at first/in most cases.

Ngo and Yudkowsky on AI capability gains

Thanks for the thoughtful answer!

So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is "you can't make an engine more efficient than a Carnot engine." Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be "oh, thermodynamics is wrong", and instead it's going to be "oh, this engine is making use of some unseen source."

My gut reaction here is that "you can't make an engine more efficient than a Carnot engine" is not the right kind of prediction to try to break thermodynamics, because even if you could break it in principle, staying at that level without going into the detailed mechanisms of thermodynamics will only make you try the same thing as everyone else does. Do you think that's an adequate response to your point, or am I missing what you're trying to say?

So, later Eliezer gives "addition" as an example of a deep fundamental theory. And... I'm not sure I can imagine a universe where addition is wrong? Like, I can say "you would add 2 and 2 and get 5" but that sentence doesn't actually correspond to any universes.

Like, similarly, I can imagine universes where evolution doesn't describe the historical origin of species in that universe. But I can't imagine universes where the elements of evolution are present and evolution doesn't happen.

[That said, I can imagine universes with Euclidean geometry and different universes with non-Euclidean geometry, so I'm not trying to claim this is true of all deep fundamental theories, but maybe the right way to think about this is "geometry except for the parallel postulate" is the deep fundamental theory.]

The mental move I'm doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you're describing something that doesn't commute, addition might be a deep theory, but it's not useful for what you want. Similarly, you could argue that given how we're building AIs and trying to build AGI, evolution is not the deep theory that you want to use. 

It sounds to me like you (and your internal-Yudkowsky) are using "deep fundamental theory" to mean "powerful abstraction that is useful in a lot of domains". Which addition and evolution fundamentally are. But claiming that the abstraction is useful in some new domain requires some justification IMO. And even if you think the burden of proof is on the critics, the difficulty of formulating the generators makes that really hard.

Once again, do you think that answers your point adequately?

Load More