All of adamShimi's Comments + Replies

Epistemological Vigilance for Alignment

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes?

This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

Other possible names would

... (read more)
1Robert Kirk8d
This seems to me that you want a word for whatever the opposite of complex/chaotic systems are, right? Although obviously "Simple" is probably not the best word (as it's very generic). It could be "Simple Dynamics" or "Predictable Dynamics"?
RL with KL penalties is better seen as Bayesian inference

Thanks for this post, it's clear and insightful about RLHF.

From an alignment perspective, would you say that your work gives evidence that we should focus most of the energy on finding guarantees about the distribution that we're aiming for and debugging problems there, rather than thinking about the guarantees of the inference?

(I still expect that we want to understand the inference better and how it can break, but your post seems to push towards a lesser focus on that part)

The "Measuring Stick of Utility" Problem

Another way to put it: coherence theorems assume the existence of some resources (e.g. money), and talk about systems which are pareto optimal with respect to those resources - e.g. systems which “don’t throw away money”. Implicitly, we're assuming that the system generally "wants" more resources (instrumentally, not necessarily as an end goal), and we derive the system's "preferences" over everything else (including things which are not resources) from that. The agent "prefers" X over Y if it expends resources to get from Y to X. If the agent reaches a wo

... (read more)
Bits of Optimization Can Only Be Lost Over A Distance

One thing that I had to remind myself while reading this post is that "far away" is across space-time, emphasis on time. So "far away" can be about optimizing the future.

Distributed Decisions

Do you think that thinking explicitly about distributed systems (in the theoretical computer science sense) could be useful for having different frames or understanding of the tradeoffs? Or are you mostly using the idea of distributed systems as an intuitive frame without seeing much value in taking it too seriously?

2johnswentworth1mo
Two answers: * I agree with Self-Embedded Agent that there's likely powerful frames for thinking about distributed compute which have not yet been discovered, and existing work may hint toward those. That's the sort of thing which is probably not useful for most researchers to think about, but worth at least some thinking about. * There's a shared core to distributed models which I do think basically-all technical researchers in the field should be familiar with. That's best picked up by seeing it in a few different contexts, and theory of distributed systems is one possible context to pick it up from. (Some others: Bayes nets/causality, working with structured matrices, distributed programming in practice.)
4Self-Embedded Agent1mo
If I may be so bold, the answer should be a guarded yes. A snag is that the correct theory of what John calls 'distributed systems' or 'Time' and what theoretical CS academics generally call 'concurrency' is as of yet not fully constructed. To be sure, there are many quite well-developed theoretical frameworks - e.g. the Pi calculus [https://www.amazon.com/Pi-Calculus-Theory-Mobile-Processes/dp/0521543274] or the various models of concurrency like Petri nets, transitions systems, event structures [https://www.cl.cam.ac.uk/~gw104/winskel-nielsen-models-for-concurrency.pdf]etc. They're certainly on my list of 'important things I'd like to understand better'. Our world, and our sensemaking of it, is fundamentally concurrent. If we had the 'correct' theory of concurrency and we would be able to coherently combine it with decision theory under uncertainty that would be very powerful.
Proxy misspecification and the capabilities vs. value learning race

Thanks for trying to make the issue more concrete and provide a way to discuss it!

One thing I want to point out is that you don't really need to put the non-constrained variables at the worst possible state; you just have the degree of freedom to put them to whatever helps you and is not too hard to reach.

Using sets, you have a set of world you want, and a proxy that is a superset of this (because you're not able to aim exactly at what you want). The problem is that the AI is optimizing to get in the superset with high guarantees and stay there, and so it'... (read more)

Optimization at a Distance

Great post!

For instance, if I’m planning a party, then the actions I take now are far away in time (and probably also space) from the party they’re optimizing. The “intermediate layers” might be snapshots of the universe-state at each time between the actions and the party. (... or they might be something else; there are usually many different ways to draw intermediate layers between far-apart things.)

This applies surprisingly well even in situations like reinforcement learning, where we don’t typically think of the objective as “far away” from the agent.

... (read more)
Maxent and Abstractions: Current Best Arguments

I followed approximately the technical discussion, and now I'm wondering what that would buy us if you are correct.

  • Max entropy distributions seem nicely behaved and well-studied, so maybe we get some computations, properties, derivation for free? (Basically applying a productive frame to the problem of abstraction)
  • It would reduce computing the influence of the summary statistics on the model to computing the constraints, as I'm guessing that this is the hard part in computing the max entropy distribution (?)

Are these correct, and what am I missing?

3johnswentworth1mo
That's basically correct; the main immediate gain is that it makes it much easier to compute abstractions and compute using abstractions. One additional piece is that it hints towards a probably-more-fundamental derivation of the theorems in which maximum entropy plays a more central role. The maximum entropy Telephone Theorem already does that, but the resampling + gKPD approach routes awkwardly through gKPD instead; there's probably a nice way to do it directly via constrained maximization of entropy. That, in turn, would probably yield stronger and simpler theorems.
Information Loss --> Basin flatness

Thanks for the post!

So if I understand correctly, your result is aiming at letting us estimate the dimensionality of the solution basins based on the gradients for the training examples at my local min/final model? Like, I just have to train my model, and then compute the Hessian/behavior gradients and I would (if everything you're looking at works as intended) have a lot of information about the dimensionality of the basin (and I guess the modularity is what you're aiming at here)? That would be pretty nice.

What other applications do you see for this resu... (read more)

1Vivek Hebbar1mo
About the contours: While the graphic shows a finite number of contours with some spacing, in reality there are infinite contour planes and they completely fill space (as densely as the reals, if we ignore float precision). So at literally every point in space there is a blue contour, and a red one which exactly coincides with it.
Intuitions about solving hard problems

I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd

... (read more)
Intuitions about solving hard problems

Thanks for the answer.

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed b

... (read more)

I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd... (read more)

Intuitions about solving hard problems

I like that you're proposing an explicit heuristic inspired by the history of science for judging research directions and approaches, and acknowledge that it leads to conclusion that are counter intuitive to my Richard-model (pushing for Agents foundations for example), so you're not just retrofitting your own conclusion AFAIK. I also like that you're applying it to object-level directions in alignment — that's something I'm working on at the moment for my own research, based on your pushback.

That being said, my prediction/retrodiction is that this is too ... (read more)

2Spencer Becker-Kahn2mo
I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate. One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed breakthrough seems to just 'fall out' of a bunch of partial work without there being a compelling explanation after the fact.
Refine: An Incubator for Conceptual Alignment Research Bets

Sorry to make you work more, but happy to fill a much needed niche. ^^

Refine: An Incubator for Conceptual Alignment Research Bets

Thanks! Yes, this is very much an experiment, and even if it fails, I expect it to be a productive mistake we can learn from. ;)

Takeoff speeds have a huge effect on what it means to work on AI x-risk

I disagree, so I'm curious about what are great examples for you of good research on alignment that is not done by x-risk motivated people? (Not being dismissive, I'm genuinely curious, and discussing specifics sounds more promising than downvoting you to oblivion and not having a conversation at all).

1Joe_Collman2mo
Examples would be interesting, certainly. Concerning the post's point, I'd say the relevant claim is that [type of alignment research that'll be increasingly done in slow takeoff scenarios] is already being done by non x-risk motivated people. I guess the hope is that at some point there are clear-to-everyone problems with no hacky solutions, so that incentives align to look for fundamental fixes - but I wouldn't want to rely on this.
What to include in a guest lecture on existential risks from AI?

I have a framing of AI risks scenarios that I think is more general and more powerful than most available online, and that might be a good frame before going into examples. It's not posted yet (I'm finishing the sequence now) but I could sent somethings to you if you're interested. ;)

1Aryeh Englander2mo
Yes please!
AMA Conjecture, A New Alignment Startup

(I will be running the Incubator at Conjecture)

The goal for the incubator is to foster new conceptual alignment research bets that could go on to become full-fledged research directions, either at Conjecture or at other places. We’re thus planning to select mostly on the quality we expect for a very promising independent conceptual researcher, that is proactivity (see Paul Graham’s Relentlessly Resourceful post) and some interest or excitement about not fully tapped streams of evidence (see this recent post).

Although experience with alignment cou... (read more)

Reply to Eliezer on Biological Anchors

Thanks for the answer!

Unfortunately, I don't have the time at the moment to answer in detail and have more of a conversation, as I'm fully focused on writing a long sequence about pushing for pluralism in alignment and extracting the core problem out of all the implementation details and additional assumption. I plan on going back to analyzing timeline research in the future, and will probably give better answers then.

That being said, here are quick fire thoughts:

  • I used the evolution case because I consider it the most obvious/straightforward case, in that
... (read more)
Job Offering: Help Communicate Infrabayesianism

Great idea!

I was talking with someone just today about the need for something like that. I also expect that InfraBayes is one of the approaches of alignment that need a lot of translation effort to become palatable to proponents of other approaches.

Additional point though: I feel like the applicant should probably also have a decent understanding of alignment, as a lot of the value of such a communication and translation (for me at least) would come from understanding its value for alignment.

Is ELK enough? Diamond, Matrix and Child AI

This thread makes me thing that my post is basically a hardness result for ELK when you don't have access to the planner. I agree with you that in settings like the ones you describe, the reporter would have access to the planner, and thus the examples described in this post wouldn't really apply. But the need to have control of the planner is not stated in the ELK report.

So if this post is correct, solving ELK isn't enough if you don't have access to the planner. Which means we either need to ensure that in all case we can train/observe the planner (which... (read more)

Is ELK enough? Diamond, Matrix and Child AI

So this line of thinking came from considering the predictor and the planner as separate, which is what ELK does AFAIR. It would thus be the planner that executes, or prepares a treacherous turn, while the predictor doesn't know about it (but would be in principle able to find out if it actually needed to).

4Rohin Shah4mo
What is the planner? How is it trained? I often imagine a multihead architecture, where most of the computation is in the shared layers, then one (small) head is the "predictor" (trained by self-supervised learning), one (small) head is the "actor / planner" (trained by RL from human feedback), and the other (small) head is the "reporter" (trained via ELK). In this version the hope is that ~all of the relevant knowledge is in the shared layers and so is accessible to the reporter. You could also be fancier and take the activations, weights, and/or outputs of the planner head and feed them as inputs into the reporter, if you really wanted to be sure that the information is accessible in principle.
The Natural Abstraction Hypothesis: Implications and Evidence

Thanks for the post! Two general points I want to make before going into more general comments:

  • I liked the section on concepts difference across, and hadn't thought much about it before, so thanks!
  • One big aspect of the natural abstraction hypothesis that you missed IMO is "how do you draw the boundaries around abstractions?" — more formally how do you draw the markov blanket. This to me is the most important question to answer for settling the NAH, and John's recent work on sequences of markov blanket is IMO him trying to settle this.

In general, we should

... (read more)
Robustness to Scale

Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:

  • I'm not convinced that robustness to relative scale is as fundamental as the other two, because there is no reason to expect that in general the subcomponents will be significantly different in power, especially in settings like adversarial training where both parts are trained according to the same approach. That being said, I still agree that this is an interesting question to ask, and some proposal might indeed depend on a version of this.
  • Robustn
... (read more)
Reply to Eliezer on Biological Anchors

Thanks for pushing back on my interpretation.

I feel like you're using "strongest" and "weakest" to design "more concrete" and "more abstract", with maybe the value judgement (implicit in your focus on specific testable claims) that concreteness is better. My interpretation doesn't disagree with your point about Bio Anchors, it simply says that this is a concrete instantiation of a general pattern, and that the whole point of the original post as I understand it is to share this pattern. Hence the title who talks about all biology-inspired timelines, the th... (read more)

My Overview of the AI Alignment Landscape: Threat Models

Thanks so much for the effort your putting in this work! It looks particularly relevant to my current interest of understanding the different approximations and questions used in alignment, and what forbids us the Grail of paradigmaticity.

Here is my more concrete feedback

A common approach when setting research agendas in AI Alignment is to be specific, and focus on a threat model. That is, to extrapolate from current work in AI and our theoretical understanding of what to expect, to come up with specific stories for how AGI could cause an existential catas

... (read more)
Reply to Eliezer on Biological Anchors

Thanks for this post!

That being said, my model of Yudkowsky, which I built by spending time interpreting and reverse engineering the post you're responding to, feels like you're not addressing his points (obviously, I might have missed the real Yudkowsky's point)

My interpretation is that he is saying that Evolution (as the generator of most biological anchors) explores the solution space in a fundamentally different path than human research.  So what you have is two paths through a space. The burden of proof for biological anchors thus lies in arguing... (read more)

3HoldenKarnofsky3mo
I don't think I am following the argument here. You seem focused on the comparison with evolution, which is only a minor part of Bio Anchors, and used primarily as an upper bound. (You say "the number is so vastly large (and actually unknown due to the 'level of details' problem) that it's not really relevant for timelines calculations," but actually Bio Anchors still estimates that the evolution anchor implies a ~50% chance of transformative AI this century.) Generally, I don't see "A and B are very different" as a knockdown counterargument to "If A required ___ amount of compute, my guess is that B will require no more." I'm not sure I have more to say on this point that hasn't already been said - I acknowledge that the comparisons being made are not "tight" and that there's a lot of guesswork, and the Bio Anchors argument doesn't go through without some shared premises and intuitions, but I think the needed intuitions are reasonable and an update from Bio-Anchors-ignorant starting positions is warranted.

It displeases me that this is currently the most upvoted response: I believe you are focusing on EY's weakest rather than strongest points.

My interpretation is that he is saying that Evolution (as the generator of most biological anchors) explores the solution space in a fundamentally different path than human research. So what you have is two paths through a space. The burden of proof for biological anchors thus lies in arguing that there are enough connections/correlations between the two paths to use one in order to predict the other.

It's hardly su... (read more)

Biology-Inspired AGI Timelines: The Trick That Never Works

First, I want to clarify that I feel we're going into a more interesting place, where there's a better chance that you might find a point that invalidates Yudkowsky's argument, and can thus convince him of the value of the model.

But it's also important to realize that IMO, Yudkowsky is not just saying that biological anchors are bad. The more general problem (which is also developed in this post) is that predicting the Future is really hard. In his own model of AGI timelines, the factor that is basically impossible to predict until you can make AGI is the ... (read more)

Biology-Inspired AGI Timelines: The Trick That Never Works

I do think you are misconstruing Yudkowsky's argument. I'm going to give evidence (all of which are relatively strong IMO) in order of "ease of checkability". So I'll start with something anyone can check in a couple of minutes, and close by the more general interpretation that requires rereading the post in details.

Evidence 1: Yudkowsky flags Simulated-Eliezer as talking smack in the part you're mentioning

If I follow you correctly, your interpretation mostly comes from this part:

OpenPhil:  We did already consider that and try to take it into account:

... (read more)
Biology-Inspired AGI Timelines: The Trick That Never Works

Here I think I share your interpretation of Yudkowsky; I just disagree with Yudkowsky. I agree on the second part; the model overestimates median TAI arrival time. But I disagree on the first part -- I think that having a probability distribution over when to expect TAI / AGI / AI-PONR etc. is pretty important/decision-relevant, e.g. for advising people on whether to go to grad school, or for deciding what sort of research project to undertake. (Perhaps Yudkowsky agrees with this much.) 

Hum, I would say Yudkowsky seems to agree with the value of a pro... (read more)

4Daniel Kokotajlo6mo
I guess I would say: Ajeya's framework/model can incorporate this objection; this isn't a "get rid of the whole framework" objection but rather a "tweak the model in the following way" objection. Like, I agree that it would be bad if everyone who used Ajeya's model had to put 100% of their probability mass into the six bio anchors she chose. That's super misleading/biasing/ignores loads of other possible ways AGI might happen. But I don't think of this as a necessary part of Ajeya's model; when I use it, I throw out the six bio anchors and just directly input my probability distribution over OOMs of compute. My distribution is informed by the bio anchors, of course, but that's not the only thing that informs it.
Biology-Inspired AGI Timelines: The Trick That Never Works

Strongly disagree with this, to the extent that I think this is probably the least cruxy topic discussed in this post, and thus the comment is as wrong as is physically possible.

Remove Platt's law, and none of the actual arguments and meta-discussions changes. It's clearly a case of Yudkowsky going for the snappy "hey, see like even your new-and-smarter report makes exactly the same estimation predicted by a random psychological law" + his own frustration with the law still applying despite expected progress.

But once again, if Platt's law was so wrong that... (read more)

5Daniel Kokotajlo6mo
Hahaha ok, interesting! If you are right I'll take some pride in having achieved that distinction. ;) I interpreted Yudkowsky as claiming that Ajeya's model had enough free parameters that it could be made to predict a wide range of things, and that what was actually driving the 30-year prediction was a bunch of implicit biases rather than reality. Platt's Law is evidence for this claim. If it were false and e.g. the typical timelines forecast was only 10 years out, or 60, then we would have less reason to think that implicit biases were driving Ajeya's choice of parameters. Of course, Yudkowsky also made other arguments besides this one... but this one seemed to be there, and it seemed fairly important to me. It's entirely possible I am misconstruing Yudkowsky's argument... you did recently do a reconstruction, so you probably understand it better than me. Care to elaborate?
Biology-Inspired AGI Timelines: The Trick That Never Works

I do agree that the halving-of-compute-costs-every-2.5-years estimate seems too slow to me; it seems like that's the rate of "normal incremental progress" but that when you account for the sort of really important ideas (or accumulations of ideas, or shifts in research direction towards more fruitful paths) that happen about once a decade, the rate should be faster than that.

I don't think this is what Yudkowsky is saying at all in the post. Actually, I think he is saying the exact opposite: that 2.5 years estimate is too fast as an estimate that is suppose... (read more)

2Daniel Kokotajlo6mo
Thanks for this comment (and the other comment below also). I think we don't really disagree that much here. I may have just poorly communicated, slash maybe I'm objecting to the way Yudkowsky said things because I read it as implying things I disagree with. That's what I think too--normal incremental progress is probably slower than 2.5-year doubling, but there's also occasional breakthrough progress which is much faster, and it all balances out to a faster-than-2.5-year-doubling, but in such a way that makes it really hard to predict, because so much hangs on whether and when breakthroughs happen. I think I just miscommunicated. Here I think I share your interpretation of Yudkowsky; I just disagree with Yudkowsky. I agree on the second part; the model overestimates median TAI arrival time. But I disagree on the first part -- I think that having a probability distribution over when to expect TAI / AGI / AI-PONR etc. is pretty important/decision-relevant, e.g. for advising people on whether to go to grad school, or for deciding what sort of research project to undertake. (Perhaps Yudkowsky agrees with this much.) And I think that Ajeya's framework is the best framework I know of for getting that distribution. I think any reasonable distribution should be formed by Ajeya's framework, or some more complicated model that builds off of it (adding more bells and whistles such as e.g. a data-availability constraint or a probability-of-paradigm-shift mechanic.). Insofar as Yudkowsky was arguing against this, and saying that we need to throw out the whole model and start from scratch with a different model, I was not convinced. (Though maybe I need to reread the post and/or your steelman summary)
Biology-Inspired AGI Timelines: The Trick That Never Works

(My comment is quite critical, but I want to make it clear that I think doing this exercise is great and important, despite my disagreement with the result of the exercise ;) )

So, for having done the same exercise, I feel that you go far too meta here. And that by doing so, you're losing most of the actual valuable meta insights of the post. I'm not necessarily saying that your interpretation doesn't fit what Yudkowsky says, but if the goal is to distill where Yudkowsky is coming from in this specific post, I feel like this comment fails.

The "trick that ne

... (read more)
Biology-Inspired AGI Timelines: The Trick That Never Works

(I'm trying to answer and clarify some of the points in the comments based on my interpretation of Yudkowsky in this post. So take the interpretations with a grain of salt, not as "exactly what Yudkowsky meant")

Progress in AI has largely been a function of increasing compute, human software research efforts, and serial time/steps. Throwing more compute at researchers has improved performance both directly and indirectly (e.g. by enabling more experiments, refining evaluation functions in chess, training neural networks, or making algorithms that work best

... (read more)
The Plan

If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.

One thing I'm still not clear about in this thread is whether you (John) would feel that progress has been made for the theory of agency if all the problems on which MIRI were instantaneously solved. Because there's a difference between saying "this is the obvious first step if you believe reflection is the taut constraint" and "solving this problem would help significantly even if reflection wan't the taut constraint".

2johnswentworth6mo
I expect that progress on the general theory of agency is a necessary component of solving all the problems on which MIRI has worked. So, conditional on those problems being instantly solved, I'd expect that a lot of general theory of agency came along with it. But if a "solution" to something like e.g. the Tiling Problem didn't come with a bunch of progress on more foundational general theory of agency, then I'd be very suspicious of that supposed solution, and I'd expect lots of problems to crop up when we try to apply the solution in practice. (And this is not symmetric: I would not necessarily expect such problems in practice for some more foundational piece of general agency theory which did not already have a solution to the Tiling Problem built into it. Roughly speaking, I expect we can understand e-coli agency without fully understanding human agency, but not vice-versa.)
On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

Thanks for explaining your point in more details.

The type of formal methods use I am referring to was popular in the late 1970s to at least the 1990s, not sure how popular it is now. It can be summarized by Dijkstra's slogan of “designing proof and program hand in hand”.

This approach stands in complete opposite to the approach of using a theorem prover to verify existing code after it was written.

In simple cases, the designing hand-in-hand approach works as follows: you start with a mathematical specification of the properties you want the program to be wr

... (read more)
1Koen Holtman7mo
I disagree that AI code is orders of magnitude more complex than say the code in a web browser or modern compiler: in fact quite the opposite applies. Most modern ML algorithms are very short pieces of code. If you are willing to use somewhat abstract math where you do not write out all the hyperparameter values, you can specify everything that goes on in a deep learning algorithm in only a few lines of pseudocode. Same goes for most modern RL algorithms. I also note that while modern airplanes contain millions of lines of code, most of it is in the on-board entertainment systems. But the safety-critical subsystems in airplanes them tend to be significantly smaller in code size, and they also run air-gapped from the on-board entertainment system code. This air-gapping of course plays an important role in making the formal proofs for the safety-critical subsystems possible. But beyond these observations: the main point I am trying to get across is that I do not value formal methods as a post-hoc verification tool that should be applied to millions of lines of code, some of it spaghetti code put together via trial and error. Clearly that approach would not work very well. I value formal methods as a tool for the aligned AI code specification and design phases. On the design phase: formal methods offer me a way to create many alternative and clarifying viewpoints on what the code is doing or intending to do, viewpoints not offered by the text of the code itself. For example, formally expressed loop invariants can express much more clearly what is going on in a loop than the loop code itself. Global invariants that I can formulate for distributed system state can allow me to express more clearly how the protocols I design manage to avoid deadlock than the code itself. So the main value if the hand-in-hand method, as a code design method, is that you can develop these clarifying mathematical viewpoints, even before you start writing the code. (Not all programmers are
On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

Thanks for the thoughtful comment!

You fail to mention the more important engineering strategy: one which does not rely on tinkering, but instead on logical reasoning and math to chart a straight line to your goal.

To use the obvious example, modern engineering does not design bridges by using the staple strategy of tinkering, it will use applied math and materials science to create and validate the bridge design.

From this point of view, the main difference between 'science' and 'engineering' is that science tries to understand nature: it seeks to understand

... (read more)
2Koen Holtman7mo
I think I will need to write two separate replies to address the points you raise. First, a more meta-level and autobiographical reply. When it comes to formal methods, I too have a good idea of what I am talking about. I did a lot of formal methods stuff in the 1990 at Eindhoven University, which at the time was a one of the places with the most formal methods in the Netherlands. When I talk about using formal methods to design stuff, I am very serious. But I guess this is not the role of formal methods you are used to, so I'll elaborate. The type of formal methods use I am referring to was popular in the late 1970s to at least the 1990s, not sure how popular it is now. It can be summarized by Dijkstra's slogan of “designing proof and program hand in hand” [https://www.cs.utexas.edu/users/EWD/transcriptions/EWD11xx/EWD1157.html]. This approach stands in complete opposite to the approach of using a theorem prover to verify existing code after it was written. In simple cases, the designing hand-in-hand approach works as follows: you start with a mathematical specification of the properties you want the program to be written to have. Then you use this specification to guide both the writing of the code and the writing of the correctness proof for the code at the same time. The writing of the next line in the proof will often tell you exactly what next lines of code you need to add. This often leads to code which much more clearly expresses what is going on. The whole code and proof writing process can often be done without even leveraging a theorem prover. In more complex cases, you first have to develop the mathematical language to write the specification in. These complex cases are of course the more fun and interesting cases. AGI safety is one such more complex case. You mention distributed computing. This is one area where the hand-in-hand style of formal methods use is particularly useful, because the method of intuitive trial and error sucks so much at wr
Yudkowsky and Christiano discuss "Takeoff Speeds"

I grimly predict that the effect of this dialogue on the community will be polarization: People who didn't like Yudkowsky and/or his views will like him / his views less, and the gap between them and Yud-fans will grow (more than it shrinks due to the effect of increased dialogue). I say this because IMO Yudkowsky comes across as angry and uncharitable in various parts of this dialogue, and also I think it was kinda a slog to get through & it doesn't seem like much intellectual progress was made here.

Strongly agree with that.

Since you agree with Yudkowksy, do you think you could strongman his position?

Yes, though I'm much more comfortable explaining and arguing for my own position than EY's. It's just that my position turns out to be pretty similar. (Partly this is independent convergence, but of course partly this is causal influence since I've read a lot of his stuff.)

There's a lot to talk about, I'm not sure where to begin, and also a proper response would be a whole research project in itself. Fortunately I've already written a bunch of it; see these two sequences.

Here are some quick high-level thoughts:

1. Begin with timelines. The best way to forec... (read more)

LCDT, A Myopic Decision Theory

Yeah, that's a subtle point.

Here we're stressing the difference between the simulator's action and the simulation's (HCH or Evan in your example) action. Obviously, if the simulation is non-myopic, then the simulation's action will depend on the long-term consequences of this action (for the goals of the simulation). But the simulator itself only cares about answering the question "what would the simulation do next?". Once again, that might mean that the simulator will think about the long term consequences of the simulation's action on the simulation's go... (read more)

Ngo and Yudkowsky on AI capability gains

Thanks for giving more details about your perspective.

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

It's not clear to me that the sequences and HPMOR are good pointers for this particular approach to theory building. I mean, I'm sure there are posts in the seque... (read more)

Ngo and Yudkowsky on AI capability gains

I'm honestly confused by this answer.

Do you actually think that Yudkowsky having to correct everyone's object-level mistakes all the time is strictly more productive and will lead faster to the meat of the deconfusion than trying to state the underlying form of the argument and theory, and then adapting it to the object-level arguments and comments?

I have trouble understanding this, because for me the outcome of the first one is that no one gets it, he has to repeat himself all the time without making the debate progress, and this is one more giant hurdle ... (read more)

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, unti... (read more)

Ngo and Yudkowsky on AI capability gains

Good point, I hadn't thought about that one.

Still, I have to admit that my first reaction is that this particular sequence seems quite uniquely in a position to increase the quality of the debate and of alignment research singlehandedly. Of course, maybe I only feel that way because it's the only one of the long list that I know of. ^^

(Another possibility I just thought of is that maybe this subsequence requires a lot of new preliminary subsequences, such that the work is far larger than you could expect from reading the words "a subsequence". Still sounds like it would be really valuable though.

3Richard Ngo7mo
I don't expect such a sequence to be particularly useful, compared with focusing on more object-level arguments. Eliezer says that [https://www.lesswrong.com/posts/uXn3LyA8eNqpvdoZw/preface] the largest mistake he made in writing his original sequences was that he "didn’t realize that the big problem in learning this valuable way of thinking was figuring out how to practice it, not knowing the theory". Better, I expect, to correct the specific mistakes alignment researchers are currently making, until people have enough data points to generalise better.
Ngo and Yudkowsky on AI capability gains

That's a really helpful comment (at least for me)!

But at least step one could be saying, "Wait, do these two kinds of ideas actually go into the same bucket at all?"

I'm guessing that a lot of the hidden work here and in the next steps would come from asking stuff like:

  • so I need to alter the bucket for each new idea, or does it instead fit in its current form each time?
  • does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer?
  • Does the bucket become more simple and more elegant wit
... (read more)
5johnswentworth7mo
Sounds like you should try writing it.
3Rob Bensinger7mo
I'ma guess that Eliezer thinks there's a long list of sequences he could write meeting these conditions, each on a different topic.
Ngo and Yudkowsky on AI capability gains

Damn. I actually think you might have provided the first clear pointer I've seen about this form of knowledge production, why and how it works, and what could break it. There's a lot to chew on in this reply, but thanks a lot for the amazing food for thought!

(I especially like that you explained the physical points and put links that actually explain the specific implication)

And I agree (tentatively) that a lot of the epistemology of science stuff doesn't have the same object-level impact. I was not claiming that normal philosophy of science was required, just that if that was not how we should evaluate and try to break the deep theory, I wanted to understand how I was supposed to do that.

Ngo and Yudkowsky on AI capability gains

That's when I understood that spatial structure is a Deep Fundamental Theory.

And it doesn't stop there. The same thing explains the structure of our roadways, blood vessels, telecomm networks, and even why the first order differential equations for electric currents, masses on springs, and water in pipes are the same.

(The exact deep structure of physical space which explains all of these is differential topology, which I think is what Vaniver was gesturing towards with "geometry except for the parallel postulate".)

Can you go into more detail here? I have d... (read more)

There's more than just differential topology going on, but it's the thing that unifies it all. You can think of differential topology as being about spaces you can divide into cells, and the boundaries of those cells. Conservation laws are naturally expressed here as constraints that the net flow across the boundary must be zero. This makes conserved quantities into resources, for which the use of is convergently minimized. Minimal structures with certain constraints are thus led to forming the same network-like shapes, obeying the same sorts of laws. (See... (read more)

Ngo and Yudkowsky on AI capability gains

This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but "there's a phenomenon which breaks one of the modelling assumption in a way noncentral to the main theory" is a major way the predictions can fail.

That's a great way of framing it! And a great way of thinking about why these are not failures that are "worrysome" at first/in most cases.

Ngo and Yudkowsky on AI capability gains

Thanks for the thoughtful answer!

So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is "you can't make an engine more efficient than a Carnot engine." Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be "oh, thermodynamics is wrong", and instead it's going to be "oh, this engine is making use of some unseen source."

My gut reaction here is that "you can't make an engine more efficient than a Carnot engine" is not the right kind of predi... (read more)

2Matthew "Vaniver" Graves7mo
Yeah, this seems reasonable to me. I think "how could you tell that theory is relevant to this domain?" seems like a reasonable question in a way that "what predictions does that theory make?" seems like it's somehow coming at things from the wrong angle.

From my (dxu's) perspective, it's allowable for there to be "deep fundamental theories" such that, once you understand those theories well enough, you lose the ability to imagine coherent counterfactual worlds where the theories in question are false.

To use thermodynamics as an example: the first law of thermodynamics (conservation of energy) is actually a consequence of Noether's theorem, which ties conserved quantities in physics to symmetries in physical laws. Before someone becomes aware of this, it's perhaps possible for them to imagine a universe exa... (read more)

I think "deep fundamental theory" is deeper than just "powerful abstraction that is useful in a lot of domains".

Part of what makes a Deep Fundamental Theory deeper is that it is inevitably relevant for anything existing in a certain way. For example, Ramón y Cajal (discoverer of the neuronal structure of brains) wrote:

Before the correction of the law of polarization, we have thought in vain about the usefulness of the referred facts. Thus, the early emergence of the axon, or the displacement of the soma, appeared to us as unfavorable arrangements acting

... (read more)
Ngo and Yudkowsky on AI capability gains

Thanks John for this whole thread!

(Note that I only read the whole Epistemology section of this post and skimmed the rest, so I might be saying stuff that are repeated/resolved elsewhere. Please point me to the relevant parts/quotes if that's the case. ;) )

Einstein's arrogance sounds to me like an early pointer in the Sequences for that kind of thing, with a specific claim about General Relativity being that kind of theory.

That being said, I still understand Richard's position and difficulty with this whole part (or at least what I read of Richard's diffic... (read more)

To be clear, this part:

It's one of those predictions where, if it's false, then we've probably discovered something interesting - most likely some place where an organism is spending resources to do something useful which we haven't understood yet.

... is also intended as a falsifiable prediction. Like, if we go look at the anomaly and there's no new thing going on there, then that's a very big strike against expected utility theory.

This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but "ther... (read more)

And even if I feel what you're gesturing at, this sounds/looks like you're saying "even if my prediction is false, that doesn't mean that my theory would be invalidated". 

So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is "you can't make an engine more efficient than a Carnot engine." Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be "oh, thermodynamics is wrong", and instead it's going to be "oh, this engine is making use of... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

I'm interested at all in Redwood Research's latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like "Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the 'nonviolence' predicate instead of just nonviolence?" or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.

Trying to rephrase it in my own words (which will necessaril... (read more)

are you interested in Redwood's research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now?

It potentially sheds light on small subpieces of things that are particular aspects that contribute to the Real Problem, like "What actually went into the nonviolence predicate instead of just nonviolence?"  Much of the Real Meta-Problem is that you do not get things analogous to the full Real Problem until you are just about ready to die.

Discussion with Eliezer Yudkowsky on AGI interventions

This is an apology for the tone and the framing of the above comment (and my following answers), which have both been needlessly aggressive, status-focused and uncharitable. Underneath are still issues that matter a lot to me, but others have discussed them better (I'll provide a list of linked comments at the end of this one).

Thanks to Richard Ngo for convincing me that I actually needed to write such an apology, which was probably the needed push for me to stop weaseling around it.

So what did I do wrong? The list is pretty damning:

  • I took something about
... (read more)

Thank you for this follow-up comment Adam, I appreciate it.

Discussion with Eliezer Yudkowsky on AGI interventions

Not planning to answer more on this thread, but given how my last messages seem to have confused you, here is my last attempt of sharing my mental model (so you can flag in an answer where I'm wrong in your opinion for readers of this thread)

Also, I just checked on the publication list, and I've read or skimmed most things MIRI published since 2014 (including most newsletters and blog posts on MIRI website).

My model of MIRI is that initially, there was a bunch of people including EY who were working mostly on decision theory stuff, tiling, model theory, th... (read more)

Load More