Wiki Contributions


So, silly question that doesn't really address the point of this post (this may very well be just a point of clarity thing but it would be useful for me to have an answer due to earning-to-give related reasons off-topic for this post) --

Here you claim that CDT is a generalization of decision-theories that includes TDT (fair enough!):

Here, "CDT" refers -- very broadly -- to using counterfactuals to evaluate expected value of actions. It need not mean physical-causal counterfactuals. In particular, TDT counts as "a CDT" in this sense.

But here you describe CDT as two-boxing in Newcomb, which conflicts with my understanding that TDT one-boxes coupled with your claim that TDT counts as a CDT:

For example, in Newcomb, CDT two-boxes, and agrees with EDT about the consequences of two-boxing. The disagreement is only about the value of the other action.

So is this conflict a matter of using the colloquial definition of CDT in the second quote but a broader one in the first, having a more general framework for what two-boxing is than my own, or knowing something about TDT that I don't?

Thanks! This is great.

A year ago, Joaquin Phoenix made headlines when he appeared on the red carpet at the Golden Globes wearing a tuxedeo with a paper bag over his head that read, "I am a shape-shifter. I can't change the world. I can only change myself."

-- GPT-3 generated news article humans found easiest to distinguish from the real deal.

... I haven't read the paper in detail but we may have done it; we may be on the verge of superhuman skill at absurdist comedy! That's not even completely a joke. Look at the sentence "I am a shape-shifter. I can't change the world. I can only change myself." It's successful (whether intended or not) wordplay. "I can't change the world. I can only change myself" is often used as a sort of moral truism (e.g. Man in the Mirror, Michael Jackson). In contrast, "I am a shape-shifter" is a literal claim about one's ability to change themselves.

The upshot is that GPT-3 can equivocate between the colloquial meaning of a phrase and the literal meaning of a phrase in a way that I think is clever. I haven't looked into whether the other GPTs did this (it makes sense that a statistical learner would pick up this kind of behavior) but dayum.

I thought about this for longer than expected so here's an elaboration on inverse-inverse problems in the examples you provided:

Partial Differential Equations

Finding solutions to partial differential equations with specific boundary conditions is hard and often impossible. But we know a lot of solutions to differential equations with particular boundary conditions. If we match up those solutions with the problem at hand, we can often get a decent answer.

The direct problem: you have a function; figure out what relationships its derivatives have and its boundary conditions

The inverse problem: you know a bunch of relationships between derivatives and some boundary conditions; figure out the function that satisfies these conditions

The inverse inverse problem: you have a bunch of solutions to inverse problems (ie you can take a bunch of functions, solve the direct problem, and now you know the inverse problem that the function is a solution to), figure out which of these solutions look like the unsolved inverse problem you're currently dealing with


Performing division is hard but adding and multiplying is easy.

The direct problem: you have two numbers A and B; figure out what happens when you multiply them

The inverse problem: you have two numbers A and C; figure out what you can multiply A by to produce C

The inverse inverse problem: you have a bunch of solutions to inverse problems (ie you can take A and multiply it by all sorts of things like B' to produce numbers like C', solving direct problems. Now you know that B' is a solution to the inverse problems where you must divide C' by A. You just need to figure out out which of these inverse problem solutions look like the inverse problem at hand (ie if you find a C' so C' = C, you've solved the inverse problem)

In The Abstract

We have a problem like "Find X that produces Y" which is a hard problem from a broader class of problems. But we can produce a lot of solutions in that broader class pretty quickly by solving problems of the form "Find the Y' that X' produces." Then the original problem is just a matter of finding a Y' which is something like Y. Once we achieve this, we know that X will be something like X'.

Applications for Embedded Agency

The direct problem: You have a small model of something, come up with a thing much bigger than the model that the model is modeling well

The inverse problem: You have a world; figure out something much smaller than the world that can model it well

The inverse inverse problem: You have a a bunch of worlds and a bunch of models that model them well. Figure out which world looks like ours and see what it's corresponding model tells us about good models for modeling our world.

Some Theory About Why Inverse-Inverse Solutions Work

To speak extremely loosely, the assumption for inverse-inverse problems is something along the lines of "if X' solves problem Y', then we have reason to expect that solutions X similar to X' will solve problems Y similar to Y' ".

This tends to work really well in math problems with functions that are continuous/analytic because, as you take the limit of making Y' and Y increasingly similar, you can make their solutions X' and X arbitrarily close. And, even if you can't get close to that limit, X' will still be a good place to start work on finagling a solution X if the relationship between the problem-space and the solution-space isn't too crazy.

Division is a good example of an inverse-inverse problem with a literal continous and analytic mapping between the problem-space and solution-space. Differential equations with tweaked parameters/boundary conditions can be like this too although to a much weaker extent since they are iterative systems that allow dramatic phase transitions and bifurcations. Appropriately, inverse-inversing a differential equation is much, much harder inverse-inversing division.

From this perspective, the embedded agency inverse-problem is much more confusing than ordinary inverse-inverse problems. Like differential equations, there seem to be many subtle ways of tweaking the world (ie black swans) that dramatically change what counts as a good model.

Fortunately, we also have an advantage over conventional inverse problems: Unlike multiplying numbers or taking derivatives which are functions with one solution (typically -- sometimes things are undefined or weird), a particular direct problem of embedded agency likely has multiple solutions (a single model can be good at modeling multiple different worlds). In principle, this makes things easier -- it's more Y' (worlds that embedded agency is solved in) that we can compare to our Y (actual world).

Thoughts on Structuring Embedded Agency Problems

  • Inverse-inverse problems really on leveraging similarities between an unsolved problem and a solved problem which means we need to be really careful about defining things
    • Defining what it means to be a solution (to either the direct problem or inverse problem)
      • Defining a metric of good upon which we can use to compare model goodness or define worlds that models are good for. This requires us to either pick a set of goals that our model should be able to achieve or go meta and look at the model over all possible sets of goals (but I'm guessing this latter option runs into a No-Free-Lunch theorem). This is also non-trivial -- different world abstractions are good for different goals and you can't have them all
      • Defining a threshold after which we treat a world as a solution to the question "find a world that this model does well at." A Model:World pair can range a really broad spectrum of model performance
    • Defining what it means for a world to be similar to our own. Consider a phrase like "today's world will be similar to tomorrow if nothing impacts on it." This sort of claim makes sense to me but impact tends to be approached through Attainable Utility Preservaton
Can we switch to the interpolation regime early if we, before reaching the peak, tell it to keep the loss constant? Aka we are at loss l* and replace the loss function l(theta) with |l(theta)-l*| or (l(theta)-l*)^2.

Interesting! Given that stochastic gradient descent (SGD) does provide an inductive bias towards models that generalize better, it does seem like changing the loss function in this way could enhance generalization performance. Broadly speaking, SGD's bias only provides a benefit when it is searching over many possible models: it performs badly at the interpolation threshold because the lowish complexity limits convergence to a small number of overfitted models. Creating a loss function that allows SGD more reign over the model it selects could therefore improve generalization.


#1 SGD is inductively biased to more generalizeable models in general

#2 an loss-function gives all models with near a wider local minimum

#3 there are many different models where at a given level of complexity as long as

then it's plausible that changing the loss-function in this way will help emphasize SGD's bias towards models that generalize better. Point #1 is an explanation for double-descent. Point #2 seems intuitive to me (it makes the loss-function more convex and flatter when models are better performing) and Point #3 does too: there are many different sets of prediction that will all partially fit the training-dataset and yield the same loss function value of , which implies that there are also many different predictive models that yield such a loss function.

To illustrate point #3 above, imagine we're trying to fit the set of training observations . Fully overfitting this set (getting ) requires us to get all from to correct. However, we can partially overfit this set (getting ) in a variety of different ways. For instance, if we get all correct except for , we may have roughly different ways we can pick that could yield the same .[1] Consequently, our stochastic gradient descent process is free to apply its inductive bias to a broad set of models that have similar performances but make different predictions.

[1] This isn't exactly true because getting only one wrong without changing the predictions for other might only be achievable by increasing complexity since some predictions may be correlated with each other but it demonstrates the basic idea

But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything. 
This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.

When I think of the fragility argument, I usually think in terms of Goodhart's Taxonomy. In particular, we might deal with--

  • Extremal Goodhart -- Human values are already unusually well-satisfied relative to what is normal for this universe and pushing proxies of our values to the extremes might inadvertently move the universe away from that in some way we didn't consider
  • Adversial Goodhart -- The thing that matters which is absent from our proxy is absolutely critical for satsifying our values and requires the same kinds of resources that our proxy relies on

My impression is that our values are complex enough that they have a lot of distinct absolutely critical pieces that hard to pin down even if you try really hard. I mainly think this because I once tried imagining how to make an AGI that optimizes for 'fulfilling human requests' and realized that fulfill, human and request all had such complicated and fragile definitions that it would take me an extremely long time to pin-down what I meant. And I wouldn't be confident in the result I made after pinning things down.

While I don't find this kind of argument fully convincing, I think it's more powerful than ' a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something'.

That being said, I agree with b). I also lean toward the view that Slow Take-Off plus Machine-Learning may allow a non-catastrophic "good enough" solutions to human value problems.

My guess is that values that are got using ML but still somewhat off from human values are much closer in terms of not destroying all value of the universe, than ones that a person tries to write down. Like, the kinds of errors people have used to illustrate this problem (forget to put in, ‘consciousness is good’) are like forgetting to say faces have nostrils in trying to specify what a face is like, whereas a modern ML system’s imperfect impression of a face seems more likely to meet my standards for ‘very facelike’ (most of the time).

I agree that Machine-Learning will probably give us better estimations of human-flourishing than trying to write-down our values themselves. However, I'm still very apprehensive about it unless we're also being very careful about slow take-off. The main reasons for this apprehensiveness comes from Rohin Shah's work sequence on Value Learning (particularly ambitious value-learning). My main take-away from this was: Learning human values from examples of humans is hard without writing down some extra assumptions about human values (which may leave something important out).

Here's a practical example of this: If you create an AI that learns human values from a lot of examples of humans, what do you think its stance will be on Person-Affecting Views? What will its stance be on value-lexicality responses to Torture vs. Dust-Specks? My impression is that you'll have to write down something to tell the AI how to decide these cases (when should we categorize human behaviors as irrational vs when should we not). And a lot of people may regard the ultimate decision as catastrophic.

There are other complications too. If the AI can interact with the world in ways that change human values and then updates to care about those changed values, strange things might happen. For instance, the AI might pressure humanity to adopt simpler, easier to learn values if it's agential. This might not be so bad but I suspect there are things the AI might do that could potentially be very bad.

So, because I'm not that confident in ML value-learning and because I'm not that confident in human values in general, I'm pretty skeptical of the idea that machine-learning will avert extreme risks associated with value mispecification.

If the heuristics are optimized for "be able to satisfy requests from humans" and those requests sometimes require long-term planning, then the skill will develop. If it's only good at satisfying simple requests that don't require planning, in what sense is it superintelligent?

Yeah, that statement is wrong. I was trying to make a more subtle point about how an AI that learns long-term planning on a shorter time-frame is not necessarily going to be able to generalize to longer time-frames (but in the context of superintelligent AIs capable of doing human leve tasks, I do think it will generalize--so that point is kind of irrelevant). I agree with Rohin's response.

Thanks for replying!

This is not my belief. I think that powerful AI systems, even if they are a bunch of well developed heuristics, will be able to do super-long-term planning (in the same way that I'm capable of it, and I'm a bunch of heuristics, or Eliezer is to take your example).

Yeah, I intended that statement to be more of an elaboration on my own perspective than to imply that it represented your beliefs. I also agree that its wrong in the context of superintelligent AI we are discussing.

Should "I don't think" be "I do think"? Otherwise I'm confused.

Yep! Thanks for the correction.

I would be very surprised if this worked in the near term. Like, <1% in 5 years, <5% in 20 years and really I want to say < 1% that this is the first way we get AGI (no matter when)

Huh, okay... On reflection, I agree that directly hardcoded agent-y heuristics are unlikely to happen because AI-Compute tends to beat it. However, I continue to think that mathematicians may be able to use their knowledge of probability & logic to cause heuristics to develop in ways that are unusually agent-y at a fast enough rate to imply surprising x-risks.

This mainly boils down to my understanding that similarly well-performing but different heuristics for agential behavior may have very different potentials for generalizing to agential behavior on longer time-scales/chains-of-reasoning than the ones trained on. Consequently, I think there are particular ways of defining AI problem objectives and AI architecture that are uniquely suited to AI becoming generally agential over arbitrarily long time-frames and chains of reasoning.

However, I think we can address this kind of risk with the same safety solutions that could help us deal with AI that just have significantly better reasoning capabilities than us (but have not reasoning capabilities that have fully generalized!). Paul Christiano's work on amplification, for instance.

So the above is only a concern if people a) deliberately try to get AI in the most reckless way possible and b) get lucky enough that it doesn't get bottle-necked somewhere else. I'll buy the low estimates you're providing.

Thanks for recording this conversation! Some thoughts:

AI development will be relatively gradual and AI researchers will correct safety issues that come up.

I was pretty surprised to read the above--most of my intuitions about AI come down to repeatedly hearing the point that safety issues are very unpredictable and high variance, and that once a major safety issue happens, it's already too late. The arguments I've seen for this (many years of Eliezer-ian explanations of how hard it is to come out on top against superintelligent agents who care about different things than you) also seem pretty straightforward. And Rohin Shah isn't a stranger to them. So what gives?

Well, look at the summary on top of the full transcript link. Here are some quotes reflecting the point that Rohin is making which is most interesting to me--

From the summary:

Shah doesn’t believe that any sufficiently powerful AI system will look like an expected utility maximizer.

and, in more detail, from the transcript:

Rohin Shah: ... I have an intuition that AI systems are not well-modeled as, “Here’s the objective function and here is the world model.” Most of the classic arguments are: Suppose you’ve got an incorrect objective function, and you’ve got this AI system with this really, really good intelligence, which maybe we’ll call it a world model or just general intelligence. And this intelligence can take in any utility function, and optimize it, and you plug in the incorrect utility function, and catastrophe happens.
This does not seem to be the way that current AI systems work. It is the case that you have a reward function, and then you sort of train a policy that optimizes that reward function, but… I explained this the wrong way around. But the policy that’s learned isn’t really… It’s not really performing an optimization that says, “What is going to get me the most reward? Let me do that thing.”

If I was very convinced of this perspective, I think I'd share Rohin's impression that AI Safety is attainable. This is because I also do not expect highly strategic and agential actions focused on a single long-term goal to be produced by something that "has been given a bunch of heuristics by gradient descent that tend to correlate well with getting high reward and then it just executes those heuristics." To elaborate on some of this with my own perspective:

  • If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning
  • If our superintelligent AI gets punished based on any proxy for "misleading humans" and it can't do super-long-term planning, it is unlikely to come up with a good reward-attaining strategy that involves misleading humans
  • If our superintelligent AI does somehow develop a heuristic that misleads humans, it is yet more unlikely that the heuristic will be immediately well-developed enough to mislead humans long enough to cause an extinction level event. Instead, it will probably mislead the humans for more short-term gains at first--which will allow us to identify safety measures in advance

So I agree that we have a good chance of ensuring that this kind of AI is safe--mainly because I don't think the level of heuristics involved invoke an AI take-off slow enough to clearly indicate safety risks before they become x-risks.

On the other hand, while I agree with Rohin and Hanson's side that there isn't One True Learning Algorithm, there are potentially a multitude of advanced heuristics that approximate extremely agent-y and strategic long-term optimizations. We even have a real-life, human-level example of this. His name is Eliezer Yudkowsky[1]. Moreover, if I got an extra fifty IQ points and a slightly different set of ethics, I wouldn't be surprised if the set of heuristics composing my brain could be an existential threat. I think Rohin would agree with this belief in heuristic kludges that are effecively agential despite not being a One True Algorithm and, alone, this belief doesn't imply existential risk. If these agenty heuristics manifest gradually over time, we can easily stop them just by noticing them and turning the AI off before they get refined into something truly dangerous.

However, I don't think that machine-learned heuristics are the only way we can get highly dangerous agenty heuristics. We've made a lot of mathematical process on understanding logic, rationality and decision theory and, while machine-learned heuristics may figure out approximately Perfect Reasoning Capabilities just by training, I think it's possible that we can directly hardcode heuristics that do the same thing based on our current understanding of things we associate with Perfect Reasoning Capabilities.

In other words, I think that the dangerously agent-y heuristics which we can develop through gradual machine-learning processes could also be developed by a bunch of mathematicians teaming up and building a kludge that is similarly agent-y right out of the box. The former possibility is something we can mitigate gradually (for instance, by not continuing to build AI once they start doing things that look too agent-y) but the latter seems much more dangerous.

Of course, even if mathematicians could directly kludge some heuristics that can perform long-term strategic planning, implementing such a kludge seems obviously dangerous to me. It also seems rather unnecessary. If we could also just get superintelligent AI that doesn't do scary agent-y stuff by just developing it as a gradual extension of our current machine-learning technology, why would you want to do it the risky and unpredictable way? Maybe it'd be orders of magnitude faster but this doesn't seem worth the trade--especially when you could just directly improve AI-compute capabilities instead.

As of finishing this comment, I think I'm less worried about AI existential risks than I was before.

[1] While this sentence might seem glib, I phrased it the way I did specifically most, while most people display agentic behaviors, most of us aren't that agentic in general. I do not know Eliezer personally but the person who wrote a whole set of sequences on rationality, developed a new decision theory and started up a new research institute focused on saving the world is the best example of an agenty person I can come up with off the top of my head.

Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing - they just won’t

Just making sure I can parse this... When I say that they're anti-correlated, I mean that the policy of maximizing X is akin to the policy of minimizing X to the extent that X and not X will at some point compete for the same instrumental resources. I will agree with the statement that an agent maximizing X who possesses many instrumental resources can use them to accomplish not X (and ,in this sense, the agent doesn't perceive X nd not X as anti-correlated); and I'll also agree that an agent optimizing X and another optimizing not X will be competitive for instrumental resources and view those things as anti-correlated.

they may still be able to maximize their own red-seeing, or even human red-seeing - they just won’t

I think some of this is a matter of semantics but I think I agree with this. There are also two different definitions of the word able here:

  • Able #1 : The extent to which it is possible for an agent to achieve X across all possible universes we think we might reside in
  • Able #2 : The extent to which it is possble for an agent to achieve X in a counterfactual where the agent has a goal of achieving X

I think you're using Able #2 (which makes sense--it's how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:

People have a natural sense of what they "could" do. If you're sad, it still feels like you "could" do a ton of work anyways. It doesn't feel physically impossible."
Imagine suddenly becoming not-sad. Now you "could" work when you're sad, and you "could" work when you're not-sad, so if AU just compared the things you "could" do, you wouldn't feel impact here.
But you did feel impact, didn't you?
Load More