Recent Discussion

Technical AGI safety research outside AI
155d3 min readShow Highlight

I think there are many questions whose answers would be useful for technical AGI safety research, but which will probably require expertise outside AI to answer. In this post I list 30 of them, divided into four categories. Feel free to get in touch if you’d like to discuss these questions and why I think they’re important in more detail. I personally think that making progress on the ones in the first category is particularly vital, and plausibly tractable for researchers from a wide range of academic backgrounds.

Studying and understanding safety problems

  1. How strong are the econo
... (Read more)

[copying from my comment on the EA Forum x-post]

For reference, some other lists of AI safety problems that can be tackled by non-AI people:

Luke Muehlhauser's big (but somewhat old) list: "How to study superintelligence strategy"

AI Impacts has made several lists of research problems

Wei Dai's, "Problems in AI Alignment that philosophers could potentially contribute to"

Kaj Sotala's case for the relevance of psychology/cog sci to AI safety (I would add that Ought is currently testing the feasibility of IDA/Debate by doing psy... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post

The Dualist Predict-O-Matic ($100 prize)
46d5 min readShow Highlight

This is a response to Abram's The Parable of Predict-O-Matic, but you probably don't need to read Abram's post to understand mine. While writing this, I thought of a way in which I think things could wrong with dualist Predict-O-Matic, which I plan to post in about a week. I'm offering a $100 prize to the first commenter who's able to explain how things might go wrong in a sufficiently crisp way before I make my follow-up post.


Currently, machine learning algorithms are essentially "Cartesian dualists" when it comes to themselves and their environment. (Not a philosophy major -- let

... (Read more)
1Lukas Finnveden1d Yes, that sounds more like reinforcement learning. It is not the design I'm trying to point at in this post.Ok, cool, that explains it. I guess the main differences between RL and online supervised learning is whether the model takes actions that can affect their environment or only makes predictions of fixed data; so it seems plausible that someone training the Predict-O-Matic like that would think they're doing supervised learning, while they're actually closer to RL. That description sounds a lot like SGD. I think you'll need to be crisper for me to see what you're getting at.No need, since we already found the point of disagreement. (But if you're curious, the difference is that sgd makes a change in the direction of the gradient, and this one wouldn't.)
1John_Maxwell15h it seems plausible that someone training the Predict-O-Matic like that would think they're doing supervised learning, while they're actually closer to RL. How's that?

Assuming that people don't think about the fact that Predict-O-Matic's predictions can affect reality (which seems like it might have been true early on in the story, although it's admittedly unlikely to be true for too long in the real world), they might decide to train it by letting it make predictions about the future (defining and backpropagating the loss once the future comes about). They might think that this is just like training on predefined data, but now the Predict-O-Matic can change the data that it's evaluated against, so t... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post

1Evan Hubinger3d I don't think we do agree, in that I think pressure towards simple models implies that they won't be dualist in the way that you're claiming.
2Vanessa Kosoy3d Two remarks. Remark 1: Here's a simple model of self-fulfilling prophecies. First, we need to decide how Predict-O-Matic outputs its predictions. In principle, it could (i) produce the maximum likelihood outcome (ii) produce the entire distribution over outcomes (iii) sample an outcome of the distribution. But, since Predict-O-Matic is supposed to produce predictions for large volume data (e.g. the inauguration speech of the next US president, or the film that will win the Oscar in 2048), the most sensible option is (iii). Option (i) can produce an outcome that is maximum likelihood but is extremely untypical (since every individual outcome has very low probability), so it is not very useful. Option (ii) requires somehow producing an exponentially large vector of numbers, so it's infeasible. More sophisticated variants are possible, but I don't think any of them avoids the problem. If the Predict-O-Matic is a Bayesian inference algorithm, an interesting dynamic will result. On each round, some hypothesis will be sampled out of the current belief state. If this hypothesis is a self-fulfilling prophecy, sampling it will cause its likelihood to go up. We get positive feedback: the higher the probability Predict-O-Matic assigns to the hypothesis, the more often it is sampled, the more evidence in favor of the hypothesis is produced, the higher its probability becomes. So, if it starts out as sufficiently probable a priori, the belief state will converge there. Of course realistic learning algorithms are not Bayesian inference, but they have to approximate Bayesian inference in some sense. At the least, there has to be some large space of hypotheses s.t. if one of them is true, the algorithm will converge there. Any algorithm with this property probably displays the dynamics above. Now, to the simple model. In this model we have just two outcomes: A and B (so it's not large volume data, but that doesn't matter). On each round a prediction is made, after which some o
Deducing Impact
121mo1 min readShow Highlight

The solution comes in the next post! Feel free to discuss amongst yourselves.

Reminder: Your sentence should explain impact from all of the perspectives we discussed (from XYZ to humans).

5Ben Pace20h I set a fifteen minute timer, and wrote down my thoughts: Okay, the main thought I have so far is that the examples mostly seem to separate “Affects personal goals” from “Affects convergent instrumental goals”. 1. The pebbles being changed mostly affects the pebble sorters personal goals, and otherwise has little impact in the scheme of things for how able everyone is to get their goals achieved. In the long term, it doesn’t even affect the pebblesorters’ ability to achieve their goals (there basically is a constant amount of resources in the universe to turn into pebbles and sort, and the amount on their planet is miniscule). 2. The badness of the traffic jam is actually mostly determined by it being bad for most agents’ goals in that situation (constrained to travel by cars and such). I might personally care more if I was in a medical emergency or something, but this wouldn’t be true from the perspective of convergent instrumental goals. 3. Asteroid hitting earth damages any agents’ ability to affect the world. We care more because we’re on the planet, but overall it is determined by convergent instrumental goals. 4. Star exploding is the same as asteroid on earth. 5. Not sure the point of the last one regarding epistemic state. I have a sense that TurnTrout is suggesting we build an AI that will optimise your personal goals, but while attempting to make no changes on the level of convergent instrumental goals - kill no agents, change no balances of power, move no money, etc. However, this runs into the standard problem of being useless. For example, you could tell an agent to cure cancer, and then it would cure all the people who have cancer. But then, in order to make sure it changes no balance of power or agent lives or anything, it would make sure to kill all the people who would’ve died, and make sure that other humans do not find out the AI has a cure for cancer. This is so that they don’t start acting very differently (e.g. turning off the AI and

Great responses.

What you're inferring is impressively close to where the sequence is leading in some ways, but the final destination is more indirect and avoids the issues you rightly point out (with the exception of the "ensuring the future is valuable" issue; I really don't think we can or should build eg low-impact yet ambitious singletons - more on that later).

[Epistemic Status: My inside view feels confident, but I’ve only discussed this with one other person so far, so I won't be surprised if it turns out to be confused.]

Armstrong and Mindermann (A&M) argue "that even with a reasonable simplicity prior/Occam’s razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple ‘normative’ assumptions, which cannot be deduced exclusively from observations."

I explain why I think their argument is faulty, concludin... (Read more)

Hey there!

Thanks for this critique; I have, obviously, a few comments ^_^

In no particular order:

  • First of all, the FHI channel has a video going over the main points of the argument (and of the research agenda); it may help to understand where I'm coming from:

  • A useful point from that: given human theory of mind, the decomposition of human behaviour into preferences and rationality is simple; without that theory of mind, it is complex. Since it's hard for us to turn off our theory of mind, the decomposition w

... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post
All I know is Goodhart
42d3 min readShow Highlight
1Charlie Steiner1d As far as I can tell we're not actually dividing the space of W's by a plane, we're dividing the space of E(W|π)'s by a plane. We don't know for certain that U-V is negative, we merely think so in expectation. This leads to the Bayesian correction for the Optimizer's curse [] , which lets us do better when presented with lots of options with different uncertainties, but if the uncertainty is fixed it won't let us pick a strategy that does better than the one that maximizes the proxy.
1Stuart Armstrong1d As far as I can tell we're not actually dividing the space of W's by a plane, we're dividing the space of E(W|π)'s by a plane. Because expectation is affine with respect to utility functions, this does divide the space by a plane. Yes, there is a connection with the optimizer's curse style of reasoning.
2G Gordon Worley III2d I find this way of formalizing Goodhart weird. Is there a standard formalization of it, or is this your invention? I'll explain what I think is weird. You define U and V such that you can calculate U - V to find W, but this appears to me to skip right past the most pernicious bit of Goodhart, which is that U is only knowable via a measurement (not necessarily a measure), such that I would say V=μ(U) for some "measuring" function μ:U→R and the problem is that μ(U) is correlated with but different from U since there may not even be a way to compare U. To make it concrete with an example, suppose U is "beauty as defined by Gordon". We don't, at least as of yet, have a way to find U directly, and maybe we never will. So supposing we don't, if we want to answer questions like "would Gordon find this beautiful?" and "what painting would Gordon most like?" we need to a measurement of U we can work with, as developed by, say, using IRL to discover a "beauty function" that describes U such that we could say how beautiful I would think something is. But we would be hard pressed to be precise about how far off the beauty function is from my sense of beauty because we only have a very gross measure of the difference: compare how beautiful the beauty function and I think some finite set of things are (finite because I'm a bounded, embedded agent who is never going to get to see all things, even if the beauty function somehow could), and even as we are doing this we are still getting a measurement of my internal sense of beauty rather than my internal sense of beauty itself because we are asking me to say how beautiful I think something is rather than directly observing my sense of beauty. This is much of why I expect that Goodhart is extremely robust [] .

Even with your stated sense of beauty, knowing "this measure can be manipulated in extreme circumstances" is much better than nothing.

And we probably know quite a bit more; I'll continue this investigation, adding more information.

Load More