Wiki Contributions

Load More


Sorry, I'm confused by the terminology: 

Thanks for the extra detail!

(Actually, I was reading a post by Mark Xu which seems to suggest that the TradingAlgorithms have access to the price history rather than the update history as I suggested above)

My understanding after reading this is that TradingAlgorithms generate a new trading policy after each timestep (possibly with access to the update history, but I'm unsure). Is this correct? If so, it might be worth clarifying this, even though it seems clearer later.

Interesting, I think this clarifies things, but the framing also isn't quite as neat as I'd like.

I'd be tempted to redefine/reframe this as follows:

• Outer alignment for a simulator - Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren't actually literal?

• Inner alignment for a simulator - Training a simulator to perfectly simulate the assigned character

• Outer alignment for characters - finding a character who would create good outcomes if successfully simulated

In this model, there wouldn't be a separate notion of inner alignment for characters as that would be automatic if the simulator was both inner and outer aligned.


I thought this was a really important point, although I might be biased because I was finding it confusing how some discussions were talking about the gradient landscape as though it could be modified and not clarifying the source of this (for example, whether they were discussing reinforcement learning).

First off, the base loss landscape of the entire model is a function  that's the same across all training steps, and the configuration of the weights selects somewhere on this loss landscape. Configuring the weights differently can put the model on a different spot on this landscape, but it can't change the shape of the landscape itself. 

Note that this doesn't contradict the interpretation of the gradient hacker as having control over the loss landscape through subjunctive dependence. As an analogy, in Newcomb's problem even if you accept that there is subjunctive dependence of the contents of the box on your decision and conclude you should one-box, it's still true that the contents of the box cannot change after Omega has set them up and that there is no causal dependence of the contents of the box on your action, even though the dominated action argument no longer holds because of the subjunctive dependence.

In the section: "The role of naturalized induction in decision theory" a lot of variables seem to be missing.

(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups. 

I expect that it has also happened to an extent with animals as well. I wonder if anyone has ever looked into this.

converge to

Converge to 1? (Context is "9. Non-Dogmatic...").

Anyway, thanks so much for writing this! I found this to be a very useful resource.

It seems strange to treat ontological crises as a subset of embedded world-models, as it seems as though a Cartesian agent could face the same issues?

Load More