Engineer at Donor to LW 2.0.

ESRogs's Posts

Sorted by New

ESRogs's Comments

[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment

Hmm, maybe I'm missing something basic and should just go re-read the original posts, but I'm confused by this statement:

So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe".

In this, belief set A and belief set B are analogous to A[C] and C (or some c in C), right? If so, then what's the analogue of "trust... over"?

If we replace our beliefs with A[C]'s, then how is that us trusting it "over" c or C? It seems like it's us trusting it, full stop (without reference to any other thing that we are trusting it more than). No?

[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment
Notably, we need to trust A[C] even over our own beliefs, that is, if A[C] believes something, we discard our position and adopt A[C]'s belief.

To clarify, this is only if we (or the process that generated our beliefs) fall into class C, right?

[AN #77]: Double descent: a unification of statistical theory and modern ML practice
The authors don't really suggest an explanation; the closest they come is speculating that at the interpolation threshold there's only ~one model that can fit the data, which may be overfit, but then as you increase further the training procedure can "choose" from the various models that all fit the data, and that "choice" leads to better generalization. But this doesn't make sense to me, because whatever is being used to "choose" the better model applies throughout training, and so even at the interpolation threshold the model should have been selected throughout training to be the type of model that generalized well.

I don't understand your objection here. If there is only ~one model that fits the data, and the training procedure is such that it will find that model, then aren't you just stuck w/ whatever level of generalizability that model has? And isn't it irrelevant that your procedure has some bias towards better generalizability?

Or are you saying that even if there's only one model at the interpolation threshold that fits the data, you'd expect the training procedure to pick a different model (one that doesn't completely fit the data) instead, because of the bias towards generalizability?

Seeking Power is Instrumentally Convergent in MDPs
This means that if there's more than twice the power coming from one move than from another, the former is more likely than the latter. In general, if one set of possibilities contributes 2K the power of another set of possibilities, the former set is at least K times more likely than the latter.

Where does the 2 come from? Why does one move have to have more than twice the power of another to be more likely? What happens if it only has 1.1x as much power?

Seeking Power is Instrumentally Convergent in MDPs
Remember how, as the agent gets more farsighted, more of its control comes from Chocolate and Hug, while also these two possibilities become more and more likely?

I don't understand this bit -- how does more of its control come from Chocolate and Hug? Wouldn't you say its control comes from Wait!? Once it ends up in Candy, Chocolate, or Hug, it has no control left. No?

Seeking Power is Instrumentally Convergent in MDPs
We bake the opponent's policy into the environment's rules: when you choose a move, the game automatically replies.

And the opponent plays to win, with perfect play?

Seeking Power is Instrumentally Convergent in MDPs
Imagine we only care about the reward we get next turn. How many goals choose Candy over Wait? Well, it's 50-50 – since we randomly choose a number between 0 and 1 for each state, both states have an equal chance of being maximal.

I got a little confused at the introduction of Wait!, but I think I understand it now. So, to check my understanding, and for the benefit of others, some notes:

  • the agent gets a reward for the Wait! state, just like the other states
  • for terminal states (the three non-Wait! states), the agent stays in that state, and keeps getting the same reward for all future time steps
  • so, when comparing Candy vs Wait! + Chocolate, the rewards after three turns would be (R_candy + γ
    * R_candy + γ^2 * R_candy) vs (R_wait + γ * R_chocolate + γ^2 * R_chocolate)

(I had at first assumed the agent got no reward for Wait!, and also failed to realize that the agent keeps getting the reward for the terminal state indefinitely, and so thought it was just about comparing different one-time rewards.)

Chris Olah’s views on AGI safety
It's meant to be analogous to imputing a value in a causal Bayes net

Aha! I thought it might be borrowing language from some technical term I wasn't familiar with. Thanks!

Chris Olah’s views on AGI safety
Takeoff does matter, in that I expect that this worldview is not very accurate/good if there's discontinuous takeoff, but imputing the worldview I don't think takeoff matters.

Minor question: could you clarify what you mean by "imputing the worldview" here? Do you mean something like, "operating within the worldview"? (I ask because this doesn't seem to be a use of "impute" that I'm familiar with.)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More
And I think claim 5 is basically in line with what, say, Bostrom would discuss (where stabilization is a thing to do before we attempt to build a sovereign).

You mean in the sense of stabilizing the whole world? I'd be surprised if that's what Yann had in mind. I took him just to mean building a specialized AI to be a check on a single other AI.

Load More