mesaoptimizer

https://mesaoptimizer.com

 

learn math or hardware

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

Sort-of off-topic, so feel free to maybe move this comment elsewhere.

I'm quite surprised to see that you have just shipped an MSc thesis, because I didn't expect you to be doing an MSc (or anything in traditional academia). I didn't think you needed one, since I think you have enough career capital to continue to work indefinitely on the things you want to work on and get paid well for it. I also assumed that you might find academia somewhat a waste of your time in comparison to doing stuff you wanted to do.

Perhaps you could help clarify what I'm missing?

  1. During RL finetuning and given this post-unsupervised initialization, there’s now an inductive bias towards just hooking human-like criteria for bidding on internal-AI-plans. IE humans give approval-based reinforcement, and an inductively easy way of upweighting logits on those actions is just hook up the human-like plan-criteria into the AI’s planning process, so the AI gets a humanlike “care about people” shard. P(3 | 2, 1) = .55 due to plurality of value, I expect this to be one way it learns to make decisions

This is where I'd put a significantly low probability. Could you elaborate on why there's an inductive bias towards "just hooking human-like criteria for bidding on internal-AI-plans"? As far as I can tell, the inductive bias for human-like values would be something that at least seems closer to the human-brain structure than any arbitrary ML architecture we have right now. Rewarding a system to better model human beings' desires doesn't seem to me to lead it towards having similar desires. I'd use the "instrumental versus terminal desires" concept here but I expect you would consider that something that adds confusion instead of removing it.

Do you agree with: “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values”

What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.

I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”

My bad; I've updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution's failure at inner alignment is the most significant and informative evidence that inner alignment is hard.

One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given.

I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too -- inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don't scale with an increase in capabilities? Perhaps we are defining inner values differently here.

By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people's inner values shift? Not exactly; it seems to me that we were mistaken about people's inner values instead.

The most important claim in your comment is that "human learning → human values" is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the "evolution -> human values" perspective. Here's why I disagree:

Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this.

A human's environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment's objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment.

An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators.

An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn't true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he'll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in.

One may claim that the previous example somehow doesn't count because since one's sexual orientation is biologically determined (and I'm assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn't weaken this argument: "human learning -> human values" shows a huge amount of evidence of inner misalignment being ubiquitous.

I worry you are being insufficiently pessimistic.

[This comment is no longer endorsed by its author]Reply