Noosphere89

Sequences

An Opinionated Guide to Computability and Complexity

Posts

Sorted by New

Wiki Contributions

Comments

(2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities.

i actually do expect this to happen, and importantly I think this result is basically of academic interest, primarily because it is probably known why this adversarial attack can have at all, and it's the large scale cycles of a game board. This is almost certainly going to be solved, due to new training, so I find it a curiosity at best.

I strongly downvoted with this post, primarily because contra you, I do actually think reframing/reinventing is valuable, and IMO I think that the case for reframing/reinventing things is strawmanned here.

There is one valuable part of this post, and that interpretability doesn't have good result-incentives. I agree with this criticism, but given the other points of the post, I would strongly downvote it.

I disagree with this post for 1 reason:

  1. Amdahl's law limits how much cyborgism will actually work, and IMO is the reason agents are more effective than simulators.

On Amdahl's law, John Wentworth's post on the long tail is very relevant here, as it limits the use of cyborgism here:

https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail

I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.

It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.

This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.

The robust values hypothesis from DragonGod is worth looking at, too.

From the link below, I'll quote:

Consider the following hypothesis:

There exists a "broad basin of attraction" around a privileged subset of human values[1] (henceforth "ideal values") The larger the basin the more robust values are Example operationalisations[2] of "privileged subset" that gesture in the right direction: Minimal set that encompasses most of the informational content of "benevolent"/"universal"[3] human values The "minimal latents" of "benevolent"/"universal" human values Example operationalisations of "broad basin of attraction" that gesture in the right direction: A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in #3 Larger neighbourhood → larger basin Said subset is a "naturalish" abstraction The more natural the abstraction, the more robust values are Example operationalisations of "naturalish abstraction" The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe More privileged → more natural Most efficient representations of our universe contain a simple embedding of the subset Simpler embeddings → more natural Points within this basin are suitable targets for optimisation The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are. Example operationalisations of "suitable targets for optimisation": Optimisation of this target is existentially safe[4] More strongly, we would be "happy" (where we fully informed) for the system to optimise for these points.

This is an important hypothesis, since if it has a non-trivial chance of being correct, then AI Alignment gets quite easier. And given the shortening timelines, I think this is an important hypothesis to test.

Here's a link below for the robust values hypothesis:

https://www.lesswrong.com/posts/YoFLKyTJ7o4ApcKXR/disc-are-values-robust

[This comment is no longer endorsed by its author]Reply

In the human case, it's that capabilities differences are very bounded, rather than alignment successes. If we had capabilities differentials as wide as 1 order of magnitude, then I think our attempted alignment solutions would fail miserably, leading to mass death or worse.

That's the problem with AI: Multiple orders of magnitude differences in capabilities are pretty likely, and all real alignment technologies fail hard once we get anywhere near say 3x differences, let alone 10x differentials.

You're welcome, though did you miss a period here or did you want to write more?

See a Twitter thread of some brief explorations I and Alex Silverstein did on this

Further, it’s helped to build out a toolkit of techniques to rigorously reverse engineer models. In the process of understanding this circuit, they refined the technique of activation patching into more sophisticated approaches such as path patching (and later causal scrubbing). And this has helped lay the foundations for developing future techniques! There are many interpretability techniques that are more scalable but less mechanistic, like probing. Having some

See a Twitter thread of some brief explorations I and Alex Silverstein did on this

I think you cut yourself off there both times.

My short answer: Violations of the IID assumption is the likeliest problem in trying to generalize your values, and I see this as the key flaw underlying the post.

You can make the "some subnetwork just models its training process and cares about getting low loss, and then gets promoted" argument against literally any loss function, even some hypothetical "perfect" one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don't perceive you to believe this implication.

This might be the cleanest explanation for why alignment is so hard by default. Loss functions do not work, and reward functions don't work well.

Load More