That's an excellent point.

I agree. I think that's probably a better way of clarifying the confusion that what I wrote.

“But also this person doesn't know about internal invariances in NN space or the compressivity of the parameter-function map (the latter in particular is crucial for reasoning about inductive biases), then I become extremely concerned”

Have you written about this anywhere?

Nabgure senzr gung zvtug or hfrshy:

Gurer'f n qvssrerapr orgjrra gur ahzore bs zngurzngvpny shapgvbaf gung vzcyrzrag n frg bs erdhverzragf naq gur ahzore bs cebtenzf gung vzcyrzrag gur frg bs erdhverzragf.

Fvzcyvpvgl vf nobhg gur ynggre, abg gur sbezre.

Gur rkvfgrapr bs n ynetr ahzore bs cebtenzf gung cebqhpr gur rknpg fnzr zngurzngvpny shapgvba pbagevohgrf gbjneqf fvzcyvpvgl.

I wrote up my views on the principle of indifference here:

I agree that it has certain philosophical issues, but I don’t believe that this is as fatal to counting arguments as you believe.

Towards the end I write:

“The problem is that we are making an assumption, but rather than owning it, we're trying to deny that we're making any assumption at all, ie. "I'm not assuming a priori A and B have equal probability based on my subjective judgement, I'm using the principle of indifference". Roll to disbelieve.

I feel less confident in my post than when I wrote it, but it still feels more credible than the position articulated in this post.

Otherwise: this was an interesting post. Well done on identifying some arguments that I need to digest.

IIRC, there was also evidence that Copilot was modulating code quality based on name ethnicity variations in code docs

You don't know where they heard that?

One of the main challenges I see here is how to calibrate this. In other words, if I can't break a model despite adding an activation vector of strength x, what does this mean in terms of how safe we should consider the model to be? ie. How much extra adversarial prompting effort is that equivalent to or how should I modify my probabilities of the model being safe?

Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?

I think it’s worth also raising the possibility of a Kuhnian scenario where the “mature science” is actually missing something and further breakthrough is required after that to move it into in a new paradigm.

I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no time to do anything), it will forever after pursue the first agents utility.

The button play no role in this utility, so instrumental incentives mean it will destroy it sooner or later. This seems like it breaks the system.

Now, this isn’t the only equilibrium. The first agent believes the button will always be pressed, so it has no inventive to argue for the criteria being whether the button is pressed in the first round vs. needing to have been pressed in this round and all previous rounds. On the other, instead of balance, it seems likely that one agent or the other creates a subagent that clobbers the others utility, with that agent assuming that this only happens in a world that never occurs.

Do you have any thoughts on what kind of experiments you’d like to see people run that would be more directly analogous?

Load More