insofar as human capabilities are not largely explained by consequentialist planning (as suggested by the approval reward picture), this should make us more optimistic about human-level AGI alignment.
further, this picture might suggest that the cheapest way to human level-level AGI might route through approval reward-like mechanisms, giving us a large negative alignment tax.
ofc you might think getting approval reward to work is actually a very narrow target, and even if early human-level AGIs aren't coherent consequentialists, they will use some other mechanism for learning that doesn't route through approval reward and thus doesn't inherit the potentially nice alignment properties (or you could think that early human-level AGIs will be more like coherent consequentialists).
IMO most exciting mech-interp research since SAEs, great work.
A few thoughts / questions:
oh I see, by all(sensor_preds) I meant sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!
is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i))
b) AUROC(all(sensor_preds), all(sensors))
did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check).
yeah I think I agree with all that... (like psychopath can definitely learn language, accomplish things in the world, etc)
maybe the thought experiment with the 18yr old just prompted me to think about old arguments around "the consequentialist core" that aren't centrally about approval reward (and more about whether myopic rewards can elicit consequentialist-ish and aligned planning).