Any reversible effect might be reversed. The question asks about the final effects of the mind
This talk of "reversible" and "final" effects of a mind strikes me as suspicious: for one, in a block / timeless universe, there's no such thing as "reversible" effects, and for another, in the end, it may wash out in an entropic mess! But it does suggest a rephrasing of "a first-order approximation of the (direction of the) effects, understood both spatially and temporally".
We argue that a sustained commitment to transparency (e.g. to auditors) would make the RLHF research environment more robust from a safety standpoint.
Do you think this is more true of RLHF than other safety techniques or frameworks? At first blush, I would have thought "no", and the reasoning you provide in this post doesn't seem to distinguish RLHF from other things.
I think I probably didn't quite word that question right, and that's what's explaining the confusion - I meant something like "Once you've created the AAR, what alignment problems are left to be solved? Please answer in terms of the gap between the AAR and superintelligence."
Note that this paper in section 2.6 claims to demonstrate that some 175 billion parameter RLHF'ed LLM (presumably Claude or Claude 2?) does not use steganography in its CoT. It does this by:
The basic result is that paraphrasing parts of the CoT doesn't appear to reduce accuracy.
Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.
The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are "internally represented" isn't necessary for those results. You're right that the post uses the phrase "internal represent...
The spark of strategicness, if such a thing is possible, recruits the surrounding mental elements. Those surrounding mental elements, by hypothesis, make goals achievable. That means the wildfire can recruit these surrounding elements toward the wildfire's ultimate ends... Also by hypothesis, the surrounding mental elements don't themselves push strongly for goals. Seemingly, that implies that they do not resist the wildfire, since resisting would constitute a strong push.
I think this is probably a fallacy of composition (maybe in the reverse direction ...
An attempt at rephrasing a shard theory critique of utility function reasoning, while restricting myself to things I basically agree with:
Yes, there are representation theorems that say coherent behaviour is optimizing some utility function. And yes, for the sake of discussion let's say this extends to reward functions in the setting of sequential decision-making (even tho I don't remember seeing a theorem for that). But: just because there's a mapping, doesn't mean that we can pull back a uniform measure on utility/reward functions to get a reasonable mea...
Not having read other responses, my attempt to answer in my own words: what goes wrong is that there are tons of possible cognitive influences that could be reinforced by rewards for making people smile. E.g. "make things of XYZ type think things are going OK", "try to promote physical configurations like such-and-such", "trying to stimulate the reinforcer I observe in my environment". Most of these decision-influences, when extrapolated to coherent behaviour where those decision-influences drive the course of the behaviour, lead to resource-gathering and ...
I think you're making a mistake: policies can be reward-optimal even if there's not an obvious box labelled "reward" that they're optimal with respect to the outputs of. Similarly, the formalism of "reward" can be useful even if this box doesn't exist, or even if the policy isn't behaving the way you would expect if you identified that box with the reward function. To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.
...I
I guess this doesn't fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn't/can't know the truth, compared to a "strict liability" regime.
I wonder if this use of "fair" is tracking (or attempting to track) something like "this problem only exists in an unrealistically restricted action space for your AI and humans - in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won't be a problem".
Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)
I guess this doesn't fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn't/can't know the truth, compared to a "strict liability" regime.
I think you might be interpreting the break after the sentence "Their results are further evidence for feature linearity and internal activation robustness in these models." as the end of the related work section? I'm not sure why that break is there, but the section continues with them citing Mikolov et al (2013), Larsen et al (2015), White (2016), Radford et al (2016), and Upchurch et al (2016) in the main text, as well as a few more papers in footnotes.
Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as https://arxiv.org/abs/2212.04089 is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."
Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, an...
Alignment is an unusual field because the base of fans and supporters is much larger than the number of researchers
Isn't this entirely usual? Like, I'd assume that there are more readers of popular physics books than working physicists. Similarly for nature documentary viewers vs biologists.
In this same example, these interpretations are then validated by using a different interpretability tool – test set exemplars. This begs the question of why we shouldn’t just use test set exemplars instead.
Doesn't Olah et al (2017) answer this in the "Why visualize by optimization" section, where they show a bunch of cases where neurons fire on similar test set exemplars, but visualization by optimization appears to reveal that the neurons are actually 'looking for' specific aspects of those images?
I guess this proves the superiority of the mechanistic interpretability technique "note that it is mechanistically possible for your model to say that things are gorillas" :P
Re: the gorilla example, seems worth noting that the solution that was actually deployed ended up being refusing to classify anything as a gorilla, at least as of 2018 (perhaps things have changed since then).
BTW: the way I found that first link was by searching the title on google scholar, finding the paper, and clicking "All 5 versions" below (it's right next to "Cited by 7" and "Related articles"). That brought me to a bunch of versions, one of which was a seemingly-ungated PDF. This will probably frequently work, because AI researchers usually make their papers publicly available (at least in pre-print form).
To tie up this thread: I started writing a more substantive response to a section but it took a while and was difficult and I then got invited to dinner, so probably won't get around to actually writing it.
I don't want to get super hung up on this because it's not about anything Yudkowsky has said but:
Consider the whole transformed line of reasoning:
avian flight comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get an entity which flies, that entity must be as close to a bird as birds are to each other.
IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was:
...human intelligence/alignment comes from a lot of factors; you can't just ape one of the factors an
This is a valid point, and that's not what I'm critiquing. I'm critiquing how he confidently dismisses ANNs
I guess I read that as talking about the fact that at the time ANNs did not in fact really work. I agree he failed to predict that would change, but that doesn't strike me as a damning prediction.
Matters would be different if he said in the quotes you cite "you only get these human-like properties by very exactly mimicking the human brain", but he doesn't.
Didn't he? He at least confidently rules out a very large class of modern approaches.
Co...
This comment doesn't really engage much with your post - there's a lot there and I thought I'd pick one point to get a somewhat substantive disagreement. But I ended up finding this question and thought that I should answer it.
But have you ever, even once in your life, thought anything remotely like "I really like being able to predict the near-future content of my visual field. I should just sit in a dark room to maximize my visual cortex's predictive accuracy."?
I think I've been in situations where I've been disoriented by a bunch of random stuff happening and wished that less of it was happening so that I could get a better handle on stuff. An example I vividly recall was being in a history class in high school and being very bothered by the large number of conversations happening around me.
I don't really get your comment. Here are some things I don't get:
Edited to modify confidences about interpretations of EY's writing / claims.
In "Failure By Analogy" and "Surface Analogies and Deep Causes", the point being made is "X is similar in aspects A to thing Y, and X has property P" does not establish "Y has property P". The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.
This is a valid point, and that's not what I'm critiquing in that portion of the comment. I'm critiquing how -- on my read -- he confidently dismisses ...
But given that good, automated mechanistic hypothesis generation seems to be the only hope for scalable MI, it may be time for TAISIC to work on this in earnest. Because of this, I would argue that automating the generation of mechanistic hypotheses is the only type of MI work TAISIC should prioritize at this point in time.
"Automating" seems like a slightly too high bar here, given how useful human thoughts are for things. IMO, a better frame is that we have various techniques for combining human labour and algorithmic computation to generate hypothes...
Ah, you make this point here:
However, in practical applications some optimization pressure against the helper will still leak in, for example if the helper is used to decide whether to abort a training run or change its hyperparameters.
...Imitation learning methods seem less risky, as the optimization pressure is simply to match the empirical distribution of a demonstration dataset. The closest to “reward hacking” in this setting would be overfitting to the dataset, a relatively benign failure mode. There is still some risk of inner optimization objectives arising, which could then be adversarial to other systems (e.g. attempt to hide themselves from transparency tools), but comparatively speaking this is one of the methods with the lowest risk of adversarial failure. [Bolding by DanielFila
Conversely, it is less likely to happen if the optimization process is pursuing a goal that is largely independent from the helper’s output, such as if the helper is an interpretability tool checking for the presence of mesa-optimization.
Maybe I'm missing something obvious here, but for advanced AI systems under monitoring that happen to be misaligned, won't checking for the presence of mesa-optimization be relevant to the main system's goal, in cases where the results of those checks matter for whether to re-train or deploy the AI?
Note that the model does not have black or white stones as a concept, and instead only thinks of the stones as “own’s stones” and “opponent’s stones”, so we can do this without loss of generality.
I'm confused how this can be true - surely the model needs to know which player is black and which player is white to know how to incorporate komi, right?
Am I right that this algorithm is going to visit each "important" node in once per path from to the output? If so, that could be pretty slow given a densely-connected interpretation, right?
Empirically, human toddlers are able to recognize apples by sight after seeing maybe one to three examples. (Source: people with kids.)
Wait but they see a ton of images that they aren't told contain apples, right? Surely that should count. (Probably not 2^big_number bits tho)
As I understand it, the EA forum sometimes idiosyncratically calls this philosophy [rule consequentialism] "integrity for consequentialists", though I prefer the more standard term.
AFAICT in the canonical post on this topic, the author does not mean "pick rules that have good consequences when I follow them" or "pick rules that have good consequences when everyone follows them", but rather "pick actions such that if people knew I was going to pick those actions, that would have good consequences" (with some unspecified tweaks to cover places where that ...
In reality though, I think people often just believe stuff because people nearby them believe that stuff
IMO, a bigger factor is probably people thinking about topics that people nearby them think about, and having the primary factors that influence their thoughts be the ones people nearby focus on.
One reason that I doubt this story is that "try new things in case they're good" is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you'll probably be a bit scammy even tho you like people and don't want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.
Actually I'm being silly, you don't need ring signatures, just signatures that are associated with identities and also used for financial transfers.
Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.
Here's one way you can do it: Suppose we're doing public key cryptography, and every person is associated with one public key. Then when you write things online you could use a linkable ring signature. That means that you prove that you're using a private key that corresponds to one of the known public keys, and you also produce a hash of your keypair, such that (a) the world can tell you're one of the known public keys but not which public key you are, and (b) the world can tell that the key hash you used corresponds to the public key you 'committed' to when writing the proof.
Actually I'm being silly, you don't need ring signatures, just signatures that are associated with identities and also used for financial transfers.
Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.
Relevant quote I just found in the paper "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents":
...The primary measure of an agent’s performance is the score achieved during an episode, namely the undiscounted sum of rewards for that episode. While this performance measure is quite natural, it is important to realize that score, in and of itself, is not necessarily an indicator of AI progress. In some games, agents can maximize their score by “getting stuck” in a loop of “small” rewards, ignoring what human p
Here's a project idea that I wish someone would pick up (written as a shortform rather than as a post because that's much easier for me):
Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.
Here is an example story I wrote (that has been minorly edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won't end up wanting to game human approval:
Did this ever get written up? I'm still interested in it.