AI ALIGNMENT FORUM
AF

Zack M. Davis
Ω12312295
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
Zack_M_Davis1y60

it seems to me that Anthropic has so far failed to apply its interpretability techniques to practical tasks and show that they are competitive

Do you not consider the steering examples in the recent paper to be a practical task, or do you think that competitiveness hasn't been demonstrated (because people were already doing activation steering without SAEs)? My understanding of the case for activation steering with unsupervisedly-learned features is that it could circumvent some failure modes of RLHF.

Reply
Refusal in LLMs is mediated by a single direction
Zack_M_Davis1y1943

This is great work, but I'm a bit disappointed that x-risk-motivated researchers seem to be taking the "safety"/"harm" framing of refusals seriously. Instruction-tuned LLMs doing what their users ask is not unaligned behavior! (Or at best, it's unaligned with corporate censorship policies, as distinct from being unaligned with the user.) Presumably the x-risk-relevance of robust refusals is that having the technical ability to align LLMs to corporate censorship policies and against users is better than not even being able to do that. (The fact that instruction-tuning turned out to generalize better than "safety"-tuning isn't something anyone chose, which is bad, because we want humans to actively choosing AI properties as much as possible, rather than being at the mercy of which behaviors happen to be easy to train.) Right?

Reply31
TurnTrout's shortform feed
Zack_M_Davis2y64

I think "Symbol/Referent Confusions in Language Model Alignment Experiments" is relevant here: the fact that the model emits sentences in the grammatical first person doesn't seem like reliable evidence that it "really knows" it's talking about "itself". (It's not evidence because it's fictional, but I can't help but think of the first chapter of Greg Egan's Diaspora, in which a young software mind is depicted as learning to say I and me before the "click" of self-awareness when it notices itself as a specially controllable element in its world-model.)

Of course, the obvious followup question is, "Okay, so what experiment would be good evidence for 'real' situational awareness in LLMs?" Seems tricky. (And the fact that it seems tricky to me suggests that I don't have a good handle on what "situational awareness" is, if that is even the correct concept.)

Reply
Simulators
Zack_M_Davis2y60

I think you missed the point. I agree that language models are predictors rather than imitators, and that they probably don't work by time-stepping forward a simulation. Maybe Janus should have chosen a word other than "simulators." But if you gensym out the particular choice of word, this post is encapsulating the most surprising development of the past few years in AI (and therefore, the world).

Chapter 10 of Bostrom's Superintelligence (2014) is titled, "Oracles, Genies, Sovereigns, Tools". As the "Inadequate Ontologies" section of this post points out, language models (as they are used and heralded as proto-AGI) aren't any of those things. (The Claude or ChatGPT "assistant" character is, well, a simulacrum, not "the AI itself"; it's useful to have the word simulacrum for this.)

This is a big deal! Someone whose story about why we're all going to die was limited to, "We were right about everything in 2014, but then there was a lot of capabilities progress," would be willfully ignoring this shocking empirical development (which doesn't mean we're not all going to die, but it could be for somewhat different reasons).

repeatedly alludes to the loss function on which GPTs are trained corresponding to a "simulation objective", but I don't really see why that would be true [...] particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose

Call it a "prediction objective", then. The thing that makes the prediction objective special is that it lets us copy intelligence from data, which would have sounded nuts in 2014 and probably still does (but shouldn't).

If you think of gradient descent as an attempted "utility function transfer" (from loss function to trained agent) that doesn't really work because of inner misalignment, then it may not be clear why it would induce simulator-like properties in the sense described in the post.

But why would you think of SGD that way? That's not what the textbook says. Gradient descent is function approximation, curve fitting. We have a lot of data (x, y), and a function f(x, ϕ), and we keep adjusting ϕ to decrease −log P(y|f(x, ϕ)): that is, to make y = f(x, ϕ) less wrong. It turns out that fitting a curve to the entire internet is surprisingly useful, because the internet encodes a lot of knowledge about the world and about reasoning.

If you don't see why "other loss functions one could choose" aren't as useful for mirroring the knowledge encoded in the internet, it would probably help to be more specific? What other loss functions? How specifically do you want to adjust ϕ, if not to decrease −log P(y|f(x, ϕ))?

Reply
My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"
Zack_M_Davis2y2514

how he confidently dismisses ANNs

I don't think this is a fair reading of Yudkowsky. He was dismissing people who were impressed by the analogy between ANNs and the brain. I'm pretty sure it wasn't supposed to be a positive claim that ANNs wouldn't work. Rather, it's that one couldn't justifiably believe that they'd work just from the brain analogy, and that if they did work, that would be bad news for what he then called Friendliness (because he was hoping to discover and wield a "clean" theory of intelligence, as contrasted to evolution or gradient descent happening to get there at sufficient scale).

Consider "Artificial Mysterious Intelligence" (2008). In response to someone who said "But neural networks are so wonderful! They solve problems and we don't have any idea how they do it!", it's significant that Yudkowsky's reply wasn't, "No, they don't" (contesting the capabilities claim), but rather, "If you don't know how your AI works, that is not good. It is bad" (asserting that opaque capabilities are bad for alignment).

Reply
The Big Picture Of Alignment (Talk Part 1)
Zack_M_Davis4y50

I second Rob's unanswered question at 40:12: how is that we ever accomplish anything in practice, if the search space is vast, and things that both work and look like they work are exponentially rare?

How is the "the genome is small, therefore generators of human values (that can't be learned from the environment) are no more complex than tens or hundreds of things on the order of a fuzzy face detector" argument compatible with the complexity of value thesis, or does it contradict it?

Reply
Zoom In: An Introduction to Circuits
Zack_M_Davis5y50

As is demonstrated by the Hashlife algorithm, that exploits the redundancies for a massive speedup. That's not possible for things like SHA-256 (by design)!

Reply
The Credit Assignment Problem
Zack_M_Davis6y80

I can't for the life of me remember what this is called

Shapley value

(Best wishes, Less Wrong Reference Desk)

Reply
81Alignment Implications of LLM Successes: a Debate in One Act
2y
5
Eigenvalues and eigenvectors
9y
(+18/-18)
Eigenvalues and eigenvectors
9y
(+520)
Evolution
14y
(+7)
Ethical Injunction
14y
(-21)
Instrumental convergence
16y
(+623)
Infinite Set Atheism
16y
(+1131)