AI ALIGNMENT FORUM
AF

1318
Cleo Nardo
Ω240965
Message
Dialogue
Subscribe

DMs open.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Game Theory without Argmax
Base LLMs refuse too
Cleo Nardo1y10

the base model is just predicting the likely continuation of the prompt. and it's a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn't surprising.

Reply
My take on higher-order game theory
Cleo Nardo2y20

Hey Nisan. Check the following passage from Domain Theory (Samson Abramsky and Achim Jung). This might be helpful for equipping Δ with an appropriate domain structure. (You mention [JP89] yourself.)

We should also mention the various attempts to define a probabilistic version of the powerdomain construction, see [SD80, Mai85, Gra88, JP89, Jon90].

  • [SD80] N. Saheb-Djahromi. CPO’s of measures for nondeterminism. Theoretical Computer Science, 12:19–37, 1980.
  • [Mai85] M. Main. Free constructions of powerdomains. In A. Melton, editor, Mathematical Foundations of Programming Semantics, volume 239 of Lecture Notes in Computer Science, pages 162–183. Springer Verlag, 1985.
  • [Gra88] S. Graham. Closure properties of a probabilistic powerdomain construction. In M. Main, A. Melton, M. Mislove, and D. Schmidt, editors, Mathematical Foundations of Programming Language Semantics, volume 298 of Lecture Notes in Computer Science, pages 213–233. Springer Verlag, 1988.
  • [JP89] C. Jones and G. Plotkin. A probabilistic powerdomain of evaluations. In Proceedings of the 4th Annual Symposium on Logic in Computer Science, pages 186–195. IEEE Computer Society Press, 1989.
  • [Jon90] C. Jones. Probabilistic Non-Determinism. PhD thesis, University of Edinburgh, Edinburgh, 1990. Also published as Technical Report No. CST63-90.

During my own incursion into agent foundations and game theory, I also bumped into this exact obstacle — namely, that there is no obvious way to equip Δ with a least-fixed-point constructor FixΔX:(X→ΔX)→ΔX. In contrast, we can equip P with a LFP constructor FixPX:(X→PX)→PX,g↦{x∈X:x∈g(x)}.


One trick is to define FixΔX(g) to be the distribution π∈ΔX which maximises the entropy H(π) subject to the constraint gΔ(π)=π.

  • A maximum entropy distribution π∗ exists, because — 
    • For g:X→ΔX, let gΔ:ΔX→ΔX be the lift via the Δ monad, and let G={π∈ΔX|gΔ(π)=π} be the set of fixed points of gΔ.
    • ΔX is Hausdorff and compact, and gΔ:ΔX→ΔX is continuous, so G={π∈ΔX:π=gΔ(π)} is compact.
    • H:ΔX→R is continuous, and G⊆ΔX is compact, so H obtains a maximum π∗ in G.
  • Moreover, π∗ must be unique, because — 
    •  G is a convex set, i.e. if π1=gΔ(π1) and π2=gΔ(π2) then λ1π1+λ2π2=gΔ(λ1π1+λ2π2) for all λ1+λ2=1.
    • H:ΔX→R is strictly concave, i.e. H(λ1π1+λ2π2)≥λ1H(π1)+λ2H(π2) for all λ1+λ2=1, and moreover the inequality is strict if π1≠π2 and λ1,λ2>0.
    • Hence if π∗1,π∗2∈G both obtain the maximum entropy, then  π∗1≠π∗2⟹H(0.5π1+0.5π2)>0.5H(π1)+0.5H(π2), a contradiction.

 The justification here is the Principle of Maximum Entropy:

 Given a set of constraints on a probability distribution, then the “best” distribution that fits the data will be the one of maximum entropy.

More generally, we should define FixΔX(g) to be the distribution π∈ΔX which minimises cross-entropy DKL(π||π0) subject to the constraint π=g(π), where π0 is some uninformed prior such as Solomonoff. The previous result is a special case by considering π0 to be the uniform prior. The proof generalises by noting that DKL(−||π0):ΔX→R is continuous and strictly convex. See the Principle of Minimum Discrimination.

Ideally,  we'd like FixP and FixΔ to "coincide" modulo the maps Supp:ΔX→PX, i.e.Supp(FixΔ(g))=FixP(Supp∘g) for all g:X→ΔX. Unfortunately, this isn't the case — if g:H↦0.5⋅|H⟩+0.5⋅|T⟩,T↦|T⟩ then FixP(Supp∘g)={H,T} but Supp(FixΔ(g))={T}.


Alternatively, we could consider the convex sets of distributions over X.

Let C(X) denote the set of convex sets of distributions over X. There is an ordering ≤ on C(X) where A≤B⟺A⊇B. We have a LFP operator FixCX:(X→CX)→CX via g↦⋃{S∈CX:gCX(S)=S} where gC:CX→CX,S↦{∑ni=1αi⋅πi|πi∈g(xi),∑ni=1αi|xi⟩∈S} is the lift of g:X→CX via the C monad.

Reply
How Smart Are Humans?
Cleo Nardo2y10

However, if it is the case that the difference between humans and monkeys is mostly due to a one-shot discrete difference (ie language), then this cannot necessarily be repeated to get a similar gain in intelligence a second time.

Perhaps language is a zero-one, i.e. language renders a mind "cognitively complete" in the sense that the mind can represent anything about the external world, and make any inferences using those representations. But intelligence is not thereby zero-one because intelligence depends on continuous variables like computional speed, memory, etc.

More concretely, I am sceptic that "we end up with AI geniuses, but not AI gods", because running a genius at 10,000x speed, parallelised over 10,000x cores, with instantaneous access to the internet does (I think) make an AI god. A difference is quantity is a difference in kind.

Thar said, there might exist plausible threat models which require an AI which doesn't spatiotemporally decompose into less smart AIs. Could you sketch one out?

Reply
LIMA: Less Is More for Alignment
Cleo Nardo2y22

In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases;

I'm not sure how well this metric tracks what people care about — performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)

Reply
The Waluigi Effect (mega-post)
Cleo Nardo3y*10

Yep, you're correct. The original argument in the Waluigi mega-post was sloppy.

  • If μ updated the amplitudes in a perfectly bayesian way and the context window was infinite, then the amplitudes of each premise must be a martingale. But the finite context breaks this.

  • Here is a toy model which shows how the finite context window leads to Waluigi Effect. Basically, the finite context window biases the Dynamic LLM towards premises which can be evidenced by short strings (e.g. waluigi), and biases away from premises which can't be evidenced by short strings (e.g. luigis).
  • Regarding your other comment, a long context window doesn't mean that the waluigis won't appear quickly. Even with an infinite context window, the waluigi might appear immediately. The assumption that the context window is short/finite is only necessary to establish that the waluigi is an absorbing state but luigi isn't.
Reply
The Waluigi Effect (mega-post)
Cleo Nardo3y1217

You're correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.

But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren't inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.

Reply
The Waluigi Effect (mega-post)
Cleo Nardo3y10

Yep I think you might be right about the maths actually.

I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.

So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.

I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.

Actually, maybe "attractor" is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What's the right dynamical-systemy term for that?

Reply
The Waluigi Effect (mega-post)
Cleo Nardo3y54

Yes — this is exactly what I've been thinking about!

Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true.

If the answer is "yes", then that's a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.

Reply
11Uncertainty in all its flavours
2y
1
18Game Theory without Argmax [Part 1]
2y
1
24MetaAI: less is less for alignment.
2y
3
13Excessive AI growth-rate yields little socio-economic benefit.
3y
0
33The 0.2 OOMs/year target
3y
0
16Wittgenstein and ML — parameters vs architecture
3y
0
25Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers.
3y
0
58The Waluigi Effect (mega-post)
3y
26
23Towards Hodge-podge Alignment
3y
3
17Is GPT-N bounded by human capabilities? No.
3y
0
Load More
0Shortform
2y
0
Dealmaking (AI)
2 months ago
(+742)
Dealmaking (AI)
2 months ago
(+31/-69)
Dealmaking (AI)
2 months ago
(+113)
Dealmaking (AI)
2 months ago
(+1036)