AI ALIGNMENT FORUM
AF

1706
Abram Demski
Ω379310166590
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Pointing at Normativity
Implications of Logical Induction
Partial Agency
Alternate Alignment Ideas
CDT=EDT?
Embedded Agency
6abramdemski's Shortform
5y
9
Dialogue on What It Means For Something to Have A Function/Purpose
abramdemski3d20

This reminds me of Ramana’s question about what “enforces” normativity. The question immediately brought me back to a Peter Railton introductory lecture I saw (though I may be misremembering / misunderstanding / misquoting, it was a long time ago). He was saying that real normativity is not like the old Windows solitaire game, where if you try to move a card on top of another card illegally it will just prevent you, snapping the card back to where it was before. Systems like that plausibly have no normativity to them, when you have to follow the rules. In a way the whole point of normativity is that it is not enforced; if it were, it wouldn’t be normative.

I'm reminded of trembling-hand equilibria. Nash equilibria don't have to be self-enforcing; there can be tied-expectation actions which nonetheless simply aren't taken, so that agents could rationally move away from the equilibrium. Trembling-hand captures the idea that all actions have to have some probability (but some might be vanishingly small). Think of it as a very shallow model of where norm-violations come from: they're just random! 

Evolutionarily stable strategies are perhaps an even better model of this, with self-enforcement being baked into the notion of equilibrium: stable strategies are those which cannot be invaded by alternate strategies. 

Neither of these capture the case where the norms are frequently violated, however.

Reply
Dialogue on What It Means For Something to Have A Function/Purpose
abramdemski3d20

My notion of a function “for itself” is supposed to be that the functional mechanism somehow benefits the thing of which it’s a part. (Of course hammers can benefit carpenters, but we don’t tend to think of the hammer as a part of the carpenter, only a tool the carpenter uses. But I must confess that where that line is I don’t know, given complications like the “extended mind” hypothesis.)

Putting this in utility-theoretic terminology, you are saying that "for itself" telos places positive expectation on its own functional mechanism, or perhaps stronger, uses significant bits of its decision-making power on self-preservation. 

A representation theorem along these lines might reveal conditions under which such structures are usefully seen as possessing beliefs: a part of the self-preserving structure whose telos is map-territory correspondence. 

Reply
Dialogue on What It Means For Something to Have A Function/Purpose
abramdemski3d20

Steve

As you know, I totally agree that mental content is normative - this was a hard lesson for philosophers to swallow, or at least the ones that tried to “naturalize” mental content (make it a physical fact) by turning to causal correlations. Causal correlations was a natural place to start, but the problem with it is that intuitively mental content can misrepresent - my brain can represent Santa Claus even though (sorry) it can’t have any causal relation with Santa. (I don’t mean my brain can represent ideas or concepts or stories or pictures of Santa - I mean it can represent Santa.)

Ramana

Misrepresentation implies normativity, yep.

My current understanding of what's going on here:
* There's a cluster of naive theories of mental content, EG the signaling games, which attempt to account for meaning in a very naturalistic way, but fail account properly for misrepresentation. I think some of these theories cannot handle misrepresentation at all, EG, Mark of the Mental (a book about Teleosemantics) discusses how the information-theory notion of "information" has no concept of misinformation (a signal is not true or false, in information theory; it is just data, just bits). Similarly, signaling games have no way to distinguish truthfulness from a lie that's been uncovered: the meaning of a signal is what's probabilistically inferred from it, so there's no difference between a lie that the listener understands to be a lie & a true statement. So both signaling games and information theory are in the mistaken "mental content is not normative" cluster under discussion here.
* Santa is an example of misrepresentation here. I see two dimensions of misrepresentation so far:
 * Misrepresenting facts (asserting something untrue) vs misrepresenting referents (talking about something that doesn't exist, like Santa). These phenomena seem very close, but we might want to treat claims about non-existent things as meaningless rather than false, in which case we need to distinguish these cases.
 * simple misrepresentation (falsehood or nonexistence) vs deliberate misrepresentation (lie or fabrication).
* "Misrepresentation implies normativity" is saying that to model misrepresentation, we need to include a normative dimension. It isn't yet clear what that normative dimension is supposed to be. It could be active, deliberate maintenance of the signaling-game equilibrium. It could be a notion of context-independent normativity, EG the degree to which a rational observer would explain the object in a telic way ("see, these are supposed to fit together..."). Etc.
 * The teleosemantic answer is typically one where the normativity can be inherited transitively (the hammer is for hitting nails because humans made it for that), and ultimately grounds out in the naturally-arising proto-telos of evolution by natural selection (human telic nature was put there by evolution). Ramana and Steve find this unsatisfying due to swamp-man examples.

Wearing my AI safety hat, I'm not sure we need to cover swamp-man examples. Such examples are inherently improbable. In some sense the right thing to do in such cases is to infer that you're in a philosophical hypothetical, which grounds out Swamp Man's telos in that of the philosophers doing the imagining (and so, ultimately, to evolution). 

Nonetheless, I also dislike the choice to bottom everything out in biological evolution. It is not as if we have a theorem proving that all agency has to come from biological evolution. If we did, that would be very interesting, but biological evolution has a lot of "happenstance" around the structure of DNA and the genetic code. Can we say anything more fundamental about how telos arises? 

I think I don't believe in a non-contextual notion of telos like Ramana seems to want. A hammer is not a doorstop. There should be little we can say about the physical makeup of a telic entity due to multiple-instantiability. The symbols chosen in a language have very weak ties to their meanings. A logic gait can be made of a variety of components. An algorithm can be implemented as a program in many ways. A problem can be solved by a variety of algorithms.

However, I do believe there may be a useful representation theorem, which says that if it is useful to regard something as telic, then we can regard it as having beliefs (in a way that should shed light on interpretability).

Reply
Judgements: Merging Prediction & Evidence
abramdemski4mo30

Let's look at a specific example: the Allais paradox. (See page 9 of the TDT paper (page 12 of the pdf) for the treatment I'm referencing.)

It is not plausible to me that the commonly-labeled-irrational behavior in the Allais paradox arises from a desire to be money-pumped. It seems more plausible, for example, that it arises from a cognitive heuristic which makes decisions by identifying the most relevant dimensions along which options differ, weighing how significant the various differences feel, and combining those results to make a decision. Moving from 100% probability to 33/34 probability feels significant because we are moving from certainty to uncertainty, whereas the difference in payouts feels relatively uncertain. The reduction in total payout feels insignificant compared to this. In contrast, moving from 34/100 to 33/100 feels insignificant compared to the reduction in total payout.

Of course, this is still consistent with a biases-as-values analysis. EG, we can treat the heuristic weights I mention as values rather than mere heuristics. Or, reaching for a different explanation, we can say that we don't want to feel like a fool in the case that we choose 33/34 and lose, when we could have had certainty. Probabilities are subjective, so no matter how much we're assured 33/34 is the true probability, we can imagine a friend with a different evaluation of the odds who finds our decision foolish. Humans evolved to avoid such criticism. A statement of 100% probability is, in some sense, taken more seriously than a statement of near-100% probability. In that case, if we lose anyway, we can blame the person who told us it was 100%, so we are absolved from any potential feeling of embarrassment. In the 33/100 vs 34/100 version, there is no such effect.

I want to say something like "the optimal resource-maximizing policy is an illusion" though. Like, it is privileging some sort of reference frame. In economics, profit maximization privileges the wellbeing of the shareholders. A more holistic view would treat all parties involved as stakeholders (employees, owners, customers, and even local communities where the company operates) and treat corporate policy as a bargaining problem between those stakeholders. This would better reflect long-term viability of strategies. (Naively profit-maximizing behavior has a tendency to create high turnover in employees, drive away customers, and turn local communities against the company.)

So yes, you can view everything as values, but I would include "resource-maximizing" in that as well.

A further question: what's at stake when you classify something as 'values'?

EG, in the Allais paradox, one thing that's at stake is whether the 'irrational' person should change their answer to be rational.

Reply
Alignment Research Field Guide
abramdemski4mo20

The name was by analogy to TEDx, yes. MIRI was running official MIRI workshops and we (Scott Garrabrant, me, and a few others) wanted to run similar events independently. We initially called them "mini miri workshops" or something like that, and MIRI got in touch to ask us not to call them that since it insinuates that MIRI is running them. They suggested "MIRIx" instead. 

Reply
A simple example of conditional orthogonality in finite factored sets
abramdemski4mo20

I'm trying to understand the second clause for conditional histories better. 

The first clause is very intuitive, and in some sense, exactly what I would expect. I understand it as basically saying that h(X|E) drops elements from h(X) which can be inferred from E. Makes a kind of sense!

However, if that were the end of the story, then conditional histories would obviously be the wrong tool for defining conditional orthogonality. Conditional orthogonality is supposed to tell us about conditional independence in the probability distribution. However, we know from causal graphs that conditioning can create dependence. EG, in the bayes net A→B←C, A and C are independent, but if we condition on C, A and B become dependent. Therefore, conditional histories need to grow somehow. The second clause in the definition can be seen as artificially adding things to the history in order to represent that A and C have lost their independence.

What I don't yet see is how to relate these phenomena in detail. I find it surprising that the second clause only depends on E, not on X. It seems important to note that we are not simply adding the history of E[1] into the answer. Instead, it asks that the history of E itself '''factors''' into the part within h(X|E) and the part outside. If E and X are independent, then only the first clause comes into play. So the implications of the second clause do depend on X, even though the clause doesn't mention X. 

So, is there a nice way to see how the second clause adds an "artificial history" to capture the new dependencies which X might gain when we condition on E?

@Scott Garrabrant 

  1. ^

    In this paragraph, I am conflating the set E⊆S with the partition {E,S−E}. 

Reply
Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)
abramdemski4mo20

This mostly made sense to me. I agree that it is a tricky question with a lot of moving pieces. In a typical RL setting, low entropy does imply low learning, as observed by Cui et al. One reason for this is because exploration is equated with randomness. RL typically works with point-estimates only, so the learning system does not track multiple hypotheses to test between. This prevents deterministic exploration strategies like VoI exploration, which explore based on the potential for gaining information, rather than just randomly.

My main point here is just to point out all these extra assumptions which are needed to make a strict connection between entropy and adaptability, making the observed empirical connection more empirical-only, IE not a connection which holds in all corner cases we can come up with.

However, I may be a bit more prone to think of humans as exploring intelligently than you are, IE, forming hypotheses and taking actions which test them, rather than just exploring by acting randomly. 

I also don't buy this part:

And the last piece, entropy being subjective, would be just the point of therapy and some of the interventions described in the other recent RLHF+ papers. 

My concern isn't that you're anthropomorphizing the LLM, but rather, that you may be anthropomorphizing it incorrectly. The learned policy may have close to zero entropy, but that doesn't mean that the LLM can predict its own actions perfectly ahead of time from its own subjective perspective. Meaning, my argument that adaptability and entropy are connected is a distinct phenomenon from the one noted by Cui, since the notions of entropy are different (mine being a subjective notion based on the perspective of the agent, and Cui's being a somewhat more objective one based on the randomization used to sample behaviors from the LLM).

(Note: your link for the paper by Cui et al currently points back to this post, instead.)

Reply
Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)
abramdemski4mo20

"A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible." I'm not sure I would call this teacher adaptable. I might call them adapted in the sense that they're functioning well in their current environment, but if the environment changed in some way (so that actions in the current state no longer led to the same range of consequences in later states), they would fail to adapt. (Horney would call this person neurotic but successful.)

So, in this scenario, what makes the connection between higher entropy and higher adaptability? Earlier, I mentioned that lower entropy could spoil exploration, which could harm one's ability to learn. However, the optimal exploration policy (from a bayesian perspective) is actually zero entropy, because it maximizes value of information (whereas introducing randomness won't do that, unless multiple actions happen to be tied for value of information). 

The point being that if the environment changes, the teacher doesn't strictly need to introduce entropy into their policy in order to adapt. That's just a common and relatively successful method.

However, it bears mentioning that entropy is of course subjective; we might need to ask from whose perspective we are measuring the entropy. Dice have low entropy from the perspective of a physics computation which can predict how they will land, but high entropy from the perspective of a typical human who cannot. An agent facing a situation they don't know how to think about yet necessarily has high entropy from their own perspective, since they haven't yet figured out what they will do. In this sense, at least, there is a strict connection between adaptability and entropy.

Reply
Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)
abramdemski4mo30

Entropy isn't going to be a perfect measure of adaptability, eg, 

Or think back to our teacher who uses a voice of authority, even if it's false, to answer every question: this may have been a good strategy when they were a student teacher. Just like the reasoning LLM in its "early training stage," the probability of choosing a new response (choosing not to people please; choosing to admit uncertainty) drops, i.e., policy entropy collapses, and the LLM or person has developed a rigid habitual behavior.

A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible.

However, the connection you're making does generally make sense. EG, a model which has already eliminated a lot of potential behaviors from its probability distribution isn't going to explore very well for RL training. I also intuitively agree that this is related to alignment problems, although to be clear I doubt that solving this problem alone would avert serious AI risks.

Yoshua Bengio claims to have the solution to this problem: Generative Flow Networks (which also fit into his larger program to avert AI risks). However, I haven't evaluated this in detail. The main claim as I understand it is about training to produce solutions proportional to their reward, instead of training to produce only high-reward outputs.

It feels like a lot of what you're bringing up is tied to a sort of "shallowness" or "short-sightedness" more closely than entropy. EG, always saying yes to go to the bar is low-entropy, but not necessarily lower entropy than the correct strategy (EG, always saying yes unless asked by someone you know to have an alcohol problem, in which case always no). What distinguishes it is short-sightedness (you mention short-term incentives like how the friend reacts immediately in conversation) and a kind of simplicity (always saying yes is a very context-free pattern, easy to learn).

I'm also reminded of Richard Watson's talk Songs of Life and Mind, where he likens adaptive systems to mechanical deformations which are able to snap back when the pressure is released, vs modern neural networks which he likens to crumpled paper or a fallen tower. (See about 28 minutes into the video, although it might not make a lot of sense without more context.)

Reply
abramdemski's Shortform
abramdemski6mo20

This sort of approach doesn't make so much sense for research explicitly aiming at changing the dynamics in this critical period. Having an alternative, safer idea almost ready-to-go (with some explicit support from some fraction of the AI safety community) is a lot different from having some ideas which the AI could elaborate. 

Reply
Load More
45What, if not agency?
5d
0
23Dream, Truth, & Good
8mo
9
46Judgements: Merging Prediction & Evidence
7mo
4
52Have LLMs Generated Novel Insights?
Q
8mo
Q
19
25Anti-Slop Interventions?
8mo
27
58Why Don't We Just... Shoggoth+Face+Paraphraser?
11mo
0
66o1 is a bad idea
1y
19
32Seeking Collaborators
1y
8
13Complete Feedback
1y
7
69Why is o1 so deceptive?
Q
1y
Q
14
Load More
Timeless Decision Theory
8 months ago
(+1874/-8)
Updateless Decision Theory
a year ago
(+1886/-205)
Updateless Decision Theory
a year ago
(+6406/-2176)
Problem of Old Evidence
a year ago
(+4678/-10)
Problem of Old Evidence
a year ago
(+3397/-24)
Good Regulator Theorems
2 years ago
(+239)
Commitment Races
3 years ago
Agent Simulates Predictor
3 years ago
Distributional Shifts
3 years ago
Distributional Shifts
3 years ago
Load More