Dalcy

Tomorrow can be brighter than today
Although the night is cold
The stars may seem so very far away
But courage, hope and reason burn
In every mind, each lesson learned
Shining light to guide our way
Make tomorrow brighter than today

Posts

Sorted by New

0Dalcy's Shortform

Wiki Contributions

Comments

Sorted by

Newest

Agent Boundaries Aren't Markov Blankets. [Unless they're non-causal; see comments.]

Dalcy3mo10

a Markov blanket represents a probabilistic fact about the model without any knowledge you possess about values of specific variables, so it doesn't matter if you actually do know which way the agent chooses to go.

The usual definition of Markov blankets is in terms of the model without any knowledge of the specific values as you say, but I think in Critch's formalism this isn't the case. Specifically, he defines the 'Markov Boundary' of (being the non-abstracted physics-ish model) as a function of the random variable $W_{t}$ (where he writes e.g. $B_{t} := f_{W B} (W_{t})$ ), so it can depend on the values instantiated at $W_{t}$ .

it would just not make sense to try to represent agent boundaries in a physics-ish model if we were to use the usual definition of Markov blankets - the model would just consist of local rules that are spacetime homogeneous, so there is no reason to expect one can apriori carve out an agent from the model without looking at its specific instantiated values.
$f_{W B}$ can really be anything, so $B_{t}$ doesn't necessarily have to correspond to physical regions (subsets) of $W_{t}$ , but they can be if we choose to restricting our search of infiltration/exfiltration-criteria-satisfying $f_{W B}$ to functions that only return boundaries-in-the-sense-of-carving-the-physical-space.
- e.g. $B_{t}$ can represent which subset of $W_{t}$ the physical boundary is, like 0, 0, 1, 0, 0, ... 1, 1, 0

So I think under this definition of Markov blankets, they can be used to denote agent boundaries, even in physics-ish models (i.e. ones that relate nicely to causal relationships). I'd like to know what you think about this.

Big picture of phasic dopamine

Dalcy1y50

What are the errors in this essay? As I'm reading through the Brain-like AGI sequence I keep seeing this post being referenced (but this post says I should instead read the sequence!)

I would really like to have a single reference post of yours that contains the core ideas about phasic dopamine rather than the reference being the sequence posts (which is heavily dependent on a bunch of previous posts; also Post 5 and 6 feels more high-level than this one?)

Trying to isolate objectives: approaches toward high-level interpretability

Dalcy2y10

Especially because we’re working with toy models that ostensibly fit the description of an optimizer, we may end up with a model that mechanistically doesn’t have an explicit notion of objective.

I think this is very likely to be the default for most toy models one trains RL on. In my model of agent value formation (which looks very much like this post), explicit representation of objectives is useful inasmuch the model already has some sort of internal "optimizer" or search process. And before that, simple "heuristics" (or shards) should suffice—especially in small training regimes.

Trying to disambiguate different questions about whether RLHF is “good”

Dalcy2y10

I think that RLHF is reasonably likely to be safer than prompt engineering: RLHF is probably a more powerful technique for eliciting your model’s capabilities than prompt engineering is. And so if you need to make a system which has some particular level of performance, you can probably achieve that level of performance with a less generally capable model if you use RLHF than if you use prompt engineering.

Wait, that doesn't really follow. RLHF can elicit more capabilities than prompt engineering, yes, but how is that a reason for RLHF being safer than prompt engineering?