Alright, I think we have an answer! The conjecture is false.
Counterexample: suppose I have a very-high-capacity information channel (N bit capacity), but it's guarded by a uniform random n-bit password. O is the password, A is an N-bit message and a guess at the n-bit password. Y is the N-bit message part of A if the password guess matches O; otherwise, Y is 0.
Let's say the password is 50 bits and the message is 1M bits. If A is independent of the password, then there's a chance of guessing the password, so the bitrate will be about * 1M , or about one-billionth of a bit in expectation.
If A "knows" the password, then the capacity is about 1M bits. So, the delta from knowing the password is a lot more than 50 bits. It's a a multiplier of , rather than an addition of 50 bits.
This is really cool! It means that bits of observation can give a really ridiculously large boost to a system's optimization power. Making actions depend on observations is a potentially-very-big game, even with just a few bits of observation.
Credit to Yair Halberstadt in the comments for the attempted-counterexamples which provided stepping stones to this one.
The standard definition of channel capacity makes no explicit reference to the original message ; it can be eliminated from the problem. We can do the same thing here, but it’s trickier. First, let’s walk through it for the standard channel capacity setup.
In the standard setup, cannot depend on , so our graph looks like
… and we can further remove entirely by absorbing it into the stochasticity of .
Now, there are two key steps. First step: if is not a deterministic function of , then we can make a deterministic function of without reducing . Anywhere is stochastic, we just read the random bits from some independent part of instead; will have the same joint distribution with any parts of which was reading before, but will also potentially get some information about the newly-read bits of as well.
Second step: note from the graphical structure that mediates between and . Since is a deterministic function of and mediates between and , we have .
Furthermore, we can achieve any distribution (to arbitrary precision) by choosing a suitable function .
So, for the standard channel capacity problem, we have , and we can simplify the optimization problem:
Note that this all applies directly to our conjecture, for the part where actions do not depend on observations.
That’s how we get the standard expression for channel capacity. It would be potentially helpful to do something similar in our problem, allowing for observation of .
The step about determinism of carries over easily: if is not a deterministic function of and , then we can change to read random bits from an independent part of . That will make a deterministic function of and without reducing .
The second step fails: does not mediate between and .
However, we can define a “Policy” variable
is also a deterministic function of , and does mediate between G and Y. And we can achieve any distribution over policies (to arbitrary precision) by choosing a suitable function .
So, we can rewrite our problem as
In the context of our toy example: has two possible values, and . If takes the first value, then is deterministically 0; if takes the second value, then is deterministically 1. So, taking the distribution to be 50/50 over those two values, our generalized “channel capacity” is at least 1 bit. (Note that we haven’t shown that no achieves higher value in the maximization problem, which is why I say “at least”.)
Back to the general case: our conjecture can be expressed as
where the first optimization problem uses the factorization
and the second optimization problem uses the factorization