Wiki Contributions


Tangentially related: recent discussion raising a seemingly surprising point about LLM's being lossless compression finders

The first intuition pump that comes to mind for distinguishing mechanisms is examining how my brain generates and assigns credence to the hypothesis that something going wrong with my car is a sensor malfunction vs telling me about a problem in the world that the sensor exists to alert me to.

One thing that happens is that the broken sensor implies a much larger space of worlds because it can vary arbitrarily instead of only in tight informational coupling with the underlying physical system. So fluctuations outside the historical behavior of the sensor either implies I'm in some sort of weird environment or that the sensor is varying with something besides what it is supposed to measure, a hidden variable if coherent or noisy if random. So the detection is tied to why it is desirable to goodhart the sensor in the first place, more option value by allowing consistency with a broader range of worlds. By the same token, the hypothesis "the sensor is broken" should be harder to falsify since the hypothesis is consistent with lots of data? The first thing it occurs to me to do is supply a controlled input to see if I get a controlled output (see: calibrating a scale by using a known weight). This suggests that complex sensors that couple with the environment along more dimensions are harder to fool, though any data bottlenecks that are passed through reduce this i.e. the human reviewing things is themselves using a learnable simple routine that exhibits low coupling.

The next intuition pump, imagine there are two mechanics. One makes a lot of money from replacing sensors, they're fast at it and get the sensors for a discount by buying in bulk. The second mechanic makes a lot of money by doing a lot of really complicated testing and work. They work on fewer cars but the revenue per car is high. Each is unscrupulous and will lie that your problem is the one they are good at fixing. I try to imagine the sorts of things they would tell me to convince me the problem is really the sensor vs the problem is really out in the world. This even suggests a three player game that might generate additional ideas.

Whether or not details (and lots of specific detail arguments) matter hinges on the sensitivity argument (which is an argument about basins?) in general, so I'd like to see that addressed directly. What are the arguments for high sensitivity worlds other than anthropics? What is the detailed anthropic argument?

Rambling/riffing: Boundaries typically need holes in order to be useful. Depending on the level of abstraction, different things can be thought of as holes. One way to think of a boundary is a place where a rule is enforced consistently, and this probably involves pushing what would be a continuous condition into a condition with a few semi discrete modes (in the simplest case enforcing a bimodal distribution of outcomes). In practice, living systems seem to have settled on stacking a bunch of one dimensional gate keepers together as presumably the modularity of such a thing was easier to discover in the search space than things with higher path dependencies due to entangled condition measurement. This highlights the similarity between boolean circuit analysis and a biological boundary. In a boolean circuit, the configurations of 'cheap' energy flows/gradients can be optimized for benefit, while the walls to the vast alternative space of other configurations can be artificially steepened/shored up (see: mitigation efforts to prevent electron tunneling in semiconductors).

Many proposals seem doomed to me because they involve one or multiple steps where they assume a representation, then try to point to robust relations in the representation and hope they'll hold in the territory. This wouldn't be so bad on its own but when pointed to it seems like handwaving happens rather than something more like conceptual engineering. I am relatively more hopeful about John's approach as being one that doesn't fail to halt and catch fire at these underspecified steps in other plans. In other areas like math and physics we try to get the representation to fall out of the model by sufficiently constraining the model. I would prefer to try to pin down a doomed model than stay in hand wave land because at least in the process of pinning down the doomed model you might get reusable pieces for an eventual non doomed model. Was happy about eg quantilizers for basically the same reason.

This is exactly what I was thinking about though, this idea of monitoring every human on earth seems like a failure of imagination on our part. I'm not safe from predators because I monitor the location of every predator on earth. I admit that many (overwhelming majority probably) of scenarios in this vein are probably pretty bad and involve things like putting only a few humans on ice while getting rid of the rest.

I guess the threat model relies on the overhang. If you need x compute for powerful ai, then you need to control more than all the compute on earth minus x to ensure safety, or something like that. Controlling the people probably much easier.

New-to-me thought I had in response to the kill all humans part. When predators are a threat to you, you of course shoot them. But once you invent cheap tech that can control them you don't need to kill them anymore. The story goes that the AI would kill us either because we are a threat or because we are irrelevant. It seems to me that (and this imports a bunch of extra stuff that would require analysis to turn this into a serious analysis, this is just an idle thought), the first thing I do if I am superintelligent and wanting to secure my position is not take over the earth, which isn't in a particularly useful spot resource wise and instead launch my nanofactory beyond the reach of humans to mercury or something. Similarly, in the nanomachines in everyone's blood that can kill them instantly class of ideas, why do I need at that point to actually pull the switch? I.e. the kill all humans scenario is emotionally salient but doesn't actually clearly follow the power gradients that you want to climb for instrumental convergence reasons?

Load More