Tomáš Gavenčiak

A researcher in CS theory, AI safety and other stuff.

Wiki Contributions


The concept of "interfaces of misalignment" does not mainly point to GovAI-style research here (although it also may serve as a framing for GovAI). The concrete domains separated by the interfaces in the figure above are possibly a bit misleading in that sense:

For me, the "interfaces of misalignment" are generating intuitions about what it means to align a complex system that may not even be self-aligned - rather just one aligning part of it. It is expanding not just the space of solutions, but also the space of meanings of "success". (For example, one extra way to win-lose: consider world trajectories where our preferences are eventually preserved and propagated in a way that we find repugnant now but with a step-by-step endorsed trajectory towards it.)

My critique of the focus on "AI developers" and "one AI" interface in isolation is that we do not really know what the "goal of AI alignment" is, and it works with a very informal and a bit simplistic idea of what aligning AGI means (strawmannable as "not losing right away"). 

While a broader picture may seem to only make the problem strictly harder (“now you have 2 problems”), it can also bring new views of the problem. Especially, new views of what we actually want and what it means to win (which one could paraphrase as a continuous and multi-dimensional winning/losing space).

Complexity indeed matters: the universe seems to be bounded in both time and space, so running anything like Solomonoff prior algorithm (in one of its variants) or AIXI may be outright impossible for any non-trivial model. This for me significantly weakens or changes some of the implications.

A Fermi upper bound of the direct Solomonoff/AIXI algorithm trying TMs in the order of increasing complexity: even if checking one TM took one Planck time on one atom, you could only check cca 10^250=2^800 machines within a lifetime of the universe (~10^110 years until Heat death), so the machines you could even look at have description complexity a meager 800 bits.

  • You could likely speed the greedy search up, but note that most algorithmic speedups do not have a large effect on the exponent (even multiplying the exponent with constants is not very helpful).
  • Significantly narrowing down the space of TMs to a narrow subclass may help, but then we need to take look at the particular (small) class of TMs rather than have intuitions about all TMs. (And the class would need to be really narrow - see below).
  • Due to the Church-Turing thesis, any limiting the scope of the search is likely not very effective, as you can embed arbitrary programs (and thus arbitrary complexity) in anything that is strong enough to be a TM interpreter (which the universe is in multiple ways).
  • It may be hypothetically possible to search for the "right" TMS without examining them individually (witch some future tech, e.g. how sci-fi imagined quantum computing), but if such speedup is possible, any TMs modelling the universe would need to be able to contain this. This would increase any evaluation complexity of the TMs, making them more significantly costly than the Planck time I assumed above (would need a finer Fermi estimate with more complex assumptions?).