Unfinished line here
Implicit in the description of features as directions is that the feature can be represented as a scalar, and that the model cares about the range of this number. That is, it matters whether the feature
Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.
Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.
These arguments don't apply to the base models which are only trained on next word prediction (ie the simulators post), since their predictions never affected future inputs. This is the type of model Janus most interacted with.
Two of the proposals in this post do involve optimizing over human feedback, like:
Creating custom models trained on not only general alignment datasets but personal data (including interaction data), and building tools and modifying workflows to facilitate better data collection with less overhead
, which they may apply to.
I’m excited about sensory substitution (https://eagleman.com/science/sensory-substitution/), where people translate auditory or visual information into tactile sensations (usually for people who don’t usually process that info).
I remember Quintin Pope wanting to translate the latent space of language models [reading a paper] translated to visual or tactile info. I’d see this as both a way to read papers faster, brainstorm ideas, etc and gain a better understanding of latent space during development of this.
I’m unsure how alt-history and point (2) history is hard to change and predictable relates to cyborgism. Could you elaborate?
For context, Amdahl’s law states how fast you can speed up a process is bottlenecked on the serial parts. Eg you can have 100 people help make a cake really quickly, but it still takes ~30 to bake.
I’m assuming here, the human component is the serial component that we will be bottlenecked on, so will be outcompeted by agents?
If so, we should try to build the tools and knowledge to keep humans in the loop as far as we can. I agree it will eventually be outcompeted by full AI agency alone, but it isn’t set in stone how far human-steered AI can go.
Unfinished sentence at “if you want a low coding project” at the top.
I'd love to hear whether you found this useful, and whether I should bother making a second half!
We had 5 people watch it here, and we would like a part 2:)
We had a lot of fun pausing the video and making forward predictions, and we couldn't think of any feedback for you in general.
Notably the model was trained across multiple episodes to pick up on RL improvement.
Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.
Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?