Sergei Volodin — AI Alignment Forum

Excited to see this proposal and would be interested in following your results, and excited about bridging "traditional deep learning" and AF - I personally think that there's a lot of value in having a common language between "traditional DL" community and the AIS community (such as, the issues with current AI ethics could be seen as a scaled-down issues with the AGI). A lot of theoretical results on AF could benefit from simple practical examples for the sake of having a clear definition in code, and a lot of the ethics discussions could benefit from a larger perspective of AGI alignment (my own personal opinion)

I have a prediction that policy gradient RL agents (and all of them that only learn the policy) do not have good models of environments in them. For example, in order to succeed in Cartpole, all we need (in the most crude version) is to map "pole a bit to the left" to "go to the left" and "pole a bit to the right" to "go to the right". Having a policy of such an agent does not allow to determine the exact properties of the environment such as the mass of the cart or the mass of the pole, because a single policy would work for some range of masses (my prediction). Thus, it does not contain a "mental model". Basically, solving the environment is simpler than understanding it (this is also the case in the real world sometimes :). In contrast, more sophisticated agents such as MuZero or WorldModels, use a value function inside + a learned mapping from current observation+action to the next one, and in a way it is a "mental model" (though not a very interpretable one...). Would be excited about ones that "we can somewhat understand" -- current ones seem to lack this property...

Some questions below:

I was wondering about what would be the concrete environments and objectives to train it on/with, and what would be the way to find the model inside the agent (if going with the "Microscope" approach), or how to enforce the agent to have a model inside (if going with the "regularizer" approach)
I'm a bit confused -- would it be a natural language model that could respond to questions, or would it be a symbolic-like approach, or something else?
Research into making RL agents use better internal models of environments could increase capabilities as well as providing safety, because a model allows for planning, and a good model allows for generalization, and both are capability-increasing. Potentially, the "right" way of constructing models could as well decrease sample complexity, because the class of "interesting" environments is quite specific (real world and games both have certain common themes -- distinct objects, object permanence, notion of "place" and "2D/3D maps" etc), and current RL agents can as well fit random noise, and, thus, they are searching though a much wider space of policies than we are interested in. So, to sum up, better models could potentially lead to a speedup of AI timelines. I'm wondering about what are your thoughts about these tradeoffs (benefits to safety VS potential risks of decreasing time-to-AGI).

My background on the question: I worked in my MSc thesis on one of the directions, specifically, "using a simplicity prior" to uncover "a model that we can somewhat understand" from pixels. Specifically, I discover a transformation from observation to latent features, such that it allows for a causal model on these latent features with the fewest edges (simplicity). Some tricks (lagrangian multipliers, some custom neural nets etc) are required, and the simplicity prior makes the problem of finding a causal graph NP-hard. The upside though is that the models are quite interpretable in the end, though it works only for small grid-worlds and toy benchmarks so far... I have a vague feeling that the general problem could be solved in a much simpler way than I did, and would be excited to see the results of the research! Thesis: https://infoscience.epfl.ch/record/287445/ GH: https://github.com/sergeivolodin/causality-disentanglement-rl

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments