You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process - as Ben said, during action selection.Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.For example, I would say that transforming the rewards instead of Q values is more risk-averse as well as "fair" towards individual timesteps, since it does not average out the negative outcomes across time before exponentiating them. But it also results in slower learning by the agent.
Finally there is a third approach which uses lexicographical ordering between objectives or sets of objectives. Vamplew has done work on this direction. This approach is truly multi-objective in the sense that there is no aggregation at all. Instead the vectors must be compared during RL action selection without aggregation. The downside is that it is unwieldy to have many objectives (or sets of objectives) lexicographically ordered.
I imagine that the lexicographical approach and our continuous nonlinear transformation approaches are complementary. There could be for example two main sets of objectives: one set for alignment objectives, the other set for performance objectives. Inside a set there would be nonlinear transformation and then aggregation applied, but between the sets there would be lexicographical ordering applied. In other words there would be a hierarchy of objectives. By having only two sets in lexicographical ordering the lexicographical ordering does not become unwieldy.
This approach would be a bit analogous to the approach used by constraint programming, though more flexible. The safety objectives would act as a constraint against performance objectives. An approach that is almost in absurd manner missing from classical naive RL, but which is very essential, widely known, and technically developed in practical applications, that is, in constraint programming! In the hybrid approach proposed in the above paragraph the difference from classical constraint programming would be that among the safety objectives there would still be flexibility and ability to trade (in a risk-averse way).Finally, when we say "multi-objective" then it does not just refer to the technical details of the computation. It also stresses the importance of acknowledging the need for researching and making more explicit the inherent presence and even structure of multiple objectives inside any abstract top objective. To encode knowledge in a way that constrains incorrect solutions but not correct solutions. As well as acknowledging the potential existence of even more complex, nonlinear interactions between these multiple objectives. We did not focus on nonlinear interactions between the objectives yet, but these interactions are possibly relevant in the future.I totally agree that in a reasonable agent the objectives or target values / set-points do change, as it is also exemplified by biological systems.Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing