[Epistemic Status: Exploratory, and I may have confusions]
LLMs and other possible RL agent have the property of taking many actions iteratively. However, not all possible short-term outputs are equally likely, and I think better modelling what these possible outcomes might look like could give better insight into what the longer-term outputs might look like. Having better insight into what outcomes are "possible" and when behaviour might change could improve our ability to monitor iteratively-acting models.
One reason I am interested in what I will call "ideation" and "trajectories", is that it seem possible that LLMs or future AI systems in many cases might not explicitly have their "goals" written into their weights, but they might be implicitly encoding a goal that is visible only after many iterated steps. One simplistic example:
We can extrapolate that the robot must "want" to eventually end up at the north pole and stay there. We can think of this as a "longer-term goal" of the robot, but if one were to try find in the robot "what are the goals", then one would never be able to find "go to the north pole" as a goal. As the robot walks more and more, it should become easier to understand where it is going, even if it doesn't manage to go north every day. The "ideation"/generation of possible actions the robot could take are likely quite limited, and from this we can determine what the robot's future trajectories might look like.
While I think anything like this would be much more messy in more useful AIs, it should be possible to extract some useful information from the "short-term goals" of the model. Although this seems difficult to study, I think looking at the "ideation" and "long-term prediction" of LLMs is one way to try to start modelling these "short term", and subsequently, "implicit long-term" goals.
Note that here are some ideas might be somewhat incoherent/not well strung together, and I am trying to gather feedback on might be possible, or what research might already exist.
In Psychology, the term "ideation" is attributed to the formation and conceptualisation of ideas or thoughts. It may not quite exactly be the best word to be used in this context, though it is the best option I have come up with so far.
Here, "Implicit Ideation" is defined as the inherent propensity of our cognitive structures, or those of an AI model, to privilege specific trajectories during the "idea generation" process. In essence, Implicit Ideation underscores the predictability of the ideation process. Given the generator's context—whether it be a human mind or an AI model—we can, in theory, anticipate the general direction, and potentially the specifics, of the ideas generated even before a particular task or prompt is provided. This construct transcends the immediate task at hand and offers a holistic view of the ideation process.
Another way one might think about this in the specific case of language models is "implicit trajectory planning".
When we're told to "do this task," there are a few ways we can handle it:
However, the most interesting way is:
This dynamism is a part of our problem-solving toolkit.
This process of ours seems related to the idea of Natural Abstractions, as conceptualised by John Wentworth. We are constantly pulling in a large amount of information, but we have a natural knack for zeroing in on the relevant bits, almost like our brains are running a sophisticated algorithm to filter out the noise.
Even before generating some ideas, there is a very limited space that we might pool ideas from, related mostly to ones that "seem like they might be relevant", (though also to some extent by what we have been thinking about a lot recently.)
Now, let's swap the human with a large language model (LLM). We feed it a prompt, which is essentially its task. This raises a bunch of intriguing questions. Can we figure out what happens in the LLM's 'brain' right after it gets the prompt and before it begins to respond? Can we decipher the possible game plans it considers even before it spits out the first word?
What we're really trying to dig into here is the very first stage of the LLM's response generation process, and it's mainly governed by two things - the model's weights and its activations to the prompt. While one could study just the weights, it seems easier to study the activations to try to understand the LLM's 'pre-response' game plan.
There's another interesting idea - could we spot the moments when the LLM seems to take a sudden turn from its initial path as it crafts its response? While an LLM doesn't consciously change tactics like we do, these shifts in its output could be the machine equivalent of our reassessments, shaped by the way the landscape of predictions changes as it forms its response.
Could it, perhaps, also be possible to pre-empt beforehand when it looks likely that the LLM might run into a dead end, and find where these shift occur and why?
The language model operates locally to predict each token, often guided by the context provided in the prompt. Certain prompts, such as "Sponge Cake Ingredients: 2 cups of self-rising flour", have distinctly predictable long-term continuations. Others, such as "Introduction" or "How do I", or "# README.md\n" have more varied possibilities, yet would still exhibit bias towards specific types of topics.
While it is feasible to gain an understanding of possible continuations by simply extending the prompt, it is crucial to systematically analyse what drives these sometimes predictable long-term projections. The static nature of the model weights suggests the feasibility of extracting information about the long-term behaviours in these cases.
However, establishing the extent of potential continuations, identifying those continuations, and understanding the reasons behind their selection can be challenging.
One conceivable approach to assessing the potential scope of continuations (point 1) could be to evaluate the uncertainty associated with short-term predictions (next-token). However, this method has its limitations. For instance, it may not accurately reflect situations where the next token is predictable (such as a newline), but the subsequent text is not, and vice versa.
Nonetheless, it is probable that there exist numerous cases where the long-term projection is apparent. Theoretically, by examining the model's activations, one could infer the potential long-term projection.
One idea to try extract some information about the implicit ideation, would be to somehow train a predictor that is able to predict what kinds of outputs the model is likely to make, as a way of inferring more explicitly what sort of ideation the model has at each stage of text generation.
I think one thing that would be needed for a lot of these ideas, is a good way to summarise/categorise the texts into a single word. There is existing literature on this, so I think this should be possible.
Broad/Low-Information Summaries on Possible Model Generation Trajectories
Here is an idea of what this might look like in my mind:
Some basic properties that would be good to have, given only the prompt and the model, but not running the model beyond the prompt:
Some more ideal properties one might hope to find:
Some bad potential properties:
Ideally, if we have such a predictor model that is accurate at modelling the acting model's generations, it would be ideal if we could also begin doing interventions on the acting model to modify the probability of some generations. For example, to remove one of the possible "lines of thought" that the model could take, and thus improve the probabilities of the other "lines of thought". This would be an ideal outcome, and would be in theory possible, but in practice might not happen.
Look-ahead token prediction.
Distant Token Prediction Modification.
Trying to Predict Long-term Goals.
Here is me trying to fit the label "short-term" and "long-term" goals. I think the labels and ideas here are quite messy and confused, and I'm unsure about some of the things written here, but I think these could potentially lead others to new insights and clarity?
Goals in Language Models
Goals in Chess AI
Implications of Goals on Different Time Scales
Representation of Goals
Potential Scenarios and Implications
Reframing Existing Ideas