Human brain planning algorithms (and I expect future AGI systems too) don't have a special status for "one timestep"; there are different entities in the model that span different lengths of time in a flexible and somewhat-illegible way. Like "I will go to the store" is one thought-chunk, but it encompasses a large and unpredictable number of elementary actions. Do you have any thoughts on getting myopia to work if that's the kind of model you're dealing with?
I don't have any novel modeling approach to resolve your question, I can only tell you about the standard approach.
You can treat planning where multiple actions spanning many time steps are considered as a single chunk as an approximation method, and approximation method for solving the optimal planning problem in the st world model. In the paper, I mention and model this type of approximation briefly in section 3.2.1, but that section 3.2.1 is not included in the post above.
Some more details of how a approximation approach using action chunks would work: you start by setting the time step in the st planning world model to something arbitrarily small, say 1 millisecond (anything smaller than the sample rate of the agent's fastest sensors will do in practical implementations). Then, treat any action chunk C as a special policy function C(s) where this policy function can return a special value `end' to denote 'this chunk of actions is now finished'. The agent's machine leaning system may then construct a prediction function X(s',s,C) which predicts the probability that, starting in agent environment state s, executing C till the end will land the agent environment in state s'. It also needs to construct a function T(t,s,C) that estimates the probability distribution over the time taken (time steps in the policy C) till the policy ends, and an UC(s,C) that estimates the chunk of utility gained in the underlying reward nodes covered by C. These functions can then be used to compute an approximate solution to the pi∗st of planning world st. Graphically, a whole time series of Si⋯Sj, Ai⋯Aj and Ri⋯Rj nodes in the st model gets approximated by cutting out all the middle nodes and writing the functions X and UC over the nodes Sj and Rj.
Representing the use of the function T in a graphical way is more tricky, it is easier to write the role of that function during the approximation process down by using a Bellman equation that unrolls the world model into individual time lines and ends each line when the estimated time is up. But I won't write out the Bellman equation here.
The solution found by the machinery above will usually be approximately optimal only, and the approximately optimal policy found may also end having estimated Ust by averaging over over a set of world lines that are all approximately N time steps long in st, but some world lines might be slightly shorter or longer.
The advantage of this approximation method with action/thought chunks C is that it could radically speed up planning calculations. In the Kahneman and Tversky system 1/system 2 model, something like this happens also.
Now, is is possible to imagine someone creating an illegible machine learning system that is capable of constructing the functions X and UC, but not T. If you have this exact type of illegibility, then you can not reliably (or even semi-reliably) approximate pi∗st anymore, so you cannot built an approximation of an STH agent around such a learning system. However, learning the function T seems to be somewhat easy to me: there is no symbol grounding problem here, as long as we include time stamps in the agent environment states recorded in the observational record. We humans are also not too bad at estimating how long our action chunks will usually take. By the way, see section 10.2 of my paper for a more detailed discussion of my thoughts on handling illegibility, black box models and symbol grounding. I have no current plans to add that section of the paper as a post in this sequence too, as the idea of the sequence is to be a high-level introduction only.