We just posted Behavioral statistics for a maze-solving agent
[https://www.lesswrong.com/posts/eowhY5NaCaqY6Pkj9/behavioural-statistics-for-a-maze-solving-agent].
TL;DR You raise a reasonable worry, but the three key variables[1] have stable
signs and seem like legit decision-making factors. The variable you quote indeed
seems to be a statistical artifact, as we speculated.[2]
--------------------------------------------------------------------------------
There is indeed a strong correlation between two[3] of our highly predictive
variables:
dstep(decision-square,cheese) and dEuclidean(decision-square,cheese) have
correlation of .886.
We computed the variation inflation factors
[https://corporatefinanceinstitute.com/resources/data-science/variance-inflation-factor-vif/]
for the three predictive variables. VIF measures how collinearity increases the
variance of the regression coefficients. A score exceeding 4 is considered to be
a warning sign of multicollinearity.
AttributeVIF Euclidean distance between cheese and top-right square1.05Steps
between cheese and decision-square4.64Euclidean distance between cheese and
decision-square4.66
So we're at risk here. However, we re-isolated these three variables as both:
* Predictively useful on their own, and
* No/extremely rare sign-flipping when regressing upon randomly selected
subsets of variables.
Considering a range of regressions on a range of train/validation splits, these
variables have stable regression coefficient signs and somewhat stable
coefficient magnitudes. (Although we don't mean for our analysis to be
predicated on the magnitudes themselves; we know these are unreliable and
contingent quantities!)
Furthermore, we regressed upon 200 random subsets of our larger set of
variables, and the cheese/decision-square distance regression coefficients never
experienced a sign flip. The cheese/top-right Euclidean distance had a few sign
flips. The other variables sign-flip frequently.
We reran thi

Nice project and writeup. I particularly liked the walkthrough of thought processes throughout the project

I'd be weary about interpreting the regression coefficients of features that are correlated (see Multicollinearity). Even the sign may be misleading.

It might be worth making a cross-correlation plot of the features. This won't give you a new coeff... (read more)