Supplementary Text 1

Supplementary Text 1

A comparison between the conventional eligibility trace rule (Eq. 4) and the proposed eligibility rule (Eq. 8)

In Daw et al. (2011), theeligibility trace updates eligible action-state values using the prediction error at the end of each episode as expressed in Eq. (4).We proposed an alternative updating rule in Eq. (8) in which the first-stage value is updated directly using a reward at the second stage.We compared the contributions of these equations to the model fit.The results are shown in Table S1.SARSA () TD and the parallel-learning model produced the same fitsregardless of whether Eq. (4) or Eq. (8) was used. All remaining models that include a forgetting or/and an EA rule showed better fits in Eq. (8)and its versions Eq. (9) and Eq. (10) than in Eq. (4). A critical difference between these models is that the former models do not update the unchosen and unvisited option values, whereas the latter models do.To show that this point causes a different result on fits, we observe the update of chosen and unchosen option values at the first stage.

First, we focus on the update of a chosen option value. When using the conventional eligibility rule, is updated by Eq. (2) and Eq. (4) sequentially in one trial. These can be written in one equation as follows:

(s1)

In the same way, when using a proposed eligibility rule, is updated by Eq. (2) and Eq. (8) sequentially in one trial. These equations can be written in one equation as follows:

(s2)

For the comparison of Eq. (s1) and Eq. (s2), see also Fig. S1. In both equations, the sum of the weighting coefficients of , and equals one, and in the two-stage decision task, they can producethe same calculation resultsfrom the right of the equations although they use different parameter values.This occurs when there is a solution for a combination of parameters , from Eq. (s1) and parameters, from Eq. (s2) fulfilling the following simultaneous equations, where parameters, in Eq. (s2) are replaced with , :

About the coefficient of : .(s3)

About the coefficient of .(s4)

About the coefficient of : .(s5)

From the above equations, and . When andare between zero and one, there are solutions of and lying between zero and one because it is obvious about and . In the same way, and , and when and are between zero and one, there are solutions of and lying between zero and one because it is obvious about and .Because of such a parameterization between Eq. (s1) and Eq. (s2), the same fitting results are obtainedregardless of whether values were updated using the original eligibility trace rule or the proposed eligibility rule in the models updating only chosen action values (i.e., SARSA () TD and the parallel-learning model).

Next, we focus on the update of an unchosen option value in themodels using an eligibility adjustment mechanism. When using a conventional eligibility rule, is updated by an equation similar to Eq. (10) but using RPE at the second stage as follows:

,(s6)

In contrast, when using a proposed eligibility rule, is updated by Eq. (10) as follows:

,(s7)

In Eq. (s7), the sum of the weighting coefficients of and equals one, which is the same weighting structure as Eq. (s1) and Eq. (s2). However, Eq. (s6) does not have such a weighting structure because is updated by the second-stage RPE without experiencing a weighting update by . At this point, the new eligibility trace rule seems to be simply understandable when thevalue update is expanded to unchosen actions.

Finally, we also focus on the update of an unchosen option value using the forgetting rule. In both eligibility trace rules, is updated using Eq.(11). A parameterin Eq. (11) is equal toin the current models, and is also used in Eq. (s1) but not in Eq. (s2). This fact produces different fits to the data betweenthe models using Eq. (s1) and the models using Eq. (s2).

It is important to note that it is unclear which of the eligibility trace rules is suitable for the diverse types of tasks and a better expression of the real computation in the brain. However, the maximum likelihoods calculated by these equations were exactly the same with respect to the models that update only chosen action values,and the proposed eligibility trace rule showed better fits in the other models, including updates of unchosen action values.

Supplementary Text 2

The eligibility trace used in Eq. (9) and Eq. (10)

We used a redefined eligibility trace in Eq. (9) and Eq. (10). In a conventional eligibility trace rule, eligibility traces, which are denoted by, are initially set to zero for all state–action pairs. When a state–action pair is visited, its eligibility is incremented by 1. In the current task, the eligibility traces of the first-stage actions are updated as follows:

,(s8)

,(s9)

and before the choice at the second stage, both eligibility traces decay by as follows:

,(s10)

.(s11)

These are ordinary eligibility traces, whichwe call model-free eligibility traces. They are used in Eq. (4) and Eq. (8). In the EA model, newly defined model-based eligibility traces are introduced. We theorize that the neural networks, which are activated depending on the frequencies of the experienced transitions, propagate the reward information to the eligible actions in proportion to their activation. Thus, we defined themodel-based eligibility traces as follows:

,(s12)

, (s13)

where T is a transition probability function from a first-stage action to a second-stage state.Before the choice at the second stage, these model-based eligibility traces decay by λ as follows:

, (s14)

. (s15)

In the EA model, these model-based eligibility traces and the ordinary model-free eligibility traces are combined with a weighting parameter () which determines the model-based effect. Thus, Eq. (8) is replaced with Eq. (9) for chosen action. This new redefinition naturally leads to the updating mechanism of the unchosen action value in the first stage, as expressed by Eq. (10).

Supplementary Text 3

The modification of the EA model from SARSA (λ) TD learning

The EA model realized model-based value updates by adding two features to SARSA (λ) TD learning. First, it introduced a weight parameter w that controls the effect of the model-model-free and model-based systems in eligibility traces. Second, it applied the similar eligibility trace rule to the unchosen action. It is helpful to understand the EA model to examine whether both of them are needed to improve the fit by the EA model. Regarding the first point, we examined a model which does not include a mechanism of balancing the two systems, or a version of the EA model in which w is fixed at one. This model can save a parameter but showed a worse fit in AIC and BIC criteria (-LL = 317, AIC = 646, BIC = 673) than the original EA model (-LL = 312, AIC = 638, BIC = 669). Next, the need of the second feature was examined by a version of the EA model without an update for unchosen value in which in Eq. (10) is fixed at zero. This model also showed a worse fit (-LL = 316, AIC = 646, BIC = 677) than the original EA model. Taken together, we can conclude that both modifications are important to the EA model.

Fig S1. Structural images of Eq. (s1) and Eq. (s2)

Table S1. The comparison of the two different eligibility trace (ET) rules on the fitting to the choices from 23 participants. This list gives the negative log likelihood (-LL). The better fit between two rules is in bold. Importantly, SARSA () TD and the parallel-learning model showed the same fit between two rules, whereas all other models supported the proposed ET rule.

Table S2. The relationships between the estimated parameters of the EA-FD model and model use or the results of logistic regressions. To show the effect of each parameter, Table 6 is ordered by the values of (1), (2), and their production (3).

Table S3. Information concerning the compared nine models based on their fit to the choices of 23 participants in three blocks. This list provides the mean values across participants regarding the negative log likelihood (-LL), the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) for each model per block. The minimum and second minimum values of the AIC and BIC criteria are colored in gray and light gray, respectively, in each block. Importantly, the trajectory of reward probabilities in this task was not constructed to strictly compare the model fits among blocks. However, this table may help explain the application of the proposed model in future studies. In the AIC criteria, the EA-FD model was most favored in all three blocks. In the BIC criteria, the F model was the most favoredand the EA-F model was second-mostfavored in the first block. It is noteworthy that in the first block, the transition probability was relatively uncertain for participants, and a parameter relating to modelusage might become redundant.