Multi-timescale reinforcement learning in the brain

Sutton, R. S. & Barto, A. G. Reinforcement Learning 2nd edn (MIT Press, 2018).

Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995).

Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

Wurman, P. R. et al. Outracing champion Gran Turismo drivers with deep reinforcement learning. Nature 602, 223–228 (2022).

Article
ADS
CAS
PubMed

Google Scholar

Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).

Article
CAS
PubMed

Google Scholar

Schultz, W. Neuronal reward and decision signals: from theories to data. Physiol. Rev. 95, 853–951 (2015).

Article
CAS
PubMed
PubMed Central

Google Scholar

Cohen, J. Y., Haesler, S., Vong, L., Lowell, B. B. & Uchida, N. Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature 482, 85–88 (2012).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Ainslie, G. Specious reward: a behavioral theory of impulsiveness and impulse control. Psychol. Bull. 82, 463–496 (1975).

Article
CAS
PubMed

Google Scholar

Frederick, S., Loewenstein, G. & O’Donoghue, T. Time discounting and time preference: a critical review. J. Econ. Lit. 40, 351–401 (2002).

Laibson, D. Golden eggs and hyperbolic discounting. Q. J. Econ. 112, 443–478 (1997).

Article

Google Scholar

Sozou, P. D. On hyperbolic discounting and uncertain hazard rates. Proc. R. Soc. London. B 265, 2015–2020 (1998).

Article

Google Scholar

Botvinick, M. et al. Reinforcement learning, fast and slow. Trends Cogn. Sci. 23, 408–422 (2019).

Article
PubMed

Google Scholar

Redish, A. D. Addiction as a computational process gone awry. Science 306, 1944–1947 (2004).

Article
ADS
CAS
PubMed

Google Scholar

Lempert, K. M., Steinglass, J. E., Pinto, A., Kable, J. W. & Simpson, H. B. Can delay discounting deliver on the promise of RDoC? Psychol. Med. 49, 190–199 (2019).

Article
PubMed

Google Scholar

Sutton, R. S. et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems Vol. 2 761–768 (International Foundation for Autonomous Agents and Multiagent Systems, 2011); https://dl.acm.org/doi/10.5555/2031678.2031726

Bellemare, M. G., Dabney, W. & Rowland, M. Distributional Reinforcement Learning (MIT Press, 2023).

Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations (ICLR, 2017).

Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G. & Larochelle, H. Hyperbolic discounting and learning over multiple horizons. Preprint at https://arxiv.org/abs/1902.06865 (2019).

Dabney, W. et al. A distributional code for value in dopamine-based reinforcement learning. Nature 577, 671–675 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Tano, P., Dayan, P. & Pouget, A. A local temporal difference code for distributional reinforcement learning. Adv. Neural Inf. Process. Syst. 1146, 13662–13673 (2020).

Brunec, I. K. & Momennejad, I. Predictive representations in hippocampal and prefrontal hierarchies. J. Neurosci. 42, 299–312 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Lowet, A. S., Zheng, Q., Matias, S., Drugowitsch, J. & Uchida, N. Distributional reinforcement learning in the brain. Trends Neurosci. 43, 980–997 (2020).

Masset, P. & Gershman, S. J. in The Handbook of Dopamine (Handbook of Behavioral Neuroscience) Vol. 32 (eds Cragg, S. J. & Walton, M.) Ch. 24 (Academic Press, 2025).

Buhusi, C. V. & Meck, W. H. What makes us tick? Functional and neural mechanisms of interval timing. Nat. Rev. Neurosci. 6, 755–765 (2005).

Article
CAS
PubMed

Google Scholar

Tsao, A., Yousefzadeh, S. A., Meck, W. H., Moser, M.-B. & Moser, E. I. The neural bases for timing of durations. Nat. Rev. Neurosci. 23, 646–665 (2022).

Article
CAS
PubMed

Google Scholar

Fiorillo, C. D., Newsome, W. T. & Schultz, W. The temporal precision of reward prediction in dopamine neurons. Nat. Neurosci. 11, 966–973 (2008).

Article
CAS
PubMed

Google Scholar

Mello, G. B. M., Soares, S. & Paton, J. J. A scalable population code for time in the striatum. Curr. Biol. 25, 1113–1122 (2015).

Article
CAS
PubMed

Google Scholar

Soares, S., Atallah, B. V. & Paton, J. J. Midbrain dopamine neurons control judgment of time. Science 354, 1273–1277 (2016).

Article
ADS
CAS
PubMed

Google Scholar

Enomoto, K., Matsumoto, N., Inokawa, H., Kimura, M. & Yamada, H. Topographic distinction in long-term value signals between presumed dopamine neurons and presumed striatal projection neurons in behaving monkeys. Sci. Rep. 10, 8912 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Mohebi, A., Wei, W., Pelattini, L., Kim, K. & Berke, J. D. Dopamine transients follow a striatal gradient of reward time horizons. Nat. Neurosci. 27, 737–746 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Kiebel, S. J., Daunizeau, J. & Friston, K. J. A hierarchy of time-scales and the brain. PLoS Comput. Biol. 4, e1000209 (2008).

Article
ADS
PubMed
PubMed Central

Google Scholar

Kurth-Nelson, Z. & Redish, A. D. Temporal-difference reinforcement learning with distributed representations. PLoS ONE 4, 7362 (2009).

Article
ADS

Google Scholar

Shankar, K. H. & Howard, M. W. A scale-invariant internal representation of time. Neural Comput. 24, 134–193 (2012).

Article
MathSciNet
PubMed

Google Scholar

Tanaka, C. S. et al. Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nat. Neurosci. 7, 887–893 (2004).

Article
CAS
PubMed

Google Scholar

Sherstan, C., Dohare, S., MacGlashan, J., Günther, J. & Pilarski, P. M. Gamma-Nets: generalizing value estimation over timescale. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 5717–5725 (2020).

Momennejad, I. & Howard, M. W. Predicting the future with multi-scale successor representations. Preprint at bioRxiv https://doi.org/10.1101/449470 (2018).

Reinke, C., Uchibe, E. & Doya, K. Average reward optimization with multiple discounting reinforcement learners. In Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science (eds Liu, D. et al.) 789–800 (Springer, 2017).

Kobayashi, S. & Schultz, W. Influence of reward delays on responses of dopamine neurons. J. Neurosci. 28, 7837–7846 (2008).

Article
CAS
PubMed
PubMed Central

Google Scholar

Schultz, W. Dopamine reward prediction-error signalling: a two-component response. Nat. Rev. Neurosci. 17, 183–195 (2016).

Article
CAS
PubMed
PubMed Central

Google Scholar

Howe, M. W., Tierney, P. L., Sandberg, S. G., Phillips, P. E. M. & Graybiel, A. M. Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature 500, 575–579 (2013).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Berke, J. D. What does dopamine mean? Nat. Neurosci. 21, 787–793 (2018).

Article
CAS
PubMed
PubMed Central

Google Scholar

Gershman, S. J. Dopamine ramps are a consequence of reward prediction errors. Neural Comput. 26, 467–471 (2014).

Article
PubMed

Google Scholar

Kim, H. G. R. et al. A unified framework for dopamine signals across timescales. Cell 183, 1600–1616.e25 (2020).

Article
CAS
PubMed
PubMed Central

Google Scholar

Mikhael, J. G., Kim, H. R., Uchida, N. & Gershman, S. J. The role of state uncertainty in the dynamics of dopamine. Curr. Biol. 32, 1077–1087.e9 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Guru, A. et al. Ramping activity in midbrain dopamine neurons signifies the use of a cognitive map. Preprint at bioRxiv https://doi.org/10.1101/2020.05.21.108886 (2020).

Doya, K. Reinforcement learning in continuous time and space. Neural Comput. 12, 219–245 (2000).

Article
CAS
PubMed

Google Scholar

Lee, R. S., Sagiv, Y., Engelhard, B., Witten, I. B. & Daw, N. D. A feature-specific prediction error model explains dopaminergic heterogeneity. Nat. Neurosci. 27, 1574–1586 (2024).

Article
CAS
PubMed

Google Scholar

Cruz, B. F. et al. Action suppression reveals opponent parallel control via striatal circuits. Nature 607, 521–526 (2022).

Article
ADS
CAS
PubMed

Google Scholar

Millidge, B., Song, Y., Lak, A., Walton, M. E. & Bogacz, R. Reward bases: a simple mechanism for adaptive acquisition of multiple reward types. PLoS Comput. Biol. 20, e1012580 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Engelhard, B. et al. Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons. Nature 570, 509–513 (2019).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Eshel, N., Tian, J., Bukwich, M. & Uchida, N. Dopamine neurons share common response function for reward prediction error. Nat. Neurosci. 19, 479–486 (2016).

Article
CAS
PubMed
PubMed Central

Google Scholar

Cox, J. & Witten, I. B. Striatal circuits for reward learning and decision-making. Nat. Rev. Neurosci. 20, 482–494 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar

Collins, A. L. & Saunders, B. T. Heterogeneity in striatal dopamine circuits: form and function in dynamic reward seeking. J. Neurosci. Res. https://doi.org/10.1002/jnr.24587 (2020).

Gershman, S. J. et al. Explaining dopamine through prediction errors and beyond. Nat. Neurosci. 27, 1645–1655 (2024).

Article
CAS
PubMed

Google Scholar

Watabe-Uchida, M. & Uchida, N. Multiple dopamine systems: weal and woe of dopamine. Cold Spring Harb. Symp. Quant. Biol. 83, 83–95 (2018).

Article
PubMed

Google Scholar

Xu, Z., van Hasselt, H. P. & Silver, D. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems Vol. 31 (Curran Associates, 2018).

Yoshida, N., Uchibe, E. & Doya, K. Reinforcement learning with state-dependent discount factor. In 2013 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL) https://ieeexplore.ieee.org/document/6652533 (IEEE, 2013).

Doya, K. Metalearning and neuromodulation. Neural Netw. 15, 495–506 (2002).

Article
PubMed

Google Scholar

Tanaka, S. C. et al. Serotonin differentially regulates short- and long-term prediction of rewards in the ventral and dorsal striatum. PLoS ONE https://doi.org/10.1371/journal.pone.0001333 (2007).

Kvitsiani, D. et al. Distinct behavioural and network correlates of two interneuron types in prefrontal cortex. Nature 498, 363–366 (2013).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).

Article

Google Scholar

Oppenheim, A., Willsky, A. & Hamid, W. Signals and Systems (Pearson, 1996).

Dayan, P. Improving generalisation for temporal difference learning: the successor representation. Neural Comput. 5, 613–624 (1993).

Article

Google Scholar

Gershman, S. J. The successor representation: its computational logic and neural substrates. J. Neurosci. https://doi.org/10.1523/JNEUROSCI.0151-18.2018 (2018).

Amit, R., Meir, R. & Ciosek, K. Discount factor as a regularizer in reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning 269–278 (PMLR, 2020).

Badia, A. P. et al. Agent57: outperforming the Atari human benchmark. In Proceedings of the 37th International Conference on Machine Learning 507–517 (PMLR, 2020).

Reinke, C. Time adaptive reinforcement learning. ICLR 2020 work. Beyond tabula rasa RL. Preprint at https://doi.org/10.48550/arXiv.2004.08600 (2020).

Gershman, S. J. & Uchida, N. Believing in dopamine. Nat. Rev. Neurosci. 20, 703–714 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar

Leone, F. C., Nelson, L. S. & Nottingham, R. B. The folded normal distribution. Technometrics 3, 543–550 (1961).

Article
MathSciNet

Google Scholar

Lindsey, J. & Litwin-Kumar, A. Action-modulated midbrain dopamine activity arises from distributed control policies. Adv. Neural Inf. Process. Syst. 35, 5535–5548 (2022).

Google Scholar

Masset, P. et al. Data and code for ‘Multi-timescale reinforcement learning in the brain’, V1. Mendeley Data https://doi.org/10.17632/tc43t3s7c5.1 (2025).

Multi-timescale reinforcement learning in the brain

microbes that fine-tune its flavour

‘Computing is a Black people’s thing’

Ancient coins unveil web of trade across southeast Asia

Most Popular

Reframing the Empty Nest: Finding Yourself After the Kids Leave

Jermayne Harris Trains Next Gen Chefs While Waiting For Heart

Derek Fisher, Gloria Govan Sell California Mansion for $2.85 Million

U.S. Beauty Hits $50.6 Billion in First Half, Showing Resilience

Recent Comments

ABOUT US

POPULAR POSTS

Reframing the Empty Nest: Finding Yourself After the Kids Leave

Jermayne Harris Trains Next Gen Chefs While Waiting For Heart

Derek Fisher, Gloria Govan Sell California Mansion for $2.85 Million

POPULAR CATEGORY