Saturday, June 7, 2025
No menu items!
HomeNatureMulti-timescale reinforcement learning in the brain

Multi-timescale reinforcement learning in the brain

  • Sutton, R. S. & Barto, A. G. Reinforcement Learning 2nd edn (MIT Press, 2018).

  • Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995).

    Article 

    Google Scholar
     

  • Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  • Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  • Wurman, P. R. et al. Outracing champion Gran Turismo drivers with deep reinforcement learning. Nature 602, 223–228 (2022).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  • Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Schultz, W. Neuronal reward and decision signals: from theories to data. Physiol. Rev. 95, 853–951 (2015).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Cohen, J. Y., Haesler, S., Vong, L., Lowell, B. B. & Uchida, N. Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature 482, 85–88 (2012).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Ainslie, G. Specious reward: a behavioral theory of impulsiveness and impulse control. Psychol. Bull. 82, 463–496 (1975).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Frederick, S., Loewenstein, G. & O’Donoghue, T. Time discounting and time preference: a critical review. J. Econ. Lit. 40, 351–401 (2002).

  • Laibson, D. Golden eggs and hyperbolic discounting. Q. J. Econ. 112, 443–478 (1997).

    Article 

    Google Scholar
     

  • Sozou, P. D. On hyperbolic discounting and uncertain hazard rates. Proc. R. Soc. London. B 265, 2015–2020 (1998).

    Article 

    Google Scholar
     

  • Botvinick, M. et al. Reinforcement learning, fast and slow. Trends Cogn. Sci. 23, 408–422 (2019).

    Article 
    PubMed 

    Google Scholar
     

  • Redish, A. D. Addiction as a computational process gone awry. Science 306, 1944–1947 (2004).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  • Lempert, K. M., Steinglass, J. E., Pinto, A., Kable, J. W. & Simpson, H. B. Can delay discounting deliver on the promise of RDoC? Psychol. Med. 49, 190–199 (2019).

    Article 
    PubMed 

    Google Scholar
     

  • Sutton, R. S. et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems Vol. 2 761–768 (International Foundation for Autonomous Agents and Multiagent Systems, 2011); https://dl.acm.org/doi/10.5555/2031678.2031726

  • Bellemare, M. G., Dabney, W. & Rowland, M. Distributional Reinforcement Learning (MIT Press, 2023).

  • Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations (ICLR, 2017).

  • Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G. & Larochelle, H. Hyperbolic discounting and learning over multiple horizons. Preprint at https://arxiv.org/abs/1902.06865 (2019).

  • Dabney, W. et al. A distributional code for value in dopamine-based reinforcement learning. Nature 577, 671–675 (2020).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Tano, P., Dayan, P. & Pouget, A. A local temporal difference code for distributional reinforcement learning. Adv. Neural Inf. Process. Syst. 1146, 13662–13673 (2020).

  • Brunec, I. K. & Momennejad, I. Predictive representations in hippocampal and prefrontal hierarchies. J. Neurosci. 42, 299–312 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Lowet, A. S., Zheng, Q., Matias, S., Drugowitsch, J. & Uchida, N. Distributional reinforcement learning in the brain. Trends Neurosci. 43, 980–997 (2020).

  • Masset, P. & Gershman, S. J. in The Handbook of Dopamine (Handbook of Behavioral Neuroscience) Vol. 32 (eds Cragg, S. J. & Walton, M.) Ch. 24 (Academic Press, 2025).

  • Buhusi, C. V. & Meck, W. H. What makes us tick? Functional and neural mechanisms of interval timing. Nat. Rev. Neurosci. 6, 755–765 (2005).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Tsao, A., Yousefzadeh, S. A., Meck, W. H., Moser, M.-B. & Moser, E. I. The neural bases for timing of durations. Nat. Rev. Neurosci. 23, 646–665 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Fiorillo, C. D., Newsome, W. T. & Schultz, W. The temporal precision of reward prediction in dopamine neurons. Nat. Neurosci. 11, 966–973 (2008).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Mello, G. B. M., Soares, S. & Paton, J. J. A scalable population code for time in the striatum. Curr. Biol. 25, 1113–1122 (2015).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Soares, S., Atallah, B. V. & Paton, J. J. Midbrain dopamine neurons control judgment of time. Science 354, 1273–1277 (2016).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  • Enomoto, K., Matsumoto, N., Inokawa, H., Kimura, M. & Yamada, H. Topographic distinction in long-term value signals between presumed dopamine neurons and presumed striatal projection neurons in behaving monkeys. Sci. Rep. 10, 8912 (2020).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Mohebi, A., Wei, W., Pelattini, L., Kim, K. & Berke, J. D. Dopamine transients follow a striatal gradient of reward time horizons. Nat. Neurosci. 27, 737–746 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Kiebel, S. J., Daunizeau, J. & Friston, K. J. A hierarchy of time-scales and the brain. PLoS Comput. Biol. 4, e1000209 (2008).

    Article 
    ADS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Kurth-Nelson, Z. & Redish, A. D. Temporal-difference reinforcement learning with distributed representations. PLoS ONE 4, 7362 (2009).

    Article 
    ADS 

    Google Scholar
     

  • Shankar, K. H. & Howard, M. W. A scale-invariant internal representation of time. Neural Comput. 24, 134–193 (2012).

    Article 
    MathSciNet 
    PubMed 

    Google Scholar
     

  • Tanaka, C. S. et al. Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nat. Neurosci. 7, 887–893 (2004).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Sherstan, C., Dohare, S., MacGlashan, J., Günther, J. & Pilarski, P. M. Gamma-Nets: generalizing value estimation over timescale. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 5717–5725 (2020).

  • Momennejad, I. & Howard, M. W. Predicting the future with multi-scale successor representations. Preprint at bioRxiv https://doi.org/10.1101/449470 (2018).

  • Reinke, C., Uchibe, E. & Doya, K. Average reward optimization with multiple discounting reinforcement learners. In Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science (eds Liu, D. et al.) 789–800 (Springer, 2017).

  • Kobayashi, S. & Schultz, W. Influence of reward delays on responses of dopamine neurons. J. Neurosci. 28, 7837–7846 (2008).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Schultz, W. Dopamine reward prediction-error signalling: a two-component response. Nat. Rev. Neurosci. 17, 183–195 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Howe, M. W., Tierney, P. L., Sandberg, S. G., Phillips, P. E. M. & Graybiel, A. M. Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature 500, 575–579 (2013).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Berke, J. D. What does dopamine mean? Nat. Neurosci. 21, 787–793 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Gershman, S. J. Dopamine ramps are a consequence of reward prediction errors. Neural Comput. 26, 467–471 (2014).

    Article 
    PubMed 

    Google Scholar
     

  • Kim, H. G. R. et al. A unified framework for dopamine signals across timescales. Cell 183, 1600–1616.e25 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Mikhael, J. G., Kim, H. R., Uchida, N. & Gershman, S. J. The role of state uncertainty in the dynamics of dopamine. Curr. Biol. 32, 1077–1087.e9 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Guru, A. et al. Ramping activity in midbrain dopamine neurons signifies the use of a cognitive map. Preprint at bioRxiv https://doi.org/10.1101/2020.05.21.108886 (2020).

  • Doya, K. Reinforcement learning in continuous time and space. Neural Comput. 12, 219–245 (2000).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Lee, R. S., Sagiv, Y., Engelhard, B., Witten, I. B. & Daw, N. D. A feature-specific prediction error model explains dopaminergic heterogeneity. Nat. Neurosci. 27, 1574–1586 (2024).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Cruz, B. F. et al. Action suppression reveals opponent parallel control via striatal circuits. Nature 607, 521–526 (2022).

    Article 
    ADS 
    CAS 
    PubMed 

    Google Scholar
     

  • Millidge, B., Song, Y., Lak, A., Walton, M. E. & Bogacz, R. Reward bases: a simple mechanism for adaptive acquisition of multiple reward types. PLoS Comput. Biol. 20, e1012580 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Engelhard, B. et al. Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons. Nature 570, 509–513 (2019).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Eshel, N., Tian, J., Bukwich, M. & Uchida, N. Dopamine neurons share common response function for reward prediction error. Nat. Neurosci. 19, 479–486 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Cox, J. & Witten, I. B. Striatal circuits for reward learning and decision-making. Nat. Rev. Neurosci. 20, 482–494 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Collins, A. L. & Saunders, B. T. Heterogeneity in striatal dopamine circuits: form and function in dynamic reward seeking. J. Neurosci. Res. https://doi.org/10.1002/jnr.24587 (2020).

  • Gershman, S. J. et al. Explaining dopamine through prediction errors and beyond. Nat. Neurosci. 27, 1645–1655 (2024).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Watabe-Uchida, M. & Uchida, N. Multiple dopamine systems: weal and woe of dopamine. Cold Spring Harb. Symp. Quant. Biol. 83, 83–95 (2018).

    Article 
    PubMed 

    Google Scholar
     

  • Xu, Z., van Hasselt, H. P. & Silver, D. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems Vol. 31 (Curran Associates, 2018).

  • Yoshida, N., Uchibe, E. & Doya, K. Reinforcement learning with state-dependent discount factor. In 2013 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL) https://ieeexplore.ieee.org/document/6652533 (IEEE, 2013).

  • Doya, K. Metalearning and neuromodulation. Neural Netw. 15, 495–506 (2002).

    Article 
    PubMed 

    Google Scholar
     

  • Tanaka, S. C. et al. Serotonin differentially regulates short- and long-term prediction of rewards in the ventral and dorsal striatum. PLoS ONE https://doi.org/10.1371/journal.pone.0001333 (2007).

  • Kvitsiani, D. et al. Distinct behavioural and network correlates of two interneuron types in prefrontal cortex. Nature 498, 363–366 (2013).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).

    Article 

    Google Scholar
     

  • Oppenheim, A., Willsky, A. & Hamid, W. Signals and Systems (Pearson, 1996).

  • Dayan, P. Improving generalisation for temporal difference learning: the successor representation. Neural Comput. 5, 613–624 (1993).

    Article 

    Google Scholar
     

  • Gershman, S. J. The successor representation: its computational logic and neural substrates. J. Neurosci. https://doi.org/10.1523/JNEUROSCI.0151-18.2018 (2018).

  • Amit, R., Meir, R. & Ciosek, K. Discount factor as a regularizer in reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning 269–278 (PMLR, 2020).

  • Badia, A. P. et al. Agent57: outperforming the Atari human benchmark. In Proceedings of the 37th International Conference on Machine Learning 507–517 (PMLR, 2020).

  • Reinke, C. Time adaptive reinforcement learning. ICLR 2020 work. Beyond tabula rasa RL. Preprint at https://doi.org/10.48550/arXiv.2004.08600 (2020).

  • Gershman, S. J. & Uchida, N. Believing in dopamine. Nat. Rev. Neurosci. 20, 703–714 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Leone, F. C., Nelson, L. S. & Nottingham, R. B. The folded normal distribution. Technometrics 3, 543–550 (1961).

    Article 
    MathSciNet 

    Google Scholar
     

  • Lindsey, J. & Litwin-Kumar, A. Action-modulated midbrain dopamine activity arises from distributed control policies. Adv. Neural Inf. Process. Syst. 35, 5535–5548 (2022).


    Google Scholar
     

  • Masset, P. et al. Data and code for ‘Multi-timescale reinforcement learning in the brain’, V1. Mendeley Data https://doi.org/10.17632/tc43t3s7c5.1 (2025).

  • RELATED ARTICLES

    Most Popular

    Recent Comments