Double descent explained

Double descent in statistics and machine learning is the phenomenon where a model with a small number of parameters and a model with an extremely large number of parameters have a small test error, but a model whose number of parameters is about the same as the number of data points used to train the model will have a large error.[1] This phenomenon has been considered surprising, as it contradicts assumptions about overfitting in classical machine learning.[2]

History

Early observations of what would later be called double descent in specific models date back to 1989.[3] [4]

The term "double descent" was coined by Belkin et. al.[5] in 2019, when the phenomenon gained popularity as a broader concept exhibited by many models.[6] [7] The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of the bias–variance tradeoff),[8] and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models.[9]

Theoretical models

Double descent occurs in linear regression with isotropic Gaussian covariates and isotropic Gaussian noise.[10]

A model of double descent at the thermodynamic limit has been analyzed using the replica trick, and the result has been confirmed numerically.[11]

Empirical examples

The scaling behavior of double descent has been found to follow a broken neural scaling law[12] functional form.

Further reading

External links

Notes and References

  1. Web site: 2019-12-05 . Deep Double Descent . 2022-08-12 . OpenAI . en.
  2. 2303.14151v1 . cs.LG . Rylan . Schaeffer . Mikail . Khona . Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle . 2023-03-24 . en . Robertson . Zachary . Boopathy . Akhilan . Pistunova . Kateryna . Rocks . Jason W. . Fiete . Ila Rani . Koyejo . Oluwasanmi.
  3. Vallet . F. . Cailton . J.-G. . Refregier . Ph . June 1989 . Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions . Europhysics Letters . en . 9 . 4 . 315 . 10.1209/0295-5075/9/4/003 . 1989EL......9..315V . 0295-5075.
  4. Loog . Marco . Viering . Tom . Mey . Alexander . Krijthe . Jesse H. . Tax . David M. J. . 2020-05-19 . A brief prehistory of double descent . Proceedings of the National Academy of Sciences . en . 117 . 20 . 10625–10626 . 10.1073/pnas.2001875117 . free . 0027-8424 . 7245109 . 32371495. 2004.04328 . 2020PNAS..11710625L .
  5. Belkin . Mikhail . Hsu . Daniel . Ma . Siyuan . Mandal . Soumik . 2019-08-06 . Reconciling modern machine learning practice and the bias-variance trade-off . Proceedings of the National Academy of Sciences . 116 . 32 . 15849–15854 . 1812.11118 . 10.1073/pnas.1903070116 . 0027-8424 . 6689936 . 31341078 . free.
  6. Spigler . Stefano . Geiger . Mario . d'Ascoli . Stéphane . Sagun . Levent . Biroli . Giulio . Wyart . Matthieu . 2019-11-22 . A jamming transition from under- to over-parametrization affects loss landscape and generalization . Journal of Physics A: Mathematical and Theoretical . 52 . 47 . 474001 . 10.1088/1751-8121/ab4c8b . 1751-8113. 1810.09665 .
  7. Viering . Tom . Loog . Marco . 2023-06-01 . The Shape of Learning Curves: A Review . IEEE Transactions on Pattern Analysis and Machine Intelligence . 45 . 6 . 7799–7819 . 10.1109/TPAMI.2022.3220744 . 36350870 . 0162-8828. 2103.10948 .
  8. Geman . Stuart . Stuart Geman . Bienenstock . Élie . Doursat . René . 1992 . Neural networks and the bias/variance dilemma . Neural Computation . 4 . 1–58 . 10.1162/neco.1992.4.1.1 . 14215320.
  9. Preetum Nakkiran . Gal Kaplun . Yamini Bansal . Tristan Yang . Boaz Barak . Ilya Sutskever . 29 December 2021 . Deep double descent: where bigger models and more data hurt . . IOP Publishing Ltd and SISSA Medialab srl . 2021 . 12 . 124003 . 1912.02292 . 2021JSMTE2021l4003N . 10.1088/1742-5468/ac3a74 . 207808916.
  10. 1912.07242v1 . stat.ML . Preetum . Nakkiran . More Data Can Hurt for Linear Regression: Sample-wise Double Descent . 2019-12-16 . en.
  11. Advani . Madhu S. . Saxe . Andrew M. . Sompolinsky . Haim . 2020-12-01 . High-dimensional dynamics of generalization error in neural networks . Neural Networks . 132 . 428–446 . 10.1016/j.neunet.2020.08.022 . 0893-6080. free . 33022471 . 7685244 .
  12. Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023.