Double descent explained
Double descent in statistics and machine learning is the phenomenon where a model with a small number of parameters and a model with an extremely large number of parameters have a small test error, but a model whose number of parameters is about the same as the number of data points used to train the model will have a large error.[1] This phenomenon has been considered surprising, as it contradicts assumptions about overfitting in classical machine learning.[2]
History
Early observations of what would later be called double descent in specific models date back to 1989.[3] [4]
The term "double descent" was coined by Belkin et. al.[5] in 2019, when the phenomenon gained popularity as a broader concept exhibited by many models.[6] [7] The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of the bias–variance tradeoff),[8] and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models.[9]
Theoretical models
Double descent occurs in linear regression with isotropic Gaussian covariates and isotropic Gaussian noise.[10]
A model of double descent at the thermodynamic limit has been analyzed using the replica trick, and the result has been confirmed numerically.[11]
Empirical examples
The scaling behavior of double descent has been found to follow a broken neural scaling law[12] functional form.
Further reading
- Two Models of Double Descent for Weak Features. Mikhail Belkin. Daniel Hsu. Ji Xu. SIAM Journal on Mathematics of Data Science. 2. 4. 2020. 1167–1180 . 10.1137/20M1336072. free. 1903.07571.
- Web site: The m = n Machine Learning Anomaly. John. Mount. 3 April 2024.
- Deep double descent: where bigger models and more data hurt. Preetum Nakkiran. Gal Kaplun. Yamini Bansal. Tristan Yang. Boaz Barak. Ilya Sutskever. . 2021. 29 December 2021. 12 . 124003 . IOP Publishing Ltd and SISSA Medialab srl. 1912.02292. 10.1088/1742-5468/ac3a74. 2021JSMTE2021l4003N . 207808916 .
- The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve. Song Mei. Andrea Montanari. Communications on Pure and Applied Mathematics. 75. 4. April 2022. 667–766 . 10.1002/cpa.22008. 1908.05355. 199668852 .
- Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks. Xiangyu Chang. Yingcong Li. Samet Oymak. Christos Thrampoulidis. Proceedings of the AAAI Conference on Artificial Intelligence. 35. 8. 2021. 2012.08749.
External links
Notes and References
- Web site: 2019-12-05 . Deep Double Descent . 2022-08-12 . OpenAI . en.
- 2303.14151v1 . cs.LG . Rylan . Schaeffer . Mikail . Khona . Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle . 2023-03-24 . en . Robertson . Zachary . Boopathy . Akhilan . Pistunova . Kateryna . Rocks . Jason W. . Fiete . Ila Rani . Koyejo . Oluwasanmi.
- Vallet . F. . Cailton . J.-G. . Refregier . Ph . June 1989 . Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions . Europhysics Letters . en . 9 . 4 . 315 . 10.1209/0295-5075/9/4/003 . 1989EL......9..315V . 0295-5075.
- Loog . Marco . Viering . Tom . Mey . Alexander . Krijthe . Jesse H. . Tax . David M. J. . 2020-05-19 . A brief prehistory of double descent . Proceedings of the National Academy of Sciences . en . 117 . 20 . 10625–10626 . 10.1073/pnas.2001875117 . free . 0027-8424 . 7245109 . 32371495. 2004.04328 . 2020PNAS..11710625L .
- Belkin . Mikhail . Hsu . Daniel . Ma . Siyuan . Mandal . Soumik . 2019-08-06 . Reconciling modern machine learning practice and the bias-variance trade-off . Proceedings of the National Academy of Sciences . 116 . 32 . 15849–15854 . 1812.11118 . 10.1073/pnas.1903070116 . 0027-8424 . 6689936 . 31341078 . free.
- Spigler . Stefano . Geiger . Mario . d'Ascoli . Stéphane . Sagun . Levent . Biroli . Giulio . Wyart . Matthieu . 2019-11-22 . A jamming transition from under- to over-parametrization affects loss landscape and generalization . Journal of Physics A: Mathematical and Theoretical . 52 . 47 . 474001 . 10.1088/1751-8121/ab4c8b . 1751-8113. 1810.09665 .
- Viering . Tom . Loog . Marco . 2023-06-01 . The Shape of Learning Curves: A Review . IEEE Transactions on Pattern Analysis and Machine Intelligence . 45 . 6 . 7799–7819 . 10.1109/TPAMI.2022.3220744 . 36350870 . 0162-8828. 2103.10948 .
- Geman . Stuart . Stuart Geman . Bienenstock . Élie . Doursat . René . 1992 . Neural networks and the bias/variance dilemma . Neural Computation . 4 . 1–58 . 10.1162/neco.1992.4.1.1 . 14215320.
- Preetum Nakkiran . Gal Kaplun . Yamini Bansal . Tristan Yang . Boaz Barak . Ilya Sutskever . 29 December 2021 . Deep double descent: where bigger models and more data hurt . . IOP Publishing Ltd and SISSA Medialab srl . 2021 . 12 . 124003 . 1912.02292 . 2021JSMTE2021l4003N . 10.1088/1742-5468/ac3a74 . 207808916.
- 1912.07242v1 . stat.ML . Preetum . Nakkiran . More Data Can Hurt for Linear Regression: Sample-wise Double Descent . 2019-12-16 . en.
- Advani . Madhu S. . Saxe . Andrew M. . Sompolinsky . Haim . 2020-12-01 . High-dimensional dynamics of generalization error in neural networks . Neural Networks . 132 . 428–446 . 10.1016/j.neunet.2020.08.022 . 0893-6080. free . 33022471 . 7685244 .
- Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023.