BLOOM (language model) explained

BigScience Large Open-science Open-access Multilingual Language Model (BLOOM)[1] [2] is a 176-billion-parameter transformer-based autoregressive large language model (LLM). The model, as well as the code base and the data used to train it, are distributed under free licences.[3] BLOOM was trained on approximately 366 billion (1.6TB) tokens from March to July 2022.[4] [5]

BLOOM is the main outcome of the BigScience collaborative initiative,[6] a one-year-long research workshop that took place between May 2021 and May 2022. BigScience was led by HuggingFace and involved several hundreds of researchers and engineers from France and abroad representing both the academia and the private sector. BigScience was supported by a large-scale public compute grant on the French public supercomputer Jean Zay, managed by GENCI and IDRIS (CNRS), on which it was trained.

BLOOM's training corpus, named ROOTS, combines data extracted from the then-latest version of the web-based OSCAR corpus (38% of ROOTS) and newly collected data extracted from a manually selected and documented list of language data sources. It encompasses 46 natural languages (in amounts ranging from 30% of the whole dataset for English to 0.00002% for Chi Tumbuka) and 13 programming languages.[7]

Notes and References

  1. Web site: BigScience Large Open-science Open-access Multilingual Language Model . 2022-10-01.
  2. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. 2211.05100. Le Scao T, Fan A, Akiki C, Pavlick E, Ilić S, Hesslow D, Castagné R, Luccioni A, Yvon F, Gallé M, Tow J, Rush AM, Biderman S, Webson A, Sasanka Ammanamanchi P, Wang T, Sagot B, Muennighoff N, Villanova del Moral A, Ruwase O, Bawden R, Bekman S, McMillan-Major A, Beltagy I, Nguyen H, Saulnier L, Tan S, Ortiz Suarez P, Sanh V, Laurençon H, Jernite Y, Launay J, Mitchell M, Raffel C, etal. 2022. cs.CL .
  3. Web site: The BigScience RAIL license. 2024-01-10.
  4. Web site: Heikkilä . Melissa . BLOOM: Inside the radical new project to democratize AI . . 2022-07-12 . 2023-12-26.
  5. Web site: Release of largest trained open-science multilingual language model ever . . 2022-07-12 . 2023-12-26.
  6. Web site: BigScience . 2024-01-10.
  7. The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. 2303.03915. Laurençon H, Saulnier L, Wang T, Akiki C, Villanova del Moral A, Le Scao T, Von Werra L, Mou C, González Ponferrada C, Nguyen H, Frohberg J, Šaško M, Lhoest Q, McMillan-Major A, Dupont G, Biderman S, Rogers A, Ben allal L, De Toni F, Pistilli G, Nguyen O, Nikpoor S, Masoud M, Colombo P, de la Rosa J, Villegas P, Thrush T, Longpre S, Nagel S, Weber L, Muñoz M, Zhu J, Van Strien D, Alyafeai Z, Almubarak K, Vu MC, Gonzalez-Dios I, Soroa A, Lo K, Dey M, Ortiz Suarez P, Gokaslan A, Bose S, Adelani D, Phan L, Tran H, Yu I, Pai S, Chim J, Lepercq V, Ilic S, Mitchell M, Luccioni S, Jernite Y. 2022. cs.CL .