T5 (language model) explained

Text-to-Text Transfer Transformer (T5)
Author:Google AI
Latest Release Version:T5X
Repo:https://github.com/google-research/text-to-text-transfer-transformer
License:Apache-2.0

T5 (Text-to-Text Transfer Transformer) is a series of large language models developed by Google AI. Introduced in 2019,[1] T5 models are trained on a massive dataset of text and code using a text-to-text framework. The T5 models are capable of performing the text-based tasks that they were pretrained for. They can also be finetuned to perform other tasks.They have been employed in various applications, including chatbots, machine translation systems, text summarization tools, code generation, and robotics.

Like the original Transformer model,[2] T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

In 2024, T5X was updated to Pile-T5 by training the same architecture on an improved dataset (The Pile).[3]

Training

T5 models are pre-trained on the Colossal Clean Crawled Corpus (C4), containing text and code scraped from the internet. This pre-training process enables the models to learn general language understanding and generation abilities. T5 models can then be fine-tuned on specific downstream tasks, adapting their knowledge to perform well in various applications.

The T5 models were pretrained on many tasks, all in the format of <input text> -> <output text>.

Some examples are:

Architecture

The T5 series encompasses several models with varying sizes and capabilities. These models are often distinguished by their parameter count, which indicates the complexity and potential capacity of the model. The original paper reported the following 5 models:

Model Parameters
  1. layers

dmodel

dff

dkv

  1. heads
Small 60M 6 512 2048 64 8
Base 220M 12 768 3072 64 12
Large 770M 24 1024 4096 64 16
3B (XL) 3B 24 1024 16384 128 32
11B (XXL) 11B 24 1024 65536 128 128
In the above table,

dmodel

: Dimension of the embedding vectors.

dff

: Dimension of the feedforward network within each encoder and decoder layer.

dkv

: Dimension of the key and value vectors used in the self-attention mechanism.

Variants

Several subsequent models used the T5 architecture, with non-standardized naming conventions used to differentiate them. This section attempts to collect the main ones. An exhaustive list of the variants released by Google Brain is on the GitHub repo for T5X.[4]

Some models are trained from scratch while others are trained by starting with a previous trained model. By default, each model is trained from scratch, except otherwise noted.

References

  1. Raffel . Colin . Shazeer . Noam . Roberts . Adam . Lee . Katherine . Narang . Sharan . Matena . Michael . Zhou . Yanqi . Li . Wei . Liu . Peter J. . 2020 . Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Journal of Machine Learning Research . 21 . 140 . 1–67 . 1533-7928.
  2. Vaswani . Ashish . Shazeer . Noam . Parmar . Niki . Uszkoreit . Jakob . Jones . Llion . Gomez . Aidan N . Kaiser . Łukasz . Polosukhin . Illia . 2017 . Attention is All you Need . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 30.
  3. Web site: Sutawika . Lintang . Komatsuzaki . Aran . Raffel . Colin . 2024-04-15 . Pile-T5 . 2024-05-05 . EleutherAI Blog . en.
  4. Web site: t5x/docs/models.md at main · google-research/t5x . 2024-08-05 . GitHub . en.
  5. Web site: SwitchTransformers . 2024-08-05 . huggingface.co.
  6. Web site: 2024-03-04 . bigscience/T0 · Hugging Face . 2024-08-21 . huggingface.co.
  7. Chung . Hyung Won . Hou . Le . Longpre . Shayne . Zoph . Barret . Tay . Yi . Fedus . William . Li . Yunxuan . Wang . Xuezhi . Dehghani . Mostafa . Brahma . Siddhartha . Webson . Albert . Gu . Shixiang Shane . Dai . Zhuyun . Suzgun . Mirac . Chen . Xinyun . 2024 . Scaling Instruction-Finetuned Language Models . Journal of Machine Learning Research . 25 . 70 . 1–53 . 1533-7928.
  8. Longpre . Shayne . Hou . Le . Vu . Tu . Webson . Albert . Chung . Hyung Won . Tay . Yi . Zhou . Denny . Le . Quoc V. . Zoph . Barret . Wei . Jason . Roberts . Adam . 2023-07-03 . The Flan Collection: Designing Data and Methods for Effective Instruction Tuning . Proceedings of the 40th International Conference on Machine Learning . en . PMLR . 22631–22648.
  9. Web site: 2024-01-04 . google/flan-t5-xl · Hugging Face . 2024-08-05 . huggingface.co.
  10. Roberts . Adam . Chung . Hyung Won . Mishra . Gaurav . Levskaya . Anselm . Bradbury . James . Andor . Daniel . Narang . Sharan . Lester . Brian . Gaffney . Colin . Mohiuddin . Afroz . Hawthorne . Curtis . Lewkowycz . Aitor . Salcianu . Alex . Zee . Marc van . Austin . Jacob . 2023 . Scaling Up Models and Data with t5x and seqio . Journal of Machine Learning Research . 24 . 377 . 1–8 . 1533-7928.