Fast inference with T5

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX.

Author

Abel Riboulot

Published

Aug 3, 2020

Last Revised

Aug 3, 2020

Repo

https://github.com/abelriboulot/onnxt5

I remember the first presentation I gave about transformer. I cheekily took a few e-mails about macro from colleagues, split them in half, ran the first half as a prompt, and asked the audience to guess which one was the real mail.

The guesses were coin-flips. That taught me two things:

I should spend less time reading my e-mails about macro.
Transformers are going to revolutionize the way we operate.

The one issue with transformers is that they are fairly slow to inference. Even as the NLP community wraps its collective brain around GPT-3 writing dad-jokes, one big caveat keeps on showing up: GPT-3 is slow. Actually, most very large transformers are fairly slow. But this post piqued my interest. Huge gains of performance can be gained from better inference libraries. However porting models to ONNX and allowing them to be ran using onnxruntime can be difficult. That's why I decided to make onnxt5.

What is onnxt5?

onnxt5 is a python library that lets you import SOTA T5 models in a line, and run inference with it very fast (original paper).

Advantages of this approach

Few lines to load a model, one line to use it

NLP models should be accessible to any developer, therefore I tried to make it as easy as possible to get started.
Loading a model

from onnxt5 import GenerativeT5
from onnxt5.api import get_encoder_decoder_tokenizer
decoder_sess, encoder_sess, tokenizer = get_encoder_decoder_tokenizer()
generative_t5 = GenerativeT5(encoder_sess, decoder_sess, tokenizer, onnx=True)

Translate a sentence

output_text, output_logits = generative_t5(prompt, max_length=100, temperature=0.)
# Output: "J'ai été victime d'une série d'accidents."

Summarize a paragraph

generative_t5("summarize: <PARAGRAPH>")

Up to 4x faster inferences

Leveraging the fantastic work by the onnxruntime team, onnxt5 is able to achieve up to 4X faster inference.
Performance on embedding

Different pre-trained tasks

Google's approach in creating T5 was to train it on a wide variety of tasks. This also means without fine-tuning you can leverage all the tasks which are listed in the appendix, including Q&A, generation, summarization, translation, etc.

Use your own models easily

If the pretrained version of T5 does not fit your need, you can easily export your own models in onnxt5. This gives you the freedom to train more tasks, and provide fast inference on your carefully crafted models.

A necessary caveat to the performance claims is that it appears that the gain in performance decreases with longer contexts. You can evaluate the gains for your task at hand with the benchmarking notebook.

How do I get started?

You can easily get started by downloading the library on pip.

pip install onnxt5

You can support the development and find examples on the repo.