Large language models use Triton for AI inference
Julien Salinas wears many hats. He’s an entrepreneur, software developer and, until recently, a volunteer firefighter in his mountain village an hour’s drive from Grenoble, a tech hub in southeastern France.
He runs a two-year-old startup, NLP Cloud, which is already profitable, employs a dozen people and serves customers around the world. It’s one of many companies around the world using NVIDIA software to deploy some of the most complex and powerful AI models today.
NLP Cloud is an AI-powered software service for text data. A major European airline uses it to summarize Internet news to its employees. A small healthcare company uses it to analyze patient prescription refill requests. An online app uses it to let kids talk to their favorite cartoon characters.
Large language patterns speak volumes
It’s all part of the magic of natural language processing (NLP), a popular form of AI that spawns some of the largest neural networks on the planet called big language models. Trained with huge datasets on powerful systems, LLMs can handle all sorts of tasks such as text recognition and generation with astonishing accuracy.
NLP Cloud uses about 25 LLMs today, the largest has 20 billion parameters, a key measure of a model’s sophistication. And now it implements BLOOM, an LLM with 176 billion parameters.
Running these massive models efficiently in production across multiple cloud services is hard work. That’s why Salinas turns to NVIDIA Triton Inference Server.
High throughput, low latency
“Very quickly, the main challenge we faced was server costs,” said Salinas, proud that his self-funded startup hasn’t taken on outside support to date.
“Triton has proven to be a great way to fully utilize the GPUs available to us,” he said.
For example, NVIDIA A100 Tensor Core GPUs can process up to 10 requests at once, twice the throughput of alternative software, thanks to FasterTransformer, a part of Triton that automates complex tasks like splitting models across multiple GPUs. .
FasterTransformer also helps NLP Cloud distribute tasks that require more memory across multiple NVIDIA T4 GPUs while reducing task response time.
Customers who demand the fastest response times can process 50 tokens – text items like words or punctuation marks – in as little as half a second with Triton on an A100 GPU, about a third of the response time without Triton.
“It’s very cool,” said Salinas, who has reviewed dozens of software tools on her personal blog.
Visit Triton Users
Around the world, other startups and established giants are using Triton to get the most out of LLMs.
Microsoft’s translation service helped disaster workers understand Haitian Creole while responding to a 7.0 earthquake. It was one of many use cases for the service that grew 27x using Triton to run inference on models with up to 5 billion parameters.
NLP provider Cohere was founded by one of the AI researchers who wrote the seminal paper defining transformer models. He gets up to 4x faster speedups on inference using Triton on his custom LLMs, so that users of customer support chatbots, for example, get quick answers to their queries.
NLP Cloud and Cohere are among the many members of the NVIDIA Inception program, which encourages cutting-edge startups. Several other Inception startups are also using Triton for AI inference on LLMs.
Tokyo-based Rinna has created chatbots used by millions of people in Japan, as well as tools for developers to create custom chatbots and AI-powered characters. Triton helped the company achieve sub-two-second inference latency on GPUs.
In Tel Aviv, Tabnine runs a service that automates up to 30% of the code written by a million developers worldwide (see a demo below). His service runs multiple LLMs on A100 GPUs with Triton to handle more than 20 programming languages and 15 code editors.
Twitter uses Writer’s LLM service, based in San Francisco. It guarantees that the employees of the social network write in a voice that respects the graphic charter of the company. The Writer service achieves 3x lower latency and up to 4x higher throughput using Triton compared to previous software.
If you want to put a face to those words, Inception member Ex-human, just down the street from Writer, helps users create realistic avatars for games, chatbots, and VR apps. With Triton, it delivers sub-second response times on a 6 billion parameter LLM while reducing GPU memory consumption by a third.
A complete platform
Back in France, NLP Cloud now uses other elements of the NVIDIA AI platform.
For inference on models running on a single GPU, it adopts NVIDIA TensorRT software to minimize latency. “We get super-fast performance with it, and the latency really drops,” Salinas said.
The company has also started training custom versions of LLMs to support more languages and improve efficiency. For this work, it adopts NVIDIA Nemo Megatron, an end-to-end framework for training and deploying LLMs with trillions of parameters.
Salinas, 35, has the energy of twenty years to code and grow his business. He outlines plans to build private infrastructure to complement the four public cloud services used by the startup, as well as to expand into LLMs that handle speech and text-to-image to meet applications such as research. semantics.
“I’ve always loved coding, but being a good developer isn’t enough: you have to understand your customers’ needs,” said Salinas, who published code on GitHub nearly 200 times last year.
If you are passionate about software, discover the latest news on Triton in this technical blog.