Triton Inference Server speeds up Microsoft Translator
When your software can evoke tears of joy, you spread joy.
So Translator, a Microsoft Azure cognitive service, applies some of the world’s greatest AI models to help more people communicate.
“There are so many cool stories,” said Vishal Chowdhary, head of development for Translator.
Like the five-day sprint to add Haitian Creole to apps that helped aid workers after Haiti suffered a 7.0 earthquake in 2010. Or the grandparents who choked on their first session. using the software to talk live with distant grandchildren who spoke a language they didn’t understand.
An ambitious goal
“Our vision is to break down barriers across all languages and modalities with this same API that is already used by thousands of developers,” Chowdhary said.
With some 7,000 languages spoken around the world, that’s an ambitious goal.
So the team turned to a powerful and complex tool – a blended expert (MoE) AI approach.
It is an advanced member of the transformer model class resulting in rapid advances in natural language processing. And with 5 billion parameters, it’s 80 times larger than the largest model the team has in production for natural language processing.
MoE models are so computationally intensive that it’s hard to find anyone who has put them into production. In an initial test, CPU-based servers couldn’t meet the team’s requirement to use them to translate a document in a second.
Next, the team ran the test on systems accelerated with NVIDIA Triton Inference Server, part of the NVIDIA AI Enterprise 2.0 platform announced this week at GTC.
“Using NVIDIA and Triton GPUs, we could do that, and do it efficiently,” Chowdhary said.
In fact, the team was able to achieve up to 27x speedup over unoptimized GPU runtimes.
“We were able to create a model to perform different language comprehension tasks – like summarization, text generation and translation – instead of having to develop separate models for each task,” said Hanny Hassan Awadalla, principal researcher at Microsoft who supervised the tests.
How Triton Helped
Microsoft’s templates break down a big job like translating a stack of documents into several smaller tasks of translating hundreds of sentences. Triton’s dynamic batch processing feature brings these many demands together to make the most of a GPU’s power.
The team praised Triton’s ability to run any model in any mode using CPUs, GPUs or other accelerators.
“It looks very well thought out with all the features I wanted for my scenario, like something I would have developed myself,” said Chowdhary, whose team has been developing large-scale distributed systems for more than a year. decade.
Under the hood, two software components have been key to Triton’s success. NVIDIA has extended FasterTransformer – a software layer that handles inference calculations – to support MoE models. CUTLASS, an NVIDIA math library, helped implement the models efficiently.
Prototype tested in four weeks
Although the testing was complex, the team worked with NVIDIA engineers to get an end-to-end prototype with Triton up and running in less than a month.
“That’s a really impressive timeframe to make a deliverable – I really appreciate that,” Awadalla said.
And even though this was the team’s first experience with Triton, “we used it to ship MoE models by restructuring our runtime environment without too much effort, and now hopefully it will be part of our long-term host system,” Chowdhary added.
Proceed to next steps
The expedited service will arrive in sensible stages, initially for the translation of documents into a few major languages.
“Ultimately, we want our customers to get the benefits of these new models seamlessly across all of our scenarios,” Chowdhary said.
The work is part of a larger Microsoft initiative. It aims to fuel the progress of a wide range of its products such as Office and Teams, as well as those of its developers and customers, from small single-app businesses to Fortune 500 companies.
Leading the way, Awadalla’s team published research in September on training MoE patterns with up to 200 billion parameters on NVIDIA A100 Tensor Core GPUs. Since then, the team has accelerated this work another 8 times using 80G versions of A100 GPUs on models with over 300 billion parameters.
“The models will need to get bigger and bigger to better represent more languages, especially for those where we don’t have a lot of data,” Adawalla said.