Over 75 million people speak Telugu, mostly in the southern regions of India, making it one of the most widely spoken languages in the country.
Despite such prevalence, Telugu is considered a low-resource language when it comes to voice AI. This means that there are not enough hours of speech datasets to easily and accurately create AI models for automatic speech recognition (ASR) in Telugu.
And that means billions of people are locked out of using ASR to improve transcription, translation, and other voice AI applications in Telugu and other low-resource languages.
To create an ASR model for Telugu, the NVIDIA Speech AI team turned to the NVIDIA NeMo framework to develop and train state-of-the-art conversational AI models. The model won first place in a competition held in October by IIIT-Hyderabad, one of India’s most prestigious research and higher education institutes.
NVIDIA ranked first in accuracy for both parts of the Telugu ASR Challenge, held in conjunction with the Indian Language Technology Development Program and India’s Ministry of Electronics and Information Technology as part of of its mission of translation into the national language.
For the closed track, participants were required to use approximately 2,000 hours of a Telugu training dataset provided by the contest organizers. And for the open track, participants could use any data set and pre-trained AI models to build the Telugu ASR model.
NVIDIA NeMo powered models top the charts with around 13% and 12% word error rate for closed and open tracks, respectively, far outperforming all models built on popular ASR frameworks like ESPnet , Kaldi, SpeechBrain and others.
“What sets NVIDIA NeMo apart is that we open up every model we have – so people can easily refine the models and transfer learning to them for their use cases,” said Nithin Koluguri, researcher. principal on the conversational one. AI team at NVIDIA. “NeMo is also one of the only toolkits that supports scaling training to multi-GPU systems and multi-node clusters.”
Building the Telugu ASR Model
The first step in creating the award-winning model, Koluguri said, was to preprocess the data.
Koluguri and his colleague Megh Makwana, Lead Architect of Applied Deep Learning Solutions at NVIDIA, removed invalid letters and punctuation marks from the voice dataset provided for the competition’s closed track.
“Our biggest challenge was dealing with noisy data,” Koluguri said. “It’s when the audio and the transcript don’t match – in which case you can’t guarantee the accuracy of the ground-truth transcript you’re training on.”
The team cleaned up the audio clips by reducing them to less than 20 seconds, removing clips less than 1 second, and removing sentences with a character rate greater than 30, which measures characters spoken per second.
Makwana then used NeMo to train the ASR model for 160 epochs, or full cycles through the dataset, which had 120 million parameters.
For the competition’s open track, the team used pre-trained models with 36,000 hours of data on the 40 languages spoken in India. Developing this model for the Telugu language took about three days using an NVIDIA DGX system, according to Makwana.
The results of the inference tests were then shared with the contest organizers. NVIDIA won with word error rates about 2% higher than the second entrant. That’s a huge margin for voice AI, according to Koluguri.
“The impact of developing the ASR model is very high, especially for low-resource languages,” he added. “If a company comes forward and defines a baseline model, like we did for this competition, people can build on that with the NeMo toolkit to make transcription, translation, and other ASR applications more accessible for languages where voice AI is not yet widespread.”
NVIDIA extends voice AI to low-resource languages
“ASR is gaining momentum in India mainly because it will enable digital platforms to integrate and engage with billions of citizens via voice assistant services,” Makwana said.
And the process of building the Telugu model, as shown above, is a technique that can be replicated for any language.
Of approximately 7,000 languages in the world, 90% are considered low resources for voice AI, representing 3 billion speakers. This does not include dialects, pidgins and accents.
Open source of all its models on the NeMo toolkit is one of the ways NVIDIA improves language inclusion in the field of voice AI.
Additionally, pre-trained models for voice AI, as part of the NVIDIA Riva SDK, are now available in 10 languages, with many more planned for the future.
And NVIDIA hosted its first Speech AI Summit this month, with speakers from Google, Meta, Mozilla Common Voice and more. Learn more about “Unlocking Speech AI technology for language users around the world” by watching the on-demand presentation.
Start building and training state-of-the-art conversational AI models with NVIDIA NeMo.
#Speech #Expands #Global #Reach #Telugu #Language #Breakthrough #NVIDIABlog