Breakthrough in ASR: Apple Unveils Groundbreaking Denoising Language Model

Apple Unveils Groundbreaking Denoising Language Model

In the ever-evolving world of AI and Technology that has become a major and most important part of our day-to-day life, Automatic Speech Recognition or ASR has become a most prominent aspect that helps us in using the technology by giving speech commands while reducing the efforts of typing or entering the text commands. It has been observed that many industries and various sectors are making great use of these ASR technologies in order to get their work done rapidly and efficiently. The technology of ASR has been a prominent feature in various technical devices, for example, your phones, laptops or other apps and devices. 

Taking the ASR technology forward, Apple Inc. has recently launched a new denoising language model that would help enhance your speech recognition commands and would make your work experience easier. Apple was found in 1976 and has ever since expanded its boundaries in various sectors such as telecommunication and technology, it has also spread its criteria in the field of Artificial Intelligence and has ever since tried to upgrade its facilities to captivate the user’s attention and to enhance their experience. One such upgradation is the ASR also known as the Automatic Speech Recognition. This article delves into the latest launch of Apple’s ASR technique which is a Denoising language Model. But before taking a look at Apple’s new update, let’s take a look at what ASR is?

The process of employing Machine Learning or Artificial Intelligence technology to convert human speech into readable texts is called Automatic Speech Recognition or ASR. With the ASR systems being used in various everyday applications and gadgets, the field of ASR has exponentially grown more and more everyday and has been a mode of captivation for its users. The accuracy of ASR brings it in demand with the users and due to its high accuracy levels, a large number of applications are making more and more use of it. 

Image Source: 

ASR technology dates back to 1952 with the creation of Audrey which was designed by Bell Labs. Initially Audrey was unable to transcribe anything other than spoken numbers but somewhat a decade later it was worked upon and updated in order to transcribe spoken words such as “Hello”. Through the years, ASR has employed a high level of machine learning such as hidden Markov models in order to make the speech recognition faster and better. The consistent standard of these classical models have opened new means of approaches which include the highly enhanced Deep Learning Technology that has already been used in various AI sectors and has given proven results in the past. In 2014, in a paper titled “Deep Speech: Scaling up end-to-end speech recognition”, Baidu stated how Deep Learning can be applied to provide strength to the state-of-the-art speech recognition models. The paper started a revolutionizing approach and made Deep Learning a prominently used aspect in the field of ASR.  

Over the years, a huge change has been observed in the accuracy of ASR and also the technology has drastically improved. A decade ago, people had to indulge themselves in lengthy processes and at the same time had to pay high expenses for the speech recognition software license so that they can gain access to the ASR technology, however, in the modern day world, people and industries have an easy and cost efficient access to the ASR technology through the use of various APIs. 

Image Source: 

The ASR uses two main approaches, traditional hybrid approach and end to end Deep Learning approach. The traditional hybrid approach is the conventional approach to Speech Recognition and has been an effective method in the field over fifteen years. It is till date used by a large number of companies and industries as it is a trusted method and was prominently used for decades and it provides a rather high knowledge of constructing a vigorous model due to the intense and in depth research and data which present despite the fact the accuracy might not be exact. The traditional HMM (Hidden Markov Models) and GMM (Gaussian Mixture Models) require forced aligned data. The text transcription of an audio speech segment is taken and determined in the context of time and space where the specific words occur in the speech segment. The traditional model is a mix of lexicon model, acoustic model, and a language model that would generate predictions about the transcriptions. The lexicon model helps in identifying the phonetic pronunciation of words which requires a custom phoneme set for each and every language. The Acoustic Model constructs the acoustic patterns present in the speech and it also examines the force aligned data and suggests the sound or phoneme that is used in each speech segment. On the other hand, the Language model designs the statistics of language and understands the order in which the words are likely to be spoken, it also suggests the words that would come ahead in the sequence and what would be the chances of those words being spoken. Furthermore, all these procedures come together in order to collectively perform the task of decoding. However, unlike any other method, the traditional method too has its disadvantages such as the accuracy rate is low and the force aligned data is difficult to code with human labour which is required in high amounts and hence these methods come off as less accessible. 

Image Source: 

The new and better approach to ASR is the end-to-end deep learning approach with which a user can recognize the path of the sequence of the acoustic features fed in the system and transcripts them into the proper order of words that will come as output. The data fed in isn’t supposed to be force-aligned. The deep learning system is trained in such an order that it does not require a lexicon or language model but still produces accurate output, however language models can be more reliable in terms of accuracy. CTC, LAS, RNNTs are the major end to end deep learning architectures. These systems don’t require force-aligned data, instead it provides high accuracy results. 

Walking on the path of ASR, Apple has recently launched its Denoising Language Model (DLM), which is a trained model used to detect any error and correct it. It has been fed with synthetic data and has been overtaking prior methods and gaining the SOTA automatic speech recognition (ASR) performance. Text-to-Text speech system is used to generate audio that is further used as input for the ASR system by constructing noisy hypotheses which are later merged with the actual text to train the DLM. This approach furthermore covers certain key elements: an up-scaled model and data, multi-speaker TTS systems, a variety of noise augmentation strategies and new decoding techniques. The advantage of employing a Transformer- CTC ASR, DLM gains a word error rate that comes across 1.5% which are considered to be by far the best recorded outcomes where the use of external audio is avoided. Various ASRs can take the use of a single DLM and can be proved to provide better outcomes than the traditional LM methods which are based on beam-search rescoring. The outcomes of various studies portray that the traditional LMs carry a threat to be replaced by these latest excellently designed error correction models. 

DLM employs TTS systems through which they develop the synthetic audio, which is used as an input or command for the ASR system which later on constructs the hypotheses which further merges with the original text in order to generate the training datasets, this approach solves the issue of limited number of supervised training examples in the conventional ASR datasets. This approach further supports the scaling up of training data with the help of larger language corpus. 

Some of the major contributions of DLM are:

  1. Key elements of LM:
    Multiple zero-shot, multi-speaker TTS which construct the audio in a variety of patterns and styles. 
    Merging of real and synthetic data in order to maintain grounding.
    Merging of various noise augmentation strategies such as frequency masking of spectral features and random substitution of ASR hypothesis characters.
  2. State-of-the-art ASR error correction: DLM acquires a word error rate of 1.5% despite the little to no use of any external audio. 
  3. Universal, Scalable and Efficient: Various systems can be used at a single time by being connected to a single DLM. The performance of ASR has improved as the speakers now used are more in number than the previous models. 

The various tests done on DLM prove that the increase in the model’s size has reduced the WER and DLMs are now better than the conventional LM. The better performance of DLM further implies that the use of TTS is not necessarily required for the accuracy rate.  

In conclusion, the research underscores the DLM’s effectiveness in addressing ASR errors by leveraging synthetic data for training. This method not only improves the accuracy but also shows scalability and adaptability across various ASR systems. This innovative approach represents a significant step towards speech recognition, suggesting the potential for more accurate and reliable ASR systems in the future. Researchers believe the success of the DLM model highlights the need to reconsider how large text corpora can be used to enhance ASR accuracy. By prioritizing error correction over language modeling alone, the DLM establishes a new benchmark for future research and development in the field.