A Framework for Speech Recognition

Speech :

With the user’s permission, get recognition of live and prerecorded speech, and receive transcriptions, alternative interpretations, and confidence levels.

Speech recognition

Speech recognition, the ability of devices to respond to spoken commands. Speech recognition enables hands-free control of various devices and equipment (a particular boon to many disabled persons), provides input to automatic translation, and creates print-ready dictation. Among the earliest applications for speech recognition were automated telephone systems and medical dictation software. It is frequently used for dictation, for querying databases, and for giving commands to computer-based systems, especially in professions that rely on specialized vocabularies. It also enables personal assistants in vehicles and smartphones, such as Apple’s Siri.

Before any machine can interpret speech, a microphone must translate the vibrations of a person’s voice into a wavelike electrical signal. This signal in turn is converted by the system’s hardware—for instance, a computer’s sound card—into a digital signal. It is the digital signal that a speech recognition program analyzes in order to recognize separate phonemes, the basic building blocks of speech. The phonemes are then recombined into words. However, many words sound alike, and, in order to select the appropriate word, the program must rely on the context. Many programs establish context through trigram analysis, a method based on a database of frequent three-word clusters in which probabilities are assigned that any two words will be followed by a given third word. For example, if a speaker says “who am,” the next word will be recognized as the pronoun “I” rather than the similar-sounding but less likely “eye.” Nevertheless, human intervention is sometimes needed to correct errors.

Programs for recognizing a few isolated words, such as telephone voice navigation systems, work for almost every user. On the other hand, continuous speech programs, such as dictation programs, must be trained to recognize an individual’s speech patterns; training involves the user reading aloud samples of text. Today, with the growing power of personal computers and mobile devices, the accuracy of speech recognition has improved markedly. Error rates have been reduced to about 5 percent in vocabularies containing tens of thousands of words. Even greater accuracy is reached in limited vocabularies for specialized applications such as dictation of radiological diagnosis.

Speech Recognition in iOS :

Speech recognition relies on Apple’s servers to function. And as stated in the documentation: “In the case of speech recognition, … data is transmitted and temporarily stored on Apple’s servers to increase the accuracy of recognition.” So the amount of usage can be restricted if it requires heavy computation or storage.

Because speech is transmitted and uses Apple’s remote servers, security is a concern. For this reason your user must agree to have their speech detected by your app and must be made aware that what they say during recognition could be at risk.

Getting Started with Speech Recognition :

The Speech APIs perform speech recognition by communicating with Apple’s servers or using an on-device speech recognizer, if available. To find out if a speech recognizer is available for a specific language, you adopt the SFSpeechRecognizerDelegate protocol.

Because your app may need to connect to the servers to perform recognition, it’s essential that you respect the privacy of your users and treat their utterances as sensitive data. For this reason, you must get the user’s explicit permission before you initiate speech recognition.

Note :

If the user grants permission, you don’t have to request it again
To start using speech recognition in your app:

Write a sentence that tells users how they can use speech recognition in your app.
For example, if your to-do list app changes an item’s status to finished when the user speaks “done,” you might write “Lets you mark an item as finished by saying Done.”
Add the NSSpeechRecognitionUsageDescription key to your Info.plist file and provide the sentence you wrote as the string value.
Use requestAuthorization(_:) to request the user’s permission by displaying the sentence you wrote in an alert.
If the user denies permission (or if speech recognition is unavailable), handle it gracefully. For example, you might disable user interface items that indicate the availability of speech recognition.
After the user grants your app permission to perform speech recognition, create an SFSpeechRecognizer object and create a speech recognition request.
Use the SFSpeechURLRecognitionRequest class to perform recognition on a prerecorded, on-disk audio file, and use the SFSpeechAudioBufferRecognitionRequest class to recognize live audio or in-memory content.
Pass the request to your SFSpeechRecognizer object to begin recognition.
Speech is recognized incrementally, so your recognizer’s handler may be called more than once. (Check the value of the isFinal property to find out when recognition is finished.) If you’re working with live audio, you use SFSpeechAudioBufferRecognitionRequestand append audio buffers to a request during the recognition process.
When recording is finished, signal the recognizer that no more audio is expected, so that recognition can finish. Note that starting a new recognition task before the previous one finishes interrupt the in-progress task.

Creating a Speech Recognizer :

Here is a way to create a simple recognizer that defaults to the user’s current locale and initiates speech recognition. Getting a speech recognizer and making a recognition request

Eg :

                               func recognizeFile(url:NSURL) {
                               guard let myRecognizer = SFSpeechRecognizer() else {
                               // A recognizer is not supported for the current locale
                               return
                                }   
                               if !recognizer.isAvailable()
                                {
                               // The recognizer is not available right now
                               return
                               }  
                               let request = SFSpeechURLRecognitionRequest(url: url)
                               recognizer.recognitionTask(with: request) { (result, error) in
                               guard let result = result else 
                                {
                               // Recognition failed, so check error for details and handle it
                               return
                               }
                               if result.isFinal { // Print the speech that has been recognized so far
                               print("Speech in the file is \(result.bestTranscription.formattedString)")
                                } 
                                  } 
                                    }

Best Practices for a Great User Experience

Be prepared to handle the failures that can be caused by reaching speech recognition limits. Because speech recognition is a network-based service, limits are enforced so that the service can remain freely available to all apps. Individual devices may be limited in the number of recognition that can be performed per day and an individual app may be throttled globally, based on the number of requests it makes per day. For example, if a recognition request fails quickly (within a second or two of starting), the recognition service may be temporarily unavailable to your app and you may want to ask users to try again later.

Plan for a one-minute limit on audio duration. Speech recognition can place a relatively high burden on battery life and network usage. In iOS 10, utterance audio duration is limited to about one minute, which is similar to the limit for keyboard-related dictation.

Remind the user when your app is recording. For example, you can play “now recording” sounds and display a visual indicator that helps users understand that they’re being actively recorded. You can also display speech as it is being recognized so that users understand what your app is doing and when recognition errors occur.

Do not perform speech recognition on private or sensitive information. Some speech is simply not appropriate for recognition. Avoid sending passwords, health or financial data, and other sensitive speech for recognition.

Author :Srinivasa Rao Polisetty – iOS Developer
Source: Wikipedia and developer.apple.com