Understanding Voice Recognition

Imagine yourself sitting relaxed on the sofa and just ordering your computer or laptop or cell phone to carry out simple tasks like typing a letter or carrying out few commands. Is it possible?

Of course it is, that’s where Voice recognition comes into picture.

Going by the definition it is the process of recognition human speech and decoded it into text form.

Principle

The basic principle of voice recognition involves the fact that speech or words spoken by any human being cause vibrations in air, known as sound waves. These continuous or analog waves are digitized and processed and then decoded to appropriate words and then appropriate sentences.

Components of a Speech Recognition System

So what does a basic Speech Recognition System consists of?

A speech capturing Device: It consists of a microphone, which converts the sound wave signals to electrical signals and an Analog to Digital Converter which samples and digitizes the analog signals to obtain the discrete data that the computer can understand.
A Digital Signal Module or a Processor: It performs processing on the raw speech signal like frequency domain conversion, restoring only the required information etc.
Preprocessed signal storage: The preprocessed speech is stored in the memory to carry out further task of speech recognition.
Reference Speech patterns: The computer or the system consists of predefined speech patterns or templates already stored in the memory, to be used as the reference for matching.
Pattern matching algorithm: The unknown speech signal is compared with the reference speech pattern to determine the actual words or the pattern of words.

Working of the System

Now let us see how the whole system actually works.

A speech can be seen as an acoustic waveform, i.e. signal carrying message information. A normal human being with the limited rate of motion of his/her articulators (speech organs) can produce speech at a average rate of 10 sounds per second. The average information rate is about 50-60 bits/second. It means actually only 50 bits/second of information is required in the speech signal. This acoustic waveform is converted to analog electrical signals by the microphone. The Analog to Digital converter converts this analog signal to digital samples by taking precise measurements of the wave at discrete intervals.
The digitized signal consists of a stream of periodic signals sampled at 16000 times per second and is not suitable to carry out actual speech recognition process as the pattern cannot be easily located. To extract the actual information, the signal in time domain is converted to signal in frequency domain. This is done by the Digital Signal Processor using FFT technique. In the digital signal, the component after every 1/100^th of a second is analyzed and the frequency spectrum for each such component is computed. In other words the digitized signal is segmented into small parts of frequency amplitudes.
Each segment or the frequency graph represents the different sounds made by human beings. The computer performs the matching of the unknown segments with the stored phonetics of the particular language. This pattern matching is done in 3 ways:

Using a Acoustic phonetic approach: In the Acoustic phonetic approach, generally the Hidden Markov Model is used. This model develops a non deterministic probability model for the speech recognition. This model consists of two variables – the hidden states of the phonemes stored in the computer memory and the visible frequency segment of the digital signal. Each phoneme has its own probability and the segment is matched with the phoneme according to the probability and the matched phonemes are then collected together to form the correct words according to the stored grammar rules of the language.

Using a pattern recognition approach: In the pattern recognition approach, the system is trained with a particular speech pattern for any language and the unknown speech pattern is compared with the reference speech pattern by determining the distance between the signals using time warping technique.

Using Artificial intelligence: The Artificial Intelligence approach is based on the utilization of basic knowledge sources such as the knowledge of sounds spoken on basis of spectral measurements, knowledge of proper meaningful and syntactical words.

Factors on which Speech Recognition System depends

The speech recognition system depends on the following factors:

Isolated Words: There needs to be a pause between the consecutive words spoken because continuous words can overlap making it difficult for the system to understand when a word starts or ends. Thus there needs to be a silence between consecutive words.
Single Speaker: Many speakers trying to give speech input at the same time can cause overlapping of the signals and interruptions. Most of the speech recognition systems used are speaker dependant systems.
Vocabulary size: Languages with large vocabulary are difficult to be considered for pattern matching than those with small vocabulary as chances of having ambiguous words are lesser in the latter.

Speech Recognition System on Windows 7

I would like to recommend the following steps for any person using Windows 7 for the speech recognition system

Open Control Panel from the start menu or by clicking on the icon.
Select Ease of Access and then click Speech Recognition.
Next click set up microphone and select desktop microphone from the available options.
Next take the speech tutorial and follow the given instructions.
After that, train your computer for better options so that the computer stores a definite pattern of your speech signal. This is done by clicking on ‘train your computer to better understand you’ option and then follows the instructions.
Now start the speech recognition icon and start dictating your speech to the computer. You can also add your own words to the computer dictionary.

Practical Speech Recognition Systems: Using HM2007

A practical speech recognition System can be constructed using Speech Recognition IC HM2007. The HM2007 is a 48 pin IC which provides speech recognition function. It works in two modes: Manual mode or CPU mode. In both modes, the IC is first trained to recognize words by the user saying each word for corresponding number pressed on the key. The IC stores each word signal in the memory location corresponding to the word. The data output from the IC is interfaced to the Microcontroller from where it is displayed on the LCD.

Normally we use manual mode for HM2007 operation.

The HM2007 consists of a RDY pin which is an active low pin indicating the IC is ready for training purpose.
The Voice input will be given through a microphone connected to the MICIN pin of the IC.
The IC is interfaced with a keypad which is used to provide number input corresponding to each word. The IC works in two functions – Clear and Train. When Train key is pressed on the keyboard, the IC begins its training process.
The user presses a number key before pressing the ‘Train’ function key and says the required word to the microphone.
The IC sends a high signal to ME (Memory Enable) pin which is connected to corresponding ME pin of SRAM. The 8 bit data signal corresponding to the number pressed is stored in the SRAM (external RAM) through the external bus.
After the voice input is detected, RDY pin is at logic high and the IC comes to the recognition state, where it starts the recognition process.
The result of the process is given through the data bus with the DEN (Data Enable) pin high.
The 8 bit data can be then given to the Microcontroller through a series Interface processor or first latched using latch IC 74HC573.
The Microcontroller is interfaced with an LCD and is programmed such that the corresponding word is displayed on the display.

The only precaution that needs to be taken is to not use homonyms (words with similar sound) and also to take care of the excitation in voice.

So, this is all how a basic speech recognition system works. Any further inputs are welcome to be added.

Image Credit

Speech Recognition System by Gstatic
Speech Waveform Manipulation by Dadisp

Components of Speech Recognition System by An Introduction to Speech and Speaker Recognition – Richard D. Peacocke and Daryl H. Graf

What’s new in Electrical

What’s new in Electronics

What’s new in Communication

What’s new in Projects