By Prithvi Ganesh K

 “Tell my wife I’m running late.” “Remind me to call the vet.” “Any good burger joints around here?” Siri does what you say, finds the information you need, and then answers you! It’s merely a matter of conversing with your phone. Siri (Steve is right inside) iPhone 4S is an application dedicated to Steve jobs .One of the most admirable applications of the present times for the tech savvy .It lets you use your voice to send messages, schedule meetings, place phone calls, and more. Siri isn’t like traditional voice recognition software that requires you to remember keywords and speak specific commands. It understands your natural speech and it asks you questions if it needs more information to complete a task. It is a virtual personal assistant software system. It understands what you say, knows what you mean, and even talks back. For instance…

You : “Any good burger joints around here?”

Siri : “I found a number of burger restaurants near you.”

You : “Hmm. How about tacos?”

Siri remembers that you just asked about restaurants, so it will look for Mexican restaurants in the neighborhood. And Siri is proactive, so it will question you until it finds what you’re looking for.

Ask Siri to text your dad, remind you to call the dentist, or find directions, and it figures out which apps to use and who you’re talking about. It finds answers for you from the web through sources like Yelp and Wolfram Alpha. Using Location Services, it looks up where you live, where you work, and where you are. Then it gives you information and the best options based on your current location.


Siri uses the processing power of the dual-core A5 chip in iPhone 4S, and it uses 3G and Wi-Fi networks to communicate rapidly with Apple’s data centers. So it can quickly understand what you say and what you’re asking for, and then quickly return a response.

Wondering how just by a speech command u can end up with destination maps? The sounds of your speech are immediately encoded into a compact digital form that preserves its information. The signal from your connected phone is relayed wirelessly through a nearby cell tower and through a series of land lines back to your Internet Service Provider where it then communicated with a server in the cloud, loaded with a series of models honed to comprehend language and then sends back the signals

Simultaneously, your speech was evaluated locally, on your device. A recognizer installed on your phone communicates with that server in the cloud to gauge whether the command can be best handled locally such as if you had asked it to play a song on your phone, the local recognizer deems its model sufficient to process your speech, it tells the server in the cloud, “Thank you very much now, but we’re OK here.”)


To convert speech to on-screen text or a computer command, a computer has to go through several complex steps. When you speak, you create vibrations in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that the computer can understand. To do this, it samples the sound by taking precise measurements of the wave at frequent intervals. The system filters the sampled sound to remove unwanted noise, and sometimes to separate it into different bands of frequency. It also normalizes the sound, or adjusts it to a constant volume level. It may also have to be temporally aligned. People don’t always speak at the same speed, so the sound must be adjusted to match the speed of the template sound samples already stored in the system’s memory.

Next the signal is divided into small segments as short as a few hundredths of a second, or even thousandths in the case of plosive consonant sounds — consonant stops produced by obstructing airflow in the vocal tract — like “p” or “t.” The program then matches these segments to known phonemes in the appropriate language. A phoneme is the smallest element of a language, a representation of the sounds we make and put together to form meaningful expressions. There are roughly 40 phonemes in the English language (different linguists have different opinions on the exact number).


The next step seems simple yet the most difficult to accomplish and is the focus of most speech recognition research. The program examines phonemes in the context of the other phonemes around them. It runs the contextual phoneme plot through a complex statistical model and compares them to a large library of known words, phrases and sentences. The program then determines what the user was probably saying and either outputs it as text or issues a computer command.

Accents, dialects and mannerisms can vastly change the way certain words or phrases are spoken. The Hidden Markov Model is the most common among powerful and complicated statistical modeling systems which are used in today’s speech recognition system. In this model, each phoneme is like a link in a chain, and the completed chain is a word. However, the chain branches off in different directions as the program attempts to match the digital sound with the phoneme that’s most likely to come next. During this process, the program assigns a probability score to each phoneme, based on its built-in dictionary and user training.

In assent with speech recognition in Siri, the server compares your speech against a statistical model to estimate, based on the sounds you spoke and the order in which you spoke them, what letters might constitute it. (At the same time, the local recognizer compares your speech to an abridged version of that statistical model.) For both, the highest-probability estimates get the go-ahead.

Based on these opinions, your speech now understood as a series of vowels and consonants is then run through a language model, which estimates the words that your speech is comprised of.

Given a sufficient level of confidence, the computer then creates a candidate list of interpretations for what the sequence of words in your speech might mean. If there is enough confidence in this result, and if the computer determines that your intent is to send an SMS, Raghu is your addressee (and therefore his contact information should be pulled from your phone’s contact list) and the rest is your actual message to him which magically appears on screen If your speech is too ambiguous at any point during the process, the computers will defer to you, the user: did you mean Raghu Selvan, or Raghu Dixit?