Monday, September 6, 2010

Pseudo-synchronous voice transcription in the Twilio API

I've been exploring more of the Twilio API recently. Since creating Ringerous and SMSMyBus, I've begun looking into the transcription services. But I immediately ran into a challenge. The transcription service was never designed to be synchronous. So although the recording of a call is immediately available, the transcription appears to be intended for offline processing.

You can get a lot of mileage out of this, but there are applications where it would be nice to have access to the transcription immediately. Especially when accuracy is important. I was able to use the following pattern to achieve a true synchronous voice transcription...


1. Inbound call handler


The main handler for the initial inbound phone calls produces simple TwiML that prompts the caller and records what they say. Note that transcription is enabled and I specify a separate callback (which is asynchronous).








1
2
3
4
5
6
7
8
9
10
11


<?xml version="1.0" encodeing="UTF-8"?>
<Response>
    <Say>
        What is your favorite kind of pie?
    </Say>
    <Record
        action="/recording"
        transcribe="true"
        transcribeCallback="/transcribeHandler"
    />
</Response>



2. Recording handler


The recording handler processes the callback when the recording has completed. This is synchronous and thus occurs while the user is still on the original call. You can do whatever you'd like with the recording, but note that the transcription is not ready yet.

The most important thing to do during this step is to record the active CallSid so you can reference it in step three.

You can do whatever you'd like with the caller at this point, but the bottom line is they need to wait for the transcription to complete. So one thing to do is simply play them some pretty music. :)








1
2
3
4
5
6
7
8
9


<?xml version="1.0" encodeing="UTF-8"?>
<Response>
    <Say>
        Please wait while we work on your request.
    </Say>
    <Play loop=100>
        http://mydomain.com/coolmusic.mp3 
    </Play>
</Response>



3. Respond to the transcription


When Twilio's transcription engine completes, you'll get the callback you specified in the record verb in step one. Take a moment to store away the transcription text.

Now use the stored CallSid from step two along with the REST interface to interrupt the caller that is listening to that pretty music you're serving up for them. Specify yet another handler URL when you interrupt the call. You can serve up brand new TwiML that way.








1
2


POST https://api.twilio.com/2010-04-01/Account/{AccountSid}/Calls/{CallSid}
   CurrentURL=http://mydomain.com/interrupthandler



4. Interrupt handler


The interrupt via the REST interface gives you an opportunity to treat the caller to any workflow you'd like. This is a great chance to ask them if the transcription is correct and gather their input - either via voice or textpad - and react accordingly. In this case, I use the Say verb to read back the transcription I've stored in step three.









1
2
3
4
5
6
7
8


<Response>
    <Say>
        Press one if you said, I love Fluffernutter Pie. Otherwise, press two.
    </Say>
    <Gather
        action="/verificationHandler"
    />
</Response>



That's it. Synchronous voice transcription. Now all you need to do is add your favorite API to the verificationHandler so you can hook it up to another app like Twitter, Posterous, Google Search, etc.

Do you know of an easier way to do this? If so, please share...

4 comments:

  1. Hey Greg,You could also simulate being on hold using a one-person <conference> and specifying a waitUrl for the audio: http://www.twilio.com/docs/api/2010-04-01/twiml/conferenceJohn

    ReplyDelete
  2. Thanks, John. I like your solution too because it's a super simple way to take advantage of Twilio's stock hold music. i don't know of a way to do that without using the twimlet.

    ReplyDelete
  3. Hey Greg,Thanks for the great post! I used this idea to create an app for the latest twilio contest. Basically i wanted a way for hearing impaired users to communicate with people over the phone without having to rely on text messages. So i took inspiration from your post and created an app that converts a telephone users voice into text and then displays that to the hearing impaired user, then i convert the users text into voice and play that to the telephone user. I couldnt have done this without your Psuedo-synchronous concept. Keep up the great posts, ill keep checking back for inspiration. - Kunal

    ReplyDelete
  4. This is great to hear! Thanks for sharing the story and good luck in the contest. I love the idea.

    ReplyDelete