Deploying With a Phone Call: Speech Recognition in a Java backend

November 11, 2011

We had a lot of fun playing with last week’s experiment. However, it became quickly obvious that clicking on a button next to the service you want to deploy is much more convenient than talking to your web browser. Could we take the experiment further and build something that’s actually useful?

Well, we all knew about Twilio. Some kind of interaction with a phone thus seemed like the next logical step. Most of us have some kind of SSH client on our phone, so we are able to deploy or restart services on the go. However, it seemed like being able to send text messages to our Deployment Manager (or even better, call our Deployment Manager) could be convenient.

On Friday afternoon, a couple of hours after writing our last blog post, we were done implementing a simple version of text message-based deployment. The implementation is so trivial that I am not going to dive into too much details. Let’s just say that text messages sent to the phone number we are renting from Twilio ultimately reach our Deployment Manager, and that our Deployment Manager checks the phone number of the sender against a white list, parses the message, and triggers the appropriate action.

The only interesting thing to note is that we basically got the phone number white list of our engineering team for free as each one of us has a representation in our code base. Indeed, in the true dev ops spirit, we strongly believe that everything (infrastructure, for example) should be treated as code (or even better, everything should be code.) This let us do things like @Alert(JULIEN_WETTERWALD) which indicate to our monitoring system who to alert if a piece of code goes wrong. Also, it’s pretty nice to have the compiler help us transition smoothly when an engineer leaves the team.

Okay, so we got text messages to work. That’s pretty cool. But what about calling our Deployment Manager? Given Twilio’s API, it would be pretty straightforward to present a menu and offer the caller to press a number to deploy a service. However, voice recognition would be much cooler. Twilio does offer voice recognition, but we quickly started looking for alternatives given the fact that we wanted to restrict the speech recognizer to a very specific (and small) grammar. Indeed, some of our services have unusual or hard-to-pronounce names.

We ended up using Sphinx-4, which is a speech recognizer written in Java implementing the javax.speech.recognition interfaces.

The first step is to import all the libraries required by Sphinx-4. I ended up downloading sphinx4-1.0beta6-bin.zip and adding all the jars from the lib/ directory to our class path. You’ll have to execute jsapi.sh in order to create jsapi.jar.

The next step is to set up Sphinx-4 to analyze the audio files sent by Twilio. I used Sphinx’ Transcriber example included in sphinx4-1.0beta6-src.zip as a starting point. Note that I am concatenating all utterances as I don’t really care about engineers trying to give two commands in a single recording.

Most of the complexity resides in the configuration file used by the ConfigurationManager. Again, the configuration from the Transcriber example is a good starting point. However, two important changes have to be made.

First, as explained on Sphinx’ FAQ, telephone audio has a limited range of frequencies. Instead of the WSJ_16k model used in most examples, the WSJ_8k model has to be used. Furthermore, parameters in the mel filter have to be changed.

Second, endpointing has to be disabled by removing speechClassifier, speechMarker, and nonSpeechDataFilter from the pipeline. The audio files are short enough for endpointing to be unnecessary. Furthermore, the first word was often not recognized with a live frontend.

Now that Sphinx is properly configured to recognized Twilio’s audio files, the final step is to write a grammar accepting some commands.

And that’s it. We ended up having to tweak the wordInsertionProbability in the configuration file. I test drove the tweaks by first writing tens of unit tests based on audio files downloaded from Twilio. I then tweaked the configuration until they all passed.