Multimedia Computing Project 1

Speech Detection

Due date: (About Monday, February 5th by 11:59pm, depending upon your group)



An important feature of multi-person audioconference is detecting speech in the presence of background noise. This problem, often called speech detection, boils down to detecting the beginning and ending of a section of speech, called a talk-spurt. By removing silent periods between talk-spurts, network bandwidth and host processing can be greatly reduced. Moreover, end-host applications often use the times between talk-spurts to reset their delay buffers.

Rabiner and Sambur [RS75] describe an algorithm which locates the endpoints of a talk-spurt based on an algorithm that uses the zero crossing rate and the energy. Their algorithm is relatively simple and accurate and has low CPU overhead.

You will implement the Rabiner and Sambur algorithm in an application that reads records speech from a microphone on a Linux system. Your program will store two versions of the sound, one with all the sound recorded and the other with sound removed. In addition, your program will store a third file of data in text format, allowing you to plot the captured audio signal, energy computations and zero level crossings.


You will write a program, called record, that records sound from the microphone until the user presses Ctrl-c and exits. Your program will produce 3 files as output:

(Be careful as you do your recording, as all three files can get large quickly!)

You should configure the sound device to capture audio at typical voice quality rates: 8000 samples/second, with 8 bits per sample with one channel (mono, not stereo).

Read the Rabiner and Sambur paper carefully for details on the algorithm. You should follow it closely. Note, however, that they use a 10kHz sample frequency whereas you'll use a 8kHz sample frequency, so adjust your algorithm appropriately.

You are free to add command line options or other features that you'd like as long as the base functionality as indicated in this writeup is present.


Audio in Linux is typically accessed through the device /dev/dsp. You can open this device for reading and writing as you would a normal file (ie- open("/dev/dsp", O_RDWR). Reading from the device (ie- read() will record sound while writing (ie- write()) to the device will play sound.

You use the ioctl() function to change the device settings (do a man ioctl for more information). The first parameter is the sound device file descriptor, the second is the parameter you want to control and the third is a pointer to a variable containing the value. For example, the code:

   fd = open("/dev/dsp", O_RDWR);
   arg = 8;
   ioctl(fd, SOUND_PCM_WRITE_BITS, &arg);
will set the samples size to 8 bits. The parameters you will be interested in are:

Only one process at a time can have the audio device (/dev/dsp) open. If you get a "permission denied" error when you try to read or write to the audio device, it is likely because another process has it open.

You can record a fixed chunk of audio data on which to detect silence. For example, you can record sound in 100ms intervals and then apply the speech detection algorithm to it.

Variance is the mean of the squares of the differences between the respective samples and their mean:

where: n is the number of samples; xi is the value of sample i; x_bar is the mean of the samples, and sigma2 is the variance. The square root of the variance, sigma, is the standard deviation. Remember, you'll need a -lm compile time option to link in the math library for using sqrt().

You can use a plotting program to help analyze some of the data you have. I'd recommend gnuplot. It has a variety of output and a nice command line interface. You can check out for a good guide.

Use a makefile to make compiling easier. Here is a sample (note, the indented lines must be a tab and not spaces):

# Makefile for record

CC= gcc

all: record

record: record.c
    $(CC) -o record record.c -lm

    /bin/rm -rf core record

Here is some coarse pseudo-code for the application that may be helpful:

  open sound device
  set sound device parameters
  record silence
  set algorithm parameters
     record sound
     compute energy
     compute zero crossings
     search for beggining/end of speech
     write data to file
     write sound to file
     if speech, write speech to file
  end while

You might look at the slides for this project (ppt, pdf) and [RS75] (ppt, pdf).


When you are done with your project, provide brief answers to the following questions:

  1. What might happen to the speech detection algorithm in a situation where the background noise changes a lot over the audio session?
  2. In what cases might you want silence in a recorded audio stream?
  3. Accurate detecting the beginning of speech might be easier with a large sample size (ie- capturing more of the audio before computing energy and zero crossings, etc.). Why might this be a bad idea for some audio applications?
  4. Do you think the algorithm is language specific? Why or why not?

Hand In

You must turn in all source code used in your project, including header files. Please include a Makefile, too, for building your code. Have a README file with any special instructions or platform requirements for running your application. In addition, be sure to include answers to the questions.

Please include a file named "group.txt" which contains the following:

    login_name1  last_name1, first_name1
    login_name2  last_name2, first_name2

Tar up (with gzip) your files, for example:

    mkdir proj1
    cp * proj1  /* copy all your files to submit to proj1 directory */
    tar -czf proj1.tgz proj1

then attach proj1.tgz to an email with "cs525_proj1" as the subject. Type elm -scs525_proj1 < proj1.tgz to send it, if that is easier.


[RS75] L. Rabiner and M. Sambur. An Algorithm for Determining the Endpoints of Isolated Utterances, The Bell System Technical Journal, Vol. 54, No. 2, Feb. 1975, pp. 297--315.

Return to the Multimedia Computing Home Page

Send all project questions to Mark Claypool.