Multimedia Networking Project 1

Speech Detection

Due date: February 1st, by 11:59pm


Index


Overview

An important feature of many Voice over IP (VoIP) applications is detecting speech in a two-way conversation. This problem, often called speech detection (or voice activity detection), comes down to detecting the beginning and ending of a section of speech, called a talk-spurt. Speech detection allows for hands-free telephony (e.g., no need to "push to talk") and by removing silent periods between talk-spurts, network bandwidth and host processing can be greatly reduced. Moreover, end-host applications often use the times between talk-spurts to reset their delay buffers.

Rabiner and Sambur [RS75] describe an algorithm which locates the endpoints of a talk-spurt based on an algorithm that uses the zero crossing rate and the energy. While the paper is relatively old (published in 1975) zero crossings and energy are still the basis for many speech detection algorithms.

You will implement the Rabiner and Sambur algorithm in an application that records speech from a microphone. Your program will store two versions of the sound, one with all the sound recorded and the other with sound removed. In addition, your program will store additional files of data in text format, allowing you to graph the captured audio signal, energy computations and zero level crossings.


Details

Record

You will write a program, called record, that records sound from the microphone until the user presses ctrl-c and exits. Your program will produce as output:

You should configure the sound device to capture audio at typical telephone voice quality rates: 8000 samples/second, 8 bits per sample with one channel (i.e., mono, not stereo).

Read the [RS75] paper carefully for details on the algorithm. You should follow it closely. Note, however, that they use a 10kHz sample frequency whereas you will use a 8kHz sample frequency, so adjust your algorithm appropriately.

You are free to add command line options or other features as you like as long as the base functionality as indicated in this writeup is present.

Play

You will write a second program, called play that plays sounds from a file recorded by record out to the speakers until the file is completely finished. Play should work with both sound.raw and speech.raw.

You are free to add command line options or other features as you like as long as the base functionality is present.


Hints

Some coarse pseudo-code for the application that may be helpful:

  open sound device
  set up sound device parameters
  record silence
  set algorithm parameters
  while (1)
     record sound
     compute energy
     compute zero crossings
     search for beginning/end of speech
     write data to file
     write sound to file
     if speech, write speech to file
  end while

You can record a fixed chunk of audio data on which to detect silence. For example, you can record sound in 1 second intervals and then apply the speech detection algorithm to it.

You can use a graphing program to help analyze some of the data you have. If you are using Linux, I'd recommend gnuplot. It has a variety of output and a nice command line interface. You can check out http://www.gnuplot.info/ for a good guide. If you are using a Windows machine, you might use Microsoft Excel. Or, try OpenOffice on any platform.

You might look at the slides for this project (pptx) and the [RS75] paper (pptx, pdf).

Windows Implementation

You may do your implementation in Linux or Windows in a language of your choice. This sub-section has some Windows-specific, C++ hints.

To use the sound device, you first create a variable of type WAVEFORMATEX. WAVEFORMATEX is a structure with fields necessary for setting up the sound parameters:

To read from the sound device (e.g., the microphone), you make the system call waveInOpen(), passing in a device handle (HWAVEIN), the device number (e.g., 1), the WAVEFORMATEX structure above and a callback function. The callback function gets invoked when the sound device has a sample of audio.

As the sound device records information, it needs to be put into a buffer large enough to store the sound chunk (say, 500 ms) you wish to record. You need to create more than one buffer so that the device can read ahead. The buffers are of type LPWAVEHDR. The LPWAVEHDR structure contains lpData, which will record the raw data samples, and dwBufferLength which needs to be set to nBlockAlign (above) times the length (in bytes) of the sound chunk you wish to read from the device.

Once prepared, the LPWAVEHDR buffers are then given one at a time to the sound device via the waveInAddBuffer() system call, giving it the device, the buffer (LPWAVEHDR) and the size of a LPWAVEHDR variable. When the callback function is invoked, the sound data itself will be in the lpData field of the buffer. That data can then be analyzed for speech and written to a file or the audio device (writing to the audio device is very similar to reading from it) or whatever. Another buffer then needs to be added to the audio device via waveInAddBuffer(), again.

Here are some of the header files you will probably find useful:

   #include <windows.h>
   #include <stdio.h>
   #include <stdlib.h>
   #include <mmsystem.h>
   #include <winbase.h>
   #include <memory.h>
   #include <string.h>
   #include <signal.h>
   extern "C"

Here are some data types you will probably use:

   HWAVEOUT /* for writing to an audio device */
   HWAVEIN  /* for reading from the audio device */
   WAVEFORMATEX /* sound format structure */
   LPWAVEHDR /* buffer for reading/writing to device */
   MMRESULT /* result type returned from wave system calls */

You will need to link in the library winmm.lib in order to properly reference the multimedia system calls.

Linux Implementation

You may do your implementation in Linux or Windows in a language of your choice. This sub-section has some Linux-specific hints. Note: for a Linux-like implementation environment on Windows, you can also use Cygwin. If you do use Cygwin, you will want to use OSS.

Open Sound System (OSS)

The Open Sound System (OSS) was developed for Unix systems with an API similar to that of other POSIX applications (e.g., using open(), read(), etc.). With OSS, audio in Linux (and Cygwin) can be accessed through the device /dev/dsp. You can open this device for reading and writing as you would a normal file (e.g., open("/dev/dsp", O_RDWR)). Reading from the device (e.g., via read()) will record sound while writing (e.g., via write()) to the device will play sound.

You use the ioctl() function to change the device settings (do a man ioctl for more information). The first parameter is the sound device file descriptor, the second is the parameter you want to control and the third is a pointer to a variable containing the value. For example, the code:

   fd = open("/dev/dsp", O_RDWR);
   arg = 8;
   ioctl(fd, SOUND_PCM_WRITE_BITS, &arg);
sets the sample size to 8 bits. The parameters you will be interested in are:

Needed includes for audio are:

  #include <stdio.h>
  #include <fcntl.h>
  #include <sys/ioctl.h>
  #include <sys/soundcard.h>

Only one process at a time can have the audio device (/dev/dsp) open. If you get a "permission denied" error when you try to read or write to the audio device, it is likely because another process has it open.

OSS, unlike ALSA (next), runs on some systems other than Linux (e.g., Cygwin).

Advanced Linux Sound Architecture (ALSA)

As of the 2.6 kernel, Linux replaced OSS with ALSA as the default sound architecture. OSS is considered deprecated. However, in order to be compatible with older OSS-written programs (e.g., those using /dev/dsp), ALSA has an OSS-emulation layer. Such programs can be run using aoss name_of_program_using_oss or having the Linux kernel module snd_pcm_oss loaded.

ALSA can be thought of as the device-driver layer of the Linux sound system. The Linux kernel loads a module (with a prefix of snd_) specific to the PCs audio hardware, translating it into the ALSA API.

The ALSA API provides snd_pcm_open(), where a typical call would look like one of:

    snd_pcm_t *handle;

    /* open playback device (e.g. speakers if default) */
    snd_pcm_open(&handle, "default", SND_PCM_STREAM_PLAYBACK, 0);

    /* open record device (e.g. microphone if default) */
    snd_pcm_open(&handle, "default", SND_PCM_STREAM_CAPTURE, 0);

The samples at a given time for all channels is called a frame. If the stream is non-interleaved, each channel is stored in a separate buffer. If the stream is interleaved, the samples are mixed together in a single buffer. A period contains multiple samples (frames).

Setting parameters is done via snd_pcm_set_params(), as in:

snd_pcm_set_params(
  handle,
  SND_PCM_FORMAT_U8, /* unsigned, 8 bit */
  SND_PCM_ACCESS_RW_INTERLEAVED,
  1, /* channels */
  8000, /* sample rate */
  1, /* allow re-sampling*/
  500000 /* 0.5 sec */
);

Writing to or reading from the sound device is done via snd_pcm_writei() and snd_pcm_read(), respectively:

  /* write to audio device */
  snd_pcm_writei(handle, buffer, frames);

  /* read from audio device */
  snd_pcm_readi(handle, buffer, frames);

where frames is the total number of frames read/written.

When done, snd_pcm_close() closes the audio device.

The only needed audio include is:

  #include <alsa/asoundlib.h>

When compiling, -lasound is needed to link in the libasound library.

An excellent article with sample code for ALSA is provided in [Tra04].

General

Use a Makefile to make compiling easier. Here is a basic sample (note, the indented lines must be a tab and not spaces):

#
# Makefile for record/play
#

CC= gcc

all: record play

record: record.c
    $(CC) -o record record.c -lm

play: play.c
    $(CC) -o play play.c

clean:
    /bin/rm -rf core record play


Questions

When you are done with your project, provide brief (a couple of sentences) answers to the following questions:

  1. What might happen to the speech detection algorithm in a situation where the background noise changes a lot over the audio session?
  2. What are some cases where you might want the silence to remain in a recorded audio stream?
  3. Accurately detecting the beginning of speech might be easier with a large sample size (i.e. capturing more of the audio before computing energy and zero crossings). Why might this be a bad idea for some audio applications?
  4. Do you think the algorithm is language (e.g. English versus Spanish) specific? Why or why not?

Provide your answers numbered, in a text file called answers.txt.


Turn In

You must turn in all source code used in your project, including header files. Please include the project files required for building your code. Have a README file with instructions for building your project and and platform requirements for running your application. Make sure to include answers to the questions, answers.txt.

Also, please include one short (10 seconds or less) sample output from your record program. This means you should have a copy of:

You will use email to turn in your files. When ready, create a directory of your project based on your last name (e.g., jones) and pack (with zip or tar) your files, for example:

    mkdir claypool
    cp * lastname-proj1  // copy all the files you want to submit
    tar czvf proj1-lastname.tgz lastname-proj1  // package and compress

Attach your file to an email with "cs529-proj1" as the subject.


Grading

The grading breakdown is as follows:

   25% basic recording of sound.

   25% basic playback of sound.

   20% speech detection.

   10% adjustment of thresholds.

   10% proper file output (sound, speech, data).

   10% answers to questions.

Below is a general grading rubric:

100-90. The project clearly exceeds requirements. Play and record both work completely and robustly. Speech detection works as specified in [RS75]. All algorithm thresholds work. Output files are created for each of sound, speech and data. Answers to questions are complete and accurate.

89-80. The project meets requirements. Play and record both work completely. Speech detection works mostly as specified in [RS75]. Algorithm thresholds are in place. Output files are created for each of sound, speech and data. Answers to questions are complete and mostly accurate.

79-70. The project barely meets requirements. Play and record both work, but may have occasional errors. Speech detection based on [RS75] is in place but may not be effective or bug-free. Algorithm thresholds are not fully implemented. Not all output files are created. Some answers to questions may be missing or inaccurate.

69-60. The project fails to meet requirements in some places. Either play and record may not work, or may have errors. Speech detection is not fully implemented or bug-free. Algorithm thresholds are not implemented. Not all output files are created. Many answers to questions may be missing or inaccurate.

59-0. The project does not meet requirements. Both play and record do not fully work, or have many errors. Speech detection is not implemented or contains bugs. Algorithm thresholds are not implemented. Most output files are not created. Answers to questions are missing and/or inaccurate.


References

[RS75] L. Rabiner and M. Sambur. An Algorithm for Determining the Endpoints of Isolated Utterances, The Bell System Technical Journal, Volume 54, Number 2, Pages 297-315, February 1975.

[Tra04] J. Tranter. Introduction to Sound Programming with ALSA, Linux Journal, Issue #126, October, 2004.


Return to the Multimedia Networking Home Page

Send all questions to the staff mailing list.