Real-Time Network Video Communications and Graphics File Formats

John Pawasauskas
CS563 - Advanced Topics in Computer Graphics
May 06, 1997


Table of Contents


Introduction

This presentation is going to deal with the area of video compression to facilitate transmission of video data across computer networks, hopefully in real-time. It is an extension of my senior undergraduate project, which dealt with this topic but in a more limited manner.

This area is becoming very important. Obvious uses are teleconferencing and similar applications, where large amounts of video data (and audio data for that matter) need to be transmitted in real-time, which requires the use of some sort of compression. Other applications in which this area is important is the World Wide Web, and even "traditional" applications, such as HDTV. HDTV, and standard TV, are ideal applications for video compression.

In my presentation on Intel's Multimedia extensions, I mentioned that MMX technology was useful for compression of video data due to the amount of redundancy which is inherent in most compression algorithms. When compressing video data, it's common to have to execute the same code a very large number of times (once per pixel, for example). So in this case MMX, which can potentially execute the compression code on up to eight pixels at a time, would be very beneficial in this application.

With that said, this presentation isn't going to mention MMX very much.

This presentation is going to follow the basic format of my original MQP presentation, with portions added or modified where applicable.


What Was My Major Qualifying Project (MQP)?

My MQP dealt with the topic of Network Video Communications. Specifically, the project's goal was to design and implement a software system (in the C programming language) which would transmit a computer animation across an Ethernet network between two IBM PC-compatible computers in "real-time." The animation was a pre-selected raytraced animation with a resolution of 320 * 200 pixels. The goal was to get a framerate of 15-20 frames per second.

In order to do this, the project was broken down into three major components:

  1. Network Communications Subsystem
  2. Data Compression Subsystem
  3. Video Format Handling/Video Display Subsystem

The project was implemented as a "client-server" application. There was the Transmitter system and the Receiver system. Naturally, the Transmitter loaded the animation from disk, compressed it, and transmitted it. The Receiver would receive the file, decompress it, and display it to the screen and/or save it to disk. The actual program used for the Transmitter and Receiver were identical, the function of the program was differentiated by a command-line switch.

It's important to note that this project was concerned only with video data. Audio data was not considered at all, and so some of the graphics formats discussed later will be covered only partially, since audio considerations have not currently been addressed.

The project had several design constraints which are important to know beforehand:

The design constraint that it must run in MS-DOS on IBM PC-compatible computers was practical in nature - it was what was available for development and testing. The use of protected mode was more of a choice than a real constraint, since it allowed use of all memory in the system, rather than the 640K that MS-DOS allows in real mode, and since it also arranges memory in a flat memory model where the programmer need not be concerned with the MS-DOS segmented memory model.

The final constraint, not using more than 25% of the total Ethernet bandwidth, was arbitrary. It is also unreasonable for a "real" application, since using 25% of the total bandwidth for a single instance of a single application is unrealistic at best. However, since the goal of this project was not to produce a "real" application, this constraint seemed reasonable.

The network communications subsystem used the Novell IPX protocol. Again, this was a practical consideration. This project was initially implemented in 1995, when IPX was a very common method of LAN communication. TCP/IP was common, of course, but required software which was not standardized or easily available.

The data compression subsystem was the most compute-intensive component of the project, and arguably the most important. Good compression is necessary to reach the project goal of 20 frames per second, so a good deal of time was spend attempting to implement an acceptable compression scheme.

In the final subsystem, video format handling and video display are lumped together. The reason for this is really fairly simple - they use very, very similar data structures, and it seemed logical to implement the two parts in a single subsystem.

It's worth noting that the "important" parts of the project were kept reasonably portable. Statistically very little of the project was machine- or architecture-specific, and the components which are are currently being rewritten. In point of fact, the entire project is being rewritten, but not for that reason. This will be discussed later in this presentation.


The Network Communications Subsystem

The physical network which this project was developed to utilize was a standard 10Mbps Ethernet network. This network has a theoretical maximum bandwidth of 10 megabits per second, which translates to 1.25 megabytes per second. Due to the manner in which Ethernet works, however, actually attaining the maximum theoretical bandwidth is all but impossible. Also note that, because of the design constraint of utilizing at best 25% of the total network bandwidth, there is approximately 310 kilobytes per second of bandwidth which can be utilized.

As mentioned earlier, the Network Communications Subsystem was designed and implemented to use the Novell IPX network communication protocol. This was for several practical reasons. First, the preferred protocol did not have a standardized, publically-available implementation on IBM PC-compatible computers at the time this project was implemented, at least not that would run under MS-DOS. Libraries were available, but tended to be expensive (designed for commercial applications) or took some liberties with the standard. Both of these were unacceptable, so another protocol needed to be selected.

IPX was chosen because it was a standard, owned by Novell and not subject to arbitrary change. It can be used on any PC ethernet card, and the drivers necessary invariably come with all new cards. The original project report discussed programming for the IPX protocol in detail. If interested, the details of the IPX protocal can be located here. This presentation is not going to discuss IPX in detail, other than to mention that since IPX is a real-mode protocol, getting it to communicate with the project application (which runs in protected mode) was a very difficult task.


Video Format Handling/Video Display Subsystem

This component of the project is probably the least interesting. It involves loading/saving the animation file from/to disk, decoding it into a format which can be worked with, and displaying it on the screen if necessary. Not all of these actions are actually performed by the application at once.

Initially, the project worked only with an animation file format named RAW. This file format stored the video data in an uncompressed format, one frame after another. There is a minimal amount of header information which contains, among other things, the resolution of the animation. As one might imagine, these files tend to be rather large. The test animation, for example, took up over 19 megabytes of disk space. This animation contained about 300 frames, which made it large, but an excellent test case.

The RAW format was selected for two reasons:

  1. It is remarkably easy to load/store.
  2. There is no processing time "wasted" decompressing the images prior to the compression and transmission stages.

So while the RAW format itself is grossly inefficient and a waste of disk space (the test animation in compressed format required about 800 kilobytes of disk space), it was uniquely suited to this task.

Towards the end of the project, consideration was given to adding other graphics file formats. This could not be accomplished in the time allotted, but was left as an option for later development. This is being addressed in the current development effort.

The other component of this subsystem was video display. When the Receiver receives video data, it needs to either write it to disk or display it on the screen (or both). This made it beneficial to perform both of these actions in the same subsystem, since they both work on data in the same format.

There were several ways that data could have been displayed on the screen. The project could have used standard VGA modes. These were discarded early in the specification phase, since standard VGA modes lack the resolution and color depth to be viable options. VESA modes were a possibility, since they allowed higher resolutions and color depths. Unfortunately, they had to be discarded as options, since the standard software which provides VESA modes and functions was not available for the video cards in the test PCs at that time.

This left an undocumented family of video modes, called "Mode-X." Mode-X was first discussed by Michael Abrash in a series of articles in Dr. Dobb's Journal, and is a derivative of the standard VGA 320x200 256-color video mode. This mode is known as VGA mode 13h, and allowed access to 256 kilobytes of video memory, but in a manner that makes only 64K of it meaningful.

The most commonly used resolution of Mode-X, on the other hand, allows a resolution of 320x240 pixels, and allows access to all 256K of video memory allowed by VGA mode 13h. The extra memory which is accessable can be used to do "double buffering," a technique which allows flicker-free animation by copying the entire frame to video memory at once, and then switching the active page to point to that frame. The whole screen is updated at once instead of line by line, giving smoother animation.

As with IPX, Mode-X is interesting only in that it was part of the original project. If implemented for the first time now, the project would have utilized VESA modes if still programmed under MS-DOS. For more information on Mode-X, click here.


Data Compression Subsystem

As stated earlier, the animation to be transmitted was 320x200 pixels, and about 300 frames in length. That yields a frams size of 64,000 bytes uncompressed. If these frames were to be transmitted at 20 frames per second uncompressed, it would require 1,280,000 bytes per second of network bandwidth, which is very close to the maximum theoretical bandwidth of an ethernet network. This is impossible to obtain, even ignoring for a moment the 25% maximum that the project is allowed to use. This requires, at the absolute minimum, a 4:1 compression ratio.

Also, the resolution must be taken into account. 320x200 is a very low resolution, especially now. Higher resolutions would be beneficial, which requires better data compression. For example, to meet the same constraints but with 1024x768 pixels, a 48:1 compression ratio would be required.

Obviously, this will require some serious work.

There are two types of compression which are common when compressing video data, lossless and lossy. There are a number of methods which can be used to implement either type. There are advantages and disadvantages to each technique. The biggest difference between the two philosophies is size. A lossless scheme is usually limited to 2:1 compression in practice, while lossy schemes can achieve 16:1 compression ratios or higher. On the other hand, lossless schemes are faster than lossy schemes most of the time.

Since this project was originally implemented as a senior project, functionality gave way to educational value. While there were compression routines available which could achieve very high compression ratios, a "home grown" compression algorithm was used. It was a lossless scheme, which was developed both for speed and simplicity, as well as for educational value.

At this point, it is readily apparent that the design goals will probably not be met. Even assuming 2:1 compression, this is only half as good as is necessary in order to meet the goal of 20 frames per second. But it is important to realize that the test animation had been obtained during the specification phase of the project, and its characteristice were well known. Since it was created with a ray tracing program, there should ideally have been a great deal of similarity between the frames. Therefore, some lossless compression schemes should be able to provide extremely large compression ratios.

The major options for the compression algorithm were Run Length Encoding (RLE) and Huffman encoding. Huffman encoding is far more complex computationally, but is still far easier to decompress than many lossy compression algorithms.

The compression algorithm developed was a combination of delta compression and Run Length Encoding (RLE). RLE is extremely simple to implement, yet byte-wise RLE can theoretically provide a 256:1 compression ratio. By combining RLE with a compression scheme which only transmits the differences between frames (which should be relatively low), a high compression ratio should be possible.

RLE

Run Length Encoding is a very simple compression method conceptually. The idea is to count the number of adjacent pixels (in a row) which are the same color, and instead of transmitting all the redundant data, simply transmit the count and the color. So a row which had:
1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 4 4 4 4 4 5 5 5
would become:
4 1 8 2 3 3 5 4 3 5
In the above example, the original line, which required 23 bytes, was reduced to one which required 10 bytes.

It's important to note that RLE's worst case produces something which is two times the size of the original, if no two adjacent pixels are the same.

The Project So Far

This point marks the end of the work that was done on the original project. It is important to note that the project did not in fact meet its design goals. The overall frame rate obtained was more along the lines of 4-5 frames per second. This is mostly due to the fact that the compression algorithm developed was, at best, poor.

The test image was, as mentioned previously, a computed generated raytraced animation. It was an animation of the starship Enterprise leaving orbit of a planet and then zooming on its way. It should have been good quality, but it wasn't. The original animation was obtained from the Internet, and was stored in MPEG form. This was converted to RAW format for the purposes of this project, but by then the damage had been done. The image had a large number of artefacts added by the original MPEG compression process, which caused the compression algorithm to fail miserably.

The average compression ration was about 80% of the original, and there had to be a "worst case" check added so that if the size of the compressed image exceeded that of the original it would simply send the original uncompressed image. This is obviously not good.

There were a number of public-domain compression algorithms which could have been implemented. This could have produced superior compression ratios, but would not have given the same exposure to compression techniques. For an educational project, the compression algorithm implemented was probably the right choice.

For the next phase, it isn't.

The Next Phase

So far, what has been described is what has been done already. And it isn't good enough. The original implementation was slow, inefficient, and not very easy to modify. The next version is going to fix all that.

It's important to realize that computer technology has improved quite a bit from the time that the original project was implemented. At that time, a Intel 486 running at 66 megahertz was the fastest Intel Architecture processor available, and the systems available for this project were running at 33MHz. Modern CPU technology has things like the Pentium Pro, which runs at speeds up to (currently) 200MHz, and have multiple execution units which allow the processor to execute multiple instructions at once. And now with MMX technology, the potential performance of compression algorithms could be increased even further. In other words, what was a concern 2 years ago when the project was being developed is less of a concern today.

First things first, though. The original program has some serious problems with the way it was structured, at least for adding new features. The project was originally broken up into several subsystems, each of which was coded as a "module". This is good, but the modules are hard coded to use each other. It would be difficult to modify the modules to use other compression algorithms, for example, without modifying a large amount of the program.

For this reason, the whole program is being rewritten. It's also being rewrittin in C++, based on the theory that the basic components can easily be treated as classes. In this manner, there could be an abstract base class for "frame", and there could be a RAW frame, a MPEG frame, a FLI frame... These two things should make it far easier to implement some of the other compression algorithms which are standards now.

You may be wondering what some of these other compression algorithms are and how they work. That's coming up next.

Animation File Formats

There really are a lot of different animation file formats which are available and "standard." Some use no compression, some use lots of compression. Some are lossless, while others are lossy. In order to implement compression schemes which will be useful, they must first be analyzed to see what kind of performance they provide. In this section, various animation file formats will be investigated.

FLI

The basic idea behind FLI is really pretty simple: You don't need to store the parts of a frame that are the same as the last frame. Not only does this save space, but it's also fast - setting a pixel is slower than leaving it alone.

The implementation of FLI is moderately complex. FLI files have a 128-byte header, followed by a sequence of frames. The first frame is compressed using a byte-wise RLE scheme. Subsequent frames are stored as the difference from the previous frame. There is one additional frame at the end of a FLI file, which is the difference between the first and last frame. (Occasionally the first frame and/or subsequent frames are uncompressed.)

After the frame header come the "chunks" that make up the frame. First comes a color chunk (if the color map has changed from the last frame), and then comes a pixel chunk (if pixels have changed). If the frame is identical to the last frame, there won't be any chunks.

A chunk itself has a header, containing the number of bytes in the chunk and the type of chunk. There are five different types of chunks:

  1. FLI_COLOR
  2. FLI_LC
  3. FLI_BLACK
  4. FLI_BRUN
  5. FLI_COPY

FLI_LC chunks are the most common, and the most complex. They're used for the transmitting the compressed line data (the "LC" stands for "Line Compressed"). The first 16-bit word in the chunk is the number of lines starting from the top of the screen that are the same as the previous frame. So if there was motion only on the last line of the screen there would be a 199 here (remember FLIs have a resolution of 320x200).

The next word is the number of lines that do change. Next comes the data for the lines themselves. Each line is compressed individually. Among other things, this makes it easier to replay a FLI animation at reduced sizes.

The first byte of a compressed line is the number of packets in this line. If the line is unchanged from the last frame this number is zero. The format of an individual packet is:
skip_count
size_count
data

The skip_count and size_count are both single bytes. If the skip_count is more than 255, it must be broken into two packets. If the size_count is positive, that many bytes of data follow and are to be copied onto the screen. If it's negative, then a single byte follows, and is repeated skip_count times.

The worst-case for a FLI frame is about 70K. If it comes out to be 60000 bytes or more, it's usually determined that it isn't worth it, and the frame is stored as a FLI_COPY frame instead (which doesn't use compression).

MPEG

Compared to FLI and other "simple" compression formats, MPEG is evil. The reason is that MPEG is a very complex compression scheme, designed to achieve much higher compression ratios, but at a cost. The FLI format is lossless, while the MPEG format is lossy. It tries to determine what can be discarded safely without losing image quality, and usually does a pretty good job.

There are two MPEG formats now, MPEG-1 and MPEG-2. MPEG-4 is, as far as I can tell, in the development stage.

There isn't a lot of information available on the Internet on the specifics of the MPEG format. There's plenty of stuff on how MPEG works generally, overviews mostly. If you want to know exactly how it encodes a frame, you can either try to locate one of the books on animation file formate which cover MPEG (which can be hard to find), or you can download some of the source code freely available on the 'Net and try to figure it out yourself. I haven't had a chance to do either yet, so I'll present an overview of the format.

MPEG compresses both audio and video. For now I'm going to ignore the audio component and focus on the video. MPEG video defines three kinds of frames:

  1. Intra-coded (I) frames: Pictures coded without reference to other pictures
  2. Predictive-coded (P) frames: Frames coded using motion compensation prediction based on preceding I-frames or P-frames
  3. Bidirectionally-Predictive coded (B) frames: Frames coded using both past and future I-frames and P-frames as their reference points as their reference points for motion compensation

B-frames have the highest level of compression, but they cannot be decoded until the next I-frame or P-frame has been processed to provide the required reference points. This means that frame buffering is used for intermediate B-frames.

MPEG-1 uses a block-based discrete coding transform method with visually weighted quantisation and run lenth encoding for compressing video data.

A general transform coding scheme basically involves subdividing a N*N image into smaller n*n blocks and performing a unitary transform on each subimage. A unitary transform is a reversible linear transform whose kernel describes a set of complete, orthonormal discrete basic functions. The goal of the transform is to decorrelate the original signal, and this decorrelation generally results in the signal energy being redistributed among only a small set of transform coefficients.

Transform coding can be generalized into four stages:

Quantisation can be performed in several ways. Most classical approaches use 'zonal coding', consisting in the scalar quantisation of the coefficients belonging to a predefined area (with a fixed bit allocation), and 'threshold coding', consisting in the choice of the coefficients of each block characterised by an absolute value exceeding a predefined threshold. Another possibility, that leads to higher compression factors, is to apply a vector quantisation scheme to the transformed coefficients.

The same type of encoding is used for each coding method. In most cases a classical Huffman code can be used successfully.

MPEG-2 video encoding does basically the same type of thing.

It's very important to note that this is an incredibly compute-intensive operation. Most MPEG encoders (at least the "serious" ones that are used professionally) use hardware to accelerate the encoding. Until fairly recently there weren't many software MPEG encoders, since they took a very long time to encode the file.

Remember MMX? Well, MMX instructions can be used to accelerate the encoding and decoding of MPEG files.


References and Related Links


Copyright © 1997 by John Pawasauskas

[Return to CS563 '97 talks list]