CS563 - Advanced Topics in Computer Graphics
April 22, 1997
By now you should have heard the hype. You've probably seen the Intel commercials where Intel engineers are dancing to funky music under colorful lights, because their job is to put the fun in Intel CPUs. Intel is certainly spending enough money on advertising to make many people think that MMX is the best thing to hit PCs since Windows 95.
But Windows 95 had its share of criticizers, so should MMX be any different? In this presentation, I'm going to explain how MMX works, what MMX does, and make some comparisons between MMX CPUs and non-MMX CPUs. Hopefully by the time I'm done, you'll be able to figure out for yourselves if MMX is really all it's cracked up to be, or if (like many things which came before it) it's nothing more than a hot buzzword that's not worth the time, effort, and (more importantly) money.
So, the question you should all be asking yourselves now is...
The acronym "MMX" stands for MultiMedia eXtensions. MMX technology is something which Intel developed. Its goal was to increase the speed at which certain "multimedia" operations are performed. And, in fact, MMX technology improves the performance of current and future graphics and communications applications while maintaining compatibility with the existing Intel Architecture (IA) software base.
So in other words, Intel added enhancements to their processors which could speed up multimedia instructions while remaining compatible with everything already out there. MMX is an extension of IA. In fact, MMX is IA's most significant enhancement since 1985, when the Intel 386 processor was released. The 386 extended the architecture to 32 bits.
MMX includes new instructions and data types to achieve increased levels of performance on the host CPU by exploiting the parallelism inherent in a great deal of the algorithms in these applications. MMX can deliver 50%-100% performance gains for multimedia and communications applications over the same applications run on the same type of processor but without MMX technology.
It's worth noting that Intel, smart people that they are, designed MMX technology so that it would scale well with processor operating frequencies and future architecture generations. It has been integrated into the Pentium, and will soon be integrated into the P6 (aka Pentium Pro) processors, which give these processors an extra boost. MMX will also appear on all future IA processors. And although Intel did create MMX technology, both AMD and Cyrix are incorporating it into their next generation processors, the K6 and M2, respectively.
Ok, so now I'll bet you want to know the answer to the question...
First things first. To implement MMX technology, Intel created 57 new instructions. We'll get back to this later.
MMX technology's potential target applications share several characteristics:
These things pointed the MMX definition team in the direction of a single-instruction multiple-data (SIMD) architecture, in which one instruction performs the same operation on multiple data elements in parallel. This parallel operation on relatively small data elements is the main source of MMX's performance boost.
The benefits of a SIMD architecture have been identified by other processor architectures, such as the Sun Microsystems SPARC-Visual Instruction Set (VIS) and the Hewlett-Packard PA-RISC 2.0 MAX-2 instruction set. SIMD architecture has been used for years to provide high performance in a wide range of systems; it's a proven technology.
Before now, when processing 8- or 16-bit data, the existing 32- or 64-bit CPU bandwidth and processing resources on Intel processors were underutilized. Only the lower order 8 or 16 bits were manipulated, leaving the remaining bits unusued. MMX allows full utilization of the wide processing capabilities of the CPU.
For this paper, like in my primary sources, a data width of 64 bits was chosen. This was for two reasons: first, the authors' studies showed that using 64 bits of packed elements would enable a fairly substantial performance boost, and also because the Pentium and Pentium Pro processors use 64-bit wide data busses.
MMX had a couple of design goals which are very important. For the most part they were listed earlier, but I'm going to list them again, since they really are important.
This last point is important. Modern processors and operating systems can run multiple applications simultaneously (aka multitasking). New applications which used the new MMX instructions had to be able to multitask with any other applications. This put some constraints on the MMX technology definition. They couldn't create a new MMX state or mode (in other words, no new registers) because then operating systems would have needed to be modified to take care of these new additions.
The main technique for maintaining compatibility of MMX technology was to "hide" it inside the existing floating-point state and registers (current operating systems and applications are designed to work with the floating-point state). An operating system doesn't need to know if MMX technology is present, since it's hidden in the floating-point state. Applications have to check for the presence of MMX technology, and if it's built into the processor they use the new instructions.
The MMX technology definition process was unusual. In what seems a rare event in the modern computer industry, it was the engineers and not the managers who led the way. A group of architects and software engineers analyzed the potential performance of existing and future applications, including graphics, MPEG video, speech systhesis, speech compression, speech recognition, image processing, 3D graphics, video conferencing, modems, and audio. They met with external software developers to learn what they would need from a new IA processor in order to enhance their multimedia and communications applications. The applications were analyzed to identify the most compute-intensive parts, which were then analyzed in detail using computer-aided engineering tools. These studies (and the performance potential they showed) convinced Intel of the need to integrate the new technology ASAP, and to fully convert all IA processors to use MMX technology.
So what are the main features of MMX? For this part, I'm going to illustrate the main features of MMX technology and its instructions by using some simple examples as a guide. First, though, we're going to have to know what the MMX instructions are, so that we know what we're looking at. The table below is a summary of the MMX instruction set.
|Wrap around and saturate||Parallel Add and Subtract of packed eight bytes, four 16-bit words, or two 32-bit doublewords.|
|Equal or greater than||Parallel Compare of eight bytes, four 16-bit words, or two 32-bit doublewords. Result is mask of 1s if true or 0s if false.|
|Result is low or high order bits||Parallel Multiply of four signed 16-bit words. Low-order or high-order 16-bits of the 32-bit result are chosen.|
|Pmaddwd||Word to doubleword conversion||Parallel Multiply-Add of four signed 16-bit words. Adjacent pairs of 32-bit results are added together. Result is a doubleword.|
|Shift count in register or immediate||Parallel Shift of 4 words, 2 doublewords, or the full 64 bits are shifted arithmetic right, logical right and left.|
|Parallel Unpacking (interleaved merge) of eight bytes, four 16-bit words, or two 32-bit doublewords.|
|Packss[wb/dw]||Always saturate||Parallel packing of doublewords to words or words to bytes.|
|64-bit bitwise logical operations.|
|Mov[d/q]||Moves 32 or 64 bits to and from memory to MMX registers or between MMX registers. 32-bits can be moved between MMX and integer registers.|
|Emms||Empty FP registers tag bits.|
A lot of multimedia applications execute the same instructions on many pieces of data in a large data set. Standard processors can only process one piece of data with each instruction. MMX technology processes several pieces of data with each instruction. It's a simple type of parallelism which provides a big performance boost for a lot of multimedia applications. Typically, data elements are small: 8 bits per element for pixels or 8 bits for each pixel component (red, green, and blue) used in graphics and video, 16 bits per element for audio samples or as a higher-precision backup for 8-bit operations, and 32 bits per element for general computing and some 3D graphics algorithms.
Because of this, MMX defines new data types, which are 64 bits in total size, and are composed of independent smaller-size data elements. They're called "packed data types". Each element within a packed data type is a fixed-point integer. The programmer controls the place of the fixed point within each element, and is responsible for its placement throughout the calculation. This control means an extra burden for programmers, but it also gives them a lot of flexibility to choose and change fixed-point formats during the application to fully control the dynamic range of their values.
The four data types defined by MMX are:
MMX instructions are defined to perform the parallel operations on the multiple data elements packed into the new 64-bit data types. MMX technology extends the basic integer instructions into SIMD versions. These instructions include the standard add, subtract, multiply, compare, and shift, data-type conversion functions (to facilitate converting between the new data types), instructions to support 64-bit operations (64-bit memory moves, 64-bit logical operations), and a multiply-add operation (because a lot of multimedia applications perform multiply-accumulate operations).
For packed data types, MMX has its most complete instruction support for packed-word (16-bit) data types, since they found that 16-bit data was the most general and useful for the category of multimedia applications. It also acts as higher-precision backup for operations on byte data. As stated earlier, a total of 57 new MMX instructions were added overall to the IA.
Now, we can easily notice that the MMX instructions differ from one another by a few characteristics. Different instructions are supplied to do the same operation on different data types. One operation may work on a on a packed-byte, while another will work on a packed-word. Some instructions also differ because they treat a value as signed or unsigned.
A major feature of MMX instructions is saturation arithmetic. Saturation arithmetic is important to many graphics routines. As an example, assume you add together two medium-red pixels. Saturating arithmetic ensures the result is a dark red or black. It's certainly different than regular integer math, where you could perform the above operation and end up with a light-colored result. In other words, saturation arithmetic handles "wrap-around" problems. This is a very handy thing to have.
MMX supports both signed and unsigned saturating arithmetic.
The parallelism and saturating arithmetic in MMX are useful in some video conferencing compression schemes. Instead of directly encoding each frame in a video sequence, it's better to first compute the differences between the current frame and a recent previous frame. If the two frames are very similar (which is usually the case in video) it's easy to see from the difference frame that results can be represented with less information than the original. So for all the pixels in the frame, a pixel-to-pixel difference is computed. What's really nice about this is that all the differences can be computed in parallel, since they're independent operations.
This can cause a problem, though. Subtracting two 8-bit unsigned pixels can result in a 9-bit negative number. You can get around this using saturating arithmetic, though. What you do is use unsigned saturating subtraction to subtract pixel A from pixel B, and then do the reverse. One will be positive, and the other will be 0. We don't know which is which, but that's ok, because we can just use a logical OR operation to combine the results. This operation can be done in parallel on 8 bytes at a time, which provides really good performance (even better considering this operation is used a lot).
Saturating arithmetic also has value in traditional graphics. Gouraud shading is a standard way to render 3D images so that they look more realistic. It works by shading polygons by interpolating color values across scan lines during rendering. Somewhere along a scan line calculations can start to overflow. Unless precautions are taken, overflow can occur and generate a completely different result from that expected. Saturation makes sure this doesn't happen.
Many times there's several different ways to exploit parallelism in a given algorithm. The choice for the most efficient use of MMX instructions should be driven by how the data is laid out in memory or whether the programmer can change the way the data flows through an algorithm. In other words, the best implementation is often application specific. This shouldn't be a real surprise, but it does mean that simply using MMX instructions won't "magically" make our application perform as good as it could. And since the whole point is performance...
Multimedia applications usually have a data-independent control flow - each operation can execute without needing the results of a previous computation. These algorithms are the easiest to optimize. On the other hand, some important applications need the results of a previous computation before proceeding. These operations need to use logical operations to fit into MMX technology. An example is overlaying a sprite over a graphic. A sprite is a separate image in a 2D array, with the rest of the array filled with a "clear" color. Overlaying the sprite involves checking each pixel taken from the sprite array to make sure it isn't the clear color. If it isn't the clear color, it's part of the sprite and is written to the output frame. Otherwise it writes the corresponding pixel from the scene being overlaid. This operation makes use of the MMX parallel compare instruction.
MMX uses two sets of instructions (Pack and Unpack) to convert between the different MMX data types. Unpack takes small data types and produce large ones (ie, converting 16-bit to 32-bit words). Unpack takes two operands and interleaves them. If you want to Unpack 16-bit into 32-bit words, you can take one operand of 16-bit information and interleave it with another operand filled with 0s. The result of this is a 32-bit word with 0 in the most significant bits.
The Pack instructions convert data from a larger data type to a smaller one. If you expand 8-bit bytes into 16-bits to do some calculation, you could pack the final result back into 8-bit bytes before storing them back in memory.
The Unpack instructions are pretty powerful when data organized in one format in memory needs to be rearranged while executing some algoithm in order to expose the parallelism that MMX works with. An example of this is the Inverse Discrete Cosine Transform (ICDT) use in the JPEG image decompression algorithm. This algorithm takes a 2D array of data and operates first on rows of data, and then on columns of data. In memory, an array is usually laid out one row after another. MMX lets you manipulate rows in this type of memory organization really easily, since row elements are in subsequent addresses. But this organization doesn't work well on columns. In order to do this in parallel, the array has to be transposed, so that the columns become rows.
But guess what? The MMX Unpack instruction can be used to transpose an array! It's a two-step process: in phase 1 the Unpack instruction is used to interleave the word (16-bit) elements of adjacent rows, and in phase 2 the results of the first phase are unpacked again, this time using doubleword (32-bit) Unpack instructions to create the desired outputs. Pretty easy, huh?
What does this look like in MMX code? Let's take a look.
I've said it before, and I'll probably say it again: MMX technology is fully compatible with the existing IA. No new mode or state was created, and all existing PC designs and operating systems can work with a new processor which just happens to have MMX technology.
The instructions were easy to add because they're just integer instructions. From the data point of view, it was more of a challenge, because MMX data types are 64-bit packed integers, and there aren't any integer registers of this type on any existing IA processors. How do you get around a problem like this?
What they did was to map the MMX data types to the existing floating point registers, which are 80 bits wide. When MMX data is needed, the processor uses the floating-point registers as 64-bit-only packed integer registers. This mapping is done completely internally to the processor, and the external world only sees the physical floating point registers.
If the operating system has to switch tasks suddenly it saves the floating-point state (which might hold MMX data) and starts the new task. When the interruption is finished, the OS switches back, restores the floating-point state (which might hold MMX data), and then resumes where it left off. No physical new registers, condition codes, or events are added to support MMX technology! This allows all operating systems, including Windows, OS/2, and Unix, to work with MMX without knowing that MMX technology is in the processor.
MMX defines eight 64-bit general-purpose registers laid over the floating point registers. Each register can be directly addressed by designating the names MM0-MM7 in MMX instructions. These registers are used only for holding MMX data. Remember, MMX instructions are integer instructions which operate on packed fixed-point integer data loaded into the floating-point registers. What does this mean? It means that all the registers on an IA processor (8 integer registers, 8 floating-point registers) can be put to use with MMX technology, getting the greatest benefit from the available registers on the processor.
The MMX data values are put into the low-order 64-bits (the mantissa) of the 80-bit floating-point registers. The exponent field of the floating-point register and the sign bit (bit 79) are set to 1s, making the value in the register a Not a Number (NaN) or infinity when viewed as a floating-point value. Because of this, it reduces confusion which could occur, since MMX data won't look like a valid floating-point value. MMX instructions access only the low-order 64-bits of the floating-point registers, and aren't affected by the fact that they operate on invalid floating-point values.
It's worth noting that the dual use of the floating-point registers doesn't make it impossible for an application to use both MMX and floating-point code. Inside the application, MMX and floating-point code should be encapsulated in separate routines by the programmer. After one routine completes, the floating-point state is reset, and the next routine starts. On the other hand, it's not a good idea to use the floating-point registers for both floating-point and MMX data at the same time, since values in the floating-point registers are interpreted differently when accessed by floating-point or MMX instructions.
Intels calls the first implementation of MMX on the Pentium processor the "Pentium Processor with MMX Technology". So now we have to look at the Pentium processor, since MMX is built into the basic processor design.
The Pentium processor is an advanced superscalar processor (which means that it can handle more than one instruction at the same time). It has two general-purpose integer pipelines and a pipelined floating-point unit. It can simultaneously execute two integer instructions or one floating-point instruction. A software-transparent dynamic branch prediction mechanism minimizes pipeline stalls caused by branch instructions.
Despite the fact that MMX data is stored in floating-point registers, the MMX instructions were designed to run in the integer pipelines. MMX instructions operate on packed integers, so it makes sense to use the hardware in the integer pipelines for this. With the exception of the multiply instruction, MMX instructions execute in one cycle. The multiply instruction takes three cycles to complete, but since the integer multiply unit is pipelined a new multiply can start every cycle. With loop unrolling, it's possible to get a throughput of one MMX multiply per cycle.
The Pentium processor can issue two integer instructions per clock cycle (otherwise there would be no need for two integer pipelines). During execution of any given instruction the next two instructions are checked and, if possible, the first is started in the first pipe and the second in the second pipe. If it's not possible to issue two instructions (for instance, if the second instruction requires the result of the first), the first instruction is issued to the first pipe and nothing is issued to the second pipe. The second instruction waits until the next cycle and becomes the first instruction in the next possible pair. On a Pentium with MMX technology, a pair of instructions which can be executed in parallel can be two integer instructions (as on the regular Pentium), one integer instruction and one MMX instruction, or two MMX instructions.
On a Pentium processor with MMX technology, the basic integer pipeline structure looks like:
Instructions that take more than one cycle to execute stay in the Execution stage until they finish. But when executing MMX instructions the pipeline structure is a little different:
Additional stages are provided for the pipelined multiply instructions, which habe a longer latency.
The new MMX instructions made it necessary to modify the instruction decode logic so that it could decode, schedule, and issue the new instructions at a rate of up to two instructions per cycle. On the original Pentium decoding MMX instructions was slow, taking two cycles per instruction. The instruction decoder was redesigned to quadruple the throughput of MMX instructions, allowing it to decode two MMX instructions per cycle.
It should be no surprise that multimedia and communications applications have high data access rates. MMX lets you do more computing per cycle, so the processor has to be able to move data more efficiently. They needed to adapt the memory bandwidth to maintain a balanced system. This is done in two ways on the Pentium with MMX technology:
Ok, so now you have a good idea of how MMX works. Now we have to determine if all the extra work is really worth it. The goals for the Pentium with MMX were to exceed performance of IA code by 100% to 300% on kernels (the basic loops of multimedia applications) and by 50% to 100% on multimedia applications. The Intel Media Benchmark shows a performance boost of over 65%.
Creating good benchmarks isn't an easy thing to do. In the case of MMX, they had to generate reasonable workloads in order to analyze the performance, and then establish a performance baseline. Only recently has there been any interest by the computer industry to even have a multimedia benchmark suite, and since it came about before the MMX technology definition, it doesn't take advantage of any MMX instructions.
Because of this, the first performance projections for MMX were based on the kernel level. You can probably remember when MMX had just come out - the promise of quadrupling the speed of multimedia applications, games, and similar programs was all over the place. The kernels included general purpose applications like the Fast Fourier Transform (FFT) and vector dot products, as well as other things. The performance of the kernels was compared using a performance simulator to optimized versions of the same kernels (but without using the MMX instructions).
But MMX's most meaningful performance goals were on the application level. By analyzing various applications they found that most of the time was spent in a few basic kernels. They replaced these kernels with MMX versions, and ran it on first the performance simulator and finally on actual silicon, when it became available.
Some applications, like MPEG1 video decompression, saw a 80% speedup over a regular Pentium. They considered this a relatively low speedup, because a lot of time is spent in things MMX can't help with, like accessing the CD-ROM drive or a hard disk and writing graphics data to memory. Another applications, like image filtering, saw a speedup of 370%, since this application spends over 80% of its time in the compute-intensive image-filtering kernel.
Ok, so Intel created this great technology. But it's not something that code automagically takes advantage of. Programs have to be designed to use MMX instructions in order to get any advantage out of it. They realized that the success of MMX technology depended on the availability of compelling software which used it. This isn't as obvious as it sounds, there's a lot of things out there which are technologically superior to others yet fail.
The MMX developers worked closely with software developers to determine how this technology could best be used. When sample processors became available, they worked with a larger group of software developers. First they focused on the people who built the toold and building blocks for multimedia and communications applications, like 3D engines and audia coders. Then came the larger software developer community.
I don't mean to pound this point home, but programs must be able to run on processors regardless of whether or not they have MMX technology. That means no programs which require MMX, only those that take advantage of its presence. But how do you know if a processor has MMX or not? Well, you check for the feature bit which they added to detect the existence of MMX. You execute the CPUID instruction and check a certain bit - if it's set you have MMX. During runtime, programs can query this bit, and depending on the result decide whether or not to use MMX instructions. In this way, only one version of the software needs to be sold. But the software needs two separate versions of the main compute-intensive kernels - one for MMX and one for non-MMX. Do the math - the size of the binary is going to grow. The MMX developers studied this, however, and determined that since MMX code is mainly applied to small, tight compute-intensive kernels, the total code growth in most applications is small. They claim that less than 10% code growth is common.
But here's the tricky part. Ok, so an application which uses MMX will be faster on a Pentium with MMX than one without it. Makes perfect sense. But if you look at the performance figures, you'll see that a regular, non-MMX application will also run faster on a Pentium with MMX than one without it. Why?
Because of something the MMX developers did - they doubled the size of the on-chip cache! And because of the "30% Rule" employed in computer engineering, we can expect up to a 30% increase in performance due to doubling the size of the cache alone (within certain constraints placed on the workload)! They never address this in any way in the benchmarks that I've seen, but I have seen claims that a MMX-enabled processor can accelerate normal Windows applications by 20%.
Now, the questions is, is this a bad thing? No, of course not. I'm a big fan of on-chip cache. I doubt that anyone in their right mind would claim that a slower CPU is better than a faster one, all things being equal. But from what I've seen, the Intel engineers "hide" the regular performance boost. Why would they do such a thing?
To put it bluntly, the Intel implementation of MMX technology sucks. Or at least its current implementation does, they may improve it as time goes by. And, in fact, I can't see them not improving it, since otherwise they'll lose ground to their competitors.
You may be asking yourself, "But why does Intel's implentation of MMX suck?" The answer is quite simple, and something that Intel would prefer that nobody notice. The context switch which happens when using MMX instructions takes 50 clock cycles to execute. Keep in mind, the Pentium processor was designed specifically so that each instruction will take one cycle to execute. When you have a 50 cycle stall because you're working on setting things up, that's a Bad Thing(tm). That puts a sizeable dent in the performance figures.
Now, you may be asking "So what? It's something that MMX requires, nobody will be able to do much better!" And if you asked that, you'd be wrong. AMD's implementation of MMX on their new K6 processor (which was recently released, by the way, and competes not with the Pentium but with the Pentium Pro) requires only 5 cycles to do the same thing that Intel's takes 50 for. And keep in mind, this context switch isn't exactly a rare event on a modern, multitasking system. So Intel takes a relatively gigantic performance hit on something that happens all the time. AMD doesn't, on a chip which outperforms both the Pentium and Pentium Pro, and is priced better than what it competes with. Cyrix's M2 also outperforms the Pentium in its context switch time, but I could find no hard data on exactly how fast it is.
So, is a Intel CPU with MMX technology worth it? As far as the Pentium is concerned, yes. The extra cache makes it well worth it anyway, and the promise of MMX-enabled multimedia applications makes it definitely worth the extra money. How much extra money are we talking about here? On average, a Pentium Processor with MMX technology costs $50-$100 more than one without it.
Is MMX a good idea? Definitely. MMX finally brings Intel processors (as well as AMD and Cyrix, the other major PC CPU manufacturers) into a more modern architecture. A processor which is both SISD and SIMD is really a very handy thing to have. I'm sure there are a multitude of applications out there which could be optimized to use MMX instructions to get a performance boost. The extra overhead of checking for MMX technology is something which will never really be noticed, and the extra binary size will be irrelevant since many applications grow more and more bloated between releases anyway.
It's likely that even operating systems may be partially rewritten to take advantage of MMX technology. Certainly modern operating systems with GUIs have at least some operations which could be accelerated in this manner.
It's unfortunate that Intel felt the need to ensure complete backwards compatibility with previous processors, at the expense of not adding things which could be immensely useful. Other more modern processors have dozens of registers available for general purpose use, yet IA machines have 8. Modern IA machines have an additional 8 floating-point registers, but they're not designed for general purpose use. Intel could have taken this opportunity to add some additional registers, yet chose not to, instead using the floating point registers.
MMX is an idea whose time has finally come. But time will tell if developers will take full advantage of it.
Copyright © 1997 by John Pawasauskas