Compiling for Special Architectures

13.0 Introduction

Early computers of the 1940's and 50's had small instruction sets which executed on a small number of also small (often 8-bit) registers. Hardware was expensive, and compiler techniques were just developing. Floating-point operations, often implemented in software, were slow.

The 1960's and 70's saw a decrease in the price of hardware and the introduction of designs which modeled high-level language constructs such as loop instructions and array indexing. Many of these constructs were implemented in microcode; for this reason machine instructions were themselves translated to lower-level sequences of (micro-)code. These machines, when compared with reduced instruction set computers (RlSC's), are called ClSC's-Complex Instruction Set Computers.

Much of the information in this chapter applies equally well to CISC machines. Separation of the RISC discussion from parallel architecture issues is also somewhat artificial since RISC architectures may also be parallel architectures. We have avoid discussing specific machines and processors and discuss only their features.

One strong point exists for compiling for special architectures. This is that the hardware chip design and (optimizing) compiler design have often proceeded hand in hand. Rarely today is the compiler designer faced with a hardware design for which no consideration of how it is to be used has been made.

13.1 Compiling for RISC Machines

Interactions between programmers and hardware designers resulted in the hardware implementation of high-level language constructs. This led to large instruction sets and the proliferation of addressing modes. Research into the use of these complex instructions and addressing modes began to show that a small number of instructions and modes were used most frequently. For example, three of the DEC VAX^TM's ten addressing modes account for 85% of the use; ten of the over 200 instructions on the IBM 370^TM accounted for 67% of use [Muchnick, 1990].

Reduced Instruction Set Computers (RISC) are designed to reflect this simplicity. The return to smaller instruction sets and simpler addressing modes is not a return to the past, however. Random access memory is faster and cheaper. Registers and buses remain larger (generally 32 bits). Hardware design and implementation as well as compiler techniques are much better understood.

Nevertheless, RISC machines pose interesting problems for compiler designers, particularly in the areas of optimization and code generation. Because there are a larger number of registers, more program quantities can be kept in registers. Source level debugging of this highly optimized code has become a challenge.

13.1.1 RISC Machine Features

There is some controversy concerning what is or is not a RISC processor. The debate centers upon whether to classify a machine as RISC, as opposed to CISC, by architectural differences such as an abundance of registers or by performance measures such as performance of benchmark programs. In this section, we mention a number of architectural features commonly associated with RISC machines.

Small, Simpler Instruction Set and Few Addressing Modes

Small, simple and few are relative terms. Instruction set sizes are typically less than 150. Four or fewer addressing modes are common, although some processors have more.

Instruction set formats tend to be fixed in size, in contrast to variable length instruction formats of CISC machines. The number of these fixed-length formats is small, often on the order of two or three. This results in a faster (hard-wired) decoding.

Single-cycle operations allow the instructions to execute rapidly. Load-Store design dictates that only Load and Store instructions access memory. Ideally, these are the only instructions which take more than one machine cycle to execute.

Elimination of complex instructions eliminates the need for microcode.

Many Registers

Operations execute faster when the data is in a register. Thirty-two or more registers are common for RISC machines. Some have more. Hardware maintained sets of registers, called register windows, are organized in a circular queue, with a new set added to the tail of the queue and an older set removed from the head of the queue.

Levels of Memory

In addition to secondary memory, and a large number of registers, RISC processors include cache memory. Sometimes there is a separate cache memory for operations and operands. There may be separate buses to each cache.

Special-Purpose Architectures

RISC machines are often designed for a particular application or language or operating system. There are RISC machines for signal processing, symbolic processing, AI, image processing, scientific calculations, graphics processing, multiprocessing, and parallel processing. In addition. there are several general-purpose RISC machines on the market.

13.1.2 Compiler Issues in RISC Machines

Some of the issues that follow are also common to CISC machines and to recent languages such as Ada. They are mentioned here because compiler designers for RISC architectures have found them particularly important.

Pipelining

Pipelining traditionally has been the process of fetching instructions or data while another instruction is executing. It becomes increasingly important on RISC processors. Because all computations must take place in registers, and loads and stores access memory, it is important for the compiler to reorder code when it can so that the CPU is executing while a load or store is taking place. Since it is estimated that 1/3 of a program's executable instructions involve loads and stores, this is an especially important task. Pipelining has also been used to try to sustain an execution rate of more than one instruction/cycle.

Register Usage

Because RISC computations take place in registers and are one cycle or less, but a load takes more than one cycle, a technique called scoreboarding has been developed. When an instruction is begun whose result will affect a register, that register is marked on a scoreboard. This mark is erased when the instruction concludes (and the register has its new data). Other instructions check the scoreboard, and if they need the results from that register, the CPU waits. A clever compiler can keep these CPU waiting periods to a minimum.

If all program quantities are kept in registers, a speed-up of as much as 1/3 can be expected from the program. A large number of registers facilitates this, but even these must be used effectively.

Because of the large number of registers, it is often possible to keep each operand in a register and store the result in yet another register, thus allowing both operands to remain in a register for a possible future use.

A technique called value tracking keeps a list of program quantities whose values are currently in registers. A set of ordered pairs (Value, Register) is maintained, and when the value is needed for another computation, it is fetched from the register.

Registers may be organized into a file of (possibly overlapping) register sets. When these sets are managed by the hardware, they are called register windows. Register windows allow for fast storing and restoring of information in procedure calls. These register windows are organized into a circular queue rather than the traditional stack which stores and restores for each call. A new procedure call (automatically) increments a pointer so that the outgoing parameter registers become the ingoing parameter registers for the new procedure. A return operation reverses this process.

Optimization and Code Generation Techniques

Since loads and stores may slow execution, it is important for the code generator to minimize them. Peephole optimization techniques can remove redundant loads or unnecessary stores.

A two-cycle instruction such as a branch has to compute an address. Execution cannot continue until after the address is computed. A technique known as delayed branching moves the branch instruction back one instruction so that the statement preceding the branch is executed at the same time that the address is being calculated. Another technique is to execute the instruction after a conditional branch at the same time as the address fetch (hoping that the branch doesn't take place).

Other standard optimizations, such as computing common subexpressions once, and good register allocation, such as that performed by coloring, can improve RISC programs by higher percentages. One performance estimate for a RISC compiler vs. the same compiler on a CISC machine found the improvement on the RISC machine to be about 50% with nowhere near that much improvement on the CISC machine.

Run-time checks can sometimes be eliminated by the compiler. Checking that a variable X1 in the expression X1 = A + BB * 12 is within its declared bounds of say 0 .. 255, is not needed at run time if the compiler knows that A is declared as 0 .. 100 and B is declared as 0 .. 10. Of course, checks for A and B are still needed unless a similar dependence can also be found at compile time. On RISC machines, this sort of compiler simulation can cost as much as 20% in compile time, but may remove as many as 70% of the run-time checks.

Inlining is the replacement of a procedure call with the code of the called procedure. Even with register windowing, this can produce a program which runs faster at the expense of a (perhaps huge) increase in code space. Inlining is most productive when there are lots of calls to very short procedures.

Because the typical RISC architecture does not have microcode, the translated code is "bare-bones", exposing code sequences to optimizations which may have been hidden on CISC machines.

Debugger Interaction

Production-quality compilers produce code in the object file format of the machine. It is this structure to which the debugger refers. If there is a one-one correspondence between a source program construct and its place in the object file, then debugging is facilitated. In heavily optimized code, this may not be the case.

Debugging of optimized code is a challenge in any architecture-CISC or RISC. If a computation, say a loop invariant one, has been moved, and a programmer steps to that instruction point, the expected instruction is not there. Ideally, the debugger should try to figure out what the compiler has done and either undo it or inform the programmer. This may mean that the compiler should leave this information in the object file for the debugger.

Another problem arises from the register window situation. If a variable has been kept in a register during one procedure execution, and the programmer asks to see that variable's value when executing another procedure (and another register window), the debugger has to be informed that the variable is no longer in that register.

Debugging inlined code (see above) is also a difficult task. A bug in a procedure has now been reproduced as many times as the inlined procedure is called. Compilers, ideally, should leave information in the object module's symbol file, indicating that a piece of code had really only been written by the programmer once.

Compiling in a Parallel Processing Environment

Parallel processing which overlaps fetch and execute instructions at both the hardware level and the compiler level is not new. Instruction pipelining is actually an example of parallel processing.

13.2.1 Parallel Machine Features

RISC architectures are often parallel processing environments. Other parallel processing environments include multiple CPU's (multiprocessing) and multiple ALU's (array processors and supercomputers) and systems with enough back-up components to execute even if there is a hardware error (fault-tolerant computers).

Multiple processors which have their own memory and I/O ports (loosely coupled systems) have different compiler issues from those which have multiple CPU's, but share memory and I/O ports (tightly coupled systems).

13.2.2 Compiling in a Multiprocessing Environment

Preparing code to execute in more than one processor involves synchronization. Values which are needed in one processor may be computed in another. Changes in shared memory must be done carefully.

An interesting example of parallel execution in a multiprocessing environment is execution of the compiler itself. In some environments, the compiler is the major software process.

One technique for producing code for parallel processing environments is to take a compiler which produces code in a non-parallel processing environment and write a scheduler for the produced code to execute on parallel execution units. This can be done automatically, using the best developed techniques and heuristics, or in interactive mode with the (knowledgeable) programmer hand-tuning the automatically produced code.

13.2.3 Compiling for Array Processors and Supercomputers

Array processors (sometimes called vector machines) and supercomputers use parallelism to increase performance. Typically, these machines are used to perform high- precision arithmetic calculations on large arrays. In addition, they may need to operate in real time or close to real time.

Supercomputers are also used for general-purpose computing, while array processors are special-purpose (often peripheral) processors which operate solely on vectors.

Discovering data dependencies is one of the major issues for these architectures.

Data Dependencies

Data dependence checking is important for detecting possibilities for parallel scheduling of serial code. Strictly speaking, the problem is one of concurrency. For example, the following statements cannot be executed at the same time:


 X := Y + 1

 Z := X + 2

Since X's value is needed to compute Z, the second statement depends on the first. The following statements could be executed in parallel:


 X := Y + 1

 Z := Y + 2

Parallelization, like optimization, has a potentially higher payoff in loops. The following loop can be changed to execute in parallel:

                          

 FOR I := 1 TO AHighNumber

   A[I] := A[I] + B[I]

 ENDFOR

For two processors this could become:


 FOR I := 1 TO AHaghNumber BY 2

   A[I] := A[I] + B[I]

   A[I+1] := A[I + 1] + B[I + 1]

 ENDFOR

This is called loop unrolling. Since each statement within the loop is affecting and using a separate element of the array, the two statements can be executed in parallel. There will be half as many test and branch instructions to execute since the loop is now counting by 2's.

Some machines have numerous processors. On a machine with 64 processors, the following


 FOR I := 1 TO N * 69

   A[I] := A[I] + B[I]

 ENDFOR

might become


 FOR I := 0 TO N - 1 DO

   FOR J := 64 * I + 1 TO 64 * I + 64 DO

     A[J] := A[J] + B[J]

   ENDFOR

 ENDFOR

The inner loop statement can now be executed simultaneously on all 64 processors.

The statement in the following loop contains a data dependency and cannot be effectively parallelized:


 FOR I := 1 TO AHighNumber

   A[I] := A[I - 1] + B[I]

 ENDFOR

Here, the value computed in one iteration of the loop is used in the next iteration so that the loop cannot be unrolled and processed in parallel.

Debugger Interaction

Producing information in the object module which the debugger can use is important since there may no longer be a one-one correspondence between the code produced by the programmer and that which is scheduled for parallel execution. If the technique of scheduling code after the compiler has produced it is used, then there isn't even a one-one correspondence between the code produced by the compiler and that being executed by the debugger. In this case, the scheduler should leave a trail for the debugger to follow.

13.2.4 Super-Scalar Processors

Processors capable of executing more than one instruction per cycle are termed super-scalar processors. Architecturally, they combine RISC features with parallel processing. To maintain an instruction rate of two instructions per cycle requires sophisticated branch prediction.

Much is known about achieving concurrency for scientific calculations, but for wider applications, non-vectorizable instruction concurrency is needed.

Algorithms for finding non-numerical instructions which can be executed in parallel in a general application or business application environment are needed.

Compiling in a Fault-Tolerant Environment

Previous sections of this chapter have described special architecture for increasing performance There is also a growing need for systems that "produce correct results or actions even in the presence of faults or other anomalous or unexpected conditions". Fault-tolerant systems increase dependability through redundancy. Applications such as robotics, navigational systems and others require reliable systems. The purpose of a fault-tolerant system is to be able to execute an algorithm in the presence of hardware or software errors. Fault tolerance may be implemented in hardware with back-up devices or in software.

13.3.1 Fault-Tolerant Hardware Features

Fault-tolerant machines have a back-up for every critical part of the machine. This includes the CPU, I/O ports, buses, and the ALU.

13.3.2 Fault-Tolerant Software Features

Fault tolerance can be simulated in software. One example of such a technique is the sequence

                           

 ensure|x² - y |< tolerable

 by x := SqrtA(y)

 else x := SqrtB(y)

 else error

Here, multiple versions of a function which computes a square root are provided, increasing the likelihood that the square root value returned satisfies some criterion.

Knowledgeable compilers which produce code for fault-tolerant systems should be aware of the redundancies in the architecture. Much research is needed to make this a reality.

13.4 Summary

Although there is disagreement on exactly what qualifies as a RISC machine, characterizations include as small an instruction set as possible, (2) as small a set of addressing modes as possible, (3) as small a number of instruction formats as possible, (4) single-cycle execution of as many instructions as possible, (5) memory access by load-store instructions only, (6) number of user general-purpose registers as large as possible, (7) no microcode, (8) high-level language support.

The type of benchmark program used to test a computer often affects the results. For example, programs such as Quicksort or Towers of Hanoi contain lots of recursive subprogram calls. It is not surprising, therefore, that these programs show a substantial improvement when executed on RISC machines with register windowing to efficiently handle the overhead of calls and returns.

Programming for parallel environments involves two approaches: (1) learning how to program for concurrency and using languages which facilitate this and (2) scheduling of already-written sequential code for parallel processing.

Throughout this chapter the topic of parallelism is discussed. RISC architectures are often parallel. Fault tolerance is often implemented in the presence of parallelism. Parallelism makes programs run faster. Parallelism makes programs run safer: The need for both fast and safe executions is a challenge.

New architectures continue to evolve. Their effective use requires constant study of the applications for which they are intended and compiler techniques which will exploit the new architecture.

Compiling for Special Architectures 13.0 Introduction