Parallel Machine Features

Compiling in a Multiprocessing Environment

Compiling for Array Processors and Supercomputers

Super-Scaler Processors

13.2.3 Compiling for Array Processors and Supercomputers

Array processors (sometimes called vector machines) and supercomputers use parallelism to increase performance. Typically, these machines are used to perform high- precision arithmetic calculations on large arrays. In addition, they may need to operate in real time or close to real time.

Supercomputers are also used for general-purpose computing, while array processors are special-purpose (often peripheral) processors which operate solely on vectors.

Discovering data dependencies is one of the major issues for these architectures.

Data Dependencies

Data dependence checking is important for detecting possibilities for parallel scheduling of serial code. Strictly speaking, the problem is one of concurrency. For example, the following statements cannot be executed at the same time:

     X := Y + 1
     Z := X + 2
     

Since X's value is needed to compute Z, the second statement depends on the first. The following statements could be executed in parallel:

     X := Y + 1
     Z := Y + 2
     

Parallelization, like optimization, has a potentially higher payoff in loops. The following loop can be changed to execute in parallel:

                              
     FOR I := 1 TO AHighNumber
       A[I] := A[I] + B[I]
     ENDFOR
     

For two processors this could become:

     FOR I := 1 TO AHaghNumber BY 2
       A[I] := A[I] + B[I]
       A[I+1] := A[I + 1] + B[I + 1]
     ENDFOR
     

This is called loop unrolling. Since each statement within the loop is affecting and using a separate element of the array, the two statements can be executed in parallel. There will be half as many test and branch instructions to execute since the loop is now counting by 2's.

Some machines have numerous processors. On a machine with 64 processors, the following

     FOR I := 1 TO N * 69
       A[I] := A[I] + B[I]
     ENDFOR
     

might become

     FOR I := 0 TO N - 1 DO
       FOR J := 64 * I + 1 TO 64 * I + 64 DO
         A[J] := A[J] + B[J]
       ENDFOR
     ENDFOR
     

The inner loop statement can now be executed simultaneously on all 64 processors.

The statement in the following loop contains a data dependency and cannot be effectively parallelized:

     FOR I := 1 TO AHighNumber
       A[I] := A[I - 1] + B[I]
     ENDFOR
     

Here, the value computed in one iteration of the loop is used in the next iteration so that the loop cannot be unrolled and processed in parallel.

Debugger Interaction

Producing information in the object module which the debugger can use is important since there may no longer be a one-one correspondence between the code produced by the programmer and that which is scheduled for parallel execution. If the technique of scheduling code after the compiler has produced it is used, then there isn't even a one-one correspondence between the code produced by the compiler and that being executed by the debugger. In this case, the scheduler should leave a trail for the debugger to follow.