Rogers A., Pingali K. Compiling for distributed memory architectures

Файл формата pdf
размером 710,57 КБ

Добавлен пользователем Romashka 07.05.2020 03:08
Описание отредактировано 15.05.2020 15:39

Rogers A., Pingali K. Compiling for distributed memory architectures

Anne Rogers and Keshav Pingali
June 10, 1991
For MIMD computers to become the computers of choice, compiler technology must improve to a
point where the programmer can exploit parallelism without undue programming e ort. This will
be achieved when the programmer can get good performance from application programs written
using high-level control and data abstractions, leaving it to the compiler and run-time system to
worry about aspects of the architecture such as the number of processors and the interconnection
topology, and intricacies of parallel execution such as synchronization and load balancing. Unfortunately, MIMD machines cannot be programmed like this at present. In distributed memory MIMD
machines, like the Intel Hypercube and NCUBE, the programming model consists of multiple processes, each with its own address space, that communicate with each other by exchanging messages.
The programmer must decompose his algorithm into processes, distribute data among the address
spaces of these processes and insert message sends and receives wherever a process needs access to
a non-local data item. The need to deal with multiple address spaces results in a loss of abstraction
in programming. For example, if a process needs access to element X[i,j] of a global array X that
is distributed across processes, it can read the value directly if the element is mapped into its own
address space; otherwise, it must exchange messages with the process on which this element resides.
Furthermore, synchronization between processes is also accomplished using message passing, and
doing sends and receives in the wrong order can result in cycles of deadlocked processes. These
problems are absent in shared memory MIMD machines because there is a single global address
space. A typical shared memory multiprocessor today supports conventional imperative languages
like FORTRAN extended with constructs like DO-ALL or DO-ACROSS for explicitly requesting
parallel computation and constructs like locks for synchronization. Unfortunately, inserting correct
synchronization code is as di cult as writing correct operating systems code on uniprocessors and
reading and writing shared locations in parallel opens the way to inadvertent race conditions that
are very di cult to nd. We conclude that explicit management of parallelism in programs is too
onerous and error-prone for most programmers.
One alternative is to program in a language without explicit parallelism such as FORTRAN
or Id. This throws the burden of exposing and managing parallelism on the compiler and run-time
system. For distributed memory architectures, the compiler must decide what should execute in
parallel and how data should be decomposed across the address spaces of the processes, inserting
message transfers wherever a process needs access to non-local data. An extremely important
consideration is locality of reference | data should be mapped in such a way that processes access
local data as far as possible. This is because message passing can cost as much as two orders
of magnitude more than accessing local memory. On current generation machines, most of this
cost is the software overhead of packing and unpacking the message, and the transmission time
of the message itself is relatively small. Therefore, the topology of the network can be ignored
and a reasonable approximation of a message-passing machine is a two-level memory hierarchy in
which non-local accesses are about two orders of magnitude more expensive than local accesses.
On such a hierarchy, it is clear that process decomposition must attempt to locate code and the
data it references in the same process. The same considerations apply when compiling for shared
memory machines. The single shared address space is usually implemented using memory that
is physically distributed among the processors, so that a processor can access its local memory
relatively quickly but must go across the network to access non-local memory. Such machines
are called Non-uniform Memory Access (NUMA) machines and exploiting locality of reference
is important on these machines as well. The importance of combining parallelism detection with
proper data distribution to achieve locality of reference has been eloquently summarized by Karp[11]
as follows:
we see that data organization is the key to parallel algorithms even on shared
memory systems. It will take some retraining to get programmers to plan their data rst
and their program
ow later. The importance of data management is also a problem
for people writing automatic parallelization compilers : : : A new kind of analysis will
have to match the data structures to the executable code in order to minimize memory
traffc.
In this paper, we report on one such system. The cornerstone of our approach is to let the
mapping of data on to processors drive process decomposition. The idea underlying our approach
is to enable the programmer to write and debug his program in a high-level language using standard
high-level abstractions such as loops and arrays. In addition, he speci es the domain decomposition
| a mapping of the data structures on to the multiprocessor. In most programs we have looked
at (such as matrix algorithms and SIMPLE[7]), this is quite straightforward since the programmer
thinks naturally in terms of decompositions by columns, rows, blocks, and so on. Given this
data decomposition, the compiler performs process decomposition by analyzing the program and
specializing it, for each processor, to the data that resides on that processor. Thus, our approach
to process decomposition is \data-driven" rather than \program-driven" as are more traditional
approaches[1, 19].