Multiprocessor architectures are converging towards an organization in which nodes containing memory and one or more processors are connected via a fast network. Processors have access to their local memories and to a hardware-supported global address space. This organization enables high-scalability at a reasonable cost. The organization also facilitates programming by enabling the gradual introduction of parallelism on sequential prototypes. However, the Non-Uniform Memory Access (NUMA) organization of these machines makes data locality a crucial performance factor.
From previous
experimental results, we have found that the best way to exploit
all the available locality of a code, in a NUMA architecture, is to
identify the suitable distributions (decompositions) for both iteration
and data, in which data elements are, whenever possible, placed
in
the local memories of the processors accessing them. In our approach,
each processor allocates its own local data, and accesses to remote
memories are handled via explicit
put/get communication primitives. However, finding
and implementing a good decomposition by hand is a difficult
task requiring extensive analysis and complex transformations of the
sequential source code. Fortunately, we think that an advanced compiler
can alleviate this cumbersome task. Our
approach is to have the programmer write a conventional serial
-non-annotated- program and rely on the compiler to automatically
parallelize it, distribute the iteration and data between the
processors, and generate the communications necessary to keep global
data consistent. If such a compiler were truly successful it
would
become the key tool in a highly-scalable, easy-to-program computer
system.
We have developed a new
framework that can be applied by a parallelizing compiler to find,
without user intervention, the iteration and data decompositions that
minimize the communications overhead (taking into consideration an
important issue as the load imbalance) in parallel programs targeted at
NUMA architectures. One of the key ingredients in our approach is the
representation of locality as a Locality-Communications
Graph (LCG) and the
formulation of the compiler technique as a "Mixed Integer
Non-Linear Programming" (MINLP) optimization problem on this graph. The
objective function and constraints of the optimization problem model
communication costs and load imbalance. The solution to this
optimization problem is a decomposition that minimizes the parallel
execution overhead. We have validated our method using several
benchmarks. The experimental results have demonstrated that the MINLP
formulation does not increase compilation time significantly and that
our framework generates very efficient iteration/data distributions for
a variety of NUMA machines.
We have now working in the implementation of our techniques in a real
paralelizing compiler as Polaris.