Multiprocessor architectures are converging towards an organization in which nodes containing memory and one or more processors are connected via a fast network. Processors have access to their local memories and to a hardware-supported global address space. This organization enables high-scalability at a reasonable cost. The organization also facilitates programming by enabling the gradual introduction of parallelism on sequential prototypes. However, the Non-Uniform Memory Access (NUMA) organization of these machines makes data locality a crucial performance factor.
experimental results, we have found that the best way to exploit
all the available locality of a code, in a NUMA architecture, is to
identify the suitable distributions (decompositions) for both iteration
and data, in which data elements are, whenever possible, placed
the local memories of the processors accessing them. In our approach,
each processor allocates its own local data, and accesses to remote
memories are handled via explicit
put/get communication primitives. However, finding
and implementing a good decomposition by hand is a difficult
task requiring extensive analysis and complex transformations of the
sequential source code. Fortunately, we think that an advanced compiler
can alleviate this cumbersome task. Our
approach is to have the programmer write a conventional serial
-non-annotated- program and rely on the compiler to automatically
parallelize it, distribute the iteration and data between the
processors, and generate the communications necessary to keep global
data consistent. If such a compiler were truly successful it
become the key tool in a highly-scalable, easy-to-program computer
We have developed a new framework that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize the communications overhead (taking into consideration an important issue as the load imbalance) in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a Locality-Communications Graph (LCG) and the formulation of the compiler technique as a "Mixed Integer Non-Linear Programming" (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. We have validated our method using several benchmarks. The experimental results have demonstrated that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.
We have now working in the implementation of our techniques in a real paralelizing compiler as Polaris.