Automatic Iteration/Data Decomposition in Distributed-Memory Multiprocessors



Research Goals

Multiprocessor architectures are converging towards an organization in which nodes containing memory and one or more processors are connected via a fast network. Processors have access to their local memories and to a hardware-supported global address space. This organization enables high-scalability at a reasonable cost. The organization also facilitates programming by enabling the gradual introduction of parallelism on sequential prototypes. However, the Non-Uniform Memory Access (NUMA) organization of these machines makes data locality a crucial performance factor.

From previous experimental results, we have found that the best way to exploit all the available locality of a code, in a NUMA architecture, is to identify the suitable distributions (decompositions) for both iteration and data,  in which data elements are, whenever possible, placed in the local memories of the processors accessing them. In our approach, each processor allocates its own local data, and accesses to remote memories are handled via explicit put/get communication primitives. However, finding and implementing a good decomposition by hand  is a difficult task requiring extensive analysis and complex transformations of the sequential source code. Fortunately, we think that an advanced compiler can alleviate this cumbersome task. Our approach is to have the programmer write a conventional serial -non-annotated- program and rely on the compiler to automatically parallelize it, distribute the iteration and data between the processors, and generate the communications necessary to keep global data consistent.  If such a compiler were truly successful it would become the key tool in a highly-scalable, easy-to-program computer system.

We have developed  a new framework that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize the communications overhead (taking into consideration an important issue as the load imbalance) in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a Locality-Communications Graph (LCG) and the formulation of the compiler technique as a  "Mixed Integer Non-Linear Programming" (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. We have validated our method using several benchmarks. The experimental results have demonstrated that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.

We have now working in the implementation of our techniques in a real paralelizing compiler as Polaris.







Last modified -- October 2003,