Automatic Data Dec.

Multiprocessor architectures are converging towards an organization in which nodes containing memory and one or more processors are connected via a fast network. Processors have access to their local memories and to a hardware-supported global address space. This organization enables high-scalability at a reasonable cost. The organization also facilitates programming by enabling the gradual introduction of parallelism on sequential prototypes. However, the Non-Uniform Memory Access (NUMA) organization of these machines makes data locality a crucial performance factor.

From previous experimental results, we have found that the best way to exploit all the available locality of a code, in a NUMA architecture, is to identify the suitable distributions (decompositions) for both iteration and data, in which data elements are, whenever possible, placed in the local memories of the processors accessing them. In our approach, each processor allocates its own local data, and accesses to remote memories are handled via explicit put/get communication primitives. However, finding and implementing a good decomposition by hand is a difficult task requiring extensive analysis and complex transformations of the sequential source code. Fortunately, we think that an advanced compiler can alleviate this cumbersome task. Our approach is to have the programmer write a conventional serial -non-annotated- program and rely on the compiler to automatically parallelize it, distribute the iteration and data between the processors, and generate the communications necessary to keep global data consistent. If such a compiler were truly successful it would become the key tool in a highly-scalable, easy-to-program computer system.

We have developed a new framework that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize the communications overhead (taking into consideration an important issue as the load imbalance) in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a Locality-Communications Graph (LCG) and the formulation of the compiler technique as a "Mixed Integer Non-Linear Programming" (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. We have validated our method using several benchmarks. The experimental results have demonstrated that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.

We have now working in the implementation of our techniques in a real paralelizing compiler as Polaris.

People

Emilio L. Zapata , Ph.D. U. of Santiago de Compostela, 1983.
Angeles G. Navarro , Ph.D. U. of Málaga, 2000.

Publications

A. Navarro, E. Zapata and D. Padua, "Compiler Techniques for the Distribution of Data and Computation". IEEE Trans. on Parallel and Distributed Systems, vol. 14, no. 6, June 2003, pp. 545-562.

Y. Paek, A. Navarro, E.L. Zapata, J. Hoeflinger and D. Padua, "An Advanced Compiler Framework for Noncache-Coherent Multiprocessors". IEEE Trans. on Parallel and Distributed Systems, vol. 13, no. 3, March 2002, pp 241-259.

M. Angeles González Navarro, "Distribución Automática de Datos en Multiprocesadores". PhD Thesis UMA-DAC-00/34, Dept. Computer Architecture, University of Málaga, April 2000.

A.G. Navarro and E. L. Zapata, "¿Cómo y Cuánta Localidad Explotan los Paralelizadores Actuales?". XI Jornadas de Paralelismo, Granada, Spain, September 11-13, 2000.

A. Navarro, R. Asenjo and E.L. Zapata, "Automatic Iteration/Data Partitioning for Distributed Shared Memory Systems". NATO Advanced Research Workshop on High Performance Computing: Technology and Applications, Cetraro, Italy, June 12-15, 2000.

A.G. Navarro and E.L. Zapata, "An Automatic Iteration/Data Distribution Method Based on Access Descriptors for DSM Multiprocessors". Technical Report UMA-DAC-99/07, May 1999.

A.G. Navarro and E.L. Zapata, "An Automatic Iteration/Data Distribution Method based on Access Descriptors for DSMM". 12th Int'l Workshop on Languages and Compilers for Parallel Computing (LCPC'99), San Diego, La Jolla, CA, USA, August 4-6, 1999. Also published in Lecture Notes in Computer Science, vol. 1863, 1999, pp. 133-148.

A.G. Navarro, R. Asenjo, E.L. Zapata and D. Padua, "Access Descriptor Based Locality Analysis for Distributed-Shared Memory Multiprocessors". IEEE Int'l. Conf. on Parallel Processing (ICPP'99), Aizu-Wakamatsu, Japan, pp. 86-94, September 21-24, 1999.

Y. Paek, A. Navarro, E.L. Zapata and D. Padua, "Parallelization of Benchmarks for Scalable Shared-Memory Multiprocessors". IEEE Int'l. Conf. on Parallel Architectures and Compilation Techniques (PACT'98), Paris, France, pp. 401-408, October 12-18, 1998.

A.G. Navarro, Y. Paek, E.L. Zapata and D. Padua, "Compiler Techniques for Effective Communication on Distributed-Memory Multiprocessors". IEEE Int'l. Conf. on Parallel Processing (ICPP'97), Bloomingdale, Chicago, IL, August 11-15, 1997, pp. 74-77.

A.G. Navarro, D. Padua, Y. Paek and E.L. Zapata, "Performance Analysis for Polaris on Distributed-Memory Multiprocessors". 3rd Workshop on Automatic Data Layout and Performance Prediction, Barcelona, Spain, January 9-10, 1997.

Colaborations

Polaris Group, leads by Prof. D. Padua in the University of Illinois at Urbana-Champaign.

Back

Research Goals

People

Publications

Colaborations