Motivation

Although languages like C, C++, Java, etc., have been widely used for many years, there is still a lack of compiler techniques able to automatically optimize the execution of real codes written in these languages. The difficulties which have prevented the development of such techniques lie mainly in the huge complexity associated with pointers and dynamic data structures. In this sense, for Fortran77 codes - in which these data structures are not allowed - there are a large number of optimizing and parallelizing compilers which successfully deal with complex code transformation even in the presence of non-linear array access functions [7] or irregular ones [6]. Clearly, similar techniques can be applied to C, C++ or Java codes if they do not resort to pointers or dynamic data structures or, in other words, if these codes can easily be translated to Fortran. However, new solutions are a must due to the wide acceptance of these programming languages and due to the fact that dynamic data structures are important tools to achieve good performance and simplify the development of complex codes. In particular, irregular codes or symbolic ones are mainly based on complex data structures which use memory pointer references in many cases.

With this motivation, our goal is to propose and implement new techniques that can be included in compilers to allow the automatic analysis of real codes based on dynamic data structures. Our main interest lies in the automatic parallelization of such kinds of codes, but the techniques we propose can also be used in the implementation of new tools to simplify the development of codes based on pointers or dynamic data structures. In a first step, we have selected the shape analysis subproblem, which aims at estimating at compile time the shape the data will take at run time. Given this information, a subsequent analysis would detect whether or not certain sections of the code can be parallelized because they access independent data regions. Furthermore, the compiler would also assist the programmer in the development of complex codes with dynamic data structures. For example, the compiler would be able to provide a description of the programmed data structure (which may not fit with the desired one) or identify at compile time that the code will fail due to an illegal access by a null pointer.

There are several ways this problem can be approached, but we focus on the graph-based methods in which the ``storage chunks'' are represented by nodes, and edges are used to represent references between them [10], [11]. In a previous work [2], we combined and extended several ideas from these previous graph-based methods, for example, allowing more than a summary node per graph among other extensions. However, we keep the restriction of one graph per sentence in the code. This way, since each sentence of the code can be reached after following several paths in the control flow, the associated graph should approximate all the possible memory configurations arising after the execution of this sentence. This restriction leads to memory and time saving, but at the same time it significantly reduces the accuracy of the method. In this work, we have changed our previous direction by selecting a tradeoff solution: we consider several graphs with more than a summary node, while fulfilling some rules to avoid an explosion in the number of graphs and nodes in each graph.

Among the first relevant studies which allowed several graphs were those developed by Jones et al. [9] and Horwitz et al. [8]. These approaches are based on a ``k-limited'' approximation in which all nodes beyond a selectors path are joined in a summary node. The main drawback to these methods is that the node analysis beyond the ``k-limit'' is very inexact and therefore they are unable to capture complex data structures. A more recent work that also allows several graphs and summary nodes is the one presented by Sagiv et al.[12]. They propose a parametric framework based on a 3-valued logic. To describe the memory configuration they use 3-valued structures defined by several predicates. These predicates determine the accuracy of the method. As far as we know, the currently proposed predicates do not suffice to deal with the complex data structures that we handle in this paper.

With this in mind, our proposal is based on approximating all the
possible memory configurations that can arise after the execution of a
sentence by a set of graphs: the *Reduced Set of Reference Shape
Graphs* (RSRSG). We see that each RSRSG is a collection of *Reference Shape Graphs* (RSG) each one containing several
non-compatible nodes. Finally, each node represents one or several
memory locations. Compatible nodes are ``summarized'' into a single
one. Two nodes are compatible if they share the same reference
properties. With this framework we can achieve accurate results
without excessive compilation time. Besides this, we cover situations
that were previously unsolved, such as detection of complex structures
(lists of trees, lists of lists, etc.) and structure permutation, as
we will see in this article.

The rest of the paper is organized as follows: Sect. 2 briefly describes the whole framework, introducing the key ideas of the method and presenting the data structure example that will help in understanding node properties and operations with graphs. These properties are described in Sect. 3 where we show how the RSG can accurately approximate a memory configuration. In Sect. 4 we formalize when two RSGs are compatible and how they can be joined, among other operations related with the RSGs included in an RSRSG. These operations have been implemented in a compiler which is experimentally validated, in Sect. 5, by analyzing several C codes based on complex data structures. In this section we also provide the time and memory requirements of the compiler to carry out the analysis of these codes. Finally, we summarize the main contributions and future work in Sect. 6.