We have extended our previous compiler to allow for the automatic detection of the data structures at each program point for codes based not only on single selectors but also on multiselectors. With this compiler we have analyzed several codes in which the dominant data structure comprises arrays of pointers.
The analyzed codes are the sparse matrix vector multiplication, sparse matrix matrix multiplication, sparse LU factorization, and the kernel of the Barnes-Hut N-body simulation. In Table 1 we present the time and memory required by the compiler to analyze these codes in a Pentium III 500 MHZ with 128 MB main memory. The first three codes were successfully analyzed in the first level of the compiler, . However, for the Barnes-Hut code the highest accuracy of the RSRSGs was obtained in the last level, , as we explain in Sect. 5.2. For the Sparse LU factorization, our compiler runs out of memory in . We now briefly describe the results for the analyzed codes.
|Level||/ /||/ /|
Here we deal with three sparse irregular codes which implement sparse matrix operations: matrix vector multiplication, , matrix by matrix multiplication, , and sparse LU factorization, .
In the two first codes, sparse matrices , , and are stored in memory as an array of pointers, , pointing to doubly linked lists which store the matrix rows. Matrix is similarly stored by columns instead of by rows. The sparse vectors and are also doubly linked lists. This can be seen in Fig. 9(a). Note that vector grows during the matrix vector multiplication process.
On the other hand, the sparse LU factorization solves non-symmetric sparse linear systems by applying the LU factorization of the sparse matrix. Here, the sparse matrix is stored by columns. However, this code is much more complex to analyze due to the matrix filling, partial pivoting, and column permutation which takes place in the factorization in order to provide numerical stability and preserve the sparseness. After the analysis process, carried out by our compiler at level L1, the resulting RSRSGs accurately represent the data structure at each program point for the three codes.
Regarding the sparse matrix vector multiplication, in Fig. 9(b) we present a compact representation of the resulting RSRSG for the last sentence of the code. Nodes where the property is true are shaded in the figures. In this RSRSG we can clearly see the three main data structures involved in the sparse matrix vector multiplication (, , and ). Each vector is represented by three nodes and the central one represents all the middle items of the doubly linked list. The sparse matrix is pointed to by pointer variable which is actually an array of pointers with the multiselector . This multiselector has, for the last sentence of the code, a single instance () representing all the positions (pointers) of the array. In the RSRSG we can see that these pointers can point to (there is no element in the row), to a single node (the row has just one entry), or to a doubly linked list of two or more elements. For the matrix matrix multiplication, matrices , , and are also clearly identified by three graphs like the one just described before. The same happens for the in-place sparse LU factorization where the resulting LU matrix is stored where the original matrix A was.
To properly interpret this graph representation of the sparse matrices we have to say that the compiler also knows that the for all the nodes and all selectors is false. This leads to several conclusions: (i) the doubly linked lists are acyclic when traversed by following just one kind of selector ( or ), since the =false points out that a node can not be pointed to twice by other nodes using selector (the same for ); (ii) different pointers of the array point to different rows, as false; (iii) besides this, the doubly linked lists do not share elements between them.
Using this information, a subsequent compiler pass would be able to identify the traversals of the rows for prefetching or locality exploiting. Furthermore, the compiler would state that the sparse matrix rows/columns can be updated in parallel for some loops of the codes, and that it is also possible to update each row/column in parallel.
This code is based on the algorithm presented in [J. Barnes and P. Hut. A Hierarchical O(n log n) force calculation algorithm. Nature v.324, December 1986] which is used in astrophysics. In fact, this application simulates the evolution of a system of bodies under the influence of gravitational forces. It is a classical gravitational N-body simulation, in which each body is modeled as a point mass. The simulation proceeds over time-steps, each step computing the net force on every body and thereby updating that body's position and other attributes. The data structure used in this code is based on a hierarchical octree representation of space in three dimensions.
In Fig. 10(a) we present a schematic view of the data structure used in this code. The bodies are stored by a single linked list pointed to by the pvar . The octree represents the several subdivisions of the 3D space. That is, the root of the tree represents the whole space, each one of its children represents a single subsquare of this space, and so on. This way, each leaf of the octree represents a subsquare which contains a single body and therefore points to this body stored in the list. Each octree node which is not a leaf has an array of eight pointers to its children.
The three main steps in the algorithm are: (i) The creation of the octree and the pointers from the elements of the octree to the elements of the list; (ii) for each subsquare in the octree, compute the center of mass and total mass for all the particles it contains; and (iii) for each particle, traverse the tree, to compute the forces on it.
All the traversals of the octree are carried out in the code in recursive calls. Due to the fact that our compiler is still not able to perform an interprocedural analysis, we have manually carried out the inlining of the subroutine and the recursivity has been transformed into a loop. This loop uses a stack pointing to the nodes which are referenced during the octree traversal. This stack is also considered in Fig. 10 (a) and obtained in the corresponding RSRSG, Fig. 10 (b). The first step of the code, (i), is successfully analyzed in level L1 but the best accurate description of the data structures used in steps (ii) and (iii) are obtained in level L3.
However, regarding Table 1, there is paradoxical behavior that deserves explanation: expends less time and memory than and . In L3 SHARED and SHSEL remain false for more nodes and links which leads to more nodes and links being pruned during the abstract interpretation and graph compression phase of the symbolic execution of the sentences. This leads to significantly reducing the number of nodes and graphs, which reduces memory and time requirements.