next up previous
Next: Compression of graphs Up: Reference Shape Graph Previous: Reference Shape Graph

Node properties

Now we focus on $ F_n$ which extracts some important properties from a memory location and, depending on these, this location is translated into a node. Besides this, if several memory locations share the same properties then this function maps all of them into the same node of the $ RSG$. Due to this dependence on the location properties, the $ F_n$ function will be described during the presentation of the different properties which characterize each node. These properties are: Type, Structure, Reference pattern, Share information, Simple Paths, Touch information, and Cycle links. These are now described.

1.- The node TYPE property states the data type of the memory locations represented by this node. Therefore, two memory locations can be mapped into the same node if they share the same TYPE value. This property leads to the situation where, for the data structure presented in Fig. 3(a), the nodes representing list memory locations will not be summarized with those nodes representing tree locations, as we can see in Fig. 3 (b).

2.- The STRUCTURE information avoids the summarization of two nodes which represent memory locations of the same type but which do not share any element (they are non-connected components). Two memory locations have the same STRUCTURE value if there is a path between them. Again, two memory locations can be represented by the same node if they have the same STRUCTURE value.

3.- The Reference patterns are introduced to represent by different nodes the memory locations with different reference patterns (type of selectors which point to and from this memory location). This keeps singular memory locations of the data structure in separate nodes. For example, the roots and the leaves of the data structure in Fig. 3 (a) are two kinds of memory locations. The root of one of the trees is referenced by the header list and the leaves do not point to tree items but to a doubly linked list. Thanks to the reference patterns, the method results in the RSRSG of Fig. 3 (b), where the root of the tree, the leaves, and the other tree items are clearly identified.

In order to obtain this behavior, we define two sets SELINset and SELOUTset which contain the set of input/output selectors for a certain location: SELINset $ (l_1) = \{ sel_i \in S \vert
\exists l_2 \in L, <l_2, sel_i, l_1> \in LS \}$, and SELOUTset $ (l_1) = \{sel_i \in S \vert \exists l_2 \in L, <l_1, sel_i,
l_2> \in LS \}$ where we see that $ sel_i$ is in the SELINset($ l_1$) if $ l_1$ is referenced from somewhere by selector $ sel_i$, or $ sel_i$ is in SELOUTset($ l_1$) if $ l_1.sel_i$ points to somewhere outside. Our method can eventually lead to being unable to exactly determine whether or not a certain selector is in an input or output set for a node, $ n$. This way, we also define PosSELINset($ n$) and PosSELOUTset($ n$) which contain the selectors which ``may'' point to/from node $ n$. To know if two nodes can be summarized we have defined a Boolean function C_REFPAT $ (n_1, n_2)$ that returns true if two nodes represent memory locations with a similar reference pattern. Basically, we allow the summarization of two nodes if both of them have the same input and output selectors or we do not know the input or output selectors of one or both nodes.

4.- The Share information can tell whether at least one of the locations represented by a node is referenced more than once from other memory locations. Due to the relevance of this property to inform the compiler about the potential parallelism exhibited by the analyzed data structure, we use two kinds of attributes for each node: (i) SHARED$ (n)$ with $ n \in
N(rsg)$, is a Boolean function which returns ``true'' if any of the locations, $ l_1$, represented by $ n$ can be referenced by other locations, $ l_2$ and $ l_3$, by different selectors, $ sel_i$ and $ sel_j$; and (ii) SHSEL$ (n, sel)$ with $ n \in
N(rsg)$ and $ sel \in S$, is a Boolean function which returns ``true'' if any of the memory locations, $ l_1$, represented by $ n$ can be referenced more than once by selector $ sel$ from other locations, $ l_2$ and $ l_3$.

Let's illustrate these SHARED and SHSEL properties using the compact representation of the RSRSG presented in Fig. 3 (b). In this figure, shaded nodes have the SHARED property set to true. For example, in the header list the middle node $ n_2$ is shared, SHARED$ (n_2)$=1, because $ n_2$ is referenced by selectors $ nxt$ and $ prv$. However, the SHSEL$ (n_2,nxt)$=SHSEL$ (n_2,prv)$ =0 which means that by following selector $ nxt$ or $ prv$ it is not possible to reach an already visited memory location. Actually, in this example, there are no selectors with the SHSEL property set to true. Thus, the same happens for node $ n_8$ which represents the middle items of the doubly linked lists. We can also see in Fig. 3 (b), that node $ n_4$ is not shared, which states that, in fact, from memory locations represented by $ n_1$, $ n_2$, and $ n_3$ we always reach different trees which do not share any elements (as we see in Fig. 3 (a)). Finally, node $ n_7$ is shared because it is pointed to by selectors $ list$ and $ prv$. However, due to SHSEL $ (n_7,list)$=0 we can ensure that two different leaves of the trees will never point to the same doubly linked list.

Now, two nodes can be summarized if they have the same SHARED and SHSEL attributes.

5.- Simple paths denominates the access path from a pointer variable (pvar) to a location or node if the length of this path is less than or equal to 1. An example of a simple path is $ p
\rightarrow s.sel \rightarrow t$ in which the pvar $ p$ points to the location $ s$ which points to the location $ t$ using the selector $ sel$. Note that, in this example, the simple path for $ t$ is $ <p,sel>$ and the simple path for $ s$ is $ <p,\emptyset>$.

The use of the simple path in our analysis is justified by the need for an accurate approximation of memory locations by nodes. As we will see in the next example, there are codes in which the shape of the data structure is accurately approximated by the RSG if we avoid the summarization of nodes which are ``near'' the pvar. With the simple path we consider that a node is ``near'' the pvar if this pvar points to the node directly, or by a path of length one (in other words, there is a single hop from the pvar to the node).

The example we mentioned above is the following: in Fig. 4 (a) we show the RSG resulting from analyzing a doubly linked list traversed by two pvars, $ x$ and $ y$.

Figure 4: Motivation of the SPATH property.
(a) & (b)

The RSG resulting from the method not using the SPATH property is presented in Fig. 4 (b) in which, instead of keeping different sections of the list apart, these two sections are fused in a single node. This summarization can eventually hide the real properties of the data structure and therefore we avoid this summarization by using the SPATH property as we describe next.

We define, for a memory location $ l \in L(m)$, SPATH($ l$) = $ \{sp_1, ..., sp_n\}$ where $ sp_i = <pvar, sel^{*}>$ with $ pvar \in P$. In other words, SPATH($ l$) is the set of simple paths (pointer variables and selectors) which point to $ l$ directly ( $ sel^*=\emptyset$) or through an intermediate location, $ l_i$ ( $ <pvar, l_i> \in PS(m) \wedge <l_i, sel^*, l> \in LS(m)$). The length of a simple path, $ sp_i = <pvar, sel_i>$, LEN $ (sp_i) = 0$ if $ sel_i = \emptyset$ or $ 1$ if $ sel_i = sel \in S$.

To determine when two nodes can be summarized, we define a Boolean function C_SPATH $ (n_1, n_2, m)$ which returns true if nodes $ n_1$ and $ n_2$ have compatible SPATH. The parameter $ m$ imposes the constraints in the comparison of nodes. If $ m = 0$ there are less restrictions to summarize two nodes. In this case, two node SPATHs are compatible if they comprise the same zero-length simple paths. This particular case of the C_SPATH function is called C_SPATH0. On the contrary, if $ m > 0$, the two SPATHs also have to share at least $ m$ one-length simple paths. We have found in our experiments that a tradeoff value for the $ m$ parameter is one, for which we can obtain a good description of the data structure without an excessive growth of nodes. In this case the function is called C_SPATH1.

6.- Touch information. Until now, we have seen properties which capture some important issues of the memory configuration and which change according to the current sentence. However, sometimes it is necessary to keep track of the memory locations for several sentences to increase the accuracy of the RSRSGs. For example, if we deal with a list data structure, we know that all the middle elements (not pointed to by any pvar) are going to be summarized in a single node. Now, when traversing this list in a loop, the same summary node will represent non-visited locations as well as visited ones. On the other hand, during one acyclic traversal of the list, it would be better to keep the visited locations in a separate node in such a way that new changes only affect non-visited nodes.

In order to achieve this behavior in the compiler we assign to each node a new property called TOUCH. This property is taken into account only inside loop bodies. In this case, the TOUCH information of a node is the set of pvars from which the memory locations represented by the node have been visited. We understand by ``pvar $ x$ visit node $ n$'' that the node $ n$ has been referenced by $ x$. For example, in $ x = y$, the node pointed to by $ y$ is visited by $ x$. In the same way, in $ x = y\rightarrow sel$, the node pointed to by selector $ sel$ from node $ y$ is visited by $ x$.

Now, two memory locations can be summarized in the same node if they have been ``touched'' by the same set of pvars. In this way we can keep visited nodes apart from non-visited ones. However, clearly this new restriction in the summarization will increase the number of nodes in the RSRSGs. In order to avoid the explosion in the number of nodes we have to constrain the kind of pvar which can appear in the TOUCH set. More precisely, only those pvars which are used to traverse dynamic data structures (called induction pointers by Yuan-Shin Hwang [13] or navigators by Rakesh Ghiya [5]) are eligible to be included in the set. Taking all this into account, we can finally define for $ n \in
N(rsg)$: TOUCH $ (n) = \{ipvar \vert ipvar \in iP\}$ where $ iP$ is the set of induction pvars found in the code. Clearly, there should be a preprocessing compiler pass to identify inductions pvars in the code. Due to space constraints we cannot describe this preprocessing pass but it is based on Access Path Expressions [13], which points out the entry point to a dynamic data structure and the path from this point to the position currently pointed to by a certain pvar.

Finally, the compiler also implements an additional improvement to save space and time. The idea is that after exiting a loop body the TOUCH information regarding the ipvars of this loop are not needed any more. This way, the compiler removes those ipvars associated with this loop from the TOUCH set of the nodes.

7.- The Cycle links property is introduced to increase the accuracy of the data structure representation by avoiding unnecessary edges that can appear during the RSG updating process. The cycle links of a node, $ n$, are defined as the set of pairs of references $ <sel_i,
sel_j>$ such that when starting at node $ n$ and consecutively following selectors $ sel_i$ and $ sel_j$, the $ n$ node is reached again. More precisely, for $ n \in
N(rsg)$ we define: CYCLELINKS $ (n) = \{<sel_i, sel_j> \vert sel_i, sel_j \in S\}$ such that if $ <sel_i, sel_j> \in$   CYCLELINKS$ (n)$ then $ \forall l_i,
F_n(l_i) = n$, if $ <l_i, sel_i, l_j> \in LS$ then $ \exists <l_j,
sel_j, l_i> \in LS$

This property is very useful for dealing with doubly linked structures. For example, in the data structure presented in Fig. 1 (a), the elements in the middle of the doubly linked lists, represented by node $ n_2$, have two cycle links: $ <nxt,prv>$ and $ <prv,nxt>$, due to starting at a list item and consecutively following selectors $ nxt$ and $ prv$ (or $ prv$ and $ nxt$) the starting item is reached.

We conclude here that the CYCLELINKS property is used during the pruning process. Thus, in contrast with the other six properties already described, the CYCLELINKS property does not prevent the summarization of two nodes which do not share the CYCLELINKS sets and therefore do not affect the $ F_n$ function.

next up previous
Next: Compression of graphs Up: Reference Shape Graph Previous: Reference Shape Graph
Rafael Asenjo Plaza 2002-02-19