Basically, our method is based on approximating all possible memory configurations that can appear after the execution of a sentence in the code. Note that due to the control flow of the program, a sentence could be reached by following several paths in the control flow. Each ``control path'' has an associated memory configuration which is modified by each sentence in the path. Therefore, a single sentence in the code modifies all the memory configurations associated with all the control paths reaching this sentence. Each memory configuration is approximated by a graph we call Reference Shape Graphs (RSG). Thus, taking all this into account, we conclude that each sentence in the code will have a set of RSGs associated with it. This set of RSGs will describe the shape of the data structure after the execution of this sentence.
The calculation of this set of graphs is carried out by the symbolic execution of the program over the graphs. In this way, each program sentence transforms the graphs to reflect the changes in the memory configurations derived from the sentence execution.
The RSGs are graphs in which nodes represent memory locations which have similar reference patterns. Therefore, a single node can safely and accurately represent several memory locations (if they are similarly referenced) without losing their essential characteristics.
To determine whether or not two memory locations should be represented by a single node, each one is annotated with a set of properties. Now, two different memory locations will be ``summarized'' in a single node if they fulfill the same properties. Note that the node inherits the properties of the memory locations represented by this node. Besides this, two nodes can be also summarized if they represent ``summarizable'' memory locations. This way, a possibly unlimited memory configuration can be represented by a limited size RSG, because the number of different nodes is limited by the number of properties of each node. These properties are related to the reference pattern used to access the memory locations represented by the node. Hence the name Reference Shape Graph.
As we have said, all possible memory configurations which may arise after the execution of a sentence are approximated by a set of RSGs. We call this set the Reduced Set of Reference Shape Graphs (RSRSG), since not all the different RSGs arising in each sentence will be kept. On the contrary, several RSGs related to different memory configurations will be fused when they represent memory locations with similar reference patterns. As we will see, there are also several properties related to the RSGs, and two RSGs should share these properties to be joined. Therefore, besides the number of nodes in an RSG, the number of different RSGs associated with a sentence are limited too. This union of RSGs greatly reduces the number of RSGs and leads to a practicable analysis.
The symbolic execution of the code consists in the abstract interpretation of each sentence in the code. This abstract interpretation is carried out iteratively for each sentence until we reach a fixed point in which the resulting RSRSG associated with the sentence does not change any more . This way, for each sentence that modifies dynamic structures, we have to define the abstract semantics which describe how these sentences modify the RSRSG. We consider six simple instructions that deal with pointers: , , , , , and . More complex pointer instructions can be built upon these simple ones and temporal variables.
The output RSRSG resulting from the abstract interpretation of a sentence over an input RSRSG is generated by applying the abstract interpretation to each RSRSG. Due to space constraints we cannot formally describe the abstract semantics of each one of these sentences. However, we intuitively present the modifications involved in an RSG after the sentence symbolic execution:
All these processes can be better illustrated by a simple example. In Fig. 1 (a) we see an RSG representing a doubly linked list of two or more elements. Actually, represents the first element in the list, the middle elements, and the last one. Let us suppose that this RSGs is an input to the sentence where is pointing to the memory location represented by node . As we said, the first step in the abstract interpretation of this sentence is the division operation. Figure 1 (b) shows the resulting and after the division. Note that in each one of these graphs there is a single destination for . This division process is formally described in Sect. 4.1. In Fig. 1 (c) we show the result of the pruning process in which the compiler removes nodes and links which do not fulfill the graphs' properties. In fact, represents a list of three or more elements and is clearly a list of just two items. This pruning process is formally described in Sect. 4.2. Now, before removing the link in both graphs, the compiler has to focus more on one of the RSGs. More precisely, in , we have to materialize from node the node which represents the single list item referenced by , as we can see in Fig. 1 (d). Finally, we see in Fig. 1 (e) how we safely remove the link in both graphs to obtain the final and .
From a higher perspective, the whole symbolic execution process can be seen by looking at Fig. 2. For each sentence in the code we have an input RSRSG and the corresponding output RSRSG resulting from the modifications which take place in the memory configurations after the sentence execution. As we said, the RSRSG comprises several to capture all the memory configurations associated with each path in the control flow graph. During the symbolic execution of the sentence all these are going to be updated. The first step comprises the graph division and pruning processes after which we obtain several . Then the abstract interpretation of the sentence takes place and usually the complexity of the RSGs grows. In order to counter this effect, the compiler carries out a compression of the graph phase in which each RSG is simplified by the summarization of compatible nodes in the RSG, to obtain the graphs. This step is formally described in Sect. 3.2. Furthermore, some of the s can be fused into a single if they represent similar memory configurations. This operation greatly reduces the number of RSGs in the resulting RSRSG. This graph union operation is described in Sect. 4.3.
All the operations just enumerated in this section (RSG division, pruning, node summarization, compression of graphs, joining of compatible RSG, etc.) are described in the next two sections. In order to better illustrate these operations, we present a data structure example which will be referred to during the framework and operations description.