Near-Optimal Loop Tiling by means of Cache Miss Equations and Genetic Algorithms

Jaume Abella, Antonio González, Josep Llosa
Computer Architecture Department
Universitat Politècnica de Catalunya
Barcelona (Spain)
{jabella, antonio, josepll}@ac.upc.es

Xavier Vera
Institutionen för Datateknik
Mälardalens Högskola
Västerås (Sweden)
xavier.vera@mdh.se

Abstract

The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as loop tiling, which is a code transformation targeted to reduce capacity misses. This paper presents a novel systematic approach to perform near-optimal loop tiling based on an accurate data locality analysis (Cache Miss Equations) and a powerful technique to search the solution space that is based on a genetic algorithm. The results show that this approach can remove practically all capacity misses for all considered benchmarks. The reduction of replacement misses results in a decrease of the miss ratio that can be as significant as a factor of 7 for the matrix multiply kernel.

1 Introduction

Memory performance is critical for the performance of current computers. Memory is organized hierarchically in such a way that the upper levels are smaller and faster. The uppermost level typically has a very short latency (e.g. 1-2 cycles) but the latency of the lower levels may be a few orders of magnitude longer (e.g. main memory latency may be around 100 cycles). Thus, techniques to keep as much data as possible in the uppermost levels are key to performance.

In addition to the hardware organization, it is well known that the performance of the memory hierarchy is very sensitive to the particular memory reference patterns of each program. The reference patterns of a given program can be changed by means of transformations that do not alter the semantics of the program. These program transformations can modify the order in which some computations are performed or can simply change the data layout. Loop Tiling is an example of the former family of techniques. Loop tiling [1, 2] is a technique based on a combination of strip-mining and loop interchange.

Loop tiling has a significant potential to remove cache misses. Loop tiling can remove most capacity misses by restructuring the loop and changing the order in which statements are executed. However, finding the optimal loop tiling for a given program is a very complex task, since the options are almost unlimited and exploring all of them is infeasible. For very simple programs, the programmer intuition may help but in general, a systematic approach that can be integrated into a compiler and can deal with any type of program is desirable. This systematic approach requires the support of a locality analysis method in order to assess the performance of different alternatives.

In this paper, we propose an automatic approach to perform loop tiling in numeric codes. It is based on a very accurate technique to analyze the locality of a program that is known as Cache Miss Equations (CMEs) and a genetic algorithm in order to search the solution space. The proposed genetic algorithm converges very fast, and although it does not guarantee that the optimal solution is found, we show that after loop tiling, the replacement miss ratio of the evaluated benchmarks is almost negligible. Therefore, for these programs the results are near-optimal. The rest of this paper is organized as follows. Section 2 overviews the locality analysis approach. Section 3 presents the loop tiling technique and its performance is evaluated in section 4. Section 5 outlines some related work and section 6 summarizes the main conclusions of this work.

2 Memory Locality Analysis

To effectively transform a code in order to optimize memory performance, an effective memory locality analysis is required in order to assess the different alternatives. In this section, we describe the locality analysis technique
used in this work to perform loop tiling.

In particular, we use Cache Miss Equations (CMEs) [3] to represent the cache behavior. Cache Miss Equations are a very accurate analytical model of the cache memory. They describe the cache behavior by means of diophantine equations, which allows us to use mathematical techniques to compute the locality of each memory reference. Unfortunately a direct solution of these equations is computationally intractable due to its NP nature. Statistical methods and some techniques based on polyhedra theory have been proposed to solve the equations in a reasonable amount of time [4, 5].

### 2.1 Overview of Cache Miss Equations

Cache Miss Equations are a set of equations\(^1\) that represent all the potential cache misses for the references in a loop nest. They describe the precise relationship among the iteration space, array sizes, base addresses, and the cache parameters for a loop nest. This section presents an overview of the CMEs. For more details about CMEs see [3, 6].

In order to generate CMEs, the reuse vectors [7] of all the references in a loop nest must be generated. Reuse vectors provide information about the potential reuses in the entire iteration space. Figure 1 shows the matrix multiply kernel. For instance, \(\vec{r} = (0, 0, 1)\) is a reuse vector for reference \(c(k,j)\), because the data accessed by this reference in a given iteration is potentially reused one iteration later (both are potentially mapped into the same cache line). In order to determine if these potential reuses are realized, CMEs are generated and studied (it is not necessary to solve the equations to find realized reuses).

For every reuse vector of a reference two types of CMEs are generated:

- **Compulsory equations.** Compulsory equations represent the first time a memory line is brought into the cache.
- **Replacement equations.** Replacement equations represent the interferences with any other reference. For each pair of references \((R_A, R_B)\), the following expression gives the condition that determines whether they are mapped onto the same cache set:

\[
\text{Cache Set}(\vec{r})_{R_A} = \text{Cache Set}(\vec{j})_{R_B}, \quad \vec{j} \in \mathcal{I}
\]

where \(\mathcal{I}\) represents the iteration points between \(\vec{r}\) (the current one) and the iteration point from which \(R_A\) reuses. This condition is expanded into a set of equations for each reuse vector.

### 2.2 Finding Cache Misses from CMEs

Deciding whether a reference causes a miss or a hit for a given iteration point is equivalent to deciding whether this iteration point belongs to the polyhedra defined by the CMEs. The points inside each CMEs polyhedron represent the potential cache misses (the number of points is the number of potential cache misses). This leads us to consider several ways for computing them:

- **Solver.** Given a reference \(R\) with \(m\) reuse vectors and \(n_k\) equations for the \(k^{th}\) reuse vector, the polyhedron that contains all the iteration points that result in a miss is [3]:

\[
\text{Set Misses} = \bigcap_{k=1}^{m} \bigcup_{j=1}^{n_k} \text{Solution Set Equation}_j
\]

This approach implies counting the number of points inside the union of convex polyhedra. This requires counting the points (which is an NP problem for a single polyhedron) in an exponential number of polyhedra, making this problem infeasible due to its huge computing time.

- **Traversing the iteration space.** Given a reference, all the iteration points can be tested independently [6]. In order to know if an iteration point \(\vec{i}_0\) results in a miss we need to know when it fulfills the CMEs. This problem is equivalent to finding out whether, after substituting the iteration point in the CMEs, the resulting polyhedron is non-empty. This is still an NP problem, but only a linear number of polyhedra must be analyzed for each iteration point. However, the problem is still infeasible except for tiny iteration spaces.

Moreover, in a k-way set associative cache, there are \(k\) cache lines in every set, so \(k\) distinct contentions are needed before a cache miss occur. Therefore, the first method can only be applied to direct-mapped caches whereas the second method works for both direct-mapped and set-associative organizations.

---

\(^1\) The term equation is loosely used to refer to a set of simultaneous equalities and inequalities.

---

Figure 1. Matrix multiply algorithm

```
parameter (N)
REAL a(N,N), b(N,N), c(N,N)
do i = 1, N
do j = 1, N
do k = 1, N
   a(i,j) = a(i,j) + b(i,k) * c(k,j)
enddo
endo
endo
```

![Matrix multiply algorithm](image-url)
2.3 A Fast and Accurate Implementation to Solve CMEs

In this section we describe a fast and accurate approach to estimate the solution to the CMEs. Our approach builds upon the second method to solve the CMEs (traversing the iteration space). A key property of CMEs is that each point of the iteration space and each memory reference can be studied independently of the others.

We have developed some techniques that exploit the special characteristics of the CMEs polyhedra [4] in order to speed-up the process of counting points in polyhedra:

- **Counting Compulsory Polyhedra.** When an iteration point is substituted, Compulsory Equations result in a polyhedron with either 0 or 1 variable. A polyhedron with 0 variables consists of a set of inequality relations between integer values. The iteration point that has been substituted is a potential miss if the inequalities hold. On the other hand, since a polyhedron with 1 variable represents an interval, the iteration point results in a potential miss when there exist integer values in it.

- **Counting Replacement Polyhedra.** Due to the particular form of these polyhedra, the number of integer solutions can be computed in a more efficient way than in general polyhedra. We have developed a method to detect when they are empty that is based on counting the number of integer points inside them [8]. In order to compute it, the domains of the different variables involved in its definition are calculated. This can be done by means of the vertices of the polyhedron, but computing them is an NP problem. We have developed specific techniques for replacement polyhedra that compute the domains of the variables in a polynomial time. A detailed description of these techniques can be found in [8].

The use of these techniques results in an average speed-up of 20 over a method based on identifying the vertices of the polyhedra. However, this important speed-up is still insufficient to solve CMEs for huge iteration spaces in a reasonable amount of time.

To further reduce the computation cost, we use sampling techniques to study a subset of the iteration space instead of the whole iteration space [5] [9]. The subset of points is selected using **Simple Random Sampling** [10].

We model the number of misses of each reference using a Discrete Random Variable. This random variable follows a Binomial distribution that models phenomena consisting of different and independent experiments that follow a Bernoulli distribution. We can use statistical techniques [11] in order to compute the parameters that describe this random variable. The approach to obtain an approximation of the miss ratio is to evaluate the behavior of a subset of the population (sample) obtaining the empirical value of the parameters that describe the sample and to infer these values to the population.

The size of the sample is set according to the required width of the confidence interval and the desired confidence. We found experimentally that a confidence interval of width 0.1 and a 90% confidence is enough to obtain accurate miss ratios with a very small computing time (only 164 points of the iteration space must be explored). In this way, the miss ratio is computed as an interval of width 0.1 and the actual miss ratio belongs to this interval with a probability of 90%. The central point of this interval can be used as an estimation of the actual miss ratio.

2.4 CME for Multiple Convex Regions

CME are defined for iteration spaces that are single convex regions, but after tiling \( n \) dimensions the iteration space is the union of \( 2^n \) convex regions. Figure 2 shows how the iteration space of a one-dimensional loop becomes a two-convex region iteration space after tiling. The shaded regions correspond to the different convex regions before and after tiling.

There are different ways to solve this problem. The easiest option would be to use only one convex region that approximates the actual non-convex region. This convex region can be the smallest parallelepiped that includes all other convex regions (see figure 2 (c)) or alternatively, the region which does not include the last iteration of every tiled loop where the tile size is not a divisor of the upper bound (see figure 2 (d)). Both options have drawbacks. The first option includes in the convex regions points outside the iteration space, while the second option does not include points belonging to the iteration space.

Because of this, we have decided to implement a more accurate solution. The CME implementation has been modified to deal with multiple convex regions by defining the equations for every convex region and solving for every analyzed point the equations corresponding to the convex region in which the point is contained. Let \( n \) be the number of convex regions of a loop after tiling. Every compulsory equation should be defined for each convex region, so the number of compulsory equations is increased by a factor of \( n \). For each reuse vector, we have to generate a set of replacement equations for each convex region. In addition, we have to generate a set of equations for every pair of convex regions that reflect the potential reuse between different regions. Due to this, the number of replacement equations is increased by a factor of \( n^2 \).
3 Loop Tiling

In this section we present our proposal to perform loop tiling. Tiling [12, 1] is a transformation which combines strip-mining [1] with loop interchange to form small tiles [1] of loop iterations in order to increase the data locality. In this section $T_i$ and $U_i$ stand for the tile size and upper bound respectively of loop $i$ in the original loop nest. Figure 3 shows an example of a loop before (a) and after (b) loop tiling.

3.1 Model

The target of our work is to obtain the values of the variables $T_i$ that minimize the number of misses. Note that loop tiling changes only the order in which the original iteration space is traversed, so the number of compulsory misses before and after tiling remains constant. Because of this, our work focuses on minimizing the number of replacement misses, which include both capacity and conflict misses.

Let $f$ be the function that represents the number of replacement misses for each possible value of the tile size variables:

$$ f : [1, U_1] \times \ldots \times [1, U_k] \rightarrow \mathbb{Z} $$

Our problem can be expressed as follows:

$$ \text{MIN } f(T_1, \ldots, T_k) $$

$$ 1 \leq T_i \leq U_i $$

$$ i = 1 \ldots k $$

where $f$ is called the objective function.

The objective function consists of the CMEs generated in a parameterized way, where the parameters are the tile sizes.

Since $f$ is a pseudo-polynomial function [13], the relationship between loop tiling and the number of misses is nonlinear. $T_i$ can take only integer values, thus, our problem can be seen as a nonlinear integer optimization (NLP) one.

Many researchers have studied NLP [14, 15]. A well studied case is the one where the constraints are linear (named linearly constrained optimization). A special case
Final text
Genetic operators provide the basic search mechanism by creating new solutions based on the solutions that exist. The selection of individuals to produce successive generations plays an extremely important role. A common selection approach assigns probability of selection to each individual depending on its fitness. Individuals with higher fitness have a higher probability of contributing one or more offsprings to the next generation. We have adopted one of the selection schemes that gives better results, which is known as remainder stochastic selection without replacement [23]. Let us call \( N \) the size of the population (number of individuals). This selection scheme consists in choosing \( N \) individuals from the \( N \) individuals of the previous generation. In this selection process a given individual can be selected more than once. The chosen individuals are grouped forming pairs and crossover is applied to each pair with a given probability. In the case they do not crossover, both individuals are added to the new population (see Figure 5).

**Crossover** takes two individuals and produces two new individuals merging the genetic material in a random point (named cross site). **Mutation** is applied after crossover to each individual with a given probability. Mutation changes one individual to produce a new one by flipping some of its genes. Both crossover probability and mutation probability have to be determined empirically, and are related to the size of the population. In our case they do not crossover, both individuals are added to the new population (see Figure 5).

In our own polyhedra representation [4].

---

**4 Performance Evaluation**

This section evaluates the proposed loop tiling approach. Examples of its use will be given, as well as its accuracy.

**4.1 Experimental Framework**

CMEs have been implemented for fortran codes through the Polaris Compiler [26] and the Ictineo library [27]. These libraries allow us to obtain all the compile-time information needed to generate the equations.

The evaluation of CMEs has been implemented in C++ following the techniques outlined in section 2.3 and using our own polyhedra representation [4].

Due to CMEs restrictions, only perfectly nested loops in...
Figure 7. Genetic Algorithm used to perform loop tiling

which the array subscript expressions are affine functions of the induction variables are analyzed [3].

The loop nests considered are some kernels from different programs (NAS3, BIHAR4, LIVERMORE) and some frequently used kernels (see Table 1). These loops have been chosen because they exhibit high number of capacity misses. Results for different cache architectures are reported.

4.2 Examples of some Kernels

First of all we show the effectiveness of the loop tiling technique by means of some well-known kernels (see Table 2). We have evaluated their miss ratio for a 8KB direct-mapped cache with 32-byte lines.

Table 2 shows the results. Column 2 shows the problem size. Columns 3 and 4 show the total and replacement miss ratio before loop tiling respectively, whereas columns 5 and 6 show the total and replacement miss ratio after tiling respectively. We can see that after tiling the replacement miss ratio is near zero for all kernels. This indicates that loop tiling has removed almost all replacement misses. The remaining replacement misses probably are conflict misses which could be removed by means of techniques like padding.

4.3 Performance Evaluation for some Kernels

In this section we present the replacement miss ratio of different direct-mapped caches (8KB and 32KB) before and after applying loop tiling for all the benchmarks of the table 1. Results are shown in Figures 8 and 9, where the number following the program name corresponds to the problem size.

It can be seen that for most programs practically all replacement misses are removed after tiling, which implies that near-optimal solutions have been found for these programs. However, for some kernels (ADD, BTRIX, VPENTAI and VPENTAA) the replacement miss ratio obtained after tiling is still quite high for all cache sizes. For some others (ADI1000 and ADI2000) replacement misses after tiling are only significant for a 8KB cache. We have analyzed these cases carefully and we have observed that most of the remaining replacement misses are due to conflicts and they cannot be removed by loop tiling. For these programs we have investigated the combination of padding and tiling techniques. Padding parameters are obtained in a similar way to tiling ones. They are introduced in the CMEs and a GA is used to find near-optimal solutions. Details can be found in [28].

Table 3 shows the replacement miss ratios obtained for those kernels where the replacement miss ratio is still high after loop tiling. Column 2 shows the original replacement miss ratio; column 3 lists the replacement miss ratio after

<table>
<thead>
<tr>
<th>Kernel</th>
<th>Program</th>
<th>Nested loops</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>T2D</td>
<td>NAS</td>
<td>2</td>
<td>2D Matrix transposition</td>
</tr>
<tr>
<td>T3DJIK</td>
<td>3</td>
<td>3</td>
<td>3D Matrix transposition</td>
</tr>
<tr>
<td>T3DIKJ</td>
<td>3</td>
<td>3</td>
<td>3D Matrix transposition</td>
</tr>
<tr>
<td>JACOBI3D</td>
<td>3</td>
<td>Partial differential equations solver</td>
<td></td>
</tr>
<tr>
<td>MATMUL</td>
<td>LIVER</td>
<td>3</td>
<td>Matrix by vector multiplication</td>
</tr>
<tr>
<td>MM</td>
<td>LIVER</td>
<td>3</td>
<td>Matrix multiplication</td>
</tr>
<tr>
<td>ADD</td>
<td>NAS</td>
<td>4</td>
<td>Addition of update to a matrix</td>
</tr>
<tr>
<td>BTRIX</td>
<td>NAS</td>
<td>3</td>
<td>Block Tri-diagonal solver. Backward block sweep</td>
</tr>
<tr>
<td>VPENTAI</td>
<td>2</td>
<td>2</td>
<td>Invert 3 pentadagonals simultaneously. Loop 1</td>
</tr>
<tr>
<td>VPENTAJ</td>
<td>NAS</td>
<td>2</td>
<td>Invert 3 pentadagonals simultaneously. Loop 2</td>
</tr>
<tr>
<td>DPSSB</td>
<td>BIHAR</td>
<td>3</td>
<td>Unnormalized inverse of a forward transform of a complex periodic sequence</td>
</tr>
<tr>
<td>DPSSF</td>
<td>BIHAR</td>
<td>3</td>
<td>Forward transform of a complex periodic sequence</td>
</tr>
<tr>
<td>DRADBG1</td>
<td>BIHAR</td>
<td>3</td>
<td>Backward transform of a real coefficient array. Loop 1</td>
</tr>
<tr>
<td>DRADBG2</td>
<td>BIHAR</td>
<td>3</td>
<td>Backward transform of a real coefficient array. Loop 2</td>
</tr>
<tr>
<td>DRADF1</td>
<td>BIHAR</td>
<td>3</td>
<td>Forward transform of a real periodic sequence. Loop 1</td>
</tr>
<tr>
<td>DRADF2</td>
<td>BIHAR</td>
<td>3</td>
<td>Forward transform of a real periodic sequence. Loop 2</td>
</tr>
</tbody>
</table>

3 Numerical Aerospace Simulation Facility
4 Biharmonic Partial differential equations solver
padding; and column 4 shows the replacement miss ratio after padding and tiling are applied sequentially in this order. In the future we plan to study the application of padding and tiling techniques in a single step, trying to find the padding and tiling parameters at the same time. This can in general produce better results than optimizing each part separately. The replacement miss ratios after padding and tiling are practically null for all benchmarks.

Table 4 shows the percentage of kernels (not considering those that appear in table 3), which have a replacement miss ratio lower than 1%, 2% and 5% respectively after tiling. For all these kernels and all cache sizes the replacement miss ratio before and after loop tiling for a 32KB cache.
### Table 2. Miss ratio for some evaluated kernels (8KB direct-mapped cache, 32B lines)

<table>
<thead>
<tr>
<th>Kernel</th>
<th>Prob size</th>
<th>No Tiling</th>
<th>Tiling</th>
<th>No Tiling</th>
<th>Tiling</th>
</tr>
</thead>
<tbody>
<tr>
<td>T2D</td>
<td>N=2000</td>
<td>63.3%</td>
<td>36.4%</td>
<td>27.7%</td>
<td>0.9%</td>
</tr>
<tr>
<td>T3JJIK</td>
<td>N=200</td>
<td>63.4%</td>
<td>36.7%</td>
<td>30.2%</td>
<td>3.6%</td>
</tr>
<tr>
<td>T3DIKJ</td>
<td>N=200</td>
<td>34.6%</td>
<td>7.0%</td>
<td>27.9%</td>
<td>0.3%</td>
</tr>
<tr>
<td>JACOBE3D</td>
<td>N=200</td>
<td>25.6%</td>
<td>7.2%</td>
<td>19.8%</td>
<td>1.3%</td>
</tr>
</tbody>
</table>

### Table 3. Miss ratio for some evaluated kernels before padding and tiling, after padding and after padding and tiling

<table>
<thead>
<tr>
<th>Kernel</th>
<th>Original</th>
<th>Padding</th>
<th>Padding + tiling</th>
</tr>
</thead>
<tbody>
<tr>
<td>ADD</td>
<td>60.2%</td>
<td>59.8%</td>
<td>0.5%</td>
</tr>
<tr>
<td>BTRIX</td>
<td>50.1%</td>
<td>0.2%</td>
<td>0.2%</td>
</tr>
<tr>
<td>VPENTA1</td>
<td>78.3%</td>
<td>52.4%</td>
<td>0.0%</td>
</tr>
<tr>
<td>VPENTA2</td>
<td>86.0%</td>
<td>11.9%</td>
<td>0.0%</td>
</tr>
<tr>
<td>ADL1000</td>
<td>26.2%</td>
<td>12.3%</td>
<td>4.1%</td>
</tr>
<tr>
<td>ADL2000</td>
<td>25.7%</td>
<td>12.4%</td>
<td>3.4%</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Kernel</th>
<th>Original</th>
<th>Padding</th>
<th>Padding + tiling</th>
</tr>
</thead>
<tbody>
<tr>
<td>ADD</td>
<td>60.2%</td>
<td>59.8%</td>
<td>0.0%</td>
</tr>
<tr>
<td>BTRIX</td>
<td>34.1%</td>
<td>0.0%</td>
<td>0.0%</td>
</tr>
<tr>
<td>VPENTAI</td>
<td>78.1%</td>
<td>32.9%</td>
<td>0.0%</td>
</tr>
<tr>
<td>VPENTAI2</td>
<td>86.0%</td>
<td>11.3%</td>
<td>0.0%</td>
</tr>
</tbody>
</table>

### Table 4. Replacement miss ratios after tiling of all kernels excepting those in table 3

<table>
<thead>
<tr>
<th>Cache sizes</th>
<th>Replacement miss ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>&lt;1%</td>
</tr>
<tr>
<td>8KB</td>
<td>56.4%</td>
</tr>
<tr>
<td>32KB</td>
<td>90.2%</td>
</tr>
</tbody>
</table>

Miss ratios are lower than 5%.

Our technique is compared against the optimal solution (counting replacement misses) but not against other techniques in the literature. Due to the different limitations of these techniques they cannot be compared with the same benchmarks and same platform on an equal basis.

## 5 Related Work

Caches are an essential part of processors for reducing memory latency and increase memory bandwidth. Some programs have loops which work with large working sets that exceed the cache capacity and result in a high number of capacity misses. These misses can be reduced by means of computation-reordering transformations such as loop interchange, loop distribution, loop skewing and loop tiling among others [1, 2].

Loop tiling [12] is an effective optimizing transformation to reduce the number of capacity misses of programs, especially for dense matrix computations. However, the success of loop tiling depends on the tile size and shape selection. Many algorithms have been provided to find suitable approximations for this selection.

Ghosh, Martonosi and Malik proposed the use of CME [3] to select the tile size [29] but they did not propose a general algorithm to do it. Their technique consists on maximizing the tile size for every self-interference equation, obtaining a tile that has no replacement misses for the given equation. They do not give details about how to combine the different tile sizes obtained for every self-interference equation. Their tiling algorithm is not applied to cross-interference equations.

Coleman and McKinley presented a technique that tries to maximize the tile size such that it fits in cache and at the same time it reduces cross-interference misses. Their technique to reduce cross-interference misses is based on computing the worst case: maximum number of expected cross-interference misses. To do this, their algorithm estimates the footprints of array references [30].

Rivera and Tseng proposed a tile size selection algorithm [31] for programs that compute values using neighboring array elements in a fixed stencil pattern. In order to avoid cross-interference misses they use different techniques based on copying tiles, using a subset of the cache, padding and applying the same technique as Coleman and McKinley.

Sarkar and Megiddo presented a constant-time algorithm to obtain near-optimal tile sizes for two-dimensional nested loops [32] taking into account self and cross-interference misses. The algorithm is based on an approximated memory cost model and an analytical model to estimate the memory cost of a loop nest. The algorithm is extended in order to deal with three-dimensional nested loops using an iterative search over the first tile size and applying their algorithm over the two inner loops.

Our proposal differs and improves these previous approaches in the fact that it is a technique based on a precise model to represent cache behavior, that searches the solution space for optimal tile sizes, for any type of reference pattern that corresponds to affine references and for loop nests of any dimension. It always produces a tiling scheme that reduces capacity misses and usually removes practically all replacement misses.
Cache memory performance is critical for the efficient execution of numerical applications. Loop tiling is a program transformation that reduces capacity misses. In this work, we have proposed the combination of genetic algorithms and a very accurate model of the cache behavior in order to perform near-optimal loop tiling. The cache model is based on a very fast solver of the Cache Miss Equations. The proposed technique can deal with any perfectly nested loop.

The evaluation of the proposed technique shows that the resulting tiling significantly reduces the miss ratio. For instance, for a 8KB direct-mapped cache, we can reduce the replacement miss ratio of the 3D matrix transposition (N=100) from 36.7% to 0.6% and the replacement miss ratio of the Dpssb kernel from 55.5% to 1.25%. We have shown that the proposed loop tiling technique practically removes all capacity misses for all the loops that have been analyzed.

6 Conclusions

References