Target-Specific Refinement of Multigrid Codes

Richard Membarth and Philipp Slusallek
German Research Center for Artificial Intelligence
Computer Graphics Lab, Saarland University
Intel Visual Computing Institute
{richard.membarth,philipp.slusallek}@dfki.de

Marcel Köster, Roland Leißa, and Sebastian Hack
Compiler Design Lab, Saarland University
Intel Visual Computing Institute
{koester,leissa,hack}@cs.uni-saarland.de

Abstract—This paper applies partial evaluation to stage a stencil code Domain-Specific Language (DSL) onto a functional and imperative programming language. Platform-specific primitives such as scheduling or vectorization, and algorithmic variants such as boundary handling are factored out into a library that make up the elements of that DSL. We show how partial evaluation can eliminate all overhead of this separation of concerns and creates code that resembles hand-crafted versions for a particular target platform. We evaluate our technique by implementing a DSL for the V-cycle multigrid iteration. Our approach generates code for AMD and NVIDIA GPUs (via SPIR and NVVM) as well as for CPUs using AVX/AVX2 alike from the same high-level DSL program. First results show that we achieve a speedup of up to 3× on the CPU by vectorizing multigrid components and a speedup of up to 2× on the GPU by merging the computation of multigrid components.

Index Terms—Multigrid codes, partial evaluation, domain-specific language.

I. INTRODUCTION

Common imperative programming languages (C, Fortran, etc.) provide simple abstractions of fundamental features of a processor: variables to abstract from different storage locations (registers, stack), expression syntax to abstract from linearization, function calls to abstract from calling conventions, etc. These abstractions are essential to write portable and maintainable code productively. It is the job of the compiler to remove the overhead these abstractions cause and map them to efficient machine code.

However, to achieve high performance, the compiler’s code optimizations are often not good enough. Many machine properties such as SIMD instructions or the memory hierarchy are not orchestrated well by modern compilers. There are basically two reasons for this: First, it requires intricate knowledge of the target architecture to come up with a transformation strategy. Second, it requires domain knowledge to justify the correctness of the transformation.

Many performance-critical codes are manually “optimized” towards a specific target architecture. This is of course error-prone, results in unportable code, and is a maintenance and debugging nightmare. Therefore, it is common practice to use DSLs or other program generation tools to ease this process.

To make writing DSLs more productive, DSLs are often embedded into a host language using staging; The syntax of the host language is overloaded (where possible) to construct the program representation of the DSL program. A DSL compiler written in the host language then compiles the DSL program during the runtime of the host program. While this approach avoids the tedious task of building a front end for the DSL, it has significant drawbacks:

1) The programmer has to resolve the overloading to understand which part of the code belongs to the DSL program or to the host program.

2) Although the host program may build the DSL program successfully, the constructed DSL program is not guaranteed to be well-formed and might fail to compile. The programmer does not notice this at compile time but only when he runs the host program.

3) The DSL designer has to write a code generator for every DSL language he wants to host.

In this paper, we resurrect an old idea that interestingly has not been picked up much lately, although it solves all of the problems just mentioned: The (first) Futamura projection [8]. The easiest way to implement a DSL is to write an interpreter for it: For every language element, give a piece of code that implements the semantics of this language element. Essentially, the DSL implementation comes as a library. According to the first Futamura projection, partially evaluating the DSL interpreter with an input program compiles the DSL program: a process we call refinement. The result is basically that the interpreter is inlined into the program to be interpreted.

Section II discusses this approach in detail by means of an idealized stencil code example. Note that this corresponds exactly to the output of a compiler that traverses a program representation and emits a piece of code per language element. This approach solves the above-mentioned problems in the following way:

1) There is only one program: The DSL program is the host program that contains explicit calls to the interpreter library.

2) Because there is only a single program, the programmer is notified at compile time if the program is well-formed or not.

3) The partial evaluator is a reusable component of the (host) language and can be reused for every staged DSL.

In this paper, we present a small, staged DSL for the V-cycle, an important multigrid iteration for solving linear systems using the multigrid method [4], [19]. We use a functional and imperative programming language called Impala to describe the algorithm on a high level, independent of the target platform. To
this end, we derive suitable abstractions for the important and performance-critical operations of the algorithm and factor them out using higher-order functions. They essentially constitute the language elements of our DSL and are implemented by an expert for every target architecture. By partially evaluating the DSL program with an implementation of the language elements for a certain architecture, we obtain a program that looks like it was specifically written for that particular target architecture.

In our experiments, we show that, using this approach, we can map the V-cycle algorithm to different hardware architectures: CPUs and GPUs. In the architecture mapping, we perform important optimizations like vectorization and cache-aware iteration reordering without tainting the other parts of the implementation with hardware-dependent, low-level details. In our previous work [13], we have demonstrated that our technique can compete with manually-tuned expert implementations of simple stencil codes while being orders of magnitudes smaller in terms of code size. In this paper, we demonstrate that our approach scales well to more complex stencil codes by considering not only single stencils for optimization, but sequences of operations.

II. EMBEDDING OF DSLS

This section describes the two main techniques to embed a DSL in Impala by means of a small stencil code example:

1) Higher-order functions allow to design generic domain-specific APIs that can be instantiated for every target architecture by highly-optimized implementations.

2) Partial evaluation entirely removes the overhead of these generic implementations by specializing the higher-order functions to their arguments.

A. Higher-Order Functions

In stencil algorithms it is often necessary to access neighboring elements which lie outside the actual field. Many frameworks provide hard-coded solutions like setting these virtual elements to zero, mirror the field, or clamp to border values. It is good programming practice to not hard-code boundary handling into the iterator but to separate this concern from other concerns of the implementation. This separation of concerns can be elegantly implemented with higher-order functions: In Listing 1 the apply_stencil function expects a stencil which it wants to apply on a field field. The boundary handling logic is passed as function via border. For example, in order to clamp the border to the input field, the programmer defines an appropriate clamp function and passes it to apply_stencil as shown in Listing 2. Of course, a naïve implementation of the higher-order functions by the compiler inflicts the performance penalty of closure allocation, function call, and so on.

B. Partial Evaluation

In Impala the programmer can annotate a call with @ to partially evaluate that function call at compile time. This will have the effect that the function is specialized to all constant parameters as far as possible. By annotating the call to apply_stencil with @, the compiler generates a specialization where the definition of clamp is propagated into the body of apply_stencil. In this way, each call to border is specialized to a call to clamp.

```
fn apply_stencil(x: int, y: int,
    field: Field, stencil: Stencil,
    border: fn(int, int, int) -> int)
) -> float |
let mut tmp = 0.0f;
for i, j in stencil.each() { 
    let xx = border(x+i, 0, field.cols-1);
    let yy = border(y+j, 0, field.rows-1);
    tmp += field(xx, yy) * stencil(i, j);
}
tmp
```

Listing 1: A generic function that applies a stencil (with respect to boundary handling) to an input field at a given position.

```
fn clamp(idx: int, lower: int, upper: int) -> int {
    min(upper, max(lower, idx))
}
```

Listing 2: Applying a given stencil to a field using clamp for boundary handling.
to clamp, the partial evaluator will specialize both calls to clamp in the inner loop.

The example from Listing 1 can be optimized even further: It is also possible to create specialized variants for different region of the field with tailored boundary handling for each region [13]. For example, the vast main area of the field does not need any boundary handling at all. This saves unneeded boundary handling checks.

The function apply_stencil is essentially an “interpreter” that applies a stencil to a field at a given position. The aspects of boundary handling, application of the stencil, and the stencil itself are cleanly separated. Partial evaluation eliminates the overhead of this separation of concerns and produces high-performance code that looks like if all aspects had been hand-coded into a single piece of code.

III. CODE REFINEMENT

Listing 1 is a straightforward CPU implementation. In this section we show how to map implementations to different acceleration devices.

A. Code Generation for Accelerators

Impala offers built-in, compiler-known, higher-order functions to trigger code generation for GPUs as well as vectorization for CPUs with SIMD instruction sets. For example, the following function behaves like a loop ranging from 0 to size while invoking body in each iteration:

\[
\text{fn vectorize(size: int, length: int,}
\text{body: fn(int, int) -> ()}}
\]

However, the body is actually vectorized with SIMD width length [11]. Likewise, the following function triggers code generation for NVVM\(^1\), an Intermediate Representation (IR) for NVIDIA GPUs, using the given blocking for the given grid:

\[
\text{fn nvvm(grid: (int, int, int),}
\text{block: (int, int, int),}
\text{body: fn() -> ()}}
\]

Impala lambda-lifts the body [9] and resolves all dependencies to surrounding data. Any data needed from the outside is automatically copied to the GPU. Output data is pushed back to the CPU.

In a similar fashion, Impala offers built-in functions to generate OpenCL and CUDA source code as well as SPIR\(^2\), an IR for OpenCL. In contrast to pragma-based solutions like OpenACC or OpenMP, Impala’s built-in functions are proper functions which integrate seamlessly into Impala’s type system and expect Impala values. In particular, we can wrap a call to such a function within another function. This is not possible with pragmas.

B. Supporting Accelerators in the DSL

Reconsider Listing 2. We use the function

\[
\text{fn iterate(field: Field,}
\text{body: fn(int, int) -> ()}}
\]

\[
\text{iterate(field, |x, y| /= body */;})
\]

to abstract over the exact iteration behavior of the field. Impala’s for-construct is just syntactic sugar for calling a higher-order function while passing the body as lambda function.

\[
\text{iterate(field, |x, y| /s body */;})
\]

This allows DSL developers to overload for-constructs and to provide a target-specific iteration strategy, that may use vectorize or nvvm.

In order to vectorize the stencil computation with vector length 8 for AVX or AVX2, we use the aforementioned vectorize function to implement iterate:

\[
\text{fn iterate(field: Field,}
\text{body: fn(int, int) -> ()}}
\]

\[
\text{iterate(field, |x, y| /= body */;})
\]

Note that we have not touched apply_stencil itself. We can also implement other versions of iterate which map to other accelerators, for example, by using the nvvm function. Each iterate can act as drop-in replacement for another one.

Again, this cleanly separates the overall algorithm from the concrete implementations of its building blocks. By partial evaluation, all those parts are fused into a piece of code that looks like if it had been hand-coded explicitly for the particular architecture.

IV. EVALUATION

In this section, we discuss the implementation of a DSL for the V-cycle and show first results on different target architectures. As target hardware, we consider an Intel Core i7-3770K, an NVIDIA GeForce GTX 680, and an AMD Radeon R9 290X.

A. A DSL for the V-Cycle

The basic idea of the multigrid method is to smooth the error (e.g., using an iterative method like Jacobi or Gauss-Seidel) on different resolutions of the same data. The V-cycle describes one possible multigrid iteration as summarized in Algorithm 1.

The restrict and interpolate methods are used to transform data between different resolutions of the multigrid. On each level, the error is smoothed (smoother) and the error is estimated (residual). This process is recursive and starts at the finest resolution.

A DSL for the V-cycle should implement the algorithm described in Algorithm 1, but give the programmer the flexibility to choose custom methods for the multigrid components.

---

2. www.khronos.org/registry/spir/specs/spir_spec-1.2.pdf
Moreover, it should allow to specify additional properties such as the depth of the recursion, the number of smoothing steps, etc. Using Impala, we pass the multigrid components as higher-order functions to the implementation of the V-cycle. By partially evaluating the V-cycle implementation, the user-provided multigrid components are inline and optimized with respect to the input (stencils, etc.). Listing 3 shows the implementation of a DSL for the V-cycle and its invocation.

A naïve implementation of the V-cycle invokes each multigrid component according to the schedule of Algorithm 1 using the iterate function introduced in Section III. By providing specialized iterate implementations for CPU and GPU, we map the same algorithm to different target platforms. Likewise, vectorization is triggered by using the vectorize function in case we use a mapping for AVX. However, hand-tuned implementations of the V-cycle might merge multiple multigrid components in order to save unnecessary reads/writes to memory. The same optimization can be achieved in Impala by custom iterate functions that compute multiple components. As an example, consider the computation of the residual component followed by the restrict component: Instead of computing the residual for the whole field first and then restrict the field produced by the residual, we compute the residual only for two rows and restrict the residual before the next rows are processed. This pipelined processing allows to hold the result of the restrict component in cache on the CPU and allows to merge compute kernels on the GPU when using scratchpad (local or shared) memory. On the GPU, this has the same effect as loop fusion. Listing 4 illustrates this for the CPU. The index passed to the residual and restrict component refer to the temporary field. The offset to the current row of the other fields are tracked in the Field object and are used when accessing field elements. Merging the two components is only valid if the operation of the multigrid components is known: in our example, the restrict component is allowed to access two rows only. Otherwise, a larger temporary array has to be allocated and pre-computed before applying restrict.

\[ V_h(u_h^{(k+1)}) = V_h(u_h^{(k)}, A^{h}, f^{h}, v_1, v_2). \]

\[ u_h^{(k+1)} = S^2_h(u_h^{(k)}, A^{h}, f^{h}). \]

\[ u_h^{(k+1)} = S^2_h(u_h^{(k)}, A^{h}, f^{h}). \]

<table>
<thead>
<tr>
<th>Listing 3: Implementation of the DSL for the V-cycle.</th>
</tr>
</thead>
<tbody>
<tr>
<td>fn vcycle_dsl(input: Field, levels: int,</td>
</tr>
<tr>
<td>vsteps: int, ssteps: int,</td>
</tr>
<tr>
<td>smoother: fn(* ... *) -&gt; (),</td>
</tr>
<tr>
<td>residual: fn(* ... *) -&gt; (),</td>
</tr>
<tr>
<td>restrict: fn(* ... *) -&gt; (),</td>
</tr>
<tr>
<td>interpolate: fn(* ... *) -&gt; ()</td>
</tr>
<tr>
<td>} -&gt; Field</td>
</tr>
<tr>
<td>/* allocate memory for all levels */</td>
</tr>
<tr>
<td>// Sol, RHS, Res, Tmp</td>
</tr>
<tr>
<td>// vcycle implementation</td>
</tr>
<tr>
<td>fn vcycle_dsl_intern(level: int) -&gt; ()</td>
</tr>
<tr>
<td>if level == levels-1</td>
</tr>
<tr>
<td>// smooth</td>
</tr>
<tr>
<td>for i in range(0, ssteps) {</td>
</tr>
<tr>
<td>if i&gt;0 { swap(Sol(level), Tmp(level)); }</td>
</tr>
<tr>
<td>for x, y in iterate(Sol(level))</td>
</tr>
<tr>
<td>solver(x, y, /* fields */);</td>
</tr>
<tr>
<td>}</td>
</tr>
<tr>
<td>for x, y in iterate(Res(level))</td>
</tr>
<tr>
<td>residual(x, y, /* fields */);</td>
</tr>
<tr>
<td>}</td>
</tr>
<tr>
<td>for x, y in iterate(RHS(level+1))</td>
</tr>
<tr>
<td>restrict(x, y, /* fields */);</td>
</tr>
<tr>
<td>}</td>
</tr>
<tr>
<td>// recursion</td>
</tr>
<tr>
<td>vcycle_dsl_intern(level+1);</td>
</tr>
<tr>
<td>for x, y in iterate(Sol(level))</td>
</tr>
<tr>
<td>interpolate(x, y, /* fields */);</td>
</tr>
<tr>
<td>}</td>
</tr>
<tr>
<td>// post-smooth</td>
</tr>
<tr>
<td>for i in range(0, ssteps) {</td>
</tr>
<tr>
<td>if i&gt;0 { swap(Sol(level), Tmp(level)); }</td>
</tr>
<tr>
<td>for x, y in iterate(Sol(level))</td>
</tr>
<tr>
<td>solver(x, y, /* fields */);</td>
</tr>
<tr>
<td>}</td>
</tr>
<tr>
<td>// call the vcycle implementation</td>
</tr>
<tr>
<td>for i in range(0, vsteps)</td>
</tr>
<tr>
<td>vcycle_dsl_intern(0);</td>
</tr>
<tr>
<td>}</td>
</tr>
<tr>
<td>// specialize call to vcycle_dsl</td>
</tr>
<tr>
<td>let result = vcycle_dsl(input, levels, vsteps,</td>
</tr>
<tr>
<td>ssteps, jacobi, residual,</td>
</tr>
<tr>
<td>restrict, interpolate);</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>B. Results</td>
</tr>
</tbody>
</table>
| Our first evaluation of the V-cycle DSL presented in Section IV-A shows promising results. Table I lists the execution
fn iterate_rr(Sol: Field, Res: Field, 
RHSF: Field, RHS: Field, 
residual: fn(0 /* ... */ -> ()), 
restrict: fn(0 /* ... */ -> ()) 
) -> () {
    // allocate temporary array for 2 rows
    let mut tmp: Field = {
        rows: 2
    };
    rows ∗ in fn of of rows ∗ /
    fn 2048 x temporary residual
    through $range_step(0, Res.rows, 2) {
        $range(0, RHSC.cols) {
            Sol, tmp, RHSF);
        }
        residual in through
        on an Intel Core
        for x on an NVIDIA
    }
    }
}

Listing 4: Merging residual and restrict computation on the CPU.

Table I: Execution times in ms for the components of the V-cycle DSL for a field of size 2048 × 2048 on an Intel Core i7-3770K (CPU & AVX).

<table>
<thead>
<tr>
<th>Mapping</th>
<th>smoother</th>
<th>residual</th>
<th>restrict</th>
<th>interpolate</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU</td>
<td>11.63</td>
<td>11.21</td>
<td>1.61</td>
<td>2.24</td>
</tr>
<tr>
<td>CPU:M</td>
<td>11.63</td>
<td>12.18</td>
<td>2.24</td>
<td></td>
</tr>
<tr>
<td>AVX</td>
<td>3.98</td>
<td>3.99</td>
<td>2.33</td>
<td>3.39</td>
</tr>
<tr>
<td>AVX:M</td>
<td>3.98</td>
<td>4.10</td>
<td>3.39</td>
<td></td>
</tr>
</tbody>
</table>

time in ms on the CPU for the components of the V-cycle on the first level. In addition to a non-vectorized CPU mapping and a vectorized mapping for AVX, we list the execution time when merging the residual and restrict components (CPU:M and AVX:M). It can be seen that we get a speedup of 3× for most components. Only restrict and interpolate are slower due to their memory access pattern. The merged residual and restrict computation for AVX vectorizes only the residual component.

Table II lists the corresponding execution time in ms on the GPU using mapping strategies for NVVM and SPIR. Merging the residual and restrict components is almost twice as fast as computing both components in separate compute kernels on the GPU.

C. Discussion

The proposed DSL can be easily extended to express different multigrid iterations. It is actually sufficient to change the recursion in the V-cycle implementation in order to get the schedule for the W-cycle multigrid iteration.

The performance evaluation has shown that we can map the same high-level description to different target platforms by providing target-specific mappings. Moreover, we merge multiple components as shown exemplarily for the residual and restrict components.

We believe that using our approach, we can get the same performance as target-dependent, hand-optimized implementations for multigrid solvers (e.g., [12]). For single stencil operators, we have already shown that we achieve competitive results compared to hand-tuned implementations on GPU accelerators [13]. Currently, we are working on mappings to merge computations across not only two, but several components in order to get the same performance as hand-tuned implementations.

V. RELATED WORK

Multigrid solvers and in particular stencil codes are one of the most important algorithm classes in HPC and have been a popular research topic for decades. Still, a multitude of solutions exist and emerge to describe stencil codes at a high level and to map them to different target architectures.

Specialized libraries like DUNE [2] or hypre [1] provide a collection of solvers for large linear systems of equations on massively parallel machines. Auto-tuning frameworks like PATUS (Parallel Auto-Tuned Stencils) [10], [5] or the parallel Optimized Sparse Kernel Interface (pOSKI) [3] generate specialized codes for stencil computations on shared-memory architectures. DSLs for stencil codes provide special language constructs to describe stencil computations and to map the computation to parallel target hardware [7], [18], [14], [16].

We use language embedding and DSLs to describe multigrid solvers and stencil codes. However, we give the programmer additional opportunities to influence the optimization and transformation process from within the language.

Other DSL approaches such as Liszt [7], Pochoir [18], and HIPAC [14] focus on providing a simple and concise syntax to express algorithms. However, they offer no control over the applied optimization strategies. An advancement to this is the explicit specification of schedules in Halide [16]: Target-specific scheduling strategies can be defined by the programmer. Still it is not possible to trigger code refinement explicitly.

An alternative to these approaches is explicit code refinement. It can be achieved through staging like in Terra [6] and in Spiral in Scala [15]. Terra is an extension to Lua. Program parts in Lua can be evaluated and used to build Terra code during the run of the Lua program. However, this technique makes use of two different languages and type safety of the constructed fragments can only be checked before executing the constructed Terra code. Spiral in Scala uses the concept
of lightweight modular staging [17] to annotate types in Scala. Computations which make use of these types, are automatically subject to code refinement.

VI. CONCLUSION

In this paper, we present an approach for DSL creation via embedding into our host language Impala. It features high-level abstractions through higher-order functions and explicit control over code specialization through partial evaluation. These features allow to realize a novel refinement concept: We are able to abstract from target-machine details and to specialize code for different platforms.

As an example, we realize a DSL for multigrid codes. We are able to generate code for CPUs (including vectorization) and GPUs by only switching the mapping for the desired target architecture. This process does not involve an adaption of the implemented algorithm, since the realization is cleanly separated from the surrounding building blocks. Code refinement can then be applied to these different building blocks in order to glue them together into a single piece of code. The resulting code does not contain any abstractions nor unspecialized code for the target architecture. Starting from the same high-level description, we achieve a speedup of up to $3 \times$ on the CPU through vectorizing and a speedup of up to $2 \times$ on the GPU through kernel fusion.

VII. ACKNOWLEDGMENTS

This work is partly supported by the Federal Ministry of Education and Research (BMBF), as part of the Collaborate3D and ECOUSS projects as well as by the Intel Visual Computing Institute Saarbrücken. It is also co-funded by the European Union (EU), as part of the Dreamspace project.

REFERENCES