Template Numerical Library version\ main:df396df
|
In this part we describe some general and core concepts of programming with TNL. Understanding these ideas may significantly help to use TNL more efficiently by understanding the design of TNL algorithms and data structures.
The main goal of TNL is to allow developing high performance algorithms that can run on multicore CPUs and GPUs. TNL offers a unified interface, so the developer can write common code for all supported architectures.
TNL offers unified interface for both CPUs (also referred as a host system) and GPUs (referred as device). The connection between CPU and GPU is usually realized by PCI-Express bus which is orders of magnitude slower compared to the speed of the GPU global memory. Therefore, the communication between CPU and GPU must be reduced as much as possible. As a result, the programmer needs to consider two different address spaces, one for CPU and one for GPU. To distinguish between the address spaces, each data structure requiring dynamic memory allocation needs to now on what device it resides. This is done by a template parameter Device
. For example the following code creates two arrays, one on CPU and the other on GPU:
Since now, C++ template sepcialization takes care of using the right methods for given device (in meaning hardware architecture and so the device can be even CPU). For example, calling a method setSize
results in different memory allocation on CPU (for host_array
) and on GPU (for cuda_array
). The same holds for assignment
in which case appropriate data transfer from CPU to GPU is performed. Each such data structure contains inner type named DeviceType
which tells where it resides as we can see here:
If we need to specialize some parts of algorithm with respect to its device we can do it by means of std::is_same :
Developing a code for GPUs (e.g. in CUDA) consists mainly of writing kernels, which are special functions running on the GPU in parallel. This can be very hard and tedious work especially when it comes to debugging. Parallel reduction is a perfect example of an algorithm which is relatively hard to understand and implement on one hand, but it is necessary to use frequently. Writing tens of lines of code every time we need to sum up some data is exactly what we mean by tedious programming. TNL offers skeletons or patterns of such algorithms and combines them with user defined lambda functions. This approach is not absolutely general, which means that you can use it only in situations when there is a skeleton/pattern (see TNL::Algorithms) suitable for your problem. But when there is, it offers several advantages:
The following code snippet demonstrates it using TNL::Algorithms::parallelFor:
In this example, we assume that all arrays v1
, v2
and sum
were properly allocated in the appropriate address space. If Device
is TNL::Devices::Host, the lambda function is processed sequentially or in parallel by several OpenMP threads on CPU. If Device
is TNL::Devices::Cuda, the lambda function is called from a CUDA kernel (this is why it is annotated with __cuda_callable__
, which expands to __host__ __device__
) by apropriate number of CUDA threads.
One more example demonstrates use of TNL::Algorithms::reduce:
We will not explain the parallel reduction in TNL at this moment (see the section about [flexible parallel reduction](ug_ReductionAndScan)), we hope that the idea is intuitively understandable from the code snippet. If Device
is TNL::Devices::Host, the scalar product is evaluated sequentially or in parallel by several OpenMP threads on CPU, if Device
is TNL::Devices::Cuda, the parallel reduction fine tuned with the lambda functions is performed. Fortunately, there is no performance drop. On the contrary, since it is easy to generate CUDA kernels for particular situations, we may get more efficient code. Consider computing a scalar product of sum of vectors like this:
\[ s = (u_1 + u_2, v_1 + v_2). \]
This can be solved by the following code:
Notice that compared to the previous code example, we have changed only the fetch
lambda function to perform the sums of u1[ i ] + u2[ i ]
and v1[ i ] + v2[ i ]
. Now we get a completely new CUDA kernel tailored exactly for this specific problem.
Doing the same with Cublas, for example, would require splitting the computation into three separate kernels:
This could be achieved with the following code:
We believe that C++ lambda functions with properly designed patterns of parallel algorithms could make programming of GPUs significantly easier. We see a parallel with MPI standard which in the nineties defined frequent communication operations in distributed parallel computing. It made programming of distributed systems much easier and at the same time MPI helps to write efficient programs. We aim to add additional skeletons or patterns to TNL::Algorithms.
You might notice that in the previous section we used only C style arrays represented by pointers in the lambda functions. There is a general difficulty when the programmer needs to access dynamic data structures in lambda functions that should be callable on GPU. The outside variables may be captured either by value or by reference. The first case would be as follows:
However, in this case a deep copy of the array a
will be made and so there will be no effect of what we do with the array a
in the lambda function. Capturing by reference may look as follows:
This code is correct on CPU (e.g. when Device
is TNL::Devices::Host). However, we are not allowed to pass references to CUDA kernels and so this source code would not even compile with the CUDA compiler. To overcome this issue, TNL offers two solutions:
A view is a kind of lightweight reference object related to a dynamic data structure. Unlike full data structures, views are not resizable, but they may provide read-only as well as read-write access to the data. Another important distinction is the copy-constructor: while the copy-constructor of a data structure typically makes a deep copy, copy-constructing a view results in a shallow copy. Intuitively, views represent references to a data structure and thus copies of a view provide references to the same data. Therefore, views can be captured by value in lambda functions, which provides a way to transfer a reference to a data structure into a computational kernel.
The example with the array would look as follows:
Compared to the previous code example, we first obtain a view for the array using its getView
method. Then we capture it by value, i.e. using [ view ]
rather than [ &a ]
, and finally, we use view
rather than the array a
inside the lambda function.
The view has very similar interface (see TNL::Containers::ArrayView) as the array (TNL::Containers::Array) and so there is mostly no difference in using array and its view. In TNL, each data structure designed for use in GPU kernels (it means that it has methods annotated as __cuda_callable__
) provides also a getView
method for getting the appropriate view of the object.
Note that the relation between a data structure and its view is one-way only: a view has a reference to the data structure, but the data structure does not keep references to any of its views that are currently in use. Consequently, there are operations on the data structure after which the views may become invalid, because the data structure could not update all the independent objects referencing it. The most common example of such operation is reallocation, which is shown in the following example:
This code would not work correctly, because the array is first allocated with size
elements, then a view is obtained, and then the array size is changed to 2 * size
, which will cause data reallocation and the view will now be invalid (it will contain a pointer to an inaccessible memory segment with no means to check and correct this state). This example can be fixed simply by using the operations in a correct sequence: the view used in the lambda function must be obtained after resizing the data structure.
Note that changing the data managed by the array after fetching the view is not an issue. See the following example:
The method setElement
changes the value of an element, but it does not cause data reallocation or change of the array size, so the view is still valid and usable in the lambda function.
TNL offers smart pointers working across different devices (meaning CPU or GPU). See Cross-device smart pointers for more details on this topic.