Introduction
Devices and allocators
Algorithms and lambda functions
Views and shared pointers
- Data structures views
- Shared pointers

Introduction

In this part we describe some general and core concepts of programming with TNL. Understanding these ideas may significantly help to use TNL more efficiently by understanding the design of TNL algorithms and data structures.

The main goal of TNL is to allow developing high performance algorithms that can run on multicore CPUs and GPUs. TNL offers a unified interface, so the developer can write common code for all supported architectures.

Devices and allocators

TNL offers unified interface for both CPUs (also referred as a host system) and GPUs (referred as device). The connection between CPU and GPU is usually realized by PCI-Express bus which is orders of magnitude slower compared to the speed of the GPU global memory. Therefore, the communication between CPU and GPU must be reduced as much as possible. As a result, the programmer needs to consider two different address spaces, one for CPU and one for GPU. To distinguish between the address spaces, each data structure requiring dynamic memory allocation needs to now on what device it resides. This is done by a template parameter Device. For example the following code creates two arrays, one on CPU and the other on GPU:

1TNL::Containers::Array< int, TNL::Devices::Host > host_array;

2TNL::Containers::Array< int, TNL::Devices::Cuda > cuda_array;

TNL::Containers::Array

Array is responsible for memory management, access to array elements, and general array operations.

Definition Array.h:64

Since now, C++ template sepcialization takes care of using the right methods for given device (in meaning hardware architecture and so the device can be even CPU). For example, calling a method setSize

1host_array.setSize( 10 );

2cuda_array.setSize( 10 );

TNL::Containers::Array::setSize

void setSize(IndexType size)

Method for setting the array size.

Definition Array.hpp:232

results in different memory allocation on CPU (for host_array) and on GPU (for cuda_array). The same holds for assignment

1cuda_array = host_aray;

in which case appropriate data transfer from CPU to GPU is performed. Each such data structure contains inner type named DeviceType which tells where it resides as we can see here:

template< typename Array >
void deduceDevice
{
   using Device = typename Array::DeviceType;
   TNL::Container::Array< int, Device > array;
}

If we need to specialize some parts of algorithm with respect to its device we can do it by means of std::is_same :

template< typename Array >
void testDevice
{
   using Device = typename Array::DeviceType;
   if( std::is_same_v< Device, TNL::Device::Host > )
      std::cout << "Device is host CPU." << std::endl;
   if( std::is_same_v< Device, TNL::Device::Cuda > )
      std::cout << "Device is CUDA GPU." << std::endl;
}

Algorithms and lambda functions

Developing a code for GPUs (e.g. in CUDA) consists mainly of writing kernels, which are special functions running on the GPU in parallel. This can be very hard and tedious work especially when it comes to debugging. Parallel reduction is a perfect example of an algorithm which is relatively hard to understand and implement on one hand, but it is necessary to use frequently. Writing tens of lines of code every time we need to sum up some data is exactly what we mean by tedious programming. TNL offers skeletons or patterns of such algorithms and combines them with user defined lambda functions. This approach is not absolutely general, which means that you can use it only in situations when there is a skeleton/pattern (see TNL::Algorithms) suitable for your problem. But when there is, it offers several advantages:

Implementing lambda functions is much easier compared to implementing GPU kernels.
Code implemented this way works even on CPU, so the developer writes only one code for both hardware architectures.
The developer may debug the code on CPU first and then just run it on GPU. Quite likely it will work with only a little or no changes.

The following code snippet demonstrates it using TNL::Algorithms::parallelFor:

template< typename Device >
void
vectorAddition( double* v1, double* v2, double* sum, const int size )
{
   auto sum_lambda = [ = ] __cuda_callable__( int i ) mutable
   {
      sum[ i ] = v1[ i ] + v2[ i ];
   };
   TNL::Algorithms::parallelFor< Device >( 0, size, sum_lambda );
}

In this example, we assume that all arrays v1, v2 and sum were properly allocated in the appropriate address space. If Device is TNL::Devices::Host, the lambda function is processed sequentially or in parallel by several OpenMP threads on CPU. If Device is TNL::Devices::Cuda, the lambda function is called from a CUDA kernel (this is why it is annotated with __cuda_callable__, which expands to __host__ __device__) by apropriate number of CUDA threads.

One more example demonstrates use of TNL::Algorithms::reduce:

template< typename Device >
void
scalarProduct( double* v1, double* v2, double* product, const int size )
{
   auto fetch = [ = ] __cuda_callable__( int i ) -> double
   {
      return = v1[ i ] * v2[ i ];
   };
   auto reduce = [] __cuda_callable__( const double& a, const double& b )
   {
      return a + b;
   };
   TNL::Algorithms::reduce< Device >( 0, size, fetch, reduce, 0.0 );
}

We will not explain the parallel reduction in TNL at this moment (see the section about [flexible parallel reduction](ug_ReductionAndScan)), we hope that the idea is intuitively understandable from the code snippet. If Device is TNL::Devices::Host, the scalar product is evaluated sequentially or in parallel by several OpenMP threads on CPU, if Device is TNL::Devices::Cuda, the parallel reduction fine tuned with the lambda functions is performed. Fortunately, there is no performance drop. On the contrary, since it is easy to generate CUDA kernels for particular situations, we may get more efficient code. Consider computing a scalar product of sum of vectors like this:

\[s = (u_1 + u_2, v_1 + v_2). \]

This can be solved by the following code:

template< typename Device >
void
scalarProduct( double* u1, double* u2, double* v1, double* v2, double* product, const int size )
{
   auto fetch = [ = ] __cuda_callable__( int i ) -> double
   {
      return = ( u1[ i ] + u2[ i ] ) * ( v1[ i ] + v2[ i ] );
   };
   auto reduce = [] __cuda_callable__( const double& a, const double& b )
   {
      return a + b;
   };
   TNL::Algorithms::reduce< Device >( 0, size, fetch, reduce, 0.0 );
}

Notice that compared to the previous code example, we have changed only the fetch lambda function to perform the sums of u1[ i ] + u2[ i ] and v1[ i ] + v2[ i ]. Now we get a completely new CUDA kernel tailored exactly for this specific problem.

Doing the same with Cublas, for example, would require splitting the computation into three separate kernels:

Kernel to compute \(u_1 = u_1 + u_2\).
Kernel to compute \(v_1 = v_1 + v_2\).
Kernel to compute \(product = ( u_1, v_1 )\).

This could be achieved with the following code:

void
scalarProduct( double* u1, double* u2, double* v1, double* v2, double* product, const int size )
{
   cublasHandle_t handle;
   cublasSaxpy( handle, size, 1.0, u1, 1, u2, 1 );
   cublasSaxpy( handle, size, 1.0, v1, 1, v2, 1 );
   cublasSdot( handle, size, 1.0, u1, 1, v1, 1, &product );
}

We believe that C++ lambda functions with properly designed patterns of parallel algorithms could make programming of GPUs significantly easier. We see a parallel with MPI standard which in the nineties defined frequent communication operations in distributed parallel computing. It made programming of distributed systems much easier and at the same time MPI helps to write efficient programs. We aim to add additional skeletons or patterns to TNL::Algorithms.

Views and shared pointers

You might notice that in the previous section we used only C style arrays represented by pointers in the lambda functions. There is a general difficulty when the programmer needs to access dynamic data structures in lambda functions that should be callable on GPU. The outside variables may be captured either by value or by reference. The first case would be as follows:

template< typename Device >
void
lambda_capture_by_value( int size )
{
   TNL::Containers::Array< int, Device > a( size );
   auto f = [ = ] __cuda_callable__( int i ) mutable
   {
      a[ i ] = 1;
   };
   TNL::Algorithms::parallelFor< Device >( 0, size, f );
}

However, in this case a deep copy of the array a will be made and so there will be no effect of what we do with the array a in the lambda function. Capturing by reference may look as follows:

template< typename Device >
void
lambda_capture_by_reference( int size )
{
   TNL::Containers::Array< int, Device > a( size );
   auto f = [ &a ] __cuda_callable__( int i ) mutable
   {
      a[ i ] = 1;
   };
   TNL::Algorithms::parallelFor< Device >( 0, size, f );
}

This code is correct on CPU (e.g. when Device is TNL::Devices::Host). However, we are not allowed to pass references to CUDA kernels and so this source code would not even compile with the CUDA compiler. To overcome this issue, TNL offers two solutions:

Data structures views
Shared pointers

Data structures views

A view is a kind of lightweight reference object related to a dynamic data structure. Unlike full data structures, views are not resizable, but they may provide read-only as well as read-write access to the data. Another important distinction is the copy-constructor: while the copy-constructor of a data structure typically makes a deep copy, copy-constructing a view results in a shallow copy. Intuitively, views represent references to a data structure and thus copies of a view provide references to the same data. Therefore, views can be captured by value in lambda functions, which provides a way to transfer a reference to a data structure into a computational kernel.

The example with the array would look as follows:

template< typename Device >
void
lambda_capture_by_value( int size )
{
   TNL::Containers::Array< int, Device > a( size );
   auto view = a.getView();
   auto f = [ view ] __cuda_callable__( int i ) mutable
   {
      view[ i ] = 1;
   };
   TNL::Algorithms::parallelFor< Device >( 0, size, f );
}

Compared to the previous code example, we first obtain a view for the array using its getView method. Then we capture it by value, i.e. using [ view ] rather than [ &a ], and finally, we use view rather than the array a inside the lambda function.

The view has very similar interface (see TNL::Containers::ArrayView) as the array (TNL::Containers::Array) and so there is mostly no difference in using array and its view. In TNL, each data structure designed for use in GPU kernels (it means that it has methods annotated as __cuda_callable__) provides also a getView method for getting the appropriate view of the object.

Note that the relation between a data structure and its view is one-way only: a view has a reference to the data structure, but the data structure does not keep references to any of its views that are currently in use. Consequently, there are operations on the data structure after which the views may become invalid, because the data structure could not update all the independent objects referencing it. The most common example of such operation is reallocation, which is shown in the following example:

template< typename Device >
void
lambda_capture_by_value( int size )
{
   TNL::Containers::Array< int, Device > a( size );
   auto view = a.getView();
   a.setSize( 2 * size );
   auto f = [ view ] __cuda_callable__( int i ) mutable
   {
      view[ i ] = 1;
   };
   TNL::Algorithms::parallelFor< Device >( 0, size, f );
}

This code would not work correctly, because the array is first allocated with size elements, then a view is obtained, and then the array size is changed to 2 * size, which will cause data reallocation and the view will now be invalid (it will contain a pointer to an inaccessible memory segment with no means to check and correct this state). This example can be fixed simply by using the operations in a correct sequence: the view used in the lambda function must be obtained after resizing the data structure.

Note that changing the data managed by the array after fetching the view is not an issue. See the following example:

template< typename Device >
void
lambda_capture_by_value( int size )
{
   TNL::Containers::Array< int, Device > a( size );
   auto view = a.getView();
   a.setElement( 0, 1 );
   auto f = [ view ] __cuda_callable__( int i ) mutable
   {
      view[ i ] = 1;
   };
   TNL::Algorithms::parallelFor< Device >( 0, size, f );
}

The method setElement changes the value of an element, but it does not cause data reallocation or change of the array size, so the view is still valid and usable in the lambda function.

Shared pointers

TNL offers smart pointers working across different devices (meaning CPU or GPU). See Cross-device smart pointers for more details on this topic.

Table of Contents