Introduction

This part shows how to use different kind of for-loops implemented in TNL. Namely, they are:

Parallel for is a for-loop which can be run in parallel, i.e. all iterations of the loop must be independent. Parallel for can be run on both multicore CPUs and GPUs.
n-dimensional parallel for is an extension of (one-dimensional) parallel for-loop to higher dimensions.
Unrolled for is a for-loop which is performed sequentially and it is explicitly unrolled by C++ templates. Iteration bounds must be static (known at compile time).
Static for is a for-loop with static bounds (known at compile time) and indices usable in constant expressions.

Parallel For

Basic parallel for construction in TNL allows to express parallel for-loops in a way that is independent of the hardware platform that is specified by a template parameter. The loop is implemented as TNL::Algorithms::parallelFor and can be used as:

parallelFor< Device >( begin, end, function, arguments... );

The Device can be either TNL::Devices::Host or TNL::Devices::Cuda. The first two parameters define the loop bounds in the C style. It means that there will be iterations for indices begin, begin+1, ..., end-1. The function is a lambda function to be called in each iteration. The lambda function receives the iteration index and arguments passed to the parallel for (the last arguments).

See the following example:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/parallelFor.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
/****
 * Set all elements of the vector v to the constant c.
 */
template< typename Device >
void
initVector( Vector< double, Device >& v, const double& c )
{
   auto view = v.getView();
   auto init = [ = ] __cuda_callable__( int i ) mutable
   {
      view[ i ] = c;
   };
   parallelFor< Device >( 0, v.getSize(), init );
}
 
int
main( int argc, char* argv[] )
{
   /***
    * Firstly, test the vector initiation on CPU.
    */
   Vector< double, Devices::Host > host_v( 10 );
   initVector( host_v, 1.0 );
   std::cout << "host_v = " << host_v << std::endl;
 
   /***
    * And then also on GPU.
    */
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_v( 10 );
   initVector( cuda_v, 1.0 );
   std::cout << "cuda_v = " << cuda_v << std::endl;
#endif
   return EXIT_SUCCESS;
}

The result is:

host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]

cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]

n-dimensional Parallel For

For-loops in higher dimensions can be performed similarly by passing multi-index begin and end to the TNL::Algorithms::parallelFor function. In the following example we build a 3D mesh function on top of TNL::Containers::Vector. Three-dimensional indices i = ( x, y, z ) are mapped to the vector index idx as idx = ( z * ySize + y ) * xSize + x, where the mesh function has dimensions xSize * ySize * zSize. Note that since x values change the fastest and z values the slowest, this index mapping achieves sequential access to the vector elements on CPU and coalesced memory accesses on GPU. The following simple example performs initiation of the mesh function with a constant value c = 1.0:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Containers/StaticArray.h>
#include <TNL/Algorithms/parallelFor.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
void
initMeshFunction( const int xSize, const int ySize, const int zSize, Vector< double, Device >& v, const double& c )
{
   auto view = v.getView();
   auto init = [ = ] __cuda_callable__( const StaticArray< 3, int >& i ) mutable
   {
      view[ ( i.z() * ySize + i.y() ) * xSize + i.x() ] = c;
   };
   StaticArray< 3, int > begin{ 0, 0, 0 };
   StaticArray< 3, int > end{ xSize, ySize, zSize };
   parallelFor< Device >( begin, end, init );
}
 
int
main( int argc, char* argv[] )
{
   /***
    * Define dimensions of a 3D mesh function.
    */
   const int xSize( 10 ), ySize( 10 ), zSize( 10 );
   const int size = xSize * ySize * zSize;
 
   /***
    * Firstly, test the mesh function initiation on CPU.
    */
   Vector< double, Devices::Host > host_v( size );
   initMeshFunction( xSize, ySize, zSize, host_v, 1.0 );
 
   /***
    * And then also on GPU.
    */
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_v( size );
   initMeshFunction( xSize, ySize, zSize, cuda_v, 1.0 );
#endif
   return EXIT_SUCCESS;
}

The for-loop is executed by calling parallelFor with proper device. On CPU it is equivalent to the following nested for-loops:

for( Index z = beginZ; z < endZ; z++ )
   for( Index y = beginY; y < endY; y++ )
      for( Index x = beginX; x < endX; x++ )
         f( StaticArray< 3, int >{ x, y, z }, args... );

where args... stand for additional arguments that are forwarded to the lambda function after the iteration indices. In the example above there are no additional arguments, since the lambda function init captures all variables it needs to work with.

Unrolled For

TNL::Algorithms::unrolledFor is a for-loop that it is explicitly unrolled via C++ templates when the loop is short (up to eight iterations). The bounds of unrolledFor loops must be constant (i.e. known at the compile time). It is often used with static arrays and vectors.

See the following example:

#include <iostream>
#include <TNL/Containers/StaticVector.h>
#include <TNL/Algorithms/unrolledFor.h>
 
using namespace TNL;
using namespace TNL::Containers;
 
int
main( int argc, char* argv[] )
{
   /****
    * Create two static vectors
    */
   const int Size( 3 );
   StaticVector< Size, double > a, b;
   a = 1.0;
   b = 2.0;
   double sum( 0.0 );
 
   /****
    * Compute an addition of a vector and a constant number.
    */
   Algorithms::unrolledFor< int, 0, Size >(
      [ & ]( int i )
      {
         a[ i ] = b[ i ] + 3.14;
         sum += a[ i ];
      } );
   std::cout << "a = " << a << std::endl;
   std::cout << "sum = " << sum << std::endl;
}

Notice that the unrolled for-loop works with a lambda function similar to parallel for-loop. The bounds of the loop are passed as template parameters in the statement Algorithms::unrolledFor< int, 0, Size >. The parameter of the unrolledFor function is the functor to be called in each iteration. The function gets the loop index i only, see the following example:

The result looks as:

a = [ 5.14, 5.14, 5.14 ]

sum = 15.42

The effect of unrolledFor is really the same as usual for-loop. The following code does the same as the previous example:

for( int i = 0; i < Size; i++ )
{
   a[ i ] = b[ i ] + 3.14;
   sum += a[ i ];
};

The benefit of unrolledFor is mainly in the explicit unrolling of short loops which can improve performance in some situations. The maximum length of loops that will be fully unrolled can be specified using the fourth template parameter as follows:

Algorithms::unrolledFor< int, 0, Size, 16 >( ... );

unrolledFor can be used also in CUDA kernels.

Static For

TNL::Algorithms::staticFor is a generic for-loop whose iteration indices are usable in constant expressions (e.g. template arguments). It can be used as

staticFor< int, 0, N >( f );

TNL::Algorithms::staticFor

constexpr void staticFor(Func &&f, ArgTypes &&... args)

Generic loop with constant bounds and indices usable in constant expressions.

Definition staticFor.h:60

which results in the following sequence of function calls:

f( std::integral_constant< 0 >{} );
f( std::integral_constant< 1 >{} );
f( std::integral_constant< 2 >{} );
f( std::integral_constant< 3 >{} );
...
f( std::integral_constant< N-1 >{} );

Notice that each iteration index is represented by its own distinct type using std::integral_constant. Hence, the functor f must be generic, e.g. a generic lambda expression such as in the following example:

#include <iostream>
#include <array>
#include <tuple>
#include <TNL/Algorithms/staticFor.h>
 
/*
 * Example function printing members of std::tuple using staticFor
 * using lambda with capture.
 */
template< typename... Ts >
void
printTuple( const std::tuple< Ts... >& tupleVar )
{
   std::cout << "{ ";
   TNL::Algorithms::staticFor< size_t, 0, sizeof...( Ts ) >(
      [ & ]( auto i )
      {
         std::cout << std::get< i >( tupleVar );
         if( i < sizeof...( Ts ) - 1 )
            std::cout << ", ";
      } );
   std::cout << " }" << std::endl;
}
 
struct TuplePrinter
{
   constexpr TuplePrinter() = default;
 
   template< typename Index, typename... Ts >
   void
   operator()( Index i, const std::tuple< Ts... >& tupleVar )
   {
      std::cout << std::get< i >( tupleVar );
      if( i < sizeof...( Ts ) - 1 )
         std::cout << ", ";
   }
};
 
/*
 * Example function printing members of std::tuple using staticFor
 * and a structure with templated operator().
 */
template< typename... Ts >
void
printTupleCallableStruct( const std::tuple< Ts... >& tupleVar )
{
   std::cout << "{ ";
   TNL::Algorithms::staticFor< size_t, 0, sizeof...( Ts ) >( TuplePrinter(), tupleVar );
   std::cout << " }" << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   // initiate std::array
   std::array< int, 5 > a{ 1, 2, 3, 4, 5 };
 
   // print out the array using template parameters for indexing
   TNL::Algorithms::staticFor< int, 0, 5 >(
      [ &a ]( auto i )
      {
         std::cout << "a[ " << i << " ] = " << std::get< i >( a ) << std::endl;
      } );
 
   // example of printing a tuple using staticFor and a lambda function
   printTuple( std::make_tuple( "Hello", 3, 2.1 ) );
   // example of printing a tuple using staticFor and a structure with templated operator()
   printTupleCallableStruct( std::make_tuple( "Hello", 3, 2.1 ) );
}

The output looks as follows:

a[ 0 ] = 1
a[ 1 ] = 2
a[ 2 ] = 3
a[ 3 ] = 4
a[ 4 ] = 5
{ Hello, 3, 2.1 }
{ Hello, 3, 2.1 }

Table of Contents

Introduction

Parallel For

n-dimensional Parallel For

Unrolled For

Static For