Template Numerical Library version\ main:481315e2
Loading...
Searching...
No Matches
For loops

Introduction

This part shows how to use different kind of for-loops implemented in TNL. Namely, they are:

  • Parallel for is a for-loop which can be run in parallel, i.e. all iterations of the loop must be independent. Parallel for can be run on both multicore CPUs and GPUs.
  • n-dimensional parallel for is an extension of common parallel for into higher dimensions.
  • Unrolled for is a for-loop which is performed sequentially and it is explicitly unrolled by C++ templates. Iteration bounds must be static (known at compile time).
  • Static for is a for-loop with static bounds (known at compile time) and indices usable in constant expressions.

Parallel For

Basic parallel for construction in TNL serves for hardware platform transparent expression of parallel for-loops. The hardware platform is specified by a template parameter. The loop is implemented as TNL::Algorithms::parallelFor and can be used as:

parallelFor< Device >( begin, end, function, arguments... );

The Device can be either TNL::Devices::Host or TNL::Devices::Cuda. The first two parameters define the loop bounds in the C style. It means that there will be iterations for indices begin, begin+1, ..., end-1. The function is a lambda function to be called in each iteration. It is supposed to receive the iteration index and arguments passed to the parallel for (the last arguments).

See the following example:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/parallelFor.h>
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
/****
* Set all elements of the vector v to the constant c.
*/
template< typename Device >
void initVector( Vector< double, Device >& v,
const double& c )
{
auto view = v.getView();
auto init = [=] __cuda_callable__ ( int i ) mutable
{
view[ i ] = c;
};
parallelFor< Device >( 0, v.getSize(), init );
}
int main( int argc, char* argv[] )
{
/***
* Firstly, test the vector initiation on CPU.
*/
initVector( host_v, 1.0 );
std::cout << "host_v = " << host_v << std::endl;
/***
* And then also on GPU.
*/
#ifdef __CUDACC__
initVector( cuda_v, 1.0 );
std::cout << "cuda_v = " << cuda_v << std::endl;
#endif
return EXIT_SUCCESS;
}
#define __cuda_callable__
Definition CudaCallable.h:22
__cuda_callable__ IndexType getSize() const
Returns the current array size.
Definition Array.hpp:248
Vector extends Array with algebraic operations.
Definition Vector.h:39
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
Definition Vector.hpp:28
T endl(T... args)
Namespace for fundamental TNL algorithms.
Definition AtomicOperations.h:12
Namespace for TNL containers.
Definition Array.h:20
The main TNL namespace.
Definition AtomicOperations.h:12

The result is:

host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]

n-dimensional Parallel For

For-loops in higher dimensions can be performed similarly by passing multi-index begin and end to the TNL::Algorithms::parallelFor function. In the following example we build a 3D mesh function on top of TNL::Containers::Vector. Three-dimensional indices i = ( x, y, z ) are mapped to the vector index idx as idx = ( z * ySize + y ) * xSize + x, where the mesh function has dimensions xSize * ySize * zSize. Note that since x values change the fastest and z values the slowest, the index mapping j * xSize + i achieves sequential access to the vector elements on CPU and coalesced memory accesses on GPU. The following simple example performs initiation of the mesh function with a constant value c = 1.0:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Containers/StaticArray.h>
#include <TNL/Algorithms/parallelFor.h>
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
template< typename Device >
void initMeshFunction( const int xSize,
const int ySize,
const int zSize,
const double& c )
{
auto view = v.getView();
auto init = [=] __cuda_callable__ ( const StaticArray< 3, int >& i ) mutable
{
view[ ( i.z() * ySize + i.y() ) * xSize + i.x() ] = c;
};
StaticArray< 3, int > begin{ 0, 0, 0 };
StaticArray< 3, int > end{ xSize, ySize, zSize };
parallelFor< Device >( begin, end, init );
}
int main( int argc, char* argv[] )
{
/***
* Define dimensions of a 3D mesh function.
*/
const int xSize( 10 ), ySize( 10 ), zSize( 10 );
const int size = xSize * ySize * zSize;
/***
* Firstly, test the mesh function initiation on CPU.
*/
initMeshFunction( xSize, ySize, zSize, host_v, 1.0 );
/***
* And then also on GPU.
*/
#ifdef __CUDACC__
initMeshFunction( xSize, ySize, zSize, cuda_v, 1.0 );
#endif
return EXIT_SUCCESS;
}
Array with constant size.
Definition StaticArray.h:23
T end(T... args)

The for-loop is executed by calling parallelFor with proper device. On CPU it is equivalent to the following nested for-loops:

for( Index z = beginZ; z < endZ; z++ )
for( Index y = beginY; y < endY; y++ )
for( Index x = beginX; x < endX; x++ )
f( StaticArray< 3, int >{ x, y, z }, args... );

where args... stand for additional arguments that are forwarded to the lambda function after the iteration indices. In the example above there are no additional arguments, since the lambda function init captures all variables it needs to work with.

Unrolled For

TNL::Algorithms::unrolledFor is a for-loop that it is explicitly unrolled via C++ templates when the loop is short (up to eight iterations). The bounds of unrolledFor loops must be constant (i.e. known at the compile time). It is often used with static arrays and vectors.

See the following example:

#include <iostream>
#include <TNL/Containers/StaticVector.h>
#include <TNL/Algorithms/unrolledFor.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create two static vectors
*/
const int Size( 3 );
a = 1.0;
b = 2.0;
double sum( 0.0 );
/****
* Compute an addition of a vector and a constant number.
*/
Algorithms::unrolledFor< int, 0, Size >(
[&]( int i ) {
a[ i ] = b[ i ] + 3.14;
sum += a[ i ];
}
);
std::cout << "a = " << a << std::endl;
std::cout << "sum = " << sum << std::endl;
}
Vector with constant size.
Definition StaticVector.h:22

Notice that the unrolled for-loop works with a lambda function similar to parallel for-loop. The bounds of the loop are passed as template parameters in the statement Algorithms::unrolledFor< int, 0, Size >. The parameter of the unrolledFor function is the functor to be called in each iteration. The function gets the loop index i only, see the following example:

The result looks as:

a = [ 5.14, 5.14, 5.14 ]
sum = 15.42

The effect of unrolledFor is really the same as usual for-loop. The following code does the same as the previous example:

for( int i = 0; i < Size; i++ )
{
a[ i ] = b[ i ] + 3.14;
sum += a[ i ];
};

The benefit of unrolledFor is mainly in the explicit unrolling of short loops which can improve performance in some situations. The maximum length of loops that will be fully unrolled can be specified using the fourth template parameter as follows:

Algorithms::unrolledFor< int, 0, Size, 16 >( ... );

unrolledFor can be used also in CUDA kernels.

Static For

TNL::Algorithms::staticFor is a generic for-loop whose iteration indices are usable in constant expressions (e.g. template arguments). It can be used as

staticFor< int, 0, N >( f );

which results in the following sequence of function calls:

Notice that each iteration index is represented by its own distinct type using std::integral_constant. Hence, the functor f must be generic, e.g. a generic lambda expression such as in the following example:

#include <iostream>
#include <array>
#include <tuple>
#include <TNL/Algorithms/staticFor.h>
/*
* Example function printing members of std::tuple using staticFor
* using lambda with capture.
*/
template< typename... Ts >
void printTuple( const std::tuple<Ts...>& tupleVar )
{
std::cout << "{ ";
TNL::Algorithms::staticFor<size_t, 0, sizeof... (Ts)>( [&](auto i) {
std::cout << std::get<i>(tupleVar);
if( i < sizeof... (Ts) - 1 )
std::cout << ", ";
});
std::cout << " }" << std::endl;
}
struct TuplePrinter
{
constexpr TuplePrinter() = default;
template< typename Index, typename... Ts >
void operator()( Index i, const std::tuple<Ts...>& tupleVar )
{
std::cout << std::get<i>( tupleVar );
if( i < sizeof... (Ts) - 1 )
std::cout << ", ";
}
};
/*
* Example function printing members of std::tuple using staticFor
* and a structure with templated operator().
*/
template< typename... Ts >
void printTupleCallableStruct( const std::tuple<Ts...>& tupleVar )
{
std::cout << "{ ";
TNL::Algorithms::staticFor< size_t, 0, sizeof... (Ts) >( TuplePrinter(), tupleVar );
std::cout << " }" << std::endl;
}
int main( int argc, char* argv[] )
{
// initiate std::array
std::array< int, 5 > a{ 1, 2, 3, 4, 5 };
// print out the array using template parameters for indexing
[&a] ( auto i ) {
std::cout << "a[ " << i << " ] = " << std::get< i >( a ) << std::endl;
}
);
// example of printing a tuple using staticFor and a lambda function
printTuple( std::make_tuple( "Hello", 3, 2.1 ) );
// example of printing a tuple using staticFor and a structure with templated operator()
printTupleCallableStruct( std::make_tuple( "Hello", 3, 2.1 ) );
}
T make_tuple(T... args)
std::enable_if_t< std::is_integral_v< Begin > &&std::is_integral_v< End > > parallelFor(const Begin &begin, const End &end, typename Device::LaunchConfiguration launch_config, Function f, FunctionArgs... args)
Parallel for-loop function for 1D range specified with integral values.
Definition parallelFor.h:44
constexpr void staticFor(Func &&f, ArgTypes &&... args)
Generic loop with constant bounds and indices usable in constant expressions.
Definition staticFor.h:63

The output looks as follows:

a[ 0 ] = 1
a[ 1 ] = 2
a[ 2 ] = 3
a[ 3 ] = 4
a[ 4 ] = 5
{ Hello, 3, 2.1 }
{ Hello, 3, 2.1 }