Introduction
This part shows how to use different kind of for-loops implemented in TNL. Namely, they are:
- Parallel for is a for-loop which can be run in parallel, i.e. all iterations of the loop must be independent. Parallel for can be run on both multicore CPUs and GPUs.
- n-dimensional parallel for is an extension of (one-dimensional) parallel for-loop to higher dimensions.
- Unrolled for is a for-loop which is performed sequentially and it is explicitly unrolled by C++ templates. Iteration bounds must be static (known at compile time).
- Static for is a for-loop with static bounds (known at compile time) and indices usable in constant expressions.
Parallel For
Basic parallel for construction in TNL allows to express parallel for-loops in a way that is independent of the hardware platform that is specified by a template parameter. The loop is implemented as TNL::Algorithms::parallelFor and can be used as:
parallelFor< Device >( begin, end, function, arguments... );
The Device
can be either TNL::Devices::Host or TNL::Devices::Cuda. The first two parameters define the loop bounds in the C style. It means that there will be iterations for indices begin
, begin+1
, ..., end-1
. The function
is a lambda function to be called in each iteration. The lambda function receives the iteration index and arguments passed to the parallel for (the last arguments).
See the following example:
#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/parallelFor.h>
template< typename Device >
void
{
{
view[ i ] = c;
};
parallelFor< Device >( 0, v.
getSize(), init );
}
int
main( int argc, char* argv[] )
{
initVector( host_v, 1.0 );
#ifdef __CUDACC__
initVector( cuda_v, 1.0 );
#endif
return EXIT_SUCCESS;
}
#define __cuda_callable__
Definition Macros.h:49
__cuda_callable__ IndexType getSize() const
Returns the current array size.
Definition Array.hpp:245
Vector extends Array with algebraic operations.
Definition Vector.h:36
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
Definition Vector.hpp:25
Namespace for fundamental TNL algorithms.
Definition AtomicOperations.h:9
Namespace for TNL containers.
Definition Array.h:17
The main TNL namespace.
Definition AtomicOperations.h:9
The result is:
host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
n-dimensional Parallel For
For-loops in higher dimensions can be performed similarly by passing multi-index begin
and end
to the TNL::Algorithms::parallelFor function. In the following example we build a 3D mesh function on top of TNL::Containers::Vector. Three-dimensional indices i = ( x, y, z )
are mapped to the vector index idx
as idx = ( z * ySize + y ) * xSize + x
, where the mesh function has dimensions xSize * ySize * zSize
. Note that since x
values change the fastest and z
values the slowest, this index mapping achieves sequential access to the vector elements on CPU and coalesced memory accesses on GPU. The following simple example performs initiation of the mesh function with a constant value c = 1.0
:
#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Containers/StaticArray.h>
#include <TNL/Algorithms/parallelFor.h>
template< typename Device >
void
{
{
view[ ( i.z() * ySize + i.y() ) * xSize + i.x() ] = c;
};
parallelFor< Device >( begin, end, init );
}
int
main( int argc, char* argv[] )
{
const int xSize( 10 ), ySize( 10 ), zSize( 10 );
const int size = xSize * ySize * zSize;
initMeshFunction( xSize, ySize, zSize, host_v, 1.0 );
#ifdef __CUDACC__
initMeshFunction( xSize, ySize, zSize, cuda_v, 1.0 );
#endif
return EXIT_SUCCESS;
}
Array with constant size.
Definition StaticArray.h:20
The for-loop is executed by calling parallelFor
with proper device. On CPU it is equivalent to the following nested for-loops:
for( Index z = beginZ; z < endZ; z++ )
for( Index y = beginY; y < endY; y++ )
for( Index x = beginX; x < endX; x++ )
where args...
stand for additional arguments that are forwarded to the lambda function after the iteration indices. In the example above there are no additional arguments, since the lambda function init
captures all variables it needs to work with.
Unrolled For
TNL::Algorithms::unrolledFor is a for-loop that it is explicitly unrolled via C++ templates when the loop is short (up to eight iterations). The bounds of unrolledFor
loops must be constant (i.e. known at the compile time). It is often used with static arrays and vectors.
See the following example:
#include <iostream>
#include <TNL/Containers/StaticVector.h>
#include <TNL/Algorithms/unrolledFor.h>
int
main( int argc, char* argv[] )
{
const int Size( 3 );
a = 1.0;
b = 2.0;
double sum( 0.0 );
Algorithms::unrolledFor< int, 0, Size >(
[ & ]( int i )
{
a[ i ] = b[ i ] + 3.14;
sum += a[ i ];
} );
}
Vector with constant size.
Definition StaticVector.h:19
Notice that the unrolled for-loop works with a lambda function similar to parallel for-loop. The bounds of the loop are passed as template parameters in the statement Algorithms::unrolledFor< int, 0, Size >
. The parameter of the unrolledFor
function is the functor to be called in each iteration. The function gets the loop index i
only, see the following example:
The result looks as:
a = [ 5.14, 5.14, 5.14 ]
sum = 15.42
The effect of unrolledFor
is really the same as usual for-loop. The following code does the same as the previous example:
for( int i = 0; i < Size; i++ )
{
a[ i ] = b[ i ] + 3.14;
sum += a[ i ];
};
The benefit of unrolledFor
is mainly in the explicit unrolling of short loops which can improve performance in some situations. The maximum length of loops that will be fully unrolled can be specified using the fourth template parameter as follows:
Algorithms::unrolledFor< int, 0, Size, 16 >( ... );
unrolledFor
can be used also in CUDA kernels.
Static For
TNL::Algorithms::staticFor is a generic for-loop whose iteration indices are usable in constant expressions (e.g. template arguments). It can be used as
staticFor< int, 0, N >( f );
which results in the following sequence of function calls:
Notice that each iteration index is represented by its own distinct type using std::integral_constant. Hence, the functor f
must be generic, e.g. a generic lambda expression such as in the following example:
#include <iostream>
#include <array>
#include <tuple>
#include <TNL/Algorithms/staticFor.h>
template< typename... Ts >
void
{
[ & ]( auto i )
{
std::cout << std::get< i >( tupleVar );
if( i < sizeof...( Ts ) - 1 )
} );
}
struct TuplePrinter
{
constexpr TuplePrinter() = default;
template< typename Index, typename... Ts >
void
{
std::cout << std::get< i >( tupleVar );
if( i < sizeof...( Ts ) - 1 )
}
};
template< typename... Ts >
void
{
}
int
main( int argc, char* argv[] )
{
[ &a ]( auto i )
{
} );
}
constexpr void staticFor(Func &&f, ArgTypes &&... args)
Generic loop with constant bounds and indices usable in constant expressions.
Definition staticFor.h:60
The output looks as follows:
a[ 0 ] = 1
a[ 1 ] = 2
a[ 2 ] = 3
a[ 3 ] = 4
a[ 4 ] = 5
{ Hello, 3, 2.1 }
{ Hello, 3, 2.1 }