Introduction
This chapter introduces flexible parallel reduction in TNL. It shows how to easily implement parallel reduction with user defined operations which may run on both CPU and GPU. Parallel reduction is a programming pattern appearing very often in different kind of algorithms for example in scalar product, vector norms or mean value evaluation but also in sequences or strings comparison.
Flexible reduction
We will explain the flexible parallel reduction on several examples. We start with the simplest sum of sequence of numbers followed by more advanced problems like scalar product or vector norms.
Sum
We start with simple problem of computing sum of sequence of numbers
\[ s = \sum_{i=1}^n a_i. \]
Sequentialy, such sum can be computed very easily as follows:
2sequentialSum(
const double* a,
const int size )
5 for(
int i = 0; i < size; i++ )
Doing the same in CUDA for GPU is, however, much more difficult (see Optimizing Parallel Reduction in CUDA). The final code has tens of lines and it is something you do not want to write again and again anytime you need to sum a series of numbers. Using TNL and C++ lambda functions we may do the same on few lines of code efficiently and independently on the hardware beneath. Let us first rewrite the previous example using the C++ lambda functions:
2sequentialSum(
const double* a,
const int size )
4 auto fetch = [ = ](
int i ) ->
double
8 auto reduction = [](
double& x,
const double& y )
14 for(
int i = 0; i < size; i++ )
15 sum = reduction( sum, fetch( i ) );
As can be seen, we split the reduction into two steps:
fetch
reads the input data. Thanks to this lambda you can:
- Connect the reduction algorithm with given input arrays or vectors (or any other data structure).
- Perform operation you need to do with the input data.
- Perform another secondary operation simoultanously with the parallel reduction.
reduction
is operation we want to do after the data fetch. Usually it is summation, multiplication, evaluation of minimum or maximum or some logical operation.
Putting everything together gives the following example:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
40 return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
44main(
int argc,
char* argv[] )
52 std::cout <<
"The sum of the host vector elements is " << sum( host_v ) <<
"." <<
std::endl;
61 std::cout <<
"The sum of the CUDA vector elements is " << sum( cuda_v ) <<
"." <<
std::endl;
#define __cuda_callable__
Definition Macros.h:49
Vector extends Array with algebraic operations.
Definition Vector.h:36
ConstViewType getConstView(IndexType begin=0, IndexType end=0) const
Returns a non-modifiable view of the vector.
Definition Vector.hpp:40
Namespace for fundamental TNL algorithms.
Definition AtomicOperations.h:9
Namespace for TNL containers.
Definition Array.h:17
The main TNL namespace.
Definition AtomicOperations.h:9
Since TNL vectors cannot be passed to CUDA kernels and so they cannot be captured by CUDA lambdas, we must first get vector view from the vector using a method getConstView()
.
Note that we pass 0.0
as the last argument of the template function reduce< Device >
. It is an identity element for given operation, i.e., an element which does not change the result of the operation. For addition, it is zero.
The result of the previous code sample looks as follows:
host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the host vector elements is 10.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the CUDA vector elements is 10.
Note that the sum of vector elements can be also obtained as TNL::sum(v)
.
Product
To demonstrate the effect of the identity element, we will now compute product of all elements of the vector. The identity element is one for multiplication and we also need to replace a + b
with a * b
in the definition of reduction
. We get the following code:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
27 return reduce< Device >( 0, view.getSize(), fetch, reduction, 1.0 );
31main(
int argc,
char* argv[] )
39 std::cout <<
"The product of the host vector elements is " << product( host_v ) <<
"." <<
std::endl;
48 std::cout <<
"The product of the CUDA vector elements is " << product( cuda_v ) <<
"." <<
std::endl;
leading to output like this:
host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the host vector elements is 1.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the CUDA vector elements is 1.
Note that the product of vector elements can be computed as TNL::product(v)
.
Scalar product
One of the most important operation in the linear algebra is the scalar product of two vectors. Compared to computing the sum of vector elements we must change the function fetch
to read elements from both vectors and multiply them. See the following example.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
22 return u_view[ i ] * v_view[ i ];
28 return reduce< Device >( 0, v_view.getSize(), fetch, reduction, 0.0 );
32main(
int argc,
char* argv[] )
39 host_v.forAllElements(
42 value = 2 * ( i % 2 ) - 1;
46 std::cout <<
"The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) <<
"." <<
std::endl;
47 std::cout <<
"The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) <<
"." <<
std::endl;
55 cuda_v.forAllElements(
58 value = 2 * ( i % 2 ) - 1;
62 std::cout <<
"The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) <<
"." <<
std::endl;
63 std::cout <<
"The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) <<
"." <<
std::endl;
The result is:
host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.
Note that the scalar product of vectors u
and v
can be computed by TNL::dot(u, v)
or simply as (u, v)
.
Maximum norm
The maximum norm of a vector equals the modulus of the vector largest element. Therefore, fetch
must return the absolute value of the vector elements and reduction
will return maximum of given values. Look at the following example.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
17 return abs( view[ i ] );
23 return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
27main(
int argc,
char* argv[] )
30 host_v.forAllElements(
36 std::cout <<
"The maximum norm of the host vector elements is " << maximumNorm( host_v ) <<
"." <<
std::endl;
39 cuda_v.forAllElements(
45 std::cout <<
"The maximum norm of the CUDA vector elements is " << maximumNorm( cuda_v ) <<
"." <<
std::endl;
constexpr ResultType max(const T1 &a, const T2 &b)
This function returns maximum of two numbers.
Definition Math.h:48
The output is:
host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the CUDA vector elements is 7.
Note that the maximum norm can be computed by TNL::maxNorm(v)
.
Vectors comparison
The comparison of two vectors involves (parallel) reduction as well. The fetch
part is responsible for the comparison of corresponding vector elements, resulting in a boolean value true
or false
for each of the vector elements. The reduction
part must perform logical and operation on all fetched values. We must not forget to change the identity element to true
. The code may look as follows:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
22 return ( u_view[ i ] == v_view[ i ] );
32 return reduce< Device >( 0, v_view.getSize(), fetch, reduction,
true );
36main(
int argc,
char* argv[] )
40 host_v.forAllElements(
43 value = 2 * ( i % 2 ) - 1;
47 std::cout <<
"Comparison of host_u and host_v is: " << ( comparison( host_u, host_v ) ?
"'true'" :
"'false'" ) <<
"."
49 std::cout <<
"Comparison of host_u and host_u is: " << ( comparison( host_u, host_u ) ?
"'true'" :
"'false'" ) <<
"."
54 cuda_v.forAllElements(
57 value = 2 * ( i % 2 ) - 1;
61 std::cout <<
"Comparison of cuda_u and cuda_v is: " << ( comparison( cuda_u, cuda_v ) ?
"'true'" :
"'false'" ) <<
"."
63 std::cout <<
"Comparison of cuda_u and cuda_u is: " << ( comparison( cuda_u, cuda_u ) ?
"'true'" :
"'false'" ) <<
"."
And the output looks as:
host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of host_u and host_v is: 'false'.
Comparison of host_u and host_u is: 'true'.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of cuda_u and cuda_v is: 'false'.
Comparison of cuda_u and cuda_u is: 'true'.
Update and residue
In iterative solvers we often need to update a vector and compute the norm at the same time. For example, the Euler method is defined as
\[ \bf u^{k+1} = \bf u^k + \tau \Delta \bf u. \]
Together with the vector addition, we may want to compute also \(L_2\)-norm of \( \Delta \bf u \) which may indicate convergence. Computing first the addition and then the norm would be inefficient because we would have to fetch the vector \( \Delta \bf u \) twice from the memory. The following example shows how to do the addition and norm computation at the same time.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
18 const double& add = delta_u_view[ i ];
19 u_view[ i ] += tau * add;
26 return sqrt( reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 ) );
30main(
int argc,
char* argv[] )
32 const double tau = 0.1;
38 double residue = updateAndResidue( host_u, host_delta_u, tau );
47 residue = updateAndResidue( cuda_u, cuda_delta_u, tau );
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
Definition Vector.hpp:25
The result reads as:
host_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
host_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New host_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228
cuda_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
cuda_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New cuda_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228
Simple MapReduce
We can also filter the data to be reduced. This operation is called MapReduce. You simply add the necessary if-statement to the fetch function, or in the case of the following example, we use the ternary conditional operator
return u_view[ i ] > 0.0 ? u_view[ i ] : 0.0;
to sum up only the positive numbers in the vector.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
17 return u_view[ i ] > 0 ? u_view[ i ] : 0.0;
23 return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
27main(
int argc,
char* argv[] )
30 host_u.forAllElements(
33 value = sin( (
double) i );
35 double result = mapReduce( host_u );
41 result = mapReduce( cuda_u );
The result is:
host_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035
cuda_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035
Take a look at the following example where the filtering depends on the element indexes rather than values:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
11template<
typename Device >
26 return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
30main(
int argc,
char* argv[] )
36 double result = mapReduce( host_u );
44 result = mapReduce( cuda_u );
Class for real time, CPU time and CPU cycles measuring.
Definition Timer.h:25
void reset()
Reset the CPU and real-time timers.
Definition Timer.hpp:22
double getRealTime() const
Returns the elapsed real time on this timer.
Definition Timer.hpp:50
void stop()
Stops (pauses) the CPU and the real-time timers, but does not set them to zero.
Definition Timer.hpp:32
void start()
Starts timer.
Definition Timer.hpp:42
The result is:
Host tesult is:50000. It took 0.00388109 seconds.
CUDA result is:50000. It took 0.0244926 seconds.
This is not very efficient. For half of the elements, we return zero which has no effect during the reduction. A better solution is to run the reduction only for a half of the elements and to change the fetch function to
See the following example and compare the execution times.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
11template<
typename Device >
18 return u_view[ 2 * i ];
24 return reduce< Device >( 0, u_view.getSize() / 2, fetch, reduction, 0.0 );
28main(
int argc,
char* argv[] )
34 double result = mapReduce( host_u );
42 result = mapReduce( cuda_u );
Host result is:50000. It took 0.00213579 seconds.
CUDA result is:50000. It took 0.0145357 seconds.
Reduction with argument
In some situations we may need to locate given element in the vector. For example index of the smallest or the largest element. reduceWithArgument
is a function which can do it. In the following example, we modify function for computing the maximum norm of a vector. Instead of just computing the value, now we want to get index of the element having the absolute value equal to the max norm. The lambda function reduction
do not compute only maximum of two given elements anymore, but it must also compute index of the winner. See the following code:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
18 return abs( view[ i ] );
20 auto reduction = []
__cuda_callable__(
double& a,
const double& b,
int& aIdx,
const int& bIdx )
26 else if( a == b && bIdx < aIdx )
33main(
int argc,
char* argv[] )
36 host_v.forAllElements(
42 auto maxNormHost = maximumNorm( host_v );
43 std::cout <<
"The maximum norm of the host vector elements is " << maxNormHost.first <<
" at position " << maxNormHost.second
47 cuda_v.forAllElements(
53 auto maxNormCuda = maximumNorm( cuda_v );
54 std::cout <<
"The maximum norm of the device vector elements is " << maxNormCuda.first <<
" at position "
55 << maxNormCuda.second <<
"." <<
std::endl;
The definition of the lambda function reduction
reads as:
auto reduction = []
__cuda_callable__ (
double& a,
const double& b,
int& aIdx,
const int& bIdx );
In addition to vector elements values a
and b
, it gets also their positions aIdx
and bIdx
. The functions is responsible to set a
to maximum of the two and aIdx
to the position of the larger element. Note that the parameters have the above mentioned meaning only in case of computing minimum or maximum.
The result looks as:
host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7 at position 0.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the device vector elements is 7 at position 0.
Using functionals for reduction
You might notice, that the lambda function reduction
does not take so many different form compared to fetch. In addition, setting the identity element can be annoying especially when computing minimum or maximum and we need to use std::numeric_limits to make the code general for any type. To make things simpler, TNL offers variants of several functionals known from the STL. They can be used instead of the lambda function reduction
and they also carry the identity element. See the following example showing the scalar product of two vectors, now with a functional:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
20 return reduce< Device >(
25 return u_view[ i ] * v_view[ i ];
31main(
int argc,
char* argv[] )
38 host_v.forAllElements(
41 value = 2 * ( i % 2 ) - 1;
45 std::cout <<
"The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) <<
"." <<
std::endl;
46 std::cout <<
"The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) <<
"." <<
std::endl;
54 cuda_v.forAllElements(
57 value = 2 * ( i % 2 ) - 1;
61 std::cout <<
"The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) <<
"." <<
std::endl;
62 std::cout <<
"The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) <<
"." <<
std::endl;
Function object implementing x + y.
Definition Functional.h:17
This example also shows a more compact way to invoke the function reduce
. This way, one should be able to perform (parallel) reduction very easily. The result looks as follows:
host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.
In TNL/Functional.h
you may find probably all operations that can be reasonably used for reduction:
Flexible scan
Inclusive and exclusive scan
Inclusive scan (or prefix sum) operation turns a sequence \( a_1, \ldots, a_n \) into a sequence \( s_1, \ldots, s_n \) defined as
\[
s_i = \sum_{j=1}^i a_i.
\]
Exclusive scan (or prefix sum) is defined as
\[
\sigma_i = \sum_{j=1}^{i-1} a_i.
\]
For example, inclusive scan of
is
and exclusive scan of the same sequence is
Both kinds of scan have many different applications but they are usually applied only on summation, however product or logical operations could be handy as well. In TNL, scan is implemented in a similar way as reduction and so it can be easily modified by lambda functions. The following example shows how it works:
inplaceInclusiveScan( array, 0, array.getSize(),
TNL::Plus{} );
This is equivalent to the following shortened call (the second, third and fourth parameters have a default value):
inplaceInclusiveScan( array );
The complete example looks as follows:
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/scan.h>
10main(
int argc,
char* argv[] )
18 inplaceInclusiveScan( host_a );
28 inplaceInclusiveScan( cuda_a );
Array is responsible for memory management, access to array elements, and general array operations.
Definition Array.h:64
Scan does not use fetch
function because the scan must be performed on an array. Its complexity is also higher compared to reduction. Thus if one needs to do some operation with the array elements before the scan, this can be done explicitly and it will not affect the performance significantly. On the other hand, the scan function takes interval of the vector elements where the scan is performed as its second and third argument. The next argument is the reduction operation to be performed by the scan and the last parameter is the identity element of the reduction operation.
The result looks as:
host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].
Exclusive scan works similarly. The complete example looks as follows:
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/scan.h>
10main(
int argc,
char* argv[] )
18 inplaceExclusiveScan( host_a );
28 inplaceExclusiveScan( cuda_a );
And the result looks as:
host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].
Segmented scan
Segmented scan is a modification of common scan. In this case the sequence of numbers in hand is divided into segments like this, for example
[1,3,5][2,4,6,9][3,5],[3,6,9,12,15]
and we want to compute inclusive or exclusive scan of each segment. For inclusive segmented prefix sum we get
[1,4,9][2,6,12,21][3,8][3,9,18,30,45]
and the result for exclusive segmented prefix sum is
[0,1,4][0,2,6,12][0,3][0,3,9,18,30]
In addition to common scan, we need to encode the segments of the input sequence. It is done by auxiliary flags array (it can be array of booleans) having 1
at the begining of each segment and 0
on all other positions. In our example, it would be like this:
[1,0,0,1,0,0,0,1,0,1,0,0, 0, 0]
[1,3,5,2,4,6,9,3,5,3,6,9,12,15]
Note: Segmented scan is not implemented for CUDA yet.
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/SegmentedScan.h>
9template<
typename Device >
30main(
int argc,
char* argv[] )
35 Array< bool, Devices::Host > host_flags{ 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0 };
36 Array< double, Devices::Host > host_v{ 1, 3, 5, 2, 4, 6, 9, 3, 5, 3, 6, 9, 12, 15 };
39 segmentedScan( host_v, host_flags );
40 std::cout <<
"The segmented prefix sum of the host array is " << host_v <<
"." <<
std::endl;
__cuda_callable__ IndexType getSize() const
Returns the current array size.
Definition Array.hpp:245
Computes segmented scan (or prefix sum) on a vector.
Definition SegmentedScan.h:56
The result reads as:
host_flags = [ 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0 ]
host_v = [ 1, 3, 5, 2, 4, 6, 9, 3, 5, 3, 6, 9, 12, 15 ]
The segmented prefix sum of the host array is [ 1, 4, 9, 2, 6, 12, 21, 3, 8, 3, 9, 18, 30, 45 ].