Template Numerical Library version\ main:978fbd2
Searching...
No Matches
Flexible parallel reduction and scan

Introduction

This chapter introduces flexible parallel reduction in TNL. It shows how to easily implement parallel reduction with user defined operations which may run on both CPU and GPU. Parallel reduction is a programming pattern appearing very often in different kind of algorithms for example in scalar product, vector norms or mean value evaluation but also in sequences or strings comparison.

Flexible reduction

We will explain the flexible parallel reduction on several examples. We start with the simplest sum of sequence of numbers followed by more advanced problems like scalar product or vector norms.

Sum

We start with simple problem of computing sum of sequence of numbers

$s = \sum_{i=1}^n a_i.$

Sequentialy, such sum can be computed very easily as follows:

1double
2sequentialSum( const double* a, const int size )
3{
4 double sum( 0.0 );
5 for( int i = 0; i < size; i++ )
6 sum += a[ i ];
7 return sum;
8}

Doing the same in CUDA for GPU is, however, much more difficult (see Optimizing Parallel Reduction in CUDA). The final code has tens of lines and it is something you do not want to write again and again anytime you need to sum a series of numbers. Using TNL and C++ lambda functions we may do the same on few lines of code efficiently and independently on the hardware beneath. Let us first rewrite the previous example using the C++ lambda functions:

1double
2sequentialSum( const double* a, const int size )
3{
4 auto fetch = [ = ]( int i ) -> double
5 {
6 return a[ i ];
7 };
8 auto reduction = []( double& x, const double& y )
9 {
10 return x + y;
11 };
12
13 double sum( 0.0 );
14 for( int i = 0; i < size; i++ )
15 sum = reduction( sum, fetch( i ) );
16 return sum;
17}

As can be seen, we split the reduction into two steps:

1. fetch reads the input data. Thanks to this lambda you can:
1. Connect the reduction algorithm with given input arrays or vectors (or any other data structure).
2. Perform operation you need to do with the input data.
3. Perform another secondary operation simoultanously with the parallel reduction.
2. reduction is operation we want to do after the data fetch. Usually it is summation, multiplication, evaluation of minimum or maximum or some logical operation.

Putting everything together gives the following example:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double
12sum( const Vector< double, Device >& v )
13{
14 /****
15 * Get vector view which can be captured by lambda.
16 */
17 auto view = v.getConstView();
18
19 /****
20 * The fetch function just reads elements of vector v.
21 */
22 auto fetch = [ = ] __cuda_callable__( int i ) -> double
23 {
24 return view[ i ];
25 };
26
27 /***
28 * Reduction is sum of two numbers.
29 */
30 auto reduction = [] __cuda_callable__( const double& a, const double& b )
31 {
32 return a + b;
33 };
34
35 /***
36 * Finally we call the templated function Reduction and pass number of elements to reduce,
37 * lambdas defined above and finally value of identity element, zero in this case, which serve for the
38 * reduction initiation.
39 */
40 return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
41}
42
43int
44main( int argc, char* argv[] )
45{
46 /***
47 * Firstly, test the sum with vectors allocated on CPU.
48 */
50 host_v = 1.0;
51 std::cout << "host_v = " << host_v << std::endl;
52 std::cout << "The sum of the host vector elements is " << sum( host_v ) << "." << std::endl;
53
54 /***
55 * And then also on GPU.
56 */
57#ifdef __CUDACC__
59 cuda_v = 1.0;
60 std::cout << "cuda_v = " << cuda_v << std::endl;
61 std::cout << "The sum of the CUDA vector elements is " << sum( cuda_v ) << "." << std::endl;
62#endif
63 return EXIT_SUCCESS;
64}
#define __cuda_callable__
Definition Macros.h:49
Vector extends Array with algebraic operations.
Definition Vector.h:36
ConstViewType getConstView(IndexType begin=0, IndexType end=0) const
Returns a non-modifiable view of the vector.
Definition Vector.hpp:40
T endl(T... args)
Namespace for fundamental TNL algorithms.
Definition AtomicOperations.h:9
Namespace for TNL containers.
Definition Array.h:17
The main TNL namespace.
Definition AtomicOperations.h:9

Since TNL vectors cannot be passed to CUDA kernels and so they cannot be captured by CUDA lambdas, we must first get vector view from the vector using a method getConstView().

Note that we pass 0.0 as the last argument of the template function reduce< Device >. It is an identity element for given operation, i.e., an element which does not change the result of the operation. For addition, it is zero.

The result of the previous code sample looks as follows:

host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the host vector elements is 10.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the CUDA vector elements is 10.

Note that the sum of vector elements can be also obtained as TNL::sum(v).

Product

To demonstrate the effect of the identity element, we will now compute product of all elements of the vector. The identity element is one for multiplication and we also need to replace a + b with a * b in the definition of reduction. We get the following code:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double
12product( const Vector< double, Device >& v )
13{
14 auto view = v.getConstView();
15 auto fetch = [ = ] __cuda_callable__( int i )
16 {
17 return view[ i ];
18 };
19 auto reduction = [] __cuda_callable__( const double& a, const double& b )
20 {
21 return a * b;
22 };
23
24 /***
25 * Since we compute the product of all elements, the reduction must be initialized by 1.0 not by 0.0.
26 */
27 return reduce< Device >( 0, view.getSize(), fetch, reduction, 1.0 );
28}
29
30int
31main( int argc, char* argv[] )
32{
33 /***
34 * The first test on CPU ...
35 */
37 host_v = 1.0;
38 std::cout << "host_v = " << host_v << std::endl;
39 std::cout << "The product of the host vector elements is " << product( host_v ) << "." << std::endl;
40
41 /***
42 * ... the second test on GPU.
43 */
44#ifdef __CUDACC__
46 cuda_v = 1.0;
47 std::cout << "cuda_v = " << cuda_v << std::endl;
48 std::cout << "The product of the CUDA vector elements is " << product( cuda_v ) << "." << std::endl;
49#endif
50 return EXIT_SUCCESS;
51}

host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the host vector elements is 1.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the CUDA vector elements is 1.

Note that the product of vector elements can be computed as TNL::product(v).

Scalar product

One of the most important operation in the linear algebra is the scalar product of two vectors. Compared to computing the sum of vector elements we must change the function fetch to read elements from both vectors and multiply them. See the following example.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double
12scalarProduct( const Vector< double, Device >& u, const Vector< double, Device >& v )
13{
14 auto u_view = u.getConstView();
15 auto v_view = v.getConstView();
16
17 /***
18 * Fetch computes product of corresponding elements of both vectors.
19 */
20 auto fetch = [ = ] __cuda_callable__( int i )
21 {
22 return u_view[ i ] * v_view[ i ];
23 };
24 auto reduction = [] __cuda_callable__( const double& a, const double& b )
25 {
26 return a + b;
27 };
28 return reduce< Device >( 0, v_view.getSize(), fetch, reduction, 0.0 );
29}
30
31int
32main( int argc, char* argv[] )
33{
34 /***
35 * The first test on CPU ...
36 */
37 Vector< double, Devices::Host > host_u( 10 ), host_v( 10 );
38 host_u = 1.0;
39 host_v.forAllElements(
40 [] __cuda_callable__( int i, double& value )
41 {
42 value = 2 * ( i % 2 ) - 1;
43 } );
44 std::cout << "host_u = " << host_u << std::endl;
45 std::cout << "host_v = " << host_v << std::endl;
46 std::cout << "The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) << "." << std::endl;
47 std::cout << "The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) << "." << std::endl;
48
49 /***
50 * ... the second test on GPU.
51 */
52#ifdef __CUDACC__
53 Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_v( 10 );
54 cuda_u = 1.0;
55 cuda_v.forAllElements(
56 [] __cuda_callable__( int i, double& value )
57 {
58 value = 2 * ( i % 2 ) - 1;
59 } );
60 std::cout << "cuda_u = " << cuda_u << std::endl;
61 std::cout << "cuda_v = " << cuda_v << std::endl;
62 std::cout << "The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) << "." << std::endl;
63 std::cout << "The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) << "." << std::endl;
64#endif
65 return EXIT_SUCCESS;
66}

The result is:

host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.

Note that the scalar product of vectors u and v can be computed by TNL::dot(u, v) or simply as (u, v).

Maximum norm

The maximum norm of a vector equals the modulus of the vector largest element. Therefore, fetch must return the absolute value of the vector elements and reduction will return maximum of given values. Look at the following example.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double
12maximumNorm( const Vector< double, Device >& v )
13{
14 auto view = v.getConstView();
15 auto fetch = [ = ] __cuda_callable__( int i )
16 {
17 return abs( view[ i ] );
18 };
19 auto reduction = [] __cuda_callable__( const double& a, const double& b )
20 {
21 return TNL::max( a, b );
22 };
23 return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
24}
25
26int
27main( int argc, char* argv[] )
28{
30 host_v.forAllElements(
31 [] __cuda_callable__( int i, double& value )
32 {
33 value = i - 7;
34 } );
35 std::cout << "host_v = " << host_v << std::endl;
36 std::cout << "The maximum norm of the host vector elements is " << maximumNorm( host_v ) << "." << std::endl;
37#ifdef __CUDACC__
39 cuda_v.forAllElements(
40 [] __cuda_callable__( int i, double& value )
41 {
42 value = i - 7;
43 } );
44 std::cout << "cuda_v = " << cuda_v << std::endl;
45 std::cout << "The maximum norm of the CUDA vector elements is " << maximumNorm( cuda_v ) << "." << std::endl;
46#endif
47 return EXIT_SUCCESS;
48}
constexpr ResultType max(const T1 &a, const T2 &b)
This function returns maximum of two numbers.
Definition Math.h:48

The output is:

host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the CUDA vector elements is 7.

Note that the maximum norm can be computed by TNL::maxNorm(v).

Vectors comparison

The comparison of two vectors involves (parallel) reduction as well. The fetch part is responsible for the comparison of corresponding vector elements, resulting in a boolean value true or false for each of the vector elements. The reduction part must perform logical and operation on all fetched values. We must not forget to change the identity element to true. The code may look as follows:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11bool
12comparison( const Vector< double, Device >& u, const Vector< double, Device >& v )
13{
14 auto u_view = u.getConstView();
15 auto v_view = v.getConstView();
16
17 /***
18 * Fetch compares corresponding elements of both vectors
19 */
20 auto fetch = [ = ] __cuda_callable__( int i ) -> bool
21 {
22 return ( u_view[ i ] == v_view[ i ] );
23 };
24
25 /***
26 * Reduce performs logical AND on intermediate results obtained by fetch.
27 */
28 auto reduction = [] __cuda_callable__( const bool& a, const bool& b )
29 {
30 return a && b;
31 };
32 return reduce< Device >( 0, v_view.getSize(), fetch, reduction, true );
33}
34
35int
36main( int argc, char* argv[] )
37{
38 Vector< double, Devices::Host > host_u( 10 ), host_v( 10 );
39 host_u = 1.0;
40 host_v.forAllElements(
41 [] __cuda_callable__( int i, double& value )
42 {
43 value = 2 * ( i % 2 ) - 1;
44 } );
45 std::cout << "host_u = " << host_u << std::endl;
46 std::cout << "host_v = " << host_v << std::endl;
47 std::cout << "Comparison of host_u and host_v is: " << ( comparison( host_u, host_v ) ? "'true'" : "'false'" ) << "."
48 << std::endl;
49 std::cout << "Comparison of host_u and host_u is: " << ( comparison( host_u, host_u ) ? "'true'" : "'false'" ) << "."
50 << std::endl;
51#ifdef __CUDACC__
52 Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_v( 10 );
53 cuda_u = 1.0;
54 cuda_v.forAllElements(
55 [] __cuda_callable__( int i, double& value )
56 {
57 value = 2 * ( i % 2 ) - 1;
58 } );
59 std::cout << "cuda_u = " << cuda_u << std::endl;
60 std::cout << "cuda_v = " << cuda_v << std::endl;
61 std::cout << "Comparison of cuda_u and cuda_v is: " << ( comparison( cuda_u, cuda_v ) ? "'true'" : "'false'" ) << "."
62 << std::endl;
63 std::cout << "Comparison of cuda_u and cuda_u is: " << ( comparison( cuda_u, cuda_u ) ? "'true'" : "'false'" ) << "."
64 << std::endl;
65#endif
66 return EXIT_SUCCESS;
67}

And the output looks as:

host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of host_u and host_v is: 'false'.
Comparison of host_u and host_u is: 'true'.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of cuda_u and cuda_v is: 'false'.
Comparison of cuda_u and cuda_u is: 'true'.

Update and residue

In iterative solvers we often need to update a vector and compute the norm at the same time. For example, the Euler method is defined as

$\bf u^{k+1} = \bf u^k + \tau \Delta \bf u.$

Together with the vector addition, we may want to compute also $$L_2$$-norm of $$\Delta \bf u$$ which may indicate convergence. Computing first the addition and then the norm would be inefficient because we would have to fetch the vector $$\Delta \bf u$$ twice from the memory. The following example shows how to do the addition and norm computation at the same time.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double
12updateAndResidue( Vector< double, Device >& u, const Vector< double, Device >& delta_u, const double& tau )
13{
14 auto u_view = u.getView();
15 auto delta_u_view = delta_u.getConstView();
16 auto fetch = [ = ] __cuda_callable__( int i ) mutable -> double
17 {
18 const double& add = delta_u_view[ i ];
19 u_view[ i ] += tau * add;
21 };
22 auto reduction = [] __cuda_callable__( const double& a, const double& b )
23 {
24 return a + b;
25 };
26 return sqrt( reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 ) );
27}
28
29int
30main( int argc, char* argv[] )
31{
32 const double tau = 0.1;
33 Vector< double, Devices::Host > host_u( 10 ), host_delta_u( 10 );
34 host_u = 0.0;
35 host_delta_u = 1.0;
36 std::cout << "host_u = " << host_u << std::endl;
37 std::cout << "host_delta_u = " << host_delta_u << std::endl;
38 double residue = updateAndResidue( host_u, host_delta_u, tau );
39 std::cout << "New host_u is: " << host_u << "." << std::endl;
40 std::cout << "Residue is:" << residue << std::endl;
41#ifdef __CUDACC__
42 Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_delta_u( 10 );
43 cuda_u = 0.0;
44 cuda_delta_u = 1.0;
45 std::cout << "cuda_u = " << cuda_u << std::endl;
46 std::cout << "cuda_delta_u = " << cuda_delta_u << std::endl;
47 residue = updateAndResidue( cuda_u, cuda_delta_u, tau );
48 std::cout << "New cuda_u is: " << cuda_u << "." << std::endl;
49 std::cout << "Residue is:" << residue << std::endl;
50#endif
51 return EXIT_SUCCESS;
52}
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
Definition Vector.hpp:25

host_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
host_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New host_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228
cuda_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
cuda_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New cuda_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228

Simple MapReduce

We can also filter the data to be reduced. This operation is called MapReduce. You simply add the necessary if-statement to the fetch function, or in the case of the following example, we use the ternary conditional operator

return u_view[ i ] > 0.0 ? u_view[ i ] : 0.0;

to sum up only the positive numbers in the vector.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double
12mapReduce( Vector< double, Device >& u )
13{
14 auto u_view = u.getView();
15 auto fetch = [ = ] __cuda_callable__( int i ) -> double
16 {
17 return u_view[ i ] > 0 ? u_view[ i ] : 0.0;
18 };
19 auto reduction = [] __cuda_callable__( const double& a, const double& b )
20 {
21 return a + b;
22 };
23 return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
24}
25
26int
27main( int argc, char* argv[] )
28{
30 host_u.forAllElements(
31 [] __cuda_callable__( int i, double& value )
32 {
33 value = sin( (double) i );
34 } );
35 double result = mapReduce( host_u );
36 std::cout << "host_u = " << host_u << std::endl;
37 std::cout << "Sum of the positive numbers is:" << result << std::endl;
38#ifdef __CUDACC__
40 cuda_u = host_u;
41 result = mapReduce( cuda_u );
42 std::cout << "cuda_u = " << cuda_u << std::endl;
43 std::cout << "Sum of the positive numbers is:" << result << std::endl;
44#endif
45 return EXIT_SUCCESS;
46}

The result is:

host_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035
cuda_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035

Take a look at the following example where the filtering depends on the element indexes rather than values:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5#include <TNL/Timer.h>
6
7using namespace TNL;
8using namespace TNL::Containers;
9using namespace TNL::Algorithms;
10
11template< typename Device >
12double
13mapReduce( Vector< double, Device >& u )
14{
15 auto u_view = u.getView();
16 auto fetch = [ = ] __cuda_callable__( int i ) -> double
17 {
18 if( i % 2 == 0 )
19 return u_view[ i ];
20 return 0.0;
21 };
22 auto reduction = [] __cuda_callable__( const double& a, const double& b )
23 {
24 return a + b;
25 };
26 return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
27}
28
29int
30main( int argc, char* argv[] )
31{
32 Timer timer;
33 Vector< double, Devices::Host > host_u( 100000 );
34 host_u = 1.0;
35 timer.start();
36 double result = mapReduce( host_u );
37 timer.stop();
38 std::cout << "Host tesult is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
39#ifdef __CUDACC__
40 Vector< double, Devices::Cuda > cuda_u( 100000 );
41 cuda_u = 1.0;
42 timer.reset();
43 timer.start();
44 result = mapReduce( cuda_u );
45 timer.stop();
46 std::cout << "CUDA result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
47#endif
48 return EXIT_SUCCESS;
49}
Class for real time, CPU time and CPU cycles measuring.
Definition Timer.h:25
void reset()
Reset the CPU and real-time timers.
Definition Timer.hpp:22
double getRealTime() const
Returns the elapsed real time on this timer.
Definition Timer.hpp:50
void stop()
Stops (pauses) the CPU and the real-time timers, but does not set them to zero.
Definition Timer.hpp:32
void start()
Starts timer.
Definition Timer.hpp:42

The result is:

Host tesult is:50000. It took 0.0125388 seconds.
CUDA result is:50000. It took 0.00298286 seconds.

This is not very efficient. For half of the elements, we return zero which has no effect during the reduction. A better solution is to run the reduction only for a half of the elements and to change the fetch function to

return u_view[ 2 * i ];

See the following example and compare the execution times.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5#include <TNL/Timer.h>
6
7using namespace TNL;
8using namespace TNL::Containers;
9using namespace TNL::Algorithms;
10
11template< typename Device >
12double
13mapReduce( Vector< double, Device >& u )
14{
15 auto u_view = u.getView();
16 auto fetch = [ = ] __cuda_callable__( int i ) -> double
17 {
18 return u_view[ 2 * i ];
19 };
20 auto reduction = [] __cuda_callable__( const double& a, const double& b )
21 {
22 return a + b;
23 };
24 return reduce< Device >( 0, u_view.getSize() / 2, fetch, reduction, 0.0 );
25}
26
27int
28main( int argc, char* argv[] )
29{
30 Timer timer;
31 Vector< double, Devices::Host > host_u( 100000 );
32 host_u = 1.0;
33 timer.start();
34 double result = mapReduce( host_u );
35 timer.stop();
36 std::cout << "Host result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
37#ifdef __CUDACC__
38 Vector< double, Devices::Cuda > cuda_u( 100000 );
39 cuda_u = 1.0;
40 timer.reset();
41 timer.start();
42 result = mapReduce( cuda_u );
43 timer.stop();
44 std::cout << "CUDA result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
45#endif
46 return EXIT_SUCCESS;
47}
Host result is:50000. It took 0.0141075 seconds.
CUDA result is:50000. It took 0.0043334 seconds.

Reduction with argument

In some situations we may need to locate given element in the vector. For example index of the smallest or the largest element. reduceWithArgument is a function which can do it. In the following example, we modify function for computing the maximum norm of a vector. Instead of just computing the value, now we want to get index of the element having the absolute value equal to the max norm. The lambda function reduction do not compute only maximum of two given elements anymore, but it must also compute index of the winner. See the following code:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
12maximumNorm( const Vector< double, Device >& v )
13{
14 auto view = v.getConstView();
15
16 auto fetch = [ = ] __cuda_callable__( int i )
17 {
18 return abs( view[ i ] );
19 };
20 auto reduction = [] __cuda_callable__( double& a, const double& b, int& aIdx, const int& bIdx )
21 {
22 if( a < b ) {
23 a = b;
24 aIdx = bIdx;
25 }
26 else if( a == b && bIdx < aIdx )
27 aIdx = bIdx;
28 };
29 return reduceWithArgument< Device >( 0, view.getSize(), fetch, reduction, std::numeric_limits< double >::lowest() );
30}
31
32int
33main( int argc, char* argv[] )
34{
36 host_v.forAllElements(
37 [] __cuda_callable__( int i, double& value )
38 {
39 value = i - 7;
40 } );
41 std::cout << "host_v = " << host_v << std::endl;
42 auto maxNormHost = maximumNorm( host_v );
43 std::cout << "The maximum norm of the host vector elements is " << maxNormHost.first << " at position " << maxNormHost.second
44 << "." << std::endl;
45#ifdef __CUDACC__
47 cuda_v.forAllElements(
48 [] __cuda_callable__( int i, double& value )
49 {
50 value = i - 7;
51 } );
52 std::cout << "cuda_v = " << cuda_v << std::endl;
53 auto maxNormCuda = maximumNorm( cuda_v );
54 std::cout << "The maximum norm of the device vector elements is " << maxNormCuda.first << " at position "
55 << maxNormCuda.second << "." << std::endl;
56#endif
57 return EXIT_SUCCESS;
58}

The definition of the lambda function reduction reads as:

auto reduction = [] __cuda_callable__ ( double& a, const double& b, int& aIdx, const int& bIdx );

In addition to vector elements values a and b, it gets also their positions aIdx and bIdx. The functions is responsible to set a to maximum of the two and aIdx to the position of the larger element. Note that the parameters have the above mentioned meaning only in case of computing minimum or maximum.

The result looks as:

host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7 at position 0.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the device vector elements is 7 at position 0.

Using functionals for reduction

You might notice, that the lambda function reduction does not take so many different form compared to fetch. In addition, setting the identity element can be annoying especially when computing minimum or maximum and we need to use std::numeric_limits to make the code general for any type. To make things simpler, TNL offers variants of several functionals known from the STL. They can be used instead of the lambda function reduction and they also carry the identity element. See the following example showing the scalar product of two vectors, now with a functional:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double
12scalarProduct( const Vector< double, Device >& u, const Vector< double, Device >& v )
13{
14 auto u_view = u.getConstView();
15 auto v_view = v.getConstView();
16
17 /***
18 * Fetch computes product of corresponding elements of both vectors.
19 */
20 return reduce< Device >(
21 0,
22 v_view.getSize(),
23 [ = ] __cuda_callable__( int i )
24 {
25 return u_view[ i ] * v_view[ i ];
26 },
27 TNL::Plus{} );
28}
29
30int
31main( int argc, char* argv[] )
32{
33 /***
34 * The first test on CPU ...
35 */
36 Vector< double, Devices::Host > host_u( 10 ), host_v( 10 );
37 host_u = 1.0;
38 host_v.forAllElements(
39 [] __cuda_callable__( int i, double& value )
40 {
41 value = 2 * ( i % 2 ) - 1;
42 } );
43 std::cout << "host_u = " << host_u << std::endl;
44 std::cout << "host_v = " << host_v << std::endl;
45 std::cout << "The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) << "." << std::endl;
46 std::cout << "The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) << "." << std::endl;
47
48 /***
49 * ... the second test on GPU.
50 */
51#ifdef __CUDACC__
52 Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_v( 10 );
53 cuda_u = 1.0;
54 cuda_v.forAllElements(
55 [] __cuda_callable__( int i, double& value )
56 {
57 value = 2 * ( i % 2 ) - 1;
58 } );
59 std::cout << "cuda_u = " << cuda_u << std::endl;
60 std::cout << "cuda_v = " << cuda_v << std::endl;
61 std::cout << "The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) << "." << std::endl;
62 std::cout << "The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) << "." << std::endl;
63#endif
64 return EXIT_SUCCESS;
65}
Function object implementing x + y.
Definition Functional.h:17

This example also shows a more compact way to invoke the function reduce. This way, one should be able to perform (parallel) reduction very easily. The result looks as follows:

host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.

In TNL/Functional.h you may find probably all operations that can be reasonably used for reduction:

Functional Reduction operation
TNL::Plus Sum
TNL::Multiplies Product
TNL::Min Minimum
TNL::Max Maximum
TNL::MinWithArg Minimum with argument
TNL::MaxWithArg Maximum with argument
TNL::LogicalAnd Logical AND
TNL::LogicalOr Logical OR
TNL::BitAnd Bit AND
TNL::BitOr Bit OR

Flexible scan

Inclusive and exclusive scan

Inclusive scan (or prefix sum) operation turns a sequence $$a_1, \ldots, a_n$$ into a sequence $$s_1, \ldots, s_n$$ defined as

$s_i = \sum_{j=1}^i a_i.$

Exclusive scan (or prefix sum) is defined as

$\sigma_i = \sum_{j=1}^{i-1} a_i.$

For example, inclusive scan of

[1,3,5,7,9,11,13]

is

[1,4,9,16,25,36,49]

and exclusive scan of the same sequence is

[0,1,4,9,16,25,36]

Both kinds of scan have many different applications but they are usually applied only on summation, however product or logical operations could be handy as well. In TNL, scan is implemented in a similar way as reduction and so it can be easily modified by lambda functions. The following example shows how it works:

inplaceInclusiveScan( array, 0, array.getSize(), TNL::Plus{} );

This is equivalent to the following shortened call (the second, third and fourth parameters have a default value):

inplaceInclusiveScan( array );

The complete example looks as follows:

1#include <iostream>
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/scan.h>
4
5using namespace TNL;
6using namespace TNL::Containers;
7using namespace TNL::Algorithms;
8
9int
10main( int argc, char* argv[] )
11{
12 /***
13 * Firstly, test the prefix sum with an array allocated on CPU.
14 */
16 host_a = 1.0;
17 std::cout << "host_a = " << host_a << std::endl;
18 inplaceInclusiveScan( host_a );
19 std::cout << "The prefix sum of the host array is " << host_a << "." << std::endl;
20
21 /***
22 * And then also on GPU.
23 */
24#ifdef __CUDACC__
26 cuda_a = 1.0;
27 std::cout << "cuda_a = " << cuda_a << std::endl;
28 inplaceInclusiveScan( cuda_a );
29 std::cout << "The prefix sum of the CUDA array is " << cuda_a << "." << std::endl;
30#endif
31 return EXIT_SUCCESS;
32}
Array is responsible for memory management, access to array elements, and general array operations.
Definition Array.h:64

Scan does not use fetch function because the scan must be performed on an array. Its complexity is also higher compared to reduction. Thus if one needs to do some operation with the array elements before the scan, this can be done explicitly and it will not affect the performance significantly. On the other hand, the scan function takes interval of the vector elements where the scan is performed as its second and third argument. The next argument is the reduction operation to be performed by the scan and the last parameter is the identity element of the reduction operation.

The result looks as:

host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].

Exclusive scan works similarly. The complete example looks as follows:

1#include <iostream>
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/scan.h>
4
5using namespace TNL;
6using namespace TNL::Containers;
7using namespace TNL::Algorithms;
8
9int
10main( int argc, char* argv[] )
11{
12 /***
13 * Firstly, test the prefix sum with an array allocated on CPU.
14 */
16 host_a = 1.0;
17 std::cout << "host_a = " << host_a << std::endl;
18 inplaceExclusiveScan( host_a );
19 std::cout << "The prefix sum of the host array is " << host_a << "." << std::endl;
20
21 /***
22 * And then also on GPU.
23 */
24#ifdef __CUDACC__
26 cuda_a = 1.0;
27 std::cout << "cuda_a = " << cuda_a << std::endl;
28 inplaceExclusiveScan( cuda_a );
29 std::cout << "The prefix sum of the CUDA array is " << cuda_a << "." << std::endl;
30#endif
31 return EXIT_SUCCESS;
32}

And the result looks as:

host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].

Segmented scan

Segmented scan is a modification of common scan. In this case the sequence of numbers in hand is divided into segments like this, for example

[1,3,5][2,4,6,9][3,5],[3,6,9,12,15]

and we want to compute inclusive or exclusive scan of each segment. For inclusive segmented prefix sum we get

[1,4,9][2,6,12,21][3,8][3,9,18,30,45]

and the result for exclusive segmented prefix sum is

[0,1,4][0,2,6,12][0,3][0,3,9,18,30]

In addition to common scan, we need to encode the segments of the input sequence. It is done by auxiliary flags array (it can be array of booleans) having 1 at the begining of each segment and 0 on all other positions. In our example, it would be like this:

[1,0,0,1,0,0,0,1,0,1,0,0, 0, 0]
[1,3,5,2,4,6,9,3,5,3,6,9,12,15]

Note: Segmented scan is not implemented for CUDA yet.

1#include <iostream>
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/SegmentedScan.h>
4
5using namespace TNL;
6using namespace TNL::Containers;
7using namespace TNL::Algorithms;
8
9template< typename Device >
10void
11segmentedScan( Array< double, Device >& v, Array< bool, Device >& flags )
12{
13 /***
14 * Reduction is sum of two numbers.
15 */
16 auto reduce = [] __cuda_callable__( const double& a, const double& b )
17 {
18 return a + b;
19 };
20
21 /***
22 * As parameters, we pass array on which the scan is to be performed, interval
23 * where the scan is performed, lambda function which is used by the scan and
24 * zero as the identity element of the 'sum' operation.
25 */
26 SegmentedScan< Device >::perform( v, flags, 0, v.getSize(), reduce, 0.0 );
27}
28
29int
30main( int argc, char* argv[] )
31{
32 /***
33 * Firstly, test the segmented prefix sum with arrays allocated on CPU.
34 */
35 Array< bool, Devices::Host > host_flags{ 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0 };
36 Array< double, Devices::Host > host_v{ 1, 3, 5, 2, 4, 6, 9, 3, 5, 3, 6, 9, 12, 15 };
37 std::cout << "host_flags = " << host_flags << std::endl;
38 std::cout << "host_v = " << host_v << std::endl;
39 segmentedScan( host_v, host_flags );
40 std::cout << "The segmented prefix sum of the host array is " << host_v << "." << std::endl;
41
42 /***
43 * And then also on GPU.
44 */
45#ifdef __CUDACC__
46 //Array< bool, Devices::Cuda > cuda_flags{ 1,0,0,1,0,0,0,1,0,1,0,0, 0, 0 };
47 //Array< double, Devices::Cuda > cuda_v { 1,3,5,2,4,6,9,3,5,3,6,9,12,15 };
48 //std::cout << "cuda_flags = " << cuda_flags << std::endl;
49 //std::cout << "cuda_v = " << cuda_v << std::endl;
50 //segmentedScan( cuda_v, cuda_flags );
51 //std::cout << "The segmnted prefix sum of the CUDA array is " << cuda_v << "." << std::endl;
52#endif
53 return EXIT_SUCCESS;
54}
__cuda_callable__ IndexType getSize() const
Returns the current array size.
Definition Array.hpp:245
Computes segmented scan (or prefix sum) on a vector.
Definition SegmentedScan.h:56