Introduction
Flexible reduction
Flexible scan
- Inclusive and exclusive scan
- Segmented scan

Introduction

This chapter introduces flexible parallel reduction in TNL. It shows how to easily implement parallel reduction with user defined operations which may run on both CPU and GPU. Parallel reduction is a programming pattern appearing very often in different kind of algorithms for example in scalar product, vector norms or mean value evaluation but also in sequences or strings comparison.

Flexible reduction

We will explain the flexible parallel reduction on several examples. We start with the simplest sum of sequence of numbers followed by more advanced problems like scalar product or vector norms.

Sum

We start with simple problem of computing sum of sequence of numbers

\[ s = \sum_{i=1}^n a_i. \]

Sequentially, such sum can be computed very easily as follows:

double
sequentialSum( const double* a, const int size )
{
   double sum( 0.0 );
   for( int i = 0; i < size; i++ )
      sum += a[ i ];
   return sum;
}

Doing the same in CUDA for GPU is, however, much more difficult (see Optimizing Parallel Reduction in CUDA). The final code has tens of lines and it is something you do not want to write again and again anytime you need to sum a series of numbers. Using TNL and C++ lambda functions we may do the same on few lines of code efficiently and independently on the hardware beneath. Let us first rewrite the previous example using the C++ lambda functions:

double
sequentialSum( const double* a, const int size )
{
   auto fetch = [ = ]( int i ) -> double
   {
      return a[ i ];
   };
   auto reduction = []( double& x, const double& y )
   {
      return x + y;
   };
 
   double sum( 0.0 );
   for( int i = 0; i < size; i++ )
      sum = reduction( sum, fetch( i ) );
   return sum;
}

As can be seen, we split the reduction into two steps:

fetch reads the input data. Thanks to this lambda you can:
1. Connect the reduction algorithm with given input arrays or vectors (or any other data structure).
2. Perform operation you need to do with the input data.
3. Perform another secondary operation simoultanously with the parallel reduction.
reduction is operation we want to do after the data fetch. Usually it is summation, multiplication, evaluation of minimum or maximum or some logical operation.

Putting everything together gives the following example:

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
double
sum( const Vector< double, Device >& v )
{
   /****
    * Get vector view which can be captured by lambda.
    */
   auto view = v.getConstView();
 
   /****
    * The fetch function just reads elements of vector v.
    */
   auto fetch = [ = ] __cuda_callable__( int i ) -> double
   {
      return view[ i ];
   };
 
   /***
    * Reduction is sum of two numbers.
    */
   auto reduction = [] __cuda_callable__( const double& a, const double& b )
   {
      return a + b;
   };
 
   /***
    * Finally we call the templated function Reduction and pass number of elements to reduce,
    * lambdas defined above and finally value of identity element, zero in this case, which serve for the
    * reduction initiation.
    */
   return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
}
 
int
main( int argc, char* argv[] )
{
   /***
    * Firstly, test the sum with vectors allocated on CPU.
    */
   Vector< double, Devices::Host > host_v( 10 );
   host_v = 1.0;
   std::cout << "host_v = " << host_v << std::endl;
   std::cout << "The sum of the host vector elements is " << sum( host_v ) << "." << std::endl;
 
   /***
    * And then also on GPU.
    */
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_v( 10 );
   cuda_v = 1.0;
   std::cout << "cuda_v = " << cuda_v << std::endl;
   std::cout << "The sum of the CUDA vector elements is " << sum( cuda_v ) << "." << std::endl;
#endif
   return EXIT_SUCCESS;
}

Since TNL vectors cannot be passed to CUDA kernels and so they cannot be captured by CUDA lambdas, we must first get vector view from the vector using a method getConstView().

Note that we pass 0.0 as the last argument of the template function reduce< Device >. It is an identity element for given operation, i.e., an element which does not change the result of the operation. For addition, it is zero.

The result of the previous code sample looks as follows:

host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the host vector elements is 10.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the CUDA vector elements is 10.

Note that the sum of vector elements can be also obtained as TNL::sum(v).

Product

To demonstrate the effect of the identity element, we will now compute product of all elements of the vector. The identity element is one for multiplication and we also need to replace a + b with a * b in the definition of reduction. We get the following code:

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
double
product( const Vector< double, Device >& v )
{
   auto view = v.getConstView();
   auto fetch = [ = ] __cuda_callable__( int i )
   {
      return view[ i ];
   };
   auto reduction = [] __cuda_callable__( const double& a, const double& b )
   {
      return a * b;
   };
 
   /***
    * Since we compute the product of all elements, the reduction must be initialized by 1.0 not by 0.0.
    */
   return reduce< Device >( 0, view.getSize(), fetch, reduction, 1.0 );
}
 
int
main( int argc, char* argv[] )
{
   /***
    * The first test on CPU ...
    */
   Vector< double, Devices::Host > host_v( 10 );
   host_v = 1.0;
   std::cout << "host_v = " << host_v << std::endl;
   std::cout << "The product of the host vector elements is " << product( host_v ) << "." << std::endl;
 
   /***
    * ... the second test on GPU.
    */
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_v( 10 );
   cuda_v = 1.0;
   std::cout << "cuda_v = " << cuda_v << std::endl;
   std::cout << "The product of the CUDA vector elements is " << product( cuda_v ) << "." << std::endl;
#endif
   return EXIT_SUCCESS;
}

leading to output like this:

host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the host vector elements is 1.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the CUDA vector elements is 1.

Note that the product of vector elements can be computed as TNL::product(v).

Scalar product

One of the most important operation in the linear algebra is the scalar product of two vectors. Compared to computing the sum of vector elements we must change the function fetch to read elements from both vectors and multiply them. See the following example.

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
double
scalarProduct( const Vector< double, Device >& u, const Vector< double, Device >& v )
{
   auto u_view = u.getConstView();
   auto v_view = v.getConstView();
 
   /***
    * Fetch computes product of corresponding elements of both vectors.
    */
   auto fetch = [ = ] __cuda_callable__( int i )
   {
      return u_view[ i ] * v_view[ i ];
   };
   auto reduction = [] __cuda_callable__( const double& a, const double& b )
   {
      return a + b;
   };
   return reduce< Device >( 0, v_view.getSize(), fetch, reduction, 0.0 );
}
 
int
main( int argc, char* argv[] )
{
   /***
    * The first test on CPU ...
    */
   Vector< double, Devices::Host > host_u( 10 ), host_v( 10 );
   host_u = 1.0;
   host_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = 2 * ( i % 2 ) - 1;
      } );
   std::cout << "host_u = " << host_u << std::endl;
   std::cout << "host_v = " << host_v << std::endl;
   std::cout << "The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) << "." << std::endl;
   std::cout << "The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) << "." << std::endl;
 
   /***
    * ... the second test on GPU.
    */
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_v( 10 );
   cuda_u = 1.0;
   cuda_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = 2 * ( i % 2 ) - 1;
      } );
   std::cout << "cuda_u = " << cuda_u << std::endl;
   std::cout << "cuda_v = " << cuda_v << std::endl;
   std::cout << "The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) << "." << std::endl;
   std::cout << "The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) << "." << std::endl;
#endif
   return EXIT_SUCCESS;
}

The result is:

host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.

Note that the scalar product of vectors u and v can be computed by TNL::dot(u, v) or simply as (u, v).

Maximum norm

The maximum norm of a vector equals the modulus of the vector largest element. Therefore, fetch must return the absolute value of the vector elements and reduction will return maximum of given values. Look at the following example.

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
double
maximumNorm( const Vector< double, Device >& v )
{
   auto view = v.getConstView();
   auto fetch = [ = ] __cuda_callable__( int i )
   {
      return abs( view[ i ] );
   };
   auto reduction = [] __cuda_callable__( const double& a, const double& b )
   {
      return TNL::max( a, b );
   };
   return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
}
 
int
main( int argc, char* argv[] )
{
   Vector< double, Devices::Host > host_v( 10 );
   host_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = i - 7;
      } );
   std::cout << "host_v = " << host_v << std::endl;
   std::cout << "The maximum norm of the host vector elements is " << maximumNorm( host_v ) << "." << std::endl;
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_v( 10 );
   cuda_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = i - 7;
      } );
   std::cout << "cuda_v = " << cuda_v << std::endl;
   std::cout << "The maximum norm of the CUDA vector elements is " << maximumNorm( cuda_v ) << "." << std::endl;
#endif
   return EXIT_SUCCESS;
}

The output is:

host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the CUDA vector elements is 7.

Note that the maximum norm can be computed by TNL::maxNorm(v).

Vectors comparison

The comparison of two vectors involves (parallel) reduction as well. The fetch part is responsible for the comparison of corresponding vector elements, resulting in a boolean value true or false for each of the vector elements. The reduction part must perform logical and operation on all fetched values. We must not forget to change the identity element to true. The code may look as follows:

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
bool
comparison( const Vector< double, Device >& u, const Vector< double, Device >& v )
{
   auto u_view = u.getConstView();
   auto v_view = v.getConstView();
 
   /***
    * Fetch compares corresponding elements of both vectors
    */
   auto fetch = [ = ] __cuda_callable__( int i ) -> bool
   {
      return u_view[ i ] == v_view[ i ];
   };
 
   /***
    * Reduce performs logical AND on intermediate results obtained by fetch.
    */
   auto reduction = [] __cuda_callable__( const bool& a, const bool& b )
   {
      return a && b;
   };
   return reduce< Device >( 0, v_view.getSize(), fetch, reduction, true );
}
 
int
main( int argc, char* argv[] )
{
   Vector< double, Devices::Host > host_u( 10 ), host_v( 10 );
   host_u = 1.0;
   host_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = 2 * ( i % 2 ) - 1;
      } );
   std::cout << "host_u = " << host_u << std::endl;
   std::cout << "host_v = " << host_v << std::endl;
   std::cout << "Comparison of host_u and host_v is: " << ( comparison( host_u, host_v ) ? "'true'" : "'false'" ) << "."
             << std::endl;
   std::cout << "Comparison of host_u and host_u is: " << ( comparison( host_u, host_u ) ? "'true'" : "'false'" ) << "."
             << std::endl;
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_v( 10 );
   cuda_u = 1.0;
   cuda_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = 2 * ( i % 2 ) - 1;
      } );
   std::cout << "cuda_u = " << cuda_u << std::endl;
   std::cout << "cuda_v = " << cuda_v << std::endl;
   std::cout << "Comparison of cuda_u and cuda_v is: " << ( comparison( cuda_u, cuda_v ) ? "'true'" : "'false'" ) << "."
             << std::endl;
   std::cout << "Comparison of cuda_u and cuda_u is: " << ( comparison( cuda_u, cuda_u ) ? "'true'" : "'false'" ) << "."
             << std::endl;
#endif
   return EXIT_SUCCESS;
}

And the output looks as:

host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of host_u and host_v is: 'false'.
Comparison of host_u and host_u is: 'true'.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of cuda_u and cuda_v is: 'false'.
Comparison of cuda_u and cuda_u is: 'true'.

Update and residue

In iterative solvers we often need to update a vector and compute the norm at the same time. For example, the Euler method is defined as

\[ \bf u^{k+1} = \bf u^k + \tau \Delta \bf u. \]

Together with the vector addition, we may want to compute also \(L_2\)-norm of \( \Delta \bf u \) which may indicate convergence. Computing first the addition and then the norm would be inefficient because we would have to fetch the vector \( \Delta \bf u \) twice from the memory. The following example shows how to do the addition and norm computation at the same time.

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
double
updateAndResidue( Vector< double, Device >& u, const Vector< double, Device >& delta_u, const double& tau )
{
   auto u_view = u.getView();
   auto delta_u_view = delta_u.getConstView();
   auto fetch = [ = ] __cuda_callable__( int i ) mutable -> double
   {
      const double& add = delta_u_view[ i ];
      u_view[ i ] += tau * add;
      return add * add;
   };
   auto reduction = [] __cuda_callable__( const double& a, const double& b )
   {
      return a + b;
   };
   return sqrt( reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 ) );
}
 
int
main( int argc, char* argv[] )
{
   const double tau = 0.1;
   Vector< double, Devices::Host > host_u( 10 ), host_delta_u( 10 );
   host_u = 0.0;
   host_delta_u = 1.0;
   std::cout << "host_u = " << host_u << std::endl;
   std::cout << "host_delta_u = " << host_delta_u << std::endl;
   double residue = updateAndResidue( host_u, host_delta_u, tau );
   std::cout << "New host_u is: " << host_u << "." << std::endl;
   std::cout << "Residue is:" << residue << std::endl;
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_delta_u( 10 );
   cuda_u = 0.0;
   cuda_delta_u = 1.0;
   std::cout << "cuda_u = " << cuda_u << std::endl;
   std::cout << "cuda_delta_u = " << cuda_delta_u << std::endl;
   residue = updateAndResidue( cuda_u, cuda_delta_u, tau );
   std::cout << "New cuda_u is: " << cuda_u << "." << std::endl;
   std::cout << "Residue is:" << residue << std::endl;
#endif
   return EXIT_SUCCESS;
}

The result reads as:

host_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
host_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New host_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228
cuda_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
cuda_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New cuda_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228

Simple MapReduce

We can also filter the data to be reduced. This operation is called MapReduce. You simply add the necessary if-statement to the fetch function, or in the case of the following example, we use the ternary conditional operator

return u_view[ i ] > 0.0 ? u_view[ i ] : 0.0;

to sum up only the positive numbers in the vector.

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
double
mapReduce( Vector< double, Device >& u )
{
   auto u_view = u.getView();
   auto fetch = [ = ] __cuda_callable__( int i ) -> double
   {
      return u_view[ i ] > 0 ? u_view[ i ] : 0.0;
   };
   auto reduction = [] __cuda_callable__( const double& a, const double& b )
   {
      return a + b;
   };
   return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
}
 
int
main( int argc, char* argv[] )
{
   Vector< double, Devices::Host > host_u( 10 );
   host_u.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = sin( (double) i );
      } );
   double result = mapReduce( host_u );
   std::cout << "host_u = " << host_u << std::endl;
   std::cout << "Sum of the positive numbers is:" << result << std::endl;
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_u( 10 );
   cuda_u = host_u;
   result = mapReduce( cuda_u );
   std::cout << "cuda_u = " << cuda_u << std::endl;
   std::cout << "Sum of the positive numbers is:" << result << std::endl;
#endif
   return EXIT_SUCCESS;
}

The result is:

host_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035
cuda_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035

Take a look at the following example where the filtering depends on the element indexes rather than values:

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
#include <TNL/Timer.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
double
mapReduce( Vector< double, Device >& u )
{
   auto u_view = u.getView();
   auto fetch = [ = ] __cuda_callable__( int i ) -> double
   {
      if( i % 2 == 0 )
         return u_view[ i ];
      return 0.0;
   };
   auto reduction = [] __cuda_callable__( const double& a, const double& b )
   {
      return a + b;
   };
   return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
}
 
int
main( int argc, char* argv[] )
{
   Timer timer;
   Vector< double, Devices::Host > host_u( 100000 );
   host_u = 1.0;
   timer.start();
   double result = mapReduce( host_u );
   timer.stop();
   std::cout << "Host result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_u( 100000 );
   cuda_u = 1.0;
   timer.reset();
   timer.start();
   result = mapReduce( cuda_u );
   timer.stop();
   std::cout << "CUDA result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
#endif
   return EXIT_SUCCESS;
}

The result is:

Host result is:50000. It took 0.00244334 seconds.

CUDA result is:50000. It took 0.000299576 seconds.

This is not very efficient. For half of the elements, we return zero which has no effect during the reduction. A better solution is to run the reduction only for a half of the elements and to change the fetch function to

return u_view[ 2 * i ];

See the following example and compare the execution times.

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
#include <TNL/Timer.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
double
mapReduce( Vector< double, Device >& u )
{
   auto u_view = u.getView();
   auto fetch = [ = ] __cuda_callable__( int i ) -> double
   {
      return u_view[ 2 * i ];
   };
   auto reduction = [] __cuda_callable__( const double& a, const double& b )
   {
      return a + b;
   };
   return reduce< Device >( 0, u_view.getSize() / 2, fetch, reduction, 0.0 );
}
 
int
main( int argc, char* argv[] )
{
   Timer timer;
   Vector< double, Devices::Host > host_u( 100000 );
   host_u = 1.0;
   timer.start();
   double result = mapReduce( host_u );
   timer.stop();
   std::cout << "Host result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_u( 100000 );
   cuda_u = 1.0;
   timer.reset();
   timer.start();
   result = mapReduce( cuda_u );
   timer.stop();
   std::cout << "CUDA result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
#endif
   return EXIT_SUCCESS;
}

Host result is:50000. It took 0.00174445 seconds.

CUDA result is:50000. It took 0.000310647 seconds.

Reduction with argument

In some situations we may need to locate given element in the vector. For example index of the smallest or the largest element. reduceWithArgument is a function which can do it. In the following example, we modify function for computing the maximum norm of a vector. Instead of just computing the value, now we want to get index of the element having the absolute value equal to the max norm. The lambda function reduction do not compute only maximum of two given elements anymore, but it must also compute index of the winner. See the following code:

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
std::pair< double, int >
maximumNorm( const Vector< double, Device >& v )
{
   auto view = v.getConstView();
 
   auto fetch = [ = ] __cuda_callable__( int i )
   {
      return abs( view[ i ] );
   };
   auto reduction = [] __cuda_callable__( double& a, const double& b, int& aIdx, const int& bIdx )
   {
      if( a < b ) {
         a = b;
         aIdx = bIdx;
      }
      else if( a == b && bIdx < aIdx )
         aIdx = bIdx;
   };
   return reduceWithArgument< Device >( 0, view.getSize(), fetch, reduction, std::numeric_limits< double >::lowest() );
}
 
int
main( int argc, char* argv[] )
{
   Vector< double, Devices::Host > host_v( 10 );
   host_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = i - 7;
      } );
   std::cout << "host_v = " << host_v << std::endl;
   auto maxNormHost = maximumNorm( host_v );
   std::cout << "The maximum norm of the host vector elements is " << maxNormHost.first << " at position " << maxNormHost.second
             << "." << std::endl;
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_v( 10 );
   cuda_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = i - 7;
      } );
   std::cout << "cuda_v = " << cuda_v << std::endl;
   auto maxNormCuda = maximumNorm( cuda_v );
   std::cout << "The maximum norm of the device vector elements is " << maxNormCuda.first << " at position "
             << maxNormCuda.second << "." << std::endl;
#endif
   return EXIT_SUCCESS;
}

The definition of the lambda function reduction reads as:

auto reduction = [] __cuda_callable__ ( double& a, const double& b, int& aIdx, const int& bIdx );

In addition to vector elements values a and b, it gets also their positions aIdx and bIdx. The functions is responsible to set a to maximum of the two and aIdx to the position of the larger element. Note that the parameters have the above mentioned meaning only in case of computing minimum or maximum.

The result looks as:

host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7 at position 0.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the device vector elements is 7 at position 0.

Using functionals for reduction

You might notice, that the lambda function reduction does not take so many different form compared to fetch. In addition, setting the identity element can be annoying especially when computing minimum or maximum and we need to use std::numeric_limits to make the code general for any type. To make things simpler, TNL offers variants of several functionals known from the STL. They can be used instead of the lambda function reduction and they also carry the identity element. See the following example showing the scalar product of two vectors, now with a functional:

#include <iostream>
#include <cstdlib>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/reduce.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
double
scalarProduct( const Vector< double, Device >& u, const Vector< double, Device >& v )
{
   auto u_view = u.getConstView();
   auto v_view = v.getConstView();
 
   /***
    * Fetch computes product of corresponding elements of both vectors.
    */
   return reduce< Device >(
      0,
      v_view.getSize(),
      [ = ] __cuda_callable__( int i )
      {
         return u_view[ i ] * v_view[ i ];
      },
      TNL::Plus{} );
}
 
int
main( int argc, char* argv[] )
{
   /***
    * The first test on CPU ...
    */
   Vector< double, Devices::Host > host_u( 10 ), host_v( 10 );
   host_u = 1.0;
   host_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = 2 * ( i % 2 ) - 1;
      } );
   std::cout << "host_u = " << host_u << std::endl;
   std::cout << "host_v = " << host_v << std::endl;
   std::cout << "The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) << "." << std::endl;
   std::cout << "The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) << "." << std::endl;
 
   /***
    * ... the second test on GPU.
    */
#ifdef __CUDACC__
   Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_v( 10 );
   cuda_u = 1.0;
   cuda_v.forAllElements(
      [] __cuda_callable__( int i, double& value )
      {
         value = 2 * ( i % 2 ) - 1;
      } );
   std::cout << "cuda_u = " << cuda_u << std::endl;
   std::cout << "cuda_v = " << cuda_v << std::endl;
   std::cout << "The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) << "." << std::endl;
   std::cout << "The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) << "." << std::endl;
#endif
   return EXIT_SUCCESS;
}

This example also shows a more compact way to invoke the function reduce. This way, one should be able to perform (parallel) reduction very easily. The result looks as follows:

host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.

In TNL/Functional.h you may find probably all operations that can be reasonably used for reduction:

Functional	Reduction operation
TNL::Plus	Sum
TNL::Multiplies	Product
TNL::Min	Minimum
TNL::Max	Maximum
TNL::MinWithArg	Minimum with argument
TNL::MaxWithArg	Maximum with argument
TNL::LogicalAnd	Logical AND
TNL::LogicalOr	Logical OR
TNL::BitAnd	Bit AND
TNL::BitOr	Bit OR

Flexible scan

Inclusive and exclusive scan

Inclusive scan (or prefix sum) operation turns a sequence \( a_1, \ldots, a_n \) into a sequence \( s_1, \ldots, s_n \) defined as

\[s_i = \sum_{j=1}^i a_i. \]

Exclusive scan (or prefix sum) is defined as

\[\sigma_i = \sum_{j=1}^{i-1} a_i. \]

For example, inclusive scan of

[1,3,5,7,9,11,13]

[1,4,9,16,25,36,49]

and exclusive scan of the same sequence is

[0,1,4,9,16,25,36]

Both kinds of scan have many different applications but they are usually applied only on summation, however product or logical operations could be handy as well. In TNL, scan is implemented in a similar way as reduction and so it can be easily modified by lambda functions. The following example shows how it works:

inplaceInclusiveScan( array, 0, array.getSize(), TNL::Plus{} );

TNL::Algorithms::inplaceInclusiveScan

void inplaceInclusiveScan(Array &array, typename Array::IndexType begin, typename Array::IndexType end, Reduction &&reduction, typename Array::ValueType identity)

Computes an inclusive scan (or prefix sum) of an array in-place.

Definition scan.h:223

This is equivalent to the following shortened call (the second, third and fourth parameters have a default value):

inplaceInclusiveScan( array );

The complete example looks as follows:

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/scan.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
int
main( int argc, char* argv[] )
{
   /***
    * Firstly, test the prefix sum with an array allocated on CPU.
    */
   Array< double, Devices::Host > host_a( 10 );
   host_a = 1.0;
   std::cout << "host_a = " << host_a << std::endl;
   inplaceInclusiveScan( host_a );
   std::cout << "The prefix sum of the host array is " << host_a << "." << std::endl;
 
   /***
    * And then also on GPU.
    */
#ifdef __CUDACC__
   Array< double, Devices::Cuda > cuda_a( 10 );
   cuda_a = 1.0;
   std::cout << "cuda_a = " << cuda_a << std::endl;
   inplaceInclusiveScan( cuda_a );
   std::cout << "The prefix sum of the CUDA array is " << cuda_a << "." << std::endl;
#endif
   return EXIT_SUCCESS;
}

Scan does not use fetch function because the scan must be performed on an array. Its complexity is also higher compared to reduction. Thus if one needs to do some operation with the array elements before the scan, this can be done explicitly and it will not affect the performance significantly. On the other hand, the scan function takes interval of the vector elements where the scan is performed as its second and third argument. The next argument is the reduction operation to be performed by the scan and the last parameter is the identity element of the reduction operation.

The result looks as:

host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].

Exclusive scan works similarly. The complete example looks as follows:

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/scan.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
int
main( int argc, char* argv[] )
{
   /***
    * Firstly, test the prefix sum with an array allocated on CPU.
    */
   Array< double, Devices::Host > host_a( 10 );
   host_a = 1.0;
   std::cout << "host_a = " << host_a << std::endl;
   inplaceExclusiveScan( host_a );
   std::cout << "The prefix sum of the host array is " << host_a << "." << std::endl;
 
   /***
    * And then also on GPU.
    */
#ifdef __CUDACC__
   Array< double, Devices::Cuda > cuda_a( 10 );
   cuda_a = 1.0;
   std::cout << "cuda_a = " << cuda_a << std::endl;
   inplaceExclusiveScan( cuda_a );
   std::cout << "The prefix sum of the CUDA array is " << cuda_a << "." << std::endl;
#endif
   return EXIT_SUCCESS;
}

And the result looks as:

host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].

Segmented scan

Segmented scan is a modification of common scan. In this case the sequence of numbers in hand is divided into segments like this, for example

[1,3,5][2,4,6,9][3,5],[3,6,9,12,15]

and we want to compute inclusive or exclusive scan of each segment. For inclusive segmented prefix sum we get

[1,4,9][2,6,12,21][3,8][3,9,18,30,45]

and the result for exclusive segmented prefix sum is

[0,1,4][0,2,6,12][0,3][0,3,9,18,30]

In addition to common scan, we need to encode the segments of the input sequence. It is done by auxiliary flags array (it can be array of booleans) having 1 at the beginning of each segment and 0 on all other positions. In our example, it would be like this:

[1,0,0,1,0,0,0,1,0,1,0,0, 0, 0]

[1,3,5,2,4,6,9,3,5,3,6,9,12,15]

Note: Segmented scan is not implemented for CUDA yet.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/SegmentedScan.h>
 
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
 
template< typename Device >
void
segmentedScan( Array< double, Device >& v, Array< bool, Device >& flags )
{
   /***
    * Reduction is sum of two numbers.
    */
   auto reduce = [] __cuda_callable__( const double& a, const double& b )
   {
      return a + b;
   };
 
   /***
    * As parameters, we pass array on which the scan is to be performed, interval
    * where the scan is performed, lambda function which is used by the scan and
    * zero as the identity element of the 'sum' operation.
    */
   SegmentedScan< Device >::perform( v, flags, 0, v.getSize(), reduce, 0.0 );
}
 
int
main( int argc, char* argv[] )
{
   /***
    * Firstly, test the segmented prefix sum with arrays allocated on CPU.
    */
   Array< bool, Devices::Host > host_flags{ 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0 };
   Array< double, Devices::Host > host_v{ 1, 3, 5, 2, 4, 6, 9, 3, 5, 3, 6, 9, 12, 15 };
   std::cout << "host_flags = " << host_flags << std::endl;
   std::cout << "host_v     = " << host_v << std::endl;
   segmentedScan( host_v, host_flags );
   std::cout << "The segmented prefix sum of the host array is " << host_v << "." << std::endl;
 
   /***
    * And then also on GPU.
    */
#ifdef __CUDACC__
   //Array< bool, Devices::Cuda > cuda_flags{ 1,0,0,1,0,0,0,1,0,1,0,0, 0, 0 };
   //Array< double, Devices::Cuda > cuda_v { 1,3,5,2,4,6,9,3,5,3,6,9,12,15 };
   //std::cout << "cuda_flags = " << cuda_flags << std::endl;
   //std::cout << "cuda_v     = " << cuda_v << std::endl;
   //segmentedScan( cuda_v, cuda_flags );
   //std::cout << "The segmnted prefix sum of the CUDA array is " << cuda_v << "." << std::endl;
#endif
   return EXIT_SUCCESS;
}

The result reads as:

host_flags = [ 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0 ]
host_v     = [ 1, 3, 5, 2, 4, 6, 9, 3, 5, 3, 6, 9, 12, 15 ]
The segmented prefix sum of the host array is [ 1, 4, 9, 2, 6, 12, 21, 3, 8, 3, 9, 18, 30, 45 ].

Table of Contents

Introduction

Flexible reduction

Sum

Product

Scalar product

Maximum norm

Vectors comparison

Update and residue

Simple MapReduce

Reduction with argument

Using functionals for reduction

Flexible scan

Inclusive and exclusive scan

Segmented scan