Introduction
This chapter introduces flexible parallel reduction in TNL. It shows how to easily implement parallel reduction with user defined operations which may run on both CPU and GPU. Parallel reduction is a programming pattern appering very often in different kind of algorithms for example in scalar product, vector norms or mean value evaluation but also in sequences or strings comparison.
Flexible reduction
We will explain the flexible parallel reduction on several examples. We start with the simplest sum of sequence of numbers followed by more advanced problems like scalar product or vector norms.
Sum
We start with simple problem of computing sum of sequence of numbers
\[ s = \sum_{i=1}^n a_i. \]
Sequentialy, such sum can be computed very easily as follows:
1double sequentialSum(
const double* a,
const int size )
4 for(
int i = 0; i < size; i++ )
Doing the same in CUDA for GPU is, however, much more difficult (see. Optimizing Parallel Reduction in CUDA). The final code has tens of lines and it is something you do not want to write again and again anytime you need to sum a series of numbers. Using TNL and C++ lambda functions we may do the same on few lines of code efficiently and independently on the hardware beneath. Let us first rewrite the previous example using the C++ lambda functions:
1double sequentialSum(
const double* a,
const int size )
3 auto fetch = [=] (
int i)->
double {
return a[ i ]; };
4 auto reduction = [] (
double& x,
const double& y) {
return x + y; };
7 for(
int i = 0; i < size; i++ )
8 sum = reduction( sum, fetch( i ) );
As can be seen, we split the reduction into two steps:
fetch
reads the input data. Thanks to this lambda you can:
- Connect the reduction algorithm with given input arrays or vectors (or any other data structure).
- Perform operation you need to do with the input data.
- Perform another secondary operation simoultanously with the parallel reduction.
reduction
is operation we want to do after the data fetch. Usually it is summation, multiplication, evaluation of minimum or maximum or some logical operation.
Putting everything together gives the following example:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
26 auto reduction = []
__cuda_callable__ (
const double& a,
const double& b ) {
return a + b; };
33 return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
36int main(
int argc,
char* argv[] )
44 std::cout <<
"The sum of the host vector elements is " << sum( host_v ) <<
"." <<
std::endl;
53 std::cout <<
"The sum of the CUDA vector elements is " << sum( cuda_v ) <<
"." <<
std::endl;
#define __cuda_callable__
Definition CudaCallable.h:22
Vector extends Array with algebraic operations.
Definition Vector.h:39
ConstViewType getConstView(IndexType begin=0, IndexType end=0) const
Returns a non-modifiable view of the vector.
Definition Vector.hpp:43
Namespace for fundamental TNL algorithms.
Definition AtomicOperations.h:12
Namespace for TNL containers.
Definition Array.h:20
The main TNL namespace.
Definition AtomicOperations.h:12
Since TNL vectors cannot be pass to CUDA kernels and so they cannot be captured by CUDA lambdas, we must first get vector view from the vector using a method getConstView()
.
Note tha we pass 0.0
as the last argument of the template function reduce< Device >
. It is an idempotent element (see Idempotence). It is an element which, for given operation, does not change the result. For addition, it is zero. The result looks as follows.
host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the host vector elements is 10.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the CUDA vector elements is 10.
Sum of vector elements can be also obtained as sum(v)
.
Product
To demonstrate the effect of the idempotent element, we will now compute product of all elements of the vector. The idempotent element is one for multiplication and we also need to replace a+b
with a*b
in the definition of reduction
. We get the following code:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
15 auto reduction = []
__cuda_callable__ (
const double& a,
const double& b ) {
return a * b; };
20 return reduce< Device >( 0, view.getSize(), fetch, reduction, 1.0 );
23int main(
int argc,
char* argv[] )
31 std::cout <<
"The product of the host vector elements is " << product( host_v ) <<
"." <<
std::endl;
40 std::cout <<
"The product of the CUDA vector elements is " << product( cuda_v ) <<
"." <<
std::endl;
leading to output like this:
host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the host vector elements is 1.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the CUDA vector elements is 1.
Product of vector elements can be computed using fuction product(v)
.
Scalar product
One of the most important operation in the linear algebra is the scalar product of two vectors. Compared to coputing the sum of vector elements we must change the function fetch
to read elements from both vectors and multiply them. See the following example.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
19 auto fetch = [=]
__cuda_callable__ (
int i ) {
return u_view[ i ] * v_view[ i ]; };
20 auto reduction = []
__cuda_callable__ (
const double& a,
const double& b ) {
return a + b; };
21 return reduce< Device >( 0, v_view.getSize(), fetch, reduction, 0.0 );
24int main(
int argc,
char* argv[] )
31 host_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = 2 * ( i % 2 ) - 1; } );
34 std::cout <<
"The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) <<
"." <<
std::endl;
35 std::cout <<
"The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) <<
"." <<
std::endl;
43 cuda_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = 2 * ( i % 2 ) - 1; } );
46 std::cout <<
"The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) <<
"." <<
std::endl;
47 std::cout <<
"The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) <<
"." <<
std::endl;
The result is:
host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.
Scalar product of vectors u
and v
in TNL can be computed by TNL::dot(u, v)
or simply as (u, v)
.
Maximum norm
Maximum norm of a vector equals modulus of the vector largest element. Therefore, fetch
must return the absolute value of the vector elements and reduction
wil return maximum of given values. Look at the following example.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
15 auto reduction = []
__cuda_callable__ (
const double& a,
const double& b ) {
return TNL::max( a, b ); };
16 return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
19int main(
int argc,
char* argv[] )
22 host_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = i - 7; } );
24 std::cout <<
"The maximum norm of the host vector elements is " << maximumNorm( host_v ) <<
"." <<
std::endl;
27 cuda_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = i - 7; } );
29 std::cout <<
"The maximum norm of the CUDA vector elements is " << maximumNorm( cuda_v ) <<
"." <<
std::endl;
The output is:
host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the CUDA vector elements is 7.
Maximum norm in TNL is computed by the function TNL::maxNorm
.
Vectors comparison
Comparison of two vectors involve (parallel) reduction as well. The fetch
part is responsible for comparison of corresponding vector elements result of which is boolean true
or false
for each vector elements. The reduction
part must perform logical and operation on all of them. We must not forget to change the idempotent element to true
. The code may look as follows:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
19 auto fetch = [=]
__cuda_callable__ (
int i )->bool {
return ( u_view[ i ] == v_view[ i ] ); };
24 auto reduction = []
__cuda_callable__ (
const bool& a,
const bool& b ) {
return a && b; };
25 return reduce< Device >( 0, v_view.getSize(), fetch, reduction,
true );
28int main(
int argc,
char* argv[] )
32 host_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = 2 * ( i % 2 ) - 1; } );
35 std::cout <<
"Comparison of host_u and host_v is: " << ( comparison( host_u, host_v ) ?
"'true'" :
"'false'" ) <<
"." <<
std::endl;
36 std::cout <<
"Comparison of host_u and host_u is: " << ( comparison( host_u, host_u ) ?
"'true'" :
"'false'" ) <<
"." <<
std::endl;
40 cuda_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = 2 * ( i % 2 ) - 1; } );
43 std::cout <<
"Comparison of cuda_u and cuda_v is: " << ( comparison( cuda_u, cuda_v ) ?
"'true'" :
"'false'" ) <<
"." <<
std::endl;
44 std::cout <<
"Comparison of cuda_u and cuda_u is: " << ( comparison( cuda_u, cuda_u ) ?
"'true'" :
"'false'" ) <<
"." <<
std::endl;
And the output looks as:
host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of host_u and host_v is: 'false'.
Comparison of host_u and host_u is: 'true'.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of cuda_u and cuda_v is: 'false'.
Comparison of cuda_u and cuda_u is: 'true'.
Update and residue
In iterative solvers we often need to update a vector and compute the update norm at the same time. For example the Euler method is defined as
\[
\bf u^{k+1} = \bf u^k + \tau \Delta \bf u.
\]
Together with the vector addition, we may want to compute also \(L_2\)-norm of \(\Delta \bf u\) which may indicate convergence. Computing first the addition and then the norm would be inefficient because we would have to fetch the vector \(\Delta \bf u\) twice from the memory. The following example shows how to do the addition and norm computation at the same time.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
16 const double& add = delta_u_view[ i ];
17 u_view[ i ] += tau * add;
19 auto reduction = []
__cuda_callable__ (
const double& a,
const double& b ) {
return a + b; };
20 return sqrt( reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 ) );
23int main(
int argc,
char* argv[] )
25 const double tau = 0.1;
31 double residue = updateAndResidue( host_u, host_delta_u, tau );
40 residue = updateAndResidue( cuda_u, cuda_delta_u, tau );
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
Definition Vector.hpp:28
The result reads as:
host_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
host_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New host_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228
cuda_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
cuda_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New cuda_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228
Simple MapReduce
We can also filter the data to be reduced. This operation is called MapReduce . You simply add necessary if statement to the fetch function, or in the case of the following example we use a statement
return u_view[ i ] > 0.0 ? u_view[ i ] : 0.0;
to sum up only the positive numbers in the vector.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
15 return u_view[ i ] > 0 ? u_view[ i ] : 0.0; };
16 auto reduction = []
__cuda_callable__ (
const double& a,
const double& b ) {
return a + b; };
17 return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
20int main(
int argc,
char* argv[] )
23 host_u.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value =
sin( (
double ) i ); } );
24 double result = mapReduce( host_u );
30 result = mapReduce( cuda_u );
The result is:
host_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035
cuda_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035
Take a look at the following example where the filtering depends on the element indexes rather than values:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
11template<
typename Device >
16 if( i % 2 == 0 )
return u_view[ i ];
18 auto reduction = []
__cuda_callable__ (
const double& a,
const double& b ) {
return a + b; };
19 return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
22int main(
int argc,
char* argv[] )
28 double result = mapReduce( host_u );
36 result = mapReduce( cuda_u );
Class for real time, CPU time and CPU cycles measuring.
Definition Timer.h:28
void reset()
Resets timer.
Definition Timer.hpp:25
double getRealTime() const
Returns the elapsed real time on this timer.
Definition Timer.hpp:57
void stop()
Stops (pauses) the timer.
Definition Timer.hpp:37
void start()
Starts timer.
Definition Timer.hpp:48
The result is:
Host tesult is:50000. It took 0.00367469 seconds.
CUDA result is:50000. It took 0.000345911 seconds.
This is not very efficient. For half of the elements, we return zero which has no effect during the reductin. Better solution is to run the reduction only for a half of the elements and to change the fetch function to
See the following example and compare the execution times.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
11template<
typename Device >
16 return u_view[ 2 * i ]; };
17 auto reduction = []
__cuda_callable__ (
const double& a,
const double& b ) {
return a + b; };
18 return reduce< Device >( 0, u_view.getSize() / 2, fetch, reduction, 0.0 );
21int main(
int argc,
char* argv[] )
27 double result = mapReduce( host_u );
35 result = mapReduce( cuda_u );
Host result is:50000. It took 0.00239881 seconds.
CUDA result is:50000. It took 0.000354449 seconds.
Reduction with argument
In some situations we may need to locate given element in the vector. For example index of the smallest or the largest element. reduceWithArgument
is a function which can do it. In the following example, we modify function for computing the maximum norm of a vector. Instead of just computing the value, now we want to get index of the element having the absolute value equal to the max norm. The lambda function reduction
do not compute only maximum of two given elements anymore, but it must also compute index of the winner. See the following code:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
17 auto reduction = []
__cuda_callable__ (
double& a,
const double& b,
int& aIdx,
const int& bIdx ) {
22 else if( a == b && bIdx < aIdx )
28int main(
int argc,
char* argv[] )
31 host_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = i - 7; } );
33 auto maxNormHost = maximumNorm( host_v );
34 std::cout <<
"The maximum norm of the host vector elements is " << maxNormHost.first <<
" at position " << maxNormHost.second <<
"." <<
std::endl;
37 cuda_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = i - 7; } );
39 auto maxNormCuda = maximumNorm( cuda_v );
40 std::cout <<
"The maximum norm of the device vector elements is " << maxNormCuda.first <<
" at position " << maxNormCuda.second <<
"." <<
std::endl;
The definition of the lambda function reduction
reads as:
auto reduction = [] __cuda_callable__ ( double& a, const double& b, int& aIdx, const int& bIdx );
In addition to vector elements values a
and b
, it gets also their positions aIdx
and bIdx
. The functions is responsible to set a
to maximum of the two and aIdx
to the position of the larger element. Note, that the parameters have the above mentioned meaning only in case of computing minimum or maximum.
The result looks as:
host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7 at position 0.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the device vector elements is 1.79769e+308 at position 10.
Using functionals for reduction
You might notice, that the lambda function reduction
does not take so many different form compared to fetch. In addition, setting the zero (or idempotent) element can be annoying especially when computing minimum or maximum and we need to check std::limits function to make the code working with any type. To make things simpler, TNL offers variants of several functionals known from STL. They can be used instead of the lambda function reduction
and they also carry the idempotent element. See the following example showing the scalar product of two vectors, now with functional:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
10template<
typename Device >
19 return reduce< Device >(
25int main(
int argc,
char* argv[] )
32 host_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = 2 * ( i % 2 ) - 1; } );
35 std::cout <<
"The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) <<
"." <<
std::endl;
36 std::cout <<
"The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) <<
"." <<
std::endl;
44 cuda_v.forAllElements( []
__cuda_callable__ (
int i,
double& value ) { value = 2 * ( i % 2 ) - 1; } );
47 std::cout <<
"The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) <<
"." <<
std::endl;
48 std::cout <<
"The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) <<
"." <<
std::endl;
Function object implementing x + y.
Definition Functional.h:20
This example also shows more compact how to evoke the function reduce
(lines 19-22). This way, one should be able to perform (parallel) reduction very easily. The result looks as follows:
host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.
In TNL/Functional.h
you may find probably all operations that can be reasonably used for reduction:
Flexible scan
Inclusive and exclusive scan
Inclusive scan (or prefix sum) operation turns a sequence \(a_1, \ldots, a_n\) into a sequence \(s_1, \ldots, s_n\) defined as
\[
s_i = \sum_{j=1}^i a_i.
\]
Exclusive scan (or prefix sum) is defined as
\[
\sigma_i = \sum_{j=1}^{i-1} a_i.
\]
For example, inclusive prefix sum of
is
and exclusive prefix sum of the same sequence is
Both kinds of scan have many different applications but they are usually applied only on summation, however product or logical operations could be handy as well. In TNL, prefix sum is implemented in similar way as reduction and so it can be easily modified by lambda functions. The following example shows how it works:
inplaceInclusiveScan( array, 0, array.getSize(), TNL::Plus{} );
This is equivalent to the following shortened call (the second, third and fourth parameters have a default value):
inplaceInclusiveScan( array );
The complete example looks as follows:
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/scan.h>
9int main(
int argc,
char* argv[] )
17 inplaceInclusiveScan( host_a );
27 inplaceInclusiveScan( cuda_a );
Array is responsible for memory management, access to array elements, and general array operations.
Definition Array.h:66
Scan does not use fetch
function because the scan must be performed on an array. Its complexity is also higher compared to reduction. Thus if one needs to do some operation with the array elements before the scan, this can be done explicitly and it will not affect the performance significantly. On the other hand, the scan function takes interval of the vector elements where the scan is performed as its second and third argument. The next argument is the operation to be performed by the scan and the last parameter is the idempotent ("zero") element of the operation.
The result looks as:
host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].
Exclusive scan works similarly. The complete example looks as follows:
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/scan.h>
9int main(
int argc,
char* argv[] )
17 inplaceExclusiveScan( host_a );
27 inplaceExclusiveScan( cuda_a );
And the result looks as:
host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].
Segmented scan
Segmented scan is a modification of common scan. In this case the sequence of numbers in hand is divided into segments like this, for example
[1,3,5][2,4,6,9][3,5],[3,6,9,12,15]
and we want to compute inclusive or exclusive scan of each segment. For inclusive segmented prefix sum we get
[1,4,9][2,6,12,21][3,8][3,9,18,30,45]
and for exclusive segmented prefix sum it is
[0,1,4][0,2,6,12][0,3][0,3,9,18,30]
In addition to common scan, we need to encode the segments of the input sequence. It is done by auxiliary flags array (it can be array of booleans) having 1
at the begining of each segment and 0
on all other positions. In our example, it would be like this:
[1,0,0,1,0,0,0,1,0,1,0,0, 0, 0]
[1,3,5,2,4,6,9,3,5,3,6,9,12,15]
Note: Segmented scan is not implemented for CUDA yet.
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/SegmentedScan.h>
9template<
typename Device >
15 auto reduce = []
__cuda_callable__ (
const double& a,
const double& b ) {
return a + b; };
25int main(
int argc,
char* argv[] )
30 Array< bool, Devices::Host > host_flags{ 1,0,0,1,0,0,0,1,0,1,0,0, 0, 0 };
31 Array< double, Devices::Host > host_v { 1,3,5,2,4,6,9,3,5,3,6,9,12,15 };
34 segmentedScan( host_v, host_flags );
35 std::cout <<
"The segmented prefix sum of the host array is " << host_v <<
"." <<
std::endl;
__cuda_callable__ IndexType getSize() const
Returns the current array size.
Definition Array.hpp:248
Computes segmented scan (or prefix sum) on a vector.
Definition SegmentedScan.h:59
The result reads as:
host_flags = [ 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0 ]
host_v = [ 1, 3, 5, 2, 4, 6, 9, 3, 5, 3, 6, 9, 12, 15 ]
The segmented prefix sum of the host array is [ 1, 4, 9, 2, 6, 12, 21, 3, 8, 3, 9, 18, 30, 45 ].