Introduction
Segments setup
Iteration over elements of segments
- Method forElements
- Method forSegments
Flexible reduction within segments

Introduction

Segments represent data structure for manipulation with several local arrays (denoted also as segments) having different size in general. All the local arrays are supposed to be allocated in one continuous global array. The data structure segments offers mapping between indexes of particular local arrays and indexes of the global array. Segments do not store any data, segments just represent a layer for efficient access and operations with group of segments of linear containers (i.e. local arrays) with different size in general. One can perform parallel operations like for or flexible reduction on particular segments (local arrays).

A typical example of segments are different formats for sparse matrices. Sparse matrix like the following

\[\left( \begin{array}{ccccc} 1 & 0 & 2 & 0 & 0 \\ 0 & 0 & 5 & 0 & 0 \\ 3 & 4 & 7 & 9 & 0 \\ 0 & 0 & 0 & 0 & 12 \\ 0 & 0 & 15 & 17 & 20 \end{array} \right) \]

is usually first compressed which means that the zero elements are omitted to get the following "matrix":

\[\begin{array}{ccccc} 1 & 2 \\ 5 \\ 3 & 4 & 7 & 9 \\ 12 \\ 15 & 17 & 20 \end{array} \]

We have to store column index of each matrix elements as well in a "matrix" like this:

\[\begin{array}{ccccc} 0 & 2 \\ 2 \\ 0 & 1 & 2 & 3 \\ 4 \\ 2 & 3 & 4 \end{array} \]

Such "matrices" can be stored in memory in a row-wise manner in one contiguous array because of the performance reasons. The first "matrix" (i.e. values of the matrix elements) would be stored as follows

\[\begin{array}{|cc|c|cccc|c|cc|} 1 & 2 & 5 & 3 & 4 & 7 & 9 & 12 & 15 & 17 & 20 \end{array} \]

and the second one (i.e. column indexes of the matrix values) as follows

\[\begin{array}{|cc|c|cccc|c|cc|} 0 & 2 & 2 & 0 & 1 & 2 & 3 & 4 & 2 & 3 & 4 \end{array} \]

What we see above is so called CSR sparse matrix format. It is the most popular format for storage of sparse matrices designed for high performance. However, it may not be the most efficient format for storage of sparse matrices on GPUs. Therefore many other formats have been developed to get better performance. These formats often have different layout of the matrix elements in the memory. They have to deal especially with two difficulties:

Efficient storage of matrix elements in the memory to fulfill the requirements of coalesced memory accesses on GPUs or good spatial locality for efficient use of caches on CPUs.
Efficient mapping of GPU threads to different matrix rows.

TNL offers the following sparse matrix formats in a form of segments (Ellpack formats often use so called padding elements like padding zeros in terms of sparse matrices):

CSR format (TNL::Algorithms::Segments::CSR) is the most popular format for sparse matrices. It is simple and very efficient especially on CPUs and today there are efficient kernels even for GPUs. The following GPU kernels are implemented in TNL:
1. Scalar which maps one GPU thread for each segment (matrix row).
2. Vector which maps one warp of GPU threads for each segment (matrix row).
3. Adaptive ...
Ellpack format (TNL::Algorithms::Segments::Ellpack) uses padding elements to have the same number of element in each segment. It can be highly inefficient in cases when one works with few very long segments.
SlicedEllpack format (TNL::Algorithms::Segments::SlicedEllpack) which was also presented as Row-grouped CSR format is similar to common Ellpack. However, SlicedEllpack first merges the segments into groups of 32. It also uses padding elements, but only segments within the same group are aligned to have the same size. Therefore, there is not such a high performance drop because of few long segments.
ChunkedEllpack format (TNL::Algorithms::Segments::ChunkedEllpack) is similar to SlicedEllpack, but it splits segments into chunks that allow to map more GPU threads to one segment.
BiEllpack format (TNL::Algorithms::Segments::BiEllpack) is similar to ChunkedEllpack. In addition, it sorts segments within the same slice w.r.t. their length to achieve higher performance and better memory accesses.

Especially in case of GPUs, the performance of each format strongly depends on the distribution of the segment sizes. Therefore we cannot say that one of the previous formats would outperform the others in general. To get the best performance, one should try more of the formats and choose the best one. It is the reason why TNL offers more of them and additional formats will accrue.

Necessity of working with this kind of data structures is not limited only to sparse matrices. We could name at least few other applications for segments:

Graphs - one segment represents one graph node, the elements in one segments are indexes of its neighbors.
Unstructured numerical meshes - unstructured numerical mesh is a graph in fact.
Particle in cell method - one segment represents one cell, the elements in one segment are indexes of the particles.
K-means clustering - segments represent one cluster, the elements represent vectors belonging to given cluster.
Hashing - segments are particular rows of the hash table, elements in segments corresponds with colliding hashed elements.

In general, segments can be used for problems that somehow correspond with 2D data structure where each row can have different size and we need to perform miscellaneous operations within the rows. The name segments comes from segmented parallel reduction or segmented scan (prefix-sum).

Segments setup

Segments are defined just by sizes of particular segments. The following example shows how to create them:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/Segments/CSR.h>
#include <TNL/Algorithms/Segments/Ellpack.h>
#include <TNL/Algorithms/Segments/ChunkedEllpack.h>
#include <TNL/Algorithms/Segments/BiEllpack.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Segments >
void
SegmentsExample()
{
   /***
    * Create segments with given segments sizes and print their setup.
    */
   Segments segments{ 1, 2, 3, 4, 5 };
   std::cout << "Segments sizes are: " << segments << std::endl << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Example of CSR segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
 
   std::cout << "Example of Ellpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
 
   std::cout << "Example of ChunkedEllpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Host, int > >();
 
   std::cout << "Example of BiEllpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Host, int > >();
 
#ifdef __CUDACC__
   std::cout << "Example of CSR segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
 
   std::cout << "Example of Ellpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
 
   std::cout << "Example of ChunkedEllpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Cuda, int > >();
 
   std::cout << "Example of BiEllpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Cuda, int > >();
#endif
   return EXIT_SUCCESS;
}

In the SegmentsExample function we use the constructor with initializer list where each element of the list defines size of one segment. Next we print sizes of particular segments. We call this function for different segments types (except TNL::Algorithms::Segments::SlicedEllpack since it would behave the same as TNL::Algorithms::Segments::Ellpack on this small example).

The result looks as follows:

Example of CSR segments on host: 
Segments sizes are:  [ 1, 2, 3, 4, 5, ] 
 
Example of Ellpack segments on host: 
Segments sizes are:  [ 5, 5, 5, 5, 5, ] 
 
Example of ChunkedEllpack segments on host: 
Segments sizes are:  [ 17, 34, 51, 67, 87, ] 
 
Example of BiEllpack segments on host: 
Segments sizes are:  [ 2, 4, 4, 5, 5, ] 
 
Example of CSR segments on host: 
Segments sizes are:  [ 1, 2, 3, 4, 5, ] 
 
Example of Ellpack segments on host: 
Segments sizes are:  [ 5, 5, 5, 5, 5, ] 
 
Example of ChunkedEllpack segments on host: 
Segments sizes are:  [ 17, 34, 51, 67, 87, ] 
 
Example of BiEllpack segments on host: 
Segments sizes are:  [ 2, 4, 4, 5, 5, ] 
 

We can see, that real sizes of the segments are different for all Ellpack-based formats. As already mentioned, these formats often use padding elements to get more efficient access to the memory. For example, the TNL::Algorithms::Segments::ChunkedEllpack format involves many elements. It is, however, only because of very small example we presented now, on large data the overhead is not so significant.

We remind that segments represent rather sparse format then data structure because they do not store any data. The following example shows how to connect segments with array:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/Segments/CSR.h>
#include <TNL/Algorithms/Segments/Ellpack.h>
#include <TNL/Algorithms/Segments/ChunkedEllpack.h>
#include <TNL/Algorithms/Segments/BiEllpack.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Segments >
void
SegmentsExample()
{
   using Device = typename Segments::DeviceType;
 
   /***
    * Create segments with given segments sizes.
    */
   TNL::Containers::Vector< int, Device > sizes{ 1, 2, 3, 4, 5 };
   Segments segments( sizes );
   std::cout << "Segments sizes are: " << segments << std::endl;
 
   /***
    * Allocate array for the segments;
    */
   TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
   data.forAllElements(
      [ = ] __cuda_callable__( int idx, double& value )
      {
         value = idx;
      } );
 
   /***
    * Print the data managed by the segments.
    */
   auto data_view = data.getView();
   auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
   {
      return data_view[ globalIdx ];
   };
   printSegments( std::cout, segments, fetch ) << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Example of CSR segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
 
   std::cout << "Example of Ellpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
 
   std::cout << "Example of ChunkedEllpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Host, int > >();
 
   std::cout << "Example of BiEllpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Host, int > >();
 
#ifdef __CUDACC__
   std::cout << "Example of CSR segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
 
   std::cout << "Example of Ellpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
 
   std::cout << "Example of ChunkedEllpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Cuda, int > >();
 
   std::cout << "Example of BiEllpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Cuda, int > >();
#endif
   return EXIT_SUCCESS;
}

In the SegmentsExample function we show how to create segments with a vector carrying the segments sizes. Of course, the same constructor works even for arrays, array views and vector views. Next we print the real segment sizes depending on the underlying format the same way as we did in the previous example. Subsequently, we allocate an array having the size requested by the segments by means of the method getStorageSize (TNL::Algorithms::Segments::CSR::getStorageSize for example). This method says how many elements the segments need to be able to address all elements by their global index. We use the forAllElements method to mark each element of the array by its rank in the array. Finally, we use the function TNL::Algorithms::Segments::printSegments which accepts lambda function fetch as one its parameters. The lambda function reads data from our array data (with the help of array view data_view) according to given global index globalIdx.

The result looks as follows:

Example of CSR segments on host: 
Segments sizes are:  [ 1, 2, 3, 4, 5, ] 
Seg. 0: [ 0 ] 
Seg. 1: [ 1, 2 ] 
Seg. 2: [ 3, 4, 5 ] 
Seg. 3: [ 6, 7, 8, 9 ] 
Seg. 4: [ 10, 11, 12, 13, 14 ] 
 
Example of Ellpack segments on host: 
Segments sizes are:  [ 5, 5, 5, 5, 5, ] 
Seg. 0: [ 0, 1, 2, 3, 4 ] 
Seg. 1: [ 5, 6, 7, 8, 9 ] 
Seg. 2: [ 10, 11, 12, 13, 14 ] 
Seg. 3: [ 15, 16, 17, 18, 19 ] 
Seg. 4: [ 20, 21, 22, 23, 24 ] 
 
Example of ChunkedEllpack segments on host: 
Segments sizes are:  [ 17, 34, 51, 67, 87, ] 
Seg. 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ] 
Seg. 1: [ 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 ] 
Seg. 2: [ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101 ] 
Seg. 3: [ 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168 ] 
Seg. 4: [ 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255 ] 
 
Example of BiEllpack segments on host: 
Segments sizes are:  [ 2, 4, 4, 5, 5, ] 
Seg. 0: [ 8, 9 ] 
Seg. 1: [ 6, 7, 22, 23 ] 
Seg. 2: [ 4, 5, 20, 21 ] 
Seg. 3: [ 2, 3, 18, 19, 25 ] 
Seg. 4: [ 0, 1, 16, 17, 24 ] 
 
Example of CSR segments on host: 
Segments sizes are:  [ 1, 2, 3, 4, 5, ] 
Seg. 0: [ 0 ] 
Seg. 1: [ 1, 2 ] 
Seg. 2: [ 3, 4, 5 ] 
Seg. 3: [ 6, 7, 8, 9 ] 
Seg. 4: [ 10, 11, 12, 13, 14 ] 
 
Example of Ellpack segments on host: 
Segments sizes are:  [ 5, 5, 5, 5, 5, ] 
Seg. 0: [ 0, 32, 64, 96, 128 ] 
Seg. 1: [ 1, 33, 65, 97, 129 ] 
Seg. 2: [ 2, 34, 66, 98, 130 ] 
Seg. 3: [ 3, 35, 67, 99, 131 ] 
Seg. 4: [ 4, 36, 68, 100, 132 ] 
 
Example of ChunkedEllpack segments on host: 
Segments sizes are:  [ 17, 34, 51, 67, 87, ] 
Seg. 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ] 
Seg. 1: [ 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 ] 
Seg. 2: [ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101 ] 
Seg. 3: [ 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168 ] 
Seg. 4: [ 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255 ] 
 
Example of BiEllpack segments on host: 
Segments sizes are:  [ 2, 4, 4, 5, 5, ] 
Seg. 0: [ 4, 12 ] 
Seg. 1: [ 3, 11, 19, 23 ] 
Seg. 2: [ 2, 10, 18, 22 ] 
Seg. 3: [ 1, 9, 17, 21, 25 ] 
Seg. 4: [ 0, 8, 16, 20, 24 ] 
 

Frankly, what we see is not so important. It only shows that different segments formats can use very different mapping of elements identified by its segment index and local index (rank of the element in given segment) to a global index which serves as an address in the related container.

Iteration over elements of segments

In this section, we show how to iterate over the elements of segments and how to manipulate with them. There are three possible ways:

Method forElements (TNL::Algorithms::Segments::CSR::forElements for example), which iterates in parallel over all elements of segments and perform given lambda function on each of them.
Method forSegments (TNL::Algorithms::Segments::CSR::forSegments for example), which iterates in parallel over all segments. It is better choice when we need to process each segment sequentially are we have significant amount of computations common for all elements in each segment.
Method sequentailForSegments (TNL::Algorithms::Segments::CSR::sequentialForSegments for example), which iterates over all segments sequentially i.e. using only one thread even on GPUs. It is useful for debugging or for printing for example.

Methods iterating over particular segments use a segment view (TNL::Algorithms::Segments::SegmentView) to access the elements of given segment. The segment view offers iterator for better convenience.

Method forElements

The following example shows use of the method forElements:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/Segments/CSR.h>
#include <TNL/Algorithms/Segments/Ellpack.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Segments >
void
SegmentsExample()
{
   using Device = typename Segments::DeviceType;
 
   /***
    * Create segments with given segments sizes.
    */
   Segments segments{ 1, 2, 3, 4, 5 };
 
   /***
    * Allocate array for the segments;
    */
   TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
 
   /***
    * Insert data into particular segments with no check.
    */
   auto data_view = data.getView();
   segments.forAllElements(
      [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
      {
         data_view[ globalIdx ] = segmentIdx;
      } );
 
   /***
    * Print the data managed by the segments.
    */
   std::cout << "Data setup with no check ... " << std::endl;
   std::cout << "Array: " << data << std::endl;
   auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
   {
      return data_view[ globalIdx ];
   };
   printSegments( std::cout, segments, fetch );
 
   /***
    * Insert data into particular segments.
    */
   data = 0.0;
   segments.forAllElements(
      [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
      {
         if( localIdx <= segmentIdx )
            data_view[ globalIdx ] = segmentIdx;
      } );
 
   /***
    * Print the data managed by the segments.
    */
   std::cout << "Data setup with check for padding elements..." << std::endl;
   std::cout << "Array: " << data << std::endl;
   printSegments( std::cout, segments, fetch );
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Example of CSR segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
 
   std::cout << "Example of Ellpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
 
#ifdef __CUDACC__
   std::cout << "Example of CSR segments on CUDA GPU: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
 
   std::cout << "Example of Ellpack segments on CUDA GPU: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
#endif
   return EXIT_SUCCESS;
}

We first create segments with linearly increasing size (so it is like lower triangular matrix). Next, we allocate array data having the same size as the number of elements managed by the segments. It can be obtained by the method getStorageSize (TNL::Algorithms::Segments::CSR::getStorageSize for example). We prepare array view data_view for the purpose of use in lambda functions. Finally, we call the method forAllElements which iterates in parallel over all elements in the segments and for each element it calls given lambda function. The lambda function receives three arguments - segmentIdx is an index of the segment the element belongs to, localIdx is the rank of the element within the segment and globalIdx is an index of the element in the array data. We use the global index to set proper element of the array data to the index of the segment.

At the end we print the array data where we can see elements belonging to particular segments by their indexes. The layout of the elements depends on the type of segments (which means sparse format in use). Next we print the elements of array data by segments using the function printSegments which iterates over all elements and reads the elements of the array data with the help of the fetch lambda function.

Note, that for the Ellpack format, the output looks as follows:

Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 1, 1, 1 ]
Seg. 2: [ 2, 2, 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]

We see more elements than we have requested. The reason is that the Ellpack format uses padding elements for optimizing access to the memory. Segments give access even to the padding elements, they could be used in when we later need additional elements. Therefore we need to check for relevant and padding elements each time we work with elements of segments. It is demonstrated in the previous example in the second use of the forAllElements where we set the array data again but we check for the padding elements. After printing the segments the same way as before, we get the correct result:

Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]

The result of the whole example looks as follows:

Example of CSR segments on host: 
Data setup with no check ... 
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ] 
Seg. 1: [ 1, 1 ] 
Seg. 2: [ 2, 2, 2 ] 
Seg. 3: [ 3, 3, 3, 3 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
Data setup with check for padding elements...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ] 
Seg. 1: [ 1, 1 ] 
Seg. 2: [ 2, 2, 2 ] 
Seg. 3: [ 3, 3, 3, 3 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
Example of Ellpack segments on host: 
Data setup with no check ... 
Array: [ 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0, 0, 0, 0, 0 ] 
Seg. 1: [ 1, 1, 1, 1, 1 ] 
Seg. 2: [ 2, 2, 2, 2, 2 ] 
Seg. 3: [ 3, 3, 3, 3, 3 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
Data setup with check for padding elements...
Array: [ 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 0, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0, 0, 0, 0, 0 ] 
Seg. 1: [ 1, 1, 0, 0, 0 ] 
Seg. 2: [ 2, 2, 2, 0, 0 ] 
Seg. 3: [ 3, 3, 3, 3, 0 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
Example of CSR segments on CUDA GPU: 
Data setup with no check ... 
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ] 
Seg. 1: [ 1, 1 ] 
Seg. 2: [ 2, 2, 2 ] 
Seg. 3: [ 3, 3, 3, 3 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
Data setup with check for padding elements...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ] 
Seg. 1: [ 1, 1 ] 
Seg. 2: [ 2, 2, 2 ] 
Seg. 3: [ 3, 3, 3, 3 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
Example of Ellpack segments on CUDA GPU: 
Data setup with no check ... 
Array: [ 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Seg. 0: [ 0, 0, 0, 0, 0 ] 
Seg. 1: [ 1, 1, 1, 1, 1 ] 
Seg. 2: [ 2, 2, 2, 2, 2 ] 
Seg. 3: [ 3, 3, 3, 3, 3 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
Data setup with check for padding elements...
Array: [ 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Seg. 0: [ 0, 0, 0, 0, 0 ] 
Seg. 1: [ 1, 1, 0, 0, 0 ] 
Seg. 2: [ 2, 2, 2, 0, 0 ] 
Seg. 3: [ 3, 3, 3, 3, 0 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 

Method forSegments

Method forSegments iterates in parallel over particular segments. Iteration over elements within the segment is sequential. There are two reasons for such proceeding:

The iteration over the elements within the same segments must be sequential, i.e. the computation with one element depends on a result of the computation with the previous one.
Some part of computations on all elements in one segment is common. In this case, we can first perform the common part and then iterate over the elements. If we would use the method forElements, the common part would have to be performed for each element.

Sequential dependency

The first situation is demonstrated in the following example:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/Segments/CSR.h>
#include <TNL/Algorithms/Segments/Ellpack.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Segments >
void
SegmentsExample()
{
   using Device = typename Segments::DeviceType;
 
   /***
    * Create segments with given segments sizes.
    */
   Segments segments{ 1, 2, 3, 4, 5 };
 
   /***
    * Allocate array for the segments;
    */
   TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
 
   /***
    * Insert data into particular segments.
    */
   auto data_view = data.getView();
   using SegmentViewType = typename Segments::SegmentViewType;
   segments.forAllSegments(
      [ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
      {
         double sum( 0.0 );
         for( auto element : segment )
            if( element.localIndex() <= element.segmentIndex() ) {
               sum += element.localIndex() + 1;
               data_view[ element.globalIndex() ] = sum;
            }
      } );
 
   /***
    * Print the data managed by the segments.
    */
   auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
   {
      return data_view[ globalIdx ];
   };
   printSegments( std::cout, segments, fetch );
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Example of CSR segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
 
   std::cout << "Example of Ellpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
 
#ifdef __CUDACC__
   std::cout << "Example of CSR segments on CUDA GPU: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
 
   std::cout << "Example of Ellpack segments on CUDA GPU: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
#endif
   return EXIT_SUCCESS;
}

The result looks as follows:

The code is the same as in the previous example up to the insertion of data into particular segments. Instead of calling the method forAllElements, we call the method forAllSegments for which we need to define the type SegmentViewType (TNL::Algorithms::Segments::CSR::SegmentViewType for example). The lambda function passed to this method gets the segment view and it iterates over all elements of the segment by means of a for loop. We use auxiliary variable sum to compute cumulative sum of elements in each segment which is just the sequential dependency. The result looks as follows:

Example of CSR segments on host: 
Seg. 0: [ 1 ] 
Seg. 1: [ 1, 3 ] 
Seg. 2: [ 1, 3, 6 ] 
Seg. 3: [ 1, 3, 6, 10 ] 
Seg. 4: [ 1, 3, 6, 10, 15 ] 
Example of Ellpack segments on host: 
Seg. 0: [ 1, 0, 0, 0, 0 ] 
Seg. 1: [ 1, 3, 0, 0, 0 ] 
Seg. 2: [ 1, 3, 6, 0, 0 ] 
Seg. 3: [ 1, 3, 6, 10, 0 ] 
Seg. 4: [ 1, 3, 6, 10, 15 ] 
Example of CSR segments on CUDA GPU: 
Seg. 0: [ 1 ] 
Seg. 1: [ 1, 3 ] 
Seg. 2: [ 1, 3, 6 ] 
Seg. 3: [ 1, 3, 6, 10 ] 
Seg. 4: [ 1, 3, 6, 10, 15 ] 
Example of Ellpack segments on CUDA GPU: 
Seg. 0: [ 1, 0, 0, 0, 0 ] 
Seg. 1: [ 1, 3, 0, 0, 0 ] 
Seg. 2: [ 1, 3, 6, 0, 0 ] 
Seg. 3: [ 1, 3, 6, 10, 0 ] 
Seg. 4: [ 1, 3, 6, 10, 15 ] 

Common computations

Now let's take a look at the second situation, i.e. there are common computations for all elements of one segment. In the following example, we first set values of each element using the method forAllElements which we are already familiar with. Next we print values of all elements and then we use the method forAllSegments to divide each element by a sum of values of all elements in a segment. So we first sum up all elements in the segment. This is the common part of the computation for all elements in the segment. Next we perform the division of all elements by the value of the variable sum.

#include <iostream>
#include <functional>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/Segments/CSR.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
SegmentsExample()
{
   using SegmentsType = typename TNL::Algorithms::Segments::CSR< Device, int >;
 
   /***
    * Create segments with given segments sizes.
    */
   SegmentsType segments{ 1, 2, 3, 4, 5 };
 
   /***
    * Allocate array for the segments;
    */
   TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
 
   /***
    * Insert data into particular segments.
    */
   auto data_view = data.getView();
   segments.forAllElements(
      [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
      {
         data_view[ globalIdx ] = localIdx + 1;
      } );
 
   /***
    * Print the data by the segments.
    */
   std::cout << "Values of elements after intial setup: " << std::endl;
   auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
   {
      return data_view[ globalIdx ];
   };
   printSegments( std::cout, segments, fetch );
 
   /***
    * Divide elements in each segment by a sum of all elements in the segment
    */
   using SegmentViewType = typename SegmentsType::SegmentViewType;
   segments.forAllSegments(
      [ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
      {
         // Compute the sum first ...
         double sum = 0.0;
         for( auto element : segment )
            if( element.localIndex() <= element.segmentIndex() )
               sum += data_view[ element.globalIndex() ];
         // ... divide all elements.
         for( auto element : segment )
            if( element.localIndex() <= element.segmentIndex() )
               data_view[ element.globalIndex() ] /= sum;
      } );
 
   /***
    * Print the data managed by the segments.
    */
   std::cout << "Value of elements after dividing by sum in each segment:" << std::endl;
   printSegments( std::cout, segments, fetch );
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Example of CSR segments on host: " << std::endl;
   SegmentsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Example of CSR segments on CUDA GPU: " << std::endl;
   SegmentsExample< TNL::Devices::Cuda >();
#endif
   return EXIT_SUCCESS;
}

The result looks as follows:

Example of CSR segments on host: 
Values of elements after intial setup: 
Seg. 0: [ 1 ] 
Seg. 1: [ 1, 2 ] 
Seg. 2: [ 1, 2, 3 ] 
Seg. 3: [ 1, 2, 3, 4 ] 
Seg. 4: [ 1, 2, 3, 4, 5 ] 
Value of elements after dividing by sum in each segment:
Seg. 0: [ 1 ] 
Seg. 1: [ 0.333333, 0.666667 ] 
Seg. 2: [ 0.166667, 0.333333, 0.5 ] 
Seg. 3: [ 0.1, 0.2, 0.3, 0.4 ] 
Seg. 4: [ 0.0666667, 0.133333, 0.2, 0.266667, 0.333333 ] 
Example of CSR segments on CUDA GPU: 
Values of elements after intial setup: 
Seg. 0: [ 1 ] 
Seg. 1: [ 1, 2 ] 
Seg. 2: [ 1, 2, 3 ] 
Seg. 3: [ 1, 2, 3, 4 ] 
Seg. 4: [ 1, 2, 3, 4, 5 ] 
Value of elements after dividing by sum in each segment:
Seg. 0: [ 1 ] 
Seg. 1: [ 0.333333, 0.666667 ] 
Seg. 2: [ 0.166667, 0.333333, 0.5 ] 
Seg. 3: [ 0.1, 0.2, 0.3, 0.4 ] 
Seg. 4: [ 0.0666667, 0.133333, 0.2, 0.266667, 0.333333 ] 

Flexible reduction within segments

In this section we will explain an extension of flexible reduction to segments. It allows to reduce all elements within the same segment and store the result into an array. See the following example:

#include <iostream>
#include <functional>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/Segments/CSR.h>
#include <TNL/Algorithms/Segments/Ellpack.h>
#include <TNL/Algorithms/SegmentsReductionKernels/DefaultKernel.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Segments >
void
SegmentsExample()
{
   using Device = typename Segments::DeviceType;
   using SegmentsReductionKernel =
      typename TNL::Algorithms::SegmentsReductionKernels::DefaultKernel< typename Segments::ViewType >::type;
 
   /***
    * Create segments with given segments sizes.
    */
   const int size = 5;
   Segments segments{ 1, 2, 3, 4, 5 };
 
   /***
    * Allocate array for the segments;
    */
   TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
 
   /***
    * Insert data into particular segments.
    */
   auto data_view = data.getView();
   segments.forAllElements(
      [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
      {
         if( localIdx <= segmentIdx )
            data_view[ globalIdx ] = segmentIdx;
      } );
 
   /***
    * Print the data by the segments.
    */
   std::cout << "Values of elements after intial setup: " << std::endl;
   auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
   {
      return data_view[ globalIdx ];
   };
   printSegments( std::cout, segments, fetch );
 
   /***
    * Compute sums of elements in each segment.
    */
   TNL::Containers::Vector< double, Device > sums( size );
   auto sums_view = sums.getView();
   auto fetch_full = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> double
   {
      if( localIdx <= segmentIdx )
         return data_view[ globalIdx ];
      else
         return 0.0;
   };
   auto fetch_brief = [ = ] __cuda_callable__( int globalIdx ) -> double
   {
      return data_view[ globalIdx ];
   };
   auto keep = [ = ] __cuda_callable__( int globalIdx, const double& value ) mutable
   {
      sums_view[ globalIdx ] = value;
   };
 
   SegmentsReductionKernel kernel;
   kernel.init( segments );
   kernel.reduceAllSegments( segments, fetch_full, std::plus<>{}, keep, 0.0 );
   std::cout << "The sums with full fetch form are: " << sums << std::endl;
   kernel.reduceAllSegments( segments, fetch_brief, std::plus<>{}, keep, 0.0 );
   std::cout << "The sums with brief fetch form are: " << sums << std::endl << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Example of CSR segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
 
   std::cout << "Example of Ellpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
 
#ifdef __CUDACC__
   std::cout << "Example of CSR segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
 
   std::cout << "Example of Ellpack segments on host: " << std::endl;
   SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
#endif
   return EXIT_SUCCESS;
}

We first create the segments segments, related array data, and setup the elements using the forAllElements method. After printing the segments we are ready for the parallel reduction. It requires three lambda fuctions:

fetch which reads data belonging to particular elements of the segments. The fetch function can have two different forms - brief and full:
- Brief form - in this case the lambda function gets only global index and the compute flag:
  
  auto fetch = [=] __cuda_callable__ ( int globalIdx, bool& compute ) -> double { ... };
- Full form - in this case the lambda function receives even the segment index and element index:
  
  cpp auto fetch = [=] __cuda_callable__ ( int segmentIdx, int localIdx, int globalIdx, bool& compute ) -> double { ... };
  
  where segmentIdx is the index of the segment, localIdx is the rank of the element within the segment, globalIdx is index of the element in the related array and compute serves for the reduction interruption which means that the remaining elements in the segment can be omitted. Many formats used for segments are optimized for much higher performance if the brief variant is used. The form of the fetch lambda function is detected automatically using SFINAE and so the use of both is very ease for the user.
reduce is a function representing the reduction operation, in our case it is defined as follows:

auto reduce = [=] __cuda_callable__ ( const double& a, const double& b ) -> double { return a + b; }

or, in fact, we can use the function std::plus.
keep is a lambda function responsible for storage of the results. It is supposed to be defined as:

auto keep = [=] __cuda_callable__ ( int segmentIdx, const double& value ) mutable { ... };

where segmentIdx is an index of the segment of which the reduction result we aim to store and value is the result of the reduction in the segment.

To use reduction within segments, we first create the vector sums where we will store the results and prepare a view to this vector for later use in the lambda functions. We demonstrate use of both fetch function variants - full by fetch_full and brief by fetch_brief. Next we define the lambda function keep for storing the sums from particular segments into the vector sums. Finally, we call the method reduceAllSegments (TNL::Algorithms::SegmentsReductionKernels::CSRScalarKernel::reduceAllSegments for example) to compute the reductions in the segments - first with fetch_full and then with fetch_brief. In both cases, we use std::plus for the reduction and we pass zero (the last argument) as the identity element for summation. In both cases we print the results which are supposed to be the same.

The result looks as follows:

Example of CSR segments on host: 
Values of elements after intial setup: 
Seg. 0: [ 0 ] 
Seg. 1: [ 1, 1 ] 
Seg. 2: [ 2, 2, 2 ] 
Seg. 3: [ 3, 3, 3, 3 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
 
Example of Ellpack segments on host: 
Values of elements after intial setup: 
Seg. 0: [ 0, 0, 0, 0, 0 ] 
Seg. 1: [ 1, 1, 0, 0, 0 ] 
Seg. 2: [ 2, 2, 2, 0, 0 ] 
Seg. 3: [ 3, 3, 3, 3, 0 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
 
Example of CSR segments on host: 
Values of elements after intial setup: 
Seg. 0: [ 0 ] 
Seg. 1: [ 1, 1 ] 
Seg. 2: [ 2, 2, 2 ] 
Seg. 3: [ 3, 3, 3, 3 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
 
Example of Ellpack segments on host: 
Values of elements after intial setup: 
Seg. 0: [ 0, 0, 0, 0, 0 ] 
Seg. 1: [ 1, 1, 0, 0, 0 ] 
Seg. 2: [ 2, 2, 2, 0, 0 ] 
Seg. 3: [ 3, 3, 3, 3, 0 ] 
Seg. 4: [ 4, 4, 4, 4, 4 ] 
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
 

Table of Contents

Introduction

Segments setup

Iteration over elements of segments

Method forElements

Method forSegments

Sequential dependency

Common computations

Flexible reduction within segments