Template Numerical Library version\ main:eacc201d
Searching...
No Matches
Segments tutorial

# Introduction

Segments represent data structure for manipulation with several local arrays (denoted also as segments) having different size in general. All the local arrays are supposed to be allocated in one continuos global array. The data structure segments offers mapping between indexes of particular local arrays and indexes of the global array. Segments do not store any data, segments just represent a layer for efficient access and operations with group of segments of linear containers (i.e. local arrays) with different size in general. One can perform parallel operations like for or flexible reduction on particular segments (local arrays).

A typical example of segments are different formats for sparse matrices. Sparse matrix like the following

$\left( \begin{array}{ccccc} 1 & 0 & 2 & 0 & 0 \\ 0 & 0 & 5 & 0 & 0 \\ 3 & 4 & 7 & 9 & 0 \\ 0 & 0 & 0 & 0 & 12 \\ 0 & 0 & 15 & 17 & 20 \end{array} \right)$

is usually first compressed which means that the zero elements are omitted to get the following "matrix":

$\begin{array}{ccccc} 1 & 2 \\ 5 \\ 3 & 4 & 7 & 9 \\ 12 \\ 15 & 17 & 20 \end{array}$

We have to store column index of each matrix elements as well in a "matrix" like this:

$\begin{array}{ccccc} 0 & 2 \\ 2 \\ 0 & 1 & 2 & 3 \\ 4 \\ 2 & 3 & 4 \end{array}$

Such "matrices" can be stored in memory in a row-wise manner in one contiguous array because of the performance reasons. The first "matrix" (i.e. values of the matrix elements) would be stored as follows

$\begin{array}{|cc|c|cccc|c|cc|} 1 & 2 & 5 & 3 & 4 & 7 & 9 & 12 & 15 & 17 & 20 \end{array}$

and the second one (i.e. column indexes of the matrix values) as follows

$\begin{array}{|cc|c|cccc|c|cc|} 0 & 2 & 2 & 0 & 1 & 2 & 3 & 4 & 2 & 3 & 4 \end{array}$

What we see above is so called CSR sparse matrix format. It is the most popular format for storage of sparse matrices designed for high performance. However, it may not be the most efficient format for storage of sparse matrices on GPUs. Therefore many other formats have been developed to get better performance. These formats often have different layout of the matrix elements in the memory. They have to deal especially with two difficulties:

1. Efficient storage of matrix elements in the memory to fulfill the requirements of coalesced memory accesses on GPUs or good spatial locality for efficient use of caches on CPUs.
2. Efficient mapping of GPU threads to different matrix rows.

TNL offers the following sparse matrix formats in a form of segments (Ellpack formats often use so called padding elements like padding zeros in terms of sparse matrices):

1. CSR format (TNL::Algorithms::Segments::CSR) is the most popular format for sparse matrices. It is simple ane very efficient especially on CPUs and today there are efficient kernels even for GPUs. The following GPU kernels are implemented in TNL:
1. Scalar which maps one GPU thread for each segment (matrix row).
2. Vector which maps one warp of GPU threads for each segment (matrix row).
2. Ellpack format (TNL::Algorithms::Segments::Ellpack) uses padding elements to have the same number of element in each segment. It can be highly inefficient in cases when one works with few very long segments.
3. SlicedEllpack format (TNL::Algorithms::Segments::SlicedEllpack) which was also presented as Row-grouped CSR format is similar to common Ellpack. However, SlicedEllpack first merges the segments into groups of 32. It also uses padding elements but only segments within the same group are aligned to have the same size. Therefore there is not such a high performance drop because of few long segments.
4. ChunkedEllpack format (TNL::Algorithms::Segments::ChunkedEllpack) is simillar to SlicedEllpack but it splits segments into chunks which allows to map more GPU threads to one segment.
5. BiEllpack format (TNL::Algorithms::Segments::BiEllpack) is simillar to ChunkedEllpack. In addition it sorts segments within the same slice w.r.t. their length to achieve higher performance and better memory accesses.

Especially in case of GPUs, the performance of each format strongly depends on distribution of the segment sizes. Therefore we cannot say that one of the previous formats would outperform the others in general. To get the best performance, one should try more of the formats and choose the best one. It is the reason why TNL offers more of them and additional formats will acrue.

Necessity of working with this kind of data structures is not limited only to sparse matrices. We could name at least few other applications for segments:

1. Graphs - one segment represents one graph node, the elements in one segments are indexes of its neighbors.
2. Unstructured numerical meshes - unstructured numerical mesh is a graph in fact.
3. Particle in cell method - one segment represents one cell, the elements in one segment are indexes of the particles.
4. K-means clustering - segments represent one cluster, the elements represent vectors belonging to given cluster.
5. Hashing - segments are particular rows of the hash table, elements in segments corresponds with colliding hashed elements.

In general, segments can be used for problems that somehow corresponds wit 2D data structure where each row can have different size and we need to perform miscellaneous operations within the rows. The name segments comes from segmented parallel reduction or segmented scan (prefix-sum).

# Segments setup

Segments are defined just by sizes of particular segments. The following example shows how to create them:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/ChunkedEllpack.h>
6#include <TNL/Algorithms/Segments/BiEllpack.h>
7#include <TNL/Devices/Host.h>
8#include <TNL/Devices/Cuda.h>
9
10template< typename Segments >
11void SegmentsExample()
12{
13 /***
14 * Create segments with given segments sizes and print their setup.
15 */
16 Segments segments{ 1, 2, 3, 4, 5 };
17 std::cout << "Segments sizes are: " << segments << std::endl << std::endl;
18}
19
20int main( int argc, char* argv[] )
21{
22 std::cout << "Example of CSR segments on host: " << std::endl;
23 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
24
25 std::cout << "Example of Ellpack segments on host: " << std::endl;
26 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
27
28 std::cout << "Example of ChunkedEllpack segments on host: " << std::endl;
29 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Host, int > >();
30
31 std::cout << "Example of BiEllpack segments on host: " << std::endl;
32 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Host, int > >();
33
34#ifdef HAVE_CUDA
35 std::cout << "Example of CSR segments on host: " << std::endl;
36 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
37
38 std::cout << "Example of Ellpack segments on host: " << std::endl;
39 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
40
41 std::cout << "Example of ChunkedEllpack segments on host: " << std::endl;
42 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Cuda, int > >();
43
44 std::cout << "Example of BiEllpack segments on host: " << std::endl;
45 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Cuda, int > >();
46#endif
47 return EXIT_SUCCESS;
48}
T endl(T... args)

We use constructor with initializer list (line 16) where each element of the list defines size of one segment. Next we print sizes of particular segments (line 17). We call this function for different segments types (excluding TNL::Algorithms::Segments::SlicedEllpack since it would behave the same way as TNL::Algorithms::Segments::Ellpack on this small example). The result looks as follows:

Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5, ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5, ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87, ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5, ]
Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5, ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5, ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87, ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5, ]

We can see, that real sizes of the segments are different for all Ellpack-based formats. As we said already, these formats often use padding elements to get more efficient access to the memory. For example TNL::Algorithms::Segments::ChunkedEllpack format involves multiple of elements. It is, however, only because of very small example we present now, on large examples the overhead is not so significant.

We remind that segments represent rather sparse format then data structure because they do not store any data. The following example shows how to connect segments with array:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/ChunkedEllpack.h>
6#include <TNL/Algorithms/Segments/BiEllpack.h>
7#include <TNL/Devices/Host.h>
8#include <TNL/Devices/Cuda.h>
9
10template< typename Segments >
11void SegmentsExample()
12{
13 using Device = typename Segments::DeviceType;
14
15 /***
16 * Create segments with given segments sizes.
17 */
18 TNL::Containers::Vector< int, Device > sizes{ 1, 2, 3, 4, 5 };
19 Segments segments( sizes );
20 std::cout << "Segments sizes are: " << segments << std::endl;
21
22 /***
23 * Allocate array for the segments;
24 */
25 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
26 data.forAllElements( [=] __cuda_callable__ ( int idx, double& value ) {
27 value = idx;
28 } );
29
30 /***
31 * Print the data managed by the segments.
32 */
33 auto data_view = data.getView();
34 auto fetch = [=] __cuda_callable__ ( int globalIdx ) -> double { return data_view[ globalIdx ]; };
35 printSegments( segments, fetch, std::cout ) << std::endl;
36}
37
38int main( int argc, char* argv[] )
39{
40 std::cout << "Example of CSR segments on host: " << std::endl;
41 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
42
43 std::cout << "Example of Ellpack segments on host: " << std::endl;
44 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
45
46 std::cout << "Example of ChunkedEllpack segments on host: " << std::endl;
47 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Host, int > >();
48
49 std::cout << "Example of BiEllpack segments on host: " << std::endl;
50 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Host, int > >();
51
52#ifdef HAVE_CUDA
53 std::cout << "Example of CSR segments on host: " << std::endl;
54 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
55
56 std::cout << "Example of Ellpack segments on host: " << std::endl;
57 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
58
59 std::cout << "Example of ChunkedEllpack segments on host: " << std::endl;
60 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Cuda, int > >();
61
62 std::cout << "Example of BiEllpack segments on host: " << std::endl;
63 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Cuda, int > >();
64#endif
65 return EXIT_SUCCESS;
66}
#define __cuda_callable__
Definition: CudaCallable.h:22
Array is responsible for memory management, access to array elements, and general array operations.
Definition: Array.h:67
Vector extends Array with algebraic operations.
Definition: Vector.h:40
std::ostream & printSegments(const Segments &segments, std::ostream &str)
Print segments sizes, i.e. the segments setup.
Definition: SegmentsPrinting.h:31

On the line 19, we show how to create segments with vector (TNL::Containers::Vector) carrying the segments sizes. Of course, the same constructor works even for arrays and views (i.e. TNL::Containers::Array, TNL::Containers::ArrayView and TNL::Containers::VectorView). Next we print the real segment sizes depending on the format in the background (line 20) the same way as we did in the previous example. On the line 25, we allocate array having the size requested by the segments by means of method getStorageSize (TNL::Algorithms::Segments::CSR::getStorageSize for example). This method says how many elements the segments need to be able to address all elements by their global index. On the lines 26-28, we mark each element of the array by its rank in the array. On the line 35, we use function TNL::Algorithms::Segments::printSegments which accepts lambda function fetch as one its parameters. The lambda function reads data from our array data (with the help of array view data_view) according to given global index globalIdx (line 34). The result looks as follows:

Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5, ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 2 ]
Seg. 2: [ 3, 4, 5 ]
Seg. 3: [ 6, 7, 8, 9 ]
Seg. 4: [ 10, 11, 12, 13, 14 ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5, ]
Seg. 0: [ 0, 1, 2, 3, 4 ]
Seg. 1: [ 5, 6, 7, 8, 9 ]
Seg. 2: [ 10, 11, 12, 13, 14 ]
Seg. 3: [ 15, 16, 17, 18, 19 ]
Seg. 4: [ 20, 21, 22, 23, 24 ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87, ]
Seg. 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ]
Seg. 1: [ 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 ]
Seg. 2: [ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101 ]
Seg. 3: [ 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168 ]
Seg. 4: [ 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255 ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5, ]
Seg. 0: [ 8, 9 ]
Seg. 1: [ 6, 7, 22, 23 ]
Seg. 2: [ 4, 5, 20, 21 ]
Seg. 3: [ 2, 3, 18, 19, 25 ]
Seg. 4: [ 0, 1, 16, 17, 24 ]
Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5, ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 2 ]
Seg. 2: [ 3, 4, 5 ]
Seg. 3: [ 6, 7, 8, 9 ]
Seg. 4: [ 10, 11, 12, 13, 14 ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5, ]
Seg. 0: [ 0, 32, 64, 96, 128 ]
Seg. 1: [ 1, 33, 65, 97, 129 ]
Seg. 2: [ 2, 34, 66, 98, 130 ]
Seg. 3: [ 3, 35, 67, 99, 131 ]
Seg. 4: [ 4, 36, 68, 100, 132 ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87, ]
Seg. 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ]
Seg. 1: [ 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 ]
Seg. 2: [ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101 ]
Seg. 3: [ 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168 ]
Seg. 4: [ 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255 ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5, ]
Seg. 0: [ 4, 12 ]
Seg. 1: [ 3, 11, 19, 23 ]
Seg. 2: [ 2, 10, 18, 22 ]
Seg. 3: [ 1, 9, 17, 21, 25 ]
Seg. 4: [ 0, 8, 16, 20, 24 ]

Frankly, what we see is not so important. It only shows that different segments formats can use very different mapping of elements identified by its segment index and local index (rank of the element in given segment) to a global index which serves as an address in the related container.

# Iteration over elements of segments

In this section, we show how to iterate over the elements of segments and how to manipulate with them. There are three possible ways:

1. Method forElements (TNL::Algorithms::Segments::CSR::forElements for example), which iterates in parallel over all elements of segments and perform given lambda function on each of them.
2. Method forSegments (TNL::Algorithms::Segments::CSR::forSegments for example), which iterates in parallel over all segments. It is better choice when we need to process each segment sequentially are we have significant amount of computations common for all elements in each segment.
3. Method sequentailForSegments (TNL::Algorithms::Segments::CSR::sequentialForSegments for example), which iterates over all segments sequentially i.e. using only one thread even on GPUs. It is useful for debugging or for printing for example.

Methods iterating over particular segments use a segment view (TNL::Algorithms::Segments::SegmentView) to access the elements of given segment. The segment view offers iterator for better convenience.

## Method forElements

The following example shows use of the method forElements:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Segments >
9void SegmentsExample()
10{
11 using Device = typename Segments::DeviceType;
12
13 /***
14 * Create segments with given segments sizes.
15 */
16 Segments segments{ 1, 2, 3, 4, 5 };
17
18 /***
19 * Allocate array for the segments;
20 */
21 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
22
23 /***
24 * Insert data into particular segments with no check.
25 */
26 auto data_view = data.getView();
27 segments.forAllElements( [=] __cuda_callable__ ( int segmentIdx, int localIdx, int globalIdx ) mutable {
28 data_view[ globalIdx ] = segmentIdx;
29 } );
30
31 /***
32 * Print the data managed by the segments.
33 */
34 std::cout << "Data setup with no check ... " << std::endl;
35 std::cout << "Array: " << data << std::endl;
36 auto fetch = [=] __cuda_callable__ ( int globalIdx ) -> double { return data_view[ globalIdx ]; };
37 printSegments( segments, fetch, std::cout );
38
39 /***
40 * Insert data into particular segments.
41 */
42 data = 0.0;
43 segments.forAllElements( [=] __cuda_callable__ ( int segmentIdx, int localIdx, int globalIdx ) mutable {
44 if( localIdx <= segmentIdx )
45 data_view[ globalIdx ] = segmentIdx;
46 } );
47
48 /***
49 * Print the data managed by the segments.
50 */
51 std::cout << "Data setup with check for padding elements..." << std::endl;
52 std::cout << "Array: " << data << std::endl;
53 printSegments( segments, fetch, std::cout );
54}
55
56int main( int argc, char* argv[] )
57{
58 std::cout << "Example of CSR segments on host: " << std::endl;
59 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
60
61 std::cout << "Example of Ellpack segments on host: " << std::endl;
62 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
63
64
65#ifdef HAVE_CUDA
66 std::cout << "Example of CSR segments on CUDA GPU: " << std::endl;
67 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
68
69 std::cout << "Example of Ellpack segments on CUDA GPU: " << std::endl;
70 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
71#endif
72 return EXIT_SUCCESS;
73}

On the line 7, we first create segments with linearly increasing size (so it is like lower triangular matrix). Next, we allocate array data (line 21) having the same size as the number of elements managed by the segments. It can be obtained by the method getStorageSize (TNL::Algorithms::Segments::CSR::getStorageSize for example). We prepare array view data_view for the purpose of use in lambda functions (line 26). Finally, we call the method forAllElements (lines 27-29) which iterates in parallel over all elements in the segments and for each element it calls given lambda function. The lambda function receives three arguments - segmentIdx is an index of the segment the element belongs to, localIdx is the rank of the element within the segment and globalIdx is an index of the element in the array data. We use the global index to set proper element of the array data to the index of the segment. On the line 35, we print the array data. We can see elements belonging to particular segments by their indexes. The layout of the elements depends on the type of segments (which means sparse format in use). Next we print the elements of array data by segments (lines 36 and 37). The function printSegments iterates over all elements and it reads the elements of the array data with the help of the lambda function defined on the line 36.

Note, that for the Ellpack format, the output looks as follows:

Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 1, 1, 1 ]
Seg. 2: [ 2, 2, 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]

We see more elements that we have requested. The reason is that the Ellpack format uses padding elements for optimizing access to the memory. Segments give access even to the padding elements, they can be used in case when we get to situation of need of additional elements. Therefore we need to check for relevant and padding elements each time we work with elements of segments. It is demonstrated on the lines 43-46 where we set the array data again but we check for the padding elements (line 44). After printing the segments the same way as before (line 53) we get correct result:

Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]

The result of the whole example looks as follows:

Example of CSR segments on host:
Data setup with no check ...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on host:
Data setup with no check ...
Array: [ 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 1, 1, 1 ]
Seg. 2: [ 2, 2, 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 0, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Example of CSR segments on CUDA GPU:
Data setup with no check ...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on CUDA GPU:
Data setup with no check ...
Array: [ 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 1, 1, 1 ]
Seg. 2: [ 2, 2, 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]

## Method forSegments

Method forSegments iterates in parallel over particular segments. Iteration over elements within the segment is sequential. There are two reasons for such proceeding:

1. The iteration over the elements within the same segments must be sequential, i.e. the computation with one element depends on a result of the computation with the previous one.
2. Some part of computations on all elements in one segment is common. In this case, we can first perform the common part and then iterate over the elements. If we would use the method forElements, the common part would have to be performed for each element.

### Sequential dependency

The first situation is demonstrated in the following example:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Segments >
9void SegmentsExample()
10{
11 using Device = typename Segments::DeviceType;
12
13 /***
14 * Create segments with given segments sizes.
15 */
16 Segments segments{ 1, 2, 3, 4, 5 };
17
18 /***
19 * Allocate array for the segments;
20 */
21 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
22
23 /***
24 * Insert data into particular segments.
25 */
26 auto data_view = data.getView();
27 using SegmentViewType = typename Segments::SegmentViewType;
28 segments.forAllSegments( [=] __cuda_callable__ ( const SegmentViewType& segment ) mutable {
29 double sum( 0.0 );
30 for( auto element : segment )
31 if( element.localIndex() <= element.segmentIndex() )
32 {
33 sum += element.localIndex() + 1;
34 data_view[ element.globalIndex() ] = sum;
35 }
36 } );
37
38 /***
39 * Print the data managed by the segments.
40 */
41 auto fetch = [=] __cuda_callable__ ( int globalIdx ) -> double { return data_view[ globalIdx ]; };
42 printSegments( segments, fetch, std::cout );
43}
44
45int main( int argc, char* argv[] )
46{
47 std::cout << "Example of CSR segments on host: " << std::endl;
48 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
49
50 std::cout << "Example of Ellpack segments on host: " << std::endl;
51 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
52
53
54#ifdef HAVE_CUDA
55 std::cout << "Example of CSR segments on CUDA GPU: " << std::endl;
56 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
57
58 std::cout << "Example of Ellpack segments on CUDA GPU: " << std::endl;
59 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
60#endif
61 return EXIT_SUCCESS;
62}

The result looks as follows:

The code is the same as in the previous example up to line 26. Instead of calling the method forElements we call the method forSegments (line 28) for which we need to define type SegmentViewType (TNL::Algorithms::Segments::CSR::SegmentViewType for example). The lambda function on the line 28 gets the segment view and it iterates over all elements of the segment by means of a for loop. We use auxiliary variable sum to compute cumulative sum of elements in each segment which is just the sequential dependency. The result looks as follows:

Example of CSR segments on host:
Seg. 0: [ 1 ]
Seg. 1: [ 1, 3 ]
Seg. 2: [ 1, 3, 6 ]
Seg. 3: [ 1, 3, 6, 10 ]
Seg. 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on host:
Seg. 0: [ 1, 0, 0, 0, 0 ]
Seg. 1: [ 1, 3, 0, 0, 0 ]
Seg. 2: [ 1, 3, 6, 0, 0 ]
Seg. 3: [ 1, 3, 6, 10, 0 ]
Seg. 4: [ 1, 3, 6, 10, 15 ]
Example of CSR segments on CUDA GPU:
Seg. 0: [ 1 ]
Seg. 1: [ 1, 3 ]
Seg. 2: [ 1, 3, 6 ]
Seg. 3: [ 1, 3, 6, 10 ]
Seg. 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on CUDA GPU:
Seg. 0: [ 1, 0, 0, 0, 0 ]
Seg. 1: [ 1, 3, 0, 0, 0 ]
Seg. 2: [ 1, 3, 6, 0, 0 ]
Seg. 3: [ 1, 3, 6, 10, 0 ]
Seg. 4: [ 1, 3, 6, 10, 15 ]

### Common computations

Now let's take a look at the second situation, i.e. there are common computations for all elements of one segment. In the following example, we first set values of each element using the method forElements which we are already familiar with (lines 26-29). Next we print values of all elements (lines 34-36) and then we use the method forAllSegments (lines 41-52) to divide each element by a sum of values of all elements in a segment. So we first sum up all elements in the segment (lines 43-47). This is the common part of the computation for all elements in the segment. Next we perform the division of all elements by the value of the variable sum (lines 48-51).

1#include <iostream>
2#include <functional>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/CSR.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void SegmentsExample()
10{
11 using SegmentsType = typename TNL::Algorithms::Segments::CSR< Device, int >;
12
13 /***
14 * Create segments with given segments sizes.
15 */
16 SegmentsType segments{ 1, 2, 3, 4, 5 };
17
18 /***
19 * Allocate array for the segments;
20 */
21 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
22
23 /***
24 * Insert data into particular segments.
25 */
26 auto data_view = data.getView();
27 segments.forAllElements( [=] __cuda_callable__ ( int segmentIdx, int localIdx, int globalIdx ) mutable {
28 data_view[ globalIdx ] = localIdx + 1;
29 } );
30
31 /***
32 * Print the data by the segments.
33 */
34 std::cout << "Values of elements after intial setup: " << std::endl;
35 auto fetch = [=] __cuda_callable__ ( int globalIdx ) -> double { return data_view[ globalIdx ]; };
36 printSegments( segments, fetch, std::cout );
37
38 /***
39 * Divide elements in each segment by a sum of all elements in the segment
40 */
41 using SegmentViewType = typename SegmentsType::SegmentViewType;
42 segments.forAllSegments( [=] __cuda_callable__ ( const SegmentViewType& segment ) mutable {
43 // Compute the sum first ...
44 double sum = 0.0;
45 for( auto element : segment )
46 if( element.localIndex() <= element.segmentIndex() )
47 sum += data_view[ element.globalIndex() ];
48 // ... divide all elements.
49 for( auto element : segment )
50 if( element.localIndex() <= element.segmentIndex() )
51 data_view[ element.globalIndex() ] /= sum;
52 } );
53
54 /***
55 * Print the data managed by the segments.
56 */
57 std::cout << "Value of elements after dividing by sum in each segment:" << std::endl;
58 printSegments( segments, fetch, std::cout );
59}
60
61int main( int argc, char* argv[] )
62{
63 std::cout << "Example of CSR segments on host: " << std::endl;
64 SegmentsExample< TNL::Devices::Host >();
65
66#ifdef HAVE_CUDA
67 std::cout << "Example of CSR segments on CUDA GPU: " << std::endl;
68 SegmentsExample< TNL::Devices::Cuda >();
69#endif
70 return EXIT_SUCCESS;
71}
Data structure for CSR segments format.
Definition: CSR.h:38

The result looks as follows:

Example of CSR segments on host:
Values of elements after intial setup:
Seg. 0: [ 1 ]
Seg. 1: [ 1, 2 ]
Seg. 2: [ 1, 2, 3 ]
Seg. 3: [ 1, 2, 3, 4 ]
Seg. 4: [ 1, 2, 3, 4, 5 ]
Value of elements after dividing by sum in each segment:
Seg. 0: [ 1 ]
Seg. 1: [ 0.333333, 0.666667 ]
Seg. 2: [ 0.166667, 0.333333, 0.5 ]
Seg. 3: [ 0.1, 0.2, 0.3, 0.4 ]
Seg. 4: [ 0.0666667, 0.133333, 0.2, 0.266667, 0.333333 ]
Example of CSR segments on CUDA GPU:
Values of elements after intial setup:
Seg. 0: [ 1 ]
Seg. 1: [ 1, 2 ]
Seg. 2: [ 1, 2, 3 ]
Seg. 3: [ 1, 2, 3, 4 ]
Seg. 4: [ 1, 2, 3, 4, 5 ]
Value of elements after dividing by sum in each segment:
Seg. 0: [ 1 ]
Seg. 1: [ 0.333333, 0.666667 ]
Seg. 2: [ 0.166667, 0.333333, 0.5 ]
Seg. 3: [ 0.1, 0.2, 0.3, 0.4 ]
Seg. 4: [ 0.0666667, 0.133333, 0.2, 0.266667, 0.333333 ]

# Flexible reduction within segments

In this section we will explain extension of [flexible reduction]() to segments. It allows to reduce all elements within the same segment and store the result into an array. See the following example:

1#include <iostream>
2#include <functional>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/CSR.h>
5#include <TNL/Algorithms/Segments/Ellpack.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Segments >
10void SegmentsExample()
11{
12 using Device = typename Segments::DeviceType;
13
14 /***
15 * Create segments with given segments sizes.
16 */
17 const int size = 5;
18 Segments segments{ 1, 2, 3, 4, 5 };
19
20 /***
21 * Allocate array for the segments;
22 */
23 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
24
25 /***
26 * Insert data into particular segments.
27 */
28 auto data_view = data.getView();
29 segments.forAllElements( [=] __cuda_callable__ ( int segmentIdx, int localIdx, int globalIdx ) mutable {
30 if( localIdx <= segmentIdx )
31 data_view[ globalIdx ] = segmentIdx;
32 } );
33
34 /***
35 * Print the data by the segments.
36 */
37 std::cout << "Values of elements after intial setup: " << std::endl;
38 auto fetch = [=] __cuda_callable__ ( int globalIdx ) -> double { return data_view[ globalIdx ]; };
39 printSegments( segments, fetch, std::cout );
40
41 /***
42 * Compute sums of elements in each segment.
43 */
45 auto sums_view = sums.getView();
46 auto fetch_full = [=] __cuda_callable__ ( int segmentIdx, int localIdx, int globalIdx, bool& compute ) -> double {
47 if( localIdx <= segmentIdx )
48 return data_view[ globalIdx ];
49 else
50 {
51 compute = false;
52 return 0.0;
53 }
54 };
55 auto fetch_brief = [=] __cuda_callable__ ( int globalIdx, bool& compute ) -> double {
56 return data_view[ globalIdx ];
57 };
58
59 auto keep = [=] __cuda_callable__ ( int globalIdx, const double& value ) mutable {
60 sums_view[ globalIdx ] = value; };
61 segments.reduceAllSegments( fetch_full, std::plus<>{}, keep, 0.0 );
62 std::cout << "The sums with full fetch form are: " << sums << std::endl;
63 segments.reduceAllSegments( fetch_brief, std::plus<>{}, keep, 0.0 );
64 std::cout << "The sums with brief fetch form are: " << sums << std::endl << std::endl;
65}
66
67int main( int argc, char* argv[] )
68{
69 std::cout << "Example of CSR segments on host: " << std::endl;
70 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
71
72 std::cout << "Example of Ellpack segments on host: " << std::endl;
73 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
74
75#ifdef HAVE_CUDA
76 std::cout << "Example of CSR segments on host: " << std::endl;
77 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
78
79 std::cout << "Example of Ellpack segments on host: " << std::endl;
80 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
81#endif
82 return EXIT_SUCCESS;
83}

We first create the segments segments (line 18), related array data (line 23) and setup the elements (lines 28-32). After printing the segments (lines 37-39) we are ready for the parallel reduction. It requires three lambda fuctions:

1. fetch which reads data belonging to particular elements of the segments. The fetch function can have two different forms - brief and full:
• Brief form - is this case the lambda function gets only global index and the compute flag:
auto fetch = [=] __cuda_callable__ ( int globalIdx, bool& compute ) -> double { ... };
• Full form - in this case the lambda function receives even the segment index and element index:
auto fetch = [=] __cuda_callable__ ( int segmentIdx, int localIdx, int globalIdx, bool& compute ) -> double { ... };
where segmentIdx is the index of the segment, localIdx is the rank of the element within the segment, globalIdx is index of the element in the related array and compute serves for the reduction interruption which means that the remaining elements in the segment can be omitted. Many formats used for segments are optimized for much higher performance if the brief variant is used. The form of the fetch lambda function is detected automatically using SFINAE and so the use of both is very ease for the user.
2. reduce is a function representing the reduction operation, in our case it is defined as follows:
auto reduce = [=] __cuda_callable__ ( const double& a, const double& b ) -> double { return a + b; }
or, in fact, we can use the function std::plus.
3. keep is a lambda function responsible for storage of the results. It is supposed to be defined as:
auto keep = [=] __cuda_callable__ ( int segmentIdx, const double& value ) mutable { ... };
where segmentIdx is an index of the segment of which the reduction result we aim to store and value is the result of the reduction in the segment.

We first create vector sums where we will store the results (line 44) and prepare a view to this vector for later use in the lambda functions. We demonstrate use of both variants - full by fetch_full (lines 46-54) and brief by fetch_brief (lines 55-57). The lambda function keep for storing the sums from particular segments into the vector sums is on the lines 59-60. Finally, we call the method reduceAllSegments (TNL::Algorithms::Segments::CSR::reduceSegments for example) to compute the reductions in the segments - first with fetch_full (line 61) and then with fetch_brief (line 63). In both cases, we use std::plus for the reduction and we pass zero (the last argument) as an idempotent element for sumation. In both cases we print the results which are supposed to be the same. The result looks as follows:

Example of CSR segments on host:
Values of elements after intial setup:
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
Example of Ellpack segments on host:
Values of elements after intial setup:
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
Example of CSR segments on host:
Values of elements after intial setup:
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
Example of Ellpack segments on host:
Values of elements after intial setup:
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]