Template Numerical Library version\ main:4904c12
Loading...
Searching...
No Matches
Segments

Introduction

Segments represent a data structure designed for managing multiple local arrays (also referred to as segments), which generally have different sizes. All the local arrays are stored in a single contiguous global array. The segments structure provides a mapping between indices of individual local arrays and indices within the global array.

Importantly, segments do not store any data themselves. Instead, they serve as a lightweight abstraction layer that facilitates efficient access to and operations on groups of linear containers (i.e., local arrays) of variable size.

Using segments, one can perform various parallel operations, such as for-style traversals, flexible reduction and prefix-sum, sorting or searching across the individual segments.

A typical example of the use of segments is in implementing different sparse matrix formats. For instance, a sparse matrix like the following:

\[\left( \begin{array}{ccccc} 1 & 0 & 2 & 0 & 0 \\ 0 & 0 & 5 & 0 & 0 \\ 3 & 4 & 7 & 9 & 0 \\ 0 & 0 & 0 & 0 & 12 \\ 0 & 0 & 15 & 17 & 20 \end{array} \right) \]

This matrix is usually compressed, meaning that the zero elements are omitted. The resulting compressed structure looks like:

\[\begin{array}{ccccc} 1 & 2 \\ 5 \\ 3 & 4 & 7 & 9 \\ 12 \\ 15 & 17 & 20 \end{array} \]

To retain information about the positions of the non-zero elements, we also store their column indices, forming a second structure:

\[\begin{array}{ccccc} 0 & 2 \\ 2 \\ 0 & 1 & 2 & 3 \\ 4 \\ 2 & 3 & 4 \end{array} \]

Both of these “matrices” (non-zero values and their corresponding column indices) are typically stored in memory row-wise in contiguous arrays for performance reasons. For example, the array of values becomes:

\[\begin{array}{|cc|c|cccc|c|cc|} 1 & 2 & 5 & 3 & 4 & 7 & 9 & 12 & 15 & 17 & 20 \end{array} \]

And the corresponding array of column indices is:

\[\begin{array}{|cc|c|cccc|c|cc|} 0 & 2 & 2 & 0 & 1 & 2 & 3 & 4 & 2 & 3 & 4 \end{array} \]

What we see above is the so-called CSR sparse matrix format. This is the most widely used format for storing sparse matrices and was designed for high performance on CPUs.

However, CSR is not always the most efficient format for sparse matrix storage and access on GPUs. For this reason, many alternative formats have been developed to achieve better performance on parallel architectures. These formats often use a different layout of matrix elements in memory, and must address the following challenges:

  1. Efficient memory storage of matrix elements to enable:
    • Coalesced memory accesses on GPUs
    • Good spatial locality for effective use of CPU caches
  2. Efficient mapping of GPU threads to matrix rows or elements for parallel processing.

The TNL library provides support for several sparse matrix formats implemented as segments. Some of these formats - such as the Ellpack based formats - use padding elements (e.g., padding zeros) to equalize segment sizes (e.g. compressed matrix row lengths) and optimize memory access patterns. The following is the list of currently implemented sparse matrix formats in TNL:

  1. CSR format (TNL::Algorithms::Segments::CSR) This is the most widely used sparse matrix format. It is simple and efficient, especially on CPUs. Today, there also exist optimized GPU kernels for CSR. The following GPU kernels are implemented in TNL:
    1. Scalar Maps one GPU thread for each segment (matrix row).
    2. Vector Maps one warp of GPU threads for each segment (matrix row).
    3. Adaptive Dynamically selects the most suitable kernel based on the segment properties.
  2. Ellpack format (TNL::Algorithms::Segments::Ellpack) This format pads all segments to the same length, which can be inefficient when there are a few very long segments. It offers good memory access especially on GPUs.
  3. SlicedEllpack format (TNL::Algorithms::Segments::SlicedEllpack) Also known as Row-grouped CSR format this format groups segments into blocks (e.g., of 32 segments). Padding is applied only within each group, reducing the performance loss caused by outlier segment lengths.
  4. ChunkedEllpack format (TNL::Algorithms::Segments::ChunkedEllpack) Similar to SlicedEllpack, but allows segments to be split into smaller chunks, enabling more GPU threads to work on the same segment concurrently. This improves performance for very long segments.
  5. BiEllpack format (TNL::Algorithms::Segments::BiEllpack) Is similar to ChunkedEllpack. In addition, it sorts segments within each slice according to their length. This further improves memory access patterns and computational efficiency on parallel architectures.

Inspired by the paper SELL-C-σ, TNL also offers sorted segments. In this type of segments, the segments are first sorted by their size in descending order and then stored using an underlying segment type. The underlying segments can be of any type, not only TNL::Algorithms::Segments::SlicedEllpack as in the SELL-C-σ format. Using sorted segments can be particularly beneficial when the sizes of individual segments vary significantly. This has the greatest performance impact on GPUs.

Especially in the case of GPUs, the performance of each format strongly depends on the distribution of segment sizes. Therefore, we cannot claim that any one of the above formats consistently outperforms the others. To achieve optimal performance, it is often necessary to experiment with multiple formats and select the best one for a given problem. This is why TNL offers a variety of formats, and additional ones may be added in the future.

The need for such data structures is not limited to sparse matrices. Segments can be applied in a wide range of problems, including but not limited to:

  1. Graphs - Each segment corresponds to a graph node; elements in a segment represent the node's neighbors.
  2. Unstructured numerical meshes - These meshes can be viewed as graphs and handled similarly.
  3. Particle in cell method - Each segment represents a cell; the elements are indices of the particles in that cell.
  4. K-means clustering - Each segment corresponds to a cluster; elements in the segment are vectors assigned to that cluster.
  5. Hashing - Segments represent rows in a hash table; elements in a segment are entries that collide in the same row.

In general, segments are useful for any problem involving data structures consisting of contiguous blocks with irregular sizes, where various operations are performed per block. The term segments is derived from segmented parallel reduction and segmented scan (prefix-sum).

Segments setup

Segments are defined solely by the sizes of the individual segments. The following example shows how to create them:

/***
* Create segments with given segments sizes and print their setup.
*/
Segments segments{ 1, 2, 3, 4, 5 };
std::cout << "Segments sizes are: " << segments << "\n\n";

We use the constructor with an initializer list, where each element of the list defines the size of one segment. We then print the sizes of the individual segments. This function is called for different segment types (except TNL::Algorithms::Segments::SlicedEllpack, as it behaves the same as TNL::Algorithms::Segments::Ellpack for such a small example).

The full example reads as:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/ChunkedEllpack.h>
6#include <TNL/Algorithms/Segments/BiEllpack.h>
7#include <TNL/Devices/Host.h>
8#include <TNL/Devices/Cuda.h>
9
10template< typename Segments >
11void
12SegmentsExample()
13{
15 /***
16 * Create segments with given segments sizes and print their setup.
17 */
18 Segments segments{ 1, 2, 3, 4, 5 };
19 std::cout << "Segments sizes are: " << segments << "\n\n";
21}
22
23int
24main( int argc, char* argv[] )
25{
26 std::cout << "Example of CSR segments on host:\n";
27 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
28
29 std::cout << "Example of Ellpack segments on host:\n";
30 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
31
32 std::cout << "Example of ChunkedEllpack segments on host:\n";
33 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Host, int > >();
34
35 std::cout << "Example of BiEllpack segments on host:\n";
36 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Host, int > >();
37
38#ifdef __CUDACC__
39 std::cout << "Example of CSR segments on host:\n";
40 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
41
42 std::cout << "Example of Ellpack segments on host:\n";
43 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
44
45 std::cout << "Example of ChunkedEllpack segments on host:\n";
46 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Cuda, int > >();
47
48 std::cout << "Example of BiEllpack segments on host:\n";
49 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Cuda, int > >();
50#endif
51 return EXIT_SUCCESS;
52}

The result looks as follows:

Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5 ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5 ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87 ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5 ]
Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5 ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5 ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87 ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5 ]

We can see that the actual sizes of the segments differ for all Ellpack-based formats. As mentioned earlier, these formats often use padding elements to achieve more efficient memory access. For example, the TNL::Algorithms::Segments::ChunkedEllpack format includes many such elements. However, this overhead is mainly a result of the very small example presented here; on large data, the overhead is usually not so significant.

Let us remind the reader that segments represent a sparse format rather than a data structure, as they do not store any data themselves. The following example shows how to connect segments to an array:

/***
* Create segments with given segments sizes.
*/
Segments segments( sizes );
std::cout << "Segments sizes are: " << segments << '\n';

We first show how to create segments using a vector (TNL::Containers::Vector) sizes that stores the segment sizes. The same constructor also works with arrays (TNL::Containers::Array) arrays views (TNL::Containers::ArrayView) and vector views (TNL::Containers::VectorView).

Next, we print the actual segment sizes depending on the underlying format, as in the previous example.

/***
* Allocate array for the segments;
*/
TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
data.forAllElements(
[ = ] __cuda_callable__( int idx, double& value )
{
value = idx;
} );

Then we allocate the array data of the size required by segments using the method getStorageSize (e.g., TNL::Algorithms::Segments::CSR::getStorageSize). This method tells us how many elements are needed for the segments to address all elements by their global index.

We use the forAllElements method to label each element of the array data with its rank.

/***
* Print the data managed by the segments.
*/
auto data_view = data.getView();
auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
{
return data_view[ globalIdx ];
};
std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';

Finally, we use the function TNL::Algorithms::Segments::print, which takes a lambda function fetch as a parameter. This lambda reads data from the array data (using the array view data_view) based on the given globalIdx.

The full example reads as:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/ChunkedEllpack.h>
6#include <TNL/Algorithms/Segments/BiEllpack.h>
7#include <TNL/Devices/Host.h>
8#include <TNL/Devices/Cuda.h>
9
10template< typename Segments >
11void
12SegmentsExample()
13{
14 using Device = typename Segments::DeviceType;
15
17 /***
18 * Create segments with given segments sizes.
19 */
20 TNL::Containers::Vector< int, Device > sizes{ 1, 2, 3, 4, 5 };
21 Segments segments( sizes );
22 std::cout << "Segments sizes are: " << segments << '\n';
24
26 /***
27 * Allocate array for the segments;
28 */
29 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
30 data.forAllElements(
31 [ = ] __cuda_callable__( int idx, double& value )
32 {
33 value = idx;
34 } );
36
38 /***
39 * Print the data managed by the segments.
40 */
41 auto data_view = data.getView();
42 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
43 {
44 return data_view[ globalIdx ];
45 };
46 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
48}
49
50int
51main( int argc, char* argv[] )
52{
53 std::cout << "Example of CSR segments on host:\n";
54 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
55
56 std::cout << "Example of Ellpack segments on host:\n";
57 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
58
59 std::cout << "Example of ChunkedEllpack segments on host:\n";
60 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Host, int > >();
61
62 std::cout << "Example of BiEllpack segments on host:\n";
63 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Host, int > >();
64
65#ifdef __CUDACC__
66 std::cout << "Example of CSR segments on host:\n";
67 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
68
69 std::cout << "Example of Ellpack segments on host:\n";
70 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
71
72 std::cout << "Example of ChunkedEllpack segments on host:\n";
73 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Cuda, int > >();
74
75 std::cout << "Example of BiEllpack segments on host:\n";
76 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Cuda, int > >();
77#endif
78 return EXIT_SUCCESS;
79}
#define __cuda_callable__
Definition Macros.h:49
Array is responsible for memory management, access to array elements, and general array operations.
Definition Array.h:65
Vector extends Array with algebraic operations.
Definition Vector.h:37
SegmentsPrinter< typename Segments::ConstViewType, Fetch > print(const Segments &segments, Fetch fetch)
Print segments sizes, i.e. the segments setup.

The result looks as follows:

Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5 ]
Segment 0: [ 0 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 3, 4, 5 ]
Segment 3: [ 6, 7, 8, 9 ]
Segment 4: [ 10, 11, 12, 13, 14 ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5 ]
Segment 0: [ 0, 1, 2, 3, 4 ]
Segment 1: [ 5, 6, 7, 8, 9 ]
Segment 2: [ 10, 11, 12, 13, 14 ]
Segment 3: [ 15, 16, 17, 18, 19 ]
Segment 4: [ 20, 21, 22, 23, 24 ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87 ]
Segment 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ]
Segment 1: [ 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 ]
Segment 2: [ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101 ]
Segment 3: [ 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168 ]
Segment 4: [ 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255 ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5 ]
Segment 0: [ 8, 9 ]
Segment 1: [ 6, 7, 22, 23 ]
Segment 2: [ 4, 5, 20, 21 ]
Segment 3: [ 2, 3, 18, 19, 25 ]
Segment 4: [ 0, 1, 16, 17, 24 ]
Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5 ]
Segment 0: [ 0 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 3, 4, 5 ]
Segment 3: [ 6, 7, 8, 9 ]
Segment 4: [ 10, 11, 12, 13, 14 ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5 ]
Segment 0: [ 0, 32, 64, 96, 128 ]
Segment 1: [ 1, 33, 65, 97, 129 ]
Segment 2: [ 2, 34, 66, 98, 130 ]
Segment 3: [ 3, 35, 67, 99, 131 ]
Segment 4: [ 4, 36, 68, 100, 132 ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87 ]
Segment 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ]
Segment 1: [ 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 ]
Segment 2: [ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101 ]
Segment 3: [ 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168 ]
Segment 4: [ 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255 ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5 ]
Segment 0: [ 4, 12 ]
Segment 1: [ 3, 11, 19, 23 ]
Segment 2: [ 2, 10, 18, 22 ]
Segment 3: [ 1, 9, 17, 21, 25 ]
Segment 4: [ 0, 8, 16, 20, 24 ]

What we observe demonstrates that different segment formats can use very different mappings from an element identified by its segment index and local index (i.e., the rank of the element within the segment) to a global index, which serves as an address in the associated container.

Setup of sorted segments

The main difference in the setup of the sorted segments is showcased in the following code:

std::cout << "Example of sorted CSR segments on host:\n";
SegmentsExample< TNL::Algorithms::Segments::SortedSegments< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > > >();

Instead of using CSR or Ellpack directly, we embed them as the underlying segment type into SortedSegments.

The complete example is shown below:

1#include <iostream>
2#include <functional>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/traverse.h>
5#include <TNL/Algorithms/Segments/SortedSegments.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Segments >
10void
11SegmentsExample()
12{
13 using Device = typename Segments::DeviceType;
14
15 /***
16 * Create segments with given segments sizes.
17 */
18 TNL::Containers::Vector< int, Device > segmentsSizes{ 1, 2, 3, 4, 5 };
19 Segments segments( segmentsSizes );
20 std::cout << "Segments sizes are: " << segments << '\n';
21
22 /***
23 * Allocate array for the segments;
24 */
25 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
26
27 /***
28 * Insert data into particular segments.
29 */
30 auto data_view = data.getView();
32 segments,
33 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
34 {
35 if( localIdx <= segmentIdx )
36 data_view[ globalIdx ] = segmentIdx;
37 } );
38
39 /***
40 * Print the data managed by the segments.
41 */
42 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
43 {
44 return data_view[ globalIdx ];
45 };
46 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
47}
48
49int
50main( int argc, char* argv[] )
51{
52 // ![sorted-segments-definition]
53 std::cout << "Example of sorted CSR segments on host:\n";
54 SegmentsExample< TNL::Algorithms::Segments::SortedSegments< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > > >();
55 // ![sorted-segments-definition]
56
57 std::cout << "Example of sorted Ellpack segments on host:\n";
58 SegmentsExample<
60
61#ifdef __CUDACC__
62 std::cout << "Example of sorted CSR segments on CUDA GPU:\n";
63 SegmentsExample< TNL::Algorithms::Segments::SortedSegments< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > > >();
64
65 std::cout << "Example of sorted Ellpack segments on CUDA GPU:\n";
66 SegmentsExample<
68#endif
69 return EXIT_SUCCESS;
70}
Data structure for sorted segments.
Definition SortedSegments.h:25
void forAllElements(const Segments &segments, Function &&function, LaunchConfiguration launchConfig=Algorithms::Segments::LaunchConfiguration())
Iterates in parallel over all elements of all segments and applies the specified lambda function.

The output of this example is as follows:

Example of sorted CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5 ]
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of sorted Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of sorted CSR segments on CUDA GPU:
Segments sizes are: [ 1, 2, 3, 4, 5 ]
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of sorted Ellpack segments on CUDA GPU:
Segments sizes are: [ 5, 5, 5, 5, 5 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]

Iteration over elements of segments

In this section, we show how to iterate over the elements of segments and how to manipulate with them. There are several ways to specify which segments should be traversed:

  1. All Segments: Functions forElements (TNL::Algorithms::Segments::forElements) and forAllElements (TNL::Algorithms::Segments::forAllElements) iterate in parallel over all elements of all segments (or specified range of segmets) and apply the given lambda function to each of them.
  2. Selected Segments by Index: The same function, forElements (TNL::Algorithms::Segments::forElements), can additionally be provided with an array of segment indices. In that case, they iterate in parallel only over the elements of the specified segments and apply the given lambda function to each of them.
  3. Conditional Segment Selection: Functions forElementsIf (TNL::Algorithms::Segments::forElementsIf) and forAllElementsIf (TNL::Algorithms::Segments::forAllElementsIf) iterate in parallel over all elements of the segments that fulfill a given condition (specified by a lambda function) and apply a provided lambda function to each of them.

All functions mentioned above traverse the elements of segments in parallel. In some cases, however, it is better to iterate over the elements of a particular segment sequentially. This can be done in the same way as for parallel traversal:

  1. All Segments: Functions forSegments (TNL::Algorithms::Segments::forSegments) and forAllSegments (TNL::Algorithms::Segments::forAllSegments) iterate in parallel over all segments (or specified range of segments) and apply the given lambda function to each of them.
  1. Selected Segments by Index: The same functions forSegments (TNL::Algorithms::Segments::forSegments) and forAllSegments (TNL::Algorithms::Segments::forAllSegments) can also additionally be provided with an array of segment indices. In that case, they iterate in parallel only over the specified segments and apply the given lambda function to each of them.
  1. Conditional Segment Selection: Functions forSegmentsIf (TNL::Algorithms::Segments::forSegmentsIf) and forAllSegmentsIf (TNL::Algorithms::Segments::forAllSegmentsIf) iterate in parallel over all segments that fulfill a given condition (specified by a lambda function) and apply a provided lambda function to each of them.

Functions iterating over particular segments use a segment view (TNL::Algorithms::Segments::SegmentView) to access the elements of given segment. The segment view offers iterator for better convenience.

Function forElements

The following example shows use of the function forElements.

/***
* Create segments with given segments sizes.
*/
Segments segments{ 1, 2, 3, 4, 5 };
/***
* Allocate array for the segments;
*/
TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );

We first create segments with linearly increasing sizes (i.e., a structure resembling a lower triangular matrix). Next, we allocate an array data with the same size as the total number of elements managed by the segments. This size can be obtained using the getStorageSize method (see, e.g., TNL::Algorithms::Segments::CSR::getStorageSize).

/***
* Insert data into particular segments with no check.
*/
auto data_view = data.getView();
segments,
[ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
{
data_view[ globalIdx ] = segmentIdx;
} );

We then create an array view data_view for convenient access in lambda functions. After that, we call the function forAllElements, which iterates in parallel over all elements in the segments. For each element, it invokes a user-provided lambda function. The lambda receives three arguments:

  • segmentIdx: index of the segment the element belongs to,
  • localIdx: index of the element within its segment,
  • globalIdx: global index of the element within the array data.

We use the globalIdx to assign the corresponding entry in data to the segment index.

/***
* Print the data managed by the segments.
*/
std::cout << "Data setup with no check ...\n";
std::cout << "Array: " << data << '\n';
auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
{
return data_view[ globalIdx ];
};
std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';

Next, we print the array data, which shows the segment membership of each element by its value. The layout of the elements depends on the segment type (i.e., the sparse format used). We also print the contents of data grouped by segments using the function printSegments, which iterates over all elements and accesses the values using a fetch lambda function.

Traversing segments this way includes even padding elements used by certain segment formats to achieve more efficient memory accesses. This is, however, not what we usually want to do. To skip the padding elements, we need to check the size of each segment explicitly, as demonstrated in the following code snippet from the same example.

/***
* Insert data into particular segments.
*/
data = 0.0;
segments,
[ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
{
if( localIdx <= segmentIdx )
data_view[ globalIdx ] = segmentIdx;
} );

Here we first erase the array data by setting all its elements to zero. Next, we traverse the segment elements in the same way, but inside the lambda function, we check the size of each segment, which in this case equals the segment index. Finally, we print the data managed by the segments as before.

/***
* Print the data managed by the segments.
*/
std::cout << "Data setup with check for padding elements...\n";
std::cout << "Array: " << data << '\n';
std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';

The full example looks as follows:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/traverse.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Segments >
10void
11SegmentsExample()
12{
13 using Device = typename Segments::DeviceType;
14
16 /***
17 * Create segments with given segments sizes.
18 */
19 Segments segments{ 1, 2, 3, 4, 5 };
20
21 /***
22 * Allocate array for the segments;
23 */
24 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
26
28 /***
29 * Insert data into particular segments with no check.
30 */
31 auto data_view = data.getView();
33 segments,
34 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
35 {
36 data_view[ globalIdx ] = segmentIdx;
37 } );
39
41 /***
42 * Print the data managed by the segments.
43 */
44 std::cout << "Data setup with no check ...\n";
45 std::cout << "Array: " << data << '\n';
46 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
47 {
48 return data_view[ globalIdx ];
49 };
50 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
52
54 /***
55 * Insert data into particular segments.
56 */
57 data = 0.0;
59 segments,
60 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
61 {
62 if( localIdx <= segmentIdx )
63 data_view[ globalIdx ] = segmentIdx;
64 } );
66
68 /***
69 * Print the data managed by the segments.
70 */
71 std::cout << "Data setup with check for padding elements...\n";
72 std::cout << "Array: " << data << '\n';
73 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
75}
76
77int
78main( int argc, char* argv[] )
79{
80 std::cout << "Example of CSR segments on host:\n";
81 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
82
83 std::cout << "Example of Ellpack segments on host:\n";
84 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
85
86#ifdef __CUDACC__
87 std::cout << "Example of CSR segments on CUDA GPU:\n";
88 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
89
90 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
91 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
92#endif
93 return EXIT_SUCCESS;
94}

Note, that for the Ellpack format, the output of the traversing through all elements looks as follows:

Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 1, 1, 1 ]
Seg. 2: [ 2, 2, 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]

But if we check the segments sizes and skip the padding elements, we get the following output:

Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]

The result of the entire example looks as follows:

Example of CSR segments on host:
Data setup with no check ...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on host:
Data setup with no check ...
Array: [ 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 1, 1, 1 ]
Segment 2: [ 2, 2, 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 0, 4, 4, 4, 4, 4 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of CSR segments on CUDA GPU:
Data setup with no check ...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on CUDA GPU:
Data setup with no check ...
Array: [ 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 1, 1, 1 ]
Segment 2: [ 2, 2, 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]

Note, that the function Segments_forElements_range allows to specify range of segments over which we aim to traverse.

Function forElements with segment indexes

The function forElements also allows traversing only selected segments, based on their indexes. We begin with the same setup as in the previous example:

/***
* Create segments with given segments sizes.
*/
Segments segments{ 1, 2, 3, 4, 5 };
/***
* Allocate array for the segments;
*/
TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );

Next, we iterate only over even-indexed segments:

/***
* Create array with the indexes of segments we want to iterate over.
*/
TNL::Containers::Array< int, Device > segmentIndexes{ 0, 2, 4 };
/***
* Insert data into particular segments with no check.
*/
auto data_view = data.getView();
segments,
segmentIndexes,
[ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
{
if( localIdx <= segmentIdx )
data_view[ globalIdx ] = segmentIdx;
} );

We first create the array segmentIndexes that stores the indexes of even segments. This array is passed as the second argument to the forElements function, which then processes only those specified segments.

Finally, we print the contents of the data array managed by the segments in the same way as in the previous example.

The full example is shown below:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/traverse.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Segments >
10void
11SegmentsExample()
12{
13 using Device = typename Segments::DeviceType;
14
16 /***
17 * Create segments with given segments sizes.
18 */
19 Segments segments{ 1, 2, 3, 4, 5 };
20
21 /***
22 * Allocate array for the segments;
23 */
24 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
26
28 /***
29 * Create array with the indexes of segments we want to iterate over.
30 */
31 TNL::Containers::Array< int, Device > segmentIndexes{ 0, 2, 4 };
32
33 /***
34 * Insert data into particular segments with no check.
35 */
36 auto data_view = data.getView();
38 segments,
39 segmentIndexes,
40 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
41 {
42 if( localIdx <= segmentIdx )
43 data_view[ globalIdx ] = segmentIdx;
44 } );
46
47 /***
48 * Print the data managed by the segments.
49 */
50 std::cout << "Array: " << data << '\n';
51 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
52 {
53 return data_view[ globalIdx ];
54 };
55 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
56}
57
58int
59main( int argc, char* argv[] )
60{
61 std::cout << "Example of CSR segments on host:\n";
62 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
63
64 std::cout << "Example of Ellpack segments on host:\n";
65 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
66
67#ifdef __CUDACC__
68 std::cout << "Example of CSR segments on CUDA GPU:\n";
69 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
70
71 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
72 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
73#endif
74 return EXIT_SUCCESS;
75}
void forElements(const Segments &segments, IndexBegin begin, IndexEnd end, Function &&function, LaunchConfiguration launchConfig=Algorithms::Segments::LaunchConfiguration())
Iterates in parallel over all elements in the given range of segments and applies the specified lambd...

The result of the entire example looks as follows:

Example of CSR segments on host:
Array: [ 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 4, 4, 4, 4, 4 ]
Segment 0: [ 0 ]
Segment 1: [ 0, 0 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 0, 0, 0, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on host:
Array: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 0, 0, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 0, 0, 0, 0, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of CSR segments on CUDA GPU:
Array: [ 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 4, 4, 4, 4, 4 ]
Segment 0: [ 0 ]
Segment 1: [ 0, 0 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 0, 0, 0, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on CUDA GPU:
Array: [ 0, 0, 2, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 0, 0, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 0, 0, 0, 0, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]

Function forElements with condition on segment indexes

Traversing only the elements of even-indexed segments can also be achieved using the function forElementsIf. This function allows you to specify which segments to traverse using a condition expressed as a lambda function.

The usage is demonstrated in the following example. The setup is the same as in the previous examples. The traversing part looks as follows:

/***
* Insert data into particular segments with no check.
*/
auto data_view = data.getView();
segments,
[ = ] __cuda_callable__( int segmentIdx ) -> bool
{
return segmentIdx % 2 == 0; // Iterate only over even-indexed segments.
},
[ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
{
if( localIdx <= segmentIdx )
data_view[ globalIdx ] = segmentIdx;
} );

Instead of array with segments indexes, the function forAllElementsIf takes a lambda function.

[ = ] __cuda_callable__( int segmentIdx ) -> bool
{
return segmentIdx % 2 == 0; // Iterate only over even-indexed segments.
},

This lambda expresses the condition that a segment must fulfill to be included in the traversal. It accepts a single argument, segmentIdx, which represents the segment index used in the condition.

The remainder of the example follows the same structure as before:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/traverse.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Segments >
10void
11SegmentsExample()
12{
13 using Device = typename Segments::DeviceType;
14
15 /***
16 * Create segments with given segments sizes.
17 */
18 Segments segments{ 1, 2, 3, 4, 5 };
19
20 /***
21 * Allocate array for the segments;
22 */
23 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
24
26 /***
27 * Insert data into particular segments with no check.
28 */
29 auto data_view = data.getView();
31 segments,
33 [ = ] __cuda_callable__( int segmentIdx ) -> bool
34 {
35 return segmentIdx % 2 == 0; // Iterate only over even-indexed segments.
36 },
38 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
39 {
40 if( localIdx <= segmentIdx )
41 data_view[ globalIdx ] = segmentIdx;
42 } );
44
45 /***
46 * Print the data managed by the segments.
47 */
48 std::cout << "Array: " << data << '\n';
49 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
50 {
51 return data_view[ globalIdx ];
52 };
53 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
54}
55
56int
57main( int argc, char* argv[] )
58{
59 std::cout << "Example of CSR segments on host:\n";
60 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
61
62 std::cout << "Example of Ellpack segments on host:\n";
63 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
64
65#ifdef __CUDACC__
66 std::cout << "Example of CSR segments on CUDA GPU:\n";
67 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
68
69 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
70 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
71#endif
72 return EXIT_SUCCESS;
73}
void forAllElementsIf(const Segments &segments, Condition condition, Function function, LaunchConfiguration launchConfig=Algorithms::Segments::LaunchConfiguration())
Iterates in parallel over all elements of all segments based on a condition.

The result of the entire example is:

Example of CSR segments on host:
Array: [ 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 4, 4, 4, 4, 4 ]
Segment 0: [ 0 ]
Segment 1: [ 0, 0 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 0, 0, 0, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on host:
Array: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 0, 0, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 0, 0, 0, 0, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of CSR segments on CUDA GPU:
Array: [ 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 4, 4, 4, 4, 4 ]
Segment 0: [ 0 ]
Segment 1: [ 0, 0 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 0, 0, 0, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on CUDA GPU:
Array: [ 0, 0, 2, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 0, 0, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 0, 0, 0, 0, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]

Note that there is also the function TNL::Algorithms::Segments::forElementsIf which allows for conditional traversing of segments within a given range.

Function forSegments

The function forSegments iterates in parallel over selected segments. However, the iteration over elements within each segment is sequential.

There are two main reasons for this behavior:

  1. Sequential dependency: The computation for an element may depend on the result of the computation for the previous element in the same segment. In such cases, sequential processing within a segment is necessary to preserve correctness.
  2. Common computations within a segment: If part of the computation is the same for all elements in a segment, it is more efficient to compute it once per segment, then process the elements. The function forElements does not allow sharing data between different elements in a segment, and would therefore repeat the common computation for every element, which is inefficient.

Sequential dependency

The first situation — where sequential processing is required — is demonstrated in the following example:

/***
* Insert data into particular segments.
*/
auto data_view = data.getView();
using SegmentViewType = typename Segments::SegmentViewType;
segments,
[ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
{
double sum( 0.0 );
for( auto element : segment )
if( element.localIndex() <= element.segmentIndex() ) {
sum += element.localIndex() + 1;
data_view[ element.globalIndex() ] = sum;
}
} );

In this example, we compute the cumulative sum of all elements within each segment. This is a typical case where the result for one element depends on the result of the previous element.

The lambda function passed to forAllSegments receives a segment view, and then uses a for loop to iterate over all elements of the segment. Each element is represented by a variable element of type TNL::Algorithms::Segments::SegmentElement, which provides access to methods:

  • localIndex – index of the element within the segment
  • segmentIndex – index of the segment
  • globalIndex – global index in the data array

To perform the cumulative sum, we use an auxiliary variable sum, which accumulates the values of the elements as we iterate over them. This demonstrates the sequential dependency that prevents parallel execution at the element level within a segment.

The full example looks as follows:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/traverse.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Segments >
10void
11SegmentsExample()
12{
14 using Device = typename Segments::DeviceType;
15
16 /***
17 * Create segments with given segments sizes.
18 */
19 Segments segments{ 1, 2, 3, 4, 5 };
20
21 /***
22 * Allocate array for the segments;
23 */
24 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
26
28 /***
29 * Insert data into particular segments.
30 */
31 auto data_view = data.getView();
32 using SegmentViewType = typename Segments::SegmentViewType;
34 segments,
35 [ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
36 {
37 double sum( 0.0 );
38 for( auto element : segment )
39 if( element.localIndex() <= element.segmentIndex() ) {
40 sum += element.localIndex() + 1;
41 data_view[ element.globalIndex() ] = sum;
42 }
43 } );
45
47 /***
48 * Print the data managed by the segments.
49 */
50 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
51 {
52 return data_view[ globalIdx ];
53 };
54 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
56}
57
58int
59main( int argc, char* argv[] )
60{
61 std::cout << "Example of CSR segments on host:\n";
62 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
63
64 std::cout << "Example of Ellpack segments on host:\n";
65 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
66
67#ifdef __CUDACC__
68 std::cout << "Example of CSR segments on CUDA GPU:\n";
69 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
70
71 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
72 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
73#endif
74 return EXIT_SUCCESS;
75}
void forAllSegments(const Segments &segments, Function &&function, LaunchConfiguration launchConfig=Algorithms::Segments::LaunchConfiguration())
Iterates in parallel over all segments and applies the given lambda function to each segment.

The result looks as follows:

Example of CSR segments on host:
Segment 0: [ 1 ]
Segment 1: [ 1, 3 ]
Segment 2: [ 1, 3, 6 ]
Segment 3: [ 1, 3, 6, 10 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on host:
Segment 0: [ 1, 0, 0, 0, 0 ]
Segment 1: [ 1, 3, 0, 0, 0 ]
Segment 2: [ 1, 3, 6, 0, 0 ]
Segment 3: [ 1, 3, 6, 10, 0 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of CSR segments on CUDA GPU:
Segment 0: [ 1 ]
Segment 1: [ 1, 3 ]
Segment 2: [ 1, 3, 6 ]
Segment 3: [ 1, 3, 6, 10 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on CUDA GPU:
Segment 0: [ 1, 0, 0, 0, 0 ]
Segment 1: [ 1, 3, 0, 0, 0 ]
Segment 2: [ 1, 3, 6, 0, 0 ]
Segment 3: [ 1, 3, 6, 10, 0 ]
Segment 4: [ 1, 3, 6, 10, 15 ]

Note: The cumulative sum of elements within segments — also referred to as a scan or prefix sum — can also be computed using the functions described in Overview of Segment Scan Functions.

Common computations

Now let’s take a look at the second situation — when there are common computations shared by all elements of a segment.

In the following example, we use the function forAllSegments to normalize each element by dividing it by the sum of all values in its segment.

/***
* Divide elements in each segment by a sum of all elements in the segment
*/
using SegmentViewType = typename SegmentsType::SegmentViewType;
segments,
[ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
{
// Compute the sum first ...
double sum = 0.0;
for( auto element : segment )
if( element.localIndex() <= element.segmentIndex() )
sum += data_view[ element.globalIndex() ];
// ... divide all elements.
for( auto element : segment )
if( element.localIndex() <= element.segmentIndex() )
data_view[ element.globalIndex() ] /= sum;
} );

This process includes two steps inside the segment traversal:

  • First, we compute the sum of all elements in the segment. This is the common computation for the entire segment.
  • Then, we iterate over the elements again and divide each by the segment sum, stored in the variable sum.

The full example looks as follows:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/traverse.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void
10SegmentsExample()
11{
12 using SegmentsType = typename TNL::Algorithms::Segments::CSR< Device, int >;
13
14 /***
15 * Create segments with given segments sizes.
16 */
17 SegmentsType segments{ 1, 2, 3, 4, 5 };
18
19 /***
20 * Allocate array for the segments;
21 */
22 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
23
24 /***
25 * Insert data into particular segments.
26 */
27 auto data_view = data.getView();
29 segments,
30 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
31 {
32 data_view[ globalIdx ] = localIdx + 1;
33 } );
34
35 /***
36 * Print the data by the segments.
37 */
38 std::cout << "Values of elements after initial setup:\n";
39 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
40 {
41 return data_view[ globalIdx ];
42 };
43 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
44
46 /***
47 * Divide elements in each segment by a sum of all elements in the segment
48 */
49 using SegmentViewType = typename SegmentsType::SegmentViewType;
51 segments,
52 [ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
53 {
54 // Compute the sum first ...
55 double sum = 0.0;
56 for( auto element : segment )
57 if( element.localIndex() <= element.segmentIndex() )
58 sum += data_view[ element.globalIndex() ];
59 // ... divide all elements.
60 for( auto element : segment )
61 if( element.localIndex() <= element.segmentIndex() )
62 data_view[ element.globalIndex() ] /= sum;
63 } );
65
66 /***
67 * Print the data managed by the segments.
68 */
69 std::cout << "Value of elements after dividing by sum in each segment:\n";
70 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
71}
72
73int
74main( int argc, char* argv[] )
75{
76 std::cout << "Example of CSR segments on host:\n";
77 SegmentsExample< TNL::Devices::Host >();
78
79#ifdef __CUDACC__
80 std::cout << "Example of CSR segments on CUDA GPU:\n";
81 SegmentsExample< TNL::Devices::Cuda >();
82#endif
83 return EXIT_SUCCESS;
84}
Data structure for CSR segments.
Definition CSR.h:37

The result looks as:

Example of CSR segments on host:
Values of elements after initial setup:
Segment 0: [ 1 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 1, 2, 3 ]
Segment 3: [ 1, 2, 3, 4 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Value of elements after dividing by sum in each segment:
Segment 0: [ 1 ]
Segment 1: [ 0.333333, 0.666667 ]
Segment 2: [ 0.166667, 0.333333, 0.5 ]
Segment 3: [ 0.1, 0.2, 0.3, 0.4 ]
Segment 4: [ 0.0666667, 0.133333, 0.2, 0.266667, 0.333333 ]
Example of CSR segments on CUDA GPU:
Values of elements after initial setup:
Segment 0: [ 1 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 1, 2, 3 ]
Segment 3: [ 1, 2, 3, 4 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Value of elements after dividing by sum in each segment:
Segment 0: [ 1 ]
Segment 1: [ 0.333333, 0.666667 ]
Segment 2: [ 0.166667, 0.333333, 0.5 ]
Segment 3: [ 0.1, 0.2, 0.3, 0.4 ]
Segment 4: [ 0.0666667, 0.133333, 0.2, 0.266667, 0.333333 ]

Function forSegments with segment indexes

Similar to the function TNL::Algorithms::Segments::forElements, the function TNL::Algorithms::Segments::forSegments also allows you to specify a range of segments to traverse, or explicitly provide the indexes of segments to be traversed. The later is demonstrated in the following code snippet:

/***
* Create array with the indexes of segments we want to iterate over.
*/
TNL::Containers::Array< int, Device > segmentIndexes{ 0, 2, 4 };
/***
* Compute cumulative sums in particular segments.
*/
using SegmentViewType = typename Segments::SegmentViewType;
segments,
segmentIndexes,
[ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
{
double sum( 0.0 );
for( auto element : segment )
if( element.localIndex() <= element.segmentIndex() ) {
sum += element.localIndex() + 1;
data_view[ element.globalIndex() ] = sum;
}
} );

The example computes cumulative sums in the same way as the previous example, but only for the selected segments.

The full example reads as:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/traverse.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Segments >
10void
11SegmentsExample()
12{
13 using Device = typename Segments::DeviceType;
14
15 /***
16 * Create segments with given segments sizes.
17 */
18 Segments segments{ 1, 2, 3, 4, 5 };
19
20 /***
21 * Allocate array for the segments;
22 */
23 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
24
25 /***
26 * Insert data into particular segments.
27 */
28 auto data_view = data.getView();
30 segments,
31 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
32 {
33 data_view[ globalIdx ] = localIdx + 1;
34 } );
35
36 /***
37 * Print the data by the segments.
38 */
39 std::cout << "Values of elements after initial setup:\n";
40 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
41 {
42 return data_view[ globalIdx ];
43 };
44 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
45
47 /***
48 * Create array with the indexes of segments we want to iterate over.
49 */
50 TNL::Containers::Array< int, Device > segmentIndexes{ 0, 2, 4 };
51
52 /***
53 * Compute cumulative sums in particular segments.
54 */
55 using SegmentViewType = typename Segments::SegmentViewType;
57 segments,
58 segmentIndexes,
59 [ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
60 {
61 double sum( 0.0 );
62 for( auto element : segment )
63 if( element.localIndex() <= element.segmentIndex() ) {
64 sum += element.localIndex() + 1;
65 data_view[ element.globalIndex() ] = sum;
66 }
67 } );
69
70 /***
71 * Print the data managed by the segments.
72 */
73 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
74}
75
76int
77main( int argc, char* argv[] )
78{
79 std::cout << "Example of CSR segments on host:\n";
80 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
81
82 std::cout << "Example of Ellpack segments on host:\n";
83 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
84
85#ifdef __CUDACC__
86 std::cout << "Example of CSR segments on CUDA GPU:\n";
87 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
88
89 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
90 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
91#endif
92 return EXIT_SUCCESS;
93}
void forSegments(const Segments &segments, IndexBegin begin, IndexEnd end, Function &&function, LaunchConfiguration launchConfig=Algorithms::Segments::LaunchConfiguration())
Iterates in parallel over segments within the specified range of segment indexes and applies the give...

And the result looks as:

Example of CSR segments on host:
Values of elements after initial setup:
Segment 0: [ 1 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 1, 2, 3 ]
Segment 3: [ 1, 2, 3, 4 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Segment 0: [ 1 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 1, 3, 6 ]
Segment 3: [ 1, 2, 3, 4 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on host:
Values of elements after initial setup:
Segment 0: [ 1, 2, 3, 4, 5 ]
Segment 1: [ 1, 2, 3, 4, 5 ]
Segment 2: [ 1, 2, 3, 4, 5 ]
Segment 3: [ 1, 2, 3, 4, 5 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Segment 0: [ 1, 2, 3, 4, 5 ]
Segment 1: [ 1, 2, 3, 4, 5 ]
Segment 2: [ 1, 3, 6, 4, 5 ]
Segment 3: [ 1, 2, 3, 4, 5 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of CSR segments on CUDA GPU:
Values of elements after initial setup:
Segment 0: [ 1 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 1, 2, 3 ]
Segment 3: [ 1, 2, 3, 4 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Segment 0: [ 1 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 1, 3, 6 ]
Segment 3: [ 1, 2, 3, 4 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on CUDA GPU:
Values of elements after initial setup:
Segment 0: [ 1, 2, 3, 4, 5 ]
Segment 1: [ 1, 2, 3, 4, 5 ]
Segment 2: [ 1, 2, 3, 4, 5 ]
Segment 3: [ 1, 2, 3, 4, 5 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Segment 0: [ 1, 2, 3, 4, 5 ]
Segment 1: [ 1, 2, 3, 4, 5 ]
Segment 2: [ 1, 3, 6, 4, 5 ]
Segment 3: [ 1, 2, 3, 4, 5 ]
Segment 4: [ 1, 3, 6, 10, 15 ]

Function forSegments with condition on segment indexes

Another way to specify which segments to traverse is by using a condition. For this purpose, you can use the function TNL::Algorithms::Segments::forSegmentsIf.

This is demonstrated in the following code snippet:

/***
* Compute cumulative sums in particular segments.
*/
using SegmentViewType = typename Segments::SegmentViewType;
segments,
[ = ] __cuda_callable__( const int segmentIdx ) -> bool
{
return segmentIdx % 2 == 0; // Iterate only over even-indexed segments.
},
[ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
{
double sum( 0.0 );
for( auto element : segment )
if( element.localIndex() <= element.segmentIndex() ) {
sum += element.localIndex() + 1;
data_view[ element.globalIndex() ] = sum;
}
} );

The example computes cumulative sums, just like in the previous examples, but this time only for the even-indexed segments, based on the condition provided in a lambda function.

The full example reads as:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/traverse.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Segments >
10void
11SegmentsExample()
12{
13 using Device = typename Segments::DeviceType;
14
15 /***
16 * Create segments with given segments sizes.
17 */
18 Segments segments{ 1, 2, 3, 4, 5 };
19
20 /***
21 * Allocate array for the segments;
22 */
23 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
24
25 /***
26 * Insert data into particular segments.
27 */
28 auto data_view = data.getView();
30 segments,
31 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
32 {
33 data_view[ globalIdx ] = localIdx + 1;
34 } );
35
36 /***
37 * Print the data by the segments.
38 */
39 std::cout << "Values of elements after initial setup:\n";
40 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
41 {
42 return data_view[ globalIdx ];
43 };
44 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
45
47 /***
48 * Compute cumulative sums in particular segments.
49 */
50 using SegmentViewType = typename Segments::SegmentViewType;
52 segments,
53 [ = ] __cuda_callable__( const int segmentIdx ) -> bool
54 {
55 return segmentIdx % 2 == 0; // Iterate only over even-indexed segments.
56 },
57 [ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
58 {
59 double sum( 0.0 );
60 for( auto element : segment )
61 if( element.localIndex() <= element.segmentIndex() ) {
62 sum += element.localIndex() + 1;
63 data_view[ element.globalIndex() ] = sum;
64 }
65 } );
67
68 /***
69 * Print the data managed by the segments.
70 */
71 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
72}
73
74int
75main( int argc, char* argv[] )
76{
77 std::cout << "Example of CSR segments on host:\n";
78 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
79
80 std::cout << "Example of Ellpack segments on host:\n";
81 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
82
83#ifdef __CUDACC__
84 std::cout << "Example of CSR segments on CUDA GPU:\n";
85 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
86
87 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
88 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
89#endif
90 return EXIT_SUCCESS;
91}
void forAllSegmentsIf(const Segments &segments, SegmentCondition &&segmentCondition, Function &&function, LaunchConfiguration launchConfig=Algorithms::Segments::LaunchConfiguration())
Iterates in parallel over all segments, applying a condition to determine whether each segment should...

And the result looks as:

Example of CSR segments on host:
Segment 0: [ 1 ]
Segment 1: [ 1, 3 ]
Segment 2: [ 1, 3, 6 ]
Segment 3: [ 1, 3, 6, 10 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on host:
Segment 0: [ 1, 0, 0, 0, 0 ]
Segment 1: [ 1, 3, 0, 0, 0 ]
Segment 2: [ 1, 3, 6, 0, 0 ]
Segment 3: [ 1, 3, 6, 10, 0 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of CSR segments on CUDA GPU:
Segment 0: [ 1 ]
Segment 1: [ 1, 3 ]
Segment 2: [ 1, 3, 6 ]
Segment 3: [ 1, 3, 6, 10 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on CUDA GPU:
Segment 0: [ 1, 0, 0, 0, 0 ]
Segment 1: [ 1, 3, 0, 0, 0 ]
Segment 2: [ 1, 3, 6, 0, 0 ]
Segment 3: [ 1, 3, 6, 10, 0 ]
Segment 4: [ 1, 3, 6, 10, 15 ]

Flexible reduction within segments

In this section, we explain an extension of flexible reduction to segments. It enables you to reduce all elements within the same segment and store the result in an output array.

There are several ways to specify which segments should be included in the reduction:

  1. All Segments Use the functions reduceSegments and reduceAllSegments to perform reduction over all segments (or a specified range of segments).
  2. Selected Segments by Index The same function, reduceSegments, can also be provided with an array of segment indices. In this case, reduction is performed only within the specified segments.
  3. Conditional Segment Selection Use the functions reduceSegmentsIf and reduceAllSegmentsIf to perform reduction only in segments that fulfill a given condition, which is specified using a lambda function.

Function reduceSegments

Redcution within segments is demonstrated in the following code snippet:

/***
* Compute sums of elements in each segment.
*/
auto sums_view = sums.getView();
auto fetch_full = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> double
{
if( localIdx <= segmentIdx )
return data_view[ globalIdx ];
else
return 0.0;
};
auto fetch_brief = [ = ] __cuda_callable__( int globalIdx ) -> double
{
return data_view[ globalIdx ];
};
auto store = [ = ] __cuda_callable__( int segmentIdx, const double& value ) mutable
{
sums_view[ segmentIdx ] = value;
};
TNL::Algorithms::Segments::reduceAllSegments( segments, fetch_full, TNL::Plus{}, store );
std::cout << "The sums with full fetch form are: " << sums << '\n';
TNL::Algorithms::Segments::reduceAllSegments( segments, fetch_brief, TNL::Plus{}, store );
std::cout << "The sums with brief fetch form are: " << sums << "\n\n";

The function reduceAllSegments, which we call at the end, requires three lambda functions:

  1. fetch, which reads data associated with individual elements of the segments. The fetch function can be written in two forms - brief and full:

    • Brief form: The lambda function receives only the global index and the compute flag:
    auto fetch = [=] __cuda_callable__ ( int globalIdx ) -> double { ... };
    • Full form: The lambda function receives the segment index, local index, and global index:
    auto fetch = [=] __cuda_callable__ ( int segmentIdx, int localIdx, int globalIdx ) -> double { ... };

    Here:

    • segmentIdx is the index of the segment,
    • localIdx is the index of the element within the segment,
    • globalIdx is the index of the element in the global array.

Many segment formats are optimized for significantly better performance when the brief variant is used. The form of the fetch lambda function is automatically detected using SFINAE , which makes both variants easy to use.

  1. reduce is a lambda function representing the reduction operation. In our case, it is defined as:

    auto reduce = [=] __cuda_callable__ ( const double& a, const double& b ) -> double { return a + b; };

    Alternatively, you can simply use the predefined functors like TNL::Plus, TNL::Multiplies etc.

  1. keep is a lambda function responsible for storing the results of the reduction. It should be defined as:

    auto keep = [=] __cuda_callable__ ( int segmentIdx, const double& value ) mutable { ... };

    Here, segmentIdx is the index of the segment whose reduction result we are storing, and value is the result of the reduction.

To use reduction within segments, we first create a vector sums to store the results and prepare a view to this vector for use inside the lambda functions.

We demonstrate the use of both fetch function variants - the full form via fetch_full and the brief form via fetch_brief.

Next, we define the lambda function keep, which stores the sum from each segment into the vector sums.

Finally, we call the function reduceAllSegments to compute the reductions in the segments, first, using fetch_full and then, using fetch_brief. In both cases, we use the functor TNL::Plus for the reduction operation. We then print the results, which is supposed be the same for both variants.

The full example reads as:

1#include <iostream>
2#include <functional>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/CSR.h>
5#include <TNL/Algorithms/Segments/Ellpack.h>
6#include <TNL/Algorithms/Segments/traverse.h>
7#include <TNL/Algorithms/Segments/reduce.h>
8#include <TNL/Devices/Host.h>
9#include <TNL/Devices/Cuda.h>
10
11template< typename Segments >
12void
13SegmentsExample()
14{
15 using Device = typename Segments::DeviceType;
16
17 /***
18 * Create segments with given segments sizes.
19 */
20 const int size = 5;
21 Segments segments{ 1, 2, 3, 4, 5 };
22
23 /***
24 * Allocate array for the segments;
25 */
26 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
27
28 /***
29 * Insert data into particular segments.
30 */
31 auto data_view = data.getView();
33 segments,
34 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
35 {
36 if( localIdx <= segmentIdx )
37 data_view[ globalIdx ] = segmentIdx;
38 } );
39
40 /***
41 * Print the data by the segments.
42 */
43 std::cout << "Values of elements after initial setup:\n";
44 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
45 {
46 return data_view[ globalIdx ];
47 };
48 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
49
51 /***
52 * Compute sums of elements in each segment.
53 */
55 auto sums_view = sums.getView();
56 auto fetch_full = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> double
57 {
58 if( localIdx <= segmentIdx )
59 return data_view[ globalIdx ];
60 else
61 return 0.0;
62 };
63 auto fetch_brief = [ = ] __cuda_callable__( int globalIdx ) -> double
64 {
65 return data_view[ globalIdx ];
66 };
67 auto store = [ = ] __cuda_callable__( int segmentIdx, const double& value ) mutable
68 {
69 sums_view[ segmentIdx ] = value;
70 };
71
72 TNL::Algorithms::Segments::reduceAllSegments( segments, fetch_full, TNL::Plus{}, store );
73 std::cout << "The sums with full fetch form are: " << sums << '\n';
74 TNL::Algorithms::Segments::reduceAllSegments( segments, fetch_brief, TNL::Plus{}, store );
75 std::cout << "The sums with brief fetch form are: " << sums << "\n\n";
77}
78
79int
80main( int argc, char* argv[] )
81{
82 std::cout << "Example of CSR segments on host:\n";
83 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
84
85 std::cout << "Example of Ellpack segments on host:\n";
86 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
87
88#ifdef __CUDACC__
89 std::cout << "Example of CSR segments on host:\n";
90 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
91
92 std::cout << "Example of Ellpack segments on host:\n";
93 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
94#endif
95 return EXIT_SUCCESS;
96}
Function object implementing x + y.
Definition Functional.h:34

The result looks as follows:

Example of CSR segments on host:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
Example of Ellpack segments on host:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
Example of CSR segments on host:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
Example of Ellpack segments on host:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]

Function reduceSegments with segment indexes

Reduction can also be performed only in segments with specified indexes. This is demonstrated in the following code snippet:

/***
* Create array with the indexes of segments we want to iterate over.
*/
TNL::Containers::Array< int, Device > segmentIndexes{ 0, 2, 4 };
/***
* Compute sums of elements in segments with given indexes.
*/
TNL::Containers::Vector< double, Device > compressedSums( segmentIndexes.getSize() );
auto sums_view = sums.getView();
auto compressedSums_view = compressedSums.getView();
auto fetch_full = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> double
{
if( localIdx <= segmentIdx )
return data_view[ globalIdx ];
else
return 0.0;
};
auto fetch_brief = [ = ] __cuda_callable__( int globalIdx ) -> double
{
return data_view[ globalIdx ];
};
auto store = [ = ] __cuda_callable__( int indexOfSegmentIdx, int segmentIdx, const double& value ) mutable
{
sums_view[ segmentIdx ] = value;
compressedSums_view[ indexOfSegmentIdx ] = value;
};
TNL::Algorithms::Segments::reduceSegments( segments, segmentIndexes, fetch_full, TNL::Plus{}, store );
std::cout << "The sums with full fetch form are: " << sums << '\n';
std::cout << "The compressed sums with full fetch form are: " << compressedSums << '\n';
sums = 0;
compressedSums = 0;
TNL::Algorithms::Segments::reduceSegments( segments, segmentIndexes, fetch_brief, TNL::Plus{}, store );
std::cout << "The sums with brief fetch form are: " << sums << '\n';
std::cout << "The compressed sums with brief fetch form are: " << compressedSums << '\n';

Here, we call the function Segments_reduceSegments_with_segment_indices which takes, as its second parameter, an array segmentIndexes containing the indices of segments in which we want to perform the reduction. This function also accepts a lambda function for fetching data in either the full or brief form, such as fetch_full and fetch_brief. The keep lambda function is now slightly different and takes the following parameters:

  1. indexOfSegmentIdx - the position of the segment index in the array segmentIndexes. This information is useful when we want to store the results of the reductions at the positions corresponding to segment indexes in the segmentIndexes array - that is, in a compressed format.
  2. segmentIdx - the actual index of the segment being reduced.
  3. value - the result of the reduction for the given segment.

The difference between the use of indexOfSegmentIdx and segmentIdx is demonstrated using two vectors:

  • sum - a vector whose size equals the total number of all segments.
  • compressedSum - a vector whose size equals the number of segments that actively participate in the reduction.

The full example reads as:

1#include <iostream>
2#include <functional>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/CSR.h>
5#include <TNL/Algorithms/Segments/Ellpack.h>
6#include <TNL/Algorithms/Segments/traverse.h>
7#include <TNL/Algorithms/Segments/reduce.h>
8#include <TNL/Devices/Host.h>
9#include <TNL/Devices/Cuda.h>
10
11template< typename Segments >
12void
13SegmentsExample()
14{
15 using Device = typename Segments::DeviceType;
16
17 /***
18 * Create segments with given segments sizes.
19 */
20 const int size = 5;
21 Segments segments{ 1, 2, 3, 4, 5 };
22
23 /***
24 * Allocate array for the segments;
25 */
26 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
27
28 /***
29 * Insert data into particular segments.
30 */
31 auto data_view = data.getView();
33 segments,
34 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
35 {
36 if( localIdx <= segmentIdx )
37 data_view[ globalIdx ] = segmentIdx;
38 } );
39
40 /***
41 * Print the data by the segments.
42 */
43 std::cout << "Values of elements after initial setup:\n";
44 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
45 {
46 return data_view[ globalIdx ];
47 };
48 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
49
51 /***
52 * Create array with the indexes of segments we want to iterate over.
53 */
54 TNL::Containers::Array< int, Device > segmentIndexes{ 0, 2, 4 };
55
56 /***
57 * Compute sums of elements in segments with given indexes.
58 */
60 TNL::Containers::Vector< double, Device > compressedSums( segmentIndexes.getSize() );
61 auto sums_view = sums.getView();
62 auto compressedSums_view = compressedSums.getView();
63 auto fetch_full = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> double
64 {
65 if( localIdx <= segmentIdx )
66 return data_view[ globalIdx ];
67 else
68 return 0.0;
69 };
70 auto fetch_brief = [ = ] __cuda_callable__( int globalIdx ) -> double
71 {
72 return data_view[ globalIdx ];
73 };
74 auto store = [ = ] __cuda_callable__( int indexOfSegmentIdx, int segmentIdx, const double& value ) mutable
75 {
76 sums_view[ segmentIdx ] = value;
77 compressedSums_view[ indexOfSegmentIdx ] = value;
78 };
79
80 TNL::Algorithms::Segments::reduceSegments( segments, segmentIndexes, fetch_full, TNL::Plus{}, store );
81 std::cout << "The sums with full fetch form are: " << sums << '\n';
82 std::cout << "The compressed sums with full fetch form are: " << compressedSums << '\n';
83
84 sums = 0;
85 compressedSums = 0;
86 TNL::Algorithms::Segments::reduceSegments( segments, segmentIndexes, fetch_brief, TNL::Plus{}, store );
87 std::cout << "The sums with brief fetch form are: " << sums << '\n';
88 std::cout << "The compressed sums with brief fetch form are: " << compressedSums << '\n';
90}
91
92int
93main( int argc, char* argv[] )
94{
95 std::cout << "Example of CSR segments on host:\n";
96 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
97
98 std::cout << "Example of Ellpack segments on host:\n";
99 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
100
101#ifdef __CUDACC__
102 std::cout << "Example of CSR segments on CUDA GPU:\n";
103 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
104
105 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
106 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
107#endif
108 return EXIT_SUCCESS;
109}
__cuda_callable__ IndexType getSize() const
Returns the current array size.

And the output looks as:

Example of CSR segments on host:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 0, 6, 6.95272e-310, 20 ]
The compressed sums with full fetch form are: [ 0, 6, 20 ]
The sums with brief fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with brief fetch form are: [ 0, 6, 20 ]
Example of Ellpack segments on host:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 0, 6, 6.95272e-310, 20 ]
The compressed sums with full fetch form are: [ 0, 6, 20 ]
The sums with brief fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with brief fetch form are: [ 0, 6, 20 ]
Example of CSR segments on CUDA GPU:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with full fetch form are: [ 0, 6, 20 ]
The sums with brief fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with brief fetch form are: [ 0, 6, 20 ]
Example of Ellpack segments on CUDA GPU:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with full fetch form are: [ 0, 6, 20 ]
The sums with brief fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with brief fetch form are: [ 0, 6, 20 ]

Function reduceSegmentsIf with condition on segment indexes

Reduction within segments with a condition on segment indexes is demonstrated by the following code snippet:

/***
* Compute sums of elements in segments with given indexes.
*/
auto sums_view = sums.getView();
auto compressedSums_view = compressedSums.getView();
auto condition = [ = ] __cuda_callable__( int segmentIdx ) -> bool
{
return segmentIdx % 2 == 0;
};
auto fetch_full = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> double
{
if( localIdx <= segmentIdx )
return data_view[ globalIdx ];
else
return 0.0;
};
auto fetch_brief = [ = ] __cuda_callable__( int globalIdx ) -> double
{
return data_view[ globalIdx ];
};
auto store = [ = ] __cuda_callable__( int indexOfSegmentIdx, int segmentIdx, const double& value ) mutable
{
sums_view[ segmentIdx ] = value;
compressedSums_view[ indexOfSegmentIdx ] = value;
};
TNL::Algorithms::Segments::reduceAllSegmentsIf( segments, condition, fetch_full, TNL::Plus{}, store );
std::cout << "The sums with full fetch form are: " << sums << '\n';
std::cout << "The compressed sums with full fetch form are: " << compressedSums << '\n';
sums = 0;
compressedSums = 0;
TNL::Algorithms::Segments::reduceAllSegmentsIf( segments, condition, fetch_brief, TNL::Plus{}, store );
std::cout << "The sums with brief fetch form are: " << sums << '\n';
std::cout << "The compressed sums with brief fetch form are: " << compressedSums << '\n';

It works the same way as reduction within segments specified by segment indexes. However, instead of providing an array of segment indexes, we use a lambda function with a condition. Only those segments whose indices satisfy the condition will participate in the reduction.

Note that even when using the function reduceSegmentsIf, the lambda function keep receives the parameter indexOfSegmentIdx, which has the same meaning as before — it represents the rank (position) of the segment among those actively participating in the reduction.

The fulle example reads as:

1#include <iostream>
2#include <functional>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/CSR.h>
5#include <TNL/Algorithms/Segments/Ellpack.h>
6#include <TNL/Algorithms/Segments/traverse.h>
7#include <TNL/Algorithms/Segments/reduce.h>
8#include <TNL/Devices/Host.h>
9#include <TNL/Devices/Cuda.h>
10
11template< typename Segments >
12void
13SegmentsExample()
14{
15 using Device = typename Segments::DeviceType;
16
17 /***
18 * Create segments with given segments sizes.
19 */
20 const int size = 5;
21 Segments segments{ 1, 2, 3, 4, 5 };
22
23 /***
24 * Allocate array for the segments;
25 */
26 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
27
28 /***
29 * Insert data into particular segments.
30 */
31 auto data_view = data.getView();
33 segments,
34 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
35 {
36 if( localIdx <= segmentIdx )
37 data_view[ globalIdx ] = segmentIdx;
38 } );
39
40 /***
41 * Print the data by the segments.
42 */
43 std::cout << "Values of elements after initial setup:\n";
44 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
45 {
46 return data_view[ globalIdx ];
47 };
48 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
49
51 /***
52 * Compute sums of elements in segments with given indexes.
53 */
55 TNL::Containers::Vector< double, Device > compressedSums( size );
56 auto sums_view = sums.getView();
57 auto compressedSums_view = compressedSums.getView();
58 auto condition = [ = ] __cuda_callable__( int segmentIdx ) -> bool
59 {
60 return segmentIdx % 2 == 0;
61 };
62 auto fetch_full = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> double
63 {
64 if( localIdx <= segmentIdx )
65 return data_view[ globalIdx ];
66 else
67 return 0.0;
68 };
69 auto fetch_brief = [ = ] __cuda_callable__( int globalIdx ) -> double
70 {
71 return data_view[ globalIdx ];
72 };
73 auto store = [ = ] __cuda_callable__( int indexOfSegmentIdx, int segmentIdx, const double& value ) mutable
74 {
75 sums_view[ segmentIdx ] = value;
76 compressedSums_view[ indexOfSegmentIdx ] = value;
77 };
78
79 TNL::Algorithms::Segments::reduceAllSegmentsIf( segments, condition, fetch_full, TNL::Plus{}, store );
80 std::cout << "The sums with full fetch form are: " << sums << '\n';
81 std::cout << "The compressed sums with full fetch form are: " << compressedSums << '\n';
82
83 sums = 0;
84 compressedSums = 0;
85 TNL::Algorithms::Segments::reduceAllSegmentsIf( segments, condition, fetch_brief, TNL::Plus{}, store );
86 std::cout << "The sums with brief fetch form are: " << sums << '\n';
87 std::cout << "The compressed sums with brief fetch form are: " << compressedSums << '\n';
89}
90
91int
92main( int argc, char* argv[] )
93{
94 std::cout << "Example of CSR segments on host:\n";
95 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
96
97 std::cout << "Example of Ellpack segments on host:\n";
98 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
99
100#ifdef __CUDACC__
101 std::cout << "Example of CSR segments on CUDA device:\n";
102 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
103
104 std::cout << "Example of Ellpack segments on CUDA device:\n";
105 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
106#endif
107 return EXIT_SUCCESS;
108}

And the output looks as:

Example of CSR segments on host:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 0, 6, 6.95272e-310, 20 ]
The compressed sums with full fetch form are: [ 0, 6, 20, 6.95272e-310, 4.68523e-310 ]
The sums with brief fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with brief fetch form are: [ 0, 6, 20, 0, 0 ]
Example of Ellpack segments on host:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 0, 6, 6.95272e-310, 20 ]
The compressed sums with full fetch form are: [ 0, 6, 20, 6.95272e-310, 4.68523e-310 ]
The sums with brief fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with brief fetch form are: [ 0, 6, 20, 0, 0 ]
Example of CSR segments on CUDA device:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 1, 1 ]
Segment 2: [ 2, 2, 2 ]
Segment 3: [ 3, 3, 3, 3 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with full fetch form are: [ 0, 6, 20, 0, 0 ]
The sums with brief fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with brief fetch form are: [ 0, 6, 20, 0, 0 ]
Example of Ellpack segments on CUDA device:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 1, 0, 0, 0 ]
Segment 2: [ 2, 2, 2, 0, 0 ]
Segment 3: [ 3, 3, 3, 3, 0 ]
Segment 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 8.48798e-314, 6, 0, 20 ]
The compressed sums with full fetch form are: [ 0, 6, 20, 0, 0 ]
The sums with brief fetch form are: [ 0, 0, 6, 0, 20 ]
The compressed sums with brief fetch form are: [ 0, 6, 20, 0, 0 ]

Reduction with argument

The function reduceSegmentsWithArgument also works with the positions of elements within segments. This is useful in situations where we need to determine, for example, the position of the smallest or largest element in each segment. This functionality is demonstrated in the following code snippet:

/***
* Find the maximum element in each segment.
*/
auto sums_view = sums.getView();
auto positions_view = positions.getView();
auto fetch_full = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> double
{
if( localIdx <= segmentIdx )
return data_view[ globalIdx ];
else
return 0.0;
};
auto fetch_brief = [ = ] __cuda_callable__( int globalIdx ) -> double
{
return data_view[ globalIdx ];
};
auto store = [ = ] __cuda_callable__( int segmentIdx, int localIdx, const double& value, bool emptySegment ) mutable
{
sums_view[ segmentIdx ] = value;
if( ! emptySegment )
positions_view[ segmentIdx ] = localIdx;
};
TNL::Algorithms::Segments::reduceAllSegmentsWithArgument( segments, fetch_full, TNL::MaxWithArg{}, store );
std::cout << "The sums with full fetch form are: " << sums << '\n';
std::cout << "The positions of the largest elements are: " << positions << '\n';
sums = 0;
positions = 0;
TNL::Algorithms::Segments::reduceAllSegmentsWithArgument( segments, fetch_brief, TNL::MaxWithArg{}, store );
std::cout << "The sums with brief fetch form are: " << sums << '\n';
std::cout << "The positions of the largest elements are: " << positions << '\n';

Here we search for the maximum element in each segment. The function works similarly to reduceSegments, but it requires reduction functors with argument, such as TNL::MinWithArg or TNL::MaxWithArg. The lambda function keep receives an additional parameter localIdx, which represents the position of the maximum element within the segment.

The full example reads as:

1#include <iostream>
2#include <functional>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/CSR.h>
5#include <TNL/Algorithms/Segments/Ellpack.h>
6#include <TNL/Algorithms/Segments/traverse.h>
7#include <TNL/Algorithms/Segments/reduce.h>
8#include <TNL/Devices/Host.h>
9#include <TNL/Devices/Cuda.h>
10
11template< typename Segments >
12void
13SegmentsExample()
14{
15 using Device = typename Segments::DeviceType;
16
17 /***
18 * Create segments with given segments sizes.
19 */
20 const int size = 5;
21 Segments segments{ 1, 2, 3, 4, 5 };
22
23 /***
24 * Allocate array for the segments;
25 */
26 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
27
28 /***
29 * Insert data into particular segments.
30 */
31 auto data_view = data.getView();
33 segments,
34 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
35 {
36 if( localIdx <= segmentIdx )
37 data_view[ globalIdx ] = segmentIdx + localIdx;
38 } );
39
40 /***
41 * Print the data by the segments.
42 */
43 std::cout << "Values of elements after initial setup:\n";
44 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
45 {
46 return data_view[ globalIdx ];
47 };
48 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
49
51 /***
52 * Find the maximum element in each segment.
53 */
56 auto sums_view = sums.getView();
57 auto positions_view = positions.getView();
58 auto fetch_full = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> double
59 {
60 if( localIdx <= segmentIdx )
61 return data_view[ globalIdx ];
62 else
63 return 0.0;
64 };
65 auto fetch_brief = [ = ] __cuda_callable__( int globalIdx ) -> double
66 {
67 return data_view[ globalIdx ];
68 };
69 auto store = [ = ] __cuda_callable__( int segmentIdx, int localIdx, const double& value, bool emptySegment ) mutable
70 {
71 sums_view[ segmentIdx ] = value;
72 if( ! emptySegment )
73 positions_view[ segmentIdx ] = localIdx;
74 };
75
76 TNL::Algorithms::Segments::reduceAllSegmentsWithArgument( segments, fetch_full, TNL::MaxWithArg{}, store );
77 std::cout << "The sums with full fetch form are: " << sums << '\n';
78 std::cout << "The positions of the largest elements are: " << positions << '\n';
79
80 sums = 0;
81 positions = 0;
82 TNL::Algorithms::Segments::reduceAllSegmentsWithArgument( segments, fetch_brief, TNL::MaxWithArg{}, store );
83 std::cout << "The sums with brief fetch form are: " << sums << '\n';
84 std::cout << "The positions of the largest elements are: " << positions << '\n';
86}
87
88int
89main( int argc, char* argv[] )
90{
91 std::cout << "Example of CSR segments on host:\n";
92 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
93
94 std::cout << "Example of Ellpack segments on host:\n";
95 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
96
97#ifdef __CUDACC__
98 std::cout << "Example of CSR segments on CUDA GPU:\n";
99 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
100
101 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
102 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
103#endif
104 return EXIT_SUCCESS;
105}
Function object implementing argmax(x, y, i, j), i.e. returning the maximum value and its index.
Definition Functional.h:355

The result looks as:

Example of CSR segments on host:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 2, 3, 4 ]
Segment 3: [ 3, 4, 5, 6 ]
Segment 4: [ 4, 5, 6, 7, 8 ]
The sums with full fetch form are: [ 0, 2, 4, 6, 8 ]
The positions of the largest elements are: [ 0, 1, 2, 3, 4 ]
The sums with brief fetch form are: [ 0, 2, 4, 6, 8 ]
The positions of the largest elements are: [ 0, 1, 2, 3, 4 ]
Example of Ellpack segments on host:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 2, 0, 0, 0 ]
Segment 2: [ 2, 3, 4, 0, 0 ]
Segment 3: [ 3, 4, 5, 6, 0 ]
Segment 4: [ 4, 5, 6, 7, 8 ]
The sums with full fetch form are: [ 0, 2, 4, 6, 8 ]
The positions of the largest elements are: [ 0, 1, 2, 3, 4 ]
The sums with brief fetch form are: [ 0, 2, 4, 6, 8 ]
The positions of the largest elements are: [ 0, 1, 2, 3, 4 ]
Example of CSR segments on CUDA GPU:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 2, 3, 4 ]
Segment 3: [ 3, 4, 5, 6 ]
Segment 4: [ 4, 5, 6, 7, 8 ]
The sums with full fetch form are: [ 0, 2, 4, 6, 8 ]
The positions of the largest elements are: [ 0, 1, 2, 3, 4 ]
The sums with brief fetch form are: [ 0, 2, 4, 6, 8 ]
The positions of the largest elements are: [ 0, 1, 2, 3, 4 ]
Example of Ellpack segments on CUDA GPU:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 1, 2, 0, 0, 0 ]
Segment 2: [ 2, 3, 4, 0, 0 ]
Segment 3: [ 3, 4, 5, 6, 0 ]
Segment 4: [ 4, 5, 6, 7, 8 ]
The sums with full fetch form are: [ 0, 2, 4, 6, 8 ]
The positions of the largest elements are: [ 0, 1, 2, 3, 4 ]
The sums with brief fetch form are: [ 0, 2, 4, 6, 8 ]
The positions of the largest elements are: [ 0, 1, 2, 3, 4 ]

Function reduceAll

The function reduceAll performs local reduction within specified segments, followed by a global reduction of the intermediate results obtained from the individual segment reductions.

It is demonstrated by the following code snippet, where we compute the sum of the maximum values across all segments.

// Perform complete reduction:
// 1. Find maximum in each segment
// 2. Sum up all the maximums
const ValueType result = reduceAll(
segments,
// Fetch function for segment-level reduction (gets element value)
[ = ] __cuda_callable__( IndexType segmentIdx, IndexType localIdx, IndexType globalIdx )
{
if( localIdx < segmentsSizesView[ segmentIdx ] )
return valuesView[ globalIdx ];
},
// Reduction operation for segments (maximum)
// Fetch function for global reduction (identity - returns segment result as is)
[] __cuda_callable__( const ValueType& segmentValue )
{
return segmentValue;
},
// Global reduction operation (sum)
TNL::Plus{} );

The function accepts the following lambda functions and functors:

  1. Local fetch: A lambda function for fetching data from segments. It can be provided in either full or brief form, as in other segment-reduction functions.
  2. Local reduction: A functor representing the reduction operation to be performed within individual segments.
  3. Global fetch: A lambda function that receives the result of the local reduction for each segment, optionally performs a transformation, and returns the value to be used as input for the global reduction.
  4. Global reduction: A functor representing the global reduction, i.e., the reduction performed across all segment-level results.

Since the result of this function is a single value, a keep lambda function is not required.

In our example, we first need to find the maximum value in each segment. Therefore, the local reduction is performed using TNL::Max. Next, we want to compute the sum of these maxima, so the global reduction is performed using TNL::Plus.

The full example reads as:

1#include "TNL/Functional.h"
2#include <iostream>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/reduce.h>
5#include <TNL/Algorithms/Segments/print.h>
6#include <TNL/Algorithms/Segments/traverse.h>
7#include <TNL/Containers/Vector.h>
8#include <TNL/Devices/Host.h>
9#include <TNL/Devices/Cuda.h>
10#include <TNL/Math.h>
11#include <TNL/Containers/Array.h>
12
13using namespace TNL;
14using namespace TNL::Algorithms;
15using namespace TNL::Algorithms::Segments;
16
17template< typename Device >
18void
19reduceAllExample()
20{
21 /***
22 * This example shows how to perform a complete reduction over segments
23 * with different operations for segment-level and final reduction.
24 *
25 * We will:
26 * 1. Find maximum in each segment
27 * 2. Sum up all the maximums
28 */
29
30 using IndexType = int;
31 using ValueType = double;
32
33 // Create segments with different sizes
35 CSR< Device, IndexType > segments( segmentsSizes );
36
37 // Initialize data: each element is segment_idx + local_idx
38 Containers::Vector< ValueType, Device, IndexType > values( segments.getStorageSize(), -1 );
39 auto valuesView = values.getView();
40 auto segmentsSizesView = segmentsSizes.getView();
42 segments,
43 [ = ] __cuda_callable__( IndexType segmentIdx, IndexType localIdx, IndexType globalIdx ) mutable
44 {
45 if( localIdx < segmentsSizesView[ segmentIdx ] )
46 valuesView[ globalIdx ] = segmentIdx + localIdx;
47 } );
48
49 // Print the initial data
50 std::cout << "Segments sizes: " << segmentsSizes << "\n";
52 segments,
53 [ = ] __cuda_callable__( IndexType idx )
54 {
55 return valuesView[ idx ];
56 } ) << "\n";
57
59 // Perform complete reduction:
60 // 1. Find maximum in each segment
61 // 2. Sum up all the maximums
62 const ValueType result = reduceAll(
63 segments,
64 // Fetch function for segment-level reduction (gets element value)
65 [ = ] __cuda_callable__( IndexType segmentIdx, IndexType localIdx, IndexType globalIdx )
66 {
67 if( localIdx < segmentsSizesView[ segmentIdx ] )
68 return valuesView[ globalIdx ];
70 },
71 // Reduction operation for segments (maximum)
72 TNL::Max{},
73 // Fetch function for global reduction (identity - returns segment result as is)
74 [] __cuda_callable__( const ValueType& segmentValue )
75 {
76 return segmentValue;
77 },
78 // Global reduction operation (sum)
79 TNL::Plus{} );
81
82 // Print the result
83 std::cout << "Sum of maximums = " << result << "\n";
84}
85
86int
87main( int argc, char** argv )
88{
89 std::cout << "Running example on Host:\n";
90 reduceAllExample< TNL::Devices::Host >();
91
92#ifdef __CUDACC__
93 std::cout << "\nRunning example on CUDA device:\n";
94 reduceAllExample< TNL::Devices::Cuda >();
95#endif
96 return EXIT_SUCCESS;
97}
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
T lowest(T... args)
Namespace for the segments data structures.
Definition _NamespaceDoxy.h:7
Namespace for fundamental TNL algorithms.
Definition AtomicOperations.h:15
The main TNL namespace.
Definition AtomicOperations.h:15
Function object implementing max(x, y).
Definition Functional.h:272

And the output looks as:

#include "TNL/Functional.h"
#include <iostream>
#include <TNL/Algorithms/Segments/CSR.h>
#include <TNL/Algorithms/Segments/reduce.h>
#include <TNL/Algorithms/Segments/print.h>
#include <TNL/Algorithms/Segments/traverse.h>
#include <TNL/Containers/Vector.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
#include <TNL/Math.h>
#include <TNL/Containers/Array.h>
using namespace TNL;
using namespace TNL::Algorithms;
using namespace TNL::Algorithms::Segments;
template< typename Device >
void
reduceAllExample()
{
/***
* This example shows how to perform a complete reduction over segments
* with different operations for segment-level and final reduction.
*
* We will:
* 1. Find maximum in each segment
* 2. Sum up all the maximums
*/
using IndexType = int;
using ValueType = double;
// Create segments with different sizes
CSR< Device, IndexType > segments( segmentsSizes );
// Initialize data: each element is segment_idx + local_idx
Containers::Vector< ValueType, Device, IndexType > values( segments.getStorageSize(), -1 );
auto valuesView = values.getView();
auto segmentsSizesView = segmentsSizes.getView();
segments,
[ = ] __cuda_callable__( IndexType segmentIdx, IndexType localIdx, IndexType globalIdx ) mutable
{
if( localIdx < segmentsSizesView[ segmentIdx ] )
valuesView[ globalIdx ] = segmentIdx + localIdx;
} );
// Print the initial data
std::cout << "Segments sizes: " << segmentsSizes << "\n";
segments,
[ = ] __cuda_callable__( IndexType idx )
{
return valuesView[ idx ];
} ) << "\n";
// Perform complete reduction:
// 1. Find maximum in each segment
// 2. Sum up all the maximums
const ValueType result = reduceAll(
segments,
// Fetch function for segment-level reduction (gets element value)
[ = ] __cuda_callable__( IndexType segmentIdx, IndexType localIdx, IndexType globalIdx )
{
if( localIdx < segmentsSizesView[ segmentIdx ] )
return valuesView[ globalIdx ];
},
// Reduction operation for segments (maximum)
// Fetch function for global reduction (identity - returns segment result as is)
[] __cuda_callable__( const ValueType& segmentValue )
{
return segmentValue;
},
// Global reduction operation (sum)
TNL::Plus{} );
// Print the result
std::cout << "Sum of maximums = " << result << "\n";
}
int
main( int argc, char** argv )
{
std::cout << "Running example on Host:\n";
reduceAllExample< TNL::Devices::Host >();
#ifdef __CUDACC__
std::cout << "\nRunning example on CUDA device:\n";
reduceAllExample< TNL::Devices::Cuda >();
#endif
return EXIT_SUCCESS;
}

Note: To perform complete reduction on only specific segments, you can use the functions reduceSegments with segment indexes or reduceSegmentsIf with a condition on segment indexes. These functions behave the same way as reduceSegments with segment indexes and reduceSegmentsIf, and therefore we do not cover them in more detail in this user guide.

Scan (prefix-sum)

With the function scan, you can perform scan (or prefix-sum) within segments. Both inclusive and exclusive scans are supported, as demonstrated by the following code snippet:

// Define fetch, reduce and write functions
auto fetch = [ = ] __cuda_callable__( Index segmentIdx, Index localIdx, Index globalIdx ) -> Value
{
if( localIdx < segmentsSizesView[ segmentIdx ] )
return data_view[ globalIdx ];
else
return 0; // Return 0 for padding elements
};
auto write_inclusive = [ = ] __cuda_callable__( Index globalIdx, Value value ) mutable
{
inclusive_result_view[ globalIdx ] = value;
};
auto write_exclusive = [ = ] __cuda_callable__( Index globalIdx, Value value ) mutable
{
exclusive_result_view[ globalIdx ] = value;
};
// Perform inclusive scan
TNL::Algorithms::Segments::inclusiveScanAllSegments( segments, fetch, TNL::Plus{}, write_inclusive );
// Perform exclusive scan
TNL::Algorithms::Segments::exclusiveScanAllSegments( segments, fetch, TNL::Plus{}, write_exclusive );

The function accepts the following lambda functions:

  • fetch is responsible for reading data managed by the segments. This function receives the parameters segmentIdx, localIdx, and globalIdx, which have the same meaning as in other reduction operations.
  • write writes the results to the output array. It receives parameters globalIdx (the position in the output array) and value, which is the result of the scan operation at the given position.

The function also accepts functors expressing the reduction operation, such as TNL::Plus and similar. Depending on what kind of scan you want to perform, you can call either TNL::Algorithms::Segments::inclusiveScanAllSegments or TNL::Algorithms::Segments::exclusiveScanAllSegments.

The full example reads as:

1#include <iostream>
2#include <TNL/Algorithms/Segments/CSR.h>
3#include <TNL/Algorithms/Segments/scan.h>
4#include <TNL/Algorithms/Segments/print.h>
5#include <TNL/Containers/Vector.h>
6#include <TNL/Devices/Host.h>
7
8template< typename Device, typename Value = double, typename Index = int >
9void
10scanExample()
11{
12 // Create segments with different sizes
13 TNL::Containers::Vector< Index, Device > segmentsSizes{ 1, 2, 3, 4, 5 };
14 auto segmentsSizesView = segmentsSizes.getConstView();
16
17 // Create data to be scanned within segments
18 TNL::Containers::Vector< Value, Device, Index > data( segments.getStorageSize() );
19 TNL::Containers::Vector< Value, Device, Index > inclusive_result( segments.getStorageSize() );
20 TNL::Containers::Vector< Value, Device, Index > exclusive_result( segments.getStorageSize() );
21 auto inclusive_result_view = inclusive_result.getView();
22 auto exclusive_result_view = exclusive_result.getView();
23
24 // Initialize data with segment index + 1
25 auto data_view = data.getView();
27 segments,
28 [ = ] __cuda_callable__( Index segmentIdx, Index localIdx, Index globalIdx ) mutable
29 {
30 data_view[ globalIdx ] = segmentIdx + 1;
31 } );
32
33 // Print original data
34 std::cout << "Original data in segments:\n";
36 segments,
37 [ = ] __cuda_callable__( Index globalIdx ) -> Value
38 {
39 return data_view[ globalIdx ];
40 } ) << '\n';
41
43 // Define fetch, reduce and write functions
44 auto fetch = [ = ] __cuda_callable__( Index segmentIdx, Index localIdx, Index globalIdx ) -> Value
45 {
46 if( localIdx < segmentsSizesView[ segmentIdx ] )
47 return data_view[ globalIdx ];
48 else
49 return 0; // Return 0 for padding elements
50 };
51 auto write_inclusive = [ = ] __cuda_callable__( Index globalIdx, Value value ) mutable
52 {
53 inclusive_result_view[ globalIdx ] = value;
54 };
55
56 auto write_exclusive = [ = ] __cuda_callable__( Index globalIdx, Value value ) mutable
57 {
58 exclusive_result_view[ globalIdx ] = value;
59 };
60
61 // Perform inclusive scan
62 TNL::Algorithms::Segments::inclusiveScanAllSegments( segments, fetch, TNL::Plus{}, write_inclusive );
63
64 // Perform exclusive scan
65 TNL::Algorithms::Segments::exclusiveScanAllSegments( segments, fetch, TNL::Plus{}, write_exclusive );
67
68 // Print results
69 std::cout << "\nInclusive scan results:\n";
71 segments,
72 [ = ] __cuda_callable__( Index globalIdx ) -> Value
73 {
74 return inclusive_result_view[ globalIdx ];
75 } ) << '\n';
76
77 std::cout << "\nExclusive scan results:\n";
79 segments,
80 [ = ] __cuda_callable__( Index globalIdx ) -> Value
81 {
82 return exclusive_result_view[ globalIdx ];
83 } ) << '\n';
84
85 // Example of scanning only specific segments
86 TNL::Containers::Vector< Index, Device > segmentIndexes{ 1, 3 }; // Scan only segments 1 and 3
87
88 auto write_partial = [ = ] __cuda_callable__( Index globalIdx, Value value ) mutable
89 {
90 data_view[ globalIdx ] = value; // All segment scan algorithms may work even as inplace scan
91 };
92
93 // Perform inclusive scan on selected segments
94 TNL::Algorithms::Segments::inclusiveScanSegments( segments, segmentIndexes, fetch, TNL::Plus{}, write_partial );
95
96 std::cout << "\nPartial inclusive inplace scan results (only segments 1 and 3):\n";
98 segments,
99 [ = ] __cuda_callable__( Index globalIdx ) -> Value
100 {
101 return data_view[ globalIdx ];
102 } ) << '\n';
103
104 // Scanning the rest of segments using condition
105 auto condition = [ = ] __cuda_callable__( Index segmentIdx ) mutable -> bool
106 {
107 return segmentIdx % 2 == 0; // Only scan even segments
108 };
109
110 // Perform inclusive scan on selected segments
111 TNL::Algorithms::Segments::inclusiveScanAllSegmentsIf( segments, condition, fetch, TNL::Plus{}, write_partial );
112
113 std::cout << "\nPartial inclusive inplace scan results (only even segments):\n";
115 segments,
116 [ = ] __cuda_callable__( Index globalIdx ) -> Value
117 {
118 return data_view[ globalIdx ];
119 } ) << '\n';
120}
121
122int
123main( int argc, char* argv[] )
124{
125 std::cout << "Running example on Host:\n";
126 scanExample< TNL::Devices::Host >();
127
128#ifdef __CUDACC__
129 std::cout << "\nRunning example on Cuda:\n";
130 scanExample< TNL::Devices::Cuda >();
131#endif
132
133 return 0;
134}
ConstViewType getConstView(IndexType begin=0, IndexType end=0) const
Returns a non-modifiable view of the vector.
void inclusiveScanAllSegmentsIf(const Segments &segments, Condition &&condition, Fetch &&fetch, Reduce &&reduce, Write &&write, LaunchConfiguration launchConfig=LaunchConfiguration())
Compute inclusive conditional prefix-sum (scan) within all segments. .
void inclusiveScanSegments(const Segments &segments, IndexBegin begin, IndexEnd end, Fetch &&fetch, Reduce &&reduce, Write &&write, LaunchConfiguration launchConfig=LaunchConfiguration())
Compute inclusive prefix-sum (scan) within specified segments in a range. .
void exclusiveScanAllSegments(const Segments &segments, Fetch &&fetch, Reduce &&reduce, Write &&write, LaunchConfiguration launchConfig=LaunchConfiguration())
Compute exclusive prefix-sum (scan) within all segments. .
void inclusiveScanAllSegments(const Segments &segments, Fetch &&fetch, Reduce &&reduce, Write &&write, LaunchConfiguration launchConfig=LaunchConfiguration())
Compute inclusive prefix-sum (scan) within all segments. .

The output looks as:

Running example on Host:
Original data in segments:
Segment 0: [ 1 ]
Segment 1: [ 2, 2 ]
Segment 2: [ 3, 3, 3 ]
Segment 3: [ 4, 4, 4, 4 ]
Segment 4: [ 5, 5, 5, 5, 5 ]
Inclusive scan results:
Segment 0: [ 1 ]
Segment 1: [ 2, 4 ]
Segment 2: [ 3, 6, 9 ]
Segment 3: [ 4, 8, 12, 16 ]
Segment 4: [ 5, 10, 15, 20, 25 ]
Exclusive scan results:
Segment 0: [ 0 ]
Segment 1: [ 0, 2 ]
Segment 2: [ 0, 3, 6 ]
Segment 3: [ 0, 4, 8, 12 ]
Segment 4: [ 0, 5, 10, 15, 20 ]
Partial inclusive inplace scan results (only segments 1 and 3):
Segment 0: [ 1 ]
Segment 1: [ 2, 4 ]
Segment 2: [ 3, 3, 3 ]
Segment 3: [ 4, 8, 12, 16 ]
Segment 4: [ 5, 5, 5, 5, 5 ]
Partial inclusive inplace scan results (only even segments):
Segment 0: [ 1 ]
Segment 1: [ 2, 4 ]
Segment 2: [ 3, 6, 9 ]
Segment 3: [ 4, 8, 12, 16 ]
Segment 4: [ 5, 10, 15, 20, 25 ]
Running example on Cuda:
Original data in segments:
Segment 0: [ 1 ]
Segment 1: [ 2, 2 ]
Segment 2: [ 3, 3, 3 ]
Segment 3: [ 4, 4, 4, 4 ]
Segment 4: [ 5, 5, 5, 5, 5 ]
Inclusive scan results:
Segment 0: [ 1 ]
Segment 1: [ 2, 4 ]
Segment 2: [ 3, 6, 9 ]
Segment 3: [ 4, 8, 12, 16 ]
Segment 4: [ 5, 10, 15, 20, 25 ]
Exclusive scan results:
Segment 0: [ 0 ]
Segment 1: [ 0, 2 ]
Segment 2: [ 0, 3, 6 ]
Segment 3: [ 0, 4, 8, 12 ]
Segment 4: [ 0, 5, 10, 15, 20 ]
Partial inclusive inplace scan results (only segments 1 and 3):
Segment 0: [ 1 ]
Segment 1: [ 2, 4 ]
Segment 2: [ 3, 3, 3 ]
Segment 3: [ 4, 8, 12, 16 ]
Segment 4: [ 5, 5, 5, 5, 5 ]
Partial inclusive inplace scan results (only even segments):
Segment 0: [ 1 ]
Segment 1: [ 2, 4 ]
Segment 2: [ 3, 6, 9 ]
Segment 3: [ 4, 8, 12, 16 ]
Segment 4: [ 5, 10, 15, 20, 25 ]

Note: Onecan also use TNL::Algorithms::Segments::inclusiveScanSegments and TNL::Algorithms::Segments::exclusiveScanSegments with an array of segment indexes to specify the segments in which scan operations should be performed. Alternatively, one may use TNL::Algorithms::Segments::inclusiveScanSegmentsIf and TNL::Algorithms::Segments::exclusiveScanSegmentsIf, where segments are selected based on a condition on their index.

Find

The functions findInSegments are used for parallel searching within segments. The following code snippet shows how to find a specific number in each segment:

auto found_view = found.getView();
auto positions_view = positions.getView();
auto condition = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> bool
{
return data_view[ globalIdx ] == 2;
};
auto keep = [ = ] __cuda_callable__( const int segmentIdx, const int localIdx, bool found ) mutable
{
found_view[ segmentIdx ] = found;
if( found )
positions_view[ segmentIdx ] = localIdx;
else
positions_view[ segmentIdx ] = -1;
};
TNL::Algorithms::Segments::findInAllSegments( segments, condition, keep );
std::cout << "Found array: " << found << '\n';
std::cout << "Positions array: " << positions << '\n';

The function takes two lambda functions as arguments:

  1. condition - A lambda function that receives parameters segmentIdx, localIdx, and globalIdx (with their usual meanings), and returns a boolean value. A return value of true indicates that the given element satisfies the search condition, while false means it does not.
  2. keep - A lambda function that receives parameters segmentIdx, localIdx, and found. If found is true, then the searched element was found in the segment with index segmentIdx at the position localIdx.

In our example, a boolean value indicating whether the searched element was found in each segment is stored in the array found, and the positions of the found elements are stored in the array positions.

The full example reads as:

1#include <iostream>
2#include <functional>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/CSR.h>
5#include <TNL/Algorithms/Segments/Ellpack.h>
6#include <TNL/Algorithms/Segments/traverse.h>
7#include <TNL/Algorithms/Segments/find.h>
8#include <TNL/Devices/Host.h>
9#include <TNL/Devices/Cuda.h>
10
11template< typename Segments >
12void
13SegmentsExample()
14{
15 using Device = typename Segments::DeviceType;
16
17 /***
18 * Create segments with given segments sizes.
19 */
20 const int size = 5;
21 Segments segments{ 1, 2, 3, 4, 5 };
22
23 /***
24 * Allocate array for the segments;
25 */
26 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
27
28 /***
29 * Insert data into particular segments.
30 */
31 auto data_view = data.getView();
33 segments,
34 [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) mutable
35 {
36 if( localIdx <= segmentIdx )
37 data_view[ globalIdx ] = ( localIdx + segmentIdx ) / 2;
38 } );
39
40 /***
41 * Print the data by the segments.
42 */
43 std::cout << "Values of elements after initial setup:\n";
44 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
45 {
46 return data_view[ globalIdx ];
47 };
48 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
49
53
54 auto found_view = found.getView();
55 auto positions_view = positions.getView();
56 auto condition = [ = ] __cuda_callable__( int segmentIdx, int localIdx, int globalIdx ) -> bool
57 {
58 return data_view[ globalIdx ] == 2;
59 };
60 auto keep = [ = ] __cuda_callable__( const int segmentIdx, const int localIdx, bool found ) mutable
61 {
62 found_view[ segmentIdx ] = found;
63 if( found )
64 positions_view[ segmentIdx ] = localIdx;
65 else
66 positions_view[ segmentIdx ] = -1;
67 };
68 TNL::Algorithms::Segments::findInAllSegments( segments, condition, keep );
69
70 std::cout << "Found array: " << found << '\n';
71 std::cout << "Positions array: " << positions << '\n';
73}
74
75int
76main( int argc, char* argv[] )
77{
78 std::cout << "Example of CSR segments on host:\n";
79 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
80
81 std::cout << "Example of Ellpack segments on host:\n";
82 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
83
84#ifdef __CUDACC__
85 std::cout << "Example of CSR segments on CUDA GPU:\n";
86 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
87
88 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
89 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
90#endif
91 return EXIT_SUCCESS;
92}

The output reads as:

Example of CSR segments on host:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 0, 1 ]
Segment 2: [ 1, 1, 2 ]
Segment 3: [ 1, 2, 2, 3 ]
Segment 4: [ 2, 2, 3, 3, 4 ]
Found array: [ 0, 0, 1, 1, 1 ]
Positions array: [ -1, -1, 2, 1, 0 ]
Example of Ellpack segments on host:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 0, 1, 0, 0, 0 ]
Segment 2: [ 1, 1, 2, 0, 0 ]
Segment 3: [ 1, 2, 2, 3, 0 ]
Segment 4: [ 2, 2, 3, 3, 4 ]
Found array: [ 0, 0, 1, 1, 1 ]
Positions array: [ -1, -1, 2, 1, 0 ]
Example of CSR segments on CUDA GPU:
Values of elements after initial setup:
Segment 0: [ 0 ]
Segment 1: [ 0, 1 ]
Segment 2: [ 1, 1, 2 ]
Segment 3: [ 1, 2, 2, 3 ]
Segment 4: [ 2, 2, 3, 3, 4 ]
Found array: [ 0, 0, 1, 1, 1 ]
Positions array: [ -1, -1, 2, 1, 0 ]
Example of Ellpack segments on CUDA GPU:
Values of elements after initial setup:
Segment 0: [ 0, 0, 0, 0, 0 ]
Segment 1: [ 0, 1, 0, 0, 0 ]
Segment 2: [ 1, 1, 2, 0, 0 ]
Segment 3: [ 1, 2, 2, 3, 0 ]
Segment 4: [ 2, 2, 3, 3, 4 ]
Found array: [ 0, 0, 1, 1, 1 ]
Positions array: [ -1, -1, 2, 1, 0 ]

Sort

The functions sortSegments sorts data managed by the segments:

// Sort each segment
auto fetch = [ = ] __cuda_callable__( Index segmentIdx, Index localIdx, Index globalIdx ) -> int
{
return data_view[ globalIdx ] != -1 ? data_view[ globalIdx ] : std::numeric_limits< int >::max();
};
auto compare = [] __cuda_callable__( const Value& a, const Value& b ) -> bool
{
return a <= b;
};
auto swap = [ = ] __cuda_callable__( Index globalIdx1, Index globalIdx2 ) mutable
{
TNL::swap( data_view[ globalIdx1 ], data_view[ globalIdx2 ] );
};
// Sort all segments
TNL::Algorithms::Segments::sortAllSegments( segments, fetch, compare, swap );
// Print sorted data
std::cout << "\nSorted data in segments (ascending order):\n";
segments,
[ = ] __cuda_callable__( Index globalIdx ) -> int
{
return data_view[ globalIdx ];
} ) << '\n';

As demonstrated in the snippet above, three lambda functions must be provided:

  1. fetch - This lambda receives parameters segmentIdx, localIdx, and globalIdx, which have their usual meanings. It is responsible for reading the data to be sorted. The function must also account for padding elements. A common strategy is to return either the maximum or lowest value of the given data type (depending on whether sorting is ascending or descending), using std::numeric_limits< int >::max() or std::numeric_limits< int >::lowest().
  2. compare - A lambda function that compares two elements a and b. It should return true if a should appear before b and false otherwise.
  3. swap - This lambda performs the actual swap of two elements at positions globalIdx1 and globalIdx2.

The following code snippet demonstrates sorting with the function sortSegments using specified segment indexes and in descending order:

// Sort only specific segments using segmentIndexes
std::cout << "\nSorting only segments 1 and 3 in descending order:\n";
auto compareDesc = [] __cuda_callable__( Index a, Index b ) -> bool
{
return a >= b;
};
TNL::Algorithms::Segments::sortSegments( segments, segmentIndexes, fetch, compareDesc, swap );
// Print result
segments,
[ = ] __cuda_callable__( Index globalIdx ) -> int
{
return data_view[ globalIdx ];
} ) << '\n';

In addition to the lambda functions fetch and swap, this version also accepts an array of segment indexes segmentIndexes specifying which segments should be sorted. The compare lambda function (which evaluates the condition a <= b for ascending order) is replaced with the compareDesc function, which evaluates the condition a >= b for descending order.

The full example showcasing also use of function sortAllSegmentsIf reads as:

1#include <iostream>
2#include <TNL/Algorithms/Segments/CSR.h>
3#include <TNL/Algorithms/Segments/traverse.h>
4#include <TNL/Algorithms/Segments/sort.h>
5#include <TNL/Containers/Vector.h>
6#include <TNL/Devices/Host.h>
7
8template< typename Device, typename Value = double, typename Index = int >
9void
10sortExample()
11{
12 // Create segments with different sizes
13 TNL::Containers::Vector< Index, Device > segmentsSizes{ 1, 2, 3, 4, 5 };
14
16
17 // Create data to be sorted within segments
18 TNL::Containers::Vector< Value, Device, Index > data( segments.getStorageSize(), -1 );
19
20 // Initialize data
21 auto data_view = data.getView();
22 auto segmentsSizesView = segmentsSizes.getView();
24 segments,
25 [ = ] __cuda_callable__( Index segmentIdx, Index localIdx, Index globalIdx ) mutable
26 {
27 if( localIdx < segmentsSizesView[ segmentIdx ] )
28 data_view[ globalIdx ] = segmentsSizesView[ segmentIdx ] - localIdx;
29 } );
30
31 // Print original data
32 std::cout << "Original data in segments:\n";
34 segments,
35 [ = ] __cuda_callable__( Index globalIdx ) -> int
36 {
37 return data_view[ globalIdx ];
38 } ) << '\n';
39
41 // Sort each segment
42 auto fetch = [ = ] __cuda_callable__( Index segmentIdx, Index localIdx, Index globalIdx ) -> int
43 {
44 return data_view[ globalIdx ] != -1 ? data_view[ globalIdx ] : std::numeric_limits< int >::max();
45 };
46 auto compare = [] __cuda_callable__( const Value& a, const Value& b ) -> bool
47 {
48 return a <= b;
49 };
50 auto swap = [ = ] __cuda_callable__( Index globalIdx1, Index globalIdx2 ) mutable
51 {
52 TNL::swap( data_view[ globalIdx1 ], data_view[ globalIdx2 ] );
53 };
54
55 // Sort all segments
56 TNL::Algorithms::Segments::sortAllSegments( segments, fetch, compare, swap );
57
58 // Print sorted data
59 std::cout << "\nSorted data in segments (ascending order):\n";
61 segments,
62 [ = ] __cuda_callable__( Index globalIdx ) -> int
63 {
64 return data_view[ globalIdx ];
65 } ) << '\n';
67
69 // Sort only specific segments using segmentIndexes
70 TNL::Containers::Vector< Index, Device > segmentIndexes{ 1, 3 };
71 std::cout << "\nSorting only segments 1 and 3 in descending order:\n";
72
73 auto compareDesc = [] __cuda_callable__( Index a, Index b ) -> bool
74 {
75 return a >= b;
76 };
77
78 TNL::Algorithms::Segments::sortSegments( segments, segmentIndexes, fetch, compareDesc, swap );
79
80 // Print result
82 segments,
83 [ = ] __cuda_callable__( Index globalIdx ) -> int
84 {
85 return data_view[ globalIdx ];
86 } ) << '\n';
88
89 // Sort segments conditionally (only even-indexed segments)
90 std::cout << "\nSorting even-indexed segments in descending order:\n";
91 auto condition = [] __cuda_callable__( Index segmentIdx ) -> bool
92 {
93 return segmentIdx % 2 == 0;
94 };
95
96 TNL::Algorithms::Segments::sortAllSegmentsIf( segments, condition, fetch, compareDesc, swap );
97
98 // Print result
100 segments,
101 [ = ] __cuda_callable__( Index globalIdx ) -> int
102 {
103 return data_view[ globalIdx ];
104 } ) << '\n';
105}
106
107int
108main( int argc, char* argv[] )
109{
110 std::cout << "Running example on Host:\n";
111 sortExample< TNL::Devices::Host >();
112
113#ifdef __CUDACC__
114 std::cout << "Running example on Cuda:\n";
115 sortExample< TNL::Devices::Cuda >();
116#endif
117
118 return 0;
119}
T max(T... args)
__cuda_callable__ constexpr void swap(Type &a, Type &b) noexcept
This function swaps values of two parameters.
Definition Math.h:518

And the output looks as:

Running example on Host:
Original data in segments:
Segment 0: [ 1 ]
Segment 1: [ 2, 1 ]
Segment 2: [ 3, 2, 1 ]
Segment 3: [ 4, 3, 2, 1 ]
Segment 4: [ 5, 4, 3, 2, 1 ]
Sorted data in segments (ascending order):
Segment 0: [ 1 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 1, 2, 3 ]
Segment 3: [ 1, 2, 3, 4 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Sorting only segments 1 and 3 in descending order:
Segment 0: [ 1 ]
Segment 1: [ 2, 1 ]
Segment 2: [ 1, 2, 3 ]
Segment 3: [ 4, 3, 2, 1 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Sorting even-indexed segments in descending order:
Segment 0: [ 1 ]
Segment 1: [ 2, 1 ]
Segment 2: [ 3, 2, 1 ]
Segment 3: [ 4, 3, 2, 1 ]
Segment 4: [ 5, 4, 3, 2, 1 ]
Running example on Cuda:
Original data in segments:
Segment 0: [ 1 ]
Segment 1: [ 2, 1 ]
Segment 2: [ 3, 2, 1 ]
Segment 3: [ 4, 3, 2, 1 ]
Segment 4: [ 5, 4, 3, 2, 1 ]
Sorted data in segments (ascending order):
Segment 0: [ 1 ]
Segment 1: [ 1, 2 ]
Segment 2: [ 1, 2, 3 ]
Segment 3: [ 1, 2, 3, 4 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Sorting only segments 1 and 3 in descending order:
Segment 0: [ 1 ]
Segment 1: [ 2, 1 ]
Segment 2: [ 1, 2, 3 ]
Segment 3: [ 4, 3, 2, 1 ]
Segment 4: [ 1, 2, 3, 4, 5 ]
Sorting even-indexed segments in descending order:
Segment 0: [ 1 ]
Segment 1: [ 2, 1 ]
Segment 2: [ 3, 2, 1 ]
Segment 3: [ 4, 3, 2, 1 ]
Segment 4: [ 5, 4, 3, 2, 1 ]

Print

The function TNL::Algorithms::Segments::print is used to print data managed by segments. It requires a lambda function fetch which reads data from individual elements.

The following code snippet demonstrates its use:

/***
* Print the data managed by the segments.
*/
auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
{
return data_view[ globalIdx ];
};
std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';

The complete example looks as follows:

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/traverse.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Segments >
10void
11SegmentsExample()
12{
14 using Device = typename Segments::DeviceType;
15
16 /***
17 * Create segments with given segments sizes.
18 */
19 Segments segments{ 1, 2, 3, 4, 5 };
20
21 /***
22 * Allocate array for the segments;
23 */
24 TNL::Containers::Array< double, Device > data( segments.getStorageSize(), 0.0 );
26
28 /***
29 * Insert data into particular segments.
30 */
31 auto data_view = data.getView();
32 using SegmentViewType = typename Segments::SegmentViewType;
34 segments,
35 [ = ] __cuda_callable__( const SegmentViewType& segment ) mutable
36 {
37 double sum( 0.0 );
38 for( auto element : segment )
39 if( element.localIndex() <= element.segmentIndex() ) {
40 sum += element.localIndex() + 1;
41 data_view[ element.globalIndex() ] = sum;
42 }
43 } );
45
47 /***
48 * Print the data managed by the segments.
49 */
50 auto fetch = [ = ] __cuda_callable__( int globalIdx ) -> double
51 {
52 return data_view[ globalIdx ];
53 };
54 std::cout << TNL::Algorithms::Segments::print( segments, fetch ) << '\n';
56}
57
58int
59main( int argc, char* argv[] )
60{
61 std::cout << "Example of CSR segments on host:\n";
62 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
63
64 std::cout << "Example of Ellpack segments on host:\n";
65 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
66
67#ifdef __CUDACC__
68 std::cout << "Example of CSR segments on CUDA GPU:\n";
69 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
70
71 std::cout << "Example of Ellpack segments on CUDA GPU:\n";
72 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
73#endif
74 return EXIT_SUCCESS;
75}

The output of the example is:

Example of CSR segments on host:
Segment 0: [ 1 ]
Segment 1: [ 1, 3 ]
Segment 2: [ 1, 3, 6 ]
Segment 3: [ 1, 3, 6, 10 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on host:
Segment 0: [ 1, 0, 0, 0, 0 ]
Segment 1: [ 1, 3, 0, 0, 0 ]
Segment 2: [ 1, 3, 6, 0, 0 ]
Segment 3: [ 1, 3, 6, 10, 0 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of CSR segments on CUDA GPU:
Segment 0: [ 1 ]
Segment 1: [ 1, 3 ]
Segment 2: [ 1, 3, 6 ]
Segment 3: [ 1, 3, 6, 10 ]
Segment 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on CUDA GPU:
Segment 0: [ 1, 0, 0, 0, 0 ]
Segment 1: [ 1, 3, 0, 0, 0 ]
Segment 2: [ 1, 3, 6, 0, 0 ]
Segment 3: [ 1, 3, 6, 10, 0 ]
Segment 4: [ 1, 3, 6, 10, 15 ]

Note: The function print does not mask padding elements. If your segment format includes padding, those elements will be printed as well.