Introduction
Segments represent data structure for manipulation with several local arrays (denoted also as segments) having different size in general. All the local arrays are supposed to be allocated in one continuous global array. The data structure segments offers mapping between indexes of particular local arrays and indexes of the global array. Segments do not store any data, segments just represent a layer for efficient access and operations with group of segments of linear containers (i.e. local arrays) with different size in general. One can perform parallel operations like for or flexible reduction on particular segments (local arrays).
A typical example of segments are different formats for sparse matrices. Sparse matrix like the following
\[
\left(
\begin{array}{ccccc}
1 & 0 & 2 & 0 & 0 \\
0 & 0 & 5 & 0 & 0 \\
3 & 4 & 7 & 9 & 0 \\
0 & 0 & 0 & 0 & 12 \\
0 & 0 & 15 & 17 & 20
\end{array}
\right)
\]
is usually first compressed which means that the zero elements are omitted to get the following "matrix":
\[
\begin{array}{ccccc}
1 & 2 \\
5 \\
3 & 4 & 7 & 9 \\
12 \\
15 & 17 & 20
\end{array}
\]
We have to store column index of each matrix elements as well in a "matrix" like this:
\[
\begin{array}{ccccc}
0 & 2 \\
2 \\
0 & 1 & 2 & 3 \\
4 \\
2 & 3 & 4
\end{array}
\]
Such "matrices" can be stored in memory in a row-wise manner in one contiguous array because of the performance reasons. The first "matrix" (i.e. values of the matrix elements) would be stored as follows
\[
\begin{array}{|cc|c|cccc|c|cc|}
1 & 2 & 5 & 3 & 4 & 7 & 9 & 12 & 15 & 17 & 20
\end{array}
\]
and the second one (i.e. column indexes of the matrix values) as follows
\[
\begin{array}{|cc|c|cccc|c|cc|}
0 & 2 & 2 & 0 & 1 & 2 & 3 & 4 & 2 & 3 & 4
\end{array}
\]
What we see above is so called CSR sparse matrix format. It is the most popular format for storage of sparse matrices designed for high performance. However, it may not be the most efficient format for storage of sparse matrices on GPUs. Therefore many other formats have been developed to get better performance. These formats often have different layout of the matrix elements in the memory. They have to deal especially with two difficulties:
- Efficient storage of matrix elements in the memory to fulfill the requirements of coalesced memory accesses on GPUs or good spatial locality for efficient use of caches on CPUs.
- Efficient mapping of GPU threads to different matrix rows.
TNL offers the following sparse matrix formats in a form of segments (Ellpack formats often use so called padding elements like padding zeros in terms of sparse matrices):
- CSR format (TNL::Algorithms::Segments::CSR) is the most popular format for sparse matrices. It is simple and very efficient especially on CPUs and today there are efficient kernels even for GPUs. The following GPU kernels are implemented in TNL:
- Scalar which maps one GPU thread for each segment (matrix row).
- Vector which maps one warp of GPU threads for each segment (matrix row).
- Adaptive ...
- Ellpack format (TNL::Algorithms::Segments::Ellpack) uses padding elements to have the same number of element in each segment. It can be highly inefficient in cases when one works with few very long segments.
- SlicedEllpack format (TNL::Algorithms::Segments::SlicedEllpack) which was also presented as Row-grouped CSR format is similar to common Ellpack. However, SlicedEllpack first merges the segments into groups of 32. It also uses padding elements, but only segments within the same group are aligned to have the same size. Therefore, there is not such a high performance drop because of few long segments.
- ChunkedEllpack format (TNL::Algorithms::Segments::ChunkedEllpack) is similar to SlicedEllpack, but it splits segments into chunks that allow to map more GPU threads to one segment.
- BiEllpack format (TNL::Algorithms::Segments::BiEllpack) is similar to ChunkedEllpack. In addition, it sorts segments within the same slice w.r.t. their length to achieve higher performance and better memory accesses.
Especially in case of GPUs, the performance of each format strongly depends on the distribution of the segment sizes. Therefore we cannot say that one of the previous formats would outperform the others in general. To get the best performance, one should try more of the formats and choose the best one. It is the reason why TNL offers more of them and additional formats will accrue.
Necessity of working with this kind of data structures is not limited only to sparse matrices. We could name at least few other applications for segments:
- Graphs - one segment represents one graph node, the elements in one segments are indexes of its neighbors.
- Unstructured numerical meshes - unstructured numerical mesh is a graph in fact.
- Particle in cell method - one segment represents one cell, the elements in one segment are indexes of the particles.
- K-means clustering - segments represent one cluster, the elements represent vectors belonging to given cluster.
- Hashing - segments are particular rows of the hash table, elements in segments corresponds with colliding hashed elements.
In general, segments can be used for problems that somehow correspond with 2D data structure where each row can have different size and we need to perform miscellaneous operations within the rows. The name segments comes from segmented parallel reduction or segmented scan (prefix-sum).
Segments setup
Segments are defined just by sizes of particular segments. The following example shows how to create them:
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/ChunkedEllpack.h>
6#include <TNL/Algorithms/Segments/BiEllpack.h>
7#include <TNL/Devices/Host.h>
8#include <TNL/Devices/Cuda.h>
10template<
typename Segments >
17 Segments segments{ 1, 2, 3, 4, 5 };
22main(
int argc,
char* argv[] )
25 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
28 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
31 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Host, int > >();
34 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Host, int > >();
38 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
41 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
44 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Cuda, int > >();
47 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Cuda, int > >();
In the SegmentsExample
function we use the constructor with initializer list where each element of the list defines size of one segment. Next we print sizes of particular segments. We call this function for different segments types (except TNL::Algorithms::Segments::SlicedEllpack since it would behave the same as TNL::Algorithms::Segments::Ellpack on this small example).
The result looks as follows:
Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5, ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5, ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87, ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5, ]
Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5, ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5, ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87, ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5, ]
We can see, that real sizes of the segments are different for all Ellpack-based formats. As already mentioned, these formats often use padding elements to get more efficient access to the memory. For example, the TNL::Algorithms::Segments::ChunkedEllpack format involves many elements. It is, however, only because of very small example we presented now, on large data the overhead is not so significant.
We remind that segments represent rather sparse format then data structure because they do not store any data. The following example shows how to connect segments with array:
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Algorithms/Segments/ChunkedEllpack.h>
6#include <TNL/Algorithms/Segments/BiEllpack.h>
7#include <TNL/Devices/Host.h>
8#include <TNL/Devices/Cuda.h>
10template<
typename Segments >
14 using Device =
typename Segments::DeviceType;
20 Segments segments( sizes );
36 auto data_view = data.getView();
39 return data_view[ globalIdx ];
45main(
int argc,
char* argv[] )
48 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
51 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
54 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Host, int > >();
57 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Host, int > >();
61 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
64 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
67 SegmentsExample< TNL::Algorithms::Segments::ChunkedEllpack< TNL::Devices::Cuda, int > >();
70 SegmentsExample< TNL::Algorithms::Segments::BiEllpack< TNL::Devices::Cuda, int > >();
#define __cuda_callable__
Definition Macros.h:49
Array is responsible for memory management, access to array elements, and general array operations.
Definition Array.h:64
Vector extends Array with algebraic operations.
Definition Vector.h:36
std::ostream & printSegments(std::ostream &str, const Segments &segments)
Print segments sizes, i.e. the segments setup.
Definition printSegments.h:27
In the SegmentsExample
function we show how to create segments with a vector carrying the segments sizes. Of course, the same constructor works even for arrays, array views and vector views. Next we print the real segment sizes depending on the underlying format the same way as we did in the previous example. Subsequently, we allocate an array having the size requested by the segments
by means of the method getStorageSize
(TNL::Algorithms::Segments::CSR::getStorageSize for example). This method says how many elements the segments need to be able to address all elements by their global index. We use the forAllElements
method to mark each element of the array by its rank in the array. Finally, we use the function TNL::Algorithms::Segments::printSegments which accepts lambda function fetch
as one its parameters. The lambda function reads data from our array data
(with the help of array view data_view
) according to given global index globalIdx
.
The result looks as follows:
Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5, ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 2 ]
Seg. 2: [ 3, 4, 5 ]
Seg. 3: [ 6, 7, 8, 9 ]
Seg. 4: [ 10, 11, 12, 13, 14 ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5, ]
Seg. 0: [ 0, 1, 2, 3, 4 ]
Seg. 1: [ 5, 6, 7, 8, 9 ]
Seg. 2: [ 10, 11, 12, 13, 14 ]
Seg. 3: [ 15, 16, 17, 18, 19 ]
Seg. 4: [ 20, 21, 22, 23, 24 ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87, ]
Seg. 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ]
Seg. 1: [ 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 ]
Seg. 2: [ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101 ]
Seg. 3: [ 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168 ]
Seg. 4: [ 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255 ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5, ]
Seg. 0: [ 8, 9 ]
Seg. 1: [ 6, 7, 22, 23 ]
Seg. 2: [ 4, 5, 20, 21 ]
Seg. 3: [ 2, 3, 18, 19, 25 ]
Seg. 4: [ 0, 1, 16, 17, 24 ]
Example of CSR segments on host:
Segments sizes are: [ 1, 2, 3, 4, 5, ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 2 ]
Seg. 2: [ 3, 4, 5 ]
Seg. 3: [ 6, 7, 8, 9 ]
Seg. 4: [ 10, 11, 12, 13, 14 ]
Example of Ellpack segments on host:
Segments sizes are: [ 5, 5, 5, 5, 5, ]
Seg. 0: [ 0, 32, 64, 96, 128 ]
Seg. 1: [ 1, 33, 65, 97, 129 ]
Seg. 2: [ 2, 34, 66, 98, 130 ]
Seg. 3: [ 3, 35, 67, 99, 131 ]
Seg. 4: [ 4, 36, 68, 100, 132 ]
Example of ChunkedEllpack segments on host:
Segments sizes are: [ 17, 34, 51, 67, 87, ]
Seg. 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ]
Seg. 1: [ 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 ]
Seg. 2: [ 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101 ]
Seg. 3: [ 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168 ]
Seg. 4: [ 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255 ]
Example of BiEllpack segments on host:
Segments sizes are: [ 2, 4, 4, 5, 5, ]
Seg. 0: [ 4, 12 ]
Seg. 1: [ 3, 11, 19, 23 ]
Seg. 2: [ 2, 10, 18, 22 ]
Seg. 3: [ 1, 9, 17, 21, 25 ]
Seg. 4: [ 0, 8, 16, 20, 24 ]
Frankly, what we see is not so important. It only shows that different segments formats can use very different mapping of elements identified by its segment index and local index (rank of the element in given segment) to a global index which serves as an address in the related container.
Iteration over elements of segments
In this section, we show how to iterate over the elements of segments and how to manipulate with them. There are three possible ways:
- Method
forElements
(TNL::Algorithms::Segments::CSR::forElements for example), which iterates in parallel over all elements of segments and perform given lambda function on each of them.
- Method
forSegments
(TNL::Algorithms::Segments::CSR::forSegments for example), which iterates in parallel over all segments. It is better choice when we need to process each segment sequentially are we have significant amount of computations common for all elements in each segment.
- Method
sequentailForSegments
(TNL::Algorithms::Segments::CSR::sequentialForSegments for example), which iterates over all segments sequentially i.e. using only one thread even on GPUs. It is useful for debugging or for printing for example.
Methods iterating over particular segments use a segment view (TNL::Algorithms::Segments::SegmentView) to access the elements of given segment. The segment view offers iterator for better convenience.
Method forElements
The following example shows use of the method forElements
:
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
8template<
typename Segments >
12 using Device =
typename Segments::DeviceType;
17 Segments segments{ 1, 2, 3, 4, 5 };
27 auto data_view = data.getView();
28 segments.forAllElements(
31 data_view[ globalIdx ] = segmentIdx;
41 return data_view[ globalIdx ];
49 segments.forAllElements(
52 if( localIdx <= segmentIdx )
53 data_view[ globalIdx ] = segmentIdx;
65main(
int argc,
char* argv[] )
68 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
71 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
75 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
78 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
We first create segments with linearly increasing size (so it is like lower triangular matrix). Next, we allocate array data
having the same size as the number of elements managed by the segments. It can be obtained by the method getStorageSize
(TNL::Algorithms::Segments::CSR::getStorageSize for example). We prepare array view data_view
for the purpose of use in lambda functions. Finally, we call the method forAllElements
which iterates in parallel over all elements in the segments and for each element it calls given lambda function. The lambda function receives three arguments - segmentIdx
is an index of the segment the element belongs to, localIdx
is the rank of the element within the segment and globalIdx
is an index of the element in the array data
. We use the global index to set proper element of the array data
to the index of the segment.
At the end we print the array data
where we can see elements belonging to particular segments by their indexes. The layout of the elements depends on the type of segments (which means sparse format in use). Next we print the elements of array data
by segments using the function printSegments
which iterates over all elements and reads the elements of the array data
with the help of the fetch
lambda function.
Note, that for the Ellpack format, the output looks as follows:
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 1, 1, 1 ]
Seg. 2: [ 2, 2, 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
We see more elements than we have requested. The reason is that the Ellpack format uses padding elements for optimizing access to the memory. Segments give access even to the padding elements, they could be used in when we later need additional elements. Therefore we need to check for relevant and padding elements each time we work with elements of segments. It is demonstrated in the previous example in the second use of the forAllElements
where we set the array data
again but we check for the padding elements. After printing the segments the same way as before, we get the correct result:
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
The result of the whole example looks as follows:
Example of CSR segments on host:
Data setup with no check ...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on host:
Data setup with no check ...
Array: [ 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 1, 1, 1 ]
Seg. 2: [ 2, 2, 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 3, 0, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Example of CSR segments on CUDA GPU:
Data setup with no check ...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4 ]
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Example of Ellpack segments on CUDA GPU:
Data setup with no check ...
Array: [ 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 1, 1, 1 ]
Seg. 2: [ 2, 2, 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Data setup with check for padding elements...
Array: [ 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
Method forSegments
Method forSegments
iterates in parallel over particular segments. Iteration over elements within the segment is sequential. There are two reasons for such proceeding:
- The iteration over the elements within the same segments must be sequential, i.e. the computation with one element depends on a result of the computation with the previous one.
- Some part of computations on all elements in one segment is common. In this case, we can first perform the common part and then iterate over the elements. If we would use the method
forElements
, the common part would have to be performed for each element.
Sequential dependency
The first situation is demonstrated in the following example:
2#include <TNL/Containers/Vector.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Algorithms/Segments/Ellpack.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
8template<
typename Segments >
12 using Device =
typename Segments::DeviceType;
17 Segments segments{ 1, 2, 3, 4, 5 };
27 auto data_view = data.getView();
28 using SegmentViewType =
typename Segments::SegmentViewType;
29 segments.forAllSegments(
33 for(
auto element : segment )
34 if( element.localIndex() <= element.segmentIndex() ) {
35 sum += element.localIndex() + 1;
36 data_view[ element.globalIndex() ] = sum;
45 return data_view[ globalIdx ];
51main(
int argc,
char* argv[] )
54 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
57 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
61 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
64 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
The result looks as follows:
The code is the same as in the previous example up to the insertion of data into particular segments. Instead of calling the method forAllElements
, we call the method forAllSegments
for which we need to define the type SegmentViewType
(TNL::Algorithms::Segments::CSR::SegmentViewType for example). The lambda function passed to this method gets the segment view and it iterates over all elements of the segment by means of a for loop. We use auxiliary variable sum
to compute cumulative sum of elements in each segment which is just the sequential dependency. The result looks as follows:
Example of CSR segments on host:
Seg. 0: [ 1 ]
Seg. 1: [ 1, 3 ]
Seg. 2: [ 1, 3, 6 ]
Seg. 3: [ 1, 3, 6, 10 ]
Seg. 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on host:
Seg. 0: [ 1, 0, 0, 0, 0 ]
Seg. 1: [ 1, 3, 0, 0, 0 ]
Seg. 2: [ 1, 3, 6, 0, 0 ]
Seg. 3: [ 1, 3, 6, 10, 0 ]
Seg. 4: [ 1, 3, 6, 10, 15 ]
Example of CSR segments on CUDA GPU:
Seg. 0: [ 1 ]
Seg. 1: [ 1, 3 ]
Seg. 2: [ 1, 3, 6 ]
Seg. 3: [ 1, 3, 6, 10 ]
Seg. 4: [ 1, 3, 6, 10, 15 ]
Example of Ellpack segments on CUDA GPU:
Seg. 0: [ 1, 0, 0, 0, 0 ]
Seg. 1: [ 1, 3, 0, 0, 0 ]
Seg. 2: [ 1, 3, 6, 0, 0 ]
Seg. 3: [ 1, 3, 6, 10, 0 ]
Seg. 4: [ 1, 3, 6, 10, 15 ]
Common computations
Now let's take a look at the second situation, i.e. there are common computations for all elements of one segment. In the following example, we first set values of each element using the method forAllElements
which we are already familiar with. Next we print values of all elements and then we use the method forAllSegments
to divide each element by a sum of values of all elements in a segment. So we first sum up all elements in the segment. This is the common part of the computation for all elements in the segment. Next we perform the division of all elements by the value of the variable sum
.
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/CSR.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
8template<
typename Device >
17 SegmentsType segments{ 1, 2, 3, 4, 5 };
27 auto data_view = data.getView();
28 segments.forAllElements(
31 data_view[ globalIdx ] = localIdx + 1;
40 return data_view[ globalIdx ];
47 using SegmentViewType =
typename SegmentsType::SegmentViewType;
48 segments.forAllSegments(
53 for(
auto element : segment )
54 if( element.localIndex() <= element.segmentIndex() )
55 sum += data_view[ element.globalIndex() ];
57 for(
auto element : segment )
58 if( element.localIndex() <= element.segmentIndex() )
59 data_view[ element.globalIndex() ] /= sum;
70main(
int argc,
char* argv[] )
73 SegmentsExample< TNL::Devices::Host >();
77 SegmentsExample< TNL::Devices::Cuda >();
Data structure for CSR segments format.
Definition CSR.h:27
The result looks as follows:
Example of CSR segments on host:
Values of elements after intial setup:
Seg. 0: [ 1 ]
Seg. 1: [ 1, 2 ]
Seg. 2: [ 1, 2, 3 ]
Seg. 3: [ 1, 2, 3, 4 ]
Seg. 4: [ 1, 2, 3, 4, 5 ]
Value of elements after dividing by sum in each segment:
Seg. 0: [ 1 ]
Seg. 1: [ 0.333333, 0.666667 ]
Seg. 2: [ 0.166667, 0.333333, 0.5 ]
Seg. 3: [ 0.1, 0.2, 0.3, 0.4 ]
Seg. 4: [ 0.0666667, 0.133333, 0.2, 0.266667, 0.333333 ]
Example of CSR segments on CUDA GPU:
Values of elements after intial setup:
Seg. 0: [ 1 ]
Seg. 1: [ 1, 2 ]
Seg. 2: [ 1, 2, 3 ]
Seg. 3: [ 1, 2, 3, 4 ]
Seg. 4: [ 1, 2, 3, 4, 5 ]
Value of elements after dividing by sum in each segment:
Seg. 0: [ 1 ]
Seg. 1: [ 0.333333, 0.666667 ]
Seg. 2: [ 0.166667, 0.333333, 0.5 ]
Seg. 3: [ 0.1, 0.2, 0.3, 0.4 ]
Seg. 4: [ 0.0666667, 0.133333, 0.2, 0.266667, 0.333333 ]
Flexible reduction within segments
In this section we will explain an extension of flexible reduction to segments. It allows to reduce all elements within the same segment and store the result into an array. See the following example:
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/Segments/CSR.h>
5#include <TNL/Algorithms/Segments/Ellpack.h>
6#include <TNL/Algorithms/SegmentsReductionKernels/DefaultKernel.h>
7#include <TNL/Devices/Host.h>
8#include <TNL/Devices/Cuda.h>
10template<
typename Segments >
14 using Device =
typename Segments::DeviceType;
15 using SegmentsReductionKernel =
22 Segments segments{ 1, 2, 3, 4, 5 };
32 auto data_view = data.getView();
33 segments.forAllElements(
36 if( localIdx <= segmentIdx )
37 data_view[ globalIdx ] = segmentIdx;
46 return data_view[ globalIdx ];
54 auto sums_view = sums.getView();
55 auto fetch_full = [ = ]
__cuda_callable__(
int segmentIdx,
int localIdx,
int globalIdx ) ->
double
57 if( localIdx <= segmentIdx )
58 return data_view[ globalIdx ];
64 return data_view[ globalIdx ];
68 sums_view[ globalIdx ] = value;
71 SegmentsReductionKernel kernel;
72 kernel.init( segments );
73 kernel.reduceAllSegments( segments, fetch_full,
std::plus<>{}, keep, 0.0 );
75 kernel.reduceAllSegments( segments, fetch_brief,
std::plus<>{}, keep, 0.0 );
80main(
int argc,
char* argv[] )
83 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Host, int > >();
86 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Host, int > >();
90 SegmentsExample< TNL::Algorithms::Segments::CSR< TNL::Devices::Cuda, int > >();
93 SegmentsExample< TNL::Algorithms::Segments::Ellpack< TNL::Devices::Cuda, int > >();
Definition DefaultKernel.h:21
We first create the segments segments
, related array data
, and setup the elements using the forAllElements
method. After printing the segments we are ready for the parallel reduction. It requires three lambda fuctions:
fetch
which reads data belonging to particular elements of the segments. The fetch function can have two different forms - brief and full:
Brief form - in this case the lambda function gets only global index and the compute
flag:
auto fetch = [=]
__cuda_callable__ (
int globalIdx,
bool& compute ) ->
double { ... };
Full form - in this case the lambda function receives even the segment index and element index:
cpp auto fetch = [=] __cuda_callable__ ( int segmentIdx, int localIdx, int globalIdx, bool& compute ) -> double { ... };
where segmentIdx
is the index of the segment, localIdx
is the rank of the element within the segment, globalIdx
is index of the element in the related array and compute
serves for the reduction interruption which means that the remaining elements in the segment can be omitted. Many formats used for segments are optimized for much higher performance if the brief variant is used. The form of the fetch
lambda function is detected automatically using SFINAE and so the use of both is very ease for the user.
reduce
is a function representing the reduction operation, in our case it is defined as follows:
auto reduce = [=]
__cuda_callable__ (
const double& a,
const double& b ) ->
double {
return a + b; }
or, in fact, we can use the function std::plus
.
keep
is a lambda function responsible for storage of the results. It is supposed to be defined as:
auto keep = [=]
__cuda_callable__ (
int segmentIdx,
const double& value )
mutable { ... };
where segmentIdx
is an index of the segment of which the reduction result we aim to store and value
is the result of the reduction in the segment.
To use reduction within segments, we first create the vector sums
where we will store the results and prepare a view to this vector for later use in the lambda functions. We demonstrate use of both fetch function variants - full by fetch_full
and brief by fetch_brief
. Next we define the lambda function keep
for storing the sums from particular segments into the vector sums
. Finally, we call the method reduceAllSegments
(TNL::Algorithms::SegmentsReductionKernels::CSRScalarKernel::reduceAllSegments for example) to compute the reductions in the segments - first with fetch_full
and then with fetch_brief
. In both cases, we use std::plus
for the reduction and we pass zero (the last argument) as the identity element for summation. In both cases we print the results which are supposed to be the same.
The result looks as follows:
Example of CSR segments on host:
Values of elements after intial setup:
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
Example of Ellpack segments on host:
Values of elements after intial setup:
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
Example of CSR segments on host:
Values of elements after intial setup:
Seg. 0: [ 0 ]
Seg. 1: [ 1, 1 ]
Seg. 2: [ 2, 2, 2 ]
Seg. 3: [ 3, 3, 3, 3 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]
Example of Ellpack segments on host:
Values of elements after intial setup:
Seg. 0: [ 0, 0, 0, 0, 0 ]
Seg. 1: [ 1, 1, 0, 0, 0 ]
Seg. 2: [ 2, 2, 2, 0, 0 ]
Seg. 3: [ 3, 3, 3, 3, 0 ]
Seg. 4: [ 4, 4, 4, 4, 4 ]
The sums with full fetch form are: [ 0, 2, 6, 12, 20 ]
The sums with brief fetch form are: [ 0, 2, 6, 12, 20 ]