Template Numerical Library version\ main:f17d0c8
Loading...
Searching...
No Matches
Matrices

TODO: Add description of forRows and sequentialForRows.

Introduction

TNL offers several types of matrices like dense (TNL::Matrices::DenseMatrix), sparse (TNL::Matrices::SparseMatrix), tridiagonal (TNL::Matrices::TridiagonalMatrix), multidiagonal (TNL::Matrices::MultidiagonalMatrix) and lambda matrices (TNL::Matrices::LambdaMatrix). The sparse matrices can be symmetric to lower the memory requirements. The interfaces of given matrix types are designed to be as unified as possible to ensure that the user can easily switch between different matrix types while making no or only a little changes in the source code. All matrix types allows traversing all matrix elements and manipulate them using lambda functions as well as performing flexible reduction in matrix rows. The following text describes particular matrix types and their unified interface in details.

Overview of matrix types

In a lot of numerical algorithms either dense or sparse matrices are used. The dense matrix (TNL::Matrices::DenseMatrix) is such that all or at least most of its matrix elements are nonzero. On the other hand sparse matrix (TNL::Matrices::SparseMatrix) is a matrix which has most of the matrix elements equal to zero. From the implementation point of view, the data structures for the dense matrices allocates all matrix elements while formats for the sparse matrices aim to store explicitly only the nonzero matrix elements. The most popular format for storing the sparse matrices is CSR format. However, especially for better data alignment in memory of GPUs, many other formats were designed. In TNL, the user may choose between several different sparse matrix formats. There are also sparse matrices with specific pattern of the nonzero elements like tridiagonal matrices (TNL::Matrices::TridiagonalMatrix) which "has nonzero elements on the main diagonal, the first diagonal below this, and the first diagonal above the main diagonal only". An example of such matrix may look as follows:

\[ \left( \begin{array}{ccccccc} -2 & 1 & . & . & . & . \\ 1 & -2 & 1 & . & . & . \\ . & 1 & -2 & 1 & . & . \\ . & . & 1 & -2 & 1 & . \\ . & . & . & 1 & -2 & 1 \\ . & . & . & . & 1 & -2 \end{array} \right) \]

Similar but more general type of matrices are multidiagonal matrices (TNL::Matrices::MultidiagonalMatrix) which have the nonzero matrix elements positioned only on lines parallel to the main diagonal like the following matrix:

\[ \left( \begin{array}{ccccccc} -4 & 1 & . & 1 & . & . \\ 1 & -4 & 1 & . & 1 & . \\ . & 1 & -4 & 1 & . & 1 \\ 1 & . & 1 & -4 & 1 & . \\ . & 1 & . & 1 & -4 & 1 \\ . & . & 1 & . & 1 & -4 \end{array} \right) \]

Finally, TNL offers so called lambda matrices (TNL::Matrices::LambdaMatrix) which are kind of "matrix-free matrices". They do not store the matrix elements explicitly in the memory, but rather evaluates them on-the-fly based on user defined lambda functions.

In the following table we show comparison of expressing a tridiagonal matrix by means of different matrix types.

Matrix dimensions Dense elems. Dense mem. Sparse elems. Sparse mem. Tridiag. elems. Tridiag. mem. Multidiag. elems. Mutlidiag. mem.
10x10 100 800 B >28 >336 B 30 240 B 30 252 B
100x100 10,000 80 kB >298 >3,576 B 300 2,400 B 300 2,412 B
1,000x1,000 1,000,000 8 MB >2,998 >35,976 B 3,000 24,000 B 3,000 24,012 B
10,000x10,000 100,000,000 800 MB >29,998 >359,976 B 30,000 240,000 B 30,000 240,012 B
100,000x100,000 10,000,000,000 80 GB >299,998 >3,599,876 B 300,000 2,400,000 B 300,000 2,400,012 B

In the table:

  • Matrix dimensions is the number of matrix rows and columns
  • Dense elems. is the number of allocated matrix elements in the dense matrix.
  • Dense mem. is the allocated memory for the matrix elements in the dense matrix if the elements are stored in the double precision.
  • Sparse elems. is the number of allocated matrix elements in the sparse matrix. Some formats may allocate padding zeros for better data alignment in the memory and so the number of allocated matrix elements may increase.
  • Sparse mem. is the allocated memory for the matrix elements in the sparse matrix if the elements are stored in the double precision and column indexes in 32-bit integer.
  • Tridiag. elems is the number of allocated matrix elements in the tridiagonal matrix.
  • Tridiag mem. is the allocated memory for the matrix elements in the tridiagonal matrix if the elements are stored in the double precision.
  • Multidiag. elems is the number of allocated matrix elements in the multidiagonal matrix.
  • Multidiag mem. is the allocated memory for the matrix elements in the multidiagonal matrix if the elements are stored in the double precision.

Choosing the best matrix type can have tremendous impact on the performance but also on memory requirements. If we would treat each matrix as dense one we would not be able to to work with matrices larger than 50,000x50,000 on common personal computers, because we would need tens of gibabytes of memory. At the same time, we see that the other matrix types can do the same job with only few megabytes. In addition, other matrix types work with much less matrix elements and so operations like matrix-vector multiplication can be done with significantly less operations which means much faster. Since in the modern hardware architectures, the computing units are limited mainly by the performance of the memory chips (so called memory wall), transferring less data from the memory increases the performance even more.

The following table shows the same as the one above but when storing a matrix which has only five nonzero elements in each row. Such matrices arise often from the finite difference method for solution of the partial differential equations:

Matrix dimensions Dense elems. Dense mem. Sparse elems. Sparse mem. Multidiag. elems. Mutlidiag. mem.
10x10 100 800 B >50 >600 B 50 420 B
100x100 10,000 80 kB >500 >6,000 B 500 4,020 B
1,000x1,000 1,000,000 8 MB >5,000 >60,000 B 5,000 40,020 B
10,000x10,000 100,000,000 800 MB >50,000 >600,000 B 50,000 400,020 B
100,000x100,000 10,000,000,000 80 GB >500,000 >6,000,000 B 500,000 4,000,020 B

There is no change in the dense matrix part of the table. The numbers grow proportionally in case of sparse and mutlidiagonal matrix. General sparse matrix formats need to store column indexes for each matrix element which is not true for the multidiagonal matrix. The following table shows how many bytes we need for storing of one matrix element with different matrix types depending on the type of the matrix elements (Real) and column indexes (Index):

Real Index Dense matrix Multidiagonal matrix Sparse matrix Fill ratio
float 32-bit 4 B 4 B 8 B << 50%
float 64-bit 4 B 4 B 12 B << 30%
double 32-bit 8 B 8 B 12 B << 60%
double 64-bit 8 B 8 B 16 B << 50%

In this table:

  • Real is matrix element type.
  • Index is column index type.
  • Dense matrix is number of bytes needed to store one matrix element in the dense matrix.
  • Multidiagonal matrix is number of bytes needed to store one matrix element in the mutldiagonal matrix.
  • Sparse matrix is number of bytes needed to store one matrix element in the sparse matrix.
  • Fill ratio is maximal percentage of the nonzero matrix elements until which the sparse matrix can perform better.

The multidiagonal matrix type is especially suitable for the finite difference method or similar numerical methods for solution of the partial differential equations.

Indexing of nonzero matrix elements in sparse matrices

The sparse matrix formats usually, in the first step, compress the matrix rows by omitting the zero matrix elements as follows

\[ \left( \begin{array}{ccccc} 0 & 1 & 0 & 2 & 0 \\ 0 & 0 & 5 & 0 & 0 \\ 4 & 0 & 0 & 0 & 7 \\ 0 & 3 & 0 & 8 & 5 \\ 0 & 5 & 7 & 0 & 0 \end{array} \right) \rightarrow \left( \begin{array}{ccccc} 1 & 2 & . & . & . \\ 5 & . & . & . & . \\ 4 & 7 & . & . & . \\ 3 & 8 & 5 & . & . \\ 5 & 7 & . & . & . \end{array} \right) \]

In such a form, it is more efficient to refer the nonzero matrix elements in given row by their rank in the compressed matrix row rather than by their column index in the original matrix. In methods for the sparse matrices, this parameter is called localIdx. Some sparse matrix formats add padding zeros for better alignment of data in memory. But if this is not the case, the variable localIdx of particular matrix elements would read as:

\[ \left( \begin{array}{ccccc} 0 & 1 & . & . & . \\ 0 & . & . & . & . \\ 0 & 1 & . & . & . \\ 0 & 1 & 2 & . & . \\ 0 & 1 & . & . & . \end{array} \right) \]

Matrix view

Matrix views are small reference objects which help accessing the matrix in GPU kernels or lambda functions being executed on GPUs. We describe this in details in section about Shared pointers and views. The problem lies in fact that we cannot pass references to GPU kernels and we do not want to pass there deep copies of matrices. Matrix view is some kind of reference to a matrix. A copy of matrix view is always shallow and so it behaves like a reference. The following example shows how to obtain the matrix view by means of method getView and pass it to a lambda function:

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9getRowExample()
10{
11 /***
12 * Set the following matrix (dots represent zero matrix elements):
13 *
14 * / 2 . . . . \
15 * | 1 2 1 . . |
16 * | . 1 2 1. . |
17 * | . . 1 2 1 |
18 * \ . . . . 2 /
19 */
20 const int size = 5;
21 TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 3, 3, 3, 1 }, size );
22 auto view = matrix.getView();
23
24 /***
25 * Set the matrix elements.
26 */
27 auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
28 {
29 auto row = view.getRow( rowIdx );
30 if( rowIdx == 0 )
31 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
32 else if( rowIdx == size - 1 )
33 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
34 else {
35 row.setElement( 0, rowIdx - 1, 1.0 ); // elements below the diagonal
36 row.setElement( 1, rowIdx, 2.0 ); // diagonal element
37 row.setElement( 2, rowIdx + 1, 1.0 ); // elements above the diagonal
38 }
39 };
40 TNL::Algorithms::parallelFor< Device >( 0, matrix.getRows(), f );
41 std::cout << matrix << std::endl;
42}
43
44int
45main( int argc, char* argv[] )
46{
47 std::cout << "Getting matrix rows on host: " << std::endl;
48 getRowExample< TNL::Devices::Host >();
49
50#ifdef __CUDACC__
51 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
52 getRowExample< TNL::Devices::Cuda >();
53#endif
54}
#define __cuda_callable__
Definition Macros.h:49
__cuda_callable__ ConstRowView getRow(IndexType rowIdx) const
Constant getter of simple structure for accessing given matrix row.
Definition SparseMatrixBase.hpp:136
Implementation of sparse matrix, i.e. matrix storing only non-zero elements.
Definition SparseMatrix.h:57
ViewType getView()
Returns a modifiable view of the sparse matrix.
Definition SparseMatrix.hpp:165
T endl(T... args)

Here we create sparse matrix matrix on the line 11, and use the method getView to get the matrix view on the line 12. The view is then used in the lambda function on the line 15 where it is captured by value (see [=] in the definition of the lambda function f on the line 14).

Allocation and setup of different matrix types

There are several ways how to create a new matrix:

  1. Initializer lists allow to create matrix from the C++ initializer lists. The matrix elements must be therefore encoded in the source code and so it is useful for rather smaller matrices. Methods and constructors with initializer lists are user friendly and simple to use. It is a good choice for tool problems with small matrices.
  2. STL map can be used for creation of sparse matrices only. The user first insert all matrix elements together with their coordinates into std::map based on which the sparse matrix is created in the next step. It is simple and user friendly approach suitable for creation of large matrices. An advantage is that we do not need to know the distribution of the nonzero matrix elements in matrix rows in advance like we do in other ways of construction of sparse matrices. This makes the use of STL map suitable for combining of sparse matrices from TNL with other numerical packages. However, the sparse matrix is constructed on the host and then copied on GPU if necessary. Therefore, this approach is not a good choice if fast and efficient matrix construction is required.
  3. Methods setElement and addElement called from the host allow to change particular matrix elements. The methods can be called from host even for matrices allocated on GPU. In this case, however, the matrix elements are transferred on GPU one by one which is very inefficient. If the matrix is allocated on the host system (CPU), the efficiency is nearly optimal. In case of sparse matrices, one must set row capacities (i.e. maximal number of nonzero elements in each row) before using these methods. If the row capacity is exceeded, the matrix has to be reallocated and all matrix elements are lost.
  4. Methods setElement and addElement called on the host and copy matrix on GPU setting particular matrix elements by the methods setElement and addElement when the matrix is allocated on GPU can be time consuming for large matrices. Setting up the matrix on CPU using the same methods and copying it on GPU at once when the setup is finished can be significantly more efficient. A drawback is that we need to allocate temporarily whole matrix on CPU.
  5. Methods setElement and addElement called from native device allow to do efficient matrix elements setup even on devices (GPUs). In this case, the methods must be called from a GPU kernel or a lambda function combined with the parallel for (TNL::Algorithms::parallelFor). The user get very good performance even when manipulating matrix allocated on GPU. On the other hand, only data structures allocated on GPUs can be accessed from the kernel or lambda function. The matrix can be accessed in the GPU kernel or lambda function by means of matrix view (see the previous section) or the shared pointer (see TNL::Pointers::SharedPointer).
  6. Method getRow combined with parallelFor is very similar to the previous one. The difference is that we first fetch helper object called matrix row which is linked to particular matrix row. Using methods of this object, one may change the matrix elements in given matrix row. An advantage is that the access to the matrix row is resolved only once for all elements in the row. In some more sophisticated sparse matrix formats, this can be nontrivial operation and this approach may slightly improve the performance. Another advantage for sparse matrices is that we access the matrix elements based on their local index ('localIdx', see [Indexing of nonzero matrix elements in sparse matrices](indexing_of_nonzero_matrix_elements_in_sparse_matrices)) in the row which is something like a rank of the nonzero element in the row. This is more efficient than addressing the matrix elements by the column indexes which requires searching in the matrix row. So this may significantly improve the performance of setup of sparse matrices. When it comes to dense matrices, there should not be great difference in performance compared to use of the methods setElement and getElement. Note that when the method is called from a GPU kernel or a lambda function, only data structures allocated on GPU can be accessed and the matrix must be made accessible by the means of matrix view.
  7. Methods forRows and forElements this approach is very similar to the previous one but it avoids using parallelFor and necessity of passing the matrix to GPU kernels by matrix view or shared pointers.

The following table shows pros and cons of particular methods:

Method Efficient Easy to use Pros Cons
Initializer list ** ***** Very easy to use. Only for small matrices.
Does not need setting of matrix rows capacities
STL map ** ***** Very easy to use. Higher memory requirements.
Does not need setting of matrix rows capacities Slow transfer on GPU.
[set,add]Element on host ****/* ***** Very easy to use. Requires setting of row capacities.
Extremely slow transfer on GPU.
[set,and]Element on host&copy on GPU *** **** Easy to use. Requires setting of row capacities.
Reasonable efficiency. Allocation of auxiliary matrix on CPU.
[set,add]Element on native device **** Good efficiency. Requires setting of row capacities.
Requires writing GPU kernel or lambda function.
Allows accessing only data allocated on the same device/memory space.
getRow and parallelFor ***** ** Best efficiency for sparse matrices. Requires setting of row capacities.
Requires writing GPU kernel or lambda function.
Allows accessing only data allocated on the same device/memory space.
Use of matrix local indexes can be less intuitive.
forRows, forElements ***** ** Best efficiency for sparse matrices. Requires setting of row capacities.
Avoid use of matrix view or shared pointer in kernels/lambda function. Requires writing GPU kernel or lambda function.
Allows accessing only data allocated on the same device/memory space.
Use of matrix local indexes is less intuitive.

Though it may seem that the later methods come with more cons than pros, they offer much higher performance and we believe that even they are still user friendly. On the other hand, if the matrix setup performance is not a priority, the use easy-to-use but slow method can still be a good choice. The following tables demonstrate the performance of different methods. The tests were performed with the following setup:

CPU Intel i9-9900KF, 3.60GHz, 8 cores, 16384 KB cache
GPU GeForce RTX 2070
g++ version 10.2.0
nvcc version 11.2.67
Precision single precision

Dense matrix

In the test of dense matrices, we set each matrix element to value equal to rowIdx + columnIdx. The times in seconds obtained on CPU looks as follows:

Matrix rows and columns setElement on host setElement with parallelFor getRow forElements
16 0.00000086 0.0000053 0.00000035 0.0000023
32 0.00000278 0.0000050 0.00000201 0.0000074
64 0.00000703 0.0000103 0.00000354 0.0000203
128 0.00002885 0.0000312 0.00000867 0.0000709
256 0.00017543 0.0000439 0.00002490 0.0001054
512 0.00078153 0.0001683 0.00005999 0.0002713
1024 0.00271989 0.0006691 0.00003808 0.0003942
2048 0.01273520 0.0038295 0.00039116 0.0017083
4096 0.08381450 0.0716542 0.00937997 0.0116771
8192 0.51596800 0.3535530 0.03971900 0.0467374

Here:

  • setElement on host tests run in one thread. Therefore they are faster for small matrices compared to setElement with parallelFor tests.
  • setElement with parallelFor tests run in parallel in several OpenMP threads. This approach is faster for larger matrices.
  • getRow tests run in parallel in several OpenMP threads mapping of which is more efficient compared to setElement on host tests.

And the same on GPU is in the following table:

Matrix rows and columns setElement on host setElement on host and copy setElement on GPU getRow forElements
16 0.027835 0.02675 0.000101198 0.00009903 0.000101214
32 0.002776 0.00018 0.000099197 0.00009901 0.000100481
64 0.010791 0.00015 0.000094446 0.00009493 0.000101796
128 0.043014 0.00021 0.000099397 0.00010024 0.000102729
256 0.171029 0.00056 0.000100469 0.00010448 0.000105893
512 0.683627 0.00192 0.000103346 0.00011034 0.000112752
1024 2.736680 0.00687 0.000158805 0.00016932 0.000170302
2048 10.930300 0.02474 0.000509000 0.00050917 0.000511183
4096 43.728700 0.13174 0.001557030 0.00156117 0.001557930
8192 174.923000 0.70602 0.005312470 0.00526658 0.005263870

Here:

  • setElement on host tests are very slow especially for large matrices since each matrix element is copied on GPU separately.
  • setElement on host and copy tests are much faster because the matrix is copied from CPU to GPU on the whole which is more efficient.
  • setElement on GPU tests are even more faster since there is no transfer of data between CPU and GPU.
  • getRow tests have the same performance as setElement on GPU.
  • forElements tests have the same performance as both setElement on GPU and getRow.

You can see the source code of the previous benchmark in Appendix.

Sparse matrix

The sparse matrices are tested on computation of matrix the discrete Laplace operator in 2D. This matrix has at most five nonzero elements in each row. The times for sparse matrix (with CSR format) on CPU in seconds look as follows:

Matrix rows and columns STL Map setElement on host setElement with parallelFor getRow forElements
256 0.00016 0.000017 0.000014 0.000013 0.000020
1,024 0.00059 0.000044 0.000021 0.000019 0.000022
4,096 0.00291 0.000130 0.000031 0.000022 0.000031
16,384 0.01414 0.000471 0.000067 0.000031 0.000065
65,536 0.06705 0.001869 0.000218 0.000074 0.000209
262,144 0.31728 0.007436 0.000856 0.000274 0.000799
1,048,576 1.46388 0.027087 0.006162 0.005653 0.005904
4,194,304 7.46147 0.102808 0.028385 0.027925 0.027937
16,777,216 38.95900 0.413823 0.125870 0.124588 0.123858
67,108,864 185.75700 1.652580 0.505232 0.501003 0.500927

Here:

  • STL Map tests show that use of STL Map can be very slow on large matrices and, of course, they need to allocate the map containing all the matrix elements. This can be memory consuming. On the other hand, it is the only way which does not require knowing the matrix row capacities in advance.
  • setElement on host tests are much faster compared to STL map, it does not need to allocate anything else except the sparse matrix. However, matrix row capacities must be known in advance.
  • setElement with parallelFor tests run in parallel in several OpenMP threads and so this can be faster for larger matrices.
  • getRow tests perform the same as setElement with parallelFor.
  • forElements tests perform the same as both setElement with parallelFor and forElements.

We see, that the use of STL map makes sense only in situation when it is hard to estimate necessary row capacities. Otherwise very easy setup with setElement method is much faster. If the performance is the highest priority, getRow method should be preferred. The results for GPU are in the following table:

Matrix rows and columns STL Map setElement on host setElement on host and copy setElement on GPU getRow forElements
256 0.002 0.036 0.0280 0.00017 0.00017 0.00017
1,024 0.001 0.161 0.0006 0.00017 0.00017 0.00017
4,096 0.003 0.680 0.0010 0.00020 0.00020 0.00020
16,384 0.015 2.800 0.0034 0.00021 0.00020 0.00021
65,536 0.074 11.356 0.0130 0.00048 0.00047 0.00048
262,144 0.350 45.745 0.0518 0.00088 0.00087 0.00088
1,048,576 1.630 183.632 0.2057 0.00247 0.00244 0.00245
4,194,304 8.036 735.848 0.8119 0.00794 0.00783 0.00788
16,777,216 41.057 2946.610 3.2198 0.02481 0.02429 0.02211
67,108,864 197.581 11791.601 12.7775 0.07196 0.06329 0.06308

Here:

  • STL Map tests show that the times are comparable to CPU times which means the most of the time is spent by creating the matrix on CPU.
  • setElement on host tests are again extremely slow for large matrices. It is even slower than the use of STL map. So in case of GPU, this is another reason for using the STL map.
  • setElement on host and copy tests are, similar to the dense matrix, much faster compared to the previous approaches. So it is the best way when you need to use data structures available only on the host system (CPU).
  • setElement on GPU tests exhibit the best performance together with getRow and forElements methods. Note, however, that this method can be slower that getRow and forElements if there would be more nonzero matrix elements in a row.
  • getRow tests exhibit the best performance together with setElement on GPU and forElements methods.
  • forElements tests exhibit the best performance together with getRow and setElement on GPU methods.

Here we see, that the setElement methods performs extremely bad because all matrix elements are transferred to GPU one-by-one. Even STL map is much faster. Note, that the times for STL map are not much higher compared to CPU which indicates that the transfer of the matrix on GPU is not dominant. Setup of the matrix on CPU by the means of setElement method and transfer on GPU is even faster. However, the best performance can be obtained only we creating the matrix directly on GPU by methods setElement, getRow and forElements. Note, however, that even if all of them perform the same way, for matrices with more nonzero matrix elements in a row, setElement could be slower compared to the getRow and forElements.

You can see the source code of the previous benchmark in Appendix.

Multidiagonal matrix

Finally, the following tables show the times of the same test performed with multidiagonal matrix. Times on CPU in seconds looks as follows:

Matrix rows and columns setElement on host setElement with parallelFor getRow forElements
256 0.000055 0.0000038 0.000004 0.000009
1,024 0.000002 0.0000056 0.000003 0.000006
4,096 0.000087 0.0000130 0.000005 0.000014
16,384 0.000347 0.0000419 0.000010 0.000046
65,536 0.001378 0.0001528 0.000032 0.000177
262,144 0.005504 0.0006025 0.000131 0.000711
1,048,576 0.019392 0.0028773 0.001005 0.003265
4,194,304 0.072078 0.0162378 0.011915 0.018065
16,777,216 0.280085 0.0642682 0.048876 0.072084
67,108,864 1.105120 0.2427610 0.181974 0.272579

Here:

  • setElement on host tests show that this method is fairly efficient.
  • setElement with parallelFor tests run in parallel in several OpenMP threads compared to setElement on host tests. For larger matrices, this way of matrix setup performs better.
  • getRow tests perform more or less the same as setElement with parallelFor and forElements.
  • forElements tests perform more or less the same as setElement with parallelFor and getRow.

Note, that setup of multidiagonal matrix is faster compared to the same matrix stored in general sparse format. Results for GPU are in the following table:

Matrix rows and columns setElement on host setElement on host and copy setElement on GPU getRow forElements
256 0.035 0.02468 0.000048 0.000045 0.000047
1,024 0.059 0.00015 0.000047 0.000045 0.000047
4,096 0.251 0.00044 0.000048 0.000045 0.000047
16,384 1.030 0.00158 0.000049 0.000046 0.000048
65,536 4.169 0.00619 0.000053 0.000048 0.000052
262,144 16.807 0.02187 0.000216 0.000214 0.000217
1,048,576 67.385 0.08043 0.000630 0.000629 0.000634
4,194,304 270.025 0.31272 0.001939 0.001941 0.001942
16,777,216 1080.741 1.18849 0.003212 0.004185 0.004207
67,108,864 4326.120 4.74481 0.013672 0.022494 0.030369
  • setElement on host tests are extremely slow again, especially for large matrices.
  • setElement on host and copy tests are much faster compared to the previous.
  • setElement with parallelFor tests offer the best performance. They are even faster then getRow and forElements method. This, however, does not have be true for matrices having more nonzero elements in a row.
  • getRow tests perform more or less the same as forElements. For matrices having more nonzero elements in a row this method could be faster than setElement.
  • forElements tests perform more or less the same as getRow.

Note that multidiagonal matrix performs better compared to general sparse matrix. One reason for it is the fact, that the multidiagonal type does not store explicitly column indexes of all matrix elements. Because of this, less data need to be transferred from the memory.

You can see the source code of the previous benchmark in Appendix.

In the following parts we will describe how to setup particular matrix types by means of the methods mentioned above.

Dense matrices

Dense matrix (TNL::Matrices::DenseMatrix) is a templated class defined in the namespace TNL::Matrices. It has five template parameters:

  • Real is a type of the matrix elements. It is double by default.
  • Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
  • Index is a type to be used for indexing of the matrix elements. It is int by default.
  • ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default, it is the row-major order if the matrix is allocated on the host system and column major order if it is allocated on GPU.
  • RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.

The following examples show how to allocate the dense matrix and how to initialize the matrix elements.

Initializer list

Small matrices can be created simply by the constructor with an initializer list.

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7initializerListExample()
8{
10 // clang-format off
11 { 1, 2, 3, 4, 5, 6 },
12 { 7, 8, 9, 10, 11, 12 },
13 { 13, 14, 15, 16, 17, 18 }
14 // clang-format on
15 };
16
17 std::cout << "General dense matrix: " << std::endl << matrix << std::endl;
18
20 // clang-format off
21 { 1 },
22 { 2, 3 },
23 { 4, 5, 6 },
24 { 7, 8, 9, 10 },
25 { 11, 12, 13, 14, 15 }
26 // clang-format on
27 };
28
29 std::cout << "Triangular dense matrix: " << std::endl << triangularMatrix << std::endl;
30}
31
32int
33main( int argc, char* argv[] )
34{
35 std::cout << "Creating matrices on CPU ... " << std::endl;
36 initializerListExample< TNL::Devices::Host >();
37
38#ifdef __CUDACC__
39 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
40 initializerListExample< TNL::Devices::Cuda >();
41#endif
42}
Implementation of dense matrix, i.e. matrix storing explicitly all of its elements including zeros.
Definition DenseMatrix.h:31

In fact, the constructor takes a list of initializer lists. Each embedded list defines one matrix row and so the number of matrix rows is given by the size of the outer initializer list. The number of matrix columns is given by the longest inner initializer lists. Shorter inner lists are filled with zeros from the right side. The result looks as follows:

Creating matrices on CPU ...
General dense matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5 5:6
Row: 1 -> 0:7 1:8 2:9 3:10 4:11 5:12
Row: 2 -> 0:13 1:14 2:15 3:16 4:17 5:18
Triangular dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:3 2:0 3:0 4:0
Row: 2 -> 0:4 1:5 2:6 3:0 4:0
Row: 3 -> 0:7 1:8 2:9 3:10 4:0
Row: 4 -> 0:11 1:12 2:13 3:14 4:15
Creating matrices on CUDA GPU ...
General dense matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5 5:6
Row: 1 -> 0:7 1:8 2:9 3:10 4:11 5:12
Row: 2 -> 0:13 1:14 2:15 3:16 4:17 5:18
Triangular dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:3 2:0 3:0 4:0
Row: 2 -> 0:4 1:5 2:6 3:0 4:0
Row: 3 -> 0:7 1:8 2:9 3:10 4:0
Row: 4 -> 0:11 1:12 2:13 3:14 4:15

Methods setElement and addElement

Larger matrices can be setup with methods setElement and addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement). The following example shows how to call these methods from the host.

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7addElements()
8{
10 auto matrixView = matrix.getView();
11
12 for( int i = 0; i < 5; i++ )
13 matrixView.setElement( i, i, i ); // or matrix.setElement
14
15 std::cout << "Initial matrix is: " << std::endl << matrix << std::endl;
16
17 for( int i = 0; i < 5; i++ )
18 for( int j = 0; j < 5; j++ )
19 matrixView.addElement( i, j, 1.0, 5.0 ); // or matrix.addElement
20
21 std::cout << "Matrix after addition is: " << std::endl << matrix << std::endl;
22}
23
24int
25main( int argc, char* argv[] )
26{
27 std::cout << "Add elements on host:" << std::endl;
28 addElements< TNL::Devices::Host >();
29
30#ifdef __CUDACC__
31 std::cout << "Add elements on CUDA device:" << std::endl;
32 addElements< TNL::Devices::Cuda >();
33#endif
34}

As we can see, both methods can be called from the host no matter where the matrix is allocated. If it is on GPU, each call of setElement or addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement) causes slow transfer of tha data between CPU and GPU. Use this approach only if the performance is not a priority. The result looks as follows:

Add elements on host:
Initial matrix is:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix after addition is:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21
Add elements on CUDA device:
Initial matrix is:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix after addition is:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21

More efficient way of the matrix initialization on GPU consists of calling the methods setElement and addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement) directly from GPU, for example by means of lambda function and parallelFor. It is demonstrated in the following example (of course it works even on CPU):

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9setElements()
10{
12 auto matrixView = matrix.getView();
13 for( int i = 0; i < 5; i++ )
14 matrixView.setElement( i, i, i ); // or matrix.setElement
15
16 std::cout << "Matrix set from the host:" << std::endl;
17 std::cout << matrix << std::endl;
18
19 auto f = [ = ] __cuda_callable__( const TNL::Containers::StaticArray< 2, int >& i ) mutable
20 {
21 matrixView.addElement( i[ 0 ], i[ 1 ], 5.0 );
22 };
25 TNL::Algorithms::parallelFor< Device >( begin, end, f );
26
27 std::cout << "Matrix set from its native device:" << std::endl;
28 std::cout << matrix << std::endl;
29}
30
31int
32main( int argc, char* argv[] )
33{
34 std::cout << "Set elements on host:" << std::endl;
35 setElements< TNL::Devices::Host >();
36
37#ifdef __CUDACC__
38 std::cout << "Set elements on CUDA device:" << std::endl;
39 setElements< TNL::Devices::Cuda >();
40#endif
41}
Array with constant size.
Definition StaticArray.h:20
T end(T... args)

Here we get the matrix view (TNL::Matrices::DenseMatrixView) (line 10) to make the matrix accessible in lambda function even on GPU (see Shared pointers and views ). We first call the setElement method from CPU to set the i-th diagonal element to i (lines 11-12). Next we iterate over the matrix rows with parallelFor (line 20) and for each row we call the lambda function f. This is done on the same device where the matrix is allocated and so it we get optimal performance even for matrices on GPU. In the lambda function we add one to each matrix element (line 18). The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix set from its native device:
Row: 0 -> 0:5 1:5 2:5 3:5 4:5
Row: 1 -> 0:5 1:6 2:5 3:5 4:5
Row: 2 -> 0:5 1:5 2:7 3:5 4:5
Row: 3 -> 0:5 1:5 2:5 3:8 4:5
Row: 4 -> 0:5 1:5 2:5 3:5 4:9
Set elements on CUDA device:
Matrix set from the host:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix set from its native device:
Row: 0 -> 0:5 1:5 2:5 3:5 4:5
Row: 1 -> 0:5 1:6 2:5 3:5 4:5
Row: 2 -> 0:5 1:5 2:7 3:5 4:5
Row: 3 -> 0:5 1:5 2:5 3:8 4:5
Row: 4 -> 0:5 1:5 2:5 3:5 4:9

Method getRow

This method is available for the dense matrix (TNL::Matrices::DenseMatrix::getRow) mainly for two reasons:

  1. The method getRow is recommended for sparse matrices. In most cases, it is not optimal for dense matrices. However, if one needs to have one code for both dense and sparse matrices, this method is a good choice.
  2. In general, use of setElement (TNL::Matrices::DenseMatrix::setElement) combined with parallelFor is preferred, for dense matrices, since it offers more parallelism for GPUs. parallelFor creates one CUDA thread per each matrix element which is desirable for GPUs. With the use of the method getRow we have only one CUDA thread per each matrix row. This makes sense only in situation when we need to setup each matrix row sequentially.

Here we show an example:

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9getRowExample()
10{
11 const int size = 5;
13
14 /***
15 * Create dense matrix view which can be captured by the following lambda
16 * function.
17 */
18 auto matrixView = matrix.getView();
19
20 auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
21 {
22 auto row = matrixView.getRow( rowIdx );
23 if( rowIdx > 0 )
24 row.setValue( rowIdx - 1, -1.0 );
25 row.setValue( rowIdx, rowIdx + 1.0 );
26 if( rowIdx < size - 1 )
27 row.setValue( rowIdx + 1, -1.0 );
28 };
29
30 /***
31 * Set the matrix elements.
32 */
33 TNL::Algorithms::parallelFor< Device >( 0, matrix.getRows(), f );
34 std::cout << matrix << std::endl;
35}
36
37int
38main( int argc, char* argv[] )
39{
40 std::cout << "Getting matrix rows on host: " << std::endl;
41 getRowExample< TNL::Devices::Host >();
42
43#ifdef __CUDACC__
44 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
45 getRowExample< TNL::Devices::Cuda >();
46#endif
47}

Here we create the matrix on the line 10 and get the matrix view on the line 16. Next we use parallelFor (line 31) to iterate over the matrix rows and call the lambda function f (lines 19-26) for each of them. In the lambda function, we first fetch the matrix row by means of the method getRow (TNL::Matrices::DenseMatrixView::getRow) and next we set the matrix elements by using the method setElement of the matrix row (TNL::Matrices::DenseMatrixRowView::setElement). For the compatibility with the sparse matrices, use the variant of setElement with the parameter localIdx. It has no effect here, it is only for compatibility of the interface.

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Getting matrix rows on CUDA device:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5

Method forRows

This method iterates in parallel over all matrix rows. In fact, it combines TNL::Algorithms::parallelFor and TNL::Matrices::DenseMatrix::getRow method in one. See the following example. It is even a bit simpler compared to the previous one:

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forRowsExample()
9{
11 using RowView = typename MatrixType::RowView;
12 const int size = 5;
13 MatrixType matrix( size, size );
14 auto view = matrix.getView();
15
16 /***
17 * Set the matrix elements.
18 */
19 auto f = [] __cuda_callable__( RowView & row )
20 {
21 const int& rowIdx = row.getRowIndex();
22 if( rowIdx > 0 )
23 row.setValue( rowIdx - 1, -1.0 );
24 row.setValue( rowIdx, rowIdx + 1.0 );
25 if( rowIdx < size - 1 )
26 row.setValue( rowIdx + 1, -1.0 );
27 };
28 view.forAllRows( f ); // or matrix.forAllRows
29 std::cout << matrix << std::endl;
30
31 /***
32 * Now divide each matrix row by its largest element - with the use of iterators.
33 */
34 view.forAllRows(
35 [] __cuda_callable__( RowView & row )
36 {
38 for( auto element : row )
39 largest = TNL::max( largest, element.value() );
40 for( auto element : row )
41 element.value() /= largest;
42 } );
43 std::cout << "Divide each matrix row by its largest element... " << std::endl;
44 std::cout << matrix << std::endl;
45}
46
47int
48main( int argc, char* argv[] )
49{
50 std::cout << "Getting matrix rows on host: " << std::endl;
51 forRowsExample< TNL::Devices::Host >();
52
53#ifdef __CUDACC__
54 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
55 forRowsExample< TNL::Devices::Cuda >();
56#endif
57}
void forAllRows(Function &&function)
Method for parallel iteration over all matrix rows.
Definition SparseMatrixBase.hpp:591
T lowest(T... args)
The main TNL namespace.
Definition AtomicOperations.h:9

The lambda function f, which is called for each matrix row (lines 18-25), have to accept parameter row with type RowView. This type is defined inside each TNL matrix and in the case of the dense matrix, it is TNL::Matrices::DenseMatrixRowView. We use the method TNL::Matrices::DenseMatrixRowView::getRowIndex to get the index of the matrix row being currently processed and method TNL::Matrices::DenseMatrixRowView::setElement which sets the value of the element with given column index (the first parameter).

Next, on the lines 32-38, we call another lambda function which firstly find the largest element in each row (lines 33-35) and then it divides the matrix row by its value (lines 36-37).

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Divide each matrix row by its largest element...
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-0.5 1:1 2:-0.5 3:0 4:0
Row: 2 -> 0:0 1:-0.333333 2:1 3:-0.333333 4:0
Row: 3 -> 0:0 1:0 2:-0.25 3:1 4:-0.25
Row: 4 -> 0:0 1:0 2:0 3:-0.2 4:1
Getting matrix rows on CUDA device:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Divide each matrix row by its largest element...
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-0.5 1:1 2:-0.5 3:0 4:0
Row: 2 -> 0:0 1:-0.333333 2:1 3:-0.333333 4:0
Row: 3 -> 0:0 1:0 2:-0.25 3:1 4:-0.25
Row: 4 -> 0:0 1:0 2:0 3:-0.2 4:1

Method forElements

The next example demonstrates the method forElements (TNL::Matrices::DenseMatrix::forElements) which works in very similar way as the method getRow but it is slightly easier to use. It is also compatible with sparse matrices. See the following example:

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forElementsExample()
9{
11 auto matrixView = matrix.getView();
12
13 auto f = [] __cuda_callable__( int rowIdx, int columnIdx, int globalIdx, double& value )
14 {
15 if( columnIdx <= rowIdx )
16 value = rowIdx + columnIdx;
17 };
18
19 matrixView.forElements( 0, matrix.getRows(), f ); // or matrix.forElements
20 std::cout << matrix << std::endl;
21}
22
23int
24main( int argc, char* argv[] )
25{
26 std::cout << "Creating matrix on host: " << std::endl;
27 forElementsExample< TNL::Devices::Host >();
28
29#ifdef __CUDACC__
30 std::cout << "Creating matrix on CUDA device: " << std::endl;
31 forElementsExample< TNL::Devices::Cuda >();
32#endif
33}

We do not need any matrix view and instead of calling parallelFor we call just the method forElements (line 18). The lambda function f (line 11) must accept the following parameters:

  • rowIdx is the row index of given matrix element.
  • columnIdx is the column index of given matrix element.
  • value is a reference on the matrix element value and so by changing this value we can modify the matrix element.
  • compute is a boolean which, when set to false, indicates that we can skip the rest of the matrix row. This is, however, only a hint and it does not guarantee that the rest of the matrix row is really skipped.

The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:2 1:3 2:4 3:0 4:0
Row: 3 -> 0:3 1:4 2:5 3:6 4:0
Row: 4 -> 0:4 1:5 2:6 3:7 4:8
Creating matrix on CUDA device:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:2 1:3 2:4 3:0 4:0
Row: 3 -> 0:3 1:4 2:5 3:6 4:0
Row: 4 -> 0:4 1:5 2:6 3:7 4:8

Wrapping existing data to dense matrix view

In case when you have already allocated data for dense matrix (for example in some other library), you may wrap it to dense matrix view with a function TNL::Matrices::wrapDenseMatrix . See the following example:

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Matrices/MatrixWrapping.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9wrapMatrixView()
10{
11 const int rows( 3 ), columns( 4 );
13 // clang-format off
14 1, 2, 3, 4,
15 5, 6, 7, 8,
16 9, 10, 11, 12
17 // clang-format on
18 };
19 double* values = valuesVector.getData();
20
21 /***
22 * Wrap the array `values` to dense matrix view
23 */
24 auto matrix = TNL::Matrices::wrapDenseMatrix< Device >( rows, columns, values );
25 std::cout << "Matrix reads as: " << std::endl << matrix << std::endl;
26}
27
28int
29main( int argc, char* argv[] )
30{
31 std::cout << "Wraping matrix view on host: " << std::endl;
32 wrapMatrixView< TNL::Devices::Host >();
33
34#ifdef __CUDACC__
35 std::cout << "Wraping matrix view on CUDA device: " << std::endl;
36 wrapMatrixView< TNL::Devices::Cuda >();
37#endif
38}
__cuda_callable__ const Value * getData() const
Returns a const-qualified raw pointer to the data.
Definition Array.hpp:325
Vector extends Array with algebraic operations.
Definition Vector.h:36

Here we create dense matrix having three rows and four columns. We use TNL vector (TNL::Containers::Vector) only for allocation of the matrix elements (lines 12-15) and we get a pointer to the allocated array immediately (line 16). Next we use just the array to get dense matrix view with proper matrix dimensions (line 21). Note that we must explicitly state the device type as a template parameter of the function wrapDenseMatrix (TNL::Matrices::wrapDenseMatrix). Finally, we print the matrix to see if it is correct (line 22). The result looks as follows:

Wraping matrix view on host:
Matrix reads as:
Row: 0 -> 0:1 1:2 2:3 3:4
Row: 1 -> 0:5 1:6 2:7 3:8
Row: 2 -> 0:9 1:10 2:11 3:12
Wraping matrix view on CUDA device:
Matrix reads as:
Row: 0 -> 0:1 1:4 2:7 3:10
Row: 1 -> 0:2 1:5 2:8 3:11
Row: 2 -> 0:3 1:6 2:9 3:12

Sparse matrices

Sparse matrices are extremely important in a lot of numerical algorithms. They are used at situations when we need to operate with matrices having majority of the matrix elements equal to zero. In this case, only the non-zero matrix elements are stored with possibly some padding zeros used for memory alignment. This is necessary mainly on GPUs. See the Overview of matrix types for the differences in memory requirements.

Major disadvantage of sparse matrices is that there are a lot of different formats for their storage in memory. Though CSR (Compressed Sparse Row) format is the most popular of all, especially for GPUs, there are many other formats. Often their performance differ significantly for various matrices. So it is a good idea to test several sparse matrix formats if you want to get the best performance. In TNL, there is one templated class TNL::Matrices::SparseMatrix representing general sparse matrices. The change of underlying matrix format can be done just by changing one template parameter. The list of the template parameters is as follows:

If Real is set to bool, we get a binary matrix for which the non-zero elements can be equal only to one and so the matrix elements values are not stored explicitly in the memory. This can significantly reduce the memory requirements and also increase performance.

In the following text we will show how to create and setup sparse matrices.

Setting of row capacities

Larger sparse matrices are created in two steps:

  1. We use a method TNL::Matrices::SparseMatrix::setRowCapacities to initialize the underlying matrix format and to allocate memory for the matrix elements. This method only needs to know how many non-zero elements are supposed to be in each row. Once this is set, it cannot be changed only by resetting the whole matrix. In most situations, it is not an issue to compute the number of nonzero elements in each row. Otherwise, we can currently only recommend the use of matrix setup with STL map, which is, however, quite slow.
  2. Now, the nonzero matrix elements can be set one after another by telling its coordinates and a value. Since majority of sparse matrix formats are designed to allow quick access to particular matrix rows the insertion can be done in parallel by mapping different threads to different matrix rows. This approach is usually optimal or nearly optimal when it comes to efficiency.

See the following example which creates lower triangular matrix like this one

\[ \left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \\ 2 & 1 & 0 & 0 & 0 \\ 3 & 2 & 1 & 0 & 0 \\ 4 & 3 & 2 & 1 & 0 \\ 5 & 4 & 3 & 2 & 1 \end{array} \right). \]

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9setRowCapacitiesExample()
10{
12 TNL::Containers::Vector< int, Device > rowCapacities{ 1, 2, 3, 4, 5 };
13 matrix.setRowCapacities( rowCapacities );
14 for( int row = 0; row < 5; row++ )
15 for( int column = 0; column <= row; column++ )
16 matrix.setElement( row, column, row - column + 1 );
17
18 std::cout << matrix << std::endl;
19}
20
21int
22main( int argc, char* argv[] )
23{
24 std::cout << "Creating matrix on CPU ... " << std::endl;
25 setRowCapacitiesExample< TNL::Devices::Host >();
26
27#ifdef __CUDACC__
28 std::cout << "Creating matrix on CUDA GPU ... " << std::endl;
29 setRowCapacitiesExample< TNL::Devices::Cuda >();
30#endif
31}
__cuda_callable__ void setElement(IndexType i, ValueType value)
Sets the value of the i-th element to v.
Definition Array.hpp:357

The method TNL::Matrices::SparseMatrix::setRowCapacities reads the required capacities of the matrix rows from a vector (or simmilar container - TNL::Containers::Array, TNL::Containers::ArrayView, TNL::Containers::Vector and TNL::Containers::VectorView) which has the same number of elements as the number of matrix rows and each element defines the capacity of the related row. The result looks as follows:

Creating matrix on CPU ...
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Creating matrix on CUDA GPU ...
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1

There are constructors which also set the row capacities. The first one uses a vector:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Devices/Host.h>
5
6template< typename Device >
7void
8initializerListExample()
9{
10 TNL::Containers::Vector< int, Device > rowCapacities{ 1, 2, 3, 4, 5 };
11 TNL::Matrices::SparseMatrix< double, Device > matrix{ rowCapacities, // row capacities
12 6 }; // number of matrix columns
13
14 for( int row = 0; row < matrix.getRows(); row++ )
15 for( int column = 0; column <= row; column++ )
16 matrix.setElement( row, column, row - column + 1 );
17 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
18}
19
20int
21main( int argc, char* argv[] )
22{
23 std::cout << "Creating matrices on CPU ... " << std::endl;
24 initializerListExample< TNL::Devices::Host >();
25
26#ifdef __CUDACC__
27 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
28 initializerListExample< TNL::Devices::Cuda >();
29#endif
30}
__cuda_callable__ void setElement(IndexType row, IndexType column, const RealType &value)
Sets element at given row and column to given value.
Definition SparseMatrixBase.hpp:154

The second one uses an initializer list:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7initializerListExample()
8{
9 TNL::Matrices::SparseMatrix< double, Device > matrix{ { 1, 2, 3, 4, 5 }, // row capacities
10 6 }; // number of matrix columns
11
12 for( int row = 0; row < matrix.getRows(); row++ )
13 for( int column = 0; column <= row; column++ )
14 matrix.setElement( row, column, row - column + 1 );
15 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
16}
17
18int
19main( int argc, char* argv[] )
20{
21 std::cout << "Creating matrices on CPU ... " << std::endl;
22 initializerListExample< TNL::Devices::Host >();
23
24#ifdef __CUDACC__
25 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
26 initializerListExample< TNL::Devices::Cuda >();
27#endif
28}

The result of both examples looks as follows:

Creating matrices on CPU ...
General sparse matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Creating matrices on CUDA GPU ...
General sparse matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1

Initializer list

Small matrices can be initialized by a constructor with an initializer list. We assume having the following sparse matrix

\[ \left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \\ -1 & 2 & -1 & 0 & 0 \\ 0 & -1 & 2 & -1 & 0 \\ 0 & 0 & -1 & 2 & -1 \\ 0 & 0 & 0 & -1 & 0 \end{array} \right). \]

It can be created with the initializer list constructor like we shows in the following example:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7initializerListExample()
8{
10 // number of matrix rows
11 5,
12 // number of matrix columns
13 5,
14 // matrix elements definition
15 {
16 // clang-format off
17 { 0, 0, 2.0 },
18 { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
19 { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
20 { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
21 { 4, 4, 2.0 }
22 // clang-format on
23 } );
24
25 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
26}
27
28int
29main( int argc, char* argv[] )
30{
31 std::cout << "Creating matrices on CPU ... " << std::endl;
32 initializerListExample< TNL::Devices::Host >();
33
34#ifdef __CUDACC__
35 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
36 initializerListExample< TNL::Devices::Cuda >();
37#endif
38}

The constructor accepts the following parameters (lines 9-17):

  • rows is a number of matrix rows.
  • columns is a number of matrix columns.
  • data is definition of nonzero matrix elements. It is a initializer list of triples having a form { row_index, column_index, value }. In fact, it is very much like the Coordinate format - COO.

The constructor also accepts Real and Index allocators (TNL::Allocators) but the default ones are used in this example. A method setElements (TNL::Matrices::SparseMatrix::setElements) works the same way:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8setElementsExample()
9{
10 TNL::Matrices::SparseMatrix< double, Device > matrix( 5, 5 ); // matrix dimensions
11 // matrix elements definition
12 matrix.setElements( {
13 // clang-format off
14 { 0, 0, 2.0 },
15 { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
16 { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
17 { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
18 { 4, 4, 2.0 }
19 // clang-format on
20 } );
21
22 std::cout << matrix << std::endl;
23}
24
25int
26main( int argc, char* argv[] )
27{
28 std::cout << "Setting matrix elements on host: " << std::endl;
29 setElementsExample< TNL::Devices::Host >();
30
31#ifdef __CUDACC__
32 std::cout << "Setting matrix elements on CUDA device: " << std::endl;
33 setElementsExample< TNL::Devices::Cuda >();
34#endif
35}

In this example, we create the matrix in two steps. Firstly we use constructor with only matrix dimensions as parameters (line 9) and next we set the matrix elements by setElements method (lines 10-15). The result of both examples looks as follows:

Creating matrices on CPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2
Creating matrices on CUDA GPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2

STL map

The constructor which creates the sparse matrix from std::map is useful especially in situations when you cannot estimate the matrix row capacities in advance. You can first store the matrix elements in std::map data structure in a COO format manner. It means that each entry of the map is the following pair:

std::pair( std::pair( row_index, column_index ), element_value )

which defines one matrix element at given coordinates (row_index,column_index) with given value (element_value). Of course, you can insert such entries into the map in arbitrary order. When it is complete, you pass the map to the sparse matrix. See the following example:

1#include <iostream>
2#include <map>
3#include <utility>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9initializerListExample()
10{
12 map.insert( std::make_pair( std::make_pair( 0, 0 ), 2.0 ) );
13 map.insert( std::make_pair( std::make_pair( 1, 0 ), -1.0 ) );
14 map.insert( std::make_pair( std::make_pair( 1, 1 ), 2.0 ) );
15 map.insert( std::make_pair( std::make_pair( 1, 2 ), -1.0 ) );
16 map.insert( std::make_pair( std::make_pair( 2, 1 ), -1.0 ) );
17 map.insert( std::make_pair( std::make_pair( 2, 2 ), 2.0 ) );
18 map.insert( std::make_pair( std::make_pair( 2, 3 ), -1.0 ) );
19 map.insert( std::make_pair( std::make_pair( 3, 2 ), -1.0 ) );
20 map.insert( std::make_pair( std::make_pair( 3, 3 ), 2.0 ) );
21 map.insert( std::make_pair( std::make_pair( 3, 4 ), -1.0 ) );
22 map.insert( std::make_pair( std::make_pair( 4, 4 ), 2.0 ) );
23
25
26 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
27}
28
29int
30main( int argc, char* argv[] )
31{
32 std::cout << "Creating matrices on CPU ... " << std::endl;
33 initializerListExample< TNL::Devices::Host >();
34
35#ifdef __CUDACC__
36 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
37 initializerListExample< TNL::Devices::Cuda >();
38#endif
39}
T insert(T... args)
T make_pair(T... args)

The method setElements (TNL::Matrices::SparseMatrix::setElements) works the same way for already existing instances of sparse matrix:

1#include <iostream>
2#include <map>
3#include <utility>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9setElementsExample()
10{
12 map.insert( std::make_pair( std::make_pair( 0, 0 ), 2.0 ) );
13 map.insert( std::make_pair( std::make_pair( 1, 0 ), -1.0 ) );
14 map.insert( std::make_pair( std::make_pair( 1, 1 ), 2.0 ) );
15 map.insert( std::make_pair( std::make_pair( 1, 2 ), -1.0 ) );
16 map.insert( std::make_pair( std::make_pair( 2, 1 ), -1.0 ) );
17 map.insert( std::make_pair( std::make_pair( 2, 2 ), 2.0 ) );
18 map.insert( std::make_pair( std::make_pair( 2, 3 ), -1.0 ) );
19 map.insert( std::make_pair( std::make_pair( 3, 2 ), -1.0 ) );
20 map.insert( std::make_pair( std::make_pair( 3, 3 ), 2.0 ) );
21 map.insert( std::make_pair( std::make_pair( 3, 4 ), -1.0 ) );
22 map.insert( std::make_pair( std::make_pair( 4, 4 ), 2.0 ) );
23
25 matrix.setElements( map );
26
27 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
28}
29
30int
31main( int argc, char* argv[] )
32{
33 std::cout << "Creating matrices on CPU ... " << std::endl;
34 setElementsExample< TNL::Devices::Host >();
35
36#ifdef __CUDACC__
37 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
38 setElementsExample< TNL::Devices::Cuda >();
39#endif
40}

The result of both examples looks as follows:

Creating matrices on CPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2
Creating matrices on CUDA GPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2

Note, however, that the map can be constructed only on CPU and not on GPU. It requires allocation of additional memory on the host system (CPU) and if the target sparse matrix resided on GPU, the matrix elements must be copied on GPU. This is the reason, why this way of the sparse matrix setup is inefficient compared to other methods.

Methods setElement and addElement

Another way of setting the sparse matrix is by means of the methods setElement and addElement (TNL::Matrices::SparseMatrix::setElement, TNL::Matrices::SparseMatrix::addElement). The method can be called from both host (CPU) and device (GPU) if the matrix is allocated there. Note, however, that if the matrix is allocated on GPU and the methods are called from CPU there will be significant performance drop because the matrix elements will be transferer one-by-one separately. However, if the matrix elements setup is not a critical part of your algorithm this can be an easy way how to do it. See the following example:

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9setElements()
10{
11 TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 1, 1, 1, 1 }, 5 );
12
13 /****
14 * Get the matrix view.
15 */
16 auto view = matrix.getView();
17 for( int i = 0; i < 5; i++ )
18 view.setElement( i, i, i );
19
20 std::cout << "Matrix set from the host:" << std::endl;
21 std::cout << matrix << std::endl;
22
23 auto f = [ = ] __cuda_callable__( int i ) mutable
24 {
25 view.setElement( i, i, -i );
26 };
27
28 TNL::Algorithms::parallelFor< Device >( 0, 5, f );
29
30 std::cout << "Matrix set from its native device:" << std::endl;
31 std::cout << matrix << std::endl;
32}
33
34int
35main( int argc, char* argv[] )
36{
37 std::cout << "Set elements on host:" << std::endl;
38 setElements< TNL::Devices::Host >();
39
40#ifdef __CUDACC__
41 std::cout << "Set elements on CUDA device:" << std::endl;
42 setElements< TNL::Devices::Cuda >();
43#endif
44}

We first allocate matrix with five rows (it is given by the size of the initializer list) and columns and we set capacity each row to one (line 12). The first for-loop (lines 17-19) runs on CPU no matter where the matrix is allocated. After printing the matrix (lines 21-22), we call the lambda function f (lines 24-26) with a help of parallelFor (line 28) which is device sensitive and so it runs on CPU or GPU depending on where the matrix is allocated. The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 ->
Row: 1 -> 1:-1
Row: 2 -> 2:-2
Row: 3 -> 3:-3
Row: 4 -> 4:-4
Set elements on CUDA device:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 ->
Row: 1 -> 1:-1
Row: 2 -> 2:-2
Row: 3 -> 3:-3
Row: 4 -> 4:-4

The method addElement (TNL::Matrices::SparseMatrix::addElement) adds a value to specific matrix element. Otherwise, it behaves the same as setElement. See the following example:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7addElements()
8{
9 TNL::Matrices::SparseMatrix< double, Device > matrix( { 5, 5, 5, 5, 5 }, 5 );
10 auto matrixView = matrix.getView();
11
12 for( int i = 0; i < 5; i++ )
13 matrixView.setElement( i, i, i ); // or matrix.setElement
14
15 std::cout << "Initial matrix is: " << std::endl << matrix << std::endl;
16
17 for( int i = 0; i < 5; i++ )
18 for( int j = 0; j < 5; j++ )
19 matrixView.addElement( i, j, 1.0, 5.0 ); // or matrix.addElement
20
21 std::cout << "Matrix after addition is: " << std::endl << matrix << std::endl;
22}
23
24int
25main( int argc, char* argv[] )
26{
27 std::cout << "Add elements on host:" << std::endl;
28 addElements< TNL::Devices::Host >();
29
30#ifdef __CUDACC__
31 std::cout << "Add elements on CUDA device:" << std::endl;
32 addElements< TNL::Devices::Cuda >();
33#endif
34}
__cuda_callable__ void addElement(IndexType row, IndexType column, const RealType &value, const RealType &thisElementMultiplicator=1.0)
Add element at given row and column to given value.
Definition SparseMatrixBase.hpp:164

The result looks as follows:

Add elements on host:
Initial matrix is:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix after addition is:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21
Add elements on CUDA device:
Initial matrix is:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix after addition is:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21

Method getRow

More efficient method, especially for GPUs, is to combine getRow (TNL::Matrices::SparseMatrix::getRow) method with parallelFor and lambda function as the following example demonstrates:

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9getRowExample()
10{
11 /***
12 * Set the following matrix (dots represent zero matrix elements):
13 *
14 * / 2 . . . . \
15 * | 1 2 1 . . |
16 * | . 1 2 1. . |
17 * | . . 1 2 1 |
18 * \ . . . . 2 /
19 */
20 const int size = 5;
21 TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 3, 3, 3, 1 }, size );
22 auto view = matrix.getView();
23
24 /***
25 * Set the matrix elements.
26 */
27 auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
28 {
29 auto row = view.getRow( rowIdx );
30 if( rowIdx == 0 )
31 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
32 else if( rowIdx == size - 1 )
33 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
34 else {
35 row.setElement( 0, rowIdx - 1, 1.0 ); // elements below the diagonal
36 row.setElement( 1, rowIdx, 2.0 ); // diagonal element
37 row.setElement( 2, rowIdx + 1, 1.0 ); // elements above the diagonal
38 }
39 };
40 TNL::Algorithms::parallelFor< Device >( 0, matrix.getRows(), f );
41 std::cout << matrix << std::endl;
42}
43
44int
45main( int argc, char* argv[] )
46{
47 std::cout << "Getting matrix rows on host: " << std::endl;
48 getRowExample< TNL::Devices::Host >();
49
50#ifdef __CUDACC__
51 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
52 getRowExample< TNL::Devices::Cuda >();
53#endif
54}

On the line 21, we create small matrix having five rows (number of rows is given by the size of the initializer list ) and columns (number of columns is given by the second parameter) and we set each row capacity to one or three (particular elements of the initializer list). On the line 41, we call parallelFor to iterate over all matrix rows. Each row is processed by the lambda function f (lines 24-36). In the lambda function, we first fetch a sparse matrix row (TNL::Matrices::SparseMatrixRowView) (line 25) which serves for accessing particular matrix elements in the matrix row. This object has a method setElement (TNL::Matrices::SparseMatrixRowView::setElement) accepting three parameters:

  1. localIdx is a rank of the nonzero element in given matrix row.
  2. columnIdx is the new column index of the matrix element.
  3. value is the new value of the matrix element.

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2
Getting matrix rows on CUDA device:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2

Method forRows

The method forRows (TNL::Matrices::SparseMatrix::forRows) calls the method getRow (TNL::Matrices::SparseMatrix::getRow) in parallel. See the following example which has the same effect as the previous one but it is slightly simpler:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forRowsExample()
9{
10 /***
11 * Set the following matrix (dots represent zero matrix elements):
12 *
13 * / 2 . . . . \
14 * | 1 2 1 . . |
15 * | . 1 2 1. . |
16 * | . . 1 2 1 |
17 * \ . . . . 2 /
18 */
19 const int size( 5 );
21 using RowView = typename MatrixType::RowView;
22 MatrixType matrix( { 1, 3, 3, 3, 1 }, size );
23 auto matrixView = matrix.getView();
24
25 /***
26 * Set the matrix elements.
27 */
28 auto f = [] __cuda_callable__( RowView & row )
29 {
30 const int rowIdx = row.getRowIndex();
31 if( rowIdx == 0 )
32 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
33 else if( rowIdx == size - 1 )
34 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
35 else {
36 row.setElement( 0, rowIdx - 1, 1.0 ); // elements below the diagonal
37 row.setElement( 1, rowIdx, 2.0 ); // diagonal element
38 row.setElement( 2, rowIdx + 1, 1.0 ); // elements above the diagonal
39 }
40 };
41 matrixView.forAllRows( f ); // or matrix.forAllRows
42 std::cout << matrix << std::endl;
43
44 /***
45 * Divide each matrix row by a sum of all elements in the row - with use of iterators.
46 */
47 matrix.forAllRows(
48 [] __cuda_callable__( RowView & row )
49 {
50 double sum = 0.0;
51 for( auto element : row )
52 sum += element.value();
53 for( auto element : row )
54 element.value() /= sum;
55 } );
56 std::cout << "Divide each matrix row by a sum of all elements in the row ... " << std::endl;
57 std::cout << matrix << std::endl;
58}
59
60int
61main( int argc, char* argv[] )
62{
63 std::cout << "Getting matrix rows on host: " << std::endl;
64 forRowsExample< TNL::Devices::Host >();
65
66#ifdef __CUDACC__
67 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
68 forRowsExample< TNL::Devices::Cuda >();
69#endif
70}

The differences are:

  1. We do not need to get the matrix view as we did in the previous example.
  2. We call the method forAllRows (TNL::Matrices::SparseMatrix::forAllRows) instead of parallelFor which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither.
  3. The lambda function f (lines 27-39) accepts one parameter row of the type RowView (TNL::Matrices::SparseMatrix::RowView which is TNL::Matrices::SparseMatrixRowView) instead of the index of the matrix row. Therefore we do not need to call the method getRow (TNL::Matrices::SparseMatrix::getRow). On the other hand, we need the method geRowIndex (TNL::Matrices::SparseMatrixRowView::getRowIndex) to get the index of the matrix row (line 28).

On the lines 46-52, we call a lambda function which computes sum of all elements in a row (lines 47-49) and it divides the row by the sum then (lines 50-51).

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2
Divide each matrix row by a sum of all elements in the row ...
Row: 0 -> 0:1
Row: 1 -> 0:0.25 1:0.5 2:0.25
Row: 2 -> 1:0.25 2:0.5 3:0.25
Row: 3 -> 2:0.25 3:0.5 4:0.25
Row: 4 -> 4:1
Getting matrix rows on CUDA device:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2
Divide each matrix row by a sum of all elements in the row ...
Row: 0 -> 0:1
Row: 1 -> 0:0.25 1:0.5 2:0.25
Row: 2 -> 1:0.25 2:0.5 3:0.25
Row: 3 -> 2:0.25 3:0.5 4:0.25
Row: 4 -> 4:1

Method forElements

Finally, another efficient way of setting the nonzero matrix elements, is use of the method forElements (TNL::Matrices::SparseMatrix::forElements). It requires indexes of the range of rows (begin and end) to be processed and a lambda function function which is called for each nonzero element. The lambda function provides the following data:

  • rowIdx is a row index of the matrix element.
  • localIdx is an index of the nonzero matrix element within the matrix row.
  • columnIdx is a column index of the matrix element. If the matrix element column index is supposed to be modified, this parameter can be a reference and so its value can be changed.
  • value is a value of the matrix element. If the matrix element value is supposed to be modified, this parameter can be a reference as well and so the element value can be changed.
  • compute is a bool reference. When it is set to false the rest of the row can be omitted. This is, however, only a hint and it depends on the underlying matrix format if it is taken into account.

See the following example:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forElementsExample()
9{
10 TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 2, 3, 4, 5 }, 5 );
11 auto matrixView = matrix.getView();
12
13 auto f = [] __cuda_callable__( int rowIdx, int localIdx, int& columnIdx, double& value )
14 {
15 // This is important, some matrix formats may allocate more matrix elements
16 // than we requested. These padding elements are processed here as well.
17 if( rowIdx >= localIdx ) {
18 columnIdx = localIdx;
19 value = rowIdx + localIdx + 1;
20 }
21 };
22
23 matrixView.forElements( 0, matrix.getRows(), f ); // or matrix.forElements
24 std::cout << matrix << std::endl;
25}
26
27int
28main( int argc, char* argv[] )
29{
30 std::cout << "Creating matrix on host: " << std::endl;
31 forElementsExample< TNL::Devices::Host >();
32
33#ifdef __CUDACC__
34 std::cout << "Creating matrix on CUDA device: " << std::endl;
35 forElementsExample< TNL::Devices::Cuda >();
36#endif
37}
void forElements(IndexType begin, IndexType end, Function &&function) const
Method for iteration over all matrix rows for constant instances.
Definition SparseMatrixBase.hpp:492

On the line 9, we allocate a lower triangular matrix byt setting the row capacities as {1,2,3,4,5}. On the line 11, we prepare lambda function f which we execute on the line 22 just by calling the method forElements (TNL::Matrices::SparseMatrix::forElements). This method takes the range of matrix rows as the first two parameters and the lambda function as the last parameter. The lambda function receives parameters mentioned above (see the line 11). We first check if the matrix element coordinates (rowIdx and localIdx) points to an element lying before the matrix diagonal or on the diagonal (line 12). In case of the lower triangular matrix in our example, the local index is in fact the same as the column index

\[ \left( \begin{array}{ccccc} 0 & . & . & . & . \\ 0 & 1 & . & . & . \\ 0 & 1 & 2 & . & . \\ 0 & 1 & 2 & 3 & . \\ 0 & 1 & 2 & 3 & 4 \end{array} \right) \]

If we call the method forElements (TNL::Matrices::SparseMatrix::forElements) to setup the matrix elements for the first time, the parameter columnIdx has no sense because the matrix elements and their column indexes were not set yet. Therefore it is important that the test on the line 12 reads as

if( rowIdx < localIdx )

because

if( rowIdx < columnIdx )

would not make sense. If we pass through this test, the matrix element lies in the lower triangular part of the matrix and we may set the matrix elements which is done on the lines 17 and 18. The column index (columnIdx) is set to local index (line 17) and value is set on the line 18. The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:3
Row: 2 -> 0:3 1:4 2:5
Row: 3 -> 0:4 1:5 2:6 3:7
Row: 4 -> 0:5 1:6 2:7 3:8 4:9
Creating matrix on CUDA device:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:3
Row: 2 -> 0:3 1:4 2:5
Row: 3 -> 0:4 1:5 2:6 3:7
Row: 4 -> 0:5 1:6 2:7 3:8 4:9

Wrapping existing data to sparse matrix view

Standard sparse matrix format like CSR and Ellpack store the matrix elements in specifically defined arrays. In case that you have already allocated them (for example in some other library), they can be wrapped into a sparse matrix view with given matrix format. This can be done by means of functions TNL::Matrices::wrapCSRMatrix and TNL::Matrices::wrapEllpackMatrix . See the following example for demonstration of the CSR format:

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Matrices/MatrixWrapping.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9wrapMatrixView()
10{
11 /***
12 * Encode the following matrix to CSR format...
13 *
14 * / 1 2 0 0 \.
15 * | 0 6 0 0 |
16 * | 9 0 0 0 |
17 * \ 0 0 15 16 /
18 */
19 const int rows( 4 ), columns( 4 );
20 TNL::Containers::Vector< double, Device > valuesVector{ 1, 2, 6, 9, 15, 16 };
21 TNL::Containers::Vector< int, Device > columnIndexesVector{ 0, 1, 1, 0, 2, 3 };
22 TNL::Containers::Vector< int, Device > rowPointersVector{ 0, 2, 3, 4, 6 };
23
24 double* values = valuesVector.getData();
25 int* columnIndexes = columnIndexesVector.getData();
26 int* rowPointers = rowPointersVector.getData();
27
28 /***
29 * Wrap the arrays `rowPointers, `values` and `columnIndexes` to sparse matrix view
30 */
31 auto matrix = TNL::Matrices::wrapCSRMatrix< Device >( rows, columns, rowPointers, values, columnIndexes );
32
33 std::cout << "Matrix reads as: " << std::endl << matrix << std::endl;
34}
35
36int
37main( int argc, char* argv[] )
38{
39 std::cout << "Wraping matrix view on host: " << std::endl;
40 wrapMatrixView< TNL::Devices::Host >();
41
42#ifdef __CUDACC__
43 std::cout << "Wraping matrix view on CUDA device: " << std::endl;
44 wrapMatrixView< TNL::Devices::Cuda >();
45#endif
46}

We create sparse matrix having four rows and four columns (line 19). We use TNL vector (TNL::Containers::Vector) to allocate arrays necessary for the CSR format:

  1. valuesVector (line 20) - contains values of the nonzero matrix elements.
  2. columnIndexesVector (line 21) - contains column indexes of the nonzero matrix elements.
  3. rowPointersVector (line 22) - contains positions of the first nonzero matrix elements in each row within valuesVector and columnIndexesVector. The size of this array equals number of matrix rows plus one.

Next we turn the vectors into C style pointers (lines 24-26) to wrap them into sparse matrix view (line 31). Note, that we must explicitly state the device on which the arrays are allocated. Finlay we print the matrix to check the correctness (line 33). The result looks as follows:

Wraping matrix view on host:
Matrix reads as:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16
Wraping matrix view on CUDA device:
Matrix reads as:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16

Wrapping data corresponding with the Ellpack format is very similar as we can see in the following example:

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Matrices/MatrixWrapping.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9wrapMatrixView()
10{
11 /***
12 * Encode the following matrix to Ellpack format...
13 *
14 * / 1 2 0 0 \.
15 * | 0 6 0 0 |
16 * | 9 0 0 0 |
17 * \ 0 0 15 16 /
18 */
19 const int rows( 4 ), columns( 4 );
20 TNL::Containers::Vector< double, Device > valuesVector{ 1, 2, 6, 0, 9, 0, 15, 16 };
21 TNL::Containers::Vector< int, Device > columnIndexesVector{ 0, 1, 1, -1, 0, -1, 2, 3 };
22
23 double* values = valuesVector.getData();
24 int* columnIndexes = columnIndexesVector.getData();
25
26 /***
27 * Wrap the arrays `values` and `columnIndexes` to sparse matrix view
28 */
29 auto matrix = TNL::Matrices::wrapEllpackMatrix< Device, TNL::Algorithms::Segments::RowMajorOrder >(
30 rows, columns, 2, values, columnIndexes );
31
32 std::cout << "Matrix reads as: " << std::endl << matrix << std::endl;
33}
34
35int
36main( int argc, char* argv[] )
37{
38 std::cout << "Wraping matrix view on host: " << std::endl;
39 wrapMatrixView< TNL::Devices::Host >();
40
41#ifdef __CUDACC__
42 std::cout << "Wraping matrix view on CUDA device: " << std::endl;
43 wrapMatrixView< TNL::Devices::Cuda >();
44#endif
45}

We encode the same sparse matrix as in the previous example. The essence of the Ellpack format is that we allocate the same number of matrix elements for each row which is two in our example. For some matrix rows we use the padding zeros for which we set the column index to -1 (line 21). Therefore the size of valuesVector and columnIndexesVector equals number of matrix rows times number of matrix elements allocated in each row. As before, we turn the vectors into C style pointers (lines 23-24) and wrap them into sparse matrix view with Ellpack format (line 29). Note that we must state the device on which the arrays are allocated explicitly and also the matrix elements organization, which is TNL::Algorithms::Segments::RowMajorOrder in this case. For Ellpack matrix stored on GPU, TNL::Algorithms::Segments::ColumnMajorOrder is preferred. The result looks as follows:

Wraping matrix view on host:
Matrix reads as:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16
Wraping matrix view on CUDA device:
Matrix reads as:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16

Symmetric sparse matrices

For sparse symmetric matrices, TNL offers a format storing only a half of the matrix elements. More precisely, ony the matrix diagonal and the elements bellow are stored in the memory. The matrix elements above the diagonal are deduced from those bellow. If such a symmetric format is used on GPU, atomic operations must be used in some matrix operations. For this reason, symmetric matrices can be combined only with matrix elements values expressed in float or double type. An advantage of the symmetric formats is lower memory consumption. Since less data need to be transferred from the memory, better performance might be observed. In some cases, however, the use of atomic operations on GPU may cause performance drop. Mostly we can see approximately the same performance compared to general formats but we can profit from lower memory requirements which is appreciated especially on GPU. The following example shows how to create symmetric sparse matrix.

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7symmetricSparseMatrixExample()
8{
10 5, // number of matrix rows
11 5, // number of matrix columns
12 {
13 // matrix elements definition
14 // clang-format off
15 { 0, 0, 1.0 },
16 { 1, 0, 2.0 }, { 1, 1, 1.0 },
17 { 2, 0, 3.0 }, { 2, 2, 1.0 },
18 { 3, 0, 4.0 }, { 3, 3, 1.0 },
19 { 4, 0, 5.0 }, { 4, 4, 1.0 }
20 // clang-format on
21 } );
22
23 std::cout << "Symmetric sparse matrix: " << std::endl << symmetricMatrix << std::endl;
24
25 TNL::Containers::Vector< double, Device > inVector( 5, 1.0 ), outVector( 5, 0.0 );
26 symmetricMatrix.vectorProduct( inVector, outVector );
27 std::cout << "Product with vector " << inVector << " is " << outVector << std::endl << std::endl;
28}
29
30int
31main( int argc, char* argv[] )
32{
33 std::cout << "Creating matrix on CPU ... " << std::endl;
34 symmetricSparseMatrixExample< TNL::Devices::Host >();
35
36#ifdef __CUDACC__
37 std::cout << "Creating matrix on CUDA GPU ... " << std::endl;
38 symmetricSparseMatrixExample< TNL::Devices::Cuda >();
39#endif
40}

We construct matrix of the following form

\[ \left( \begin{array}{ccccc} 1 & \color{grey}{2} & \color{grey}{3} & \color{grey}{4} & \color{grey}{5} \\ 2 & 1 & & & \\ 3 & & 1 & & \\ 4 & & & 1 & \\ 5 & & & & 1 \end{array} \right) \]

The elements depicted in grey color are not stored in the memory. The main difference, compared to creation of general sparse matrix, is on line 9 where we state that the matrix is symmetric by setting the matrix type to TNL::Matrices::SymmetricMatrix. Next we set only the diagonal elements and those lying bellow the diagonal (lines 13-17). When we print the matrix (line 19) we can see also the symmetric part above the diagonal. Next we test product of matrix and vector (lines 21-23). The result looks as follows:

Creating matrix on CPU ...
Symmetric sparse matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 2:1
Row: 3 -> 0:4 3:1
Row: 4 -> 0:5 4:1
Product with vector [ 1, 1, 1, 1, 1 ] is [ 15, 3, 4, 5, 6 ]
Creating matrix on CUDA GPU ...
Symmetric sparse matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 2:1
Row: 3 -> 0:4 3:1
Row: 4 -> 0:5 4:1
Product with vector [ 1, 1, 1, 1, 1 ] is [ 15, 3, 4, 5, 6 ]

Warning: Assignment of symmetric sparse matrix to general sparse matrix does not give correct result, currently. Only the diagonal and the lower part of the matrix is assigned.

Binary sparse matrices

If the matrix element value type (i.e. Real type) is set to bool the matrix elements can be only 1 or 0. So in the sparse matrix formats, where we do not store the zero matrix elements, explicitly stored elements can have only one possible value which is 1. Therefore we do not need to store the values, only the positions of the nonzero elements. The array values, which usualy stores the matrix elements values, can be completely omitted and we can reduce the memory requirements. The following table shows how much we can reduce the memory consumption when using binary matrix instead of common sparse matrix using float or double types:

Real Index Common sparse matrix Binary sparse matrix Ratio
float 32-bit 4 + 4 = 8 B 4 B 50%
float 64-bit 4 + 8 = 12 B 8 B 75%
double 32-bit 8 + 4 = 12 B 4 B 33%
double 64-bit 8 + 8 = 16 B 8 B 50%

The following example demonstrates the use of binary matrix:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Devices/Host.h>
5
6template< typename Device >
7void
8binarySparseMatrixExample()
9{
11 // number of matrix rows
12 5,
13 // number of matrix columns
14 5,
15 // matrix elements definition
16 {
17 // clang-format off
18 { 0, 0, 1.0 }, { 0, 1, 2.0 }, { 0, 2, 3.0 }, { 0, 3, 4.0 }, { 0, 4, 5.0 },
19 { 1, 0, 2.0 }, { 1, 1, 1.0 },
20 { 2, 0, 3.0 }, { 2, 2, 1.0 },
21 { 3, 0, 4.0 }, { 3, 3, 1.0 },
22 { 4, 0, 5.0 }, { 4, 4, 1.0 }
23 // clang-format on
24 } );
25
26 std::cout << "Binary sparse matrix: " << std::endl << binaryMatrix << std::endl;
27
28 TNL::Containers::Vector< double, Device > inVector( 5, 1.1 ), outVector( 5, 0.0 );
29 binaryMatrix.vectorProduct( inVector, outVector );
30 std::cout << "Product with vector " << inVector << " is " << outVector << std::endl << std::endl;
31
33 binaryMatrix2;
34 binaryMatrix2 = binaryMatrix;
35 binaryMatrix2.vectorProduct( inVector, outVector );
36 std::cout << "Product with vector in double precision " << inVector << " is " << outVector << std::endl << std::endl;
37}
38
39int
40main( int argc, char* argv[] )
41{
42 std::cout << "Creating matrix on CPU ... " << std::endl;
43 binarySparseMatrixExample< TNL::Devices::Host >();
44
45#ifdef __CUDACC__
46 std::cout << "Creating matrix on CUDA GPU ... " << std::endl;
47 binarySparseMatrixExample< TNL::Devices::Cuda >();
48#endif
49}
void vectorProduct(const InVector &inVector, OutVector &outVector, ComputeRealType matrixMultiplicator=1.0, ComputeRealType outVectorMultiplicator=0.0, IndexType begin=0, IndexType end=0, const SegmentsReductionKernel &kernel=SegmentsReductionKernel{}) const
Computes product of matrix and vector.
Definition SparseMatrixBase.hpp:268

All we need to do is set the Real type to bool as we can see on the line 9. We can see that even though we set different values to different matrix elements (lines 14-18) at the end all of them are turned into ones (printing of the matrix on the line 20). There is an issue, however, which is demonstrated on the product of the matrix with a vector. Nonbinary matrices compute all operations using the Real type. If it is set to bool operations like SpMV would not get correct solution. Therefore sparse matrices use another type called ComputeReal which is the 6th template parameter of TNL::Matrices::SparseMatrix. By default it is set to Index type but it can be changed by the user. On the lines 26-29 we show how to change this type to double and what is the effect of it (correct result of matrix-vector multiplication). The result looks as follows:

Creating matrix on CPU ...
Binary sparse matrix:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:1
Row: 2 -> 0:1 2:1
Row: 3 -> 0:1 3:1
Row: 4 -> 0:1 4:1
Product with vector [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5, 2, 2, 2, 2 ]
Product with vector in double precision [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5.5, 2.2, 2.2, 2.2, 2.2 ]
Creating matrix on CUDA GPU ...
Binary sparse matrix:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:1
Row: 2 -> 0:1 2:1
Row: 3 -> 0:1 3:1
Row: 4 -> 0:1 4:1
Product with vector [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5, 2, 2, 2, 2 ]
Product with vector in double precision [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5.5, 2.2, 2.2, 2.2, 2.2 ]

Tridiagonal matrices

Tridiagonal matrix format serves for specific matrix pattern when the nonzero matrix elements can be placed only at the diagonal and immediately next to the diagonal. Here is an example:

\[ \left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \end{array} \right) \]

An advantage is that we do not store the column indexes explicitly as it is in TNL::Matrices::SparseMatrix. This can reduce significantly the memory requirements which also means better performance. See the following table for the storage requirements comparison between TNL::Matrices::TridiagonalMatrix and TNL::Matrices::SparseMatrix.

Real Index SparseMatrix TridiagonalMatrix Ratio
float 32-bit int 8 bytes per element 4 bytes per element 50%
double 32-bit int 12 bytes per element 8 bytes per element 75%
float 64-bit int 12 bytes per element 4 bytes per element 30%
double 64-bit int 16 bytes per element 8 bytes per element 50%

Tridiagonal matrix is a templated class defined in the namespace TNL::Matrices. It has five template parameters:

  • Real is a type of the matrix elements. It is double by default.
  • Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for GPU supporting CUDA. It is TNL::Devices::Host by default.
  • Index is a type to be used for indexing of the matrix elements. It is int by default.
  • ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default it is the row-major order if the matrix is allocated in the host system and column major order if it is allocated on GPU.
  • RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.

For better alignment in the memory the tridiagonal format is organized like if there were three nonzero matrix elements in each row. This is not true for example in the first row where there is no matrix element on the left side of the diagonal. The same happens on the last row of the matrix. We have to add even the artificial matrix elements like this:

\[ \begin{array}{c} 0 \\ . \\ . \\ . \\ . \\ . \end{array} \left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \end{array} \right) \begin{array}{c} . \\ . \\ . \\ . \\ . \\ 0 \end{array} \]

If the tridiagonal matrix has more rows then columns, we have to extend the last two rows with nonzero elements in this way

\[ \left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \\ . & . & . & . & . & -1 \end{array} \right) \rightarrow \begin{array}{c} 0 \\ . \\ . \\ . \\ . \\ . \\ . \end{array} \left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \\ . & . & . & . & . & -1 \end{array} \right) \begin{array}{cc} . & . \\ . & . \\ . & . \\ . & . \\ . & . \\ 0 & . \\ 0 & 0 \end{array} \]

We also would like to remind the meaning of the local index (localIdx) of the matrix element within a matrix row. It is a rank of the nonzero matrix element in given row as we explained in section Indexing of nonzero matrix elements in sparse matrices. The values of the local index for tridiagonal matrix elements are as follows

\[ \left( \begin{array}{cccccc} 1 & 2 & & & & \\ 0 & 1 & 2 & & & \\ & 0 & 1 & 2 & & \\ & & 0 & 1 & 2 & \\ & & & 0 & 1 & 2 \\ & & & & 0 & 1 \end{array} \right) \]

In the following text we show different methods for setup of tridiagonal matrices.

Initializer list

The tridiagonal matrix can be initialized by the means of the constructor with initializer list. The matrix from the beginning of this section can be constructed as the following example demonstrates:

1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8createTridiagonalMatrix()
9{
10 const int matrixSize = 6;
11
12 /***
13 * Setup the following matrix (dots represent zeros):
14 *
15 * / 2 -1 . . . . \
16 * | -1 2 -1 . . . |
17 * | . -1 2 -1 . . |
18 * | . . -1 2 -1 . |
19 * | . . . -1 2 -1 |
20 * \ . . . . -1 2 /
21 *
22 */
24 matrixSize,
25 { /***
26 * To set the matrix elements we first extend the diagonals to their full
27 * lengths even outside the matrix (dots represent zeros and zeros are
28 * artificial zeros used for memory alignment):
29 *
30 * 0 / 2 -1 . . . . \ -> { 0, 2, -1 }
31 * | -1 2 -1 . . . | -> { -1, 2, -1 }
32 * | . -1 2 -1 . . | -> { -1, 2, -1 }
33 * | . . -1 2 -1 . | -> { -1, 2, -1 }
34 * | . . . -1 2 -1 | -> { -1, 2, -1 }
35 * \ . . . . -1 2 / 0 -> { -1, 2, 0 }
36 *
37 */
38 { 0, 2, -1 },
39 { -1, 2, -1 },
40 { -1, 2, -1 },
41 { -1, 2, -1 },
42 { -1, 2, -1 },
43 { -1, 2, 0 } } );
44 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
45}
46
47int
48main( int argc, char* argv[] )
49{
50 std::cout << "Creating tridiagonal matrix on CPU ... " << std::endl;
51 createTridiagonalMatrix< TNL::Devices::Host >();
52
53#ifdef __CUDACC__
54 std::cout << "Creating tridiagonal matrix on CUDA GPU ... " << std::endl;
55 createTridiagonalMatrix< TNL::Devices::Cuda >();
56#endif
57}
Implementation of sparse tridiagonal matrix.
Definition TridiagonalMatrix.h:56

The matrix elements values are defined on lines 39-44. Each matrix row is represented by embedded an initializer list. We set three values in each row including the padding zeros.

The output of the example looks as:

Creating tridiagonal matrix on CPU ...
The matrix reads as:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2
Creating tridiagonal matrix on CUDA GPU ...
The matrix reads as:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2

Methods setElement and addElement

Similar way of the tridiagonal matrix setup is offered by the method setElements (TNL::Matrices::TridiagonalMatrix::setElements) as the following example demonstrates:

1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8createTridiagonalMatrix()
9{
10 const int matrixSize = 6;
11
12 /***
13 * Setup the following matrix (dots represent zeros):
14 *
15 * / 2 -1 . . . . \
16 * | -1 2 -1 . . . |
17 * | . -1 2 -1 . . |
18 * | . . -1 2 -1 . |
19 * | . . . -1 2 -1 |
20 * \ . . . . -1 2 /
21 *
22 */
23 TNL::Matrices::TridiagonalMatrix< double, Device > matrix( matrixSize, matrixSize );
24 matrix.setElements( { /***
25 * To set the matrix elements we first extend the diagonals to their full
26 * lengths even outside the matrix (dots represent zeros and zeros are
27 * artificial zeros used for memory alignment):
28 *
29 * 0 / 2 -1 . . . . \ -> { 0, 2, -1 }
30 * | -1 2 -1 . . . | -> { -1, 2, -1 }
31 * | . -1 2 -1 . . | -> { -1, 2, -1 }
32 * | . . -1 2 -1 . | -> { -1, 2, -1 }
33 * | . . . -1 2 -1 | -> { -1, 2, -1 }
34 * \ . . . . -1 2 / 0 -> { -1, 2, 0 }
35 *
36 */
37 { 0, 2, -1 },
38 { -1, 2, -1 },
39 { -1, 2, -1 },
40 { -1, 2, -1 },
41 { -1, 2, -1 },
42 { -1, 2, 0 } } );
43 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
44}
45
46int
47main( int argc, char* argv[] )
48{
49 std::cout << "Creating tridiagonal matrix on CPU ... " << std::endl;
50 createTridiagonalMatrix< TNL::Devices::Host >();
51
52#ifdef __CUDACC__
53 std::cout << "Creating tridiagonal matrix on CUDA GPU ... " << std::endl;
54 createTridiagonalMatrix< TNL::Devices::Cuda >();
55#endif
56}

Here we create the matrix in two steps. Firstly, we setup the matrix dimensions by the appropriate constructor (line 24) and after that we setup the matrix elements (line 25-45). The result looks the same as in the previous example:

Creating tridiagonal matrix on CPU ...
The matrix reads as:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2
Creating tridiagonal matrix on CUDA GPU ...
The matrix reads as:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2

In the following example we create tridiagonal matrix with 5 rows and 5 columns (line 12-14) by the means of a shared pointer (TNL::Pointers::SharedPointer) to make this work even on GPU. We set numbers 0,...,4 on the diagonal (line 16) and we print the matrix (line 18). Next we use a lambda function (lines 21-27) combined with parallelFor (line 35), to modify the matrix. The offdiagonal elements are set to 1 (lines 23 and 26) and for the diagonal elements, we change the sign (line 24).

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/TridiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9setElements()
10{
11 const int matrixSize( 5 );
13 Matrix matrix( matrixSize, matrixSize );
14 auto view = matrix.getView();
15
16 for( int i = 0; i < 5; i++ )
17 view.setElement( i, i, i ); // or matrix.setElement
18
19 std::cout << "Matrix set from the host:" << std::endl;
20 std::cout << matrix << std::endl;
21
22 auto f = [ = ] __cuda_callable__( int i ) mutable
23 {
24 if( i > 0 )
25 view.setElement( i, i - 1, 1.0 );
26 view.setElement( i, i, -i );
27 if( i < matrixSize - 1 )
28 view.setElement( i, i + 1, 1.0 );
29 };
30
31 TNL::Algorithms::parallelFor< Device >( 0, matrixSize, f );
32
33 std::cout << "Matrix set from its native device:" << std::endl;
34 std::cout << matrix << std::endl;
35}
36
37int
38main( int argc, char* argv[] )
39{
40 std::cout << "Set elements on host:" << std::endl;
41 setElements< TNL::Devices::Host >();
42
43#ifdef __CUDACC__
44 std::cout << "Set elements on CUDA device:" << std::endl;
45 setElements< TNL::Devices::Cuda >();
46#endif
47}

The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4
Set elements on CUDA device:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4

Method getRow

A bit different way of setting up the matrix, is the use of tridiagonal matrix view and the method getRow (TNL::Matrices::TridiagonalMatrixView::getRow) as the following example demonstrates:

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/TridiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9getRowExample()
10{
11 /***
12 * Set the following matrix (dots represent zero matrix elements and zeros are
13 * padding zeros for memory alignment):
14 *
15 * 0 / 2 -1 . . . \ -> { 0, 0, 1 }
16 * | -1 2 -1 . . | -> { 0, 2, 1 }
17 * | . -1 2 -1. . | -> { 3, 2, 1 }
18 * | . . -1 2 -1 | -> { 3, 2, 1 }
19 * \ . . . -1 2 / -> { 3, 2, 1 }
20 *
21 */
22
23 const int matrixSize( 5 );
25 MatrixType matrix( matrixSize, // number of matrix rows
26 matrixSize // number of matrix columns
27 );
28 auto view = matrix.getView();
29
30 auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
31 {
32 auto row = view.getRow( rowIdx );
33
34 if( rowIdx > 0 )
35 row.setElement( 0, -1.0 ); // elements below the diagonal
36 row.setElement( 1, 2.0 ); // elements on the diagonal
37 if( rowIdx < matrixSize - 1 ) // elements above the diagonal
38 row.setElement( 2, -1.0 );
39 };
40
41 /***
42 * Set the matrix elements.
43 */
44 TNL::Algorithms::parallelFor< Device >( 0, view.getRows(), f );
45 std::cout << std::endl << matrix << std::endl;
46}
47
48int
49main( int argc, char* argv[] )
50{
51 std::cout << "Getting matrix rows on host: " << std::endl;
52 getRowExample< TNL::Devices::Host >();
53
54#ifdef __CUDACC__
55 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
56 getRowExample< TNL::Devices::Cuda >();
57#endif
58}
__cuda_callable__ IndexType getRows() const
Returns number of matrix rows.
Definition MatrixBase.hpp:52

We create a matrix with the same size (line 22-27). Next, we fetch the tridiagonal matrix view (ef TNL::Matrices::TridiagonalMatrixView ,line 28) which we use in the lambda function for matrix elements modification (lines 30-38). Inside the lambda function, we first get a matrix row by calling the method getRow (TNL::Matrices::TridiagonalMatrixView::getRow) using which we can access the matrix elements (lines 33-37). We would like to stress that the method setElement addresses the matrix elements with the localIdx parameter which is a rank of the nonzero element in the matrix row - see Indexing of nonzero matrix elements in sparse matrices. The lambda function is called by the parallelFor.

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Getting matrix rows on CUDA device:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2

Method forRows

As in the case of other matrix types, the method forRows (TNL::Matrices::TridiagonalMatrix::forRows) calls the method getRow (TNL::Matrices::TridiagonalMatrix::getRow) in parallel. It is demonstrated by the following example which we may directly compare with the previous one:

1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forRowsExample()
9{
11 /***
12 * Set the following matrix (dots represent zero matrix elements and zeros are
13 * padding zeros for memory alignment):
14 *
15 * 0 / 2 . . . . \ -> { 0, 0, 1 }
16 * | -1 2 -1 . . | -> { 0, 2, 1 }
17 * | . -1 2 -1. . | -> { 3, 2, 1 }
18 * | . . -1 2 -1 | -> { 3, 2, 1 }
19 * \ . . . . 2 / -> { 3, 2, 1 }
20 *
21 * The diagonals offsets are { -1, 0, 1 }.
22 */
23 const int size = 5;
24 MatrixType matrix( size, size );
25 auto view = matrix.getView();
26
27 auto f = [] __cuda_callable__( typename MatrixType::RowView & row )
28 {
29 const int& rowIdx = row.getRowIndex();
30 if( rowIdx > 0 )
31 row.setElement( 0, -1.0 ); // elements below the diagonal
32 row.setElement( 1, 2.0 ); // elements on the diagonal
33 if( rowIdx < size - 1 ) // elements above the diagonal
34 row.setElement( 2, -1.0 );
35 };
36 view.forAllRows( f ); // or matrix.forAllRows
37 std::cout << matrix << std::endl;
38}
39
40int
41main( int argc, char* argv[] )
42{
43 std::cout << "Creating matrix on host: " << std::endl;
44 forRowsExample< TNL::Devices::Host >();
45
46#ifdef __CUDACC__
47 std::cout << "Creating matrix on CUDA device: " << std::endl;
48 forRowsExample< TNL::Devices::Cuda >();
49#endif
50}

The differences are:

  1. We do not need to get the matrix view as we did in the previous example.
  2. We call the method forAllRows (TNL::Matrices::TridiagonalMatrix::forAllRows) (line 33) instead of parallelFor which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither.
  3. The lambda function f (lines 25-31) accepts one parameter row of the type RowView (TNL::Matrices::TridiagonalMatrix::RowView which is TNL::Matrices::TridiagonalMatrixRowView) instead of the index of the matrix row. Therefore we do not need to call the method getRow (TNL::Matrices::TridiagonalMatrix::getRow). On the other hand, we need the method geRowIndex (TNL::Matrices::TridiagonalMatrixRowView::getRowIndex) to get the index of the matrix row (line 24).

Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 39-46). Firstly we create the vector sum_vector for storing the sums (line 39) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 41-46, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 45).

The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Creating matrix on CUDA device:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2

Method forElements

Finally, even a bit more simple way of matrix elements manipulation with the method forElements (TNL::Matrices::TridiagonalMatrix::forElements) is demonstrated in the following example:

1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forElementsExample()
9{
10 /***
11 * Set the following matrix (dots represent zero matrix elements and zeros are
12 * padding zeros for memory alignment):
13 *
14 * 0 / 2 1 . . . \ -> { 0, 2, 1 }
15 * | 3 2 1 . . | -> { 3, 2, 1 }
16 * | . 3 2 1 . | -> { 3, 2, 1 }
17 * | . . 3 2 1 | -> { 3, 2, 1 }
18 * \ . . . 3 2 / 0 -> { 3, 2, 0 }
19 */
20 TNL::Matrices::TridiagonalMatrix< double, Device > matrix( 5, // number of matrix rows
21 5 ); // number of matrix columns
22 auto view = matrix.getView();
23
24 auto f = [] __cuda_callable__( int rowIdx, int localIdx, int columnIdx, double& value )
25 {
26 /***
27 * 'forElements' method iterates only over matrix elements lying on given subdiagonals
28 * and so we do not need to check anything. The element value can be expressed
29 * by the 'localIdx' variable, see the following figure:
30 *
31 * 0 1 2 <- localIdx values
32 * -------
33 * 0 / 2 1 . . . \ -> { 0, 2, 1 }
34 * | 3 2 1 . . | -> { 3, 2, 1 }
35 * | . 3 2 1 . | -> { 3, 2, 1 }
36 * | . . 3 2 1 | -> { 3, 2, 1 }
37 * \ . . . 3 2 / 0 -> { 3, 2, 0 }
38 *
39 */
40 value = 3 - localIdx;
41 };
42 view.forElements( 0, matrix.getRows(), f ); // or matrix.forElements
43 std::cout << matrix << std::endl;
44}
45
46int
47main( int argc, char* argv[] )
48{
49 std::cout << "Creating matrix on host: " << std::endl;
50 forElementsExample< TNL::Devices::Host >();
51
52#ifdef __CUDACC__
53 std::cout << "Creating matrix on CUDA device: " << std::endl;
54 forElementsExample< TNL::Devices::Cuda >();
55#endif
56}

On the line 41, we call the method forElements (TNL::Matrices::TridiagonalMatrix::forElements) instead of parallelFor. This method iterates over all matrix rows and all nonzero matrix elements. The lambda function on the line 24 therefore do not receive only the matrix row index but also local index of the matrix element (localIdx) which is a rank of the nonzero matrix element in given row - see Indexing of nonzero matrix elements in sparse matrices. Next parameter, columnIdx received by the lambda function, is the column index of the matrix element. The fourth parameter value is a reference on the matrix element which we use for its modification. If the last parameter compute is set to false, the iterations over the matrix rows is terminated.

The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:2 1:1
Row: 1 -> 0:3 1:2 2:1
Row: 2 -> 1:3 2:2 3:1
Row: 3 -> 2:3 3:2 4:1
Row: 4 -> 3:3 4:2
Creating matrix on CUDA device:
Row: 0 -> 0:2 1:1
Row: 1 -> 0:3 1:2 2:1
Row: 2 -> 1:3 2:2 3:1
Row: 3 -> 2:3 3:2 4:1
Row: 4 -> 3:3 4:2

Multidiagonal matrices

Multidiagonal matrices are generalization of the tridiagonal ones. It is a special type of sparse matrices with specific pattern of the nonzero matrix elements which are positioned only parallel along diagonal. See the following example:

\[ \left( \begin{array}{ccccccc} 4 & -1 & . & -1 & . & . \\ -1 & 4 & -1 & . & -1 & . \\ . & -1 & 4 & -1 & . & -1 \\ -1 & . & -1 & 4 & -1 & . \\ . & -1 & . & -1 & 4 & -1 \\ . & . & -1 & . & -1 & 4 \end{array} \right) \]

We can see that the matrix elements lay on lines parallel to the main diagonal. Such lines can be characterized by their offsets from the main diagonal. On the following figure, each such line is depicted in different color:

\[ \begin{array}{ccc} \color{green}{-3} & . & \color{cyan}{-1} \\ \hline \color{green}{*} & . & \color{cyan}{*} \\ . & \color{green}{*} & . \\ . & . & \color{green}{*} \\ . & . & . \\ . & . & . \\ . & . & . \end{array} \left( \begin{array}{ccccccc} \color{blue}{0} & \color{magenta}{1} & . & \color{red}{3} & . & . \\ \hline \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} & . & . \\ \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} & . \\ . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} \\ \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . \\ . & \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} \\ . & . & \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} \end{array} \right) \]

In this matrix, the offsets reads as \(\{-3, -1, 0, +1, +3\}\). It also means that the column indexes on \(i-\)th row are \(\{i-3, i-1, i, i+1, i+3\}\) (where we accept only nonnegative indexes smaller than the number of matrix columns). An advantage is that, similar to the tridiagonal matrix (TNL::Matrices::TridiagonalMatrix), we do not store the column indexes explicitly as it is in TNL::Matrices::SparseMatrix. This can significantly reduce the memory requirements which also means better performance. See the following table for the storage requirements comparison between multidiagonal matrix (TNL::Matrices::MultidiagonalMatrix) and general sparse matrix (TNL::Matrices::SparseMatrix).

Real Index SparseMatrix MultidiagonalMatrix Ratio
float 32-bit int 8 bytes per element 4 bytes per element 50%
double 32-bit int 12 bytes per element 8 bytes per element 75%
float 64-bit int 12 bytes per element 4 bytes per element 30%
double 64-bit int 16 bytes per element 8 bytes per element 50%

For the sake of better memory alignment and faster access to the matrix elements, we store all subdiagonals in complete form including the elements which are outside the matrix as depicted on the following figure where zeros stand for the padding artificial zero matrix elements

\[ \begin{array}{cccc} 0 & & & 0 \\ & 0 & & \\ & & 0 & \\ & & & 0 \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \end{array} \left( \begin{array}{cccccccccccccccc} 1 & 0 & & & 0 & & & & & & & & & & & \\ 0 & 1 & 0 & & & 0 & & & & & & & & & & \\ & 0 & 1 & 0 & & & 0 & & & & & & & & & \\ & & 0 & 1 & 0 & & & 0 & & & & & & & & \\ 0 & & & 0 & 1 & 0 & & & 0 & & & & & & & \\ & -1 & & & -1 & 1 & -1 & & & -1 & & & & & & \\ & & -1 & & & -1 & 1 & -1 & & & -1 & & & & & \\ & & & 0 & & & 0 & 1 & 0 & & & 0 & & & & \\ & & & & 0 & & & 0 & 1 & 0 & & & 0 & & & \\ & & & & & -1 & & & -1 & 1 & -1 & & & -1 & & \\ & & & & & & -1 & & & -1 & 1 & -1 & & & -1 & \\ & & & & & & & 0 & & & 0 & 1 & 0 & & & 0 \\ & & & & & & & & 0 & & & 0 & 1 & 0 & & \\ & & & & & & & & & 0 & & & 0 & 1 & 0 & \\ & & & & & & & & & & 0 & & & 0 & 1 & 0 \\ & & & & & & & & & & & 0 & & & 0 & 1 \end{array} \right) \begin{array} & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ 0 & & & \\ & 0 & & \\ & & 0 & \\ 0 & & & 0 \end{array} \]

Multidiagonal matrix is a templated class defined in the namespace TNL::Matrices. It has six template parameters:

  • Real is a type of the matrix elements. It is double by default.
  • Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
  • Index is a type to be used for indexing of the matrix elements. It is int by default.
  • ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default, it is the row-major order if the matrix is allocated in the host system and column major order if it is allocated on GPU.
  • RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.
  • IndexAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements offsets. By default, it is the default allocator for given Index type and Device type – see TNL::Allocators::Default.

In the following text we show different methods how to setup multidiagonal matrices.

Initializer list

Smaller multidiagonal matrices can be created using the constructor with initializer lists (std::initializer_list) as we demonstrate in the following example:

1#include <iostream>
2#include <TNL/Matrices/MultidiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8createMultidiagonalMatrix()
9{
10 const int matrixSize = 6;
11
12 /*
13 * Setup the following matrix (dots represent zeros):
14 *
15 * / 4 -1 . -1 . . \
16 * | -1 4 -1 . -1 . |
17 * | . -1 4 -1 . -1 |
18 * | -1 . -1 4 -1 . |
19 * | . -1 . -1 4 -1 |
20 * \ . . 1 . -1 4 /
21 *
22 * The diagonals offsets are { -3, -1, 0, 1, 3 }.
23 */
25 matrixSize,
26 { -3, -1, 0, 1, 3 },
27 {
28 /*
29 * To set the matrix elements we first extend the diagonals to their full
30 * lengths even outside the matrix (dots represent zeros and zeros are
31 * artificial zeros used for memory alignment):
32 *
33 * 0 . 0 / 4 -1 . -1 . . \ -> { 0, 0, 4, -1, -1 }
34 * . 0 . | -1 4 -1 . -1 . | . -> { 0, -1, 4, -1, -1 }
35 * . . 0 | . -1 4 -1 . -1 | . . -> { 0, -1, 4, -1, -1 }
36 * . . | -1 . -1 4 -1 . | 0 . . -> { -1, -1, 4, -1, 0 }
37 * . | . -1 . -1 4 -1 | . 0 . . -> { -1, -1, 4, -1, 0 }
38 * \ . . 1 . -1 4 / 0 . 0 . . -> { -1, -1, 4, 0, 0 }
39 */
40 // clang-format off
41 { 0, 0, 4, -1, -1 },
42 { 0, -1, 4, -1, -1 },
43 { 0, -1, 4, -1, -1 },
44 { -1, -1, 4, -1, 0 },
45 { -1, -1, 4, -1, 0 },
46 { -1, -1, 4, 0, 0 }
47 // clang-format on
48 } );
49 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
50}
51
52int
53main( int argc, char* argv[] )
54{
55 std::cout << "Create multidiagonal matrix on CPU ... " << std::endl;
56 createMultidiagonalMatrix< TNL::Devices::Host >();
57
58#ifdef __CUDACC__
59 std::cout << "Creating multidiagonal matrix on CUDA GPU ... " << std::endl;
60 createMultidiagonalMatrix< TNL::Devices::Cuda >();
61#endif
62}
Implementation of sparse multidiagonal matrix.
Definition MultidiagonalMatrix.h:64

Here, we create a matrix which looks as

\[ \left( \begin{array}{cccccc} 4 & -1 & & -1 & & \\ -1 & 4 & -1 & & -1 & \\ & -1 & 4 & -1 & & -1 \\ -1 & & -1 & 4 & -1 & \\ & -1 & & -1 & 4 & -1 \\ & & -1 & & -1 & 4 \\ \end{array} \right). \]

On the lines 25-46, we call the constructor which, in addition to matrix dimensions and subdiagonals offsets, accepts also initializer list of initializer lists with matrix elements values. Each embedded list corresponds to one matrix row and it contains values of matrix elements on particular subdiagonals including those which lies out of the matrix. The result looks as follows:

Create multidiagonal matrix on CPU ...
The matrix reads as:
Row: 0 -> 0:4 1:-1 3:-1
Row: 1 -> 0:-1 1:4 2:-1 4:-1
Row: 2 -> 1:-1 2:4 3:-1 5:-1
Row: 3 -> 0:-1 2:-1 3:4 4:-1
Row: 4 -> 1:-1 3:-1 4:4 5:-1
Row: 5 -> 2:-1 4:-1 5:4
Creating multidiagonal matrix on CUDA GPU ...
The matrix reads as:
Row: 0 -> 0:4 1:-1 3:-1
Row: 1 -> 0:-1 1:4 2:-1 4:-1
Row: 2 -> 1:-1 2:4 3:-1 5:-1
Row: 3 -> 0:-1 2:-1 3:4 4:-1
Row: 4 -> 1:-1 3:-1 4:4 5:-1
Row: 5 -> 2:-1 4:-1 5:4

Methods setElement and addElement

Another and more efficient way of setting the matrix elements is by means of the method setElement (TNL::Matrices::MultidiagonalMatrix::setElement). It is demonstrated in the following example:

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/MultidiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Pointers/SharedPointer.h>
7#include <TNL/Pointers/SmartPointersRegister.h>
8
9template< typename Device >
10void
11setElements()
12{
13 const int matrixSize( 5 );
14 auto diagonalsOffsets = { -1, 0, 1 }; // offsets of tridiagonal matrix
16 Matrix matrix( matrixSize, matrixSize, diagonalsOffsets );
17 auto view = matrix.getView();
18
19 for( int i = 0; i < 5; i++ )
20 view.setElement( i, i, i );
21
22 std::cout << "Matrix set from the host:" << std::endl;
23 std::cout << matrix << std::endl;
24
25 auto f = [ = ] __cuda_callable__( int i ) mutable
26 {
27 if( i > 0 )
28 view.setElement( i, i - 1, 1.0 );
29 view.setElement( i, i, -i );
30 if( i < matrixSize - 1 )
31 view.setElement( i, i + 1, 1.0 );
32 };
33
34 TNL::Algorithms::parallelFor< Device >( 0, matrixSize, f );
35
36 std::cout << "Matrix set from its native device:" << std::endl;
37 std::cout << matrix << std::endl;
38}
39
40int
41main( int argc, char* argv[] )
42{
43 std::cout << "Set elements on host:" << std::endl;
44 setElements< TNL::Devices::Host >();
45
46#ifdef __CUDACC__
47 std::cout << "Set elements on CUDA device:" << std::endl;
48 setElements< TNL::Devices::Cuda >();
49#endif
50}

This examples shows that the method setElement can be used both on the host (CPU) (line 19) as well as in the lambda functions that can be called from GPU kernels (lines 25-29). For this purpose, we fetch a matrix view on the line 16. The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4
Set elements on CUDA device:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4

Method getRow

Slightly more efficient way of the multidiagonal matrix setup is offered by the method getRow (TNL::Matrices::MultidiagonalMatrix::getRow). We will use it to create a matrix of the following form:

\[ \left( \begin{array}{cccccccccccccccc} 1 & . & & & . & & & & & & & & & & & \\ . & 1 & . & & & . & & & & & & & & & & \\ & . & 1 & . & & & . & & & & & & & & & \\ & & . & 1 & . & & & . & & & & & & & & \\ . & & & . & 1 & . & & & . & & & & & & & \\ & -1 & & & -1 & -4 & -1 & & & -1 & & & & & & \\ & & -1 & & & -1 & -4 & -1 & & & -1 & & & & & \\ & & & . & & & . & 1 & . & & & . & & & & \\ & & & & . & & & . & 1 & . & & & . & & & \\ & & & & & -1 & & & -1 & -4 & -1 & & & -1 & & \\ & & & & & & -1 & & & -1 & -4 & -1 & & & -1 & \\ & & & & & & & . & & & . & 1 & . & & & . \\ & & & & & & & & . & & & . & 1 & . & & \\ & & & & & & & & & . & & & . & 1 & . & \\ & & & & & & & & & & . & & & . & 1 & . \\ & & & & & & & & & & & . & & & . & 1 \end{array} \right) \]

The matrices of this form arise from a discretization of the Laplace operator in 2D by the finite difference method. We use this example because it is a frequent numerical problem and the multidiagonal format is very suitable for such matrices. If the reader, however, is not familiar with the finite difference method, please, do not be scared, we will just create the matrix mentioned above. The code based on use of method getRow reads as:

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/MultidiagonalMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void
10laplaceOperatorMatrix()
11{
12 /***
13 * Set matrix representing approximation of the Laplace operator on regular
14 * grid using the finite difference method.
15 */
16 const int gridSize( 4 );
17 const int matrixSize = gridSize * gridSize;
18 TNL::Containers::Vector< int, Device > offsets{ -gridSize, -1, 0, 1, gridSize };
19 TNL::Matrices::MultidiagonalMatrix< double, Device > matrix( matrixSize, matrixSize, offsets );
20 auto matrixView = matrix.getView();
21 auto f = [ = ] __cuda_callable__( const TNL::Containers::StaticArray< 2, int >& i ) mutable
22 {
23 const int elementIdx = i[ 1 ] * gridSize + i[ 0 ];
24 auto row = matrixView.getRow( elementIdx );
25 if( i[ 0 ] == 0 || i[ 1 ] == 0 || i[ 0 ] == gridSize - 1 || i[ 1 ] == gridSize - 1 )
26 row.setElement( 2, 1.0 ); // set matrix elements corresponding to boundary grid nodes
27 // and Dirichlet boundary conditions, i.e. 1 on the main diagonal
28 // which is the third one
29 else {
30 row.setElement( 0, -1.0 ); // set matrix elements corresponding to inner grid nodes, i.e.
31 row.setElement( 1, -1.0 ); // 4 on the main diagonal (the third one) and -1 to the other
32 row.setElement( 2, 4.0 ); // sub-diagonals
33 row.setElement( 3, -1.0 );
34 row.setElement( 4, -1.0 );
35 }
36 };
38 TNL::Containers::StaticArray< 2, int > end = { gridSize, gridSize };
39 TNL::Algorithms::parallelFor< Device >( begin, end, f );
40
41 std::cout << "Laplace operator matrix: " << std::endl << matrix << std::endl;
42}
43
44int
45main( int argc, char* argv[] )
46{
47 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
48 laplaceOperatorMatrix< TNL::Devices::Host >();
49
50#ifdef __CUDACC__
51 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
52 laplaceOperatorMatrix< TNL::Devices::Cuda >();
53#endif
54}
T begin(T... args)

We firstly compute the matrix size (matrixSize) based on the numerical grid dimensions on the line 16. The subdiagonals offsets are defined by the numerical grid size and since it is four in this example the offsets read as \(\left\{-4,-1,0,1,4 \right\} \) or { -gridSize, -1, 0, 1, gridSize} (line 17). Here we store the offsets in vector (TNL::Containers::Vector) called offsets. Next we use a constructor with matrix dimensions and offsets passed via TNL vector (line 18) and we fetch a matrix view (TNL::Matrices::MultidiagonalMatrixView, line 19).

The matrix is constructed by iterating over particular nodes of the numerical grid. Each node correspond to one matrix row. This is why the lambda function f (lines 20-35) take two indexes i and j (line 20). Their values are coordinates of the two-dimensional numerical grid. Based on these coordinates we compute index (elementIdx) of the corresponding matrix row (line 21). We fetch matrix row (row) by calling the getRow method (TNL::Matrices::MultidiagonalMatrix::getRow) (line 22). Depending on the grid node coordinates we set either the boundary conditions (lines 23-26) for the boundary nodes (those laying on the boundary of the grid and so their coordinates fulfil the condition i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) for which se set only the diagonal element to 1. The inner nodes of the numerical grid are handled on the lines 29-33 where we set coefficients approximating the Laplace operator. We use the method setElement of the matrix row (TNL::Matrices::MultidiagonalMatrixRowView::setElement) which takes the local index of the nonzero matrix element as the first parameter (see Indexing of nonzero matrix elements in sparse matrices) and the new value of the element as the second parameter. The local indexes, in fact, refer to particular subdiagonals as depicted on the following figure (in blue):

\[ \begin{array}{cccc} \color{blue}{-4} & & & \color{blue}{-1} \\ \hline . & & & . \\ & . & & \\ & & . & \\ & & & . \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \end{array} \left( \begin{array}{cccccccccccccccc} \color{blue}{0} & \color{blue}{1} & & & \color{blue}{4} & & & & & & & & & & & \\ \hline 1 & . & & & . & & & & & & & & & & & \\ . & 1 & . & & & . & & & & & & & & & & \\ & . & 1 & . & & & . & & & & & & & & & \\ & & . & 1 & . & & & . & & & & & & & & \\ . & & & . & 1 & . & & & . & & & & & & & \\ & -1 & & & -1 & 1 & -1 & & & -1 & & & & & & \\ & & -1 & & & -1 & 1 & -1 & & & -1 & & & & & \\ & & & . & & & . & 1 & . & & & . & & & & \\ & & & & . & & & . & 1 & . & & & . & & & \\ & & & & & -1 & & & -1 & 1 & -1 & & & -1 & & \\ & & & & & & -1 & & & -1 & 1 & -1 & & & -1 & \\ & & & & & & & . & & & . & 1 & . & & & . \\ & & & & & & & & . & & & . & 1 & . & & \\ & & & & & & & & & . & & & . & 1 & . & \\ & & & & & & & & & & . & & & . & 1 & . \\ & & & & & & & & & & & . & & & . & 1 \end{array} \right) \]

We use parallelFor to iterate over all nodes of the numerical grid (line 36) and apply the lambda function. The result looks as follows:

Creating Laplace operator matrix on CPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:-1 4:-1 5:4 6:-1 9:-1
Row: 6 -> 2:-1 5:-1 6:4 7:-1 10:-1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:-1 8:-1 9:4 10:-1 13:-1
Row: 10 -> 6:-1 9:-1 10:4 11:-1 14:-1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Creating Laplace operator matrix on CUDA GPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:-1 4:-1 5:4 6:-1 9:-1
Row: 6 -> 2:-1 5:-1 6:4 7:-1 10:-1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:-1 8:-1 9:4 10:-1 13:-1
Row: 10 -> 6:-1 9:-1 10:4 11:-1 14:-1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1

Method forRows

As in the case of other matrix types, the method forRows (TNL::Matrices::MultidiagonalMatrix::forRows) calls the method getRow (TNL::Matrices::MultidiagonalMatrix::getRow) in parallel. It is demonstrated by the following example:

1#include <iostream>
2#include <TNL/Matrices/MultidiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forRowsExample()
9{
11 /***
12 * Set the following matrix (dots represent zero matrix elements and zeros are
13 * padding zeros for memory alignment):
14 *
15 * 0 / 2 . . . . \ -> { 0, 0, 1 }
16 * | -1 2 -1 . . | -> { 0, 2, 1 }
17 * | . -1 2 -1. . | -> { 3, 2, 1 }
18 * | . . -1 2 -1 | -> { 3, 2, 1 }
19 * \ . . . . 2 / -> { 3, 2, 1 }
20 *
21 * The diagonals offsets are { -1, 0, 1 }.
22 */
23 const int size = 5;
24 MatrixType matrix( size, // number of matrix rows
25 size, // number of matrix columns
26 { -1, 0, 1 } ); // matrix diagonals offsets
27 auto view = matrix.getView();
28
29 auto f = [] __cuda_callable__( typename MatrixType::RowView & row )
30 {
31 const int& rowIdx = row.getRowIndex();
32 if( rowIdx > 0 )
33 row.setElement( 0, -1.0 ); // elements below the diagonal
34 row.setElement( 1, 2.0 ); // elements on the diagonal
35 if( rowIdx < size - 1 ) // elements above the diagonal
36 row.setElement( 2, -1.0 );
37 };
38 view.forAllRows( f ); // or matrix.forAllRows
39 std::cout << matrix << std::endl;
40}
41
42int
43main( int argc, char* argv[] )
44{
45 std::cout << "Creating matrix on host: " << std::endl;
46 forRowsExample< TNL::Devices::Host >();
47
48#ifdef __CUDACC__
49 std::cout << "Creating matrix on CUDA device: " << std::endl;
50 forRowsExample< TNL::Devices::Cuda >();
51#endif
52}

We call the method forAllRows (TNL::Matrices::MultidiagonalMatrix::forAllRows) (line 36) instead of parallelFor which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither. The lambda function f (lines 28-35) accepts one parameter row of the type RowView (TNL::Matrices::MultidiagonalMatrix::RowView which is TNL::Matrices::MultidiagonalMatrixRowView). At the beginning of the lambda function, we call the method geRowIndex (TNL::Matrices::MultidiagonalMatrixRowView::getRowIndex) to get the index of the matrix row (line 29).

Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 39-46). Firstly we create the vector sum_vector for storing the sums (line 39) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 41-46, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 45).

The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Creating matrix on CUDA device:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2

Method forElements

Similar and even a bit simpler way of setting the matrix elements is offered by the method forElements (TNL::Matrices::MultidiagonalMatrix::forElements, TNL::Matrices::MultidiagonalMatrixView::forElements) as demonstrated in the following example:

1#include <iostream>
2#include <TNL/Matrices/MultidiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forElementsExample()
9{
10 /***
11 * Set the following matrix (dots represent zero matrix elements and zeros are
12 * padding zeros for memory alignment):
13 *
14 * 0 0 / 1 . . . . \ -> { 0, 0, 1 }
15 * 0 | 2 1 . . . | -> { 0, 2, 1 }
16 * | 3 2 1 . . | -> { 3, 2, 1 }
17 * | . 3 2 1 . | -> { 3, 2, 1 }
18 * \ . . 3 2 1 / -> { 3, 2, 1 }
19 *
20 * The diagonals offsets are { -2, -1, 0 }.
21 */
22 TNL::Matrices::MultidiagonalMatrix< double, Device > matrix( 5, // number of matrix rows
23 5, // number of matrix columns
24 { -2, -1, 0 } ); // matrix diagonals offsets
25 auto view = matrix.getView();
26
27 auto f = [] __cuda_callable__( int rowIdx, int localIdx, int columnIdx, double& value )
28 {
29 /***
30 * 'forElements' method iterates only over matrix elements lying on given subdiagonals
31 * and so we do not need to check anything. The element value can be expressed
32 * by the 'localIdx' variable, see the following figure:
33 *
34 * 0 1 2 <- localIdx values
35 * -------
36 * 0 0 / 1 . . . . \ -> { 0, 0, 1 }
37 * 0 | 2 1 . . . | -> { 0, 2, 1 }
38 * | 3 2 1 . . | -> { 3, 2, 1 }
39 * | . 3 2 1 . | -> { 3, 2, 1 }
40 * \ . . 3 2 1 / -> { 3, 2, 1 }
41 *
42 */
43 value = 3 - localIdx;
44 };
45 view.forElements( 0, matrix.getRows(), f ); // or matrix.forElements
46 std::cout << matrix << std::endl;
47}
48
49int
50main( int argc, char* argv[] )
51{
52 std::cout << "Creating matrix on host: " << std::endl;
53 forElementsExample< TNL::Devices::Host >();
54
55#ifdef __CUDACC__
56 std::cout << "Creating matrix on CUDA device: " << std::endl;
57 forElementsExample< TNL::Devices::Cuda >();
58#endif
59}
void forElements(IndexType begin, IndexType end, Function &function) const
Method for iteration over all matrix rows for constant instances.
Definition MultidiagonalMatrixBase.hpp:265
ViewType getView()
Returns a modifiable view of the multidiagonal matrix.
Definition MultidiagonalMatrix.hpp:71

In this case, we need to provide a lambda function f (lines 27-43) which is called for each matrix row just by the method forElements (line 44). The lambda function f provides the following parameters

  • rowIdx is an index iof the matrix row.
  • localIdx is in index of the matrix subdiagonal.
  • columnIdx is a column index of the matrix element.
  • value is a reference to the matrix element value. It can be used even for changing the value.
  • compute is a reference to boolean. If it is set to false, the iteration over the matrix row can be stopped.

In this example, the matrix element value depends only on the subdiagonal index localIdx (see Indexing of nonzero matrix elements in sparse matrices) as we can see on the line 42. The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Creating matrix on CUDA device:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1

Lambda matrices

Lambda matrix (TNL::Matrices::LambdaMatrix) is a special type of matrix which could be also called matrix-free matrix. The matrix elements are not stored in memory explicitly but they are evaluated on-the-fly by means of user defined lambda functions. If the matrix elements can be expressed by computationally not expansive formula, we can significantly reduce the memory consumption which is appreciated especially on GPUs. Since the memory accesses are quite expensive even on both CPU and GPU, we can get, at the end, even much faster code.

The lambda matrix (TNL::Matrices::LambdaMatrix) is a templated class with the following template parameters:

  • MatrixElementsLambda is a lambda function which evaluates the matrix elements values and column indexes.
  • CompressedRowLengthsLambda is a lambda function telling how many nonzero elements are there in given matrix row.
  • Real is a type of matrix elements values.
  • Device is a device on which the lambda functions mentioned above will be evaluated.
  • Index is a type to be used for indexing.

The lambda function MatrixElementsLambda is supposed to have the following declaration:

1auto matrixElements = [=
2] __cuda_callable__( Index rows, Index columns, Index row, Index localIdx, Index& columnIdx, Real& value );

where the particular parameters have the following meaning:

  • rows tells the number of matrix rows.
  • columns tells the number of matrix columns.
  • rowIdx is index of the matrix row in which we are supposed to evaluate the matrix element.
  • localIdx is a rank of the nonzero matrix element, see Indexing of nonzero matrix elements in sparse matrices.
  • columnIdx is a reference on variable where we are supposed to store the matrix element column index.
  • value is a reference on variable where we are supposed to store the matrix element value.

The lambda function CompressedRowLengthsLambda (by compressed row length we mean the number of matrix elements in a row after ignoring/compressing the zero elements) is supposed to be declared like this:

1auto compressedRowLengths = [=] __cuda_callable__( Index rows, Index columns, Index row ) -> Index;

where the parameters can be described as follows:

  • rows tells the number of matrix rows.
  • columns tells the number of matrix columns.
  • rowIdx is index of the matrix row for which we are supposed to evaluate the number of nonzero matrix elements.

The lambda function is supposed to return just the number of the nonzero matrix elements in given matrix row.

Lambda matrix inititation

How to put the lambda functions together with the lambda matrix is demonstrated in the following example:

1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3
4int
5main( int argc, char* argv[] )
6{
7 /***
8 * Lambda functions defining the matrix.
9 */
10 auto compressedRowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
11 {
12 return 1;
13 };
14 auto matrixElements1 =
16 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
17 {
18 columnIdx = rowIdx;
19 value = 1.0;
20 };
21 auto matrixElements2 =
23 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
24 {
25 columnIdx = rowIdx;
26 value = rowIdx;
27 };
28
29 const int size = 5;
30
31 /***
32 * Matrix construction with explicit type definition.
33 */
35 matrixElements1, compressedRowLengths ) );
36 MatrixType m1( size, size, matrixElements1, compressedRowLengths );
37
38 /***
39 * Matrix construction using 'auto'.
40 */
41 auto m2 =
43 m2.setDimensions( size, size );
44
45 std::cout << "The first lambda matrix: " << std::endl << m1 << std::endl;
46 std::cout << "The second lambda matrix: " << std::endl << m2 << std::endl;
47}
static auto create(MatrixElementsLambda &matrixElementsLambda, CompressedRowLengthsLambda &compressedRowLengthsLambda) -> LambdaMatrix< MatrixElementsLambda, CompressedRowLengthsLambda, Real, Device, Index >
Creates lambda matrix with given lambda functions.
Definition LambdaMatrix.h:571

Here we create two simple diagonal matrices. Therefore they share the same lambda function compressedRowLengths telling the number of nonzero matrix elements in particular matrix rows which is always one (line 9). The first matrix, defined by the lambda function matrixElements1, is identity matrix and so its each diagonal element equals one. We set the matrix element value to 1.0 (line 12) and the column index equals the row index (line 15). The second matrix, defined by the lambda function matrixElements2, is also diagonal but not the identity matrix. The values of the diagonal elements equal to row index (line 16).

With the same lambda functions we can define matrices with different dimensions. In this example, we set the matrix size to five (line 19). It could be quite difficult to express the lambda matrix type because it depends on the types of the lambda functions. To make this easier, one may use the lambda-matrix factory (TNL::Matrices::LambdaMatrixFactory). Using decltype one can deduce even the matrix type (line 24) followed by calling lambda matrix constructor with matrix dimensions and instances of the lambda functions (line 25). Or one can just simply employ the keyword auto (line 30) followed by setting the matrix dimensions (line 31).

The result looks as follows:

The first lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
The second lambda matrix:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4

Method forRows

Method forRows (TNL::Matrices::LambdaMatrix::forRows, TNL::Matrices::LambdaMatrix::forAllRows) iterates in parallel over all matrix rows. In the case of lambda matrices, it cannot be used for changing the matrix elements since they cannot be changed. In the following example, we show how to use this method to copy the matrix elements values to the dense matrix:

1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9forRowsExample()
10{
11 /***
12 * Set matrix representing approximation of the Laplace operator on regular
13 * grid using the finite difference method.
14 */
15 const int gridSize( 4 );
16 const int matrixSize = gridSize * gridSize;
17 auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
18 {
19 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
20 const int gridColumn = rowIdx % gridSize;
21 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
22 gridColumn == 0 || gridColumn == gridSize - 1 )
23 return 1;
24 return 5;
25 };
26 auto matrixElements =
28 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
29 {
30 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
31 const int gridColumn = rowIdx % gridSize;
32 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
33 gridColumn == 0 || gridColumn == gridSize - 1 )
34 {
35 columnIdx = rowIdx; // diagonal element ....
36 value = 1.0; // ... is set to 1
37 }
38 else // interior grid node
39 {
40 switch( localIdx ) // set diagonal element to 4
41 { // and the others to -1
42 case 0:
43 columnIdx = rowIdx - gridSize;
44 value = 1;
45 break;
46 case 1:
47 columnIdx = rowIdx - 1;
48 value = 1;
49 break;
50 case 2:
51 columnIdx = rowIdx;
52 value = -4;
53 break;
54 case 3:
55 columnIdx = rowIdx + 1;
56 value = 1;
57 break;
58 case 4:
59 columnIdx = rowIdx + gridSize;
60 value = 1;
61 break;
62 }
63 }
64 };
65 auto matrix =
66 TNL::Matrices::LambdaMatrixFactory< double, Device, int >::create( matrixSize, matrixSize, matrixElements, rowLengths );
67 using MatrixType = decltype( matrix );
68 using RowView = typename MatrixType::RowView;
69
70 TNL::Matrices::DenseMatrix< double, Device > denseMatrix( matrixSize, matrixSize );
71 denseMatrix.setValue( 0.0 );
72 auto dense_view = denseMatrix.getView();
73 auto f = [ = ] __cuda_callable__( const RowView& row ) mutable
74 {
75 auto dense_row = dense_view.getRow( row.getRowIndex() );
76 for( int localIdx = 0; localIdx < row.getSize(); localIdx++ )
77 dense_row.setValue( row.getColumnIndex( localIdx ), row.getValue( localIdx ) );
78 };
79 matrix.forAllRows( f );
80
81 std::cout << "Laplace operator lambda matrix: " << std::endl << matrix << std::endl;
82 std::cout << "Laplace operator dense matrix: " << std::endl << denseMatrix << std::endl;
83
84 /***
85 * Compute sum of elements in each row and store it into a vector.
86 */
87 TNL::Containers::Vector< double, Device > sum_vector( matrixSize );
88 auto sum_view = sum_vector.getView();
89 matrix.forAllRows(
90 [ = ] __cuda_callable__( const RowView& row ) mutable
91 {
92 double sum( 0.0 );
93 for( auto element : row )
94 sum += TNL::abs( element.value() );
95 sum_view[ row.getRowIndex() ] = sum;
96 } );
97
98 std::cout << "Sums in matrix rows = " << sum_vector << std::endl;
99}
100
101int
102main( int argc, char* argv[] )
103{
104 std::cout << "Running example on CPU ... " << std::endl;
105 forRowsExample< TNL::Devices::Host >();
106
107#ifdef __CUDACC__
108 std::cout << "Running example on CUDA GPU ... " << std::endl;
109 forRowsExample< TNL::Devices::Cuda >();
110#endif
111}

We start with the lambda functions (lines 17-61) defining the elements of the lambda matrix. Next, we create the lambda matrix matrix (lines 62-64) and the dense matrix denseMatrix (lines 67-68) together with the dense matrix view (line 69). The lambda function f (lines 70-74) serves for copying matrix elements from the lambda matrix to the dense matrix. The process of matrix elements copying is started by calling the method forAllRows (TNL::Matrices::LambdaMatrix::forRows, TNL::Matrices::LambdaMatrix::forAllRows) (line 75).

Note, however, that use of forElements method (TNL::Matrices::LambdaMatrix::forElements) would be more convenient.

Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 83-90). Firstly we create the vector sum_vector for storing the sums (line 83) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 85-90, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 92).

The result looks as follows:

Running example on CPU ...
Laplace operator lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Laplace operator dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 2 -> 0:0 1:0 2:1 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 3 -> 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 5 -> 0:0 1:1 2:0 3:0 4:1 5:-4 6:1 7:0 8:0 9:1 10:0 11:0 12:0 13:0 14:0 15:0
Row: 6 -> 0:0 1:0 2:1 3:0 4:0 5:1 6:-4 7:1 8:0 9:0 10:1 11:0 12:0 13:0 14:0 15:0
Row: 7 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 8 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:1 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 9 -> 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:1 9:-4 10:1 11:0 12:0 13:1 14:0 15:0
Row: 10 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:1 7:0 8:0 9:1 10:-4 11:1 12:0 13:0 14:1 15:0
Row: 11 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:1 12:0 13:0 14:0 15:0
Row: 12 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:1 13:0 14:0 15:0
Row: 13 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:1 14:0 15:0
Row: 14 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:1 15:0
Row: 15 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:1
Sums in matrix rows = [ 1, 1, 1, 1, 1, 8, 8, 1, 1, 8, 8, 1, 1, 1, 1, 1 ]
Running example on CUDA GPU ...
Laplace operator lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Laplace operator dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 2 -> 0:0 1:0 2:1 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 3 -> 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 5 -> 0:0 1:1 2:0 3:0 4:1 5:-4 6:1 7:0 8:0 9:1 10:0 11:0 12:0 13:0 14:0 15:0
Row: 6 -> 0:0 1:0 2:1 3:0 4:0 5:1 6:-4 7:1 8:0 9:0 10:1 11:0 12:0 13:0 14:0 15:0
Row: 7 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 8 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:1 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 9 -> 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:1 9:-4 10:1 11:0 12:0 13:1 14:0 15:0
Row: 10 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:1 7:0 8:0 9:1 10:-4 11:1 12:0 13:0 14:1 15:0
Row: 11 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:1 12:0 13:0 14:0 15:0
Row: 12 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:1 13:0 14:0 15:0
Row: 13 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:1 14:0 15:0
Row: 14 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:1 15:0
Row: 15 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:1
Sums in matrix rows = [ 1, 1, 1, 1, 1, 8, 8, 1, 1, 8, 8, 1, 1, 1, 1, 1 ]

Method forElements

The lambda matrix has the same interface as other matrix types except of the method getRow. The following example demonstrates the use of the method forElements (TNL::Matrices::LambdaMatrix::forElements) to copy the lambda matrix into the dense matrix:

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Matrices/LambdaMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9forElementsExample()
10{
11 /***
12 * Lambda functions defining the matrix.
13 */
14 auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
15 {
16 return columns;
17 };
18 auto matrixElements =
20 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
21 {
22 columnIdx = localIdx;
23 value = TNL::max( rowIdx - columnIdx + 1, 0 );
24 };
25
27 auto matrix = MatrixFactory::create( 5, 5, matrixElements, rowLengths );
28
30 auto denseView = denseMatrix.getView();
31
32 auto f = [ = ] __cuda_callable__( int rowIdx, int localIdx, int columnIdx, double value ) mutable
33 {
34 denseView.setElement( rowIdx, columnIdx, value );
35 };
36
37 matrix.forElements( 0, matrix.getRows(), f );
38 std::cout << "Original lambda matrix:" << std::endl << matrix << std::endl;
39 std::cout << "Dense matrix:" << std::endl << denseMatrix << std::endl;
40}
41
42int
43main( int argc, char* argv[] )
44{
45 std::cout << "Copying matrix on host: " << std::endl;
46 forElementsExample< TNL::Devices::Host >();
47
48#ifdef __CUDACC__
49 std::cout << "Copying matrix on CUDA device: " << std::endl;
50 forElementsExample< TNL::Devices::Cuda >();
51#endif
52}
constexpr ResultType max(const T1 &a, const T2 &b)
This function returns maximum of two numbers.
Definition Math.h:48
Helper class for creating instances of LambdaMatrix.
Definition LambdaMatrix.h:552

Here, we treat the lambda matrix as if it was dense matrix and so the lambda function compressedRowLengths returns the number of the nonzero elements equal to the number of matrix columns (line 13). However, the lambda function matrixElements (lines 14-17), sets nonzero values only to lower triangular part of the matrix. The elements in the upper part are equal to zero (line 16). Next we create an instance of the lambda matrix with a help of the lambda matrix factory (TNL::Matrices::LambdaMatrixFactory) (lines 19-20) and an instance of the dense matrix (TNL::Matrices::DenseMatrix) (lines 22-23).

Next we call the lambda function f by the method forElements (TNL::Matrices::LambdaMatrix::forElements) to set the matrix elements of the dense matrix denseMatrix (line 26) via the dense matrix view (denseView) (TNL::Matrices::DenseMatrixView). Note, that in the lambda function f we get the matrix element value already evaluated in the variable value as we are used to from other matrix types. So in fact, the same lambda function f would do the same job even for sparse matrix or any other. Also note, that in this case we iterate even over all zero matrix elements because the lambda function compressedRowLengths (line 13) tells so. The result looks as follows:

Copying matrix on host:
Original lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:1 2:0 3:0 4:0
Row: 2 -> 0:3 1:2 2:1 3:0 4:0
Row: 3 -> 0:4 1:3 2:2 3:1 4:0
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Copying matrix on CUDA device:
Original lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:1 2:0 3:0 4:0
Row: 2 -> 0:3 1:2 2:1 3:0 4:0
Row: 3 -> 0:4 1:3 2:2 3:1 4:0
Row: 4 -> 0:5 1:4 2:3 3:2 4:1

At the end of this part, we show two more examples, how to express a matrix approximating the Laplace operator:

1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8laplaceOperatorMatrix()
9{
10 /***
11 * Set matrix representing approximation of the Laplace operator on regular
12 * grid using the finite difference method.
13 */
14 const int gridSize( 4 );
15 const int matrixSize = gridSize * gridSize;
16 auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
17 {
18 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
19 const int gridColumn = rowIdx % gridSize;
20 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
21 gridColumn == 0 || gridColumn == gridSize - 1 )
22 return 1;
23 return 5;
24 };
25 auto matrixElements =
27 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
28 {
29 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
30 const int gridColumn = rowIdx % gridSize;
31 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
32 gridColumn == 0 || gridColumn == gridSize - 1 )
33 {
34 columnIdx = rowIdx; // diagonal element ....
35 value = 1.0; // ... is set to 1
36 }
37 else // interior grid node
38 {
39 switch( localIdx ) // set diagonal element to 4
40 { // and the others to -1
41 case 0:
42 columnIdx = rowIdx - gridSize;
43 value = 1;
44 break;
45 case 1:
46 columnIdx = rowIdx - 1;
47 value = 1;
48 break;
49 case 2:
50 columnIdx = rowIdx;
51 value = -4;
52 break;
53 case 3:
54 columnIdx = rowIdx + 1;
55 value = 1;
56 break;
57 case 4:
58 columnIdx = rowIdx + gridSize;
59 value = 1;
60 break;
61 }
62 }
63 };
64 auto matrix =
65 TNL::Matrices::LambdaMatrixFactory< double, Device, int >::create( matrixSize, matrixSize, matrixElements, rowLengths );
66 std::cout << "Laplace operator matrix: " << std::endl << matrix << std::endl;
67}
68
69int
70main( int argc, char* argv[] )
71{
72 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
73 laplaceOperatorMatrix< TNL::Devices::Host >();
74
75#ifdef __CUDACC__
76 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
77 laplaceOperatorMatrix< TNL::Devices::Cuda >();
78#endif
79}

The following is another way of doing the same but with precomputed supporting vectors:

1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8laplaceOperatorMatrix()
9{
10 /***
11 * Set matrix representing approximation of the Laplace operator on regular
12 * grid using the finite difference method.
13 */
14 const int gridSize( 4 );
15 const int matrixSize = gridSize * gridSize;
17 0, -1, 0, 1, 0
18 }; // helper vector for getting matrix elements column indexes
19 columnOffsets.setElement( 0, -gridSize );
20 columnOffsets.setElement( 4, gridSize );
21 auto columnOffsetsView = columnOffsets.getView();
22 TNL::Containers::Vector< double, Device > values{ 1, 1, -4, 1, 1 }; // helper vector for getting matrix elements values
23 auto valuesView = values.getView();
24
25 auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
26 {
27 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
28 const int gridColumn = rowIdx % gridSize;
29
30 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
31 gridColumn == 0 || gridColumn == gridSize - 1 )
32 return 1;
33 return 5;
34 };
35
36 auto matrixElements =
38 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
39 {
40 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
41 const int gridColumn = rowIdx % gridSize;
42 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
43 gridColumn == 0 || gridColumn == gridSize - 1 )
44 {
45 columnIdx = rowIdx; // diagonal element ....
46 value = 1.0; // ... is set to 1
47 }
48 else // interior grid node
49 {
50 columnIdx = rowIdx + columnOffsetsView[ localIdx ];
51 value = valuesView[ localIdx ];
52 }
53 };
54 auto matrix =
55 TNL::Matrices::LambdaMatrixFactory< double, Device, int >::create( matrixSize, matrixSize, matrixElements, rowLengths );
56 std::cout << "Laplace operator matrix: " << std::endl << matrix << std::endl;
57}
58
59int
60main( int argc, char* argv[] )
61{
62 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
63 laplaceOperatorMatrix< TNL::Devices::Host >();
64
65#ifdef __CUDACC__
66 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
67 laplaceOperatorMatrix< TNL::Devices::Cuda >();
68#endif
69}
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
Definition Vector.hpp:25

The result of both examples looks as follows:

Creating Laplace operator matrix on CPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Creating Laplace operator matrix on CUDA GPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1

Distributed matrices

TODO: Write documentation on distributed matrices.

Flexible reduction in matrix rows

Flexible reduction in matrix rows is a powerful tool for many different matrix operations. It is represented by the method reduceRows (TNL::Matrices::DenseMatrix::reduceRows, TNL::Matrices::SparseMatrix::reduceRows, TNL::Matrices::TridiagonalMatrix::reduceRows, TNL::Matrices::MultidiagonalMatrix::reduceRows, TNL::Matrices::LambdaMatrix::reduceRows) and similar to the method forElements it iterates over particular matrix rows. However, it performs flexible parallel reduction in addition. For example, the matrix-vector product can be seen as a reduction of products of matrix elements with the input vector in particular matrix rows. The first element of the result vector ios obtained as:

\[ y_1 = a_{11} x_1 + a_{12} x_2 + \ldots + a_{1n} x_n = \sum_{j=1}^n a_{1j}x_j \]

and in general i-th element of the result vector is computed as

\[ y_i = a_{i1} x_1 + a_{i2} x_2 + \ldots + a_{in} x_n = \sum_{j=1}^n a_{ij}x_j. \]

We see that in i-th matrix row we have to compute the sum \(\sum_{j=1}^n a_{ij}x_j\) which is reduction of products \( a_{ij}x_j\). Similar to flexible parallel reduction (TNL::Algorithms::reduce) we just need to design proper lambda functions. There are three of them.

  1. fetch reads and preprocesses data entering the flexible parallel reduction.
  2. reduce performs the reduction operation.
  3. keep stores the results from each matrix row.

Lambda function fetch

This lambda function has the same purpose as the lambda function fetch in flexible parallel reduction for arrays and vectors (see Flexible Parallel Reduction). It is supposed to be declared as follows:

1auto fetch = [=] __cuda_callable__( Index rowIdx, Index columnIdx, const Real& value ) -> Real;

The meaning of the particular parameters is as follows:

  1. rowIdx is the row index of the matrix element.
  2. columnIdx is the column index of the matrix element.
  3. value is the value of the matrix element.

The lambda function returns a value of type Real based on the input data.

Lambda function reduce

The lambda function reduce expresses reduction operation (sum, product, minimum, maximum etc.) which is supposed to be done during the flexible reduction.

1auto reduce = [] __cuda_callable__( const Real& a, const Real& b ) -> Real;

The meaning of the particular parameters is as follows:

  1. a is the first operand for the reduction operation.
  2. b is the second operand for the reduction operation.

Lambda function keep

The lambda function keep is new one compared to the flexible reduction for arrays, vectors or other linear structures. The reason is that the result consists of as many numbers as there are matrix rows. Result obtained for each matrix row is processed by this lambda function. It is declared as follows:

1auto keep = [=] __cuda_callable__( Index rowIdx, const Real& value ) mutable;

The meaning of the particular parameters is as follows:

  1. rowIdx is an index of the matrix row related to given result of flexible reduction.
  2. valueis the result of the flexible reduction in given matrix row.

The method reduceRows (TNL::Matrices::DenseMatrix::reduceRows, TNL::Matrices::SparseMatrix::reduceRows, TNL::Matrices::TridiagonalMatrix::reduceRows, TNL::Matrices::MultidiagonalMatrix::reduceRows, TNL::Matrices::LambdaMatrix::reduceRows) accepts the following arguments:

  1. begin is the beginning of the matrix rows range on which the reduction will be performed.
  2. end is the end of the matrix rows range on which the reduction will be performed. The last matrix row which is going to be processed has index end-1.
  3. fetch is the lambda function for data fetching.
  4. reduce is the the lambda function performing the reduction.
  5. keep is the lambda function responsible for processing the results from particular matrix rows.
  6. zero is the identity element of given reduction operation.

Though the interface is the same for all matrix types, in the following part we will show several examples for different matrix types to better demonstrate possible ways of use of the flexible reduction for matrices.

Dense matrices example

The following example demonstrates implementation of the dense matrix-vector product \( {\bf y} = A \vec {\bf x}\), i.e.

\[ y_i = \sum_{j=0}^{columns - 1} a_{ij} x_j \text{ for } i = 0, \ldots, rows-1. \]

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void
10reduceRows()
11{
12 // clang-format off
14 { 1, 0, 0, 0, 0 },
15 { 1, 2, 0, 0, 0 },
16 { 0, 1, 8, 0, 0 },
17 { 0, 0, 1, 9, 0 },
18 { 0, 0, 0, 0, 1 } };
19 // clang-format on
20 /***
21 * Allocate input and output vectors for matrix-vector product
22 */
23 TNL::Containers::Vector< double, Device > x( matrix.getColumns() ), y( matrix.getRows() );
24
25 /***
26 * Fill the input vectors with ones.
27 */
28 x = 1.0;
29
30 /***
31 * Prepare vector view for lambdas.
32 */
33 auto xView = x.getView();
34 auto yView = y.getView();
35
36 /***
37 * Fetch lambda just returns product of appropriate matrix elements and the
38 * input vector elements.
39 */
40 auto fetch = [ = ] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
41 {
42 return xView[ columnIdx ] * value;
43 };
44
45 /***
46 * Keep lambda store the result of matrix-vector product to output vector y.
47 */
48 auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
49 {
50 yView[ rowIdx ] = value;
51 };
52
53 /***
54 * Compute matrix-vector product.
55 */
56 matrix.reduceRows( 0, matrix.getRows(), fetch, std::plus<>{}, keep, 0.0 );
57
58 std::cout << "The matrix reads as:" << std::endl << matrix << std::endl;
59 std::cout << "The input vector is:" << x << std::endl;
60 std::cout << "Result of matrix-vector multiplication is: " << y << std::endl;
61}
62
63int
64main( int argc, char* argv[] )
65{
66 std::cout << "Rows reduction on host:" << std::endl;
67 reduceRows< TNL::Devices::Host >();
68
69#ifdef __CUDACC__
71 std::cout << "Rows reduction on CUDA device:" << std::endl;
72 reduceRows< TNL::Devices::Cuda >();
73#endif
74}

We set the following lambda functions:

  • fetch lambda function computes the product \( a_{ij}x_j\) where \( a_{ij} \) is represented by value and \(x_j \) is represented by xView[columnIdx] (line 40).
  • reduce - reduction is just sum of particular products and it is represented by std::plus (line 53).
  • keep is responsible for storing the results of reduction in each matrix row (which is represented by the variable value) into the output vector y.

The result looks as:

Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:0 1:1 2:8 3:0 4:0
Row: 3 -> 0:0 1:0 2:1 3:9 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:0 1:1 2:8 3:0 4:0
Row: 3 -> 0:0 1:0 2:1 3:9 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]

We will show one more example which is a computation of maximal absolute value in each matrix row. The results will be stored in a vector:

\[ y_i = \max_{j=1,\ldots,n} |a_{ij}|. \]

See the following example:

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9reduceRows()
10{
11 // clang-format off
13 { 1, 0, 0, 0, 0 },
14 { 1, 2, 0, 0, 0 },
15 { 0, 1, 8, 0, 0 },
16 { 0, 0, 1, 9, 0 },
17 { 0, 0, 0, 0, 1 } };
18 // clang-format on
19 /***
20 * Find largest element in each row.
21 */
22 TNL::Containers::Vector< double, Device > rowMax( matrix.getRows() );
23
24 /***
25 * Prepare vector view for lambdas.
26 */
27 auto rowMaxView = rowMax.getView();
28
29 /***
30 * Fetch lambda just returns absolute value of matrix elements.
31 */
32 auto fetch = [ = ] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
33 {
34 return TNL::abs( value );
35 };
36
37 /***
38 * Reduce lambda return maximum of given values.
39 */
40 auto reduce = [ = ] __cuda_callable__( double& a, const double& b ) -> double
41 {
42 return TNL::max( a, b );
43 };
44
45 /***
46 * Keep lambda store the largest value in each row to the vector rowMax.
47 */
48 auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
49 {
50 rowMaxView[ rowIdx ] = value;
51 };
52
53 /***
54 * Compute the largest values in each row.
55 */
56 matrix.reduceRows( 0, matrix.getRows(), fetch, reduce, keep, std::numeric_limits< double >::lowest() );
57
58 std::cout << "Max. elements in rows are: " << rowMax << std::endl;
59 std::cout << "Max. matrix norm is: " << TNL::max( rowMax ) << std::endl;
60}
61
62int
63main( int argc, char* argv[] )
64{
65 std::cout << "Rows reduction on host:" << std::endl;
66 reduceRows< TNL::Devices::Host >();
67
68#ifdef __CUDACC__
69 std::cout << "Rows reduction on CUDA device:" << std::endl;
70 reduceRows< TNL::Devices::Cuda >();
71#endif
72}
Result reduce(Index begin, Index end, Fetch &&fetch, Reduction &&reduction, const Result &identity)
reduce implements (parallel) reduction for vectors and arrays.
Definition reduce.h:65
__cuda_callable__ T abs(const T &n)
This function returns absolute value of given number n.
Definition Math.h:74

The lambda functions rare:

  • fetch lambda function just returns absolute value of \(a_{ij} \) which is represented again by the variable value.
  • reduce lambda function returns larger of given values.
  • keep stores the results to the output vector the same way as in the previous example.

Note, that the identity value for the reduction is std::numeric_limits< double >::lowest. Of course, if we compute the maximum of all output vector elements, we get some kind of maximal matrix norm. The output looks as:

Rows reduction on host:
Max. elements in rows are: [ 1, 2, 8, 9, 1 ]
Max. matrix norm is: 9
Rows reduction on CUDA device:
Max. elements in rows are: [ 1, 2, 8, 9, 1 ]
Max. matrix norm is: 9

Sparse matrices example

The following example demonstrates sparse matrix-vector product:

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void
10reduceRows()
11{
12 // clang-format off
14 { 0, 0, 1 },
15 { 1, 0, 1 }, { 1, 1, 2 },
16 { 2, 1, 1 }, { 2, 2, 8 },
17 { 3, 2, 1 }, { 3, 3, 9 },
18 { 4, 4, 1 }
19 } };
20 // clang-format on
21
22 /***
23 * Allocate input and output vectors for matrix-vector product
24 */
25 TNL::Containers::Vector< double, Device > x( matrix.getColumns() ), y( matrix.getRows() );
26
27 /***
28 * Fill the input vectors with ones.
29 */
30 x = 1.0;
31
32 /***
33 * Prepare vector view for lambdas.
34 */
35 auto xView = x.getView();
36 auto yView = y.getView();
37
38 /***
39 * Fetch lambda just returns product of appropriate matrix elements and the
40 * input vector elements.
41 */
42 auto fetch = [ = ] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
43 {
44 return xView[ columnIdx ] * value;
45 };
46
47 /***
48 * Keep lambda store the result of matrix-vector product to output vector y.
49 */
50 auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
51 {
52 yView[ rowIdx ] = value;
53 };
54
55 /***
56 * Compute matrix-vector product.
57 */
58 matrix.reduceRows( 0, matrix.getRows(), fetch, std::plus<>{}, keep, 0.0 );
59
60 std::cout << "The matrix reads as:" << std::endl << matrix << std::endl;
61 std::cout << "The input vector is:" << x << std::endl;
62 std::cout << "Result of matrix-vector multiplication is: " << y << std::endl;
63}
64
65int
66main( int argc, char* argv[] )
67{
68 std::cout << "Rows reduction on host:" << std::endl;
69 reduceRows< TNL::Devices::Host >();
70
71#ifdef __CUDACC__
73 std::cout << "Rows reduction on CUDA device:" << std::endl;
74 reduceRows< TNL::Devices::Cuda >();
75#endif
76}

On the lines 11-16 we set the following matrix:

\[ \left( \begin{array}{ccccc} 1 & . & . & . & . \\ 1 & 2 & . & . & . \\ . & 1 & 8 & . & . \\ . & . & 1 & 9 & . \\ . & . & . & . & 1 \end{array} \right) \]

The lambda functions on the lines 39-48 are the same as in the example with the dense matrix. The result looks as follows:

Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:1 1:2
Row: 2 -> 1:1 2:8
Row: 3 -> 2:1 3:9
Row: 4 -> 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:1 1:2
Row: 2 -> 1:1 2:8
Row: 3 -> 2:1 3:9
Row: 4 -> 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]

Tridiagonal matrices example

In this example, we will compute maximal absolute value in each row of the following tridiagonal matrix:

\[ \left( \begin{array}{ccccc} 1 & 3 & & & & \\ 2 & 1 & 3 & & & \\ & 2 & 1 & 3 & & \\ & & 2 & 1 & 3 & \\ & & & 2 & 1 & 3 \end{array} \right). \]

The source code reads as follows:

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/TridiagonalMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9reduceRows()
10{
11 /***
12 * Set the following matrix (dots represent zero matrix elements and zeros are
13 * padding zeros for memory alignment):
14 *
15 * 0 / 1 3 . . . \ -> { 0, 1, 3 }
16 * | 2 1 3 . . | -> { 2, 1, 3 }
17 * | . 2 1 3 . | -> { 2, 1, 3 }
18 * | . . 2 1 3 | -> { 2, 1, 3 }
19 * \ . . . 2 1 / 0 -> { 2, 1, 0 }
20 *
21 */
22 TNL::Matrices::TridiagonalMatrix< double, Device > matrix( 5, // number of matrix columns
23 { { 0, 1, 3 }, // matrix elements
24 { 2, 1, 3 },
25 { 2, 1, 3 },
26 { 2, 1, 3 },
27 { 2, 1, 3 } } );
28 auto view = matrix.getView();
29
30 /***
31 * Find largest element in each row.
32 */
33 TNL::Containers::Vector< double, Device > rowMax( matrix.getRows() );
34
35 /***
36 * Prepare vector view for lambdas.
37 */
38 auto rowMaxView = rowMax.getView();
39
40 /***
41 * Fetch lambda just returns absolute value of matrix elements.
42 */
43 auto fetch = [] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
44 {
45 return TNL::abs( value );
46 };
47
48 /***
49 * Reduce lambda return maximum of given values.
50 */
51 auto reduce = [] __cuda_callable__( const double& a, const double& b ) -> double
52 {
53 return TNL::max( a, b );
54 };
55
56 /***
57 * Keep lambda store the largest value in each row to the vector rowMax.
58 */
59 auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
60 {
61 rowMaxView[ rowIdx ] = value;
62 };
63
64 /***
65 * Compute the largest values in each row.
66 */
67 view.reduceRows( 0, view.getRows(), fetch, reduce, keep, std::numeric_limits< double >::lowest() ); // or matrix.reduceRows
68
69 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
70 std::cout << "Max. elements in rows are: " << rowMax << std::endl;
71}
72
73int
74main( int argc, char* argv[] )
75{
76 std::cout << "Rows reduction on host:" << std::endl;
77 reduceRows< TNL::Devices::Host >();
78
79#ifdef __CUDACC__
80 std::cout << "Rows reduction on CUDA device:" << std::endl;
81 reduceRows< TNL::Devices::Cuda >();
82#endif
83}
void reduceRows(IndexType begin, IndexType end, Fetch &fetch, Reduce &reduce, Keep &keep, const FetchReal &identity) const
Method for performing general reduction on matrix rows for constant instances.
Definition TridiagonalMatrixBase.hpp:188

Here we first set the tridiagonal matrix (lines 10-27). Next we allocate the vector rowMax where we will store the results (line 32). The lambda function are:

  • fetch (lines 42-44) is responsible for reading the matrix elements. In our example, the only thing this function has to do, is to compute the absolute value of each matrix element represented by variable value.
  • reduce (lines 49-51), performs reduction operation. In this case, it returns maximum of two input values a and b.
  • keep (lines 56-58) takes the result of the reduction in variable value in each row and stores it into the vector rowMax via related vector view rowMaxView.

Note, that the identity value for the reduction is std::numeric_limits< double >::lowest. The results looks as follows:

Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1 1:3
Row: 1 -> 0:2 1:1 2:3
Row: 2 -> 1:2 2:1 3:3
Row: 3 -> 2:2 3:1 4:3
Row: 4 -> 3:2 4:1
Max. elements in rows are: [ 3, 3, 3, 3, 2 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1 1:3
Row: 1 -> 0:2 1:1 2:3
Row: 2 -> 1:2 2:1 3:3
Row: 3 -> 2:2 3:1 4:3
Row: 4 -> 3:2 4:1
Max. elements in rows are: [ 3, 3, 3, 3, 2 ]

Multidiagonal matrices example

The next example computes again the maximal absolute value in each row. Now, we do it for multidiagonal matrix the following form:

\[ \left( \begin{array}{ccccc} 1 & & & & \\ 2 & 1 & & & \\ 3 & 2 & 1 & & \\ & 3 & 2 & 1 & \\ & & 3 & 2 & 1 \end{array} \right) \]

We first create vector rowMax into which we will store the results and fetch it view rowMaxView (line 39). Next we prepare necessary lambda functions:

  • fetch (lines 44-46) is responsible for reading the matrix element value which is stored in the constant reference value and for returning its absolute value. The other parameters rowIdx and columnIdx correspond to row and column indexes respectively and they are omitted in our example.
  • reduce (lines 51-53) returns maximum value of the two input values a and b.
  • keep (line 58-60) stores the input value at the corresponding position, given by the row index rowIdx, in the output vector view rowMaxView.

Finally, we call the method reduceRows (TNL::Matrices::MultidiagonalMatrix::reduceRows) with parameters telling the interval of rows to be processed (the first and second parameter), the lambda functions fetch, reduce and keep, and the identity element for the reduction operation which is the lowest number of given type (std::numeric_limits< double >::lowest ). The result looks as follows:

Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Max. elements in rows are: [ -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Max. elements in rows are: [ -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308 ]

Lambda matrices example

The reduction of matrix rows is available for the lambda matrices as well. See the following example:

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/LambdaMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void
10reduceRows()
11{
12 /***
13 * Lambda functions defining the matrix.
14 */
15 auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
16 {
17 return columns;
18 };
19 auto matrixElements =
21 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
22 {
23 columnIdx = localIdx;
24 value = TNL::max( rowIdx - columnIdx + 1, 0 );
25 };
26
28 auto matrix = MatrixFactory::create( 5, 5, matrixElements, rowLengths );
29
30 /***
31 * Find largest element in each row.
32 */
33 TNL::Containers::Vector< double, Device > rowMax( matrix.getRows() );
34
35 /***
36 * Prepare vector view for lambdas.
37 */
38 auto rowMaxView = rowMax.getView();
39
40 /***
41 * Fetch lambda just returns absolute value of matrix elements.
42 */
43 auto fetch = [ = ] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
44 {
45 return TNL::abs( value );
46 };
47
48 /***
49 * Reduce lambda return maximum of given values.
50 */
51 auto reduce = [ = ] __cuda_callable__( const double& a, const double& b ) -> double
52 {
53 return TNL::max( a, b );
54 };
55
56 /***
57 * Keep lambda store the largest value in each row to the vector rowMax.
58 */
59 auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
60 {
61 rowMaxView[ rowIdx ] = value;
62 };
63
64 /***
65 * Compute the largest values in each row.
66 */
67 matrix.reduceRows( 0, matrix.getRows(), fetch, reduce, keep, std::numeric_limits< double >::lowest() );
68
69 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
70 std::cout << "Max. elements in rows are: " << rowMax << std::endl;
71}
72
73int
74main( int argc, char* argv[] )
75{
76 std::cout << "Rows reduction on host:" << std::endl;
77 reduceRows< TNL::Devices::Host >();
78
79#ifdef __CUDACC__
80 std::cout << "Rows reduction on CUDA device:" << std::endl;
81 reduceRows< TNL::Devices::Cuda >();
82#endif
83}

On the lines 14-21, we create the lower triangular lambda matrix which looks as follows:

\[ \left( \begin{array}{ccccc} 1 & & & & \\ 2 & 1 & & & \\ 3 & 2 & 1 & & \\ 4 & 3 & 2 & 1 & \\ 5 & 4 & 3 & 2 & 1 \end{array} \right) \]

We want to compute maximal absolute value of matrix elements in each row. For this purpose we define well known lambda functions:

  • fetch takes the value of the lambda matrix element and returns its absolute value.
  • reduce computes maximum value of two input variables.
  • keep stores the results into output vector rowMax.

Note that the interface of the lambda functions is the same as for other matrix types. The result looks as follows:

Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Max. elements in rows are: [ 1, 2, 3, 4, 5 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Max. elements in rows are: [ 1, 2, 3, 4, 5 ]

Matrix-vector product

One of the most important matrix operation is the matrix-vector multiplication. It is represented by a method vectorProduct (TNL::Matrices::DenseMatrix::vectorProduct, TNL::Matrices::SparseMatrix::vectorProduct, TNL::Matrices::TridiagonalMatrix::vectorProduct, TNL::Matrices::MultidiagonalMatrix::vectorProduct, TNL::Matrices::LambdaMatrix::vectorProduct). It is templated method with two template parameters InVector and OutVector telling the types of input and output vector respectively. Usually one will substitute some of TNL::Containers::Array, TNL::Containers::ArrayView, TNL::Containers::Vector or TNL::Containers::VectorView for these types. The method accepts the following parameters:

  1. inVector is the input vector having the same number of elements as the number of matrix columns.
  2. outVector is the output vector having the same number of elements as the number of matrix rows.
  3. matrixMultiplicator is a number by which the result of matrix-vector product is multiplied.
  4. outVectorMultiplicator is a number by which the output vector is multiplied before added to the result of matrix-vector product.
  5. begin is the beginning of the matrix rows range on which we compute the matrix-vector product.
  6. end is the end of the matrix rows range on which the matrix-vector product will be evaluated. The last matrix row which is going to be processed has index end-1.

Note that the output vector dimension must be the same as the number of matrix rows no matter how we set begin and end parameters. These parameters just say that some matrix rows and the output vector elements are omitted.

To summarize, this method computes the following formula:

outVector = matrixMultiplicator * ( *this ) * inVector + outVectorMultiplicator * outVector.

Matrix I/O operations

All matrices can be saved to a file using a method save (TNL::Matrices::DenseMatrix::save, TNL::Matrices::SparseMatrix::save, TNL::Matrices::TridiagonalMatrix::save, TNL::Matrices::MultidiagonalMatrix::save) and restored with a method load (TNL::Matrices::DenseMatrix::load, TNL::Matrices::SparseMatrix::load, TNL::Matrices::TridiagonalMatrix::load, TNL::Matrices::MultidiagonalMatrix::load). To print the matrix, there is a method print (TNL::Matrices::DenseMatrix::print, TNL::Matrices::SparseMatrix::print, TNL::Matrices::TridiagonalMatrix::print, TNL::Matrices::MultidiagonalMatrix::print, TNL::Matrices::LambdaMatrix::print) can be used.

Matrix reader and writer

TNL also offers matrix reader (TNL::Matrices::MatrixReader) and matrix writer (TNL::Matrices::MatrixWriter) for import and export of matrices respectively. The matrix reader currently supports only Coordinate MTX file format which is popular mainly for sparse matrices. By the mean of the matrix writer, we can export TNL matrices into coordinate MTX format as well. In addition, the matrices can be exported to a text file suitable for Gnuplot program which can be used for matrix visualization. Finally, a pattern of nonzero matrix elements can be visualized via the EPS format - Encapsulated PostScript. We demonstrate both matrix reader and writer in the following example:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Matrices/MatrixReader.h>
5#include <TNL/Matrices/MatrixWriter.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Device >
10void
11matrixWriterExample()
12{
14 Matrix matrix( 5, // number of matrix rows
15 5, // number of matrix columns
16 {
17 // matrix elements definition
18 // clang-format off
19 { 0, 0, 2.0 },
20 { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
21 { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
22 { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
23 { 4, 4, 2.0 }
24 // clang-format on
25 } );
26
27 std::cout << "Matrix: " << std::endl << matrix << std::endl;
28 std::cout << "Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ...";
29 TNL::Matrices::MatrixWriter< Matrix >::writeGnuplot( "matrix-writer-example.gplt", matrix );
30 std::cout << " OK " << std::endl;
31 std::cout << "Writing matrix pattern in EPS format into the file matrix-writer-example.eps ...";
32 TNL::Matrices::MatrixWriter< Matrix >::writeEps( "matrix-writer-example.eps", matrix );
33 std::cout << " OK " << std::endl;
34 std::cout << "Writing matrix in MTX format into the file matrix-writer-example.mtx ...";
35 TNL::Matrices::MatrixWriter< Matrix >::writeMtx( "matrix-writer-example.mtx", matrix );
36 std::cout << " OK " << std::endl;
37}
38
39template< typename Device >
40void
41matrixReaderExample()
42{
44 SparseMatrix sparseMatrix;
45
46 std::cout << "Reading sparse matrix from MTX file matrix-writer-example.mtx ... ";
47 TNL::Matrices::MatrixReader< SparseMatrix >::readMtx( "matrix-writer-example.mtx", sparseMatrix );
48 std::cout << " OK " << std::endl;
49 std::cout << "Imported matrix is: " << std::endl << sparseMatrix << std::endl;
50
52 DenseMatrix denseMatrix;
53
54 std::cout << "Reading dense matrix from MTX file matrix-writer-example.mtx ... ";
55 TNL::Matrices::MatrixReader< DenseMatrix >::readMtx( "matrix-writer-example.mtx", denseMatrix );
56 std::cout << " OK " << std::endl;
57 std::cout << "Imported matrix is: " << std::endl << denseMatrix << std::endl;
58}
59
60int
61main( int argc, char* argv[] )
62{
63 std::cout << "Creating matrices on CPU ... " << std::endl;
64 matrixWriterExample< TNL::Devices::Host >();
65 matrixReaderExample< TNL::Devices::Host >();
66
67#ifdef __CUDACC__
69 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
70 matrixWriterExample< TNL::Devices::Cuda >();
71 matrixReaderExample< TNL::Devices::Cuda >();
72#endif
73}
static void readMtx(const std::string &fileName, Matrix &matrix, bool verbose=false)
Method for importing matrix from file with given filename.
Definition MatrixReader.hpp:19
static void writeEps(const std::string &fileName, const Matrix &matrix, bool verbose=false)
Method for exporting matrix to file with given filename using EPS format.
Definition MatrixWriter.hpp:51
static void writeGnuplot(const std::string &fileName, const Matrix &matrix, bool verbose=false)
Method for exporting matrix to file with given filename using Gnuplot format.
Definition MatrixWriter.hpp:15
static void writeMtx(const std::string &fileName, const Matrix &matrix, bool verbose=false)
Method for exporting matrix to file with given filename using MTX format.
Definition MatrixWriter.hpp:33

The example consists of two functions - matrixWriterExample (lines 10-24) and matrixReaderExample (lines 36-54). In the first one, we first create a toy matrix (lines 13-22) which we subsequently export into Gnuplot (line 26), EPS (line 29) and MTX (line 32) formats. In the next step (the matrixReaderExample function on lines 36-54), the MTX file is used to import the matrix into sparse (line 43) and dense (line 51) matrices. Both matrices are printed out (lines 45 and 53).

The result looks as follows:

1Creating matrices on CPU ...
2Matrix:
3Row: 0 -> 0:2
4Row: 1 -> 0:-1 1:2 2:-1
5Row: 2 -> 1:-1 2:2 3:-1
6Row: 3 -> 2:-1 3:2 4:-1
7Row: 4 -> 4:2
8
9Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ... OK
10Writing matrix pattern in EPS format into the file matrix-writer-example.eps ... OK
11Writing matrix in MTX format into the file matrix-writer-example.mtx ... OK
12Reading sparse matrix from MTX file matrix-writer-example.mtx ... OK
13Imported matrix is:
14Row: 0 -> 0:2
15Row: 1 -> 0:-1 1:2 2:-1
16Row: 2 -> 1:-1 2:2 3:-1
17Row: 3 -> 2:-1 3:2 4:-1
18Row: 4 -> 4:2
19
20Reading dense matrix from MTX file matrix-writer-example.mtx ... OK
21Imported matrix is:
22Row: 0 -> 0:2 1:0 2:0 3:0 4:0
23Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
24Row: 2 -> 0:0 1:-1 2:2 3:-1 4:0
25Row: 3 -> 0:0 1:0 2:-1 3:2 4:-1
26Row: 4 -> 0:0 1:0 2:0 3:0 4:2
27
28
29Creating matrices on CUDA GPU ...
30Matrix:
31Row: 0 -> 0:2
32Row: 1 -> 0:-1 1:2 2:-1
33Row: 2 -> 1:-1 2:2 3:-1
34Row: 3 -> 2:-1 3:2 4:-1
35Row: 4 -> 4:2
36
37Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ... OK
38Writing matrix pattern in EPS format into the file matrix-writer-example.eps ... OK
39Writing matrix in MTX format into the file matrix-writer-example.mtx ... OK
40Reading sparse matrix from MTX file matrix-writer-example.mtx ... OK
41Imported matrix is:
42Row: 0 -> 0:2
43Row: 1 -> 0:-1 1:2 2:-1
44Row: 2 -> 1:-1 2:2 3:-1
45Row: 3 -> 2:-1 3:2 4:-1
46Row: 4 -> 4:2
47
48Reading dense matrix from MTX file matrix-writer-example.mtx ... OK
49Imported matrix is:
50Row: 0 -> 0:2 1:0 2:0 3:0 4:0
51Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
52Row: 2 -> 0:0 1:-1 2:2 3:-1 4:0
53Row: 3 -> 0:0 1:0 2:-1 3:2 4:-1
54Row: 4 -> 0:0 1:0 2:0 3:0 4:2

Appendix

Benchmark of dense matrix setup

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7#include <TNL/Timer.h>
8
9const int testsCount = 5;
10
11template< typename Matrix >
12void
13setElement_on_host( const int matrixSize, Matrix& matrix )
14{
15 matrix.setDimensions( matrixSize, matrixSize );
16
17 for( int j = 0; j < matrixSize; j++ )
18 for( int i = 0; i < matrixSize; i++ )
19 matrix.setElement( i, j, i + j );
20}
21
22template< typename Matrix >
23void
24setElement_on_host_and_transfer( const int matrixSize, Matrix& matrix )
25{
26 using RealType = typename Matrix::RealType;
27 using IndexType = typename Matrix::IndexType;
29 HostMatrix hostMatrix( matrixSize, matrixSize );
30
31 for( int j = 0; j < matrixSize; j++ )
32 for( int i = 0; i < matrixSize; i++ )
33 hostMatrix.setElement( i, j, i + j );
34 matrix = hostMatrix;
35}
36
37template< typename Matrix >
38void
39setElement_on_device( const int matrixSize, Matrix& matrix )
40{
41 matrix.setDimensions( matrixSize, matrixSize );
42
43 auto matrixView = matrix.getView();
44 auto f = [ = ] __cuda_callable__( const TNL::Containers::StaticArray< 2, int >& i ) mutable
45 {
46 matrixView.setElement( i[ 0 ], i[ 1 ], i[ 0 ] + i[ 1 ] );
47 };
49 const TNL::Containers::StaticArray< 2, int > end = { matrixSize, matrixSize };
50 TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( begin, end, f );
51}
52
53template< typename Matrix >
54void
55getRow( const int matrixSize, Matrix& matrix )
56{
57 matrix.setDimensions( matrixSize, matrixSize );
58
59 auto matrixView = matrix.getView();
60 auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
61 {
62 auto row = matrixView.getRow( rowIdx );
63 for( int i = 0; i < matrixSize; i++ )
64 row.setValue( i, rowIdx + i );
65 };
66 TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( 0, matrixSize, f );
67}
68
69template< typename Matrix >
70void
71forElements( const int matrixSize, Matrix& matrix )
72{
73 matrix.setDimensions( matrixSize, matrixSize );
74
75 auto f = [ = ] __cuda_callable__( int rowIdx, int localIdx, int& columnIdx, float& value ) mutable
76 {
77 value = rowIdx + columnIdx;
78 };
79 matrix.forElements( 0, matrixSize, f );
80}
81
82template< typename Device >
83void
84setupDenseMatrix()
85{
86 std::cout << " Dense matrix test:" << std::endl;
87 for( int matrixSize = 16; matrixSize <= 8192; matrixSize *= 2 ) {
88 std::cout << " Matrix size = " << matrixSize << std::endl;
89 TNL::Timer timer;
90
91 std::cout << " setElement on host: ";
92 timer.reset();
93 timer.start();
94 for( int i = 0; i < testsCount; i++ ) {
96 setElement_on_host( matrixSize, matrix );
97 }
98 timer.stop();
99 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
100
101 std::cout << " setElement on device: ";
102 timer.reset();
103 timer.start();
104 for( int i = 0; i < testsCount; i++ ) {
106 setElement_on_device( matrixSize, matrix );
107 }
108 timer.stop();
109 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
110
112 std::cout << " setElement on host and transfer on GPU: ";
113 timer.reset();
114 timer.start();
115 for( int i = 0; i < testsCount; i++ ) {
117 setElement_on_host_and_transfer( matrixSize, matrix );
118 }
119 timer.stop();
120 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
121 }
122
123 std::cout << " getRow: ";
124 timer.reset();
125 timer.start();
126 for( int i = 0; i < testsCount; i++ ) {
128 getRow( matrixSize, matrix );
129 }
130 timer.stop();
131 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
132
133 std::cout << " forElements: ";
134 timer.reset();
135 timer.start();
136 for( int i = 0; i < testsCount; i++ ) {
138 forElements( matrixSize, matrix );
139 }
140 timer.stop();
141 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
142 }
143}
144
145int
146main( int argc, char* argv[] )
147{
148 std::cout << "Creating dense matrix on CPU ... " << std::endl;
149 setupDenseMatrix< TNL::Devices::Host >();
150
151#ifdef __CUDACC__
152 std::cout << "Creating dense matrix on CUDA GPU ... " << std::endl;
153 setupDenseMatrix< TNL::Devices::Cuda >();
154#endif
155}
Class for real time, CPU time and CPU cycles measuring.
Definition Timer.h:25
void reset()
Reset the CPU and real-time timers.
Definition Timer.hpp:22
double getRealTime() const
Returns the elapsed real time on this timer.
Definition Timer.hpp:50
void stop()
Stops (pauses) the CPU and the real-time timers, but does not set them to zero.
Definition Timer.hpp:32
void start()
Starts timer.
Definition Timer.hpp:42

Benchmark of sparse matrix setup

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7#include <TNL/Timer.h>
8
9const int testsCount = 5;
10
11template< typename Matrix >
12void
13STL_Map( const int gridSize, Matrix& matrix )
14{
15 /***
16 * Set matrix representing approximation of the Laplace operator on regular
17 * grid using the finite difference method by means of STL map.
18 */
19 const int matrixSize = gridSize * gridSize;
20 matrix.setDimensions( matrixSize, matrixSize );
22 for( int j = 0; j < gridSize; j++ )
23 for( int i = 0; i < gridSize; i++ ) {
24 const int rowIdx = j * gridSize + i;
25 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
26 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx ), 1.0 ) );
27 else {
28 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx - gridSize ), 1.0 ) );
29 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx - 1 ), 1.0 ) );
30 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx ), -4.0 ) );
31 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx + 1 ), 1.0 ) );
32 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx + gridSize ), 1.0 ) );
33 }
34 }
35 matrix.setElements( map );
36}
37
38template< typename Matrix >
39void
40setElement_on_host( const int gridSize, Matrix& matrix )
41{
42 /***
43 * Set matrix representing approximation of the Laplace operator on regular
44 * grid using the finite difference method by means setElement method called
45 * from the host system.
46 */
47 const int matrixSize = gridSize * gridSize;
49 matrix.setDimensions( matrixSize, matrixSize );
50 matrix.setRowCapacities( rowCapacities );
51
52 for( int j = 0; j < gridSize; j++ )
53 for( int i = 0; i < gridSize; i++ ) {
54 const int rowIdx = j * gridSize + i;
55 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
56 matrix.setElement( rowIdx, rowIdx, 1.0 );
57 else {
58 matrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
59 matrix.setElement( rowIdx, rowIdx - 1, 1.0 );
60 matrix.setElement( rowIdx, rowIdx, -4.0 );
61 matrix.setElement( rowIdx, rowIdx + 1, 1.0 );
62 matrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
63 }
64 }
65}
66
67template< typename Matrix >
68void
69setElement_on_host_and_transfer( const int gridSize, Matrix& matrix )
70{
71 using RealType = typename Matrix::RealType;
72 using HostMatrix = typename Matrix::template Self< RealType, TNL::Devices::Host >;
73
74 const int matrixSize = gridSize * gridSize;
76 HostMatrix hostMatrix( matrixSize, matrixSize );
77 hostMatrix.setRowCapacities( rowCapacities );
78
79 for( int j = 0; j < gridSize; j++ )
80 for( int i = 0; i < gridSize; i++ ) {
81 const int rowIdx = j * gridSize + i;
82 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
83 hostMatrix.setElement( rowIdx, rowIdx, 1.0 );
84 else {
85 hostMatrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
86 hostMatrix.setElement( rowIdx, rowIdx - 1, 1.0 );
87 hostMatrix.setElement( rowIdx, rowIdx, -4.0 );
88 hostMatrix.setElement( rowIdx, rowIdx + 1, 1.0 );
89 hostMatrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
90 }
91 }
92 matrix = hostMatrix;
93}
94
95template< typename Matrix >
96void
97setElement_on_device( const int gridSize, Matrix& matrix )
98{
99 /***
100 * Set matrix representing approximation of the Laplace operator on regular
101 * grid using the finite difference method by means of setElement method called
102 * from the native device.
103 */
104 const int matrixSize = gridSize * gridSize;
106 matrix.setDimensions( matrixSize, matrixSize );
107 matrix.setRowCapacities( rowCapacities );
108
109 auto matrixView = matrix.getView();
110 auto f = [ = ] __cuda_callable__( const TNL::Containers::StaticArray< 2, int >& i ) mutable
111 {
112 const int rowIdx = i[ 1 ] * gridSize + i[ 0 ];
113 if( i[ 0 ] == 0 || i[ 1 ] == 0 || i[ 0 ] == gridSize - 1 || i[ 1 ] == gridSize - 1 )
114 matrixView.setElement( rowIdx, rowIdx, 1.0 );
115 else {
116 matrixView.setElement( rowIdx, rowIdx - gridSize, 1.0 );
117 matrixView.setElement( rowIdx, rowIdx - 1, 1.0 );
118 matrixView.setElement( rowIdx, rowIdx, -4.0 );
119 matrixView.setElement( rowIdx, rowIdx + 1, 1.0 );
120 matrixView.setElement( rowIdx, rowIdx + gridSize, 1.0 );
121 }
122 };
124 const TNL::Containers::StaticArray< 2, int > end = { gridSize, gridSize };
125 TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( begin, end, f );
126}
127
128template< typename Matrix >
129void
130getRow( const int gridSize, Matrix& matrix )
131{
132 /***
133 * Set matrix representing approximation of the Laplace operator on regular
134 * grid using the finite difference method by means of getRow method.
135 */
136 const int matrixSize = gridSize * gridSize;
138 matrix.setDimensions( matrixSize, matrixSize );
139 matrix.setRowCapacities( rowCapacities );
140
141 auto matrixView = matrix.getView();
142 auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
143 {
144 const int i = rowIdx % gridSize;
145 const int j = rowIdx / gridSize;
146 auto row = matrixView.getRow( rowIdx );
147 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
148 row.setElement( 2, rowIdx, 1.0 );
149 else {
150 row.setElement( 0, rowIdx - gridSize, 1.0 );
151 row.setElement( 1, rowIdx - 1, 1.0 );
152 row.setElement( 2, rowIdx, -4.0 );
153 row.setElement( 3, rowIdx + 1, 1.0 );
154 row.setElement( 4, rowIdx + gridSize, 1.0 );
155 }
156 };
157 TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( 0, matrixSize, f );
158}
159
160template< typename Matrix >
161void
162forElements( const int gridSize, Matrix& matrix )
163{
164 /***
165 * Set matrix representing approximation of the Laplace operator on regular
166 * grid using the finite difference method by means of forElements method.
167 */
168
169 const int matrixSize = gridSize * gridSize;
171 matrix.setDimensions( matrixSize, matrixSize );
172 matrix.setRowCapacities( rowCapacities );
173
174 auto f = [ = ] __cuda_callable__( int rowIdx, int localIdx, int& columnIdx, float& value ) mutable
175 {
176 const int i = rowIdx % gridSize;
177 const int j = rowIdx / gridSize;
178 if( ( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) && localIdx == 0 ) {
179 columnIdx = rowIdx;
180 value = 1.0;
181 }
182 else {
183 switch( localIdx ) {
184 case 0:
185 columnIdx = rowIdx - gridSize;
186 value = 1.0;
187 break;
188 case 1:
189 columnIdx = rowIdx - 1;
190 value = 1.0;
191 break;
192 case 2:
193 columnIdx = rowIdx;
194 value = -4.0;
195 break;
196 case 3:
197 columnIdx = rowIdx + 1;
198 value = 1.0;
199 break;
200 case 4:
201 columnIdx = rowIdx + gridSize;
202 value = 1.0;
203 break;
204 }
205 }
206 };
207 matrix.forElements( 0, matrixSize, f );
208}
209
210template< typename Device >
211void
212laplaceOperatorSparseMatrix()
213{
214 std::cout << " Sparse matrix test:" << std::endl;
215 for( int gridSize = 16; gridSize <= 8192; gridSize *= 2 ) {
216 std::cout << " Grid size = " << gridSize << std::endl;
217 TNL::Timer timer;
218
219 std::cout << " STL map: ";
220 timer.reset();
221 timer.start();
222 for( int i = 0; i < testsCount; i++ ) {
224 STL_Map( gridSize, matrix );
225 }
226 timer.stop();
227 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
228
229 std::cout << " setElement on host: ";
230 timer.reset();
231 timer.start();
232 for( int i = 0; i < testsCount; i++ ) {
234 setElement_on_host( gridSize, matrix );
235 }
236 timer.stop();
237 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
238
240 std::cout << " setElement on host and transfer on GPU: ";
241 timer.reset();
242 timer.start();
243 for( int i = 0; i < testsCount; i++ ) {
245 setElement_on_host_and_transfer( gridSize, matrix );
246 }
247 timer.stop();
248 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
249 }
250
251 std::cout << " setElement on device: ";
252 timer.reset();
253 timer.start();
254 for( int i = 0; i < testsCount; i++ ) {
256 setElement_on_device( gridSize, matrix );
257 }
258 timer.stop();
259 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
260
261 std::cout << " getRow: ";
262 timer.reset();
263 timer.start();
264 for( int i = 0; i < testsCount; i++ ) {
266 getRow( gridSize, matrix );
267 }
268 timer.stop();
269 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
270
271 std::cout << " forElements: ";
272 timer.reset();
273 timer.start();
274 for( int i = 0; i < testsCount; i++ ) {
276 forElements( gridSize, matrix );
277 }
278 timer.stop();
279 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
280 }
281}
282
283int
284main( int argc, char* argv[] )
285{
286 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
287 laplaceOperatorSparseMatrix< TNL::Devices::Host >();
288
289#ifdef __CUDACC__
290 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
291 laplaceOperatorSparseMatrix< TNL::Devices::Cuda >();
292#endif
293}

Benchmark of multidiagonal matrix setup

1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/MultidiagonalMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7#include <TNL/Timer.h>
8
9const int testsCount = 5;
10
11template< typename Device >
13getOffsets( const int gridSize )
14{
16 offsets.setElement( 0, -gridSize );
17 offsets.setElement( 1, -1 );
18 offsets.setElement( 2, 0 );
19 offsets.setElement( 3, 1 );
20 offsets.setElement( 4, gridSize );
21 return offsets;
22}
23
24template< typename Matrix >
25void
26setElement_on_host( const int gridSize, Matrix& matrix )
27{
28 /***
29 * Set matrix representing approximation of the Laplace operator on regular
30 * grid using the finite difference method by means setElement method called
31 * from the host system.
32 */
33 const int matrixSize = gridSize * gridSize;
34 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
35
36 for( int j = 0; j < gridSize; j++ )
37 for( int i = 0; i < gridSize; i++ ) {
38 const int rowIdx = j * gridSize + i;
39 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
40 matrix.setElement( rowIdx, rowIdx, 1.0 );
41 else {
42 matrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
43 matrix.setElement( rowIdx, rowIdx - 1, 1.0 );
44 matrix.setElement( rowIdx, rowIdx, -4.0 );
45 matrix.setElement( rowIdx, rowIdx + 1, 1.0 );
46 matrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
47 }
48 }
49}
50
51template< typename Matrix >
52void
53setElement_on_host_and_transfer( const int gridSize, Matrix& matrix )
54{
55 using RealType = typename Matrix::RealType;
56 using IndexType = typename Matrix::IndexType;
58 const int matrixSize = gridSize * gridSize;
59 HostMatrix hostMatrix( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
60
61 for( int j = 0; j < gridSize; j++ )
62 for( int i = 0; i < gridSize; i++ ) {
63 const int rowIdx = j * gridSize + i;
64 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
65 hostMatrix.setElement( rowIdx, rowIdx, 1.0 );
66 else {
67 hostMatrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
68 hostMatrix.setElement( rowIdx, rowIdx - 1, 1.0 );
69 hostMatrix.setElement( rowIdx, rowIdx, -4.0 );
70 hostMatrix.setElement( rowIdx, rowIdx + 1, 1.0 );
71 hostMatrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
72 }
73 }
74 matrix = hostMatrix;
75}
76
77template< typename Matrix >
78void
79setElement_on_device( const int gridSize, Matrix& matrix )
80{
81 /***
82 * Set matrix representing approximation of the Laplace operator on regular
83 * grid using the finite difference method by means of setElement method called
84 * from the native device.
85 */
86 const int matrixSize = gridSize * gridSize;
87 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
88
89 auto matrixView = matrix.getView();
91 {
92 const int rowIdx = i[ 1 ] * gridSize + i[ 0 ];
93 if( i[ 0 ] == 0 || i[ 1 ] == 0 || i[ 0 ] == gridSize - 1 || i[ 1 ] == gridSize - 1 )
94 matrixView.setElement( rowIdx, rowIdx, 1.0 );
95 else {
96 matrixView.setElement( rowIdx, rowIdx - gridSize, 1.0 );
97 matrixView.setElement( rowIdx, rowIdx - 1, 1.0 );
98 matrixView.setElement( rowIdx, rowIdx, -4.0 );
99 matrixView.setElement( rowIdx, rowIdx + 1, 1.0 );
100 matrixView.setElement( rowIdx, rowIdx + gridSize, 1.0 );
101 }
102 };
104 const TNL::Containers::StaticArray< 2, int > end = { gridSize, gridSize };
105 TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( begin, end, f );
106}
107
108template< typename Matrix >
109void
110getRow( const int gridSize, Matrix& matrix )
111{
112 /***
113 * Set matrix representing approximation of the Laplace operator on regular
114 * grid using the finite difference method by means of getRow method.
115 */
116 const int matrixSize = gridSize * gridSize;
117 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
118
119 auto matrixView = matrix.getView();
120 auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
121 {
122 const int i = rowIdx % gridSize;
123 const int j = rowIdx / gridSize;
124 auto row = matrixView.getRow( rowIdx );
125 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
126 row.setElement( 2, 1.0 );
127 else {
128 row.setElement( 0, 1.0 );
129 row.setElement( 1, 1.0 );
130 row.setElement( 2, -4.0 );
131 row.setElement( 3, 1.0 );
132 row.setElement( 4, 1.0 );
133 }
134 };
135 TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( 0, matrixSize, f );
136}
137
138template< typename Matrix >
139void
140forElements( const int gridSize, Matrix& matrix )
141{
142 /***
143 * Set matrix representing approximation of the Laplace operator on regular
144 * grid using the finite difference method by means of forElements method.
145 */
146
147 const int matrixSize = gridSize * gridSize;
148 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
149
150 auto f = [ = ] __cuda_callable__( int rowIdx, int localIdx, int columnIdx, float& value ) mutable
151 {
152 const int i = rowIdx % gridSize;
153 const int j = rowIdx / gridSize;
154 if( ( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) && localIdx == 0 ) {
155 value = 1.0;
156 }
157 else {
158 if( localIdx == 2 )
159 value = -4.0;
160 else
161 value = 1.0;
162 }
163 };
164 matrix.forElements( 0, matrixSize, f );
165}
166
167template< typename Device >
168void
169laplaceOperatorMultidiagonalMatrix()
170{
171 std::cout << " Sparse matrix test:" << std::endl;
172 for( int gridSize = 16; gridSize <= 8192; gridSize *= 2 ) {
173 std::cout << " Grid size = " << gridSize << std::endl;
174 TNL::Timer timer;
175
176 std::cout << " setElement on host: ";
177 timer.reset();
178 timer.start();
179 for( int i = 0; i < testsCount; i++ ) {
181 setElement_on_host( gridSize, matrix );
182 }
183 timer.stop();
184 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
185
187 std::cout << " setElement on host and transfer on GPU: ";
188 timer.reset();
189 timer.start();
190 for( int i = 0; i < testsCount; i++ ) {
192 setElement_on_host_and_transfer( gridSize, matrix );
193 }
194 timer.stop();
195 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
196 }
197
198 std::cout << " setElement on device: ";
199 timer.reset();
200 timer.start();
201 for( int i = 0; i < testsCount; i++ ) {
203 setElement_on_device( gridSize, matrix );
204 }
205 timer.stop();
206 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
207
208 std::cout << " getRow: ";
209 timer.reset();
210 timer.start();
211 for( int i = 0; i < testsCount; i++ ) {
213 getRow( gridSize, matrix );
214 }
215 timer.stop();
216 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
217
218 std::cout << " forElements: ";
219 timer.reset();
220 timer.start();
221 for( int i = 0; i < testsCount; i++ ) {
223 forElements( gridSize, matrix );
224 }
225 timer.stop();
226 std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
227 }
228}
229
230int
231main( int argc, char* argv[] )
232{
233 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
234 laplaceOperatorMultidiagonalMatrix< TNL::Devices::Host >();
235
236#ifdef __CUDACC__
237 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
238 laplaceOperatorMultidiagonalMatrix< TNL::Devices::Cuda >();
239#endif
240}