Template Numerical Library version\ main:eacc201d
Searching...
No Matches
Matrices tutorial

TODO: Add description of forRows and sequentialForRows.

# Introduction

TNL offers several types of matrices like dense (TNL::Matrices::DenseMatrix), sparse (TNL::Matrices::SparseMatrix), tridiagonal (TNL::Matrices::TridiagonalMatrix), multidiagonal (TNL::Matrices::MultidiagonalMatrix) and lambda matrices (TNL::Matrices::LambdaMatrix). The sparse matrices can be symmetric to lower the memory requirements. The interfaces of given matrix types are designed to be as unified as possible to ensure that the user can easily switch between different matrix types while making no or only a little changes in the source code. All matrix types allows traversing all matrix elements and manipulate them using lambda functions as well as performing flexible reduction in matrix rows. The following text describes particular matrix types and their unified interface in details.

# Overview of matrix types

In a lot of numerical algorithms either dense or sparse matrices are used. The dense matrix (TNL::Matrices::DenseMatrix) is such that all or at least most of its matrix elements are nonzero. On the other hand sparse matrix (TNL::Matrices::SparseMatrix) is a matrix which has most of the matrix elements equal to zero. From the implementation point of view, the data structures for the dense matrices allocates all matrix elements while formats for the sparse matrices aim to store explicitly only the nonzero matrix elements. The most popular format for storing the sparse matrices is CSR format. However, especially for better data alignment in memory of GPUs, many other formats were designed. In TNL, the user may choose between several different sparse matrix formats. There are also sparse matrices with specific pattern of the nonzero elements like tridiagonal matrices (TNL::Matrices::TridiagonalMatrix) which "has nonzero elements on the main diagonal, the first diagonal below this, and the first diagonal above the main diagonal only". An example of such matrix may look as follows:

$\left( \begin{array}{ccccccc} -2 & 1 & . & . & . & . \\ 1 & -2 & 1 & . & . & . \\ . & 1 & -2 & 1 & . & . \\ . & . & 1 & -2 & 1 & . \\ . & . & . & 1 & -2 & 1 \\ . & . & . & . & 1 & -2 \end{array} \right)$

Similar but more general type of matrices are multidiagonal matrices (TNL::Matrices::MultidiagonalMatrix) which have the nonzero matrix elements positioned only on lines parallel to the main diagonal like the following matrix:

$\left( \begin{array}{ccccccc} -4 & 1 & . & 1 & . & . \\ 1 & -4 & 1 & . & 1 & . \\ . & 1 & -4 & 1 & . & 1 \\ 1 & . & 1 & -4 & 1 & . \\ . & 1 & . & 1 & -4 & 1 \\ . & . & 1 & . & 1 & -4 \end{array} \right)$

Finally, TNL offers so called lambda matrices (TNL::Matrices::LambdaMatrix) which are kind of "matrix-free matrices". They do not store the matrix elements explicitly in the memory, but rather evaluates them on-the-fly based on user defined lambda functions.

In the following table we show comparison of expressing a tridiagonal matrix by means of different matrix types.

Matrix dimensions Dense elems. Dense mem. Sparse elems. Sparse mem. Tridiag. elems. Tridiag. mem. Multidiag. elems. Mutlidiag. mem.
10x10 100 800 B >28 >336 B 30 240 B 30 252 B
100x100 10,000 80 kB >298 >3,576 B 300 2,400 B 300 2,412 B
1,000x1,000 1,000,000 8 MB >2,998 >35,976 B 3,000 24,000 B 3,000 24,012 B
10,000x10,000 100,000,000 800 MB >29,998 >359,976 B 30,000 240,000 B 30,000 240,012 B
100,000x100,000 10,000,000,000 80 GB >299,998 >3,599,876 B 300,000 2,400,000 B 300,000 2,400,012 B

In the table:

• Matrix dimensions is the number of matrix rows and columns
• Dense elems. is the number of allocated matrix elements in the dense matrix.
• Dense mem. is the allocated memory for the matrix elements in the dense matrix if the elements are stored in the double precision.
• Sparse elems. is the number of allocated matrix elements in the sparse matrix. Some formats may allocate padding zeros for better data alignment in the memory and so the number of allocated matrix elements may increase.
• Sparse mem. is the allocated memory for the matrix elements in the sparse matrix if the elements are stored in the double precision and column indexes in 32-bit integer.
• Tridiag. elems is the number of allocated matrix elements in the tridiagonal matrix.
• Tridiag mem. is the allocated memory for the matrix elements in the tridiagonal matrix if the elements are stored in the double precision.
• Multidiag. elems is the number of allocated matrix elements in the multidiagonal matrix.
• Multidiag mem. is the allocated memory for the matrix elements in the multidiagonal matrix if the elements are stored in the double precision.

Choosing the best matrix type can have tremendous impact on the performance but also on memory requirements. If we would treat each matrix as dense one we would not be able to to work with matrices larger than 50,000x50,000 on common personal computers, because we would need tens of gibabytes of memory. At the same time, we see that the other matrix types can do the same job with only few megabytes. In addition, other matrix types work with much less matrix elements and so operations like matrix-vector multiplication can be done with significantly less operations which means much faster. Since in the modern hardware architectures, the computing units are limited mainly by the performance of the memory chips (so called memory wall), transferring less data from the memory increases the performance even more.

The following table shows the same as the one above but when storing a matrix which has only five nonzero elements in each row. Such matrices arise often from the finite difference method for solution of the partial differential equations:

Matrix dimensions Dense elems. Dense mem. Sparse elems. Sparse mem. Multidiag. elems. Mutlidiag. mem.
10x10 100 800 B >50 >600 B 50 420 B
100x100 10,000 80 kB >500 >6,000 B 500 4,020 B
1,000x1,000 1,000,000 8 MB >5,000 >60,000 B 5,000 40,020 B
10,000x10,000 100,000,000 800 MB >50,000 >600,000 B 50,000 400,020 B
100,000x100,000 10,000,000,000 80 GB >500,000 >6,000,000 B 500,000 4,000,020 B

There is no change in the dense matrix part of the table. The numbers grow proportionally in case of sparse and mutlidiagonal matrix. General sparse matrix formats need to store column indexes for each matrix element which is not true for the multidiagonal matrix. The following table shows how many bytes we need for storing of one matrix element with different matrix types depending on the type of the matrix elements (Real) and column indexes (Index):

Real Index Dense matrix Multidiagonal matrix Sparse matrix Fill ratio
float 32-bit 4 B 4 B 8 B << 50%
float 64-bit 4 B 4 B 12 B << 30%
double 32-bit 8 B 8 B 12 B << 60%
double 64-bit 8 B 8 B 16 B << 50%

In this table:

• Real is matrix element type.
• Index is column index type.
• Dense matrix is number of bytes needed to store one matrix element in the dense matrix.
• Multidiagonal matrix is number of bytes needed to store one matrix element in the mutldiagonal matrix.
• Sparse matrix is number of bytes needed to store one matrix element in the sparse matrix.
• Fill ratio is maximal percentage of the nonzero matrix elements until which the sparse matrix can perform better.

The multidiagonal matrix type is especially suitable for the finite difference method or similar numerical methods for solution of the partial differential equations.

# Indexing of nonzero matrix elements in sparse matrices

The sparse matrix formats usually, in the first step, compress the matrix rows by omitting the zero matrix elements as follows

$\left( \begin{array}{ccccc} 0 & 1 & 0 & 2 & 0 \\ 0 & 0 & 5 & 0 & 0 \\ 4 & 0 & 0 & 0 & 7 \\ 0 & 3 & 0 & 8 & 5 \\ 0 & 5 & 7 & 0 & 0 \end{array} \right) \rightarrow \left( \begin{array}{ccccc} 1 & 2 & . & . & . \\ 5 & . & . & . & . \\ 4 & 7 & . & . & . \\ 3 & 8 & 5 & . & . \\ 5 & 7 & . & . & . \end{array} \right)$

In such a form, it is more efficient to refer the nonzero matrix elements in given row by their rank in the compressed matrix row rather than by their column index in the original matrix. In methods for the sparse matrices, this parameter is called localIdx. Some sparse matrix formats add padding zeros for better alignment of data in memory. But if this is not the case, the variable localIdx of particular matrix elements would read as:

$\left( \begin{array}{ccccc} 0 & 1 & . & . & . \\ 0 & . & . & . & . \\ 0 & 1 & . & . & . \\ 0 & 1 & 2 & . & . \\ 0 & 1 & . & . & . \end{array} \right)$

# Matrix view

Matrix views are small reference objects which help accessing the matrix in GPU kernels or lambda functions being executed on GPUs. We describe this in details in section about Shared pointers and views. The problem lies in fact that we cannot pass references to GPU kernels and we do not want to pass there deep copies of matrices. Matrix view is some kind of reference to a matrix. A copy of matrix view is always shallow and so it behaves like a reference. The following example shows how to obtain the matrix view by means of method getView and pass it to a lambda function:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Pointers/SharedPointer.h>
7
8template< typename Device >
9void getRowExample()
10{
11 /***
12 * Set the following matrix (dots represent zero matrix elements):
13 *
14 * / 2 . . . . \
15 * | 1 2 1 . . |
16 * | . 1 2 1. . |
17 * | . . 1 2 1 |
18 * \ . . . . 2 /
19 */
20 const int size = 5;
21 TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 3, 3, 3, 1 }, size );
22 auto view = matrix.getView();
23
24 /***
25 * Set the matrix elements.
26 */
27 auto f = [=] __cuda_callable__ ( int rowIdx ) mutable {
28 auto row = view.getRow( rowIdx );
29 if( rowIdx == 0 )
30 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
31 else if( rowIdx == size - 1 )
32 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
33 else
34 {
35 row.setElement( 0, rowIdx - 1, 1.0 ); // elements below the diagonal
36 row.setElement( 1, rowIdx, 2.0 ); // diagonal element
37 row.setElement( 2, rowIdx + 1, 1.0 ); // elements above the diagonal
38 }
39 };
40 TNL::Algorithms::ParallelFor< Device >::exec( 0, matrix.getRows(), f );
41 std::cout << matrix << std::endl;
42}
43
44int main( int argc, char* argv[] )
45{
46 std::cout << "Getting matrix rows on host: " << std::endl;
47 getRowExample< TNL::Devices::Host >();
48
49#ifdef HAVE_CUDA
50 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
51 getRowExample< TNL::Devices::Cuda >();
52#endif
53}
#define __cuda_callable__
Definition: CudaCallable.h:22
Implementation of sparse matrix, i.e. matrix storing only non-zero elements.
Definition: SparseMatrix.h:57
ViewType getView()
Returns a modifiable view of the sparse matrix.
Definition: SparseMatrix.hpp:156
__cuda_callable__ ConstRowView getRow(IndexType rowIdx) const
Constant getter of simple structure for accessing given matrix row.
Definition: SparseMatrix.hpp:468
T endl(T... args)
static void exec(Index start, Index end, Function f, FunctionArgs... args)
Static method for the execution of the loop.
Definition: ParallelFor.h:70

Here we create sparse matrix matrix on the line 11, and use the method getView to get the matrix view on the line 12. The view is then used in the lambda function on the line 15 where it is captured by value (see [=] in the definition of the lambda function f on the line 14).

# Allocation and setup of different matrix types

There are several ways how to create a new matrix:

1. Initializer lists allow to create matrix from the C++ initializer lists. The matrix elements must be therefore encoded in the source code and so it is useful for rather smaller matrices. Methods and constructors with initializer lists are user friendly and simple to use. It is a good choice for tool problems with small matrices.
2. STL map can be used for creation of sparse matrices only. The user first insert all matrix elements together with their coordinates into std::map based on which the sparse matrix is created in the next step. It is simple and user friendly approach suitable for creation of large matrices. An advantage is that we do not need to know the distribution of the nonzero matrix elements in matrix rows in advance like we do in other ways of construction of sparse matrices. This makes the use of STL map suitable for combining of sparse matrices from TNL with other numerical packages. However, the sparse matrix is constructed on the host and then copied on GPU if necessary. Therefore, this approach is not a good choice if fast and efficient matrix construction is required.
3. Methods setElement and addElement called from the host allow to change particular matrix elements. The methods can be called from host even for matrices allocated on GPU. In this case, however, the matrix elements are transferred on GPU one by one which is very inefficient. If the matrix is allocated on the host system (CPU), the efficiency is nearly optimal. In case of sparse matrices, one must set row capacities (i.e. maximal number of nonzero elements in each row) before using these methods. If the row capacity is exceeded, the matrix has to be reallocated and all matrix elements are lost.
4. Methods setElement and addElement called on the host and copy matrix on GPU setting particular matrix elements by the methods setElement and addElement when the matrix is allocated on GPU can be time consuming for large matrices. Setting up the matrix on CPU using the same methods and copying it on GPU at once when the setup is finished can be significantly more efficient. A drawback is that we need to allocate temporarily whole matrix on CPU.
5. Methods setElement and addElement called from native device allow to do efficient matrix elements setup even on devices (GPUs). In this case, the methods must be called from a GPU kernel or a lambda function combined with the parallel for (TNL::Algorithms::ParallelFor). The user get very good performance even when manipulating matrix allocated on GPU. On the other hand, only data structures allocated on GPUs can be accessed from the kernel or lambda function. The matrix can be accessed in the GPU kernel or lambda function by means of matrix view (see the previous section) or the shared pointer (see TNL::Pointers::SharedPointer).
6. Method getRow combined with ParallelFor is very similar to the previous one. The difference is that we first fetch helper object called matrix row which is linked to particular matrix row. Using methods of this object, one may change the matrix elements in given matrix row. An advantage is that the access to the matrix row is resolved only once for all elements in the row. In some more sophisticated sparse matrix formats, this can be nontrivial operation and this approach may slightly improve the performance. Another advantage for sparse matrices is that we access the matrix elements based on their local index ('localIdx', see [Indexing of nonzero matrix elements in sparse matrices](indexing_of_nonzero_matrix_elements_in_sparse_matrices)) in the row which is something like a rank of the nonzero element in the row. This is more efficient than addressing the matrix elements by the column indexes which requires searching in the matrix row. So this may significantly improve the performance of setup of sparse matrices. When it comes to dense matrices, there should not be great difference in performance compared to use of the methods setElement and getElement. Note that when the method is called from a GPU kernel or a lambda function, only data structures allocated on GPU can be accessed and the matrix must be made accessible by the means of matrix view.
7. Methods forRows and forElements this approach is very similar to the previous one but it avoids using ParallelFor and necessity of passing the matrix to GPU kernels by matrix view or shared pointers.

The following table shows pros and cons of particular methods:

Method Efficient Easy to use Pros Cons
Initializer list ** ***** Very easy to use. Only for small matrices.
Does not need setting of matrix rows capacities
STL map ** ***** Very easy to use. Higher memory requirements.
Does not need setting of matrix rows capacities Slow transfer on GPU.
[set,add]Element on host ****/* ***** Very easy to use. Requires setting of row capacities.
Extremely slow transfer on GPU.
[set,and]Element on host&copy on GPU *** **** Easy to use. Requires setting of row capacities.
Reasonable efficiency. Allocation of auxiliary matrix on CPU.
[set,add]Element on native device **** Good efficiency. Requires setting of row capacities.
Requires writing GPU kernel or lambda function.
Allows accessing only data allocated on the same device/memory space.
getRow and ParallelFor ***** ** Best efficiency for sparse matrices. Requires setting of row capacities.
Requires writing GPU kernel or lambda function.
Allows accessing only data allocated on the same device/memory space.
Use of matrix local indexes can be less intuitive.
forRows, forElements ***** ** Best efficiency for sparse matrices. Requires setting of row capacities.
Avoid use of matrix view or shared pointer in kernels/lambda function. Requires writing GPU kernel or lambda function.
Allows accessing only data allocated on the same device/memory space.
Use of matrix local indexes is less intuitive.

Though it may seem that the later methods come with more cons than pros, they offer much higher performance and we believe that even they are still user friendly. On the other hand, if the matrix setup performance is not a priority, the use easy-to-use but slow method can still be a good choice. The following tables demonstrate the performance of different methods. The tests were performed with the following setup:

CPU Intel i9-9900KF, 3.60GHz, 8 cores, 16384 KB cache
GPU GeForce RTX 2070
g++ version 10.2.0
nvcc version 11.2.67
Precision single precision

## Dense matrix

In the test of dense matrices, we set each matrix element to value equal to rowIdx + columnIdx. The times in seconds obtained on CPU looks as follows:

Matrix rows and columns setElement on host setElement with ParallelFor getRow forElements
16 0.00000086 0.0000053 0.00000035 0.0000023
32 0.00000278 0.0000050 0.00000201 0.0000074
64 0.00000703 0.0000103 0.00000354 0.0000203
128 0.00002885 0.0000312 0.00000867 0.0000709
256 0.00017543 0.0000439 0.00002490 0.0001054
512 0.00078153 0.0001683 0.00005999 0.0002713
1024 0.00271989 0.0006691 0.00003808 0.0003942
2048 0.01273520 0.0038295 0.00039116 0.0017083
4096 0.08381450 0.0716542 0.00937997 0.0116771
8192 0.51596800 0.3535530 0.03971900 0.0467374

Here:

• setElement on host tests run in one thread. Therefore they are faster for small matrices compared to "setElement with ParallelFor" tests.
• setElement with ParallelFor tests run in parallel in several OpenMP threads. This approach is faster for larger matrices.
• getRow tests run in parallel in several OpenMP threads mapping of which is more efficient compared to "setElement on host" tests.

And the same on GPU is in the following table:

Matrix rows and columns setElement on host setElement on host and copy setElement on GPU getRow forElements
16 0.027835 0.02675 0.000101198 0.00009903 0.000101214
32 0.002776 0.00018 0.000099197 0.00009901 0.000100481
64 0.010791 0.00015 0.000094446 0.00009493 0.000101796
128 0.043014 0.00021 0.000099397 0.00010024 0.000102729
256 0.171029 0.00056 0.000100469 0.00010448 0.000105893
512 0.683627 0.00192 0.000103346 0.00011034 0.000112752
1024 2.736680 0.00687 0.000158805 0.00016932 0.000170302
2048 10.930300 0.02474 0.000509000 0.00050917 0.000511183
4096 43.728700 0.13174 0.001557030 0.00156117 0.001557930
8192 174.923000 0.70602 0.005312470 0.00526658 0.005263870

Here:

• setElement on host tests are very slow especially for large matrices since each matrix element is copied on GPU separately.
• setElement on host and copy tests are much faster because the matrix is copied from CPU to GPU on the whole which is more efficient.
• setElement on GPU tests are even more faster since there is no transfer of data between CPU and GPU.
• getRow tests have the same performance as "setElement on GPU".
• forElements tests have the same performance as both "setElement on GPU" and "getRow".

You can see the source code of the previous benchmark in Appendix.

## Sparse matrix

The sparse matrices are tested on computation of matrix the discrete Laplace operator in 2D. This matrix has at most five nonzero elements in each row. The times for sparse matrix (with CSR format) on CPU in seconds looks as follows:

Matrix rows and columns STL Map setElement on host setElement with ParallelFor getRow forElements
256 0.00016 0.000017 0.000014 0.000013 0.000020
1,024 0.00059 0.000044 0.000021 0.000019 0.000022
4,096 0.00291 0.000130 0.000031 0.000022 0.000031
16,384 0.01414 0.000471 0.000067 0.000031 0.000065
65,536 0.06705 0.001869 0.000218 0.000074 0.000209
262,144 0.31728 0.007436 0.000856 0.000274 0.000799
1,048,576 1.46388 0.027087 0.006162 0.005653 0.005904
4,194,304 7.46147 0.102808 0.028385 0.027925 0.027937
16,777,216 38.95900 0.413823 0.125870 0.124588 0.123858
67,108,864 185.75700 1.652580 0.505232 0.501003 0.500927

Here:

• STL Map tests show that use of STL Map can be very slow on large matrices and, of course, they need to allocate the map containing all the matrix elements. This can be memory consuming. On the other hand, it is the only way which does not require knowing the matrix row capacities in advance.
• setElement on host tests are much faster compared to STL map, it does not need to allocate anything else except the sparse matrix. However, matrix row capacities must be known in advance.
• setElement with ParallelFor tests run in parallel in several OpenMP threads and so this can be faster for larger matrices.
• getRow tests perform the same as "setElement with ParallelFor".
• forElements tests perform the same as both "setElement with ParallelFor" and "forElements".

We see, that the use of STL map makes sense only in situation when it is hard to estimate necessary row capacities. Otherwise very easy setup with setElement method is much faster. If the performance is the highest priority, getRow method should be preferred. The results for GPU are in the following table:

Matrix rows and columns STL Map setElement on host setElement on host and copy setElement on GPU getRow forElements
256 0.002 0.036 0.0280 0.00017 0.00017 0.00017
1,024 0.001 0.161 0.0006 0.00017 0.00017 0.00017
4,096 0.003 0.680 0.0010 0.00020 0.00020 0.00020
16,384 0.015 2.800 0.0034 0.00021 0.00020 0.00021
65,536 0.074 11.356 0.0130 0.00048 0.00047 0.00048
262,144 0.350 45.745 0.0518 0.00088 0.00087 0.00088
1,048,576 1.630 183.632 0.2057 0.00247 0.00244 0.00245
4,194,304 8.036 735.848 0.8119 0.00794 0.00783 0.00788
16,777,216 41.057 2946.610 3.2198 0.02481 0.02429 0.02211
67,108,864 197.581 11791.601 12.7775 0.07196 0.06329 0.06308

Here:

• STL Map tests show that the times are comparable to CPU times which means the most of the time is spent by creating the matrix on CPU.
• setElement on host tests are again extremely slow for large matrices. It is even slower than the use of STL map. So in case of GPU, this is another reason for using the STL map.
• setElement on host and copy tests are, similar to the dense matrix, much faster compared to the previous approaches. So it is the best way when you need to use data structures available only on the host system (CPU).
• setElement on GPU tests exhibit the best performance together with getRow and forElements methods. Note, however, that this method can be slower that getRow and forElements if there would be more nonzero matrix elements in a row.
• getRow tests exhibit the best performance together with setElement on GPU and forElements methods.
• forElements tests exhibit the best performance together with getRow and setElement on GPU methods.

Here we see, that the setElement methods performs extremely bad because all matrix elements are transferred to GPU one-by-one. Even STL map is much faster. Note, that the times for STL map are not much higher compared to CPU which indicates that the transfer of the matrix on GPU is not dominant. Setup of the matrix on CPU by the means of setElement method and transfer on GPU is even faster. However, the best performance can be obtained only we creating the matrix directly on GPU by methods setElement, getRow and forElements. Note, however, that even if all of them perform the same way, for matrices with more nonzero matrix elements in a row, setElement could be slower compared to the getRow and forElements.

You can see the source code of the previous benchmark in Appendix.

## Multidiagonal matrix

Finally, the following tables show the times of the same test performed with multidiagonal matrix. Times on CPU in seconds looks as follows:

Matrix rows and columns setElement on host setElement with ParallelFor getRow forElements
256 0.000055 0.0000038 0.000004 0.000009
1,024 0.000002 0.0000056 0.000003 0.000006
4,096 0.000087 0.0000130 0.000005 0.000014
16,384 0.000347 0.0000419 0.000010 0.000046
65,536 0.001378 0.0001528 0.000032 0.000177
262,144 0.005504 0.0006025 0.000131 0.000711
1,048,576 0.019392 0.0028773 0.001005 0.003265
4,194,304 0.072078 0.0162378 0.011915 0.018065
16,777,216 0.280085 0.0642682 0.048876 0.072084
67,108,864 1.105120 0.2427610 0.181974 0.272579

Here:

• setElement on host tests show that this method is fairly efficient.
• setElement with ParallelFor tests run in parallel in several OpenMP threads compared to "setElement on host" tests. For larger matrices, this way of matrix setup performs better.
• getRow tests perform more or less the same as "setElement with ParallelFor" and forElements.
• forElements tests perform more or less the same as "setElement with ParallelFor" and getRow.

Note, that setup of multidiagonal matrix is faster compared to the same matrix stored in general sparse format. Results for GPU are in the following table:

Matrix rows and columns setElement on host setElement on host and copy setElement on GPU getRow forElements
256 0.035 0.02468 0.000048 0.000045 0.000047
1,024 0.059 0.00015 0.000047 0.000045 0.000047
4,096 0.251 0.00044 0.000048 0.000045 0.000047
16,384 1.030 0.00158 0.000049 0.000046 0.000048
65,536 4.169 0.00619 0.000053 0.000048 0.000052
262,144 16.807 0.02187 0.000216 0.000214 0.000217
1,048,576 67.385 0.08043 0.000630 0.000629 0.000634
4,194,304 270.025 0.31272 0.001939 0.001941 0.001942
16,777,216 1080.741 1.18849 0.003212 0.004185 0.004207
67,108,864 4326.120 4.74481 0.013672 0.022494 0.030369
• setElement on host tests are extremely slow again, especially for large matrices.
• setElement on host and copy tests are much faster compared to the previous.
• setElement with ParallelFor tests offer the best performance. They are even faster then getRow and forElements method. This, however, does not have be true for matrices having more nonzero elements in a row.
• getRow tests perform more or less the same as forElements. For matrices having more nonzero elements in a row this method could be faster than setElement.
• forElements tests perform more or less the same as getRow.

Note that multidiagonal matrix performs better compared to general sparse matrix. One reason for it is the fact, that the multidiagonal type does not store explicitly column indexes of all matrix elements. Because of this, less data need to be transferred from the memory.

You can see the source code of the previous benchmark in Appendix.

In the following parts we will describe hoe to setup particular matrix types by means of the methods mentioned above.

## Dense matrices

Dense matrix (TNL::Matrices::DenseMatrix) is a templated class defined in the namespace TNL::Matrices. It has five template parameters:

• Real is a type of the matrix elements. It is double by default.
• Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
• Index is a type to be used for indexing of the matrix elements. It is int by default.
• ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default, it is the row-major order if the matrix is allocated on the host system and column major order if it is allocated on GPU.
• RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.

The following examples show how to allocate the dense matrix and how to initialize the matrix elements.

### Initializer list

Small matrices can be created simply by the constructor with an initializer list.

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5
6template< typename Device >
7void initializerListExample()
8{
10 { 1, 2, 3, 4, 5, 6 },
11 { 7, 8, 9, 10, 11, 12 },
12 { 13, 14, 15, 16, 17, 18 }
13 };
14
15 std::cout << "General dense matrix: " << std::endl << matrix << std::endl;
16
18 { 1 },
19 { 2, 3 },
20 { 4, 5, 6 },
21 { 7, 8, 9, 10 },
22 { 11, 12, 13, 14, 15 }
23 };
24
25 std::cout << "Triangular dense matrix: " << std::endl << triangularMatrix << std::endl;
26}
27
28int main( int argc, char* argv[] )
29{
30 std::cout << "Creating matrices on CPU ... " << std::endl;
31 initializerListExample< TNL::Devices::Host >();
32
33#ifdef HAVE_CUDA
34 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
35 initializerListExample< TNL::Devices::Cuda >();
36#endif
37}
Implementation of dense matrix, i.e. matrix storing explicitly all of its elements including zeros.
Definition: DenseMatrix.h:35

In fact, the constructor takes a list of initializer lists. Each embedded list defines one matrix row and so the number of matrix rows is given by the size of the outer initializer list. The number of matrix columns is given by the longest inner initializer lists. Shorter inner lists are filled with zeros from the right side. The result looks as follows:

Creating matrices on CPU ...
General dense matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5 5:6
Row: 1 -> 0:7 1:8 2:9 3:10 4:11 5:12
Row: 2 -> 0:13 1:14 2:15 3:16 4:17 5:18
Triangular dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:3 2:0 3:0 4:0
Row: 2 -> 0:4 1:5 2:6 3:0 4:0
Row: 3 -> 0:7 1:8 2:9 3:10 4:0
Row: 4 -> 0:11 1:12 2:13 3:14 4:15
Creating matrices on CUDA GPU ...
General dense matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5 5:6
Row: 1 -> 0:7 1:8 2:9 3:10 4:11 5:12
Row: 2 -> 0:13 1:14 2:15 3:16 4:17 5:18
Triangular dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:3 2:0 3:0 4:0
Row: 2 -> 0:4 1:5 2:6 3:0 4:0
Row: 3 -> 0:7 1:8 2:9 3:10 4:0
Row: 4 -> 0:11 1:12 2:13 3:14 4:15

### Methods setElement and addElement

Larger matrices can be setup with methods setElement and addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement). The following example shows how to call these methods from the host.

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
7{
9 for( int i = 0; i < 5; i++ )
10 matrix.setElement( i, i, i );
11
12 std::cout << "Initial matrix is: " << std::endl << matrix << std::endl;
13
14 for( int i = 0; i < 5; i++ )
15 for( int j = 0; j < 5; j++ )
16 matrix.addElement( i, j, 1.0, 5.0 );
17
18 std::cout << "Matrix after addition is: " << std::endl << matrix << std::endl;
19}
20
21int main( int argc, char* argv[] )
22{
23 std::cout << "Add elements on host:" << std::endl;
25
26#ifdef HAVE_CUDA
27 std::cout << "Add elements on CUDA device:" << std::endl;
29#endif
30}

As we can see, both methods can be called from the host no matter where the matrix is allocated. If it is on GPU, each call of setElement or addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement) causes slow transfer of tha data between CPU and GPU. Use this approach only if the performance is not a priority. The result looks as follows:

Initial matrix is:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21
Initial matrix is:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21

More efficient way of the matrix initialization on GPU consists of calling the methods setElement and addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement) directly from GPU, for example by means of lambda function and ParallelFor2D (TNL::Algorithms::ParallelFor2D). It is demonstrated in the following example (of course it works even on CPU):

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Devices/Host.h>
5
6template< typename Device >
7void setElements()
8{
10 auto matrixView = matrix.getView();
11 for( int i = 0; i < 5; i++ )
12 matrixView.setElement( i, i, i );
13
14 std::cout << "Matrix set from the host:" << std::endl;
15 std::cout << matrix << std::endl;
16
17 auto f = [=] __cuda_callable__ ( int i, int j ) mutable {
18 matrixView.addElement( i, j, 5.0 );
19 };
21
22 std::cout << "Matrix set from its native device:" << std::endl;
23 std::cout << matrix << std::endl;
24}
25
26int main( int argc, char* argv[] )
27{
28 std::cout << "Set elements on host:" << std::endl;
29 setElements< TNL::Devices::Host >();
30
31#ifdef HAVE_CUDA
32 std::cout << "Set elements on CUDA device:" << std::endl;
33 setElements< TNL::Devices::Cuda >();
34#endif
35}
static void exec(Index startX, Index startY, Index endX, Index endY, Function f, FunctionArgs... args)
Static method for the execution of the loop.
Definition: ParallelFor.h:130

Here we get the matrix view (TNL::Matrices::DenseMatrixView) (line 10) to make the matrix accessible in lambda function even on GPU (see Shared pointers and views ). We first call the setElement method from CPU to set the i-th diagonal element to i (lines 11-12). Next we iterate over the matrix rows with ParallelFor2D (TNL::Algorithms::ParallelFor2D) (line 20) and for each row we call the lambda function f. This is done on the same device where the matrix is allocated and so it we get optimal performance even for matrices on GPU. In the lambda function we add one to each matrix element (line 18). The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix set from its native device:
Row: 0 -> 0:5 1:5 2:5 3:5 4:5
Row: 1 -> 0:5 1:6 2:5 3:5 4:5
Row: 2 -> 0:5 1:5 2:7 3:5 4:5
Row: 3 -> 0:5 1:5 2:5 3:8 4:5
Row: 4 -> 0:5 1:5 2:5 3:5 4:9
Set elements on CUDA device:
Matrix set from the host:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix set from its native device:
Row: 0 -> 0:5 1:5 2:5 3:5 4:5
Row: 1 -> 0:5 1:6 2:5 3:5 4:5
Row: 2 -> 0:5 1:5 2:7 3:5 4:5
Row: 3 -> 0:5 1:5 2:5 3:8 4:5
Row: 4 -> 0:5 1:5 2:5 3:5 4:9

### Method getRow

This method is available for the dense matrix (TNL::Matrices::DenseMatrix::getRow) mainly for two reasons:

1. The method getRow is recommended for sparse matrices. In most cases, it is not optimal for dense matrices. However, if one needs to have one code for both dense and sparse matrices, this method is a good choice.
2. In general, use of setElement (TNL::Matrices::DenseMatrix::setElement) combined with ParallelFor2D (TNL::Algorithms::ParallelFor2D) is preferred, for dense matrices, since it offers more parallelism for GPUs. ParallelFor2D creates one CUDA thread per each matrix element which is desirable for GPUs. With the use of the method getRow we have only one CUDA thread per each matrix row. This makes sense only in situation when we need to setup each matrix row sequentially.

Here we show an example:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void getRowExample()
9{
10 const int size = 5;
12
13 /***
14 * Create dense matrix view which can be captured by the following lambda
15 * function.
16 */
17 auto matrixView = matrix.getView();
18
19 auto f = [=] __cuda_callable__ ( int rowIdx ) mutable {
20 auto row = matrixView.getRow( rowIdx );
21 if( rowIdx > 0 )
22 row.setValue( rowIdx - 1, -1.0 );
23 row.setValue( rowIdx, rowIdx + 1.0 );
24 if( rowIdx < size - 1 )
25 row.setValue( rowIdx + 1, -1.0 );
26 };
27
28 /***
29 * Set the matrix elements.
30 */
31 TNL::Algorithms::ParallelFor< Device >::exec( 0, matrix.getRows(), f );
32 std::cout << matrix << std::endl;
33}
34
35int main( int argc, char* argv[] )
36{
37 std::cout << "Getting matrix rows on host: " << std::endl;
38 getRowExample< TNL::Devices::Host >();
39
40#ifdef HAVE_CUDA
41 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
42 getRowExample< TNL::Devices::Cuda >();
43#endif
44}

Here we create the matrix on the line 10 and get the matrix view on the line 16. Next we use ParallelFor (TNL::Algorithms::ParallelFor) (line 31) to iterate over the matrix rows and call the lambda function f (lines 19-26) for each of them. In the lambda function, we first fetch the matrix row by means of the method getRow (TNL::Matrices::DenseMatrixView::getRow) and next we set the matrix elements by using the method setElement of the matrix row (TNL::Matrices::DenseMatrixRowView::setElement). For the compatibility with the sparse matrices, use the variant of setElement with the parameter localIdx. It has no effect here, it is only for compatibility of the interface.

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Getting matrix rows on CUDA device:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5

### Method forRows

This method iterates in parallel over all matrix rows. In fact, it combines TNL::Algorithms::ParallelFor and TNL::Matrices::DenseMatrix::getRow method in one. See the following example. It is even a bit simpler compared to the previous one:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void forRowsExample()
9{
11 using RowView = typename MatrixType::RowView;
12 const int size = 5;
13 MatrixType matrix( size, size );
14
15 /***
16 * Set the matrix elements.
17 */
18 auto f = [=] __cuda_callable__ ( RowView& row ) mutable {
19 const int& rowIdx = row.getRowIndex();
20 if( rowIdx > 0 )
21 row.setValue( rowIdx - 1, -1.0 );
22 row.setValue( rowIdx, rowIdx + 1.0 );
23 if( rowIdx < size - 1 )
24 row.setValue( rowIdx + 1, -1.0 );
25 };
26 matrix.forAllRows( f );
27 std::cout << matrix << std::endl;
28
29 /***
30 * Now divide each matrix row by its largest element with use of iterators.
31 */
32 matrix.forAllRows( [=] __cuda_callable__ ( RowView& row ) mutable {
34 for( auto element : row )
35 largest = TNL::max( largest, element.value() );
36 for( auto element : row )
37 element.value() /= largest;
38 } );
39 std::cout << "Divide each matrix row by its largest element... " << std::endl;
40 std::cout << matrix << std::endl;
41}
42
43int main( int argc, char* argv[] )
44{
45 std::cout << "Getting matrix rows on host: " << std::endl;
46 forRowsExample< TNL::Devices::Host >();
47
48#ifdef HAVE_CUDA
49 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
50 forRowsExample< TNL::Devices::Cuda >();
51#endif
52}
T lowest(T... args)

The lambda function f, which is called for each matrix row (lines 18-25), have to accept parameter row with type RowView. This type is defined inside each TNL matrix and in the case of the dense matrix, it is TNL::Matrices::DenseMatrixRowView. We use the method TNL::Matrices::DenseMatrixRowView::getRowIndex to get the index of the matrix row being currently processed and method TNL::Matrices::DenseMatrixRowView::setElement which sets the value of the element with given column index (the first parameter).

Next, on the lines 32-38, we call another lambda function which firstly find the largest element in each row (lines 33-35) and then it divides the matrix row by its value (lines 36-37).

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Divide each matrix row by its largest element...
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-0.5 1:1 2:-0.5 3:0 4:0
Row: 2 -> 0:0 1:-0.333333 2:1 3:-0.333333 4:0
Row: 3 -> 0:0 1:0 2:-0.25 3:1 4:-0.25
Row: 4 -> 0:0 1:0 2:0 3:-0.2 4:1
Getting matrix rows on CUDA device:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Divide each matrix row by its largest element...
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-0.5 1:1 2:-0.5 3:0 4:0
Row: 2 -> 0:0 1:-0.333333 2:1 3:-0.333333 4:0
Row: 3 -> 0:0 1:0 2:-0.25 3:1 4:-0.25
Row: 4 -> 0:0 1:0 2:0 3:-0.2 4:1

### Method forElements

The next example demonstrates the method forElements (TNL::Matrices::DenseMatrix::forElements) which works in very similar way as the method getRow but it is slightly easier to use. It is also compatible with sparse matrices. See the following example:

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void forElementsExample()
8{
10
11 auto f = [=] __cuda_callable__ ( int rowIdx, int columnIdx, int columnIdx_, double& value ) {
12 if( rowIdx >= columnIdx )
13 value = rowIdx + columnIdx;
14 };
15
16 matrix.forElements( 0, matrix.getRows(), f );
17 std::cout << matrix << std::endl;
18}
19
20int main( int argc, char* argv[] )
21{
22 std::cout << "Creating matrix on host: " << std::endl;
23 forElementsExample< TNL::Devices::Host >();
24
25#ifdef HAVE_CUDA
26 std::cout << "Creating matrix on CUDA device: " << std::endl;
27 forElementsExample< TNL::Devices::Cuda >();
28#endif
29}

We do not need any matrix view and instead of calling ParallelFor (TNL::Algorithms::ParallelFor) we call just the method forElements (line 18). The lambda function f (line 11) must accept the following parameters:

• rowIdx is the row index of given matrix element.
• columnIdx is the column index of given matrix element.
• value is a reference on the matrix element value and so by changing this value we can modify the matrix element.
• compute is a boolean which, when set to false, indicates that we can skip the rest of the matrix row. This is, however, only a hint and it does not guarantee that the rest of the matrix row is really skipped.

The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:2 1:3 2:4 3:0 4:0
Row: 3 -> 0:3 1:4 2:5 3:6 4:0
Row: 4 -> 0:4 1:5 2:6 3:7 4:8
Creating matrix on CUDA device:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:2 1:3 2:4 3:0 4:0
Row: 3 -> 0:3 1:4 2:5 3:6 4:0
Row: 4 -> 0:4 1:5 2:6 3:7 4:8

### Wrapping existing data to dense matrix view

In case when you have already allocated data for dense matrix (for example in some other library), you may wrap it to dense matrix view with a function TNL::Matrices::wrapDenseMatrix . See the following example:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Matrices/MatrixWrapping.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void wrapMatrixView()
10{
11 const int rows( 3 ), columns( 4 );
13 1, 2, 3, 4,
14 5, 6, 7, 8,
15 9, 10, 11, 12 };
16 double* values = valuesVector.getData();
17
18 /***
19 * Wrap the array values to dense matrix view
20 */
21 auto matrix = TNL::Matrices::wrapDenseMatrix< Device >( rows, columns, values );
22 std::cout << "Matrix reads as: " << std::endl << matrix << std::endl;
23}
24
25int main( int argc, char* argv[] )
26{
27 std::cout << "Wraping matrix view on host: " << std::endl;
28 wrapMatrixView< TNL::Devices::Host >();
29
30#ifdef HAVE_CUDA
31 std::cout << "Wraping matrix view on CUDA device: " << std::endl;
32 wrapMatrixView< TNL::Devices::Cuda >();
33#endif
34}
__cuda_callable__ const Value * getData() const
Returns a const-qualified raw pointer to the data.
Definition: Array.hpp:324
Vector extends Array with algebraic operations.
Definition: Vector.h:40

Here we create dense matrix having three rows and four columns. We use TNL vector (TNL::Containers::Vector) only for allocation of the matrix elements (lines 12-15) and we get a pointer to the allocated array immediately (line 16). Next we use just the array to get dense matrix view with proper matrix dimensions (line 21). Note that we must explicitly state the device type as a template parameter of the function wrapDenseMatrix (TNL::Matrices::wrapDenseMatrix). Finally, we print the matrix to see if it is correct (line 22). The result looks as follows:

Wraping matrix view on host:
Row: 0 -> 0:1 1:2 2:3 3:4
Row: 1 -> 0:5 1:6 2:7 3:8
Row: 2 -> 0:9 1:10 2:11 3:12
Wraping matrix view on CUDA device:
Row: 0 -> 0:1 1:4 2:7 3:10
Row: 1 -> 0:2 1:5 2:8 3:11
Row: 2 -> 0:3 1:6 2:9 3:12

## Sparse matrices

Sparse matrices are extremely important in a lot of numerical algorithms. They are used at situations when we need to operate with matrices having majority of the matrix elements equal to zero. In this case, only the non-zero matrix elements are stored with possibly some padding zeros used for memory alignment. This is necessary mainly on GPUs. See the Overview of matrix types for the differences in memory requirements.

Major disadvantage of sparse matrices is that there are a lot of different formats for their storage in memory. Though CSR (Compressed Sparse Row) format is the most popular of all, especially for GPUs, there are many other formats. Often their performance differ significantly for various matrices. So it is a good idea to test several sparse matrix formats if you want to get the best performance. In TNL, there is one templated class TNL::Matrices::SparseMatrix representing general sparse matrices. The change of underlying matrix format can be done just by changing one template parameter. The list of the template parameters is as follows:

If Real is set to bool, we get a binary matrix for which the non-zero elements can be equal only to one and so the matrix elements values are not stored explicitly in the memory. This can significantly reduce the memory requirements and also increase performance.

In the following text we will show how to create and setup sparse matrices.

### Setting of row capacities

Larger sparse matrices are created in two steps:

1. We use a method TNL::Matrices::SparseMatrix::setRowCapacities to initialize the underlying matrix format and to allocate memory for the matrix elements. This method only needs to know how many non-zero elements are supposed to be in each row. Once this is set, it cannot be changed only by resetting the whole matrix. In most situations, it is not an issue to compute the number of nonzero elements in each row. Otherwise, we can currently only recommend the use of matrix setup with STL map, which is, however, quite slow.
2. Now, the nonzero matrix elements can be set one after another by telling its coordinates and a value. Since majority of sparse matrix formats are designed to allow quick access to particular matrix rows the insertion can be done in parallel by mapping different threads to different matrix rows. This approach is usually optimal or nearly optimal when it comes to efficiency.

See the following example which creates lower triangular matrix like this one

$\left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \\ 2 & 1 & 0 & 0 & 0 \\ 3 & 2 & 1 & 0 & 0 \\ 4 & 3 & 2 & 1 & 0 \\ 5 & 4 & 3 & 2 & 1 \end{array} \right).$

1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void setRowCapacitiesExample()
9{
11 TNL::Containers::Vector< int, Device > rowCapacities{ 1, 2, 3, 4, 5 };
12 matrix.setRowCapacities( rowCapacities );
13 for( int row = 0; row < 5; row++ )
14 for( int column = 0; column <= row; column++ )
15 matrix.setElement( row, column, row - column + 1 );
16
17 std::cout << matrix << std::endl;
18}
19
20int main( int argc, char* argv[] )
21{
22 std::cout << "Creating matrix on CPU ... " << std::endl;
23 setRowCapacitiesExample< TNL::Devices::Host >();
24
25#ifdef HAVE_CUDA
26 std::cout << "Creating matrix on CUDA GPU ... " << std::endl;
27 setRowCapacitiesExample< TNL::Devices::Cuda >();
28#endif
29}
__cuda_callable__ void setElement(IndexType i, ValueType value)
Sets the value of the i-th element to v.
Definition: Array.hpp:356

The method TNL::Matrices::SparseMatrix::setRowCapacities reads the required capacities of the matrix rows from a vector (or simmilar container - TNL::Containers::Array, TNL::Containers::ArrayView, TNL::Containers::Vector and TNL::Containers::VectorView) which has the same number of elements as the number of matrix rows and each element defines the capacity of the related row. The result looks as follows:

Creating matrix on CPU ...
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Creating matrix on CUDA GPU ...
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1

There are constructors which also set the row capacities. The first one uses a vector:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Devices/Host.h>
5
6
7template< typename Device >
8void initializerListExample()
9{
10 TNL::Containers::Vector< int, Device > rowCapacities{ 1, 2, 3, 4, 5 };
12 rowCapacities, // row capacities
13 6 }; // number of matrix columns
14
15 for( int row = 0; row < matrix.getRows(); row++ )
16 for( int column = 0; column <= row; column++ )
17 matrix.setElement( row, column, row - column + 1 );
18 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
19}
20
21int main( int argc, char* argv[] )
22{
23 std::cout << "Creating matrices on CPU ... " << std::endl;
24 initializerListExample< TNL::Devices::Host >();
25
26#ifdef HAVE_CUDA
27 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
28 initializerListExample< TNL::Devices::Cuda >();
29#endif
30}
__cuda_callable__ void setElement(IndexType row, IndexType column, const RealType &value)
Sets element at given row and column to given value.
Definition: SparseMatrix.hpp:502

The second one uses an initializer list:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5
6template< typename Device >
7void initializerListExample()
8{
10 { 1, 2, 3, 4, 5 }, // row capacities
11 6 }; // number of matrix columns
12
13 for( int row = 0; row < matrix.getRows(); row++ )
14 for( int column = 0; column <= row; column++ )
15 matrix.setElement( row, column, row - column + 1 );
16 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
17}
18
19int main( int argc, char* argv[] )
20{
21 std::cout << "Creating matrices on CPU ... " << std::endl;
22 initializerListExample< TNL::Devices::Host >();
23
24#ifdef HAVE_CUDA
25 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
26 initializerListExample< TNL::Devices::Cuda >();
27#endif
28}

The result of both examples looks as follows:

Creating matrices on CPU ...
General sparse matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Creating matrices on CUDA GPU ...
General sparse matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1

### Initializer list

Small matrices can be initialized by a constructor with an initializer list. We assume having the following sparse matrix

$\left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \\ -1 & 2 & -1 & 0 & 0 \\ 0 & -1 & 2 & -1 & 0 \\ 0 & 0 & -1 & 2 & -1 \\ 0 & 0 & 0 & -1 & 0 \end{array} \right).$

It can be created with the initializer list constructor like we shows in the following example:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5
6template< typename Device >
7void initializerListExample()
8{
10 5, // number of matrix rows
11 5, // number of matrix columns
12 { // matrix elements definition
13 { 0, 0, 2.0 },
14 { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
15 { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
16 { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
17 { 4, 4, 2.0 } } );
18
19 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
20}
21
22int main( int argc, char* argv[] )
23{
24 std::cout << "Creating matrices on CPU ... " << std::endl;
25 initializerListExample< TNL::Devices::Host >();
26
27#ifdef HAVE_CUDA
28 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
29 initializerListExample< TNL::Devices::Cuda >();
30#endif
31}

The constructor accepts the following parameters (lines 9-17):

• rows is a number of matrix rows.
• columns is a number of matrix columns.
• data is definition of nonzero matrix elements. It is a initializer list of triples having a form { row_index, column_index, value }. In fact, it is very much like the Coordinate format - COO.

The constructor also accepts Real and Index allocators (TNL::Allocators) but the default ones are used in this example. A method setElements (TNL::Matrices::SparseMatrix::setElements) works the same way:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void setElementsExample()
8{
9 TNL::Matrices::SparseMatrix< double, Device > matrix ( 5, 5 ); // matrix dimensions
10 matrix.setElements( { // matrix elements definition
11 { 0, 0, 2.0 },
12 { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
13 { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
14 { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
15 { 4, 4, 2.0 } } );
16
17 std::cout << matrix << std::endl;
18}
19
20int main( int argc, char* argv[] )
21{
22 std::cout << "Setting matrix elements on host: " << std::endl;
23 setElementsExample< TNL::Devices::Host >();
24
25#ifdef HAVE_CUDA
26 std::cout << "Setting matrix elements on CUDA device: " << std::endl;
27 setElementsExample< TNL::Devices::Cuda >();
28#endif
29}

In this example, we create the matrix in two steps. Firstly we use constructor with only matrix dimensions as parameters (line 9) and next we set the matrix elements by setElements method (lines 10-15). The result of both examples looks as follows:

Creating matrices on CPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2
Creating matrices on CUDA GPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2

### STL map

The constructor which creates the sparse matrix from std::map is useful especially in situations when you cannot estimate the matrix row capacities in advance. You can first store the matrix elements in std::map data structure in a COO format manner. It means that each entry of the map is the following pair:

std::pair( std::pair( row_index, column_index ), element_value )

which defines one matrix element at given coordinates (row_index,column_index) with given value (element_value). Of course, you can insert such entries into the map in arbitrary order. When it is complete, you pass the map to the sparse matrix. See the following example:

1#include <iostream>
2#include <map>
3#include <utility>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7
8template< typename Device >
9void initializerListExample()
10{
12 map.insert( std::make_pair( std::make_pair( 0, 0 ), 2.0 ) );
13 map.insert( std::make_pair( std::make_pair( 1, 0 ), -1.0 ) );
14 map.insert( std::make_pair( std::make_pair( 1, 1 ), 2.0 ) );
15 map.insert( std::make_pair( std::make_pair( 1, 2 ), -1.0 ) );
16 map.insert( std::make_pair( std::make_pair( 2, 1 ), -1.0 ) );
17 map.insert( std::make_pair( std::make_pair( 2, 2 ), 2.0 ) );
18 map.insert( std::make_pair( std::make_pair( 2, 3 ), -1.0 ) );
19 map.insert( std::make_pair( std::make_pair( 3, 2 ), -1.0 ) );
20 map.insert( std::make_pair( std::make_pair( 3, 3 ), 2.0 ) );
21 map.insert( std::make_pair( std::make_pair( 3, 4 ), -1.0 ) );
22 map.insert( std::make_pair( std::make_pair( 4, 4 ), 2.0 ) );
23
25
26 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
27}
28
29int main( int argc, char* argv[] )
30{
31 std::cout << "Creating matrices on CPU ... " << std::endl;
32 initializerListExample< TNL::Devices::Host >();
33
34#ifdef HAVE_CUDA
35 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
36 initializerListExample< TNL::Devices::Cuda >();
37#endif
38}
T insert(T... args)
T make_pair(T... args)

The method setElements (TNL::Matrices::SparseMatrix::setElements) works the same way for already existing instances of sparse matrix:

1#include <iostream>
2#include <map>
3#include <utility>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7
8template< typename Device >
9void setElementsExample()
10{
12 map.insert( std::make_pair( std::make_pair( 0, 0 ), 2.0 ) );
13 map.insert( std::make_pair( std::make_pair( 1, 0 ), -1.0 ) );
14 map.insert( std::make_pair( std::make_pair( 1, 1 ), 2.0 ) );
15 map.insert( std::make_pair( std::make_pair( 1, 2 ), -1.0 ) );
16 map.insert( std::make_pair( std::make_pair( 2, 1 ), -1.0 ) );
17 map.insert( std::make_pair( std::make_pair( 2, 2 ), 2.0 ) );
18 map.insert( std::make_pair( std::make_pair( 2, 3 ), -1.0 ) );
19 map.insert( std::make_pair( std::make_pair( 3, 2 ), -1.0 ) );
20 map.insert( std::make_pair( std::make_pair( 3, 3 ), 2.0 ) );
21 map.insert( std::make_pair( std::make_pair( 3, 4 ), -1.0 ) );
22 map.insert( std::make_pair( std::make_pair( 4, 4 ), 2.0 ) );
23
25 matrix.setElements( map );
26
27 std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
28}
29
30int main( int argc, char* argv[] )
31{
32 std::cout << "Creating matrices on CPU ... " << std::endl;
33 setElementsExample< TNL::Devices::Host >();
34
35#ifdef HAVE_CUDA
36 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
37 setElementsExample< TNL::Devices::Cuda >();
38#endif
39}

The result of both examples looks as follows:

Creating matrices on CPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2
Creating matrices on CUDA GPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2

Note, however, that the map can be constructed only on CPU and not on GPU. It requires allocation of additional memory on the host system (CPU) and if the target sparse matrix resided on GPU, the matrix elements must be copied on GPU. This is the reason, why this way of the sparse matrix setup is inefficient compared to other methods.

### Methods setElement and addElement

Another way of setting the sparse matrix is by means of the methods setElement and addElement (TNL::Matrices::SparseMatrix::setElement, TNL::Matrices::SparseMatrix::addElement). The method can be called from both host (CPU) and device (GPU) if the matrix is allocated there. Note, however, that if the matrix is allocated on GPU and the methods are called from CPU there will be significant performance drop because the matrix elements will be transferer one-by-one separately. However, if the matrix elements setup is not a critical part of your algorithm this can be an easy way how to do it. See the following example:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Pointers/SharedPointer.h>
7#include <TNL/Pointers/SmartPointersRegister.h>
8
9template< typename Device >
10void setElements()
11{
12 TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 1, 1, 1, 1 }, 5 );
13
14 /****
15 * Get the matrix view.
16 */
17 auto view = matrix.getView();
18 for( int i = 0; i < 5; i++ )
19 view.setElement( i, i, i );
20
21 std::cout << "Matrix set from the host:" << std::endl;
22 std::cout << matrix << std::endl;
23
24 auto f = [=] __cuda_callable__ ( int i ) mutable {
25 view.setElement( i, i, -i );
26 };
27
29
30 std::cout << "Matrix set from its native device:" << std::endl;
31 std::cout << matrix << std::endl;
32}
33
34int main( int argc, char* argv[] )
35{
36 std::cout << "Set elements on host:" << std::endl;
37 setElements< TNL::Devices::Host >();
38
39#ifdef HAVE_CUDA
40 std::cout << "Set elements on CUDA device:" << std::endl;
41 setElements< TNL::Devices::Cuda >();
42#endif
43}

We first allocate matrix with five rows (it is given by the size of the initializer list and columns and we set capacity each row to one (line 12). The first for-loop (lines 17-19) runs on CPU no matter where the matrix is allocated. After printing the matrix (lines 21-22), we call the lambda function f (lines 24-26) with a help of ParallelFor (TNL::Algorithms::ParallelFor , line 28) which is device sensitive and so it runs on CPU or GPU depending on where the matrix is allocated. The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 ->
Row: 1 -> 1:-1
Row: 2 -> 2:-2
Row: 3 -> 3:-3
Row: 4 -> 4:-4
Set elements on CUDA device:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 ->
Row: 1 -> 1:-1
Row: 2 -> 2:-2
Row: 3 -> 3:-3
Row: 4 -> 4:-4

The method addElement (TNL::Matrices::SparseMatrix::addElement) adds a value to specific matrix element. Otherwise, it behaves the same as setElement. See the following example:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
7{
8 TNL::Matrices::SparseMatrix< double, Device > matrix( { 5, 5, 5, 5, 5 }, 5 );
9 for( int i = 0; i < 5; i++ )
10 matrix.setElement( i, i, i );
11
12 std::cout << "Initial matrix is: " << std::endl << matrix << std::endl;
13
14 for( int i = 0; i < 5; i++ )
15 for( int j = 0; j < 5; j++ )
16 matrix.addElement( i, j, 1.0, 5.0 );
17
18 std::cout << "Matrix after addition is: " << std::endl << matrix << std::endl;
19}
20
21int main( int argc, char* argv[] )
22{
23 std::cout << "Add elements on host:" << std::endl;
25
26#ifdef HAVE_CUDA
27 std::cout << "Add elements on CUDA device:" << std::endl;
29#endif
30}

The result looks as follows:

Initial matrix is:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21
Initial matrix is:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21

### Method getRow

More efficient method, especially for GPUs, is to combine getRow (TNL::Matrices::SparseMatrix::getRow) method with ParallelFor (TNL::Algorithms::ParallelFor) and lambda function as the following example demonstrates:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Pointers/SharedPointer.h>
7
8template< typename Device >
9void getRowExample()
10{
11 auto rowCapacities = { 1, 1, 1, 1, 1 }; // Variadic templates in SharedPointer
12 // constructor do not recognize initializer
13 // list so we give it a hint.
15 TNL::Pointers::SharedPointer< MatrixType > matrix( rowCapacities, 5 );
16
17 auto f = [=] __cuda_callable__ ( int rowIdx ) mutable {
18 auto row = matrix->getRow( rowIdx );
19 row.setElement( 0, rowIdx, 10 * ( rowIdx + 1 ) );
20 };
21
22 /***
23 * For the case when Device is CUDA device we need to synchronize smart
24 * pointers. To avoid this you may use SparseMatrixView. See
25 * SparseMatrixView::getRow example for details.
26 */
27 TNL::Pointers::synchronizeSmartPointersOnDevice< Device >();
28
29 /***
30 * Set the matrix elements.
31 */
32 TNL::Algorithms::ParallelFor< Device >::exec( 0, matrix->getRows(), f );
33 std::cout << *matrix << std::endl;
34}
35
36int main( int argc, char* argv[] )
37{
38 std::cout << "Getting matrix rows on host: " << std::endl;
39 getRowExample< TNL::Devices::Host >();
40
41#ifdef HAVE_CUDA
42 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
43 getRowExample< TNL::Devices::Cuda >();
44#endif
45}
Cross-device shared smart pointer.
Definition: SharedPointer.h:49

On the line 21, we create small matrix having five rows (number of rows is given by the size of the initializer list ) and columns (number of columns is given by the second parameter) and we set each row capacity to one or three (particular elements of the initializer list). On the line 41, we call ParallelFor (TNL::Algorithms::ParallelFor) to iterate over all matrix rows. Each row is processed by the lambda function f (lines 24-36). In the lambda function, we first fetch a sparse matrix row (TNL::Matrices::SparseMatrixRowView) (line 25) which serves for accessing particular matrix elements in the matrix row. This object has a method setElement (TNL::Matrices::SparseMatrixRowView::setElement) accepting three parameters:

1. localIdx is a rank of the nonzero element in given matrix row.
2. columnIdx is the new column index of the matrix element.
3. value is the new value of the matrix element.

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:10
Row: 1 -> 1:20
Row: 2 -> 2:30
Row: 3 -> 3:40
Row: 4 -> 4:50
Getting matrix rows on CUDA device:
Row: 0 -> 0:10
Row: 1 -> 1:20
Row: 2 -> 2:30
Row: 3 -> 3:40
Row: 4 -> 4:50

### Method forRows

The method forRows (TNL::Matrices::SparseMatrix::forRows) calls the method getRow (TNL::Matrices::SparseMatrix::getRow) in parallel. See the following example which has the same effect as the previous one but it is slightly simpler:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void forRowsExample()
9{
10 /***
11 * Set the following matrix (dots represent zero matrix elements):
12 *
13 * / 2 . . . . \
14 * | 1 2 1 . . |
15 * | . 1 2 1. . |
16 * | . . 1 2 1 |
17 * \ . . . . 2 /
18 */
19 const int size( 5 );
21 MatrixType matrix( { 1, 3, 3, 3, 1 }, size );
22 using RowView = typename MatrixType::RowView;
23
24 /***
25 * Set the matrix elements.
26 */
27 auto f = [=] __cuda_callable__ ( RowView& row ) mutable {
28 const int rowIdx = row.getRowIndex();
29 if( rowIdx == 0 )
30 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
31 else if( rowIdx == size - 1 )
32 row.setElement( 0, rowIdx, 2.0 ); // diagonal element
33 else
34 {
35 row.setElement( 0, rowIdx - 1, 1.0 ); // elements below the diagonal
36 row.setElement( 1, rowIdx, 2.0 ); // diagonal element
37 row.setElement( 2, rowIdx + 1, 1.0 ); // elements above the diagonal
38 }
39 };
40 matrix.forAllRows( f );
41 std::cout << matrix << std::endl;
42
43 /***
44 * Divide each matrix row by a sum of all elements in the row - with use of iterators.
45 */
46 matrix.forAllRows( [=] __cuda_callable__ ( RowView& row ) mutable {
47 double sum( 0.0 );
48 for( auto element : row )
49 sum += element.value();
50 for( auto element: row )
51 element.value() /= sum;
52 } );
53 std::cout << "Divide each matrix row by a sum of all elements in the row ... " << std::endl;
54 std::cout << matrix << std::endl;
55}
56
57int main( int argc, char* argv[] )
58{
59 std::cout << "Getting matrix rows on host: " << std::endl;
60 forRowsExample< TNL::Devices::Host >();
61
62#ifdef HAVE_CUDA
63 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
64 forRowsExample< TNL::Devices::Cuda >();
65#endif
66}

The differences are:

1. We do not need to get the matrix view as we did in the previous example.
2. We call the method forAllRows (TNL::Matrices::SparseMatrix::forAllRows) instead of ParallelFor (TNL::Algorithms::ParallelFor) which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither.
3. The lambda function f (lines 27-39) accepts one parameter row of the type RowView (TNL::Matrices::SparseMatrix::RowView which is TNL::Matrices::SparseMatrixRowView) instead of the index of the matrix row. Therefore we do not need to call the method getRow (TNL::Matrices::SparseMatrix::getRow). On the other hand, we need the method geRowIndex (TNL::Matrices::SparseMatrixRowView::getRowIndex) to get the index of the matrix row (line 28).

On the lines 46-52, we call a lambda function which computes sum of all elements in a row (lines 47-49) and it divides the row by the sum then (lines 50-51).

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2
Divide each matrix row by a sum of all elements in the row ...
Row: 0 -> 0:1
Row: 1 -> 0:0.25 1:0.5 2:0.25
Row: 2 -> 1:0.25 2:0.5 3:0.25
Row: 3 -> 2:0.25 3:0.5 4:0.25
Row: 4 -> 4:1
Getting matrix rows on CUDA device:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2
Divide each matrix row by a sum of all elements in the row ...
Row: 0 -> 0:1
Row: 1 -> 0:0.25 1:0.5 2:0.25
Row: 2 -> 1:0.25 2:0.5 3:0.25
Row: 3 -> 2:0.25 3:0.5 4:0.25
Row: 4 -> 4:1

### Method forElements

Finally, another efficient way of setting the nonzero matrix elements, is use of the method forElements (TNL::Matrices::SparseMatrix::forElements). It requires indexes of the range of rows (begin and end) to be processed and a lambda function function which is called for each nonzero element. The lambda function provides the following data:

• rowIdx is a row index of the matrix element.
• localIdx is an index of the nonzero matrix element within the matrix row.
• columnIdx is a column index of the matrix element. If the matrix element column index is supposed to be modified, this parameter can be a reference and so its value can be changed.
• value is a value of the matrix element. If the matrix element value is supposed to be modified, this parameter can be a reference as well and so the element value can be changed.
• compute is a bool reference. When it is set to false the rest of the row can be omitted. This is, however, only a hint and it depends on the underlying matrix format if it is taken into account.

See the following example:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void forElementsExample()
8{
9 TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 2, 3, 4, 5 }, 5 );
10
11 auto f = [=] __cuda_callable__ ( int rowIdx, int localIdx, int& columnIdx, double& value ) {
12 if( rowIdx >= localIdx ) // This is important, some matrix formats may allocate more matrix elements
13 // than we requested. These padding elements are processed here as well.
14 {
15 columnIdx = localIdx;
16 value = rowIdx + localIdx + 1;
17 }
18 };
19
20 matrix.forElements( 0, matrix.getRows(), f );
21 std::cout << matrix << std::endl;
22}
23
24int main( int argc, char* argv[] )
25{
26 std::cout << "Creating matrix on host: " << std::endl;
27 forElementsExample< TNL::Devices::Host >();
28
29#ifdef HAVE_CUDA
30 std::cout << "Creating matrix on CUDA device: " << std::endl;
31 forElementsExample< TNL::Devices::Cuda >();
32#endif
33}

On the line 9, we allocate a lower triangular matrix byt setting the row capacities as {1,2,3,4,5}. On the line 11, we prepare lambda function f which we execute on the line 22 just by calling the method forElements (TNL::Matrices::SparseMatrix::forElements). This method takes the range of matrix rows as the first two parameters and the lambda function as the last parameter. The lambda function receives parameters mentioned above (see the line 11). We first check if the matrix element coordinates (rowIdx and localIdx) points to an element lying before the matrix diagonal or on the diagonal (line 12). In case of the lower triangular matrix in our example, the local index is in fact the same as the column index

$\left( \begin{array}{ccccc} 0 & . & . & . & . \\ 0 & 1 & . & . & . \\ 0 & 1 & 2 & . & . \\ 0 & 1 & 2 & 3 & . \\ 0 & 1 & 2 & 3 & 4 \end{array} \right)$

If we call the method forElements (TNL::Matrices::SparseMatrix::forElements) to setup the matrix elements for the first time, the parameter columnIdx has no sense because the matrix elements and their column indexes were not set yet. Therefore it is important that the test on the line 12 reads as

if( rowIdx < localIdx )

because

if( rowIdx < columnIdx )

would not make sense. If we pass through this test, the matrix element lies in the lower triangular part of the matrix and we may set the matrix elements which is done on the lines 17 and 18. The column index (columnIdx) is set to local index (line 17) and value is set on the line 18. The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:3
Row: 2 -> 0:3 1:4 2:5
Row: 3 -> 0:4 1:5 2:6 3:7
Row: 4 -> 0:5 1:6 2:7 3:8 4:9
Creating matrix on CUDA device:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:3
Row: 2 -> 0:3 1:4 2:5
Row: 3 -> 0:4 1:5 2:6 3:7
Row: 4 -> 0:5 1:6 2:7 3:8 4:9

### Wrapping existing data to sparse matrix view

Standard sparse matrix format like CSR and Ellpack store the matrix elements in specifically defined arrays. In case that you have already allocated them (for example in some other library), they can be wrapped into a sparse matrix view with given matrix format. This can be done by means of functions TNL::Matrices::wrapCSRMatrix and TNL::Matrices::wrapEllpackMatrix . See the following example for demonstration of the CSR format:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Matrices/MatrixWrapping.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void wrapMatrixView()
10{
11 /***
12 * Encode the following matrix to CSR format...
13 *
14 * / 1 2 0 0 \.
15 * | 0 6 0 0 |
16 * | 9 0 0 0 |
17 * \ 0 0 15 16 /
18 */
19 const int rows( 4 ), columns( 4 );
20 TNL::Containers::Vector< double, Device > valuesVector { 1, 2, 6, 9, 15, 16 };
21 TNL::Containers::Vector< int, Device > columnIndexesVector { 0, 1, 1, 0, 2, 3 };
22 TNL::Containers::Vector< int, Device > rowPointersVector { 0, 2, 3, 4, 6 };
23
24 double* values = valuesVector.getData();
25 int* columnIndexes = columnIndexesVector.getData();
26 int* rowPointers = rowPointersVector.getData();
27
28 /***
29 * Wrap the arrays rowPointers, values and columnIndexes to sparse matrix view
30 */
31 auto matrix = TNL::Matrices::wrapCSRMatrix< Device >( rows, columns, rowPointers, values, columnIndexes );
32
33 std::cout << "Matrix reads as: " << std::endl << matrix << std::endl;
34}
35
36int main( int argc, char* argv[] )
37{
38 std::cout << "Wraping matrix view on host: " << std::endl;
39 wrapMatrixView< TNL::Devices::Host >();
40
41#ifdef HAVE_CUDA
42 std::cout << "Wraping matrix view on CUDA device: " << std::endl;
43 wrapMatrixView< TNL::Devices::Cuda >();
44#endif
45}

We create sparse matrix having four rows and four columns (line 19). We use TNL vector (TNL::Containers::Vector) to allocate arrays necessary for the CSR format:

1. valuesVector (line 20) - contains values of the nonzero matrix elements.
2. columnIndexesVector (line 21) - contains column indexes of the nonzero matrix elements.
3. rowPointersVector (line 22) - contains positions of the first nonzero matrix elements in each row within valuesVector and columnIndexesVector. The size of this array equals number of matrix rows plus one.

Next we turn the vectors into C style pointers (lines 24-26) to wrap them into sparse matrix view (line 31). Note, that we must explicitly state the device on which the arrays are allocated. Finlay we print the matrix to check the correctness (line 33). The result looks as follows:

Wraping matrix view on host:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16
Wraping matrix view on CUDA device:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16

Wrapping data corresponding with the Ellpack format is very similar as we can see in the following example:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Matrices/MatrixWrapping.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void wrapMatrixView()
10{
11 /***
12 * Encode the following matrix to Ellpack format...
13 *
14 * / 1 2 0 0 \.
15 * | 0 6 0 0 |
16 * | 9 0 0 0 |
17 * \ 0 0 15 16 /
18 */
19 const int rows( 4 ), columns( 4 );
20 TNL::Containers::Vector< double, Device > valuesVector { 1, 2, 6, 0, 9, 0, 15, 16 };
21 TNL::Containers::Vector< int, Device > columnIndexesVector { 0, 1, 1, -1, 0, -1, 2, 3 };
22
23 double* values = valuesVector.getData();
24 int* columnIndexes = columnIndexesVector.getData();
25
26 /***
27 * Wrap the arrays values and columnIndexes to sparse matrix view
28 */
29 auto matrix = TNL::Matrices::wrapEllpackMatrix< Device, TNL::Algorithms::Segments::RowMajorOrder >( rows, columns, 2, values, columnIndexes );
30
31 std::cout << "Matrix reads as: " << std::endl << matrix << std::endl;
32}
33
34int main( int argc, char* argv[] )
35{
36 std::cout << "Wraping matrix view on host: " << std::endl;
37 wrapMatrixView< TNL::Devices::Host >();
38
39#ifdef HAVE_CUDA
40 std::cout << "Wraping matrix view on CUDA device: " << std::endl;
41 wrapMatrixView< TNL::Devices::Cuda >();
42#endif
43}

We encode the same sparse matrix as in the previous example. The essence of the Ellpack format is that we allocate the same number of matrix elements for each row which is two in our example. For some matrix rows we use the padding zeros for which we set the column index to -1 (line 21). Therefore the size of valuesVector and columnIndexesVector equals number of matrix rows times number of matrix elements allocated in each row. As before, we turn the vectors into C style pointers (lines 23-24) and wrap them into sparse matrix view with Ellpack format (line 29). Note that we must state the device on which the arrays are allocated explicitly and also the matrix elements organization, which is TNL::Algorithms::Segments::RowMajorOrder in this case. For Ellpack matrix stored on GPU, TNL::Algorithms::Segments::ColumnMajorOrder is preferred. The result looks as follows:

Wraping matrix view on host:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16
Wraping matrix view on CUDA device:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16

### Symmetric sparse matrices

For sparse symmetric matrices, TNL offers a format storing only a half of the matrix elements. More precisely, ony the matrix diagonal and the elements bellow are stored in the memory. The matrix elements above the diagonal are deduced from those bellow. If such a symmetric format is used on GPU, atomic operations must be used in some matrix operations. For this reason, symmetric matrices can be combined only with matrix elements values expressed in float or double type. An advantage of the symmetric formats is lower memory consumption. Since less data need to be transferred from the memory, better performance might be observed. In some cases, however, the use of atomic operations on GPU may cause performance drop. Mostly we can see approximately the same performance compared to general formats but we can profit from lower memory requirements which is appreciated especially on GPU. The following example shows how to create symmetric sparse matrix.

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5
6template< typename Device >
7void symmetricSparseMatrixExample()
8{
10 5, // number of matrix rows
11 5, // number of matrix columns
12 { // matrix elements definition
13 { 0, 0, 1.0 },
14 { 1, 0, 2.0 }, { 1, 1, 1.0 },
15 { 2, 0, 3.0 }, { 2, 2, 1.0 },
16 { 3, 0, 4.0 }, { 3, 3, 1.0 },
17 { 4, 0, 5.0 }, { 4, 4, 1.0 } } );
18
19 std::cout << "Symmetric sparse matrix: " << std::endl << symmetricMatrix << std::endl;
20
21 TNL::Containers::Vector< double, Device > inVector( 5, 1.0 ), outVector( 5, 0.0 );
22 symmetricMatrix.vectorProduct( inVector, outVector );
23 std::cout << "Product with vector " << inVector << " is " << outVector << std::endl << std::endl;
24}
25
26int main( int argc, char* argv[] )
27{
28 std::cout << "Creating matrix on CPU ... " << std::endl;
29 symmetricSparseMatrixExample< TNL::Devices::Host >();
30
31#ifdef HAVE_CUDA
32 std::cout << "Creating matrix on CUDA GPU ... " << std::endl;
33 symmetricSparseMatrixExample< TNL::Devices::Cuda >();
34#endif
35}

We construct matrix of the following form

$\left( \begin{array}{ccccc} 1 & \color{grey}{2} & \color{grey}{3} & \color{grey}{4} & \color{grey}{5} \\ 2 & 1 & & & \\ 3 & & 1 & & \\ 4 & & & 1 & \\ 5 & & & & 1 \end{array} \right)$

The elements depicted in grey color are not stored in the memory. The main difference, compared to creation of general sparse matrix, is on line 9 where we state that the matrix is symmetric by setting the matrix type to TNL::Matrices::SymmetricMatrix. Next we set only the diagonal elements and those lying bellow the diagonal (lines 13-17). When we print the matrix (line 19) we can see also the symmetric part above the diagonal. Next we test product of matrix and vector (lines 21-23). The result looks as follows:

Creating matrix on CPU ...
Symmetric sparse matrix:
Row: 0 -> Col:0->1 Col:1->2 Col:2->3 Col:3->4 Col:4->5
Row: 1 -> Col:0->2 Col:1->1
Row: 2 -> Col:0->3 Col:2->1
Row: 3 -> Col:0->4 Col:3->1
Row: 4 -> Col:0->5 Col:4->1
Product with vector [ 1, 1, 1, 1, 1 ] is [ 15, 3, 4, 5, 6 ]
Creating matrix on CUDA GPU ...
Symmetric sparse matrix:
Row: 0 -> Col:0->1 Col:1->2 Col:2->3 Col:3->4 Col:4->5
Row: 1 -> Col:0->2 Col:1->1
Row: 2 -> Col:0->3 Col:2->1
Row: 3 -> Col:0->4 Col:3->1
Row: 4 -> Col:0->5 Col:4->1
Product with vector [ 1, 1, 1, 1, 1 ] is [ 15, 3, 4, 5, 6 ]

Warning: Assignment of symmetric sparse matrix to general sparse matrix does not give correct result, currently. Only the diagonal and the lower part of the matrix is assigned.

### Binary sparse matrices

If the matrix element value type (i.e. Real type) is set to bool the matrix elements can be only 1 or 0. So in the sparse matrix formats, where we do not store the zero matrix elements, explicitly stored elements can have only one possible value which is 1. Therefore we do not need to store the values, only the positions of the nonzero elements. The array values, which usualy stores the matrix elements values, can be completely omitted and we can reduce the memory requirements. The following table shows how much we can reduce the memory consumption when using binary matrix instead of common sparse matrix using float or double types:

Real Index Common sparse matrix Binary sparse matrix Ratio
float 32-bit 4 + 4 = 8 B 4 B 50%
float 64-bit 4 + 8 = 12 B 8 B 75%
double 32-bit 8 + 4 = 12 B 4 B 33%
double 64-bit 8 + 8 = 16 B 8 B 50%

The following example demonstrates the use of binary matrix:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Devices/Host.h>
5
6
7template< typename Device >
8void binarySparseMatrixExample()
9{
11 5, // number of matrix rows
12 5, // number of matrix columns
13 { // matrix elements definition
14 { 0, 0, 1.0 }, { 0, 1, 2.0 }, { 0, 2, 3.0 }, { 0, 3, 4.0 }, { 0, 4, 5.0 },
15 { 1, 0, 2.0 }, { 1, 1, 1.0 },
16 { 2, 0, 3.0 }, { 2, 2, 1.0 },
17 { 3, 0, 4.0 }, { 3, 3, 1.0 },
18 { 4, 0, 5.0 }, { 4, 4, 1.0 } } );
19
20 std::cout << "Binary sparse matrix: " << std::endl << binaryMatrix << std::endl;
21
22 TNL::Containers::Vector< double, Device > inVector( 5, 1.1 ), outVector( 5, 0.0 );
23 binaryMatrix.vectorProduct( inVector, outVector );
24 std::cout << "Product with vector " << inVector << " is " << outVector << std::endl << std::endl;
25
27 binaryMatrix2 = binaryMatrix;
28 binaryMatrix2.vectorProduct( inVector, outVector );
29 std::cout << "Product with vector in double precision " << inVector << " is " << outVector << std::endl << std::endl;
30}
31
32int main( int argc, char* argv[] )
33{
34 std::cout << "Creating matrix on CPU ... " << std::endl;
35 binarySparseMatrixExample< TNL::Devices::Host >();
36
37#ifdef HAVE_CUDA
38 std::cout << "Creating matrix on CUDA GPU ... " << std::endl;
39 binarySparseMatrixExample< TNL::Devices::Cuda >();
40#endif
41}
void vectorProduct(const InVector &inVector, OutVector &outVector, ComputeRealType matrixMultiplicator=1.0, ComputeRealType outVectorMultiplicator=0.0, IndexType begin=0, IndexType end=0) const
Computes product of matrix and vector.
Definition: SparseMatrix.hpp:559

All we need to do is set the Real type to bool as we can see on the line 9. We can see that even though we set different values to different matrix elements (lines 14-18) at the end all of them are turned into ones (printing of the matrix on the line 20). There is an issue, however, which is demonstrated on the product of the matrix with a vector. Nonbinary matrices compute all operations using the Real type. If it is set to bool operations like SpMV would not get correct solution. Therefore sparse matrices use another type called ComputeReal which is the 6th template parameter of TNL::Matrices::SparseMatrix. By default it is set to Index type but it can be changed by the user. On the lines 26-29 we show how to change this type to double and what is the effect of it (correct result of matrix-vector multiplication). The result looks as follows:

Creating matrix on CPU ...
Binary sparse matrix:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:1
Row: 2 -> 0:1 2:1
Row: 3 -> 0:1 3:1
Row: 4 -> 0:1 4:1
Product with vector [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5, 2, 2, 2, 2 ]
Product with vector in double precision [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5.5, 2.2, 2.2, 2.2, 2.2 ]
Creating matrix on CUDA GPU ...
Binary sparse matrix:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:1
Row: 2 -> 0:1 2:1
Row: 3 -> 0:1 3:1
Row: 4 -> 0:1 4:1
Product with vector [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5, 2, 2, 2, 2 ]
Product with vector in double precision [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5.5, 2.2, 2.2, 2.2, 2.2 ]

## Tridiagonal matrices

Tridiagonal matrix format serves for specific matrix pattern when the nonzero matrix elements can be placed only at the diagonal and immediately next to the diagonal. Here is an example:

$\left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \end{array} \right)$

An advantage is that we do not store the column indexes explicitly as it is in TNL::Matrices::SparseMatrix. This can reduce significantly the memory requirements which also means better performance. See the following table for the storage requirements comparison between TNL::Matrices::TridiagonalMatrix and TNL::Matrices::SparseMatrix.

Real Index SparseMatrix TridiagonalMatrix Ratio
float 32-bit int 8 bytes per element 4 bytes per element 50%
double 32-bit int 12 bytes per element 8 bytes per element 75%
float 64-bit int 12 bytes per element 4 bytes per element 30%
double 64-bit int 16 bytes per element 8 bytes per element 50%

Tridiagonal matrix is a templated class defined in the namespace TNL::Matrices. It has five template parameters:

• Real is a type of the matrix elements. It is double by default.
• Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for GPU supporting CUDA. It is TNL::Devices::Host by default.
• Index is a type to be used for indexing of the matrix elements. It is int by default.
• ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default it is the row-major order if the matrix is allocated in the host system and column major order if it is allocated on GPU.
• RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.

For better alignment in the memory the tridiagonal format is organized like if there were three nonzero matrix elements in each row. This is not true for example in the first row where there is no matrix element on the left side of the diagonal. The same happens on the last row of the matrix. We have to add even the artificial matrix elements like this:

$\begin{array}{c} 0 \\ . \\ . \\ . \\ . \\ . \end{array} \left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \end{array} \right) \begin{array}{c} . \\ . \\ . \\ . \\ . \\ 0 \end{array}$

If the tridiagonal matrix has more rows then columns, we have to extend the last two rows with nonzero elements in this way

$\left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \\ . & . & . & . & . & -1 \end{array} \right) \rightarrow \begin{array}{c} 0 \\ . \\ . \\ . \\ . \\ . \\ . \end{array} \left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \\ . & . & . & . & . & -1 \end{array} \right) \begin{array}{cc} . & . \\ . & . \\ . & . \\ . & . \\ . & . \\ 0 & . \\ 0 & 0 \end{array}$

We also would like to remind the meaning of the local index (localIdx) of the matrix element within a matrix row. It is a rank of the nonzero matrix element in given row as we explained in section Indexing of nonzero matrix elements in sparse matrices. The values of the local index for tridiagonal matrix elements are as follows

$\left( \begin{array}{cccccc} 1 & 2 & & & & \\ 0 & 1 & 2 & & & \\ & 0 & 1 & 2 & & \\ & & 0 & 1 & 2 & \\ & & & 0 & 1 & 2 \\ & & & & 0 & 1 \end{array} \right)$

In the following text we show different methods for setup of tridiagonal matrices.

### Initializer list

The tridiagonal matrix can be initialized by the means of the constructor with initializer list. The matrix from the beginning of this section can be constructed as the following example demonstrates:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/TridiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7
8template< typename Device >
9void createTridiagonalMatrix()
10{
11 const int matrixSize = 6;
12
13 /***
14 * Setup the following matrix (dots represent zeros):
15 *
16 * / 2 -1 . . . . \
17 * | -1 2 -1 . . . |
18 * | . -1 2 -1 . . |
19 * | . . -1 2 -1 . |
20 * | . . . -1 2 -1 |
21 * \ . . . . -1 2 /
22 *
23 */
25 matrixSize, {
26 /***
27 * To set the matrix elements we first extend the diagonals to their full
28 * lengths even outside the matrix (dots represent zeros and zeros are
29 * artificial zeros used for memory alignment):
30 *
31 * 0 / 2 -1 . . . . \ -> { 0, 2, -1 }
32 * | -1 2 -1 . . . | -> { -1, 2, -1 }
33 * | . -1 2 -1 . . | -> { -1, 2, -1 }
34 * | . . -1 2 -1 . | -> { -1, 2, -1 }
35 * | . . . -1 2 -1 | -> { -1, 2, -1 }
36 * \ . . . . -1 2 / 0 -> { -1, 2, 0 }
37 *
38 */
39 { 0, 2, -1 },
40 { -1, 2, -1 },
41 { -1, 2, -1 },
42 { -1, 2, -1 },
43 { -1, 2, -1 },
44 { -1, 2, 0 }
45 } );
46 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
47}
48
49int main( int argc, char* argv[] )
50{
51 std::cout << "Creating tridiagonal matrix on CPU ... " << std::endl;
52 createTridiagonalMatrix< TNL::Devices::Host >();
53
54#ifdef HAVE_CUDA
55 std::cout << "Creating tridiagonal matrix on CUDA GPU ... " << std::endl;
56 createTridiagonalMatrix< TNL::Devices::Cuda >();
57#endif
58}
Implementation of sparse tridiagonal matrix.
Definition: TridiagonalMatrix.h:63

The matrix elements values are defined on lines 39-44. Each matrix row is represented by embedded an initializer list. We set three values in each row including the padding zeros.

The output of the example looks as:

Creating tridiagonal matrix on CPU ...
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2
Creating tridiagonal matrix on CUDA GPU ...
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2

Similar way of the tridiagonal matrix setup is offered by the method setElements (TNL::Matrices::TridiagonalMatrix::setElements) as the following example demonstrates:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/TridiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7
8template< typename Device >
9void createTridiagonalMatrix()
10{
11 const int matrixSize = 6;
12
13 /***
14 * Setup the following matrix (dots represent zeros):
15 *
16 * / 2 -1 . . . . \
17 * | -1 2 -1 . . . |
18 * | . -1 2 -1 . . |
19 * | . . -1 2 -1 . |
20 * | . . . -1 2 -1 |
21 * \ . . . . -1 2 /
22 *
23 */
24 TNL::Matrices::TridiagonalMatrix< double, Device > matrix( matrixSize, matrixSize );
25 matrix.setElements( {
26 /***
27 * To set the matrix elements we first extend the diagonals to their full
28 * lengths even outside the matrix (dots represent zeros and zeros are
29 * artificial zeros used for memory alignment):
30 *
31 * 0 / 2 -1 . . . . \ -> { 0, 2, -1 }
32 * | -1 2 -1 . . . | -> { -1, 2, -1 }
33 * | . -1 2 -1 . . | -> { -1, 2, -1 }
34 * | . . -1 2 -1 . | -> { -1, 2, -1 }
35 * | . . . -1 2 -1 | -> { -1, 2, -1 }
36 * \ . . . . -1 2 / 0 -> { -1, 2, 0 }
37 *
38 */
39 { 0, 2, -1 },
40 { -1, 2, -1 },
41 { -1, 2, -1 },
42 { -1, 2, -1 },
43 { -1, 2, -1 },
44 { -1, 2, 0 }
45 } );
46 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
47}
48
49int main( int argc, char* argv[] )
50{
51 std::cout << "Creating tridiagonal matrix on CPU ... " << std::endl;
52 createTridiagonalMatrix< TNL::Devices::Host >();
53
54#ifdef HAVE_CUDA
55 std::cout << "Creating tridiagonal matrix on CUDA GPU ... " << std::endl;
56 createTridiagonalMatrix< TNL::Devices::Cuda >();
57#endif
58}

Here we create the matrix in two steps. Firstly, we setup the matrix dimensions by the appropriate constructor (line 24) and after that we setup the matrix elements (line 25-45). The result looks the same as in the previous example:

Creating tridiagonal matrix on CPU ...
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2
Creating tridiagonal matrix on CUDA GPU ...
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2

In the following example we create tridiagonal matrix with 5 rows and 5 columns (line 12-14) by the means of a shared pointer (TNL::Pointers::SharedPointer) to make this work even on GPU. We set numbers 0,...,4 on the diagonal (line 16) and we print the matrix (line 18). Next we use a lambda function (lines 21-27) combined with parallel for (TNL::Algorithms::ParallelFor) (line 35), to modify the matrix. The offdiagonal elements are set to 1 (lines 23 and 26) and for the diagonal elements, we change the sign (line 24).

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/TridiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Pointers/SharedPointer.h>
7#include <TNL/Pointers/SmartPointersRegister.h>
8
9template< typename Device >
10void setElements()
11{
12 const int matrixSize( 5 );
14 TNL::Pointers::SharedPointer< Matrix > matrix( matrixSize, matrixSize );
15 for( int i = 0; i < 5; i++ )
16 matrix->setElement( i, i, i );
17
18 std::cout << "Matrix set from the host:" << std::endl;
19 std::cout << *matrix << std::endl;
20
21 auto f = [=] __cuda_callable__ ( int i ) mutable {
22 if( i > 0 )
23 matrix->setElement( i, i - 1, 1.0 );
24 matrix->setElement( i, i, -i );
25 if( i < matrixSize - 1 )
26 matrix->setElement( i, i + 1, 1.0 );
27 };
28
29 /***
30 * For the case when Device is CUDA device we need to synchronize smart
31 * pointers. To avoid this you may use TridiagonalMatrixView. See
32 * TridiagonalMatrixView::getRow example for details.
33 */
34 TNL::Pointers::synchronizeSmartPointersOnDevice< Device >();
36
37 std::cout << "Matrix set from its native device:" << std::endl;
38 std::cout << *matrix << std::endl;
39}
40
41int main( int argc, char* argv[] )
42{
43 std::cout << "Set elements on host:" << std::endl;
44 setElements< TNL::Devices::Host >();
45
46#ifdef HAVE_CUDA
47 std::cout << "Set elements on CUDA device:" << std::endl;
48// FIXME: it does not work with nvcc 10.1
49// setElements< TNL::Devices::Cuda >();
50#endif
51}

The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4
Set elements on CUDA device:

### Method getRow

A bit different way of setting up the matrix, is the use of tridiagonal matrix view and the method getRow (TNL::Matrices::TridiagonalMatrixView::getRow) as the following example demonstrates:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/TridiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void getRowExample()
9{
10 /***
11 * Set the following matrix (dots represent zero matrix elements and zeros are
12 * padding zeros for memory alignment):
13 *
14 * 0 / 2 -1 . . . \ -> { 0, 0, 1 }
15 * | -1 2 -1 . . | -> { 0, 2, 1 }
16 * | . -1 2 -1. . | -> { 3, 2, 1 }
17 * | . . -1 2 -1 | -> { 3, 2, 1 }
18 * \ . . . -1 2 / -> { 3, 2, 1 }
19 *
20 */
21
22 const int matrixSize( 5 );
24 MatrixType matrix(
25 matrixSize, // number of matrix rows
26 matrixSize // number of matrix columns
27 );
28 auto view = matrix.getView();
29
30 auto f = [=] __cuda_callable__ ( int rowIdx ) mutable {
31 auto row = view.getRow( rowIdx );
32
33 if( rowIdx > 0 )
34 row.setElement( 0, -1.0 ); // elements below the diagonal
35 row.setElement( 1, 2.0 ); // elements on the diagonal
36 if( rowIdx < matrixSize - 1 ) // elements above the diagonal
37 row.setElement( 2, -1.0 );
38 };
39
40 /***
41 * Set the matrix elements.
42 */
44 std::cout << std::endl << matrix << std::endl;
45}
46
47int main( int argc, char* argv[] )
48{
49 std::cout << "Getting matrix rows on host: " << std::endl;
50 getRowExample< TNL::Devices::Host >();
51
52#ifdef HAVE_CUDA
53 std::cout << "Getting matrix rows on CUDA device: " << std::endl;
54 getRowExample< TNL::Devices::Cuda >();
55#endif
56}
__cuda_callable__ IndexType getRows() const
Returns number of matrix rows.
Definition: Matrix.hpp:69

We create a matrix with the same size (line 22-27). Next, we fetch the tridiagonal matrix view (ef TNL::Matrices::TridiagonalMatrixView ,line 28) which we use in the lambda function for matrix elements modification (lines 30-38). Inside the lambda function, we first get a matrix row by calling the method getRow (TNL::Matrices::TridiagonalMatrixView::getRow) using which we can access the matrix elements (lines 33-37). We would like to stress that the method setElement addresses the matrix elements with the localIdx parameter which is a rank of the nonzero element in the matrix row - see Indexing of nonzero matrix elements in sparse matrices. The lambda function is called by the ParallelFor (TNL::Algorithms::ParallelFor).

The result looks as follows:

Getting matrix rows on host:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Getting matrix rows on CUDA device:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2

### Method forRows

As in the case of other matrix types, the method forRows (TNL::Matrices::TridiagonalMatrix::forRows) calls the method getRow (TNL::Matrices::TridiagonalMatrix::getRow) in parallel. It is demonstrated by the following example which we may directly compare with the previous one:

1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void forRowsExample()
8{
10 using RowView = typename MatrixType::RowView;
11 /***
12 * Set the following matrix (dots represent zero matrix elements and zeros are
13 * padding zeros for memory alignment):
14 *
15 * 0 / 2 . . . . \ -> { 0, 0, 1 }
16 * | -1 2 -1 . . | -> { 0, 2, 1 }
17 * | . -1 2 -1. . | -> { 3, 2, 1 }
18 * | . . -1 2 -1 | -> { 3, 2, 1 }
19 * \ . . . . 2 / -> { 3, 2, 1 }
20 *
21 */
22 const int size = 5;
23 MatrixType matrix( size, size );
24
25 auto f = [=] __cuda_callable__ ( RowView& row ) {
26 const int& rowIdx = row.getRowIndex();
27 if( rowIdx > 0 )
28 row.setElement( 0, -1.0 ); // elements below the diagonal
29 row.setElement( 1, 2.0 ); // elements on the diagonal
30 if( rowIdx < size - 1 ) // elements above the diagonal
31 row.setElement( 2, -1.0 );
32 };
33 matrix.forAllRows( f );
34 std::cout << matrix << std::endl;
35
36 /***
37 * Compute sum of elements in each row and store it into a vector.
38 */
40 auto sum_view = sum_vector.getView();
41 matrix.forAllRows( [=] __cuda_callable__ ( RowView& row ) mutable {
42 double sum( 0.0 );
43 for( auto element : row )
44 sum += TNL::abs( element.value() );
45 sum_view[ row.getRowIndex() ] = sum;
46 } );
47
48 std::cout << "Sums in matrix rows = " << sum_vector << std::endl;
49}
50
51int main( int argc, char* argv[] )
52{
53 std::cout << "Creating matrix on host: " << std::endl;
54 forRowsExample< TNL::Devices::Host >();
55
56#ifdef HAVE_CUDA
57 std::cout << "Creating matrix on CUDA device: " << std::endl;
58 forRowsExample< TNL::Devices::Cuda >();
59#endif
60}

The differences are:

1. We do not need to get the matrix view as we did in the previous example.
2. We call the method forAllRows (TNL::Matrices::TridiagonalMatrix::forAllRows) (line 33) instead of ParallelFor (TNL::Algorithms::ParallelFor) which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither.
3. The lambda function f (lines 25-31) accepts one parameter row of the type RowView (TNL::Matrices::TridiagonalMatrix::RowView which is TNL::Matrices::TridiagonalMatrixRowView) instead of the index of the matrix row. Therefore we do not need to call the method getRow (TNL::Matrices::TridiagonalMatrix::getRow). On the other hand, we need the method geRowIndex (TNL::Matrices::TridiagonalMatrixRowView::getRowIndex) to get the index of the matrix row (line 24).

Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 39-46). Firstly we create the vector sum_vector for storing the sums (line 39) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 41-46, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 45).

The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Sums in matrix rows = [ 3, 4, 4, 4, 3 ]
Creating matrix on CUDA device:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Sums in matrix rows = [ 3, 4, 4, 4, 3 ]

### Method forElements

Finally, even a bit more simple way of matrix elements manipulation with the method forElements (TNL::Matrices::TridiagonalMatrix::forElements) is demonstrated in the following example:

1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void forElementsExample()
8{
9 /***
10 * Set the following matrix (dots represent zero matrix elements and zeros are
11 * padding zeros for memory alignment):
12 *
13 * 0 / 2 1 . . . \ -> { 0, 2, 1 }
14 * | 3 2 1 . . | -> { 3, 2, 1 }
15 * | . 3 2 1 . | -> { 3, 2, 1 }
16 * | . . 3 2 1 | -> { 3, 2, 1 }
17 * \ . . . 3 2 / 0 -> { 3, 2, 0 }
18 */
20 5, // number of matrix rows
21 5 ); // number of matrix columns
22 auto view = matrix.getView();
23
24 auto f = [=] __cuda_callable__ ( int rowIdx, int localIdx, int columnIdx, double& value ) {
25 /***
26 * 'forElements' method iterates only over matrix elements lying on given subdiagonals
27 * and so we do not need to check anything. The element value can be expressed
28 * by the 'localIdx' variable, see the following figure:
29 *
30 * 0 1 2 <- localIdx values
31 * -------
32 * 0 / 2 1 . . . \ -> { 0, 2, 1 }
33 * | 3 2 1 . . | -> { 3, 2, 1 }
34 * | . 3 2 1 . | -> { 3, 2, 1 }
35 * | . . 3 2 1 | -> { 3, 2, 1 }
36 * \ . . . 3 2 / 0 -> { 3, 2, 0 }
37 *
38 */
39 value = 3 - localIdx;
40 };
41 view.forElements( 0, matrix.getRows(), f );
42 std::cout << matrix << std::endl;
43}
44
45int main( int argc, char* argv[] )
46{
47 std::cout << "Creating matrix on host: " << std::endl;
48 forElementsExample< TNL::Devices::Host >();
49
50#ifdef HAVE_CUDA
51 std::cout << "Creating matrix on CUDA device: " << std::endl;
52 forElementsExample< TNL::Devices::Cuda >();
53#endif
54}
void forElements(IndexType begin, IndexType end, Function &&function) const
Method for parallel iteration over matrix elements of given rows for constant instances.
Definition: SparseMatrix.hpp:665

On the line 41, we call the method forElements (TNL::Matrices::TridiagonalMatrix::forElements) instead of parallel for (TNL::Algorithms::ParallelFor). This method iterates over all matrix rows and all nonzero matrix elements. The lambda function on the line 24 therefore do not receive only the matrix row index but also local index of the matrix element (localIdx) which is a rank of the nonzero matrix element in given row - see Indexing of nonzero matrix elements in sparse matrices. Next parameter, columnIdx received by the lambda function, is the column index of the matrix element. The fourth parameter value is a reference on the matrix element which we use for its modification. If the last parameter compute is set to false, the iterations over the matrix rows is terminated.

The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:2 1:1
Row: 1 -> 0:3 1:2 2:1
Row: 2 -> 1:3 2:2 3:1
Row: 3 -> 2:3 3:2 4:1
Row: 4 -> 3:3 4:2
Creating matrix on CUDA device:
Row: 0 -> 0:2 1:1
Row: 1 -> 0:3 1:2 2:1
Row: 2 -> 1:3 2:2 3:1
Row: 3 -> 2:3 3:2 4:1
Row: 4 -> 3:3 4:2

## Multidiagonal matrices

Multidiagonal matrices are generalization of the tridiagonal ones. It is a special type of sparse matrices with specific pattern of the nonzero matrix elements which are positioned only parallel along diagonal. See the following example:

$\left( \begin{array}{ccccccc} 4 & -1 & . & -1 & . & . \\ -1 & 4 & -1 & . & -1 & . \\ . & -1 & 4 & -1 & . & -1 \\ -1 & . & -1 & 4 & -1 & . \\ . & -1 & . & -1 & 4 & -1 \\ . & . & -1 & . & -1 & 4 \end{array} \right)$

We can see that the matrix elements lay on lines parallel to the main diagonal. Such lines can be characterized by their offsets from the main diagonal. On the following figure, each such line is depicted in different color:

$\begin{array}{ccc} \color{green}{-3} & . & \color{cyan}{-1} \\ \hline \color{green}{*} & . & \color{cyan}{*} \\ . & \color{green}{*} & . \\ . & . & \color{green}{*} \\ . & . & . \\ . & . & . \\ . & . & . \end{array} \left( \begin{array}{ccccccc} \color{blue}{0} & \color{magenta}{1} & . & \color{red}{3} & . & . \\ \hline \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} & . & . \\ \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} & . \\ . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} \\ \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . \\ . & \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} \\ . & . & \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} \end{array} \right)$

In this matrix, the offsets reads as $$\{-3, -1, 0, +1, +3\}$$. It also means that the column indexes on $$i-$$th row are $$\{i-3, i-1, i, i+1, i+3\}$$ (where we accept only nonnegative indexes smaller than the number of matrix columns). An advantage is that, similar to the tridiagonal matrix (TNL::Matrices::TridiagonalMatrix), we do not store the column indexes explicitly as it is in TNL::Matrices::SparseMatrix. This can significantly reduce the memory requirements which also means better performance. See the following table for the storage requirements comparison between multidiagonal matrix (TNL::Matrices::MultidiagonalMatrix) and general sparse matrix (TNL::Matrices::SparseMatrix).

Real Index SparseMatrix MultidiagonalMatrix Ratio
float 32-bit int 8 bytes per element 4 bytes per element 50%
double 32-bit int 12 bytes per element 8 bytes per element 75%
float 64-bit int 12 bytes per element 4 bytes per element 30%
double 64-bit int 16 bytes per element 8 bytes per element 50%

For the sake of better memory alignment and faster access to the matrix elements, we store all subdiagonals in complete form including the elements which are outside the matrix as depicted on the following figure where zeros stand for the padding artificial zero matrix elements

$\begin{array}{cccc} 0 & & & 0 \\ & 0 & & \\ & & 0 & \\ & & & 0 \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \end{array} \left( \begin{array}{cccccccccccccccc} 1 & 0 & & & 0 & & & & & & & & & & & \\ 0 & 1 & 0 & & & 0 & & & & & & & & & & \\ & 0 & 1 & 0 & & & 0 & & & & & & & & & \\ & & 0 & 1 & 0 & & & 0 & & & & & & & & \\ 0 & & & 0 & 1 & 0 & & & 0 & & & & & & & \\ & -1 & & & -1 & 1 & -1 & & & -1 & & & & & & \\ & & -1 & & & -1 & 1 & -1 & & & -1 & & & & & \\ & & & 0 & & & 0 & 1 & 0 & & & 0 & & & & \\ & & & & 0 & & & 0 & 1 & 0 & & & 0 & & & \\ & & & & & -1 & & & -1 & 1 & -1 & & & -1 & & \\ & & & & & & -1 & & & -1 & 1 & -1 & & & -1 & \\ & & & & & & & 0 & & & 0 & 1 & 0 & & & 0 \\ & & & & & & & & 0 & & & 0 & 1 & 0 & & \\ & & & & & & & & & 0 & & & 0 & 1 & 0 & \\ & & & & & & & & & & 0 & & & 0 & 1 & 0 \\ & & & & & & & & & & & 0 & & & 0 & 1 \end{array} \right) \begin{array} & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ 0 & & & \\ & 0 & & \\ & & 0 & \\ 0 & & & 0 \end{array}$

Multidiagonal matrix is a templated class defined in the namespace TNL::Matrices. It has six template parameters:

• Real is a type of the matrix elements. It is double by default.
• Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
• Index is a type to be used for indexing of the matrix elements. It is int by default.
• ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default, it is the row-major order if the matrix is allocated in the host system and column major order if it is allocated on GPU.
• RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.
• IndexAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements offsets. By default, it is the default allocator for given Index type and Device type – see TNL::Allocators::Default.

In the following text we show different methods how to setup multidiagonal matrices.

### Initializer list

Smaller multidiagonal matrices can be created using the constructor with initializer lists (std::initializer_list) as we demonstrate in the following example:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/MultidiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7
8template< typename Device >
9void createMultidiagonalMatrix()
10{
11 const int matrixSize = 6;
12
13 /***
14 * Setup the following matrix (dots represent zeros):
15 *
16 * / 4 -1 . -1 . . \
17 * | -1 4 -1 . -1 . |
18 * | . -1 4 -1 . -1 |
19 * | -1 . -1 4 -1 . |
20 * | . -1 . -1 4 -1 |
21 * \ . . 1 . -1 4 /
22 *
23 * The diagonals offsets are { -3, -1, 0, 1, 3 }.
24 */
26 matrixSize, { -3, -1, 0, 1, 3 }, {
27 /***
28 * To set the matrix elements we first extend the diagonals to their full
29 * lengths even outside the matrix (dots represent zeros and zeros are
30 * artificial zeros used for memory alignment):
31 *
32 * 0 . 0 / 4 -1 . -1 . . \ -> { 0, 0, 4, -1, -1 }
33 * . 0 . | -1 4 -1 . -1 . | . -> { 0, -1, 4, -1, -1 }
34 * . . 0 | . -1 4 -1 . -1 | . . -> { 0, -1, 4, -1, -1 }
35 * . . | -1 . -1 4 -1 . | 0 . . -> { -1, -1, 4, -1, 0 }
36 * . | . -1 . -1 4 -1 | . 0 . . -> { -1, -1, 4, -1, 0 }
37 * \ . . 1 . -1 4 / 0 . 0 . . -> { -1, -1, 4, 0, 0 }
38 *
39 */
40 { 0, 0, 4, -1, -1 },
41 { 0, -1, 4, -1, -1 },
42 { 0, -1, 4, -1, -1 },
43 { -1, -1, 4, -1, 0 },
44 { -1, -1, 4, -1, 0 },
45 { -1, -1, 4, 0, 0 }
46 } );
47 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
48}
49
50int main( int argc, char* argv[] )
51{
52 std::cout << "Create multidiagonal matrix on CPU ... " << std::endl;
53 createMultidiagonalMatrix< TNL::Devices::Host >();
54
55#ifdef HAVE_CUDA
56 std::cout << "Creating multidiagonal matrix on CUDA GPU ... " << std::endl;
57 createMultidiagonalMatrix< TNL::Devices::Cuda >();
58#endif
59}
Implementation of sparse multidiagonal matrix.
Definition: MultidiagonalMatrix.h:71

Here, we create a matrix which looks as

$\left( \begin{array}{cccccc} 4 & -1 & & -1 & & \\ -1 & 4 & -1 & & -1 & \\ & -1 & 4 & -1 & & -1 \\ -1 & & -1 & 4 & -1 & \\ & -1 & & -1 & 4 & -1 \\ & & -1 & & -1 & 4 \\ \end{array} \right).$

On the lines 25-46, we call the constructor which, in addition to matrix dimensions and subdiagonals offsets, accepts also initializer list of initializer lists with matrix elements values. Each embedded list corresponds to one matrix row and it contains values of matrix elements on particular subdiagonals including those which lies out of the matrix. The result looks as follows:

Create multidiagonal matrix on CPU ...
Row: 0 -> 0:4 1:-1 3:-1
Row: 1 -> 0:-1 1:4 2:-1 4:-1
Row: 2 -> 1:-1 2:4 3:-1 5:-1
Row: 3 -> 0:-1 2:-1 3:4 4:-1
Row: 4 -> 1:-1 3:-1 4:4 5:-1
Row: 5 -> 2:-1 4:-1 5:4
Creating multidiagonal matrix on CUDA GPU ...
Row: 0 -> 0:4 1:-1 3:-1
Row: 1 -> 0:-1 1:4 2:-1 4:-1
Row: 2 -> 1:-1 2:4 3:-1 5:-1
Row: 3 -> 0:-1 2:-1 3:4 4:-1
Row: 4 -> 1:-1 3:-1 4:4 5:-1
Row: 5 -> 2:-1 4:-1 5:4

Another and more efficient way of setting the matrix elements is by means of the method setElement (TNL::Matrices::MultidiagonalMatrix::setElement). It is demonstrated in the following example:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/MultidiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Pointers/SharedPointer.h>
7#include <TNL/Pointers/SmartPointersRegister.h>
8
9template< typename Device >
10void setElements()
11{
12 const int matrixSize( 5 );
13 auto diagonalsOffsets = { -1, 0, 1 }; // offsets of tridiagonal matrix
15 Matrix matrix( matrixSize, matrixSize, diagonalsOffsets );
16 auto view = matrix.getView();
17
18 for( int i = 0; i < 5; i++ )
19 view.setElement( i, i, i );
20
21 std::cout << "Matrix set from the host:" << std::endl;
22 std::cout << matrix << std::endl;
23
24 auto f = [=] __cuda_callable__ ( int i ) mutable {
25 if( i > 0 )
26 view.setElement( i, i - 1, 1.0 );
27 view.setElement( i, i, -i );
28 if( i < matrixSize - 1 )
29 view.setElement( i, i + 1, 1.0 );
30 };
31
33
34 std::cout << "Matrix set from its native device:" << std::endl;
35 std::cout << matrix << std::endl;
36}
37
38int main( int argc, char* argv[] )
39{
40 std::cout << "Set elements on host:" << std::endl;
41 setElements< TNL::Devices::Host >();
42
43#ifdef HAVE_CUDA
44 std::cout << "Set elements on CUDA device:" << std::endl;
45 setElements< TNL::Devices::Cuda >();
46#endif
47}

This examples shows that the method setElement can be used both on the host (CPU) (line 19) as well as in the lambda functions that can be called from GPU kernels (lines 25-29). For this purpose, we fetch a matrix view on the line 16. The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4
Set elements on CUDA device:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4

### Method getRow

Slightly more efficient way of the multidiagonal matrix setup is offered by the method getRow (TNL::Matrices::MultidiagonalMatrix::getRow). We will use it to create a matrix of the following form:

$\left( \begin{array}{cccccccccccccccc} 1 & . & & & . & & & & & & & & & & & \\ . & 1 & . & & & . & & & & & & & & & & \\ & . & 1 & . & & & . & & & & & & & & & \\ & & . & 1 & . & & & . & & & & & & & & \\ . & & & . & 1 & . & & & . & & & & & & & \\ & -1 & & & -1 & -4 & -1 & & & -1 & & & & & & \\ & & -1 & & & -1 & -4 & -1 & & & -1 & & & & & \\ & & & . & & & . & 1 & . & & & . & & & & \\ & & & & . & & & . & 1 & . & & & . & & & \\ & & & & & -1 & & & -1 & -4 & -1 & & & -1 & & \\ & & & & & & -1 & & & -1 & -4 & -1 & & & -1 & \\ & & & & & & & . & & & . & 1 & . & & & . \\ & & & & & & & & . & & & . & 1 & . & & \\ & & & & & & & & & . & & & . & 1 & . & \\ & & & & & & & & & & . & & & . & 1 & . \\ & & & & & & & & & & & . & & & . & 1 \end{array} \right)$

The matrices of this form arise from a discretization of the Laplace operator in 2D by the finite difference method. We use this example because it is a frequent numerical problem and the multidiagonal format is very suitable for such matrices. If the reader, however, is not familiar with the finite difference method, please, do not be scared, we will just create the matrix mentioned above. The code based on use of method getRow reads as:

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/MultidiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7
8template< typename Device >
9void laplaceOperatorMatrix()
10{
11 /***
12 * Set matrix representing approximation of the Laplace operator on regular
13 * grid using the finite difference method.
14 */
15 const int gridSize( 4 );
16 const int matrixSize = gridSize * gridSize;
17 TNL::Containers::Vector< int, Device > offsets { - gridSize, -1, 0, 1, gridSize };
18 TNL::Matrices::MultidiagonalMatrix< double, Device > matrix( matrixSize, matrixSize, offsets );
19 auto matrixView = matrix.getView();
20 auto f = [=] __cuda_callable__ ( int i, int j ) mutable {
21 const int elementIdx = j * gridSize + i;
22 auto row = matrixView.getRow( elementIdx );
23 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
24 row.setElement( 2, 1.0 ); // set matrix elements corresponding to boundary grid nodes
25 // and Dirichlet boundary conditions, i.e. 1 on the main diagonal
26 // which is the third one
27 else
28 {
29 row.setElement( 0, -1.0 ); // set matrix elements corresponding to inner grid nodes, i.e.
30 row.setElement( 1, -1.0 ); // 4 on the main diagonal (the third one) and -1 to the other
31 row.setElement( 2, 4.0 ); // sub-diagonals
32 row.setElement( 3, -1.0 );
33 row.setElement( 4, -1.0 );
34 }
35 };
36 TNL::Algorithms::ParallelFor2D< Device >::exec( 0, 0, gridSize, gridSize, f );
37
38 std::cout << "Laplace operator matrix: " << std::endl << matrix << std::endl;
39}
40
41int main( int argc, char* argv[] )
42{
43 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
44 laplaceOperatorMatrix< TNL::Devices::Host >();
45
46#ifdef HAVE_CUDA
47 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
48 laplaceOperatorMatrix< TNL::Devices::Cuda >();
49#endif
50}

We firstly compute the matrix size (matrixSize) based on the numerical grid dimensions on the line 16. The subdiagonals offsets are defined by the numerical grid size and since it is four in this example the offsets read as $$\left\{-4,-1,0,1,4 \right\}$$ or { -gridSize, -1, 0, 1, gridSize} (line 17). Here we store the offsets in vector (TNL::Containers::Vector) called offsets. Next we use a constructor with matrix dimensions and offsets passed via TNL vector (line 18) and we fetch a matrix view (TNL::Matrices::MultidiagonalMatrixView, line 19).

The matrix is constructed by iterating over particular nodes of the numerical grid. Each node correspond to one matrix row. This is why the lambda function f (lines 20-35) take two indexes i and j (line 20). Their values are coordinates of the two-dimensional numerical grid. Based on these coordinates we compute index (elementIdx) of the corresponding matrix row (line 21). We fetch matrix row (row) by calling the getRow method (TNL::Matrices::MultidiagonalMatrix::getRow) (line 22). Depending on the grid node coordinates we set either the boundary conditions (lines 23-26) for the boundary nodes (those laying on the boundary of the grid and so their coordinates fulfil the condition i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) for which se set only the diagonal element to 1. The inner nodes of the numerical grid are handled on the lines 29-33 where we set coefficients approximating the Laplace operator. We use the method setElement of the matrix row (TNL::Matrices::MultidiagonalMatrixRowView::setElement) which takes the local index of the nonzero matrix element as the first parameter (see Indexing of nonzero matrix elements in sparse matrices) and the new value of the element as the second parameter. The local indexes, in fact, refer to particular subdiagonals as depicted on the following figure (in blue):

$\begin{array}{cccc} \color{blue}{-4} & & & \color{blue}{-1} \\ \hline . & & & . \\ & . & & \\ & & . & \\ & & & . \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \end{array} \left( \begin{array}{cccccccccccccccc} \color{blue}{0} & \color{blue}{1} & & & \color{blue}{4} & & & & & & & & & & & \\ \hline 1 & . & & & . & & & & & & & & & & & \\ . & 1 & . & & & . & & & & & & & & & & \\ & . & 1 & . & & & . & & & & & & & & & \\ & & . & 1 & . & & & . & & & & & & & & \\ . & & & . & 1 & . & & & . & & & & & & & \\ & -1 & & & -1 & 1 & -1 & & & -1 & & & & & & \\ & & -1 & & & -1 & 1 & -1 & & & -1 & & & & & \\ & & & . & & & . & 1 & . & & & . & & & & \\ & & & & . & & & . & 1 & . & & & . & & & \\ & & & & & -1 & & & -1 & 1 & -1 & & & -1 & & \\ & & & & & & -1 & & & -1 & 1 & -1 & & & -1 & \\ & & & & & & & . & & & . & 1 & . & & & . \\ & & & & & & & & . & & & . & 1 & . & & \\ & & & & & & & & & . & & & . & 1 & . & \\ & & & & & & & & & & . & & & . & 1 & . \\ & & & & & & & & & & & . & & & . & 1 \end{array} \right)$

We use ParallelFor2D (TNL::Algorithms::ParallelFor2D) to iterate over all nodes of the numerical grid (line 36) and apply the lambda function. The result looks as follows:

Creating Laplace operator matrix on CPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:-1 4:-1 5:4 6:-1 9:-1
Row: 6 -> 2:-1 5:-1 6:4 7:-1 10:-1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:-1 8:-1 9:4 10:-1 13:-1
Row: 10 -> 6:-1 9:-1 10:4 11:-1 14:-1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Creating Laplace operator matrix on CUDA GPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:-1 4:-1 5:4 6:-1 9:-1
Row: 6 -> 2:-1 5:-1 6:4 7:-1 10:-1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:-1 8:-1 9:4 10:-1 13:-1
Row: 10 -> 6:-1 9:-1 10:4 11:-1 14:-1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1

### Method forRows

As in the case of other matrix types, the method forRows (TNL::Matrices::MultidiagonalMatrix::forRows) calls the method getRow (TNL::Matrices::MultidiagonalMatrix::getRow) in parallel. It is demonstrated by the following example:

1#include <iostream>
2#include <TNL/Matrices/MultidiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void forRowsExample()
8{
10 using RowView = typename MatrixType::RowView;
11 /***
12 * Set the following matrix (dots represent zero matrix elements and zeros are
13 * padding zeros for memory alignment):
14 *
15 * 0 / 2 -1. . . . \ -> { 0, 0, 1 }
16 * | -1 2 -1 . . | -> { 0, 2, 1 }
17 * | . -1 2 -1. . | -> { 3, 2, 1 }
18 * | . . -1 2 -1 | -> { 3, 2, 1 }
19 * \ . . . -1. 2 / -> { 3, 2, 1 }
20 *
21 * The diagonals offsets are { -1, 0, 1 }.
22 */
23 const int size = 5;
24 MatrixType matrix(
25 size, // number of matrix rows
26 size, // number of matrix columns
27 { -1, -0, 1 } ); // matrix diagonals offsets
28
29 auto f = [=] __cuda_callable__ ( RowView& row ) {
30 const int& rowIdx = row.getRowIndex();
31 if( rowIdx > 0 )
32 row.setElement( 0, -1.0 ); // elements below the diagonal
33 row.setElement( 1, 2.0 ); // elements on the diagonal
34 if( rowIdx < size - 1 ) // elements above the diagonal
35 row.setElement( 2, -1.0 );
36
37 };
38 matrix.forAllRows( f );
39 std::cout << matrix << std::endl;
40
41 /***
42 * Compute sum of elements in each row and store it into a vector.
43 */
45 auto sum_view = sum_vector.getView();
46 matrix.forAllRows( [=] __cuda_callable__ ( RowView& row ) mutable {
47 double sum( 0.0 );
48 for( auto element : row )
49 sum += TNL::abs( element.value() );
50 sum_view[ row.getRowIndex() ] = sum;
51 } );
52
53 std::cout << "Sums in matrix rows = " << sum_vector << std::endl;
54}
55
56int main( int argc, char* argv[] )
57{
58 std::cout << "Creating matrix on host: " << std::endl;
59 forRowsExample< TNL::Devices::Host >();
60
61#ifdef HAVE_CUDA
62 std::cout << "Creating matrix on CUDA device: " << std::endl;
63 forRowsExample< TNL::Devices::Cuda >();
64#endif
65}
__cuda_callable__ T abs(const T &n)
This function returns absolute value of given number n.
Definition: Math.h:88

We call the method forAllRows (TNL::Matrices::MultidiagonalMatrix::forAllRows) (line 36) instead of ParallelFor (TNL::Algorithms::ParallelFor) which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither. The lambda function f (lines 28-35) accepts one parameter row of the type RowView (TNL::Matrices::MultidiagonalMatrix::RowView which is TNL::Matrices::MultidiagonalMatrixRowView). At the beginning of the lambda function, we call the method geRowIndex (TNL::Matrices::MultidiagonalMatrixRowView::getRowIndex) to get the index of the matrix row (line 29).

Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 39-46). Firstly we create the vector sum_vector for storing the sums (line 39) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 41-46, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 45).

The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Sums in matrix rows = [ 3, 4, 4, 4, 3 ]
Creating matrix on CUDA device:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Sums in matrix rows = [ 3, 4, 4, 4, 3 ]

### Method forElements

Similar and even a bit simpler way of setting the matrix elements is offered by the method forElements (TNL::Matrices::MultidiagonalMatrix::forElements, TNL::Matrices::MultidiagonalMatrixView::forElements) as demonstrated in the following example:

1#include <iostream>
2#include <TNL/Matrices/MultidiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void forElementsExample()
8{
9 /***
10 * Set the following matrix (dots represent zero matrix elements and zeros are
11 * padding zeros for memory alignment):
12 *
13 * 0 0 / 1 . . . . \ -> { 0, 0, 1 }
14 * 0 | 2 1 . . . | -> { 0, 2, 1 }
15 * | 3 2 1 . . | -> { 3, 2, 1 }
16 * | . 3 2 1 . | -> { 3, 2, 1 }
17 * \ . . 3 2 1 / -> { 3, 2, 1 }
18 *
19 * The diagonals offsets are { -2, -1, 0 }.
20 */
22 5, // number of matrix rows
23 5, // number of matrix columns
24 { -2, -1, 0 } ); // matrix diagonals offsets
25 auto view = matrix.getView();
26
27 auto f = [=] __cuda_callable__ ( int rowIdx, int localIdx, int columnIdx, double& value ) {
28 /***
29 * 'forElements' method iterates only over matrix elements lying on given subdiagonals
30 * and so we do not need to check anything. The element value can be expressed
31 * by the 'localIdx' variable, see the following figure:
32 *
33 * 0 1 2 <- localIdx values
34 * -------
35 * 0 0 / 1 . . . . \ -> { 0, 0, 1 }
36 * 0 | 2 1 . . . | -> { 0, 2, 1 }
37 * | 3 2 1 . . | -> { 3, 2, 1 }
38 * | . 3 2 1 . | -> { 3, 2, 1 }
39 * \ . . 3 2 1 / -> { 3, 2, 1 }
40 *
41 */
42 value = 3 - localIdx;
43 };
44 view.forElements( 0, matrix.getRows(), f );
45 std::cout << matrix << std::endl;
46}
47
48int main( int argc, char* argv[] )
49{
50 std::cout << "Creating matrix on host: " << std::endl;
51 forElementsExample< TNL::Devices::Host >();
52
53#ifdef HAVE_CUDA
54 std::cout << "Creating matrix on CUDA device: " << std::endl;
55 forElementsExample< TNL::Devices::Cuda >();
56#endif
57}
void forElements(IndexType begin, IndexType end, Function &function) const
Method for iteration over matrix rows for constant instances.
Definition: MultidiagonalMatrix.hpp:553

In this case, we need to provide a lambda function f (lines 27-43) which is called for each matrix row just by the method forElements (line 44). The lambda function f provides the following parameters

• rowIdx is an index iof the matrix row.
• localIdx is in index of the matrix subdiagonal.
• columnIdx is a column index of the matrix element.
• value is a reference to the matrix element value. It can be used even for changing the value.
• compute is a reference to boolean. If it is set to false, the iteration over the matrix row can be stopped.

In this example, the matrix element value depends only on the subdiagonal index localIdx (see Indexing of nonzero matrix elements in sparse matrices) as we can see on the line 42. The result looks as follows:

Creating matrix on host:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Creating matrix on CUDA device:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1

## Lambda matrices

Lambda matrix (TNL::Matrices::LambdaMatrix) is a special type of matrix which could be also called matrix-free matrix. The matrix elements are not stored in memory explicitly but they are evaluated on-the-fly by means of user defined lambda functions. If the matrix elements can be expressed by computationally not expansive formula, we can significantly reduce the memory consumption which is appreciated especially on GPUs. Since the memory accesses are quite expensive even on both CPU and GPU, we can get, at the end, even much faster code.

The lambda matrix (TNL::Matrices::LambdaMatrix) is a templated class with the following template parameters:

• MatrixElementsLambda is a lambda function which evaluates the matrix elements values and column indexes.
• CompressedRowLengthsLambda is a lambda function telling how many nonzero elements are there in given matrix row.
• Real is a type of matrix elements values.
• Device is a device on which the lambda functions mentioned above will be evaluated.
• Index is a type to be used for indexing.

The lambda function MatrixElementsLambda is supposed to have the following declaration:

1auto matrixElements = [=] __cuda_callable__ (
2 Index rows,
3 Index columns,
4 Index row,
5 Index localIdx,
6 Index& columnIdx,
7 Real& value );
8

where the particular parameters have the following meaning:

• rows tells the number of matrix rows.
• columns tells the number of matrix columns.
• rowIdx is index of the matrix row in which we are supposed to evaluate the matrix element.
• localIdx is a rank of the nonzero matrix element, see Indexing of nonzero matrix elements in sparse matrices.
• columnIdx is a reference on variable where we are supposed to store the matrix element column index.
• value is a reference on variable where we are supposed to store the matrix element value.

The lambda function CompressedRowLengthsLambda (by compressed row length we mean the number of matrix elements in a row after ignoring/compressing the zero elements) is supposed to be declared like this:

1auto compressedRowLengths = [=] __cuda_callable__ (
2 Index rows,
3 Index columns,
4 Index row ) -> Index;

where the parameters can be described as follows:

• rows tells the number of matrix rows.
• columns tells the number of matrix columns.
• rowIdx is index of the matrix row for which we are supposed to evaluate the number of nonzero matrix elements.

The lambda function is supposed to return just the number of the nonzero matrix elements in given matrix row.

### Lambda matrix inititation

How to put the lambda functions together with the lambda matrix is demonstrated in the following example:

1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3
4int main( int argc, char* argv[] )
5{
6 /***
7 * Lambda functions defining the matrix.
8 */
9 auto compressedRowLengths = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx ) -> int { return 1; };
10 auto matrixElements1 = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value ) {
11 columnIdx = rowIdx;
12 value = 1.0;
13 };
14 auto matrixElements2 = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value ) {
15 columnIdx = rowIdx;
16 value = rowIdx;
17 };
18
19 const int size = 5;
20
21 /***
22 * Matrix construction with explicit type definition.
23 */
24 using MatrixType = decltype( TNL::Matrices::LambdaMatrixFactory< double, TNL::Devices::Host, int >::create( matrixElements1, compressedRowLengths ) );
25 MatrixType m1( size, size, matrixElements1, compressedRowLengths );
26
27 /***
28 * Matrix construction using 'auto'.
29 */
30 auto m2 = TNL::Matrices::LambdaMatrixFactory< double, TNL::Devices::Host, int >::create( matrixElements2, compressedRowLengths );
31 m2.setDimensions( size, size );
32
33 std::cout << "The first lambda matrix: " << std::endl << m1 << std::endl;
34 std::cout << "The second lambda matrix: " << std::endl << m2 << std::endl;
35}
static auto create(MatrixElementsLambda &matrixElementsLambda, CompressedRowLengthsLambda &compressedRowLengthsLambda) -> LambdaMatrix< MatrixElementsLambda, CompressedRowLengthsLambda, Real, Device, Index >
Creates lambda matrix with given lambda functions.
Definition: LambdaMatrix.h:576

Here we create two simple diagonal matrices. Therefore they share the same lambda function compressedRowLengths telling the number of nonzero matrix elements in particular matrix rows which is always one (line 9). The first matrix, defined by the lambda function matrixElements1, is identity matrix and so its each diagonal element equals one. We set the matrix element value to 1.0 (line 12) and the column index equals the row index (line 15). The second matrix, defined by the lambda function matrixElements2, is also diagonal but not the identity matrix. The values of the diagonal elements equal to row index (line 16).

With the same lambda functions we can define matrices with different dimensions. In this example, we set the matrix size to five (line 19). It could be quite difficult to express the lambda matrix type because it depends on the types of the lambda functions. To make this easier, one may use the lambda-matrix factory (TNL::Matrices::LambdaMatrixFactory). Using decltype one can deduce even the matrix type (line 24) followed by calling lambda matrix constructor with matrix dimensions and instances of the lambda functions (line 25). Or one can just simply employ the keyword auto (line 30) followed by setting the matrix dimensions (line 31).

The result looks as follows:

The first lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
The second lambda matrix:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4

### Method forRows

Method forRows (TNL::Matrices::LambdaMatrix::forRows, TNL::Matrices::LambdaMatrix::forAllRows) iterates in parallel over all matrix rows. In the case of lambda matrices, it cannot be used for changing the matrix elements since they cannot be changed. In the following example, we show how to use this method to copy the matrix elements values to the dense matrix:

1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7
8template< typename Device >
9void forRowsExample()
10{
11 /***
12 * Set matrix representing approximation of the Laplace operator on regular
13 * grid using the finite difference method.
14 */
15 const int gridSize( 4 );
16 const int matrixSize = gridSize * gridSize;
17 auto rowLengths = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx ) -> int
18 {
19 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
20 const int gridColumn = rowIdx % gridSize;
21 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
22 gridColumn == 0 || gridColumn == gridSize - 1 )
23 return 1;
24 return 5;
25 };
26 auto matrixElements = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value) {
27 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
28 const int gridColumn = rowIdx % gridSize;
29 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
30 gridColumn == 0 || gridColumn == gridSize - 1 )
31 {
32 columnIdx = rowIdx; // diagonal element ....
33 value = 1.0; // ... is set to 1
34 }
35 else // interior grid node
36 {
37 switch( localIdx ) // set diagonal element to 4
38 { // and the others to -1
39 case 0:
40 columnIdx = rowIdx - gridSize;
41 value = 1;
42 break;
43 case 1:
44 columnIdx = rowIdx - 1;
45 value = 1;
46 break;
47 case 2:
48 columnIdx = rowIdx;
49 value = -4;
50 break;
51 case 3:
52 columnIdx = rowIdx + 1;
53 value = 1;
54 break;
55 case 4:
56 columnIdx = rowIdx + gridSize;
57 value = 1;
58 break;
59 }
60 }
61 };
63 matrixSize, matrixSize, matrixElements, rowLengths );
64 using MatrixType = decltype( matrix );
65 using RowView = typename MatrixType::RowView;
66
67 TNL::Matrices::DenseMatrix< double, Device > denseMatrix( matrixSize, matrixSize );
68 denseMatrix.setValue( 0.0 );
69 auto dense_view = denseMatrix.getView();
70 auto f = [=] __cuda_callable__ ( const RowView& row ) mutable {
71 auto dense_row = dense_view.getRow( row.getRowIndex() );
72 for( int localIdx = 0; localIdx < row.getSize(); localIdx++ )
73 dense_row.setValue( row.getColumnIndex( localIdx ), row.getValue( localIdx ) );
74 };
75 matrix.forAllRows( f );
76
77 std::cout << "Laplace operator lambda matrix: " << std::endl << matrix << std::endl;
78 std::cout << "Laplace operator dense matrix: " << std::endl << denseMatrix << std::endl;
79
80 /***
81 * Compute sum of elements in each row and store it into a vector.
82 */
83 TNL::Containers::Vector< double, Device > sum_vector( matrixSize );
84 auto sum_view = sum_vector.getView();
85 matrix.forAllRows( [=] __cuda_callable__ ( const RowView& row ) mutable {
86 double sum( 0.0 );
87 for( auto element : row )
88 sum += TNL::abs( element.value() );
89 sum_view[ row.getRowIndex() ] = sum;
90 } );
91
92 std::cout << "Sums in matrix rows = " << sum_vector << std::endl;
93}
94
95int main( int argc, char* argv[] )
96{
97 std::cout << "Running example on CPU ... " << std::endl;
98 forRowsExample< TNL::Devices::Host >();
99
100#ifdef HAVE_CUDA
101 std::cout << "Running example on CUDA GPU ... " << std::endl;
102 forRowsExample< TNL::Devices::Cuda >();
103#endif
104}

We start with the lambda functions (lines 17-61) defining the elements of the lambda matrix. Next, we create the lambda matrix matrix (lines 62-64) and the dense matrix denseMatrix (lines 67-68) together with the dense matrix view (line 69). The lambda function f (lines 70-74) serves for copying matrix elements from the lambda matrix to the dense matrix. The process of matrix elements copying is started by calling the method forAllRows (TNL::Matrices::LambdaMatrix::forRows, TNL::Matrices::LambdaMatrix::forAllRows) (line 75).

Note, however, that use of forElements method (TNL::Matrices::LambdaMatrix::forElements) would be more convenient.

Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 83-90). Firstly we create the vector sum_vector for storing the sums (line 83) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 85-90, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 92).

The result looks as follows:

Running example on CPU ...
Laplace operator lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Laplace operator dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 2 -> 0:0 1:0 2:1 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 3 -> 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 5 -> 0:0 1:1 2:0 3:0 4:1 5:-4 6:1 7:0 8:0 9:1 10:0 11:0 12:0 13:0 14:0 15:0
Row: 6 -> 0:0 1:0 2:1 3:0 4:0 5:1 6:-4 7:1 8:0 9:0 10:1 11:0 12:0 13:0 14:0 15:0
Row: 7 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 8 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:1 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 9 -> 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:1 9:-4 10:1 11:0 12:0 13:1 14:0 15:0
Row: 10 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:1 7:0 8:0 9:1 10:-4 11:1 12:0 13:0 14:1 15:0
Row: 11 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:1 12:0 13:0 14:0 15:0
Row: 12 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:1 13:0 14:0 15:0
Row: 13 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:1 14:0 15:0
Row: 14 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:1 15:0
Row: 15 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:1
Sums in matrix rows = [ 1, 1, 1, 1, 1, 8, 8, 1, 1, 8, 8, 1, 1, 1, 1, 1 ]
Running example on CUDA GPU ...
Laplace operator lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Laplace operator dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 2 -> 0:0 1:0 2:1 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 3 -> 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 5 -> 0:0 1:1 2:0 3:0 4:1 5:-4 6:1 7:0 8:0 9:1 10:0 11:0 12:0 13:0 14:0 15:0
Row: 6 -> 0:0 1:0 2:1 3:0 4:0 5:1 6:-4 7:1 8:0 9:0 10:1 11:0 12:0 13:0 14:0 15:0
Row: 7 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 8 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:1 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 9 -> 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:1 9:-4 10:1 11:0 12:0 13:1 14:0 15:0
Row: 10 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:1 7:0 8:0 9:1 10:-4 11:1 12:0 13:0 14:1 15:0
Row: 11 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:1 12:0 13:0 14:0 15:0
Row: 12 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:1 13:0 14:0 15:0
Row: 13 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:1 14:0 15:0
Row: 14 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:1 15:0
Row: 15 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:1
Sums in matrix rows = [ 1, 1, 1, 1, 1, 8, 8, 1, 1, 8, 8, 1, 1, 1, 1, 1 ]

### Method forElements

The lambda matrix has the same interface as other matrix types except of the method getRow. The following example demonstrates the use of the method forElements (TNL::Matrices::LambdaMatrix::forElements) to copy the lambda matrix into the dense matrix:

1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Matrices/LambdaMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void forElementsExample()
9{
10 /***
11 * Lambda functions defining the matrix.
12 */
13 auto rowLengths = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx ) -> int { return columns; };
14 auto matrixElements = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value ) {
15 columnIdx = localIdx;
16 value = TNL::max( rowIdx - columnIdx + 1, 0 );
17 };
18
20 auto matrix = MatrixFactory::create( 5, 5, matrixElements, rowLengths );
21
23 auto denseView = denseMatrix.getView();
24
25 auto f = [=] __cuda_callable__ ( int rowIdx, int localIdx, int columnIdx, double value ) mutable {
26 denseView.setElement( rowIdx, columnIdx, value );
27 };
28
29 matrix.forElements( 0, matrix.getRows(), f );
30 std::cout << "Original lambda matrix:" << std::endl << matrix << std::endl;
31 std::cout << "Dense matrix:" << std::endl << denseMatrix << std::endl;
32}
33
34int main( int argc, char* argv[] )
35{
36 std::cout << "Copying matrix on host: " << std::endl;
37 forElementsExample< TNL::Devices::Host >();
38
39#ifdef HAVE_CUDA
40 std::cout << "Copying matrix on CUDA device: " << std::endl;
41 forElementsExample< TNL::Devices::Cuda >();
42#endif
43}
Helper class for creating instances of LambdaMatrix.
Definition: LambdaMatrix.h:557

Here, we treat the lambda matrix as if it was dense matrix and so the lambda function compressedRowLengths returns the number of the nonzero elements equal to the number of matrix columns (line 13). However, the lambda function matrixElements (lines 14-17), sets nonzero values only to lower triangular part of the matrix. The elements in the upper part are equal to zero (line 16). Next we create an instance of the lambda matrix with a help of the lambda matrix factory (TNL::Matrices::LambdaMatrixFactory) (lines 19-20) and an instance of the dense matrix (TNL::Matrices::DenseMatrix) (lines 22-23).

Next we call the lambda function f by the method forElements (TNL::Matrices::LambdaMatrix::forElements) to set the matrix elements of the dense matrix denseMatrix (line 26) via the dense matrix view (denseView) (TNL::Matrices::DenseMatrixView). Note, that in the lambda function f we get the matrix element value already evaluated in the variable value as we are used to from other matrix types. So in fact, the same lambda function f would do the same job even for sparse matrix or any other. Also note, that in this case we iterate even over all zero matrix elements because the lambda function compressedRowLengths (line 13) tells so. The result looks as follows:

Copying matrix on host:
Original lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:1 2:0 3:0 4:0
Row: 2 -> 0:3 1:2 2:1 3:0 4:0
Row: 3 -> 0:4 1:3 2:2 3:1 4:0
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Copying matrix on CUDA device:
Original lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:1 2:0 3:0 4:0
Row: 2 -> 0:3 1:2 2:1 3:0 4:0
Row: 3 -> 0:4 1:3 2:2 3:1 4:0
Row: 4 -> 0:5 1:4 2:3 3:2 4:1

At the end of this part, we show two more examples, how to express a matrix approximating the Laplace operator:

1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6
7template< typename Device >
8void laplaceOperatorMatrix()
9{
10 /***
11 * Set matrix representing approximation of the Laplace operator on regular
12 * grid using the finite difference method.
13 */
14 const int gridSize( 4 );
15 const int matrixSize = gridSize * gridSize;
16 auto rowLengths = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx ) -> int
17 {
18 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
19 const int gridColumn = rowIdx % gridSize;
20 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
21 gridColumn == 0 || gridColumn == gridSize - 1 )
22 return 1;
23 return 5;
24 };
25 auto matrixElements = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value) {
26 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
27 const int gridColumn = rowIdx % gridSize;
28 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
29 gridColumn == 0 || gridColumn == gridSize - 1 )
30 {
31 columnIdx = rowIdx; // diagonal element ....
32 value = 1.0; // ... is set to 1
33 }
34 else // interior grid node
35 {
36 switch( localIdx ) // set diagonal element to 4
37 { // and the others to -1
38 case 0:
39 columnIdx = rowIdx - gridSize;
40 value = 1;
41 break;
42 case 1:
43 columnIdx = rowIdx - 1;
44 value = 1;
45 break;
46 case 2:
47 columnIdx = rowIdx;
48 value = -4;
49 break;
50 case 3:
51 columnIdx = rowIdx + 1;
52 value = 1;
53 break;
54 case 4:
55 columnIdx = rowIdx + gridSize;
56 value = 1;
57 break;
58 }
59 }
60 };
62 matrixSize, matrixSize, matrixElements, rowLengths );
63 std::cout << "Laplace operator matrix: " << std::endl << matrix << std::endl;
64}
65
66int main( int argc, char* argv[] )
67{
68 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
69 laplaceOperatorMatrix< TNL::Devices::Host >();
70
71#ifdef HAVE_CUDA
72 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
73 laplaceOperatorMatrix< TNL::Devices::Cuda >();
74#endif
75}

The following is another way of doing the same but with precomputed supporting vectors:

1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6
7template< typename Device >
8void laplaceOperatorMatrix()
9{
10 /***
11 * Set matrix representing approximation of the Laplace operator on regular
12 * grid using the finite difference method.
13 */
14 const int gridSize( 4 );
15 const int matrixSize = gridSize * gridSize;
16 TNL::Containers::Vector< int, Device > columnOffsets{ 0, -1, 0, 1, 0 }; // helper vector for getting matrix elements column indexes
17 columnOffsets.setElement( 0, -gridSize );
18 columnOffsets.setElement( 4, gridSize );
19 auto columnOffsetsView = columnOffsets.getView();
20 TNL::Containers::Vector< double, Device > values{ 1, 1, -4, 1, 1 }; // helper vector for getting matrix elements values
21 auto valuesView = values.getView();
22
23 auto rowLengths = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx ) -> int
24 {
25 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
26 const int gridColumn = rowIdx % gridSize;
27
28 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
29 gridColumn == 0 || gridColumn == gridSize - 1 )
30 return 1;
31 return 5;
32 };
33
34 auto matrixElements = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value) {
35 const int gridRow = rowIdx / gridSize; // coordinates in the numerical grid
36 const int gridColumn = rowIdx % gridSize;
37 if( gridRow == 0 || gridRow == gridSize - 1 || // boundary grid node
38 gridColumn == 0 || gridColumn == gridSize - 1 )
39 {
40 columnIdx = rowIdx; // diagonal element ....
41 value = 1.0; // ... is set to 1
42 }
43 else // interior grid node
44 {
45 columnIdx = rowIdx + columnOffsetsView[ localIdx ];
46 value = valuesView[ localIdx ];
47 }
48 };
50 matrixSize, matrixSize, matrixElements, rowLengths );
51 std::cout << "Laplace operator matrix: " << std::endl << matrix << std::endl;
52}
53
54int main( int argc, char* argv[] )
55{
56 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
57 laplaceOperatorMatrix< TNL::Devices::Host >();
58
59#ifdef HAVE_CUDA
60 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
61 laplaceOperatorMatrix< TNL::Devices::Cuda >();
62#endif
63}
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
Definition: Vector.hpp:29

The result of both examples looks as follows:

Creating Laplace operator matrix on CPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Creating Laplace operator matrix on CUDA GPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1

## Distributed matrices

TODO: Write documentation on distributed matrices.

# Flexible reduction in matrix rows

Flexible reduction in matrix rows is a powerful tool for many different matrix operations. It is represented by the method reduceRows (TNL::Matrices::DenseMatrix::reduceRows, TNL::Matrices::SparseMatrix::reduceRows, TNL::Matrices::TridiagonalMatrix::reduceRows, TNL::Matrices::MultidiagonalMatrix::reduceRows, TNL::Matrices::LambdaMatrix::reduceRows) and similar to the method forElements it iterates over particular matrix rows. However, it performs flexible paralell reduction in addition. For example, the matrix-vector product can be seen as a reduction of products of matrix elements with the input vector in particular matrix rows. The first element of the result vector ios obtained as:

$y_1 = a_{11} x_1 + a_{12} x_2 + \ldots + a_{1n} x_n = \sum_{j=1}^n a_{1j}x_j$

and in general i-th element of the result vector is computed as

$y_i = a_{i1} x_1 + a_{i2} x_2 + \ldots + a_{in} x_n = \sum_{j=1}^n a_{ij}x_j.$

We see that in i-th matrix row we have to compute the sum $$\sum_{j=1}^n a_{ij}x_j$$ which is reduction of products $$a_{ij}x_j$$. Similar to flexible parallel reduction (TNL::Algorithms::reduce) we just need to design proper lambda functions. There are three of them.

1. fetch reads and preprocesses data entering the flexible parallel reduction.
2. reduce performs the reduction operation.
3. keep stores the results from each matrix row.

### Lambda function fetch

This lambda function has the same purpose as the lambda function fetch in flexible parallel reduction for arrays and vectors (see Flexible Parallel Reduction). It is supposed to be declared as follows:

1auto fetch = [=] __cuda_callable__ ( Index rowIdx, Index columnIdx, const Real& value ) -> Real;

The meaning of the particular parameters is as follows:

1. rowIdx is the row index of the matrix element.
2. columnIdx is the column index of the matrix element.
3. value is the value of the matrix element.

The lambda function returns a value of type Real based on the input data.

### Lambda function reduce

The lambda function reduce expresses reduction operation (sum, product, minimum, maximum etc.) which is supposed to be done during the flexible reduction.

1auto reduce = [] __cuda_callable__ ( const Real& a, const Real& b ) -> Real;

The meaning of the particular parameters is as follows:

1. a is the first operand for the reduction operation.
2. b is the second operand for the reduction operation.

### Lambda function keep

The lambda function keep is new one compared to the flexible reduction for arrays, vectors or other linear structures. The reason is that the result consists of as many numbers as there are matrix rows. Result obtained for each matrix row is processed by this lambda function. It is declared as follows:

1auto keep = [=] __cuda_callable__ ( Index rowIdx, const Real& value ) mutable;

The meaning of the particular parameters is as follows:

1. rowIdx is an index of the matrix row related to given result of flexible reduction.
2. valueis the result of the flexible reduction in given matrix row.

The method reduceRows (TNL::Matrices::DenseMatrix::reduceRows, TNL::Matrices::SparseMatrix::reduceRows, TNL::Matrices::TridiagonalMatrix::reduceRows, TNL::Matrices::MultidiagonalMatrix::reduceRows, TNL::Matrices::LambdaMatrix::reduceRows) accepts the following arguments:

1. begin is the beginning of the matrix rows range on which the reduction will be performed.
2. end is the end of the matrix rows range on which the reduction will be performed. The last matrix row which is going to be processed has index end-1.
3. fetch is the lambda function for data fetching.
4. reduce is the the lambda function performing the reduction.
5. keep is the lambda function responsible for processing the results from particular matrix rows.
6. zero is the "zero" element of given reduction operation also known as idempotent.

Though the interface is the same for all matrix types, in the following part we will show several examples for different matrix types to better demonstrate possible ways of use of the flexible reduction for matrices.

## Dense matrices example

The following example demonstrates implementation of the dense matrix-vector product $${\bf y} = A \vec {\bf x}$$, i.e.

$y_i = \sum_{j=0}^{columns - 1} a_{ij} x_j \text{ for } i = 0, \ldots, rows-1.$

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void reduceRows()
10{
12 { 1, 0, 0, 0, 0 },
13 { 1, 2, 0, 0, 0 },
14 { 0, 1, 8, 0, 0 },
15 { 0, 0, 1, 9, 0 },
16 { 0, 0, 0, 0, 1 } };
17
18 /***
19 * Allocate input and output vectors for matrix-vector product
20 */
21 TNL::Containers::Vector< double, Device > x( matrix.getColumns() ),
22 y( matrix.getRows() );
23
24 /***
25 * Fill the input vectors with ones.
26 */
27 x = 1.0;
28
29 /***
30 * Prepare vector view for lambdas.
31 */
32 auto xView = x.getView();
33 auto yView = y.getView();
34
35 /***
36 * Fetch lambda just returns product of appropriate matrix elements and the
37 * input vector elements.
38 */
39 auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) -> double {
40 return xView[ columnIdx ] * value;
41 };
42
43 /***
44 * Keep lambda store the result of matrix-vector product to output vector y.
45 */
46 auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable {
47 yView[ rowIdx ] = value;
48 };
49
50 /***
51 * Compute matrix-vector product.
52 */
53 matrix.reduceRows( 0, matrix.getRows(), fetch, std::plus<>{}, keep, 0.0 );
54
55 std::cout << "The matrix reads as:" << std::endl << matrix << std::endl;
56 std::cout << "The input vector is:" << x << std::endl;
57 std::cout << "Result of matrix-vector multiplication is: " << y << std::endl;
58}
59
60int main( int argc, char* argv[] )
61{
62 std::cout << "Rows reduction on host:" << std::endl;
63 reduceRows< TNL::Devices::Host >();
64
65#ifdef HAVE_CUDA
67 std::cout << "Rows reduction on CUDA device:" << std::endl;
68 reduceRows< TNL::Devices::Cuda >();
69#endif
70}

We set the following lambda functions:

• fetch lambda function computes the product $$a_{ij}x_j$$ where $$a_{ij}$$ is represented by value and $$x_j$$ is represented by xView[columnIdx] (line 40).
• reduce - reduction is just sum of particular products and it is represented by std::plus (line 53).
• keep is responsible for storing the results of reduction in each matrix row (which is represented by the variable value) into the output vector y.

The result looks as:

Rows reduction on host:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:0 1:1 2:8 3:0 4:0
Row: 3 -> 0:0 1:0 2:1 3:9 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
Rows reduction on CUDA device:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:0 1:1 2:8 3:0 4:0
Row: 3 -> 0:0 1:0 2:1 3:9 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]

We will show one more example which is a computation of maximal absolute value in each matrix row. The results will be stored in a vector:

$y_i = \max_{j=1,\ldots,n} |a_{ij}|.$

See the following example:

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void reduceRows()
9{
11 { 1, 0, 0, 0, 0 },
12 { 1, 2, 0, 0, 0 },
13 { 0, 1, 8, 0, 0 },
14 { 0, 0, 1, 9, 0 },
15 { 0, 0, 0, 0, 1 } };
16
17 /***
18 * Find largest element in each row.
19 */
20 TNL::Containers::Vector< double, Device > rowMax( matrix.getRows() );
21
22 /***
23 * Prepare vector view for lambdas.
24 */
25 auto rowMaxView = rowMax.getView();
26
27 /***
28 * Fetch lambda just returns absolute value of matrix elements.
29 */
30 auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) -> double {
31 return TNL::abs( value );
32 };
33
34 /***
35 * Reduce lambda return maximum of given values.
36 */
37 auto reduce = [=] __cuda_callable__ ( double& a, const double& b ) -> double {
38 return TNL::max( a, b );
39 };
40
41 /***
42 * Keep lambda store the largest value in each row to the vector rowMax.
43 */
44 auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable {
45 rowMaxView[ rowIdx ] = value;
46 };
47
48 /***
49 * Compute the largest values in each row.
50 */
51 matrix.reduceRows( 0, matrix.getRows(), fetch, reduce, keep, std::numeric_limits< double >::lowest() );
52
53 std::cout << "Max. elements in rows are: " << rowMax << std::endl;
54 std::cout << "Max. matrix norm is: " << TNL::max( rowMax ) << std::endl;
55}
56
57int main( int argc, char* argv[] )
58{
59 std::cout << "Rows reduction on host:" << std::endl;
60 reduceRows< TNL::Devices::Host >();
61
62#ifdef HAVE_CUDA
63 std::cout << "Rows reduction on CUDA device:" << std::endl;
64 reduceRows< TNL::Devices::Cuda >();
65#endif
66}
Result reduce(Index begin, Index end, Fetch &&fetch, Reduction &&reduction, const Result &identity)
reduce implements (parallel) reduction for vectors and arrays.
Definition: reduce.h:71
constexpr ResultType max(const T1 &a, const T2 &b)
This function returns maximum of two numbers.
Definition: Math.h:62

The lambda functions rare:

• fetch lambda function just returns absolute value of $$a_{ij}$$ which is represented again by the variable value.
• reduce lambda function returns larger of given values.
• keep stores the results to the output vector the same way as in the previous example.

Note, that the idempotent value for the reduction is std::numeric_limits< double >::lowest. Of course, if we compute the maximum of all output vector elements, we get some kind of maximal matrix norm. The output looks as:

Rows reduction on host:
Max. elements in rows are: [ 1, 2, 8, 9, 1 ]
Max. matrix norm is: 9
Rows reduction on CUDA device:
Max. elements in rows are: [ 1, 2, 8, 9, 1 ]
Max. matrix norm is: 9

## Sparse matrices example

The following example demonstrates sparse matrix-vector product:

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void reduceRows()
10{
12 { 0, 0, 1 },
13 { 1, 0, 1 }, { 1, 1, 2 },
14 { 2, 1, 1 }, { 2, 2, 8 },
15 { 3, 2, 1 }, { 3, 3, 9 },
16 { 4, 4, 1 } } };
17
18 /***
19 * Allocate input and output vectors for matrix-vector product
20 */
21 TNL::Containers::Vector< double, Device > x( matrix.getColumns() ),
22 y( matrix.getRows() );
23
24 /***
25 * Fill the input vectors with ones.
26 */
27 x = 1.0;
28
29 /***
30 * Prepare vector view for lambdas.
31 */
32 auto xView = x.getView();
33 auto yView = y.getView();
34
35 /***
36 * Fetch lambda just returns product of appropriate matrix elements and the
37 * input vector elements.
38 */
39 auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) -> double {
40 return xView[ columnIdx ] * value;
41 };
42
43 /***
44 * Keep lambda store the result of matrix-vector product to output vector y.
45 */
46 auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable {
47 yView[ rowIdx ] = value;
48 };
49
50 /***
51 * Compute matrix-vector product.
52 */
53 matrix.reduceRows( 0, matrix.getRows(), fetch, std::plus<>{}, keep, 0.0 );
54
55 std::cout << "The matrix reads as:" << std::endl << matrix << std::endl;
56 std::cout << "The input vector is:" << x << std::endl;
57 std::cout << "Result of matrix-vector multiplication is: " << y << std::endl;
58}
59
60int main( int argc, char* argv[] )
61{
62 std::cout << "Rows reduction on host:" << std::endl;
63 reduceRows< TNL::Devices::Host >();
64
65#ifdef HAVE_CUDA
67 std::cout << "Rows reduction on CUDA device:" << std::endl;
68 reduceRows< TNL::Devices::Cuda >();
69#endif
70}

On the lines 11-16 we set the following matrix:

$\left( \begin{array}{ccccc} 1 & . & . & . & . \\ 1 & 2 & . & . & . \\ . & 1 & 8 & . & . \\ . & . & 1 & 9 & . \\ . & . & . & . & 1 \end{array} \right)$

The lambda functions on the lines 39-48 are the same as in the example with the dense matrix. The result looks as follows:

Rows reduction on host:
Row: 0 -> 0:1
Row: 1 -> 0:1 1:2
Row: 2 -> 1:1 2:8
Row: 3 -> 2:1 3:9
Row: 4 -> 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
Rows reduction on CUDA device:
Row: 0 -> 0:1
Row: 1 -> 0:1 1:2
Row: 2 -> 1:1 2:8
Row: 3 -> 2:1 3:9
Row: 4 -> 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]

## Tridiagonal matrices example

In this example, we will compute maximal absolute value in each row of the following tridiagonal matrix:

$\left( \begin{array}{ccccc} 1 & 3 & & & & \\ 2 & 1 & 3 & & & \\ & 2 & 1 & 3 & & \\ & & 2 & 1 & 3 & \\ & & & 2 & 1 & 3 \end{array} \right).$

The source code reads as follows:

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/TridiagonalMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void reduceRows()
9{
10 /***
11 * Set the following matrix (dots represent zero matrix elements and zeros are
12 * padding zeros for memory alignment):
13 *
14 * 0 / 1 3 . . . \ -> { 0, 1, 3 }
15 * | 2 1 3 . . | -> { 2, 1, 3 }
16 * | . 2 1 3 . | -> { 2, 1, 3 }
17 * | . . 2 1 3 | -> { 2, 1, 3 }
18 * \ . . . 2 1 / 0 -> { 2, 1, 0 }
19 *
20 */
22 5, // number of matrix columns
23 { { 0, 1, 3 }, // matrix elements
24 { 2, 1, 3 },
25 { 2, 1, 3 },
26 { 2, 1, 3 },
27 { 2, 1, 3 } } );
28
29 /***
30 * Find largest element in each row.
31 */
32 TNL::Containers::Vector< double, Device > rowMax( matrix.getRows() );
33
34 /***
35 * Prepare vector view for lambdas.
36 */
37 auto rowMaxView = rowMax.getView();
38
39 /***
40 * Fetch lambda just returns absolute value of matrix elements.
41 */
42 auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) -> double {
43 return TNL::abs( value );
44 };
45
46 /***
47 * Reduce lambda return maximum of given values.
48 */
49 auto reduce = [=] __cuda_callable__ ( const double& a, const double& b ) -> double {
50 return TNL::max( a, b );
51 };
52
53 /***
54 * Keep lambda store the largest value in each row to the vector rowMax.
55 */
56 auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable {
57 rowMaxView[ rowIdx ] = value;
58 };
59
60 /***
61 * Compute the largest values in each row.
62 */
63 matrix.reduceRows( 0, matrix.getRows(), fetch, reduce, keep, std::numeric_limits< double >::lowest() );
64
65 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
66 std::cout << "Max. elements in rows are: " << rowMax << std::endl;
67}
68
69int main( int argc, char* argv[] )
70{
71 std::cout << "Rows reduction on host:" << std::endl;
72 reduceRows< TNL::Devices::Host >();
73
74#ifdef HAVE_CUDA
75 std::cout << "Rows reduction on CUDA device:" << std::endl;
76 reduceRows< TNL::Devices::Cuda >();
77#endif
78}

Here we first set the tridiagonal matrix (lines 10-27). Next we allocate the vector rowMax where we will store the results (line 32). The lambda function are:

• fetch (lines 42-44) is responsible for reading the matrix elements. In our example, the only thing this function has to do, is to compute the absolute value of each matrix element represented by variable value.
• reduce (lines 49-51), performs reduction operation. In this case, it returns maximum of two input values a and b.
• keep (lines 56-58) takes the result of the reduction in variable value in each row and stores it into the vector rowMax via related vector view rowMaxView.

Note, that the idempotent value for the reduction is std::numeric_limits< double >::lowest. The results looks as follows:

Rows reduction on host:
Row: 0 -> 0:1 1:3
Row: 1 -> 0:2 1:1 2:3
Row: 2 -> 1:2 2:1 3:3
Row: 3 -> 2:2 3:1 4:3
Row: 4 -> 3:2 4:1
Max. elements in rows are: [ 3, 3, 3, 3, 2 ]
Rows reduction on CUDA device:
Row: 0 -> 0:1 1:3
Row: 1 -> 0:2 1:1 2:3
Row: 2 -> 1:2 2:1 3:3
Row: 3 -> 2:2 3:1 4:3
Row: 4 -> 3:2 4:1
Max. elements in rows are: [ 3, 3, 3, 3, 2 ]

## Multidiagonal matrices example

The next example computes again the maximal absolute value in each row. Now, we do it for multidiagonal matrix the following form:

$\left( \begin{array}{ccccc} 1 & & & & \\ 2 & 1 & & & \\ 3 & 2 & 1 & & \\ & 3 & 2 & 1 & \\ & & 3 & 2 & 1 \end{array} \right)$

We first create vector rowMax into which we will store the results and fetch it view rowMaxView (line 39). Next we prepare necessary lambda functions:

• fetch (lines 44-46) is responsible for reading the matrix element value which is stored in the constant reference value and for returning its absolute value. The other parameters rowIdx and columnIdx correspond to row and column indexes respectively and they are omitted in our example.
• reduce (lines 51-53) returns maximum value of the two input values a and b.
• keep (line 58-60) stores the input value at the corresponding position, given by the row index rowIdx, in the output vector view rowMaxView.

Finally, we call the method reduceRows (TNL::Matrices::MultidiagonalMatrix::reduceRows) with parameters telling the interval of rows to be processed (the first and second parameter), the lambda functions fetch, reduce and keep, and the idempotent element for the reduction operation which is the lowest number of given type (std::numeric_limits< double >::lowest ). The result looks as follows:

Rows reduction on host:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Max. elements in rows are: [ 1, 2, 3, 3, 3 ]
Rows reduction on CUDA device:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Max. elements in rows are: [ 1, 2, 3, 3, 3 ]

## Lambda matrices example

The reduction of matrix rows is available for the lambda matrices as well. See the following example:

1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/LambdaMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void reduceRows()
10{
11 /***
12 * Lambda functions defining the matrix.
13 */
14 auto rowLengths = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx ) -> int { return columns; };
15 auto matrixElements = [=] __cuda_callable__ ( const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value ) {
16 columnIdx = localIdx;
17 value = TNL::max( rowIdx - columnIdx + 1, 0 );
18 };
19
21 auto matrix = MatrixFactory::create( 5, 5, matrixElements, rowLengths );
22
23 /***
24 * Find largest element in each row.
25 */
26 TNL::Containers::Vector< double, Device > rowMax( matrix.getRows() );
27
28 /***
29 * Prepare vector view for lambdas.
30 */
31 auto rowMaxView = rowMax.getView();
32
33 /***
34 * Fetch lambda just returns absolute value of matrix elements.
35 */
36 auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) -> double {
37 return TNL::abs( value );
38 };
39
40 /***
41 * Reduce lambda return maximum of given values.
42 */
43 auto reduce = [=] __cuda_callable__ ( const double& a, const double& b ) -> double {
44 return TNL::max( a, b );
45 };
46
47 /***
48 * Keep lambda store the largest value in each row to the vector rowMax.
49 */
50 auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable {
51 rowMaxView[ rowIdx ] = value;
52 };
53
54 /***
55 * Compute the largest values in each row.
56 */
57 matrix.reduceRows( 0, matrix.getRows(), fetch, reduce, keep, std::numeric_limits< double >::lowest() );
58
59 std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
60 std::cout << "Max. elements in rows are: " << rowMax << std::endl;
61}
62
63int main( int argc, char* argv[] )
64{
65 std::cout << "Rows reduction on host:" << std::endl;
66 reduceRows< TNL::Devices::Host >();
67
68#ifdef HAVE_CUDA
69 std::cout << "Rows reduction on CUDA device:" << std::endl;
70 reduceRows< TNL::Devices::Cuda >();
71#endif
72}

On the lines 14-21, we create the lower triangular lambda matrix which looks as follows:

$\left( \begin{array}{ccccc} 1 & & & & \\ 2 & 1 & & & \\ 3 & 2 & 1 & & \\ 4 & 3 & 2 & 1 & \\ 5 & 4 & 3 & 2 & 1 \end{array} \right)$

We want to compute maximal absolute value of matrix elements in each row. For this purpose we define well known lambda functions:

• fetch takes the value of the lambda matrix element and returns its absolute value.
• reduce computes maximum value of two input variables.
• keep stores the results into output vector rowMax.

Note that the interface of the lambda functions is the same as for other matrix types. The result looks as follows:

Rows reduction on host:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Max. elements in rows are: [ 1, 2, 3, 4, 5 ]
Rows reduction on CUDA device:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Max. elements in rows are: [ 1, 2, 3, 4, 5 ]

# Matrix-vector product

One of the most important matrix operation is the matrix-vector multiplication. It is represented by a method vectorProduct (TNL::Matrices::DenseMatrix::vectorProduct, TNL::Matrices::SparseMatrix::vectorProduct, TNL::Matrices::TridiagonalMatrix::vectorProduct, TNL::Matrices::MultidiagonalMatrix::vectorProduct, TNL::Matrices::LambdaMatrix::vectorProduct). It is templated method with two template parameters InVector and OutVector telling the types of input and output vector respectively. Usually one will substitute some of TNL::Containers::Array, TNL::Containers::ArrayView, TNL::Containers::Vector or TNL::Containers::VectorView for these types. The method accepts the following parameters:

1. inVector is the input vector having the same number of elements as the number of matrix columns.
2. outVector is the output vector having the same number of elements as the number of matrix rows.
3. matrixMultiplicator is a number by which the result of matrix-vector product is multiplied.
4. outVectorMultiplicator is a number by which the output vector is multiplied before added to the result of matrix-vector product.
5. begin is the beginning of the matrix rows range on which we compute the matrix-vector product.
6. end is the end of the matrix rows range on which the matrix-vector product will be evaluated. The last matrix row which is going to be processed has index end-1.

Note that the output vector dimension must be the same as the number of matrix rows no matter how we set begin and end parameters. These parameters just say that some matrix rows and the output vector elements are omitted.

To summarize, this method computes the following formula:

outVector = matrixMultiplicator * ( *this ) * inVector + outVectorMultiplicator * outVector.

# Matrix I/O operations

All matrices can be saved to a file using a method save (TNL::Matrices::DenseMatrix::save, TNL::Matrices::SparseMatrix::save, TNL::Matrices::TridiagonalMatrix::save, TNL::Matrices::MultidiagonalMatrix::save) and restored with a method load (TNL::Matrices::DenseMatrix::load, TNL::Matrices::SparseMatrix::load, TNL::Matrices::TridiagonalMatrix::load, TNL::Matrices::MultidiagonalMatrix::load). To print the matrix, there is a method print (TNL::Matrices::DenseMatrix::print, TNL::Matrices::SparseMatrix::print, TNL::Matrices::TridiagonalMatrix::print, TNL::Matrices::MultidiagonalMatrix::print, TNL::Matrices::LambdaMatrix::print) can be used.

TNL also offers matrix reader (TNL::Matrices::MatrixReader) and matrix writer (TNL::Matrices::MatrixWriter) for import and export of matrices respectively. The matrix reader currently supports only Coordinate MTX file format which is popular mainly for sparse matrices. By the mean of the matrix writer, we can export TNL matrices into coordinate MTX format as well. In addition, the matrices can be exported to a text file suitable for Gnuplot program which can be used for matrix visualization. Finally, a pattern of nonzero matrix elements can be visualized via the EPS format - Encapsulated PostScript. We demonstrate both matrix reader and writer in the following example:

1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Matrices/MatrixWriter.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9
10template< typename Device >
11void matrixWriterExample()
12{
14 Matrix matrix (
15 5, // number of matrix rows
16 5, // number of matrix columns
17 { // matrix elements definition
18 { 0, 0, 2.0 },
19 { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
20 { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
21 { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
22 { 4, 4, 2.0 } } );
23
24 std::cout << "Matrix: " << std::endl << matrix << std::endl;
25 std::cout << "Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ...";
26 TNL::Matrices::MatrixWriter< Matrix >::writeGnuplot( "matrix-writer-example.gplt", matrix );
27 std::cout << " OK " << std::endl;
28 std::cout << "Writing matrix pattern in EPS format into the file matrix-writer-example.eps ...";
29 TNL::Matrices::MatrixWriter< Matrix >::writeEps( "matrix-writer-example.eps", matrix );
30 std::cout << " OK " << std::endl;
31 std::cout << "Writing matrix in MTX format into the file matrix-writer-example.mtx ...";
32 TNL::Matrices::MatrixWriter< Matrix >::writeMtx( "matrix-writer-example.mtx", matrix );
33 std::cout << " OK " << std::endl;
34}
35
36template< typename Device >
38{
40 SparseMatrix sparseMatrix;
41
42 std::cout << "Reading sparse matrix from MTX file matrix-writer-example.mtx ... ";
44 std::cout << " OK " << std::endl;
45 std::cout << "Imported matrix is: " << std::endl << sparseMatrix << std::endl;
46
48 DenseMatrix denseMatrix;
49
50 std::cout << "Reading dense matrix from MTX file matrix-writer-example.mtx ... ";
52 std::cout << " OK " << std::endl;
53 std::cout << "Imported matrix is: " << std::endl << denseMatrix << std::endl;
54}
55
56int main( int argc, char* argv[] )
57{
58 std::cout << "Creating matrices on CPU ... " << std::endl;
59 matrixWriterExample< TNL::Devices::Host >();
61
62#ifdef HAVE_CUDA
64 std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
65 matrixWriterExample< TNL::Devices::Cuda >();
67#endif
68}
static void readMtx(const String &fileName, Matrix &matrix, bool verbose=false)
Method for importing matrix from file with given filename.
static void writeEps(const TNL::String &fileName, const Matrix &matrix, bool verbose=false)
Method for exporting matrix to file with given filename using EPS format.
Definition: MatrixWriter.hpp:55
static void writeGnuplot(const TNL::String &fileName, const Matrix &matrix, bool verbose=false)
Method for exporting matrix to file with given filename using Gnuplot format.
Definition: MatrixWriter.hpp:19
static void writeMtx(const TNL::String &fileName, const Matrix &matrix, bool verbose=false)
Method for exporting matrix to file with given filename using MTX format.
Definition: MatrixWriter.hpp:37

The example consists of two functions - matrixWriterExample (lines 10-24) and matrixReaderExample (lines 36-54). In the first one, we first create a toy matrix (lines 13-22) which we subsequently export into Gnuplot (line 26), EPS (line 29) and MTX (line 32) formats. In the next step (the matrixReaderExample` function on lines 36-54), the MTX file is used to import the matrix into sparse (line 43) and dense (line 51) matrices. Both matrices are printed out (lines 45 and 53).

The result looks as follows:

1Creating matrices on CPU ...
2Matrix:
3Row: 0 -> 0:2
4Row: 1 -> 0:-1 1:2 2:-1
5Row: 2 -> 1:-1 2:2 3:-1
6Row: 3 -> 2:-1 3:2 4:-1
7Row: 4 -> 4:2
8
9Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ... OK
10Writing matrix pattern in EPS format into the file matrix-writer-example.eps ... OK
11Writing matrix in MTX format into the file matrix-writer-example.mtx ... OK
12Reading sparse matrix from MTX file matrix-writer-example.mtx ... OK
13Imported matrix is:
14Row: 0 -> 0:2
15Row: 1 -> 0:-1 1:2 2:-1
16Row: 2 -> 1:-1 2:2 3:-1
17Row: 3 -> 2:-1 3:2 4:-1
18Row: 4 -> 4:2
19
20Reading dense matrix from MTX file matrix-writer-example.mtx ... OK
21Imported matrix is:
22Row: 0 -> 0:2 1:0 2:0 3:0 4:0
23Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
24Row: 2 -> 0:0 1:-1 2:2 3:-1 4:0
25Row: 3 -> 0:0 1:0 2:-1 3:2 4:-1
26Row: 4 -> 0:0 1:0 2:0 3:0 4:2
27
28
29Creating matrices on CUDA GPU ...
30Matrix:
31Row: 0 -> 0:2
32Row: 1 -> 0:-1 1:2 2:-1
33Row: 2 -> 1:-1 2:2 3:-1
34Row: 3 -> 2:-1 3:2 4:-1
35Row: 4 -> 4:2
36
37Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ... OK
38Writing matrix pattern in EPS format into the file matrix-writer-example.eps ... OK
39Writing matrix in MTX format into the file matrix-writer-example.mtx ... OK
40Reading sparse matrix from MTX file matrix-writer-example.mtx ... OK
41Imported matrix is:
42Row: 0 -> 0:2
43Row: 1 -> 0:-1 1:2 2:-1
44Row: 2 -> 1:-1 2:2 3:-1
45Row: 3 -> 2:-1 3:2 4:-1
46Row: 4 -> 4:2
47
48Reading dense matrix from MTX file matrix-writer-example.mtx ... OK
49Imported matrix is:
50Row: 0 -> 0:2 1:0 2:0 3:0 4:0
51Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
52Row: 2 -> 0:0 1:-1 2:2 3:-1 4:0
53Row: 3 -> 0:0 1:0 2:-1 3:2 4:-1
54Row: 4 -> 0:0 1:0 2:0 3:0 4:2

# Appendix

## Benchmark of dense matrix setup

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Timer.h>
7
8const int testsCount = 5;
9
10template< typename Matrix >
11void setElement_on_host( const int matrixSize, Matrix& matrix )
12{
13 matrix.setDimensions( matrixSize, matrixSize );
14
15 for( int j = 0; j < matrixSize; j++ )
16 for( int i = 0; i < matrixSize; i++ )
17 matrix.setElement( i, j, i + j );
18}
19
20template< typename Matrix >
21void setElement_on_host_and_transfer( const int matrixSize, Matrix& matrix )
22{
23 using RealType = typename Matrix::RealType;
24 using IndexType = typename Matrix::IndexType;
26 HostMatrix hostMatrix( matrixSize, matrixSize );
27
28 for( int j = 0; j < matrixSize; j++ )
29 for( int i = 0; i < matrixSize; i++ )
30 hostMatrix.setElement( i, j, i + j );
31 matrix = hostMatrix;
32}
33
34template< typename Matrix >
35void setElement_on_device( const int matrixSize, Matrix& matrix )
36{
37 matrix.setDimensions( matrixSize, matrixSize );
38
39 auto matrixView = matrix.getView();
40 auto f = [=] __cuda_callable__ ( int i, int j ) mutable {
41 matrixView.setElement( i, j, i + j );
42 };
44}
45
46template< typename Matrix >
47void getRow( const int matrixSize, Matrix& matrix )
48{
49 matrix.setDimensions( matrixSize, matrixSize );
50
51 auto matrixView = matrix.getView();
52 auto f = [=] __cuda_callable__ ( int rowIdx ) mutable {
53 auto row = matrixView.getRow( rowIdx );
54 for( int i = 0; i < matrixSize; i++ )
55 row.setValue( i, rowIdx + i );
56 };
58}
59
60template< typename Matrix >
61void forElements( const int matrixSize, Matrix& matrix )
62{
63 matrix.setDimensions( matrixSize, matrixSize );
64
65 auto f = [=] __cuda_callable__ ( int rowIdx, int localIdx, int& columnIdx, float& value ) mutable {
66 value = rowIdx + columnIdx;
67 };
68 matrix.forElements( 0, matrixSize, f );
69}
70
71template< typename Device >
72void setupDenseMatrix()
73{
74 std::cout << " Dense matrix test:" << std::endl;
75 for( int matrixSize = 16; matrixSize <= 8192; matrixSize *= 2 )
76 {
77 std::cout << " Matrix size = " << matrixSize << std::endl;
78 TNL::Timer timer;
79
80 std::cout << " setElement on host: ";
81 timer.reset();
82 timer.start();
83 for( int i = 0; i < testsCount; i++ )
84 {
86 setElement_on_host( matrixSize, matrix );
87 }
88 timer.stop();
89 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
90
91 std::cout << " setElement on device: ";
92 timer.reset();
93 timer.start();
94 for( int i = 0; i < testsCount; i++ )
95 {
97 setElement_on_device( matrixSize, matrix );
98 }
99 timer.stop();
100 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
101
103 {
104 std::cout << " setElement on host and transfer on GPU: ";
105 timer.reset();
106 timer.start();
107 for( int i = 0; i < testsCount; i++ )
108 {
110 setElement_on_host_and_transfer( matrixSize, matrix );
111 }
112 timer.stop();
113 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
114 }
115
116 std::cout << " getRow: ";
117 timer.reset();
118 timer.start();
119 for( int i = 0; i < testsCount; i++ )
120 {
122 getRow( matrixSize, matrix );
123 }
124 timer.stop();
125 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
126
127 std::cout << " forElements: ";
128 timer.reset();
129 timer.start();
130 for( int i = 0; i < testsCount; i++ )
131 {
133 forElements( matrixSize, matrix );
134 }
135 timer.stop();
136 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
137 }
138}
139
140
141int main( int argc, char* argv[] )
142{
143 std::cout << "Creating dense matrix on CPU ... " << std::endl;
144 setupDenseMatrix< TNL::Devices::Host >();
145
146
147#ifdef HAVE_CUDA
148 std::cout << "Creating dense matrix on CUDA GPU ... " << std::endl;
149 setupDenseMatrix< TNL::Devices::Cuda >();
150#endif
151}
Class for real time, CPU time and CPU cycles measuring.
Definition: Timer.h:28
void reset()
Resets timer.
Definition: Timer.hpp:26
double getRealTime() const
Returns the elapsed real time on this timer.
Definition: Timer.hpp:58
void stop()
Stops (pauses) the timer.
Definition: Timer.hpp:38
void start()
Starts timer.
Definition: Timer.hpp:49

## Benchmark of sparse matrix setup

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Timer.h>
7
8const int testsCount = 5;
9
10template< typename Matrix >
11void STL_Map( const int gridSize, Matrix& matrix )
12{
13 /***
14 * Set matrix representing approximation of the Laplace operator on regular
15 * grid using the finite difference method by means of STL map.
16 */
17 const int matrixSize = gridSize * gridSize;
18 matrix.setDimensions( matrixSize, matrixSize );
20 for( int j = 0; j < gridSize; j++ )
21 for( int i = 0; i < gridSize; i++ )
22 {
23 const int rowIdx = j * gridSize + i;
24 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
25 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx ), 1.0 ) );
26 else
27 {
28 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx - gridSize ), 1.0 ) );
29 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx - 1 ), 1.0 ) );
30 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx ), -4.0 ) );
31 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx + 1 ), 1.0 ) );
32 map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx + gridSize ), 1.0 ) );
33 }
34 }
35 matrix.setElements( map );
36}
37
38template< typename Matrix >
39void setElement_on_host( const int gridSize, Matrix& matrix )
40{
41 /***
42 * Set matrix representing approximation of the Laplace operator on regular
43 * grid using the finite difference method by means setElement method called
44 * from the host system.
45 */
46 const int matrixSize = gridSize * gridSize;
48 matrix.setDimensions( matrixSize, matrixSize );
49 matrix.setRowCapacities( rowCapacities );
50
51 for( int j = 0; j < gridSize; j++ )
52 for( int i = 0; i < gridSize; i++ )
53 {
54 const int rowIdx = j * gridSize + i;
55 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
56 matrix.setElement( rowIdx, rowIdx, 1.0 );
57 else
58 {
59 matrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
60 matrix.setElement( rowIdx, rowIdx - 1, 1.0 );
61 matrix.setElement( rowIdx, rowIdx, -4.0 );
62 matrix.setElement( rowIdx, rowIdx + 1, 1.0 );
63 matrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
64 }
65 }
66}
67
68template< typename Matrix >
69void setElement_on_host_and_transfer( const int gridSize, Matrix& matrix )
70{
71 using RealType = typename Matrix::RealType;
72 using HostMatrix = typename Matrix::template Self< RealType, TNL::Devices::Host >;
73
74 const int matrixSize = gridSize * gridSize;
76 HostMatrix hostMatrix( matrixSize, matrixSize );
77 hostMatrix.setRowCapacities( rowCapacities );
78
79 for( int j = 0; j < gridSize; j++ )
80 for( int i = 0; i < gridSize; i++ )
81 {
82 const int rowIdx = j * gridSize + i;
83 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
84 hostMatrix.setElement( rowIdx, rowIdx, 1.0 );
85 else
86 {
87 hostMatrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
88 hostMatrix.setElement( rowIdx, rowIdx - 1, 1.0 );
89 hostMatrix.setElement( rowIdx, rowIdx, -4.0 );
90 hostMatrix.setElement( rowIdx, rowIdx + 1, 1.0 );
91 hostMatrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
92 }
93 }
94 matrix = hostMatrix;
95}
96
97template< typename Matrix >
98void setElement_on_device( const int gridSize, Matrix& matrix )
99{
100 /***
101 * Set matrix representing approximation of the Laplace operator on regular
102 * grid using the finite difference method by means of setElement method called
103 * from the native device.
104 */
105 const int matrixSize = gridSize * gridSize;
107 matrix.setDimensions( matrixSize, matrixSize );
108 matrix.setRowCapacities( rowCapacities );
109
110 auto matrixView = matrix.getView();
111 auto f = [=] __cuda_callable__ ( int i, int j ) mutable {
112 const int rowIdx = j * gridSize + i;
113 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
114 matrixView.setElement( rowIdx, rowIdx, 1.0 );
115 else
116 {
117 matrixView.setElement( rowIdx, rowIdx - gridSize, 1.0 );
118 matrixView.setElement( rowIdx, rowIdx - 1, 1.0 );
119 matrixView.setElement( rowIdx, rowIdx, -4.0 );
120 matrixView.setElement( rowIdx, rowIdx + 1, 1.0 );
121 matrixView.setElement( rowIdx, rowIdx + gridSize, 1.0 );
122 }
123 };
125}
126
127template< typename Matrix >
128void getRow( const int gridSize, Matrix& matrix )
129{
130 /***
131 * Set matrix representing approximation of the Laplace operator on regular
132 * grid using the finite difference method by means of getRow method.
133 */
134 const int matrixSize = gridSize * gridSize;
136 matrix.setDimensions( matrixSize, matrixSize );
137 matrix.setRowCapacities( rowCapacities );
138
139 auto matrixView = matrix.getView();
140 auto f = [=] __cuda_callable__ ( int rowIdx ) mutable {
141 const int i = rowIdx % gridSize;
142 const int j = rowIdx / gridSize;
143 auto row = matrixView.getRow( rowIdx );
144 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
145 row.setElement( 2, rowIdx, 1.0 );
146 else
147 {
148 row.setElement( 0, rowIdx - gridSize, 1.0 );
149 row.setElement( 1, rowIdx - 1, 1.0 );
150 row.setElement( 2, rowIdx, -4.0 );
151 row.setElement( 3, rowIdx + 1, 1.0 );
152 row.setElement( 4, rowIdx + gridSize, 1.0 );
153 }
154 };
156}
157
158template< typename Matrix >
159void forElements( const int gridSize, Matrix& matrix )
160{
161 /***
162 * Set matrix representing approximation of the Laplace operator on regular
163 * grid using the finite difference method by means of forElements method.
164 */
165
166 const int matrixSize = gridSize * gridSize;
168 matrix.setDimensions( matrixSize, matrixSize );
169 matrix.setRowCapacities( rowCapacities );
170
171 auto f = [=] __cuda_callable__ ( int rowIdx, int localIdx, int& columnIdx, float& value ) mutable {
172 const int i = rowIdx % gridSize;
173 const int j = rowIdx / gridSize;
174 if( ( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) && localIdx == 0 )
175 {
176 columnIdx = rowIdx;
177 value = 1.0;
178 }
179 else
180 {
181 switch( localIdx )
182 {
183 case 0:
184 columnIdx = rowIdx - gridSize;
185 value = 1.0;
186 break;
187 case 1:
188 columnIdx = rowIdx - 1;
189 value = 1.0;
190 break;
191 case 2:
192 columnIdx = rowIdx;
193 value = -4.0;
194 break;
195 case 3:
196 columnIdx = rowIdx + 1;
197 value = 1.0;
198 break;
199 case 4:
200 columnIdx = rowIdx + gridSize;
201 value = 1.0;
202 break;
203 }
204 }
205 };
206 matrix.forElements( 0, matrixSize, f );
207}
208
209template< typename Device >
210void laplaceOperatorSparseMatrix()
211{
212 std::cout << " Sparse matrix test:" << std::endl;
213 for( int gridSize = 16; gridSize <= 8192; gridSize *= 2 )
214 {
215 std::cout << " Grid size = " << gridSize << std::endl;
216 TNL::Timer timer;
217
218 std::cout << " STL map: ";
219 timer.reset();
220 timer.start();
221 for( int i = 0; i < testsCount; i++ )
222 {
224 STL_Map( gridSize, matrix );
225 }
226 timer.stop();
227 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
228
229 std::cout << " setElement on host: ";
230 timer.reset();
231 timer.start();
232 for( int i = 0; i < testsCount; i++ )
233 {
235 setElement_on_host( gridSize, matrix );
236 }
237 timer.stop();
238 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
239
241 {
242 std::cout << " setElement on host and transfer on GPU: ";
243 timer.reset();
244 timer.start();
245 for( int i = 0; i < testsCount; i++ )
246 {
248 setElement_on_host_and_transfer( gridSize, matrix );
249 }
250 timer.stop();
251 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
252 }
253
254 std::cout << " setElement on device: ";
255 timer.reset();
256 timer.start();
257 for( int i = 0; i < testsCount; i++ )
258 {
260 setElement_on_device( gridSize, matrix );
261 }
262 timer.stop();
263 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
264
265 std::cout << " getRow: ";
266 timer.reset();
267 timer.start();
268 for( int i = 0; i < testsCount; i++ )
269 {
271 getRow( gridSize, matrix );
272 }
273 timer.stop();
274 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
275
276 std::cout << " forElements: ";
277 timer.reset();
278 timer.start();
279 for( int i = 0; i < testsCount; i++ )
280 {
282 forElements( gridSize, matrix );
283 }
284 timer.stop();
285 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
286
287 }
288}
289
290int main( int argc, char* argv[] )
291{
292 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
293 laplaceOperatorSparseMatrix< TNL::Devices::Host >();
294
295#ifdef HAVE_CUDA
296 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
297 laplaceOperatorSparseMatrix< TNL::Devices::Cuda >();
298#endif
299}

## Benchmark of multidiagonal matrix setup

1#include <iostream>
2#include <TNL/Algorithms/ParallelFor.h>
3#include <TNL/Matrices/MultidiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Timer.h>
7
8const int testsCount = 5;
9
10template< typename Device >
11TNL::Containers::Vector< int, Device > getOffsets( const int gridSize )
12{
14 offsets.setElement( 0, -gridSize );
15 offsets.setElement( 1, -1 );
16 offsets.setElement( 2, 0 );
17 offsets.setElement( 3, 1 );
18 offsets.setElement( 4, gridSize );
19 return offsets;
20}
21
22template< typename Matrix >
23void setElement_on_host( const int gridSize, Matrix& matrix )
24{
25 /***
26 * Set matrix representing approximation of the Laplace operator on regular
27 * grid using the finite difference method by means setElement method called
28 * from the host system.
29 */
30 const int matrixSize = gridSize * gridSize;
31 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
32
33 for( int j = 0; j < gridSize; j++ )
34 for( int i = 0; i < gridSize; i++ )
35 {
36 const int rowIdx = j * gridSize + i;
37 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
38 matrix.setElement( rowIdx, rowIdx, 1.0 );
39 else
40 {
41 matrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
42 matrix.setElement( rowIdx, rowIdx - 1, 1.0 );
43 matrix.setElement( rowIdx, rowIdx, -4.0 );
44 matrix.setElement( rowIdx, rowIdx + 1, 1.0 );
45 matrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
46 }
47 }
48}
49
50template< typename Matrix >
51void setElement_on_host_and_transfer( const int gridSize, Matrix& matrix )
52{
53 using RealType = typename Matrix::RealType;
54 using IndexType = typename Matrix::IndexType;
56 const int matrixSize = gridSize * gridSize;
57 HostMatrix hostMatrix( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
58
59 for( int j = 0; j < gridSize; j++ )
60 for( int i = 0; i < gridSize; i++ )
61 {
62 const int rowIdx = j * gridSize + i;
63 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
64 hostMatrix.setElement( rowIdx, rowIdx, 1.0 );
65 else
66 {
67 hostMatrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
68 hostMatrix.setElement( rowIdx, rowIdx - 1, 1.0 );
69 hostMatrix.setElement( rowIdx, rowIdx, -4.0 );
70 hostMatrix.setElement( rowIdx, rowIdx + 1, 1.0 );
71 hostMatrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
72 }
73 }
74 matrix = hostMatrix;
75}
76
77
78template< typename Matrix >
79void setElement_on_device( const int gridSize, Matrix& matrix )
80{
81 /***
82 * Set matrix representing approximation of the Laplace operator on regular
83 * grid using the finite difference method by means of setElement method called
84 * from the native device.
85 */
86 const int matrixSize = gridSize * gridSize;
87 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
88
89 auto matrixView = matrix.getView();
90 auto f = [=] __cuda_callable__ ( int i, int j ) mutable {
91 const int rowIdx = j * gridSize + i;
92 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
93 matrixView.setElement( rowIdx, rowIdx, 1.0 );
94 else
95 {
96 matrixView.setElement( rowIdx, rowIdx - gridSize, 1.0 );
97 matrixView.setElement( rowIdx, rowIdx - 1, 1.0 );
98 matrixView.setElement( rowIdx, rowIdx, -4.0 );
99 matrixView.setElement( rowIdx, rowIdx + 1, 1.0 );
100 matrixView.setElement( rowIdx, rowIdx + gridSize, 1.0 );
101 }
102 };
104}
105
106template< typename Matrix >
107void getRow( const int gridSize, Matrix& matrix )
108{
109 /***
110 * Set matrix representing approximation of the Laplace operator on regular
111 * grid using the finite difference method by means of getRow method.
112 */
113 const int matrixSize = gridSize * gridSize;
114 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
115
116 auto matrixView = matrix.getView();
117 auto f = [=] __cuda_callable__ ( int rowIdx ) mutable {
118 const int i = rowIdx % gridSize;
119 const int j = rowIdx / gridSize;
120 auto row = matrixView.getRow( rowIdx );
121 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
122 row.setElement( 2, 1.0 );
123 else
124 {
125 row.setElement( 0, 1.0 );
126 row.setElement( 1, 1.0 );
127 row.setElement( 2, -4.0 );
128 row.setElement( 3, 1.0 );
129 row.setElement( 4, 1.0 );
130 }
131 };
133}
134
135template< typename Matrix >
136void forElements( const int gridSize, Matrix& matrix )
137{
138 /***
139 * Set matrix representing approximation of the Laplace operator on regular
140 * grid using the finite difference method by means of forElements method.
141 */
142
143 const int matrixSize = gridSize * gridSize;
144 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
145
146 auto f = [=] __cuda_callable__ ( int rowIdx, int localIdx, int columnIdx, float& value ) mutable {
147 const int i = rowIdx % gridSize;
148 const int j = rowIdx / gridSize;
149 if( ( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) && localIdx == 0 )
150 {
151 columnIdx = rowIdx;
152 value = 1.0;
153 }
154 else
155 {
156 switch( localIdx )
157 {
158 case 0:
159 columnIdx = rowIdx - gridSize;
160 value = 1.0;
161 break;
162 case 1:
163 columnIdx = rowIdx - 1;
164 value = 1.0;
165 break;
166 case 2:
167 columnIdx = rowIdx;
168 value = -4.0;
169 break;
170 case 3:
171 columnIdx = rowIdx + 1;
172 value = 1.0;
173 break;
174 case 4:
175 columnIdx = rowIdx + gridSize;
176 value = 1.0;
177 break;
178 }
179 }
180 };
181 matrix.forElements( 0, matrixSize, f );
182}
183
184template< typename Device >
185void laplaceOperatorMultidiagonalMatrix()
186{
187 std::cout << " Sparse matrix test:" << std::endl;
188 for( int gridSize = 16; gridSize <= 8192; gridSize *= 2 )
189 {
190 std::cout << " Grid size = " << gridSize << std::endl;
191 TNL::Timer timer;
192
193 std::cout << " setElement on host: ";
194 timer.reset();
195 timer.start();
196 for( int i = 0; i < testsCount; i++ )
197 {
199 setElement_on_host( gridSize, matrix );
200 }
201 timer.stop();
202 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
203
205 {
206 std::cout << " setElement on host and transfer on GPU: ";
207 timer.reset();
208 timer.start();
209 for( int i = 0; i < testsCount; i++ )
210 {
212 setElement_on_host_and_transfer( gridSize, matrix );
213 }
214 timer.stop();
215 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
216 }
217
218 std::cout << " setElement on device: ";
219 timer.reset();
220 timer.start();
221 for( int i = 0; i < testsCount; i++ )
222 {
224 setElement_on_device( gridSize, matrix );
225 }
226 timer.stop();
227 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
228
229 std::cout << " getRow: ";
230 timer.reset();
231 timer.start();
232 for( int i = 0; i < testsCount; i++ )
233 {
235 getRow( gridSize, matrix );
236 }
237 timer.stop();
238 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
239
240 std::cout << " forElements: ";
241 timer.reset();
242 timer.start();
243 for( int i = 0; i < testsCount; i++ )
244 {
246 forElements( gridSize, matrix );
247 }
248 timer.stop();
249 std::cout << timer.getRealTime() / ( double ) testsCount << " sec." << std::endl;
250
251 }
252}
253
254int main( int argc, char* argv[] )
255{
256 std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
257 laplaceOperatorMultidiagonalMatrix< TNL::Devices::Host >();
258
259#ifdef HAVE_CUDA
260 std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
261 laplaceOperatorMultidiagonalMatrix< TNL::Devices::Cuda >();
262#endif
263}