TODO: Add description of forRows and sequentialForRows.

Introduction

TNL offers several types of matrices like dense (TNL::Matrices::DenseMatrix), sparse (TNL::Matrices::SparseMatrix), tridiagonal (TNL::Matrices::TridiagonalMatrix), multidiagonal (TNL::Matrices::MultidiagonalMatrix) and lambda matrices (TNL::Matrices::LambdaMatrix). The sparse matrices can be symmetric to lower the memory requirements. The interfaces of given matrix types are designed to be as unified as possible to ensure that the user can easily switch between different matrix types while making no or only a little changes in the source code. All matrix types allows traversing all matrix elements and manipulate them using lambda functions as well as performing flexible reduction in matrix rows. The following text describes particular matrix types and their unified interface in details.

Overview of matrix types

In a lot of numerical algorithms either dense or sparse matrices are used. The dense matrix (TNL::Matrices::DenseMatrix) is such that all or at least most of its matrix elements are nonzero. On the other hand sparse matrix (TNL::Matrices::SparseMatrix) is a matrix which has most of the matrix elements equal to zero. From the implementation point of view, the data structures for the dense matrices allocates all matrix elements while formats for the sparse matrices aim to store explicitly only the nonzero matrix elements. The most popular format for storing the sparse matrices is CSR format. However, especially for better data alignment in memory of GPUs, many other formats were designed. In TNL, the user may choose between several different sparse matrix formats. There are also sparse matrices with specific pattern of the nonzero elements like tridiagonal matrices (TNL::Matrices::TridiagonalMatrix) which "has nonzero elements on the main diagonal, the first diagonal below this, and the first diagonal above the main diagonal only". An example of such matrix may look as follows:

\[\left( \begin{array}{ccccccc} -2 & 1 & . & . & . & . \\ 1 & -2 & 1 & . & . & . \\ . & 1 & -2 & 1 & . & . \\ . & . & 1 & -2 & 1 & . \\ . & . & . & 1 & -2 & 1 \\ . & . & . & . & 1 & -2 \end{array} \right) \]

Similar but more general type of matrices are multidiagonal matrices (TNL::Matrices::MultidiagonalMatrix) which have the nonzero matrix elements positioned only on lines parallel to the main diagonal like the following matrix:

\[ \left( \begin{array}{ccccccc} -4 & 1 & . & 1 & . & . \\ 1 & -4 & 1 & . & 1 & . \\ . & 1 & -4 & 1 & . & 1 \\ 1 & . & 1 & -4 & 1 & . \\ . & 1 & . & 1 & -4 & 1 \\ . & . & 1 & . & 1 & -4 \end{array} \right) \]

Finally, TNL offers so called lambda matrices (TNL::Matrices::LambdaMatrix) which are kind of "matrix-free matrices". They do not store the matrix elements explicitly in the memory, but rather evaluates them on-the-fly based on user defined lambda functions.

In the following table we show comparison of expressing a tridiagonal matrix by means of different matrix types.

Matrix dimensions	Dense elems.	Dense mem.	Sparse elems.	Sparse mem.	Tridiag. elems.	Tridiag. mem.	Multidiag. elems.	Mutlidiag. mem.
10x10	100	800 B	>28	>336 B	30	240 B	30	252 B
100x100	10,000	80 kB	>298	>3,576 B	300	2,400 B	300	2,412 B
1,000x1,000	1,000,000	8 MB	>2,998	>35,976 B	3,000	24,000 B	3,000	24,012 B
10,000x10,000	100,000,000	800 MB	>29,998	>359,976 B	30,000	240,000 B	30,000	240,012 B
100,000x100,000	10,000,000,000	80 GB	>299,998	>3,599,876 B	300,000	2,400,000 B	300,000	2,400,012 B

In the table:

Matrix dimensions is the number of matrix rows and columns
Dense elems. is the number of allocated matrix elements in the dense matrix.
Dense mem. is the allocated memory for the matrix elements in the dense matrix if the elements are stored in the double precision.
Sparse elems. is the number of allocated matrix elements in the sparse matrix. Some formats may allocate padding zeros for better data alignment in the memory and so the number of allocated matrix elements may increase.
Sparse mem. is the allocated memory for the matrix elements in the sparse matrix if the elements are stored in the double precision and column indexes in 32-bit integer.
Tridiag. elems is the number of allocated matrix elements in the tridiagonal matrix.
Tridiag mem. is the allocated memory for the matrix elements in the tridiagonal matrix if the elements are stored in the double precision.
Multidiag. elems is the number of allocated matrix elements in the multidiagonal matrix.
Multidiag mem. is the allocated memory for the matrix elements in the multidiagonal matrix if the elements are stored in the double precision.

Choosing the best matrix type can have tremendous impact on the performance but also on memory requirements. If we would treat each matrix as dense one we would not be able to to work with matrices larger than 50,000x50,000 on common personal computers, because we would need tens of gibabytes of memory. At the same time, we see that the other matrix types can do the same job with only few megabytes. In addition, other matrix types work with much less matrix elements and so operations like matrix-vector multiplication can be done with significantly less operations which means much faster. Since in the modern hardware architectures, the computing units are limited mainly by the performance of the memory chips (so called memory wall), transferring less data from the memory increases the performance even more.

The following table shows the same as the one above but when storing a matrix which has only five nonzero elements in each row. Such matrices arise often from the finite difference method for solution of the partial differential equations:

Matrix dimensions	Dense elems.	Dense mem.	Sparse elems.	Sparse mem.	Multidiag. elems.	Mutlidiag. mem.
10x10	100	800 B	>50	>600 B	50	420 B
100x100	10,000	80 kB	>500	>6,000 B	500	4,020 B
1,000x1,000	1,000,000	8 MB	>5,000	>60,000 B	5,000	40,020 B
10,000x10,000	100,000,000	800 MB	>50,000	>600,000 B	50,000	400,020 B
100,000x100,000	10,000,000,000	80 GB	>500,000	>6,000,000 B	500,000	4,000,020 B

There is no change in the dense matrix part of the table. The numbers grow proportionally in case of sparse and mutlidiagonal matrix. General sparse matrix formats need to store column indexes for each matrix element which is not true for the multidiagonal matrix. The following table shows how many bytes we need for storing of one matrix element with different matrix types depending on the type of the matrix elements (Real) and column indexes (Index):

Real	Index	Dense matrix	Multidiagonal matrix	Sparse matrix	Fill ratio
float	32-bit	4 B	4 B	8 B	<< 50%
float	64-bit	4 B	4 B	12 B	<< 30%
double	32-bit	8 B	8 B	12 B	<< 60%
double	64-bit	8 B	8 B	16 B	<< 50%

In this table:

Real is matrix element type.
Index is column index type.
Dense matrix is number of bytes needed to store one matrix element in the dense matrix.
Multidiagonal matrix is number of bytes needed to store one matrix element in the mutldiagonal matrix.
Sparse matrix is number of bytes needed to store one matrix element in the sparse matrix.
Fill ratio is maximal percentage of the nonzero matrix elements until which the sparse matrix can perform better.

The multidiagonal matrix type is especially suitable for the finite difference method or similar numerical methods for solution of the partial differential equations.

Indexing of nonzero matrix elements in sparse matrices

The sparse matrix formats usually, in the first step, compress the matrix rows by omitting the zero matrix elements as follows

\[\left( \begin{array}{ccccc} 0 & 1 & 0 & 2 & 0 \\ 0 & 0 & 5 & 0 & 0 \\ 4 & 0 & 0 & 0 & 7 \\ 0 & 3 & 0 & 8 & 5 \\ 0 & 5 & 7 & 0 & 0 \end{array} \right) \rightarrow \left( \begin{array}{ccccc} 1 & 2 & . & . & . \\ 5 & . & . & . & . \\ 4 & 7 & . & . & . \\ 3 & 8 & 5 & . & . \\ 5 & 7 & . & . & . \end{array} \right) \]

In such a form, it is more efficient to refer the nonzero matrix elements in given row by their rank in the compressed matrix row rather than by their column index in the original matrix. In methods for the sparse matrices, this parameter is called localIdx. Some sparse matrix formats add padding zeros for better alignment of data in memory. But if this is not the case, the variable localIdx of particular matrix elements would read as:

\[\left( \begin{array}{ccccc} 0 & 1 & . & . & . \\ 0 & . & . & . & . \\ 0 & 1 & . & . & . \\ 0 & 1 & 2 & . & . \\ 0 & 1 & . & . & . \end{array} \right) \]

Matrix view

Matrix views are small reference objects which help accessing the matrix in GPU kernels or lambda functions being executed on GPUs. We describe this in details in section about Shared pointers and views. The problem lies in fact that we cannot pass references to GPU kernels and we do not want to pass there deep copies of matrices. Matrix view is some kind of reference to a matrix. A copy of matrix view is always shallow and so it behaves like a reference. The following example shows how to obtain the matrix view by means of method getView and pass it to a lambda function:

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
getRowExample()
{
   /***
    * Set the following matrix (dots represent zero matrix elements):
    *
    *   /  2  .  .  .  . \
    *   |  1  2  1  .  . |
    *   |  .  1  2  1. . |
    *   |  .  .  1  2  1 |
    *   \  .  .  .  .  2 /
    */
   const int size = 5;
   TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 3, 3, 3, 1 }, size );
   auto view = matrix.getView();
 
   /***
    * Set the matrix elements.
    */
   auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
   {
      auto row = view.getRow( rowIdx );
      if( rowIdx == 0 )
         row.setElement( 0, rowIdx, 2.0 );  // diagonal element
      else if( rowIdx == size - 1 )
         row.setElement( 0, rowIdx, 2.0 );  // diagonal element
      else {
         row.setElement( 0, rowIdx - 1, 1.0 );  // elements below the diagonal
         row.setElement( 1, rowIdx, 2.0 );      // diagonal element
         row.setElement( 2, rowIdx + 1, 1.0 );  // elements above the diagonal
      }
   };
   TNL::Algorithms::parallelFor< Device >( 0, matrix.getRows(), f );
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Getting matrix rows on host: " << std::endl;
   getRowExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Getting matrix rows on CUDA device: " << std::endl;
   getRowExample< TNL::Devices::Cuda >();
#endif
}

Here we create sparse matrix matrix on the line 11, and use the method getView to get the matrix view on the line 12. The view is then used in the lambda function on the line 15 where it is captured by value (see [=] in the definition of the lambda function f on the line 14).

Allocation and setup of different matrix types

There are several ways how to create a new matrix:

Initializer lists allow to create matrix from the C++ initializer lists. The matrix elements must be therefore encoded in the source code and so it is useful for rather smaller matrices. Methods and constructors with initializer lists are user friendly and simple to use. It is a good choice for tool problems with small matrices.
STL map can be used for creation of sparse matrices only. The user first insert all matrix elements together with their coordinates into std::map based on which the sparse matrix is created in the next step. It is simple and user friendly approach suitable for creation of large matrices. An advantage is that we do not need to know the distribution of the nonzero matrix elements in matrix rows in advance like we do in other ways of construction of sparse matrices. This makes the use of STL map suitable for combining of sparse matrices from TNL with other numerical packages. However, the sparse matrix is constructed on the host and then copied on GPU if necessary. Therefore, this approach is not a good choice if fast and efficient matrix construction is required.
Methods setElement and addElement called from the host allow to change particular matrix elements. The methods can be called from host even for matrices allocated on GPU. In this case, however, the matrix elements are transferred on GPU one by one which is very inefficient. If the matrix is allocated on the host system (CPU), the efficiency is nearly optimal. In case of sparse matrices, one must set row capacities (i.e. maximal number of nonzero elements in each row) before using these methods. If the row capacity is exceeded, the matrix has to be reallocated and all matrix elements are lost.
Methods setElement and addElement called on the host and copy matrix on GPU setting particular matrix elements by the methods setElement and addElement when the matrix is allocated on GPU can be time consuming for large matrices. Setting up the matrix on CPU using the same methods and copying it on GPU at once when the setup is finished can be significantly more efficient. A drawback is that we need to allocate temporarily whole matrix on CPU.
Methods setElement and addElement called from native device allow to do efficient matrix elements setup even on devices (GPUs). In this case, the methods must be called from a GPU kernel or a lambda function combined with the parallel for (TNL::Algorithms::parallelFor). The user get very good performance even when manipulating matrix allocated on GPU. On the other hand, only data structures allocated on GPUs can be accessed from the kernel or lambda function. The matrix can be accessed in the GPU kernel or lambda function by means of matrix view (see the previous section) or the shared pointer (see TNL::Pointers::SharedPointer).
Method getRow combined with parallelFor is very similar to the previous one. The difference is that we first fetch helper object called matrix row which is linked to particular matrix row. Using methods of this object, one may change the matrix elements in given matrix row. An advantage is that the access to the matrix row is resolved only once for all elements in the row. In some more sophisticated sparse matrix formats, this can be nontrivial operation and this approach may slightly improve the performance. Another advantage for sparse matrices is that we access the matrix elements based on their local index ('localIdx', see [Indexing of nonzero matrix elements in sparse matrices](indexing_of_nonzero_matrix_elements_in_sparse_matrices)) in the row which is something like a rank of the nonzero element in the row. This is more efficient than addressing the matrix elements by the column indexes which requires searching in the matrix row. So this may significantly improve the performance of setup of sparse matrices. When it comes to dense matrices, there should not be great difference in performance compared to use of the methods setElement and getElement. Note that when the method is called from a GPU kernel or a lambda function, only data structures allocated on GPU can be accessed and the matrix must be made accessible by the means of matrix view.
Methods forRows and forElements this approach is very similar to the previous one but it avoids using parallelFor and necessity of passing the matrix to GPU kernels by matrix view or shared pointers.

The following table shows pros and cons of particular methods:

Method	Efficient	Easy to use	Pros	Cons
Initializer list	`**`	`*****`	Very easy to use.	Only for small matrices.
			Does not need setting of matrix rows capacities
STL map	`**`	`*****`	Very easy to use.	Higher memory requirements.
			Does not need setting of matrix rows capacities	Slow transfer on GPU.
[set,add]Element on host	`***/`	`*****`	Very easy to use.	Requires setting of row capacities.
				Extremely slow transfer on GPU.
[set,and]Element on host&copy on GPU	`***`	`****`	Easy to use.	Requires setting of row capacities.
			Reasonable efficiency.	Allocation of auxiliary matrix on CPU.
[set,add]Element on native device	`****`		Good efficiency.	Requires setting of row capacities.
				Requires writing GPU kernel or lambda function.
				Allows accessing only data allocated on the same device/memory space.
getRow and parallelFor	`*****`	`**`	Best efficiency for sparse matrices.	Requires setting of row capacities.
				Requires writing GPU kernel or lambda function.
				Allows accessing only data allocated on the same device/memory space.
				Use of matrix local indexes can be less intuitive.
forRows, forElements	`*****`	`**`	Best efficiency for sparse matrices.	Requires setting of row capacities.
			Avoid use of matrix view or shared pointer in kernels/lambda function.	Requires writing GPU kernel or lambda function.
				Allows accessing only data allocated on the same device/memory space.
				Use of matrix local indexes is less intuitive.

Though it may seem that the later methods come with more cons than pros, they offer much higher performance and we believe that even they are still user friendly. On the other hand, if the matrix setup performance is not a priority, the use easy-to-use but slow method can still be a good choice. The following tables demonstrate the performance of different methods. The tests were performed with the following setup:


CPU	Intel i9-9900KF, 3.60GHz, 8 cores, 16384 KB cache
GPU	GeForce RTX 2070
g++ version	10.2.0
nvcc version	11.2.67
Precision	single precision

Dense matrix

In the test of dense matrices, we set each matrix element to value equal to rowIdx + columnIdx. The times in seconds obtained on CPU looks as follows:

Matrix rows and columns	`setElement` on host	`setElement` with `parallelFor`	`getRow`	`forElements`
16	0.00000086	0.0000053	0.00000035	0.0000023
32	0.00000278	0.0000050	0.00000201	0.0000074
64	0.00000703	0.0000103	0.00000354	0.0000203
128	0.00002885	0.0000312	0.00000867	0.0000709
256	0.00017543	0.0000439	0.00002490	0.0001054
512	0.00078153	0.0001683	0.00005999	0.0002713
1024	0.00271989	0.0006691	0.00003808	0.0003942
2048	0.01273520	0.0038295	0.00039116	0.0017083
4096	0.08381450	0.0716542	0.00937997	0.0116771
8192	0.51596800	0.3535530	0.03971900	0.0467374

Here:

setElement on host tests run in one thread. Therefore they are faster for small matrices compared to setElement with parallelFor tests.
setElement with parallelFor tests run in parallel in several OpenMP threads. This approach is faster for larger matrices.
getRow tests run in parallel in several OpenMP threads mapping of which is more efficient compared to setElement on host tests.

And the same on GPU is in the following table:

Matrix rows and columns	`setElement` on host	`setElement` on host and copy	`setElement` on GPU	`getRow`	`forElements`
16	0.027835	0.02675	0.000101198	0.00009903	0.000101214
32	0.002776	0.00018	0.000099197	0.00009901	0.000100481
64	0.010791	0.00015	0.000094446	0.00009493	0.000101796
128	0.043014	0.00021	0.000099397	0.00010024	0.000102729
256	0.171029	0.00056	0.000100469	0.00010448	0.000105893
512	0.683627	0.00192	0.000103346	0.00011034	0.000112752
1024	2.736680	0.00687	0.000158805	0.00016932	0.000170302
2048	10.930300	0.02474	0.000509000	0.00050917	0.000511183
4096	43.728700	0.13174	0.001557030	0.00156117	0.001557930
8192	174.923000	0.70602	0.005312470	0.00526658	0.005263870

Here:

setElement on host tests are very slow especially for large matrices since each matrix element is copied on GPU separately.
setElement on host and copy tests are much faster because the matrix is copied from CPU to GPU on the whole which is more efficient.
setElement on GPU tests are even more faster since there is no transfer of data between CPU and GPU.
getRow tests have the same performance as setElement on GPU.
forElements tests have the same performance as both setElement on GPU and getRow.

You can see the source code of the previous benchmark in Appendix.

Sparse matrix

The sparse matrices are tested on computation of matrix the discrete Laplace operator in 2D. This matrix has at most five nonzero elements in each row. The times for sparse matrix (with CSR format) on CPU in seconds look as follows:

Matrix rows and columns	STL Map	`setElement` on host	`setElement` with `parallelFor`	`getRow`	`forElements`
256	0.00016	0.000017	0.000014	0.000013	0.000020
1,024	0.00059	0.000044	0.000021	0.000019	0.000022
4,096	0.00291	0.000130	0.000031	0.000022	0.000031
16,384	0.01414	0.000471	0.000067	0.000031	0.000065
65,536	0.06705	0.001869	0.000218	0.000074	0.000209
262,144	0.31728	0.007436	0.000856	0.000274	0.000799
1,048,576	1.46388	0.027087	0.006162	0.005653	0.005904
4,194,304	7.46147	0.102808	0.028385	0.027925	0.027937
16,777,216	38.95900	0.413823	0.125870	0.124588	0.123858
67,108,864	185.75700	1.652580	0.505232	0.501003	0.500927

Here:

STL Map tests show that use of STL Map can be very slow on large matrices and, of course, they need to allocate the map containing all the matrix elements. This can be memory consuming. On the other hand, it is the only way which does not require knowing the matrix row capacities in advance.
setElement on host tests are much faster compared to STL map, it does not need to allocate anything else except the sparse matrix. However, matrix row capacities must be known in advance.
setElement with parallelFor tests run in parallel in several OpenMP threads and so this can be faster for larger matrices.
getRow tests perform the same as setElement with parallelFor.
forElements tests perform the same as both setElement with parallelFor and forElements.

We see, that the use of STL map makes sense only in situation when it is hard to estimate necessary row capacities. Otherwise very easy setup with setElement method is much faster. If the performance is the highest priority, getRow method should be preferred. The results for GPU are in the following table:

Matrix rows and columns	STL Map	`setElement` on host	`setElement` on host and copy	`setElement` on GPU	`getRow`	`forElements`
256	0.002	0.036	0.0280	0.00017	0.00017	0.00017
1,024	0.001	0.161	0.0006	0.00017	0.00017	0.00017
4,096	0.003	0.680	0.0010	0.00020	0.00020	0.00020
16,384	0.015	2.800	0.0034	0.00021	0.00020	0.00021
65,536	0.074	11.356	0.0130	0.00048	0.00047	0.00048
262,144	0.350	45.745	0.0518	0.00088	0.00087	0.00088
1,048,576	1.630	183.632	0.2057	0.00247	0.00244	0.00245
4,194,304	8.036	735.848	0.8119	0.00794	0.00783	0.00788
16,777,216	41.057	2946.610	3.2198	0.02481	0.02429	0.02211
67,108,864	197.581	11791.601	12.7775	0.07196	0.06329	0.06308

Here:

STL Map tests show that the times are comparable to CPU times which means the most of the time is spent by creating the matrix on CPU.
setElement on host tests are again extremely slow for large matrices. It is even slower than the use of STL map. So in case of GPU, this is another reason for using the STL map.
setElement on host and copy tests are, similar to the dense matrix, much faster compared to the previous approaches. So it is the best way when you need to use data structures available only on the host system (CPU).
setElement on GPU tests exhibit the best performance together with getRow and forElements methods. Note, however, that this method can be slower that getRow and forElements if there would be more nonzero matrix elements in a row.
getRow tests exhibit the best performance together with setElement on GPU and forElements methods.
forElements tests exhibit the best performance together with getRow and setElement on GPU methods.

Here we see, that the setElement methods performs extremely bad because all matrix elements are transferred to GPU one-by-one. Even STL map is much faster. Note, that the times for STL map are not much higher compared to CPU which indicates that the transfer of the matrix on GPU is not dominant. Setup of the matrix on CPU by the means of setElement method and transfer on GPU is even faster. However, the best performance can be obtained only we creating the matrix directly on GPU by methods setElement, getRow and forElements. Note, however, that even if all of them perform the same way, for matrices with more nonzero matrix elements in a row, setElement could be slower compared to the getRow and forElements.

You can see the source code of the previous benchmark in Appendix.

Multidiagonal matrix

Finally, the following tables show the times of the same test performed with multidiagonal matrix. Times on CPU in seconds looks as follows:

Matrix rows and columns	`setElement` on host	`setElement` with `parallelFor`	`getRow`	`forElements`
256	0.000055	0.0000038	0.000004	0.000009
1,024	0.000002	0.0000056	0.000003	0.000006
4,096	0.000087	0.0000130	0.000005	0.000014
16,384	0.000347	0.0000419	0.000010	0.000046
65,536	0.001378	0.0001528	0.000032	0.000177
262,144	0.005504	0.0006025	0.000131	0.000711
1,048,576	0.019392	0.0028773	0.001005	0.003265
4,194,304	0.072078	0.0162378	0.011915	0.018065
16,777,216	0.280085	0.0642682	0.048876	0.072084
67,108,864	1.105120	0.2427610	0.181974	0.272579

Here:

setElement on host tests show that this method is fairly efficient.
setElement with parallelFor tests run in parallel in several OpenMP threads compared to setElement on host tests. For larger matrices, this way of matrix setup performs better.
getRow tests perform more or less the same as setElement with parallelFor and forElements.
forElements tests perform more or less the same as setElement with parallelFor and getRow.

Note, that setup of multidiagonal matrix is faster compared to the same matrix stored in general sparse format. Results for GPU are in the following table:

Matrix rows and columns	`setElement` on host	`setElement` on host and copy	`setElement` on GPU	`getRow`	`forElements`
256	0.035	0.02468	0.000048	0.000045	0.000047
1,024	0.059	0.00015	0.000047	0.000045	0.000047
4,096	0.251	0.00044	0.000048	0.000045	0.000047
16,384	1.030	0.00158	0.000049	0.000046	0.000048
65,536	4.169	0.00619	0.000053	0.000048	0.000052
262,144	16.807	0.02187	0.000216	0.000214	0.000217
1,048,576	67.385	0.08043	0.000630	0.000629	0.000634
4,194,304	270.025	0.31272	0.001939	0.001941	0.001942
16,777,216	1080.741	1.18849	0.003212	0.004185	0.004207
67,108,864	4326.120	4.74481	0.013672	0.022494	0.030369

setElement on host tests are extremely slow again, especially for large matrices.
setElement on host and copy tests are much faster compared to the previous.
setElement with parallelFor tests offer the best performance. They are even faster then getRow and forElements method. This, however, does not have be true for matrices having more nonzero elements in a row.
getRow tests perform more or less the same as forElements. For matrices having more nonzero elements in a row this method could be faster than setElement.
forElements tests perform more or less the same as getRow.

Note that multidiagonal matrix performs better compared to general sparse matrix. One reason for it is the fact, that the multidiagonal type does not store explicitly column indexes of all matrix elements. Because of this, less data need to be transferred from the memory.

You can see the source code of the previous benchmark in Appendix.

In the following parts we will describe how to setup particular matrix types by means of the methods mentioned above.

Dense matrices

Dense matrix (TNL::Matrices::DenseMatrix) is a templated class defined in the namespace TNL::Matrices. It has five template parameters:

Real is a type of the matrix elements. It is double by default.
Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
Index is a type to be used for indexing of the matrix elements. It is int by default.
ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default, it is the row-major order if the matrix is allocated on the host system and column major order if it is allocated on GPU.
RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.

The following examples show how to allocate the dense matrix and how to initialize the matrix elements.

Initializer list

Small matrices can be created simply by the constructor with an initializer list.

#include <iostream>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
initializerListExample()
{
   TNL::Matrices::DenseMatrix< double, Device > matrix{
      // clang-format off
      {  1,  2,  3,  4,  5,  6 },
      {  7,  8,  9, 10, 11, 12 },
      { 13, 14, 15, 16, 17, 18 },
      // clang-format on
   };
 
   std::cout << "General dense matrix: " << std::endl << matrix << std::endl;
 
   TNL::Matrices::DenseMatrix< double, Device > triangularMatrix{
      // clang-format off
      {  1 },
      {  2,  3 },
      {  4,  5,  6 },
      {  7,  8,  9, 10 },
      { 11, 12, 13, 14, 15 },
      // clang-format on
   };
 
   std::cout << "Triangular dense matrix: " << std::endl << triangularMatrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrices on CPU ... " << std::endl;
   initializerListExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
   initializerListExample< TNL::Devices::Cuda >();
#endif
}

In fact, the constructor takes a list of initializer lists. Each embedded list defines one matrix row and so the number of matrix rows is given by the size of the outer initializer list. The number of matrix columns is given by the longest inner initializer lists. Shorter inner lists are filled with zeros from the right side. The result looks as follows:

Creating matrices on CPU ... 
General dense matrix: 
Row: 0 ->     0:1       1:2       2:3       3:4       4:5       5:6   
Row: 1 ->     0:7       1:8       2:9       3:10      4:11      5:12  
Row: 2 ->     0:13      1:14      2:15      3:16      4:17      5:18  
Triangular dense matrix: 
Row: 0 ->     0:1       1:0       2:0       3:0       4:0   
Row: 1 ->     0:2       1:3       2:0       3:0       4:0   
Row: 2 ->     0:4       1:5       2:6       3:0       4:0   
Row: 3 ->     0:7       1:8       2:9       3:10      4:0   
Row: 4 ->     0:11      1:12      2:13      3:14      4:15  
Creating matrices on CUDA GPU ... 
General dense matrix: 
Row: 0 ->     0:1       1:2       2:3       3:4       4:5       5:6   
Row: 1 ->     0:7       1:8       2:9       3:10      4:11      5:12  
Row: 2 ->     0:13      1:14      2:15      3:16      4:17      5:18  
Triangular dense matrix: 
Row: 0 ->     0:1       1:0       2:0       3:0       4:0   
Row: 1 ->     0:2       1:3       2:0       3:0       4:0   
Row: 2 ->     0:4       1:5       2:6       3:0       4:0   
Row: 3 ->     0:7       1:8       2:9       3:10      4:0   
Row: 4 ->     0:11      1:12      2:13      3:14      4:15  

Methods `setElement` and `addElement`

Larger matrices can be setup with methods setElement and addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement). The following example shows how to call these methods from the host.

#include <iostream>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
addElements()
{
   TNL::Matrices::DenseMatrix< double, Device > matrix( 5, 5 );
   auto matrixView = matrix.getView();
 
   for( int i = 0; i < 5; i++ )
      matrixView.setElement( i, i, i );  // or matrix.setElement
 
   std::cout << "Initial matrix is: " << std::endl << matrix << std::endl;
 
   for( int i = 0; i < 5; i++ )
      for( int j = 0; j < 5; j++ )
         matrixView.addElement( i, j, 1.0, 5.0 );  // or matrix.addElement
 
   std::cout << "Matrix after addition is: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Add elements on host:" << std::endl;
   addElements< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Add elements on CUDA device:" << std::endl;
   addElements< TNL::Devices::Cuda >();
#endif
}

As we can see, both methods can be called from the host no matter where the matrix is allocated. If it is on GPU, each call of setElement or addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement) causes slow transfer of tha data between CPU and GPU. Use this approach only if the performance is not a priority. The result looks as follows:

Add elements on host:
Initial matrix is: 
Row: 0 ->     0:0       1:0       2:0       3:0       4:0   
Row: 1 ->     0:0       1:1       2:0       3:0       4:0   
Row: 2 ->     0:0       1:0       2:2       3:0       4:0   
Row: 3 ->     0:0       1:0       2:0       3:3       4:0   
Row: 4 ->     0:0       1:0       2:0       3:0       4:4   
Matrix after addition is: 
Row: 0 ->     0:1       1:1       2:1       3:1       4:1   
Row: 1 ->     0:1       1:6       2:1       3:1       4:1   
Row: 2 ->     0:1       1:1       2:11      3:1       4:1   
Row: 3 ->     0:1       1:1       2:1       3:16      4:1   
Row: 4 ->     0:1       1:1       2:1       3:1       4:21  
Add elements on CUDA device:
Initial matrix is: 
Row: 0 ->     0:0       1:0       2:0       3:0       4:0   
Row: 1 ->     0:0       1:1       2:0       3:0       4:0   
Row: 2 ->     0:0       1:0       2:2       3:0       4:0   
Row: 3 ->     0:0       1:0       2:0       3:3       4:0   
Row: 4 ->     0:0       1:0       2:0       3:0       4:4   
Matrix after addition is: 
Row: 0 ->     0:1       1:1       2:1       3:1       4:1   
Row: 1 ->     0:1       1:6       2:1       3:1       4:1   
Row: 2 ->     0:1       1:1       2:11      3:1       4:1   
Row: 3 ->     0:1       1:1       2:1       3:16      4:1   
Row: 4 ->     0:1       1:1       2:1       3:1       4:21  

More efficient way of the matrix initialization on GPU consists of calling the methods setElement and addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement) directly from GPU, for example by means of lambda function and parallelFor. It is demonstrated in the following example (of course it works even on CPU):

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Containers/StaticArray.h>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
setElements()
{
   TNL::Matrices::DenseMatrix< double, Device > matrix( 5, 5 );
   auto matrixView = matrix.getView();
   for( int i = 0; i < 5; i++ )
      matrixView.setElement( i, i, i );  // or matrix.setElement
 
   std::cout << "Matrix set from the host:" << std::endl;
   std::cout << matrix << std::endl;
 
   auto f = [ = ] __cuda_callable__( const TNL::Containers::StaticArray< 2, int >& i ) mutable
   {
      matrixView.addElement( i[ 0 ], i[ 1 ], 5.0 );
   };
   TNL::Containers::StaticArray< 2, int > begin = { 0, 0 };
   TNL::Containers::StaticArray< 2, int > end = { 5, 5 };
   TNL::Algorithms::parallelFor< Device >( begin, end, f );
 
   std::cout << "Matrix set from its native device:" << std::endl;
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Set elements on host:" << std::endl;
   setElements< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Set elements on CUDA device:" << std::endl;
   setElements< TNL::Devices::Cuda >();
#endif
}

Here we get the matrix view (TNL::Matrices::DenseMatrixView) (line 10) to make the matrix accessible in lambda function even on GPU (see Shared pointers and views ). We first call the setElement method from CPU to set the i-th diagonal element to i (lines 11-12). Next we iterate over the matrix rows with parallelFor (line 20) and for each row we call the lambda function f. This is done on the same device where the matrix is allocated and so it we get optimal performance even for matrices on GPU. In the lambda function we add one to each matrix element (line 18). The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 ->     0:0       1:0       2:0       3:0       4:0   
Row: 1 ->     0:0       1:1       2:0       3:0       4:0   
Row: 2 ->     0:0       1:0       2:2       3:0       4:0   
Row: 3 ->     0:0       1:0       2:0       3:3       4:0   
Row: 4 ->     0:0       1:0       2:0       3:0       4:4   
Matrix set from its native device:
Row: 0 ->     0:5       1:5       2:5       3:5       4:5   
Row: 1 ->     0:5       1:6       2:5       3:5       4:5   
Row: 2 ->     0:5       1:5       2:7       3:5       4:5   
Row: 3 ->     0:5       1:5       2:5       3:8       4:5   
Row: 4 ->     0:5       1:5       2:5       3:5       4:9   
Set elements on CUDA device:
Matrix set from the host:
Row: 0 ->     0:0       1:0       2:0       3:0       4:0   
Row: 1 ->     0:0       1:1       2:0       3:0       4:0   
Row: 2 ->     0:0       1:0       2:2       3:0       4:0   
Row: 3 ->     0:0       1:0       2:0       3:3       4:0   
Row: 4 ->     0:0       1:0       2:0       3:0       4:4   
Matrix set from its native device:
Row: 0 ->     0:5       1:5       2:5       3:5       4:5   
Row: 1 ->     0:5       1:6       2:5       3:5       4:5   
Row: 2 ->     0:5       1:5       2:7       3:5       4:5   
Row: 3 ->     0:5       1:5       2:5       3:8       4:5   
Row: 4 ->     0:5       1:5       2:5       3:5       4:9   

Method `getRow`

This method is available for the dense matrix (TNL::Matrices::DenseMatrix::getRow) mainly for two reasons:

The method getRow is recommended for sparse matrices. In most cases, it is not optimal for dense matrices. However, if one needs to have one code for both dense and sparse matrices, this method is a good choice.
In general, use of setElement (TNL::Matrices::DenseMatrix::setElement) combined with parallelFor is preferred, for dense matrices, since it offers more parallelism for GPUs. parallelFor creates one CUDA thread per each matrix element which is desirable for GPUs. With the use of the method getRow we have only one CUDA thread per each matrix row. This makes sense only in situation when we need to setup each matrix row sequentially.

Here we show an example:

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
getRowExample()
{
   const int size = 5;
   TNL::Matrices::DenseMatrix< double, Device > matrix( size, size );
 
   /***
    * Create dense matrix view which can be captured by the following lambda
    * function.
    */
   auto matrixView = matrix.getView();
 
   auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
   {
      auto row = matrixView.getRow( rowIdx );
      if( rowIdx > 0 )
         row.setValue( rowIdx - 1, -1.0 );
      row.setValue( rowIdx, rowIdx + 1.0 );
      if( rowIdx < size - 1 )
         row.setValue( rowIdx + 1, -1.0 );
   };
 
   /***
    * Set the matrix elements.
    */
   TNL::Algorithms::parallelFor< Device >( 0, matrix.getRows(), f );
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Getting matrix rows on host: " << std::endl;
   getRowExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Getting matrix rows on CUDA device: " << std::endl;
   getRowExample< TNL::Devices::Cuda >();
#endif
}

Here we create the matrix on the line 10 and get the matrix view on the line 16. Next we use parallelFor (line 31) to iterate over the matrix rows and call the lambda function f (lines 19-26) for each of them. In the lambda function, we first fetch the matrix row by means of the method getRow (TNL::Matrices::DenseMatrixView::getRow) and next we set the matrix elements by using the method setElement of the matrix row (TNL::Matrices::DenseMatrixRowView::setElement). For the compatibility with the sparse matrices, use the variant of setElement with the parameter localIdx. It has no effect here, it is only for compatibility of the interface.

The result looks as follows:

Getting matrix rows on host: 
Row: 0 ->     0:1       1:-1      2:0       3:0       4:0   
Row: 1 ->     0:-1      1:2       2:-1      3:0       4:0   
Row: 2 ->     0:0       1:-1      2:3       3:-1      4:0   
Row: 3 ->     0:0       1:0       2:-1      3:4       4:-1  
Row: 4 ->     0:0       1:0       2:0       3:-1      4:5   
Getting matrix rows on CUDA device: 
Row: 0 ->     0:1       1:-1      2:0       3:0       4:0   
Row: 1 ->     0:-1      1:2       2:-1      3:0       4:0   
Row: 2 ->     0:0       1:-1      2:3       3:-1      4:0   
Row: 3 ->     0:0       1:0       2:-1      3:4       4:-1  
Row: 4 ->     0:0       1:0       2:0       3:-1      4:5   

Method `forRows`

This method iterates in parallel over all matrix rows. In fact, it combines TNL::Algorithms::parallelFor and TNL::Matrices::DenseMatrix::getRow method in one. See the following example. It is even a bit simpler compared to the previous one:

#include <iostream>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forRowsExample()
{
   using MatrixType = TNL::Matrices::DenseMatrix< double, Device >;
   using RowView = typename MatrixType::RowView;
   const int size = 5;
   MatrixType matrix( size, size );
   auto view = matrix.getView();
 
   /***
    * Set the matrix elements.
    */
   auto f = [] __cuda_callable__( RowView & row )
   {
      const int& rowIdx = row.getRowIndex();
      if( rowIdx > 0 )
         row.setValue( rowIdx - 1, -1.0 );
      row.setValue( rowIdx, rowIdx + 1.0 );
      if( rowIdx < size - 1 )
         row.setValue( rowIdx + 1, -1.0 );
   };
   view.forAllRows( f );  // or matrix.forAllRows
   std::cout << matrix << std::endl;
 
   /***
    * Now divide each matrix row by its largest element - with the use of iterators.
    */
   view.forAllRows(
      [] __cuda_callable__( RowView & row )
      {
         double largest = std::numeric_limits< double >::lowest();
         for( auto element : row )
            largest = TNL::max( largest, element.value() );
         for( auto element : row )
            element.value() /= largest;
      } );
   std::cout << "Divide each matrix row by its largest element... " << std::endl;
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Getting matrix rows on host: " << std::endl;
   forRowsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Getting matrix rows on CUDA device: " << std::endl;
   forRowsExample< TNL::Devices::Cuda >();
#endif
}

The lambda function f, which is called for each matrix row (lines 18-25), have to accept parameter row with type RowView. This type is defined inside each TNL matrix and in the case of the dense matrix, it is TNL::Matrices::DenseMatrixRowView. We use the method TNL::Matrices::DenseMatrixRowView::getRowIndex to get the index of the matrix row being currently processed and method TNL::Matrices::DenseMatrixRowView::setElement which sets the value of the element with given column index (the first parameter).

Next, on the lines 32-38, we call another lambda function which firstly find the largest element in each row (lines 33-35) and then it divides the matrix row by its value (lines 36-37).

The result looks as follows:

Getting matrix rows on host: 
Row: 0 ->     0:1       1:-1      2:0       3:0       4:0   
Row: 1 ->     0:-1      1:2       2:-1      3:0       4:0   
Row: 2 ->     0:0       1:-1      2:3       3:-1      4:0   
Row: 3 ->     0:0       1:0       2:-1      3:4       4:-1  
Row: 4 ->     0:0       1:0       2:0       3:-1      4:5   
Divide each matrix row by its largest element... 
Row: 0 ->     0:1       1:-1      2:0       3:0       4:0   
Row: 1 ->     0:-0.5    1:1       2:-0.5    3:0       4:0   
Row: 2 ->     0:0      1:-0.333333    2:1      3:-0.333333    4:0   
Row: 3 ->     0:0       1:0      2:-0.25    3:1      4:-0.25
Row: 4 ->     0:0       1:0       2:0       3:-0.2    4:1   
Getting matrix rows on CUDA device: 
Row: 0 ->     0:1       1:-1      2:0       3:0       4:0   
Row: 1 ->     0:-1      1:2       2:-1      3:0       4:0   
Row: 2 ->     0:0       1:-1      2:3       3:-1      4:0   
Row: 3 ->     0:0       1:0       2:-1      3:4       4:-1  
Row: 4 ->     0:0       1:0       2:0       3:-1      4:5   
Divide each matrix row by its largest element... 
Row: 0 ->     0:1       1:-1      2:0       3:0       4:0   
Row: 1 ->     0:-0.5    1:1       2:-0.5    3:0       4:0   
Row: 2 ->     0:0      1:-0.333333    2:1      3:-0.333333    4:0   
Row: 3 ->     0:0       1:0      2:-0.25    3:1      4:-0.25
Row: 4 ->     0:0       1:0       2:0       3:-0.2    4:1   

Method `forElements`

The next example demonstrates the method forElements (TNL::Matrices::DenseMatrix::forElements) which works in very similar way as the method getRow but it is slightly easier to use. It is also compatible with sparse matrices. See the following example:

#include <iostream>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forElementsExample()
{
   TNL::Matrices::DenseMatrix< double, Device > matrix( 5, 5 );
   auto matrixView = matrix.getView();
 
   auto f = [] __cuda_callable__( int rowIdx, int columnIdx, int globalIdx, double& value )
   {
      if( columnIdx <= rowIdx )
         value = rowIdx + columnIdx;
   };
 
   matrixView.forElements( 0, matrix.getRows(), f );  // or matrix.forElements
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrix on host: " << std::endl;
   forElementsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrix on CUDA device: " << std::endl;
   forElementsExample< TNL::Devices::Cuda >();
#endif
}

We do not need any matrix view and instead of calling parallelFor we call just the method forElements (line 18). The lambda function f (line 11) must accept the following parameters:

rowIdx is the row index of given matrix element.
columnIdx is the column index of given matrix element.
value is a reference on the matrix element value and so by changing this value we can modify the matrix element.

The result looks as follows:

Creating matrix on host: 
Row: 0 ->     0:0       1:0       2:0       3:0       4:0   
Row: 1 ->     0:1       1:2       2:0       3:0       4:0   
Row: 2 ->     0:2       1:3       2:4       3:0       4:0   
Row: 3 ->     0:3       1:4       2:5       3:6       4:0   
Row: 4 ->     0:4       1:5       2:6       3:7       4:8   
Creating matrix on CUDA device: 
Row: 0 ->     0:0       1:0       2:0       3:0       4:0   
Row: 1 ->     0:1       1:2       2:0       3:0       4:0   
Row: 2 ->     0:2       1:3       2:4       3:0       4:0   
Row: 3 ->     0:3       1:4       2:5       3:6       4:0   
Row: 4 ->     0:4       1:5       2:6       3:7       4:8   

Wrapping existing data to dense matrix view

In case when you have already allocated data for dense matrix (for example in some other library), you may wrap it to dense matrix view with a function TNL::Matrices::wrapDenseMatrix . See the following example:

#include <iostream>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Matrices/MatrixWrapping.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
wrapMatrixView()
{
   const int rows( 3 ), columns( 4 );
   TNL::Containers::Vector< double, Device > valuesVector{
      // clang-format off
      1,  2,  3,  4,
      5,  6,  7,  8,
      9, 10, 11, 12
      // clang-format on
   };
   double* values = valuesVector.getData();
 
   /***
    * Wrap the array `values` to dense matrix view
    */
   auto matrix = TNL::Matrices::wrapDenseMatrix< Device >( rows, columns, values );
   std::cout << "Matrix reads as: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Wraping matrix view on host: " << std::endl;
   wrapMatrixView< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Wraping matrix view on CUDA device: " << std::endl;
   wrapMatrixView< TNL::Devices::Cuda >();
#endif
}

Here we create dense matrix having three rows and four columns. We use TNL vector (TNL::Containers::Vector) only for allocation of the matrix elements (lines 12-15) and we get a pointer to the allocated array immediately (line 16). Next we use just the array to get dense matrix view with proper matrix dimensions (line 21). Note that we must explicitly state the device type as a template parameter of the function wrapDenseMatrix (TNL::Matrices::wrapDenseMatrix). Finally, we print the matrix to see if it is correct (line 22). The result looks as follows:

Wraping matrix view on host: 
Matrix reads as: 
Row: 0 ->     0:1       1:2       2:3       3:4   
Row: 1 ->     0:5       1:6       2:7       3:8   
Row: 2 ->     0:9       1:10      2:11      3:12  
Wraping matrix view on CUDA device: 
Matrix reads as: 
Row: 0 ->     0:1       1:4       2:7       3:10  
Row: 1 ->     0:2       1:5       2:8       3:11  
Row: 2 ->     0:3       1:6       2:9       3:12  

Sparse matrices

Sparse matrices are extremely important in a lot of numerical algorithms. They are used at situations when we need to operate with matrices having majority of the matrix elements equal to zero. In this case, only the non-zero matrix elements are stored with possibly some padding zeros used for memory alignment. This is necessary mainly on GPUs. See the Overview of matrix types for the differences in memory requirements.

Major disadvantage of sparse matrices is that there are a lot of different formats for their storage in memory. Though CSR (Compressed Sparse Row) format is the most popular of all, especially for GPUs, there are many other formats. Often their performance differ significantly for various matrices. So it is a good idea to test several sparse matrix formats if you want to get the best performance. In TNL, there is one templated class TNL::Matrices::SparseMatrix representing general sparse matrices. The change of underlying matrix format can be done just by changing one template parameter. The list of the template parameters is as follows:

Real is type if the matrix elements. It is double by default.
Device is a device where the matrix is allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
Index is a type to be used for indexing of the matrix elements. It is int by default.
MatrixType tells if the matrix is symmetric (TNL::Matrices::SymmetricMatrix) or general (TNL::Matrices::GeneralMatrix). It is a TNL::Matrices::GeneralMatrix by default.
Segments define the format of the sparse matrix. It can be one of the following (by default, it is TNL::Algorithms::Segments::CSR):
- TNL::Algorithms::Segments::CSR for CSR format.
- TNL::Algorithms::Segments::Ellpack for Ellpack format.
- TNL::Algorithms::Segments::SlicedEllpack for SlicedEllpack format which was also presented as Row-grouped CSR format.
- TNL::Algorithms::Segments::ChunkedEllpack for ChunkedEllpack format which we reffered as Improved Row-grouped CSR and we renamed it to Ellpack format since it uses padding zeros.
- TNL::Algorithms::Segments::BiEllpack for BiEllpack format.
ComputeReal is type which is used for internal computations. By default it is the same as Real if Real is not bool. If Real is bool, ComputeReal is set to Index type. This can be changed, of course, by the user.
RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.
IndexAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the column indexes of the matrix elements. By default, it is the default allocator for given Index type and Device type – see TNL::Allocators::Default.

If Real is set to bool, we get a binary matrix for which the non-zero elements can be equal only to one and so the matrix elements values are not stored explicitly in the memory. This can significantly reduce the memory requirements and also increase performance.

In the following text we will show how to create and setup sparse matrices.

Setting of row capacities

Larger sparse matrices are created in two steps:

We use a method TNL::Matrices::SparseMatrix::setRowCapacities to initialize the underlying matrix format and to allocate memory for the matrix elements. This method only needs to know how many non-zero elements are supposed to be in each row. Once this is set, it cannot be changed only by resetting the whole matrix. In most situations, it is not an issue to compute the number of nonzero elements in each row. Otherwise, we can currently only recommend the use of matrix setup with STL map, which is, however, quite slow.
Now, the nonzero matrix elements can be set one after another by telling its coordinates and a value. Since majority of sparse matrix formats are designed to allow quick access to particular matrix rows the insertion can be done in parallel by mapping different threads to different matrix rows. This approach is usually optimal or nearly optimal when it comes to efficiency.

See the following example which creates lower triangular matrix like this one

\[\left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \\ 2 & 1 & 0 & 0 & 0 \\ 3 & 2 & 1 & 0 & 0 \\ 4 & 3 & 2 & 1 & 0 \\ 5 & 4 & 3 & 2 & 1 \end{array} \right). \]

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
setRowCapacitiesExample()
{
   TNL::Matrices::SparseMatrix< double, Device > matrix( 5, 5 );
   TNL::Containers::Vector< int, Device > rowCapacities{ 1, 2, 3, 4, 5 };
   matrix.setRowCapacities( rowCapacities );
   for( int row = 0; row < 5; row++ )
      for( int column = 0; column <= row; column++ )
         matrix.setElement( row, column, row - column + 1 );
 
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrix on CPU ... " << std::endl;
   setRowCapacitiesExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrix on CUDA GPU ... " << std::endl;
   setRowCapacitiesExample< TNL::Devices::Cuda >();
#endif
}

The method TNL::Matrices::SparseMatrix::setRowCapacities reads the required capacities of the matrix rows from a vector (or simmilar container - TNL::Containers::Array, TNL::Containers::ArrayView, TNL::Containers::Vector and TNL::Containers::VectorView) which has the same number of elements as the number of matrix rows and each element defines the capacity of the related row. The result looks as follows:

Creating matrix on CPU ... 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     0:4       1:3       2:2       3:1   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   
 
Creating matrix on CUDA GPU ... 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     0:4       1:3       2:2       3:1   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   
 

There are constructors which also set the row capacities. The first one uses a vector:

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Containers/Vector.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
initializerListExample()
{
   TNL::Containers::Vector< int, Device > rowCapacities{ 1, 2, 3, 4, 5 };
   TNL::Matrices::SparseMatrix< double, Device > matrix{ rowCapacities,  // row capacities
                                                         6 };            // number of matrix columns
 
   for( int row = 0; row < matrix.getRows(); row++ )
      for( int column = 0; column <= row; column++ )
         matrix.setElement( row, column, row - column + 1 );
   std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrices on CPU ... " << std::endl;
   initializerListExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
   initializerListExample< TNL::Devices::Cuda >();
#endif
}

The second one uses an initializer list:

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
initializerListExample()
{
   TNL::Matrices::SparseMatrix< double, Device > matrix{ { 1, 2, 3, 4, 5 },  // row capacities
                                                         6 };                // number of matrix columns
 
   for( int row = 0; row < matrix.getRows(); row++ )
      for( int column = 0; column <= row; column++ )
         matrix.setElement( row, column, row - column + 1 );
   std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrices on CPU ... " << std::endl;
   initializerListExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
   initializerListExample< TNL::Devices::Cuda >();
#endif
}

The result of both examples looks as follows:

Creating matrices on CPU ... 
General sparse matrix: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     0:4       1:3       2:2       3:1   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   
 
Creating matrices on CUDA GPU ... 
General sparse matrix: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     0:4       1:3       2:2       3:1   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   
 

Initializer list

Small matrices can be initialized by a constructor with an initializer list. We assume having the following sparse matrix

\[\left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \\ -1 & 2 & -1 & 0 & 0 \\ 0 & -1 & 2 & -1 & 0 \\ 0 & 0 & -1 & 2 & -1 \\ 0 & 0 & 0 & -1 & 0 \end{array} \right). \]

It can be created with the initializer list constructor like we shows in the following example:

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
initializerListExample()
{
   TNL::Matrices::SparseMatrix< double, Device > matrix(
      // number of matrix rows
      5,
      // number of matrix columns
      5,
      // matrix elements definition
      {
         // clang-format off
         { 0, 0,  2.0 },
         { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
         { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
         { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
         { 4, 4,  2.0 },
         // clang-format on
      } );
 
   std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrices on CPU ... " << std::endl;
   initializerListExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
   initializerListExample< TNL::Devices::Cuda >();
#endif
}

The constructor accepts the following parameters (lines 9-17):

rows is a number of matrix rows.
columns is a number of matrix columns.
data is definition of nonzero matrix elements. It is a initializer list of triples having a form { row_index, column_index, value }. In fact, it is very much like the Coordinate format - COO.

The constructor also accepts Real and Index allocators (TNL::Allocators) but the default ones are used in this example. A method setElements (TNL::Matrices::SparseMatrix::setElements) works the same way:

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
setElementsExample()
{
   TNL::Matrices::SparseMatrix< double, Device > matrix( 5, 5 );  // matrix dimensions
   // matrix elements definition
   matrix.setElements( {
      // clang-format off
      { 0, 0,  2.0 },
      { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
      { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
      { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
      { 4, 4,  2.0 },
      // clang-format on
   } );
 
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Setting matrix elements on host: " << std::endl;
   setElementsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Setting matrix elements on CUDA device: " << std::endl;
   setElementsExample< TNL::Devices::Cuda >();
#endif
}

In this example, we create the matrix in two steps. Firstly we use constructor with only matrix dimensions as parameters (line 9) and next we set the matrix elements by setElements method (lines 10-15). The result of both examples looks as follows:

Creating matrices on CPU ... 
General sparse matrix: 
Row: 0 ->     0:2   
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     4:2   
 
Creating matrices on CUDA GPU ... 
General sparse matrix: 
Row: 0 ->     0:2   
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     4:2   
 

STL map

The constructor which creates the sparse matrix from std::map is useful especially in situations when you cannot estimate the matrix row capacities in advance. You can first store the matrix elements in std::map data structure in a COO format manner. It means that each entry of the map is the following pair:

std::pair( std::pair( row_index, column_index ), element_value )

which defines one matrix element at given coordinates (row_index,column_index) with given value (element_value). Of course, you can insert such entries into the map in arbitrary order. When it is complete, you pass the map to the sparse matrix. See the following example:

#include <iostream>
#include <map>
#include <utility>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
initializerListExample()
{
   std::map< std::pair< int, int >, double > map;
   map.insert( std::make_pair( std::make_pair( 0, 0 ), 2.0 ) );
   map.insert( std::make_pair( std::make_pair( 1, 0 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 1, 1 ), 2.0 ) );
   map.insert( std::make_pair( std::make_pair( 1, 2 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 2, 1 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 2, 2 ), 2.0 ) );
   map.insert( std::make_pair( std::make_pair( 2, 3 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 3, 2 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 3, 3 ), 2.0 ) );
   map.insert( std::make_pair( std::make_pair( 3, 4 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 4, 4 ), 2.0 ) );
 
   TNL::Matrices::SparseMatrix< double, Device > matrix( 5, 5, map );
 
   std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrices on CPU ... " << std::endl;
   initializerListExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
   initializerListExample< TNL::Devices::Cuda >();
#endif
}

The method setElements (TNL::Matrices::SparseMatrix::setElements) works the same way for already existing instances of sparse matrix:

#include <iostream>
#include <map>
#include <utility>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
setElementsExample()
{
   std::map< std::pair< int, int >, double > map;
   map.insert( std::make_pair( std::make_pair( 0, 0 ), 2.0 ) );
   map.insert( std::make_pair( std::make_pair( 1, 0 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 1, 1 ), 2.0 ) );
   map.insert( std::make_pair( std::make_pair( 1, 2 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 2, 1 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 2, 2 ), 2.0 ) );
   map.insert( std::make_pair( std::make_pair( 2, 3 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 3, 2 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 3, 3 ), 2.0 ) );
   map.insert( std::make_pair( std::make_pair( 3, 4 ), -1.0 ) );
   map.insert( std::make_pair( std::make_pair( 4, 4 ), 2.0 ) );
 
   TNL::Matrices::SparseMatrix< double, Device > matrix( 5, 5 );
   matrix.setElements( map );
 
   std::cout << "General sparse matrix: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrices on CPU ... " << std::endl;
   setElementsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
   setElementsExample< TNL::Devices::Cuda >();
#endif
}

The result of both examples looks as follows:

Creating matrices on CPU ... 
General sparse matrix: 
Row: 0 ->     0:2   
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     4:2   
 
Creating matrices on CUDA GPU ... 
General sparse matrix: 
Row: 0 ->     0:2   
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     4:2   
 

Note, however, that the map can be constructed only on CPU and not on GPU. It requires allocation of additional memory on the host system (CPU) and if the target sparse matrix resided on GPU, the matrix elements must be copied on GPU. This is the reason, why this way of the sparse matrix setup is inefficient compared to other methods.

Methods `setElement` and `addElement`

Another way of setting the sparse matrix is by means of the methods setElement and addElement (TNL::Matrices::SparseMatrix::setElement, TNL::Matrices::SparseMatrix::addElement). The method can be called from both host (CPU) and device (GPU) if the matrix is allocated there. Note, however, that if the matrix is allocated on GPU and the methods are called from CPU there will be significant performance drop because the matrix elements will be transferer one-by-one separately. However, if the matrix elements setup is not a critical part of your algorithm this can be an easy way how to do it. See the following example:

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
setElements()
{
   TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 1, 1, 1, 1 }, 5 );
 
   /****
    * Get the matrix view.
    */
   auto view = matrix.getView();
   for( int i = 0; i < 5; i++ )
      view.setElement( i, i, i );
 
   std::cout << "Matrix set from the host:" << std::endl;
   std::cout << matrix << std::endl;
 
   auto f = [ = ] __cuda_callable__( int i ) mutable
   {
      view.setElement( i, i, -i );
   };
 
   TNL::Algorithms::parallelFor< Device >( 0, 5, f );
 
   std::cout << "Matrix set from its native device:" << std::endl;
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Set elements on host:" << std::endl;
   setElements< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Set elements on CUDA device:" << std::endl;
   setElements< TNL::Devices::Cuda >();
#endif
}

We first allocate matrix with five rows (it is given by the size of the initializer list) and columns and we set capacity each row to one (line 12). The first for-loop (lines 17-19) runs on CPU no matter where the matrix is allocated. After printing the matrix (lines 21-22), we call the lambda function f (lines 24-26) with a help of parallelFor (line 28) which is device sensitive and so it runs on CPU or GPU depending on where the matrix is allocated. The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 -> 
Row: 1 ->     1:1   
Row: 2 ->     2:2   
Row: 3 ->     3:3   
Row: 4 ->     4:4   
 
Matrix set from its native device:
Row: 0 -> 
Row: 1 ->     1:-1  
Row: 2 ->     2:-2  
Row: 3 ->     3:-3  
Row: 4 ->     4:-4  
 
Set elements on CUDA device:
Matrix set from the host:
Row: 0 -> 
Row: 1 ->     1:1   
Row: 2 ->     2:2   
Row: 3 ->     3:3   
Row: 4 ->     4:4   
 
Matrix set from its native device:
Row: 0 -> 
Row: 1 ->     1:-1  
Row: 2 ->     2:-2  
Row: 3 ->     3:-3  
Row: 4 ->     4:-4  
 

The method addElement (TNL::Matrices::SparseMatrix::addElement) adds a value to specific matrix element. Otherwise, it behaves the same as setElement. See the following example:

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
addElements()
{
   TNL::Matrices::SparseMatrix< double, Device > matrix( { 5, 5, 5, 5, 5 }, 5 );
   auto matrixView = matrix.getView();
 
   for( int i = 0; i < 5; i++ )
      matrixView.setElement( i, i, i );  // or matrix.setElement
 
   std::cout << "Initial matrix is: " << std::endl << matrix << std::endl;
 
   for( int i = 0; i < 5; i++ )
      for( int j = 0; j < 5; j++ )
         matrixView.addElement( i, j, 1.0, 5.0 );  // or matrix.addElement
 
   std::cout << "Matrix after addition is: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Add elements on host:" << std::endl;
   addElements< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Add elements on CUDA device:" << std::endl;
   addElements< TNL::Devices::Cuda >();
#endif
}

The result looks as follows:

Add elements on host:
Initial matrix is: 
Row: 0 -> 
Row: 1 ->     1:1   
Row: 2 ->     2:2   
Row: 3 ->     3:3   
Row: 4 ->     4:4   
 
Matrix after addition is: 
Row: 0 ->     0:1       1:1       2:1       3:1       4:1   
Row: 1 ->     0:1       1:6       2:1       3:1       4:1   
Row: 2 ->     0:1       1:1       2:11      3:1       4:1   
Row: 3 ->     0:1       1:1       2:1       3:16      4:1   
Row: 4 ->     0:1       1:1       2:1       3:1       4:21  
 
Add elements on CUDA device:
Initial matrix is: 
Row: 0 -> 
Row: 1 ->     1:1   
Row: 2 ->     2:2   
Row: 3 ->     3:3   
Row: 4 ->     4:4   
 
Matrix after addition is: 
Row: 0 ->     0:1       1:1       2:1       3:1       4:1   
Row: 1 ->     0:1       1:6       2:1       3:1       4:1   
Row: 2 ->     0:1       1:1       2:11      3:1       4:1   
Row: 3 ->     0:1       1:1       2:1       3:16      4:1   
Row: 4 ->     0:1       1:1       2:1       3:1       4:21  
 

Method `getRow`

More efficient method, especially for GPUs, is to combine getRow (TNL::Matrices::SparseMatrix::getRow) method with parallelFor and lambda function as the following example demonstrates:

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
getRowExample()
{
   /***
    * Set the following matrix (dots represent zero matrix elements):
    *
    *   /  2  .  .  .  . \
    *   |  1  2  1  .  . |
    *   |  .  1  2  1. . |
    *   |  .  .  1  2  1 |
    *   \  .  .  .  .  2 /
    */
   const int size = 5;
   TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 3, 3, 3, 1 }, size );
   auto view = matrix.getView();
 
   /***
    * Set the matrix elements.
    */
   auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
   {
      auto row = view.getRow( rowIdx );
      if( rowIdx == 0 )
         row.setElement( 0, rowIdx, 2.0 );  // diagonal element
      else if( rowIdx == size - 1 )
         row.setElement( 0, rowIdx, 2.0 );  // diagonal element
      else {
         row.setElement( 0, rowIdx - 1, 1.0 );  // elements below the diagonal
         row.setElement( 1, rowIdx, 2.0 );      // diagonal element
         row.setElement( 2, rowIdx + 1, 1.0 );  // elements above the diagonal
      }
   };
   TNL::Algorithms::parallelFor< Device >( 0, matrix.getRows(), f );
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Getting matrix rows on host: " << std::endl;
   getRowExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Getting matrix rows on CUDA device: " << std::endl;
   getRowExample< TNL::Devices::Cuda >();
#endif
}

On the line 21, we create small matrix having five rows (number of rows is given by the size of the initializer list ) and columns (number of columns is given by the second parameter) and we set each row capacity to one or three (particular elements of the initializer list). On the line 41, we call parallelFor to iterate over all matrix rows. Each row is processed by the lambda function f (lines 24-36). In the lambda function, we first fetch a sparse matrix row (TNL::Matrices::SparseMatrixRowView) (line 25) which serves for accessing particular matrix elements in the matrix row. This object has a method setElement (TNL::Matrices::SparseMatrixRowView::setElement) accepting three parameters:

localIdx is a rank of the nonzero element in given matrix row.
columnIdx is the new column index of the matrix element.
value is the new value of the matrix element.

The result looks as follows:

Getting matrix rows on host: 
Row: 0 ->     0:2   
Row: 1 ->     0:1       1:2       2:1   
Row: 2 ->     1:1       2:2       3:1   
Row: 3 ->     2:1       3:2       4:1   
Row: 4 ->     4:2   
 
Getting matrix rows on CUDA device: 
Row: 0 ->     0:2   
Row: 1 ->     0:1       1:2       2:1   
Row: 2 ->     1:1       2:2       3:1   
Row: 3 ->     2:1       3:2       4:1   
Row: 4 ->     4:2   
 

Method `forRows`

The method forRows (TNL::Matrices::SparseMatrix::forRows) calls the method getRow (TNL::Matrices::SparseMatrix::getRow) in parallel. See the following example which has the same effect as the previous one but it is slightly simpler:

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forRowsExample()
{
   /***
    * Set the following matrix (dots represent zero matrix elements):
    *
    *   /  2  .  .  .  . \
    *   |  1  2  1  .  . |
    *   |  .  1  2  1. . |
    *   |  .  .  1  2  1 |
    *   \  .  .  .  .  2 /
    */
   const int size( 5 );
   using MatrixType = TNL::Matrices::SparseMatrix< double, Device >;
   using RowView = typename MatrixType::RowView;
   MatrixType matrix( { 1, 3, 3, 3, 1 }, size );
   auto matrixView = matrix.getView();
 
   /***
    * Set the matrix elements.
    */
   auto f = [] __cuda_callable__( RowView & row )
   {
      const int rowIdx = row.getRowIndex();
      if( rowIdx == 0 )
         row.setElement( 0, rowIdx, 2.0 );  // diagonal element
      else if( rowIdx == size - 1 )
         row.setElement( 0, rowIdx, 2.0 );  // diagonal element
      else {
         row.setElement( 0, rowIdx - 1, 1.0 );  // elements below the diagonal
         row.setElement( 1, rowIdx, 2.0 );      // diagonal element
         row.setElement( 2, rowIdx + 1, 1.0 );  // elements above the diagonal
      }
   };
   matrixView.forAllRows( f );  // or matrix.forAllRows
   std::cout << matrix << std::endl;
 
   /***
    * Divide each matrix row by a sum of all elements in the row - with use of iterators.
    */
   matrix.forAllRows(
      [] __cuda_callable__( RowView & row )
      {
         double sum = 0.0;
         for( auto element : row )
            sum += element.value();
         for( auto element : row )
            element.value() /= sum;
      } );
   std::cout << "Divide each matrix row by a sum of all elements in the row ... " << std::endl;
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Getting matrix rows on host: " << std::endl;
   forRowsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Getting matrix rows on CUDA device: " << std::endl;
   forRowsExample< TNL::Devices::Cuda >();
#endif
}

The differences are:

We do not need to get the matrix view as we did in the previous example.
We call the method forAllRows (TNL::Matrices::SparseMatrix::forAllRows) instead of parallelFor which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither.
The lambda function f (lines 27-39) accepts one parameter row of the type RowView (TNL::Matrices::SparseMatrix::RowView which is TNL::Matrices::SparseMatrixRowView) instead of the index of the matrix row. Therefore we do not need to call the method getRow (TNL::Matrices::SparseMatrix::getRow). On the other hand, we need the method geRowIndex (TNL::Matrices::SparseMatrixRowView::getRowIndex) to get the index of the matrix row (line 28).

On the lines 46-52, we call a lambda function which computes sum of all elements in a row (lines 47-49) and it divides the row by the sum then (lines 50-51).

The result looks as follows:

Getting matrix rows on host: 
Row: 0 ->     0:2   
Row: 1 ->     0:1       1:2       2:1   
Row: 2 ->     1:1       2:2       3:1   
Row: 3 ->     2:1       3:2       4:1   
Row: 4 ->     4:2   
 
Divide each matrix row by a sum of all elements in the row ... 
Row: 0 ->     0:1   
Row: 1 ->     0:0.25    1:0.5     2:0.25
Row: 2 ->     1:0.25    2:0.5     3:0.25
Row: 3 ->     2:0.25    3:0.5     4:0.25
Row: 4 ->     4:1   
 
Getting matrix rows on CUDA device: 
Row: 0 ->     0:2   
Row: 1 ->     0:1       1:2       2:1   
Row: 2 ->     1:1       2:2       3:1   
Row: 3 ->     2:1       3:2       4:1   
Row: 4 ->     4:2   
 
Divide each matrix row by a sum of all elements in the row ... 
Row: 0 ->     0:1   
Row: 1 ->     0:0.25    1:0.5     2:0.25
Row: 2 ->     1:0.25    2:0.5     3:0.25
Row: 3 ->     2:0.25    3:0.5     4:0.25
Row: 4 ->     4:1   
 

Method `forElements`

Finally, another efficient way of setting the nonzero matrix elements, is use of the method forElements (TNL::Matrices::SparseMatrix::forElements). It requires indexes of the range of rows (begin and end) to be processed and a lambda function function which is called for each nonzero element. The lambda function provides the following data:

rowIdx is a row index of the matrix element.
localIdx is an index of the nonzero matrix element within the matrix row.
columnIdx is a column index of the matrix element. If the matrix element column index is supposed to be modified, this parameter can be a reference and so its value can be changed.
value is a value of the matrix element. If the matrix element value is supposed to be modified, this parameter can be a reference as well and so the element value can be changed.

See the following example:

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forElementsExample()
{
   TNL::Matrices::SparseMatrix< double, Device > matrix( { 1, 2, 3, 4, 5 }, 5 );
   auto matrixView = matrix.getView();
 
   auto f = [] __cuda_callable__( int rowIdx, int localIdx, int& columnIdx, double& value )
   {
      // This is important, some matrix formats may allocate more matrix elements
      // than we requested. These padding elements are processed here as well.
      if( rowIdx >= localIdx ) {
         columnIdx = localIdx;
         value = rowIdx + localIdx + 1;
      }
   };
 
   matrixView.forElements( 0, matrix.getRows(), f );  // or matrix.forElements
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrix on host: " << std::endl;
   forElementsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrix on CUDA device: " << std::endl;
   forElementsExample< TNL::Devices::Cuda >();
#endif
}

On the line 9, we allocate a lower triangular matrix byt setting the row capacities as {1,2,3,4,5}. On the line 11, we prepare lambda function f which we execute on the line 22 just by calling the method forElements (TNL::Matrices::SparseMatrix::forElements). This method takes the range of matrix rows as the first two parameters and the lambda function as the last parameter. The lambda function receives parameters mentioned above (see the line 11). We first check if the matrix element coordinates (rowIdx and localIdx) points to an element lying before the matrix diagonal or on the diagonal (line 12). In case of the lower triangular matrix in our example, the local index is in fact the same as the column index

\[\left( \begin{array}{ccccc} 0 & . & . & . & . \\ 0 & 1 & . & . & . \\ 0 & 1 & 2 & . & . \\ 0 & 1 & 2 & 3 & . \\ 0 & 1 & 2 & 3 & 4 \end{array} \right) \]

If we call the method forElements (TNL::Matrices::SparseMatrix::forElements) to setup the matrix elements for the first time, the parameter columnIdx has no sense because the matrix elements and their column indexes were not set yet. Therefore it is important that the test on the line 12 reads as

if( rowIdx < localIdx )

because

if( rowIdx < columnIdx )

would not make sense. If we pass through this test, the matrix element lies in the lower triangular part of the matrix and we may set the matrix elements which is done on the lines 17 and 18. The column index (columnIdx) is set to local index (line 17) and value is set on the line 18. The result looks as follows:

Creating matrix on host: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:3   
Row: 2 ->     0:3       1:4       2:5   
Row: 3 ->     0:4       1:5       2:6       3:7   
Row: 4 ->     0:5       1:6       2:7       3:8       4:9   
 
Creating matrix on CUDA device: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:3   
Row: 2 ->     0:3       1:4       2:5   
Row: 3 ->     0:4       1:5       2:6       3:7   
Row: 4 ->     0:5       1:6       2:7       3:8       4:9   
 

Wrapping existing data to sparse matrix view

Standard sparse matrix format like CSR and Ellpack store the matrix elements in specifically defined arrays. In case that you have already allocated them (for example in some other library), they can be wrapped into a sparse matrix view with given matrix format. This can be done by means of functions TNL::Matrices::wrapCSRMatrix and TNL::Matrices::wrapEllpackMatrix . See the following example for demonstration of the CSR format:

#include <iostream>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Matrices/MatrixWrapping.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
wrapMatrixView()
{
   /***
    * Encode the following matrix to CSR format...
    *
    * /  1  2  0  0 \.
    * |  0  6  0  0 |
    * |  9  0  0  0 |
    * \  0  0 15 16 /
    */
   const int rows( 4 ), columns( 4 );
   TNL::Containers::Vector< double, Device > valuesVector{ 1, 2, 6, 9, 15, 16 };
   TNL::Containers::Vector< int, Device > columnIndexesVector{ 0, 1, 1, 0, 2, 3 };
   TNL::Containers::Vector< int, Device > rowPointersVector{ 0, 2, 3, 4, 6 };
 
   double* values = valuesVector.getData();
   int* columnIndexes = columnIndexesVector.getData();
   int* rowPointers = rowPointersVector.getData();
 
   /***
    * Wrap the arrays `rowPointers, `values` and `columnIndexes` to sparse matrix view
    */
   auto matrix = TNL::Matrices::wrapCSRMatrix< Device >( rows, columns, rowPointers, values, columnIndexes );
 
   std::cout << "Matrix reads as: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Wraping matrix view on host: " << std::endl;
   wrapMatrixView< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Wraping matrix view on CUDA device: " << std::endl;
   wrapMatrixView< TNL::Devices::Cuda >();
#endif
}

We create sparse matrix having four rows and four columns (line 19). We use TNL vector (TNL::Containers::Vector) to allocate arrays necessary for the CSR format:

valuesVector (line 20) - contains values of the nonzero matrix elements.
columnIndexesVector (line 21) - contains column indexes of the nonzero matrix elements.
rowPointersVector (line 22) - contains positions of the first nonzero matrix elements in each row within valuesVector and columnIndexesVector. The size of this array equals number of matrix rows plus one.

Next we turn the vectors into C style pointers (lines 24-26) to wrap them into sparse matrix view (line 31). Note, that we must explicitly state the device on which the arrays are allocated. Finlay we print the matrix to check the correctness (line 33). The result looks as follows:

Wraping matrix view on host: 
Matrix reads as: 
Row: 0 ->     0:1       1:2   
Row: 1 ->     1:6   
Row: 2 ->     0:9   
Row: 3 ->     2:15      3:16  
 
Wraping matrix view on CUDA device: 
Matrix reads as: 
Row: 0 ->     0:1       1:2   
Row: 1 ->     1:6   
Row: 2 ->     0:9   
Row: 3 ->     2:15      3:16  
 

Wrapping data corresponding with the Ellpack format is very similar as we can see in the following example:

#include <iostream>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Matrices/MatrixWrapping.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
wrapMatrixView()
{
   /***
    * Encode the following matrix to Ellpack format...
    *
    * /  1  2  0  0 \.
    * |  0  6  0  0 |
    * |  9  0  0  0 |
    * \  0  0 15 16 /
    */
   const int rows( 4 ), columns( 4 );
   TNL::Containers::Vector< double, Device > valuesVector{ 1, 2, 6, 0, 9, 0, 15, 16 };
   TNL::Containers::Vector< int, Device > columnIndexesVector{ 0, 1, 1, -1, 0, -1, 2, 3 };
 
   double* values = valuesVector.getData();
   int* columnIndexes = columnIndexesVector.getData();
 
   /***
    * Wrap the arrays `values` and `columnIndexes` to sparse matrix view
    */
   auto matrix = TNL::Matrices::wrapEllpackMatrix< Device, TNL::Algorithms::Segments::RowMajorOrder >(
      rows, columns, 2, values, columnIndexes );
 
   std::cout << "Matrix reads as: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Wraping matrix view on host: " << std::endl;
   wrapMatrixView< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Wraping matrix view on CUDA device: " << std::endl;
   wrapMatrixView< TNL::Devices::Cuda >();
#endif
}

We encode the same sparse matrix as in the previous example. The essence of the Ellpack format is that we allocate the same number of matrix elements for each row which is two in our example. For some matrix rows we use the padding zeros for which we set the column index to -1 (line 21). Therefore the size of valuesVector and columnIndexesVector equals number of matrix rows times number of matrix elements allocated in each row. As before, we turn the vectors into C style pointers (lines 23-24) and wrap them into sparse matrix view with Ellpack format (line 29). Note that we must state the device on which the arrays are allocated explicitly and also the matrix elements organization, which is TNL::Algorithms::Segments::RowMajorOrder in this case. For Ellpack matrix stored on GPU, TNL::Algorithms::Segments::ColumnMajorOrder is preferred. The result looks as follows:

Wraping matrix view on host: 
Matrix reads as: 
Row: 0 ->     0:1       1:2   
Row: 1 ->     1:6   
Row: 2 ->     0:9   
Row: 3 ->     2:15      3:16  
 
Wraping matrix view on CUDA device: 
Matrix reads as: 
Row: 0 ->     0:1       1:2   
Row: 1 ->     1:6   
Row: 2 ->     0:9   
Row: 3 ->     2:15      3:16  
 

Symmetric sparse matrices

For sparse symmetric matrices, TNL offers a format storing only a half of the matrix elements. More precisely, ony the matrix diagonal and the elements bellow are stored in the memory. The matrix elements above the diagonal are deduced from those bellow. If such a symmetric format is used on GPU, atomic operations must be used in some matrix operations. For this reason, symmetric matrices can be combined only with matrix elements values expressed in float or double type. An advantage of the symmetric formats is lower memory consumption. Since less data need to be transferred from the memory, better performance might be observed. In some cases, however, the use of atomic operations on GPU may cause performance drop. Mostly we can see approximately the same performance compared to general formats but we can profit from lower memory requirements which is appreciated especially on GPU. The following example shows how to create symmetric sparse matrix.

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
symmetricSparseMatrixExample()
{
   TNL::Matrices::SparseMatrix< double, Device, int, TNL::Matrices::SymmetricMatrix > symmetricMatrix(
      5,  // number of matrix rows
      5,  // number of matrix columns
      {
         // matrix elements definition
         // clang-format off
         { 0, 0, 1.0 },
         { 1, 0, 2.0 }, { 1, 1, 1.0 },
         { 2, 0, 3.0 }, { 2, 2, 1.0 },
         { 3, 0, 4.0 }, { 3, 3, 1.0 },
         { 4, 0, 5.0 }, { 4, 4, 1.0 },
         // clang-format on
      } );
 
   std::cout << "Symmetric sparse matrix: " << std::endl << symmetricMatrix << std::endl;
 
   TNL::Containers::Vector< double, Device > inVector( 5, 1.0 ), outVector( 5, 0.0 );
   symmetricMatrix.vectorProduct( inVector, outVector );
   std::cout << "Product with vector " << inVector << " is " << outVector << std::endl << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrix on CPU ... " << std::endl;
   symmetricSparseMatrixExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrix on CUDA GPU ... " << std::endl;
   symmetricSparseMatrixExample< TNL::Devices::Cuda >();
#endif
}

We construct matrix of the following form

\[\left( \begin{array}{ccccc} 1 & \color{grey}{2} & \color{grey}{3} & \color{grey}{4} & \color{grey}{5} \\ 2 & 1 & & & \\ 3 & & 1 & & \\ 4 & & & 1 & \\ 5 & & & & 1 \end{array} \right) \]

The elements depicted in grey color are not stored in the memory. The main difference, compared to creation of general sparse matrix, is on line 9 where we state that the matrix is symmetric by setting the matrix type to TNL::Matrices::SymmetricMatrix. Next we set only the diagonal elements and those lying bellow the diagonal (lines 13-17). When we print the matrix (line 19) we can see also the symmetric part above the diagonal. Next we test product of matrix and vector (lines 21-23). The result looks as follows:

Creating matrix on CPU ... 
Symmetric sparse matrix: 
Row: 0 -> 0:1   1:2 2:3 3:4 4:5 
Row: 1 -> 0:2   1:1 
Row: 2 -> 0:3   2:1 
Row: 3 -> 0:4   3:1 
Row: 4 -> 0:5   4:1 
 
Product with vector [ 1, 1, 1, 1, 1 ] is [ 15, 3, 4, 5, 6 ]
 
Creating matrix on CUDA GPU ... 
Symmetric sparse matrix: 
Row: 0 -> 0:1   1:2 2:3 3:4 4:5 
Row: 1 -> 0:2   1:1 
Row: 2 -> 0:3   2:1 
Row: 3 -> 0:4   3:1 
Row: 4 -> 0:5   4:1 
 
Product with vector [ 1, 1, 1, 1, 1 ] is [ 15, 3, 4, 5, 6 ]
 

Warning: Assignment of symmetric sparse matrix to general sparse matrix does not give correct result, currently. Only the diagonal and the lower part of the matrix is assigned.

Binary sparse matrices

If the matrix element value type (i.e. Real type) is set to bool the matrix elements can be only 1 or 0. So in the sparse matrix formats, where we do not store the zero matrix elements, explicitly stored elements can have only one possible value which is 1. Therefore we do not need to store the values, only the positions of the nonzero elements. The array values, which usualy stores the matrix elements values, can be completely omitted and we can reduce the memory requirements. The following table shows how much we can reduce the memory consumption when using binary matrix instead of common sparse matrix using float or double types:

Real	Index	Common sparse matrix	Binary sparse matrix	Ratio
float	32-bit	4 + 4 = 8 B	4 B	50%
float	64-bit	4 + 8 = 12 B	8 B	75%
double	32-bit	8 + 4 = 12 B	4 B	33%
double	64-bit	8 + 8 = 16 B	8 B	50%

The following example demonstrates the use of binary matrix:

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Algorithms/Segments/CSR.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
binarySparseMatrixExample()
{
   TNL::Matrices::SparseMatrix< bool, Device, int > binaryMatrix(
      // number of matrix rows
      5,
      // number of matrix columns
      5,
      // matrix elements definition
      {
         // clang-format off
         { 0, 0, 1.0 }, { 0, 1, 2.0 }, { 0, 2, 3.0 }, { 0, 3, 4.0 }, { 0, 4, 5.0 },
         { 1, 0, 2.0 }, { 1, 1, 1.0 },
         { 2, 0, 3.0 }, { 2, 2, 1.0 },
         { 3, 0, 4.0 }, { 3, 3, 1.0 },
         { 4, 0, 5.0 }, { 4, 4, 1.0 },
         // clang-format on
      } );
 
   std::cout << "Binary sparse matrix: \n" << binaryMatrix << '\n';
 
   TNL::Containers::Vector< double, Device > inVector( 5, 1.1 ), outVector( 5, 0.0 );
   binaryMatrix.vectorProduct( inVector, outVector );
   std::cout << "Product with vector " << inVector << " is " << outVector << "\n\n";
 
   TNL::Matrices::SparseMatrix< bool, Device, int, TNL::Matrices::GeneralMatrix, TNL::Algorithms::Segments::CSR, double >
      binaryMatrix2;
   binaryMatrix2 = binaryMatrix;
   binaryMatrix2.vectorProduct( inVector, outVector );
   std::cout << "Product with vector in double precision " << inVector << " is " << outVector << "\n\n";
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrix on CPU ... \n";
   binarySparseMatrixExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrix on CUDA GPU ... \n";
   binarySparseMatrixExample< TNL::Devices::Cuda >();
#endif
}

All we need to do is set the Real type to bool as we can see on the line 9. We can see that even though we set different values to different matrix elements (lines 14-18) at the end all of them are turned into ones (printing of the matrix on the line 20). There is an issue, however, which is demonstrated on the product of the matrix with a vector. Nonbinary matrices compute all operations using the Real type. If it is set to bool operations like SpMV would not get correct solution. Therefore sparse matrices use another type called ComputeReal which is the 6th template parameter of TNL::Matrices::SparseMatrix. By default it is set to Index type but it can be changed by the user. On the lines 26-29 we show how to change this type to double and what is the effect of it (correct result of matrix-vector multiplication). The result looks as follows:

Creating matrix on CPU ... 
Binary sparse matrix: 
Row: 0 ->     0:1       1:1       2:1       3:1       4:1   
Row: 1 ->     0:1       1:1   
Row: 2 ->     0:1       2:1   
Row: 3 ->     0:1       3:1   
Row: 4 ->     0:1       4:1   
 
Product with vector [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5, 2, 2, 2, 2 ]
 
Product with vector in double precision [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5.5, 2.2, 2.2, 2.2, 2.2 ]
 
Creating matrix on CUDA GPU ... 
Binary sparse matrix: 
Row: 0 ->     0:1       1:1       2:1       3:1       4:1   
Row: 1 ->     0:1       1:1   
Row: 2 ->     0:1       2:1   
Row: 3 ->     0:1       3:1   
Row: 4 ->     0:1       4:1   
 
Product with vector [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5, 2, 2, 2, 2 ]
 
Product with vector in double precision [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5.5, 2.2, 2.2, 2.2, 2.2 ]
 

Tridiagonal matrices

Tridiagonal matrix format serves for specific matrix pattern when the nonzero matrix elements can be placed only at the diagonal and immediately next to the diagonal. Here is an example:

\[\left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \end{array} \right) \]

An advantage is that we do not store the column indexes explicitly as it is in TNL::Matrices::SparseMatrix. This can reduce significantly the memory requirements which also means better performance. See the following table for the storage requirements comparison between TNL::Matrices::TridiagonalMatrix and TNL::Matrices::SparseMatrix.

Real	Index	SparseMatrix	TridiagonalMatrix	Ratio
float	32-bit int	8 bytes per element	4 bytes per element	50%
double	32-bit int	12 bytes per element	8 bytes per element	75%
float	64-bit int	12 bytes per element	4 bytes per element	30%
double	64-bit int	16 bytes per element	8 bytes per element	50%

Tridiagonal matrix is a templated class defined in the namespace TNL::Matrices. It has five template parameters:

Real is a type of the matrix elements. It is double by default.
Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for GPU supporting CUDA. It is TNL::Devices::Host by default.
Index is a type to be used for indexing of the matrix elements. It is int by default.
ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default it is the row-major order if the matrix is allocated in the host system and column major order if it is allocated on GPU.
RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.

For better alignment in the memory the tridiagonal format is organized like if there were three nonzero matrix elements in each row. This is not true for example in the first row where there is no matrix element on the left side of the diagonal. The same happens on the last row of the matrix. We have to add even the artificial matrix elements like this:

\[\begin{array}{c} 0 \\ . \\ . \\ . \\ . \\ . \end{array} \left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \end{array} \right) \begin{array}{c} . \\ . \\ . \\ . \\ . \\ 0 \end{array} \]

If the tridiagonal matrix has more rows then columns, we have to extend the last two rows with nonzero elements in this way

\[\left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \\ . & . & . & . & . & -1 \end{array} \right) \rightarrow \begin{array}{c} 0 \\ . \\ . \\ . \\ . \\ . \\ . \end{array} \left( \begin{array}{ccccccc} 2 & -1 & . & . & . & . \\ -1 & 2 & -1 & . & . & . \\ . & -1 & 2 & -1 & . & . \\ . & . & -1 & 2 & -1 & . \\ . & . & . & -1 & 2 & -1 \\ . & . & . & . & -1 & 2 \\ . & . & . & . & . & -1 \end{array} \right) \begin{array}{cc} . & . \\ . & . \\ . & . \\ . & . \\ . & . \\ 0 & . \\ 0 & 0 \end{array} \]

We also would like to remind the meaning of the local index (localIdx) of the matrix element within a matrix row. It is a rank of the nonzero matrix element in given row as we explained in section Indexing of nonzero matrix elements in sparse matrices. The values of the local index for tridiagonal matrix elements are as follows

\[\left( \begin{array}{cccccc} 1 & 2 & & & & \\ 0 & 1 & 2 & & & \\ & 0 & 1 & 2 & & \\ & & 0 & 1 & 2 & \\ & & & 0 & 1 & 2 \\ & & & & 0 & 1 \end{array} \right) \]

In the following text we show different methods for setup of tridiagonal matrices.

Initializer list

The tridiagonal matrix can be initialized by the means of the constructor with initializer list. The matrix from the beginning of this section can be constructed as the following example demonstrates:

#include <iostream>
#include <TNL/Matrices/TridiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
createTridiagonalMatrix()
{
   const int matrixSize = 6;
 
   /***
    * Setup the following matrix (dots represent zeros):
    *
    * /  2 -1 .   .  .  . \
    * | -1  2 -1  .  .  . |
    * |  . -1  2 -1  .  . |
    * |  .  . -1  2 -1  . |
    * |  .  .  . -1  2 -1 |
    * \  .  .  .  . -1  2 /
    *
    */
   TNL::Matrices::TridiagonalMatrix< double, Device > matrix(
      matrixSize,
      { /***
         * To set the matrix elements we first extend the diagonals to their full
         * lengths even outside the matrix (dots represent zeros and zeros are
         * artificial zeros used for memory alignment):
         *
         * 0 /  2 -1 .   .  .  . \    -> {  0,  2, -1 }
         *   | -1  2 -1  .  .  . |    -> { -1,  2, -1 }
         *   |  . -1  2 -1  .  . |    -> { -1,  2, -1 }
         *   |  .  . -1  2 -1  . |    -> { -1,  2, -1 }
         *   |  .  .  . -1  2 -1 |    -> { -1,  2, -1 }
         *   \  .  .  .  . -1  2 / 0  -> { -1,  2,  0 }
         *
         */
        { 0, 2, -1 },
        { -1, 2, -1 },
        { -1, 2, -1 },
        { -1, 2, -1 },
        { -1, 2, -1 },
        { -1, 2, 0 } } );
   std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating tridiagonal matrix on CPU ... " << std::endl;
   createTridiagonalMatrix< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating tridiagonal matrix on CUDA GPU ... " << std::endl;
   createTridiagonalMatrix< TNL::Devices::Cuda >();
#endif
}

The matrix elements values are defined on lines 39-44. Each matrix row is represented by embedded an initializer list. We set three values in each row including the padding zeros.

The output of the example looks as:

Creating tridiagonal matrix on CPU ... 
The matrix reads as: 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2       5:-1  
Row: 5 ->     4:-1      5:2   
 
Creating tridiagonal matrix on CUDA GPU ... 
The matrix reads as: 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2       5:-1  
Row: 5 ->     4:-1      5:2   
 

Methods setElement and addElement

Similar way of the tridiagonal matrix setup is offered by the method setElements (TNL::Matrices::TridiagonalMatrix::setElements) as the following example demonstrates:

#include <iostream>
#include <TNL/Matrices/TridiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
createTridiagonalMatrix()
{
   const int matrixSize = 6;
 
   /***
    * Setup the following matrix (dots represent zeros):
    *
    * /  2 -1 .   .  .  . \
    * | -1  2 -1  .  .  . |
    * |  . -1  2 -1  .  . |
    * |  .  . -1  2 -1  . |
    * |  .  .  . -1  2 -1 |
    * \  .  .  .  . -1  2 /
    *
    */
   TNL::Matrices::TridiagonalMatrix< double, Device > matrix( matrixSize, matrixSize );
   matrix.setElements( { /***
                          * To set the matrix elements we first extend the diagonals to their full
                          * lengths even outside the matrix (dots represent zeros and zeros are
                          * artificial zeros used for memory alignment):
                          *
                          * 0 /  2 -1 .   .  .  . \    -> {  0,  2, -1 }
                          *   | -1  2 -1  .  .  . |    -> { -1,  2, -1 }
                          *   |  . -1  2 -1  .  . |    -> { -1,  2, -1 }
                          *   |  .  . -1  2 -1  . |    -> { -1,  2, -1 }
                          *   |  .  .  . -1  2 -1 |    -> { -1,  2, -1 }
                          *   \  .  .  .  . -1  2 / 0  -> { -1,  2,  0 }
                          *
                          */
                         { 0, 2, -1 },
                         { -1, 2, -1 },
                         { -1, 2, -1 },
                         { -1, 2, -1 },
                         { -1, 2, -1 },
                         { -1, 2, 0 } } );
   std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating tridiagonal matrix on CPU ... " << std::endl;
   createTridiagonalMatrix< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating tridiagonal matrix on CUDA GPU ... " << std::endl;
   createTridiagonalMatrix< TNL::Devices::Cuda >();
#endif
}

Here we create the matrix in two steps. Firstly, we setup the matrix dimensions by the appropriate constructor (line 24) and after that we setup the matrix elements (line 25-45). The result looks the same as in the previous example:

Creating tridiagonal matrix on CPU ... 
The matrix reads as: 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2       5:-1  
Row: 5 ->     4:-1      5:2   
 
Creating tridiagonal matrix on CUDA GPU ... 
The matrix reads as: 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2       5:-1  
Row: 5 ->     4:-1      5:2   
 

In the following example we create tridiagonal matrix with 5 rows and 5 columns (line 12-14) by the means of a shared pointer (TNL::Pointers::SharedPointer) to make this work even on GPU. We set numbers 0,...,4 on the diagonal (line 16) and we print the matrix (line 18). Next we use a lambda function (lines 21-27) combined with parallelFor (line 35), to modify the matrix. The offdiagonal elements are set to 1 (lines 23 and 26) and for the diagonal elements, we change the sign (line 24).

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Matrices/TridiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
setElements()
{
   const int matrixSize( 5 );
   using Matrix = TNL::Matrices::TridiagonalMatrix< double, Device >;
   Matrix matrix( matrixSize, matrixSize );
   auto view = matrix.getView();
 
   for( int i = 0; i < 5; i++ )
      view.setElement( i, i, i );  // or matrix.setElement
 
   std::cout << "Matrix set from the host:" << std::endl;
   std::cout << matrix << std::endl;
 
   auto f = [ = ] __cuda_callable__( int i ) mutable
   {
      if( i > 0 )
         view.setElement( i, i - 1, 1.0 );
      view.setElement( i, i, -i );
      if( i < matrixSize - 1 )
         view.setElement( i, i + 1, 1.0 );
   };
 
   TNL::Algorithms::parallelFor< Device >( 0, matrixSize, f );
 
   std::cout << "Matrix set from its native device:" << std::endl;
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Set elements on host:" << std::endl;
   setElements< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Set elements on CUDA device:" << std::endl;
   setElements< TNL::Devices::Cuda >();
#endif
}

The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 -> 
Row: 1 ->     1:1   
Row: 2 ->     2:2   
Row: 3 ->     3:3   
Row: 4 ->     4:4   
 
Matrix set from its native device:
Row: 0 ->     1:1   
Row: 1 ->     0:1       1:-1      2:1   
Row: 2 ->     1:1       2:-2      3:1   
Row: 3 ->     2:1       3:-3      4:1   
Row: 4 ->     3:1       4:-4  
 
Set elements on CUDA device:
Matrix set from the host:
Row: 0 -> 
Row: 1 ->     1:1   
Row: 2 ->     2:2   
Row: 3 ->     3:3   
Row: 4 ->     4:4   
 
Matrix set from its native device:
Row: 0 ->     1:1   
Row: 1 ->     0:1       1:-1      2:1   
Row: 2 ->     1:1       2:-2      3:1   
Row: 3 ->     2:1       3:-3      4:1   
Row: 4 ->     3:1       4:-4  
 

Method getRow

A bit different way of setting up the matrix, is the use of tridiagonal matrix view and the method getRow (TNL::Matrices::TridiagonalMatrixView::getRow) as the following example demonstrates:

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Matrices/TridiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
getRowExample()
{
   /***
    * Set the following matrix (dots represent zero matrix elements and zeros are
    * padding zeros for memory alignment):
    *
    *    0 /  2 -1  .  .  . \  -> { 0, 0, 1 }
    *      | -1  2 -1  .  . |  -> { 0, 2, 1 }
    *      |  . -1  2 -1. . |  -> { 3, 2, 1 }
    *      |  .  . -1  2 -1 |  -> { 3, 2, 1 }
    *      \  .  .  . -1  2 /  -> { 3, 2, 1 }
    *
    */
 
   const int matrixSize( 5 );
   using MatrixType = TNL::Matrices::TridiagonalMatrix< double, Device >;
   MatrixType matrix( matrixSize,  // number of matrix rows
                      matrixSize   // number of matrix columns
   );
   auto view = matrix.getView();
 
   auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
   {
      auto row = view.getRow( rowIdx );
 
      if( rowIdx > 0 )
         row.setElement( 0, -1.0 );  // elements below the diagonal
      row.setElement( 1, 2.0 );      // elements on the diagonal
      if( rowIdx < matrixSize - 1 )  // elements above the diagonal
         row.setElement( 2, -1.0 );
   };
 
   /***
    * Set the matrix elements.
    */
   TNL::Algorithms::parallelFor< Device >( 0, view.getRows(), f );
   std::cout << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Getting matrix rows on host: " << std::endl;
   getRowExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Getting matrix rows on CUDA device: " << std::endl;
   getRowExample< TNL::Devices::Cuda >();
#endif
}

We create a matrix with the same size (line 22-27). Next, we fetch the tridiagonal matrix view (ef TNL::Matrices::TridiagonalMatrixView ,line 28) which we use in the lambda function for matrix elements modification (lines 30-38). Inside the lambda function, we first get a matrix row by calling the method getRow (TNL::Matrices::TridiagonalMatrixView::getRow) using which we can access the matrix elements (lines 33-37). We would like to stress that the method setElement addresses the matrix elements with the localIdx parameter which is a rank of the nonzero element in the matrix row - see Indexing of nonzero matrix elements in sparse matrices. The lambda function is called by the parallelFor.

The result looks as follows:

Getting matrix rows on host: 
 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2   
 
Getting matrix rows on CUDA device: 
 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2   
 

Method forRows

As in the case of other matrix types, the method forRows (TNL::Matrices::TridiagonalMatrix::forRows) calls the method getRow (TNL::Matrices::TridiagonalMatrix::getRow) in parallel. It is demonstrated by the following example which we may directly compare with the previous one:

#include <iostream>
#include <TNL/Matrices/TridiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forRowsExample()
{
   using MatrixType = TNL::Matrices::TridiagonalMatrix< double, Device >;
   /***
    * Set the following matrix (dots represent zero matrix elements and zeros are
    * padding zeros for memory alignment):
    *
    *    0 /  2  .  .  .  . \  -> { 0, 0, 1 }
    *      | -1  2 -1  .  . |  -> { 0, 2, 1 }
    *      |  . -1  2 -1. . |  -> { 3, 2, 1 }
    *      |  .  . -1  2 -1 |  -> { 3, 2, 1 }
    *      \  .  .  .  .  2 /  -> { 3, 2, 1 }
    *
    * The diagonals offsets are { -1, 0, 1 }.
    */
   const int size = 5;
   MatrixType matrix( size, size );
   auto view = matrix.getView();
 
   auto f = [] __cuda_callable__( typename MatrixType::RowView & row )
   {
      const int& rowIdx = row.getRowIndex();
      if( rowIdx > 0 )
         row.setElement( 0, -1.0 );  // elements below the diagonal
      row.setElement( 1, 2.0 );      // elements on the diagonal
      if( rowIdx < size - 1 )        // elements above the diagonal
         row.setElement( 2, -1.0 );
   };
   view.forAllRows( f );  // or matrix.forAllRows
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrix on host: " << std::endl;
   forRowsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrix on CUDA device: " << std::endl;
   forRowsExample< TNL::Devices::Cuda >();
#endif
}

The differences are:

We do not need to get the matrix view as we did in the previous example.
We call the method forAllRows (TNL::Matrices::TridiagonalMatrix::forAllRows) (line 33) instead of parallelFor which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither.
The lambda function f (lines 25-31) accepts one parameter row of the type RowView (TNL::Matrices::TridiagonalMatrix::RowView which is TNL::Matrices::TridiagonalMatrixRowView) instead of the index of the matrix row. Therefore we do not need to call the method getRow (TNL::Matrices::TridiagonalMatrix::getRow). On the other hand, we need the method geRowIndex (TNL::Matrices::TridiagonalMatrixRowView::getRowIndex) to get the index of the matrix row (line 24).

Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 39-46). Firstly we create the vector sum_vector for storing the sums (line 39) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 41-46, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 45).

The result looks as follows:

Creating matrix on host: 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2   
 
Creating matrix on CUDA device: 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2   
 

Method forElements

Finally, even a bit more simple way of matrix elements manipulation with the method forElements (TNL::Matrices::TridiagonalMatrix::forElements) is demonstrated in the following example:

#include <iostream>
#include <TNL/Matrices/TridiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forElementsExample()
{
   /***
    * Set the following matrix (dots represent zero matrix elements and zeros are
    * padding zeros for memory alignment):
    *
    * 0 / 2  1  .  .  . \   -> { 0, 2, 1 }
    *   | 3  2  1  .  . |   -> { 3, 2, 1 }
    *   | .  3  2  1  . |   -> { 3, 2, 1 }
    *   | .  .  3  2  1 |   -> { 3, 2, 1 }
    *   \ .  .  .  3  2 / 0 -> { 3, 2, 0 }
    */
   TNL::Matrices::TridiagonalMatrix< double, Device > matrix( 5,    // number of matrix rows
                                                              5 );  // number of matrix columns
   auto view = matrix.getView();
 
   auto f = [] __cuda_callable__( int rowIdx, int localIdx, int columnIdx, double& value )
   {
      /***
       * 'forElements' method iterates only over matrix elements lying on given subdiagonals
       * and so we do not need to check anything. The element value can be expressed
       * by the 'localIdx' variable, see the following figure:
       *
       *                           0  1  2  <- localIdx values
       *                           -------
       * 0 / 2  1  .  .  . \   -> { 0, 2, 1 }
       *   | 3  2  1  .  . |   -> { 3, 2, 1 }
       *   | .  3  2  1  . |   -> { 3, 2, 1 }
       *   | .  .  3  2  1 |   -> { 3, 2, 1 }
       *   \ .  .  .  3  2 / 0 -> { 3, 2, 0 }
       *
       */
      value = 3 - localIdx;
   };
   view.forElements( 0, matrix.getRows(), f );  // or matrix.forElements
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrix on host: " << std::endl;
   forElementsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrix on CUDA device: " << std::endl;
   forElementsExample< TNL::Devices::Cuda >();
#endif
}

On the line 41, we call the method forElements (TNL::Matrices::TridiagonalMatrix::forElements) instead of parallelFor. This method iterates over all matrix rows and all nonzero matrix elements. The lambda function on the line 24 therefore do not receive only the matrix row index but also local index of the matrix element (localIdx) which is a rank of the nonzero matrix element in given row - see Indexing of nonzero matrix elements in sparse matrices. Next parameter, columnIdx received by the lambda function, is the column index of the matrix element. The fourth parameter value is a reference on the matrix element which we use for its modification.

The result looks as follows:

Creating matrix on host: 
Row: 0 ->     0:2       1:1   
Row: 1 ->     0:3       1:2       2:1   
Row: 2 ->     1:3       2:2       3:1   
Row: 3 ->     2:3       3:2       4:1   
Row: 4 ->     3:3       4:2   
 
Creating matrix on CUDA device: 
Row: 0 ->     0:2       1:1   
Row: 1 ->     0:3       1:2       2:1   
Row: 2 ->     1:3       2:2       3:1   
Row: 3 ->     2:3       3:2       4:1   
Row: 4 ->     3:3       4:2   
 

Multidiagonal matrices

Multidiagonal matrices are generalization of the tridiagonal ones. It is a special type of sparse matrices with specific pattern of the nonzero matrix elements which are positioned only parallel along diagonal. See the following example:

\[ \left( \begin{array}{ccccccc} 4 & -1 & . & -1 & . & . \\ -1 & 4 & -1 & . & -1 & . \\ . & -1 & 4 & -1 & . & -1 \\ -1 & . & -1 & 4 & -1 & . \\ . & -1 & . & -1 & 4 & -1 \\ . & . & -1 & . & -1 & 4 \end{array} \right) \]

We can see that the matrix elements lay on lines parallel to the main diagonal. Such lines can be characterized by their offsets from the main diagonal. On the following figure, each such line is depicted in different color:

\[\begin{array}{ccc} \color{green}{-3} & . & \color{cyan}{-1} \\ \hline \color{green}{*} & . & \color{cyan}{*} \\ . & \color{green}{*} & . \\ . & . & \color{green}{*} \\ . & . & . \\ . & . & . \\ . & . & . \end{array} \left( \begin{array}{ccccccc} \color{blue}{0} & \color{magenta}{1} & . & \color{red}{3} & . & . \\ \hline \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} & . & . \\ \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} & . \\ . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} \\ \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . \\ . & \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} \\ . & . & \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} \end{array} \right) \]

In this matrix, the offsets reads as \(\{-3, -1, 0, +1, +3\}\). It also means that the column indexes on \(i-\)th row are \(\{i-3, i-1, i, i+1, i+3\}\) (where we accept only nonnegative indexes smaller than the number of matrix columns). An advantage is that, similar to the tridiagonal matrix (TNL::Matrices::TridiagonalMatrix), we do not store the column indexes explicitly as it is in TNL::Matrices::SparseMatrix. This can significantly reduce the memory requirements which also means better performance. See the following table for the storage requirements comparison between multidiagonal matrix (TNL::Matrices::MultidiagonalMatrix) and general sparse matrix (TNL::Matrices::SparseMatrix).

Real	Index	SparseMatrix	MultidiagonalMatrix	Ratio
float	32-bit int	8 bytes per element	4 bytes per element	50%
double	32-bit int	12 bytes per element	8 bytes per element	75%
float	64-bit int	12 bytes per element	4 bytes per element	30%
double	64-bit int	16 bytes per element	8 bytes per element	50%

For the sake of better memory alignment and faster access to the matrix elements, we store all subdiagonals in complete form including the elements which are outside the matrix as depicted on the following figure where zeros stand for the padding artificial zero matrix elements

\[\begin{array}{cccc} 0 & & & 0 \\ & 0 & & \\ & & 0 & \\ & & & 0 \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \end{array} \left( \begin{array}{cccccccccccccccc} 1 & 0 & & & 0 & & & & & & & & & & & \\ 0 & 1 & 0 & & & 0 & & & & & & & & & & \\ & 0 & 1 & 0 & & & 0 & & & & & & & & & \\ & & 0 & 1 & 0 & & & 0 & & & & & & & & \\ 0 & & & 0 & 1 & 0 & & & 0 & & & & & & & \\ & -1 & & & -1 & 1 & -1 & & & -1 & & & & & & \\ & & -1 & & & -1 & 1 & -1 & & & -1 & & & & & \\ & & & 0 & & & 0 & 1 & 0 & & & 0 & & & & \\ & & & & 0 & & & 0 & 1 & 0 & & & 0 & & & \\ & & & & & -1 & & & -1 & 1 & -1 & & & -1 & & \\ & & & & & & -1 & & & -1 & 1 & -1 & & & -1 & \\ & & & & & & & 0 & & & 0 & 1 & 0 & & & 0 \\ & & & & & & & & 0 & & & 0 & 1 & 0 & & \\ & & & & & & & & & 0 & & & 0 & 1 & 0 & \\ & & & & & & & & & & 0 & & & 0 & 1 & 0 \\ & & & & & & & & & & & 0 & & & 0 & 1 \end{array} \right) \begin{array} & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ 0 & & & \\ & 0 & & \\ & & 0 & \\ 0 & & & 0 \end{array} \]

Multidiagonal matrix is a templated class defined in the namespace TNL::Matrices. It has six template parameters:

Real is a type of the matrix elements. It is double by default.
Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
Index is a type to be used for indexing of the matrix elements. It is int by default.
ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default, it is the row-major order if the matrix is allocated in the host system and column major order if it is allocated on GPU.
RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.
IndexAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements offsets. By default, it is the default allocator for given Index type and Device type – see TNL::Allocators::Default.

In the following text we show different methods how to setup multidiagonal matrices.

Initializer list

Smaller multidiagonal matrices can be created using the constructor with initializer lists (std::initializer_list) as we demonstrate in the following example:

#include <iostream>
#include <TNL/Matrices/MultidiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
createMultidiagonalMatrix()
{
   const int matrixSize = 6;
 
   /*
    * Setup the following matrix (dots represent zeros):
    *
    * /  4 -1 .  -1  .  . \
    * | -1  4 -1  . -1  . |
    * |  . -1  4 -1  . -1 |
    * | -1  . -1  4 -1  . |
    * |  . -1  . -1  4 -1 |
    * \  .  .  1  . -1  4 /
    *
    * The diagonals offsets are { -3, -1, 0, 1, 3 }.
    */
   TNL::Matrices::MultidiagonalMatrix< double, Device > matrix(
      matrixSize,
      { -3, -1, 0, 1, 3 },
      {
         /*
          * To set the matrix elements we first extend the diagonals to their full
          * lengths even outside the matrix (dots represent zeros and zeros are
          * artificial zeros used for memory alignment):
          *
          * 0  .  0 /  4 -1 .  -1  .  . \              -> {  0,  0,  4, -1, -1 }
          * .  0  . | -1  4 -1  . -1  . | .            -> {  0, -1,  4, -1, -1 }
          * .  .  0 |  . -1  4 -1  . -1 | .  .         -> {  0, -1,  4, -1, -1 }
          *    .  . | -1  . -1  4 -1  . | 0  .  .      -> { -1, -1,  4, -1,  0 }
          *       . |  . -1  . -1  4 -1 | .  0  .  .   -> { -1, -1,  4, -1,  0 }
          *         \  .  .  1  . -1  4 / 0  .  0  . . -> { -1, -1,  4,  0,  0 }
          */
         // clang-format off
         {  0,  0,  4, -1, -1 },
         {  0, -1,  4, -1, -1 },
         {  0, -1,  4, -1, -1 },
         { -1, -1,  4, -1,  0 },
         { -1, -1,  4, -1,  0 },
         { -1, -1,  4,  0,  0 },
         // clang-format on
      } );
   std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Create multidiagonal matrix on CPU ... " << std::endl;
   createMultidiagonalMatrix< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating multidiagonal matrix on CUDA GPU ... " << std::endl;
   createMultidiagonalMatrix< TNL::Devices::Cuda >();
#endif
}

Here, we create a matrix which looks as

\[\left( \begin{array}{cccccc} 4 & -1 & & -1 & & \\ -1 & 4 & -1 & & -1 & \\ & -1 & 4 & -1 & & -1 \\ -1 & & -1 & 4 & -1 & \\ & -1 & & -1 & 4 & -1 \\ & & -1 & & -1 & 4 \\ \end{array} \right). \]

On the lines 25-46, we call the constructor which, in addition to matrix dimensions and subdiagonals offsets, accepts also initializer list of initializer lists with matrix elements values. Each embedded list corresponds to one matrix row and it contains values of matrix elements on particular subdiagonals including those which lies out of the matrix. The result looks as follows:

Create multidiagonal matrix on CPU ... 
The matrix reads as: 
Row: 0 ->     0:4       1:-1      3:-1  
Row: 1 ->     0:-1      1:4       2:-1      4:-1  
Row: 2 ->     1:-1      2:4       3:-1      5:-1  
Row: 3 ->     0:-1      2:-1      3:4       4:-1  
Row: 4 ->     1:-1      3:-1      4:4       5:-1  
Row: 5 ->     2:-1      4:-1      5:4   
 
Creating multidiagonal matrix on CUDA GPU ... 
The matrix reads as: 
Row: 0 ->     0:4       1:-1      3:-1  
Row: 1 ->     0:-1      1:4       2:-1      4:-1  
Row: 2 ->     1:-1      2:4       3:-1      5:-1  
Row: 3 ->     0:-1      2:-1      3:4       4:-1  
Row: 4 ->     1:-1      3:-1      4:4       5:-1  
Row: 5 ->     2:-1      4:-1      5:4   
 

Methods setElement and addElement

Another and more efficient way of setting the matrix elements is by means of the method setElement (TNL::Matrices::MultidiagonalMatrix::setElement). It is demonstrated in the following example:

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Matrices/MultidiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
#include <TNL/Pointers/SharedPointer.h>
#include <TNL/Pointers/SmartPointersRegister.h>
 
template< typename Device >
void
setElements()
{
   const int matrixSize( 5 );
   auto diagonalsOffsets = { -1, 0, 1 };  // offsets of tridiagonal matrix
   using Matrix = TNL::Matrices::MultidiagonalMatrix< double, Device >;
   Matrix matrix( matrixSize, matrixSize, diagonalsOffsets );
   auto view = matrix.getView();
 
   for( int i = 0; i < 5; i++ )
      view.setElement( i, i, i );
 
   std::cout << "Matrix set from the host:" << std::endl;
   std::cout << matrix << std::endl;
 
   auto f = [ = ] __cuda_callable__( int i ) mutable
   {
      if( i > 0 )
         view.setElement( i, i - 1, 1.0 );
      view.setElement( i, i, -i );
      if( i < matrixSize - 1 )
         view.setElement( i, i + 1, 1.0 );
   };
 
   TNL::Algorithms::parallelFor< Device >( 0, matrixSize, f );
 
   std::cout << "Matrix set from its native device:" << std::endl;
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Set elements on host:" << std::endl;
   setElements< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Set elements on CUDA device:" << std::endl;
   setElements< TNL::Devices::Cuda >();
#endif
}

This examples shows that the method setElement can be used both on the host (CPU) (line 19) as well as in the lambda functions that can be called from GPU kernels (lines 25-29). For this purpose, we fetch a matrix view on the line 16. The result looks as follows:

Set elements on host:
Matrix set from the host:
Row: 0 -> 
Row: 1 ->     1:1   
Row: 2 ->     2:2   
Row: 3 ->     3:3   
Row: 4 ->     4:4   
 
Matrix set from its native device:
Row: 0 ->     1:1   
Row: 1 ->     0:1       1:-1      2:1   
Row: 2 ->     1:1       2:-2      3:1   
Row: 3 ->     2:1       3:-3      4:1   
Row: 4 ->     3:1       4:-4  
 
Set elements on CUDA device:
Matrix set from the host:
Row: 0 -> 
Row: 1 ->     1:1   
Row: 2 ->     2:2   
Row: 3 ->     3:3   
Row: 4 ->     4:4   
 
Matrix set from its native device:
Row: 0 ->     1:1   
Row: 1 ->     0:1       1:-1      2:1   
Row: 2 ->     1:1       2:-2      3:1   
Row: 3 ->     2:1       3:-3      4:1   
Row: 4 ->     3:1       4:-4  
 

Method getRow

Slightly more efficient way of the multidiagonal matrix setup is offered by the method getRow (TNL::Matrices::MultidiagonalMatrix::getRow). We will use it to create a matrix of the following form:

\[\left( \begin{array}{cccccccccccccccc} 1 & . & & & . & & & & & & & & & & & \\ . & 1 & . & & & . & & & & & & & & & & \\ & . & 1 & . & & & . & & & & & & & & & \\ & & . & 1 & . & & & . & & & & & & & & \\ . & & & . & 1 & . & & & . & & & & & & & \\ & -1 & & & -1 & -4 & -1 & & & -1 & & & & & & \\ & & -1 & & & -1 & -4 & -1 & & & -1 & & & & & \\ & & & . & & & . & 1 & . & & & . & & & & \\ & & & & . & & & . & 1 & . & & & . & & & \\ & & & & & -1 & & & -1 & -4 & -1 & & & -1 & & \\ & & & & & & -1 & & & -1 & -4 & -1 & & & -1 & \\ & & & & & & & . & & & . & 1 & . & & & . \\ & & & & & & & & . & & & . & 1 & . & & \\ & & & & & & & & & . & & & . & 1 & . & \\ & & & & & & & & & & . & & & . & 1 & . \\ & & & & & & & & & & & . & & & . & 1 \end{array} \right) \]

The matrices of this form arise from a discretization of the Laplace operator in 2D by the finite difference method. We use this example because it is a frequent numerical problem and the multidiagonal format is very suitable for such matrices. If the reader, however, is not familiar with the finite difference method, please, do not be scared, we will just create the matrix mentioned above. The code based on use of method getRow reads as:

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Containers/StaticArray.h>
#include <TNL/Matrices/MultidiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
laplaceOperatorMatrix()
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method.
    */
   const int gridSize( 4 );
   const int matrixSize = gridSize * gridSize;
   TNL::Containers::Vector< int, Device > offsets{ -gridSize, -1, 0, 1, gridSize };
   TNL::Matrices::MultidiagonalMatrix< double, Device > matrix( matrixSize, matrixSize, offsets );
   auto matrixView = matrix.getView();
   auto f = [ = ] __cuda_callable__( const TNL::Containers::StaticArray< 2, int >& i ) mutable
   {
      const int elementIdx = i[ 1 ] * gridSize + i[ 0 ];
      auto row = matrixView.getRow( elementIdx );
      if( i[ 0 ] == 0 || i[ 1 ] == 0 || i[ 0 ] == gridSize - 1 || i[ 1 ] == gridSize - 1 )
         row.setElement( 2, 1.0 );  // set matrix elements corresponding to boundary grid nodes
                                    // and Dirichlet boundary conditions, i.e. 1 on the main diagonal
                                    // which is the third one
      else {
         row.setElement( 0, -1.0 );  // set matrix elements corresponding to inner grid nodes, i.e.
         row.setElement( 1, -1.0 );  // 4 on the main diagonal (the third one) and -1 to the other
         row.setElement( 2, 4.0 );   // sub-diagonals
         row.setElement( 3, -1.0 );
         row.setElement( 4, -1.0 );
      }
   };
   TNL::Containers::StaticArray< 2, int > begin = { 0, 0 };
   TNL::Containers::StaticArray< 2, int > end = { gridSize, gridSize };
   TNL::Algorithms::parallelFor< Device >( begin, end, f );
 
   std::cout << "Laplace operator matrix: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
   laplaceOperatorMatrix< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
   laplaceOperatorMatrix< TNL::Devices::Cuda >();
#endif
}

We firstly compute the matrix size (matrixSize) based on the numerical grid dimensions on the line 16. The subdiagonals offsets are defined by the numerical grid size and since it is four in this example the offsets read as \(\left\{-4,-1,0,1,4 \right\} \) or { -gridSize, -1, 0, 1, gridSize} (line 17). Here we store the offsets in vector (TNL::Containers::Vector) called offsets. Next we use a constructor with matrix dimensions and offsets passed via TNL vector (line 18) and we fetch a matrix view (TNL::Matrices::MultidiagonalMatrixView, line 19).

The matrix is constructed by iterating over particular nodes of the numerical grid. Each node correspond to one matrix row. This is why the lambda function f (lines 20-35) take two indexes i and j (line 20). Their values are coordinates of the two-dimensional numerical grid. Based on these coordinates we compute index (elementIdx) of the corresponding matrix row (line 21). We fetch matrix row (row) by calling the getRow method (TNL::Matrices::MultidiagonalMatrix::getRow) (line 22). Depending on the grid node coordinates we set either the boundary conditions (lines 23-26) for the boundary nodes (those laying on the boundary of the grid and so their coordinates fulfil the condition i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) for which se set only the diagonal element to 1. The inner nodes of the numerical grid are handled on the lines 29-33 where we set coefficients approximating the Laplace operator. We use the method setElement of the matrix row (TNL::Matrices::MultidiagonalMatrixRowView::setElement) which takes the local index of the nonzero matrix element as the first parameter (see Indexing of nonzero matrix elements in sparse matrices) and the new value of the element as the second parameter. The local indexes, in fact, refer to particular subdiagonals as depicted on the following figure (in blue):

\[\begin{array}{cccc} \color{blue}{-4} & & & \color{blue}{-1} \\ \hline . & & & . \\ & . & & \\ & & . & \\ & & & . \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \end{array} \left( \begin{array}{cccccccccccccccc} \color{blue}{0} & \color{blue}{1} & & & \color{blue}{4} & & & & & & & & & & & \\ \hline 1 & . & & & . & & & & & & & & & & & \\ . & 1 & . & & & . & & & & & & & & & & \\ & . & 1 & . & & & . & & & & & & & & & \\ & & . & 1 & . & & & . & & & & & & & & \\ . & & & . & 1 & . & & & . & & & & & & & \\ & -1 & & & -1 & 1 & -1 & & & -1 & & & & & & \\ & & -1 & & & -1 & 1 & -1 & & & -1 & & & & & \\ & & & . & & & . & 1 & . & & & . & & & & \\ & & & & . & & & . & 1 & . & & & . & & & \\ & & & & & -1 & & & -1 & 1 & -1 & & & -1 & & \\ & & & & & & -1 & & & -1 & 1 & -1 & & & -1 & \\ & & & & & & & . & & & . & 1 & . & & & . \\ & & & & & & & & . & & & . & 1 & . & & \\ & & & & & & & & & . & & & . & 1 & . & \\ & & & & & & & & & & . & & & . & 1 & . \\ & & & & & & & & & & & . & & & . & 1 \end{array} \right) \]

We use parallelFor to iterate over all nodes of the numerical grid (line 36) and apply the lambda function. The result looks as follows:

Creating Laplace operator matrix on CPU ... 
Laplace operator matrix: 
Row: 0 ->     0:1   
Row: 1 ->     1:1   
Row: 2 ->     2:1   
Row: 3 ->     3:1   
Row: 4 ->     4:1   
Row: 5 ->     1:-1      4:-1      5:4       6:-1      9:-1  
Row: 6 ->     2:-1      5:-1      6:4       7:-1     10:-1  
Row: 7 ->     7:1   
Row: 8 ->     8:1   
Row: 9 ->     5:-1      8:-1      9:4      10:-1     13:-1  
Row: 10 ->     6:-1      9:-1     10:4      11:-1     14:-1  
Row: 11 ->    11:1   
Row: 12 ->    12:1   
Row: 13 ->    13:1   
Row: 14 ->    14:1   
Row: 15 ->    15:1   
 
Creating Laplace operator matrix on CUDA GPU ... 
Laplace operator matrix: 
Row: 0 ->     0:1   
Row: 1 ->     1:1   
Row: 2 ->     2:1   
Row: 3 ->     3:1   
Row: 4 ->     4:1   
Row: 5 ->     1:-1      4:-1      5:4       6:-1      9:-1  
Row: 6 ->     2:-1      5:-1      6:4       7:-1     10:-1  
Row: 7 ->     7:1   
Row: 8 ->     8:1   
Row: 9 ->     5:-1      8:-1      9:4      10:-1     13:-1  
Row: 10 ->     6:-1      9:-1     10:4      11:-1     14:-1  
Row: 11 ->    11:1   
Row: 12 ->    12:1   
Row: 13 ->    13:1   
Row: 14 ->    14:1   
Row: 15 ->    15:1   
 

Method forRows

As in the case of other matrix types, the method forRows (TNL::Matrices::MultidiagonalMatrix::forRows) calls the method getRow (TNL::Matrices::MultidiagonalMatrix::getRow) in parallel. It is demonstrated by the following example:

#include <iostream>
#include <TNL/Matrices/MultidiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forRowsExample()
{
   using MatrixType = TNL::Matrices::MultidiagonalMatrix< double, Device >;
   /***
    * Set the following matrix (dots represent zero matrix elements and zeros are
    * padding zeros for memory alignment):
    *
    *    0 /  2  .  .  .  . \  -> { 0, 0, 1 }
    *      | -1  2 -1  .  . |  -> { 0, 2, 1 }
    *      |  . -1  2 -1. . |  -> { 3, 2, 1 }
    *      |  .  . -1  2 -1 |  -> { 3, 2, 1 }
    *      \  .  .  .  .  2 /  -> { 3, 2, 1 }
    *
    * The diagonals offsets are { -1, 0, 1 }.
    */
   const int size = 5;
   MatrixType matrix( size,            // number of matrix rows
                      size,            // number of matrix columns
                      { -1, 0, 1 } );  // matrix diagonals offsets
   auto view = matrix.getView();
 
   auto f = [] __cuda_callable__( typename MatrixType::RowView & row )
   {
      const int& rowIdx = row.getRowIndex();
      if( rowIdx > 0 )
         row.setElement( 0, -1.0 );  // elements below the diagonal
      row.setElement( 1, 2.0 );      // elements on the diagonal
      if( rowIdx < size - 1 )        // elements above the diagonal
         row.setElement( 2, -1.0 );
   };
   view.forAllRows( f );  // or matrix.forAllRows
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrix on host: " << std::endl;
   forRowsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrix on CUDA device: " << std::endl;
   forRowsExample< TNL::Devices::Cuda >();
#endif
}

We call the method forAllRows (TNL::Matrices::MultidiagonalMatrix::forAllRows) (line 36) instead of parallelFor which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither. The lambda function f (lines 28-35) accepts one parameter row of the type RowView (TNL::Matrices::MultidiagonalMatrix::RowView which is TNL::Matrices::MultidiagonalMatrixRowView). At the beginning of the lambda function, we call the method geRowIndex (TNL::Matrices::MultidiagonalMatrixRowView::getRowIndex) to get the index of the matrix row (line 29).

The result looks as follows:

Creating matrix on host: 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2   
 
Creating matrix on CUDA device: 
Row: 0 ->     0:2       1:-1  
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     3:-1      4:2   
 

Method forElements

Similar and even a bit simpler way of setting the matrix elements is offered by the method forElements (TNL::Matrices::MultidiagonalMatrix::forElements, TNL::Matrices::MultidiagonalMatrixView::forElements) as demonstrated in the following example:

#include <iostream>
#include <TNL/Matrices/MultidiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forElementsExample()
{
   /***
    * Set the following matrix (dots represent zero matrix elements and zeros are
    * padding zeros for memory alignment):
    *
    * 0  0 / 1  .  .  .  . \  -> { 0, 0, 1 }
    *    0 | 2  1  .  .  . |  -> { 0, 2, 1 }
    *      | 3  2  1  .  . |  -> { 3, 2, 1 }
    *      | .  3  2  1  . |  -> { 3, 2, 1 }
    *      \ .  .  3  2  1 /  -> { 3, 2, 1 }
    *
    * The diagonals offsets are { -2, -1, 0 }.
    */
   TNL::Matrices::MultidiagonalMatrix< double, Device > matrix( 5,                // number of matrix rows
                                                                5,                // number of matrix columns
                                                                { -2, -1, 0 } );  // matrix diagonals offsets
   auto view = matrix.getView();
 
   auto f = [] __cuda_callable__( int rowIdx, int localIdx, int columnIdx, double& value )
   {
      /***
       * 'forElements' method iterates only over matrix elements lying on given subdiagonals
       * and so we do not need to check anything. The element value can be expressed
       * by the 'localIdx' variable, see the following figure:
       *
       *                              0  1  2  <- localIdx values
       *                              -------
       * 0  0 / 1  .  .  .  . \  -> { 0, 0, 1 }
       *    0 | 2  1  .  .  . |  -> { 0, 2, 1 }
       *      | 3  2  1  .  . |  -> { 3, 2, 1 }
       *      | .  3  2  1  . |  -> { 3, 2, 1 }
       *      \ .  .  3  2  1 /  -> { 3, 2, 1 }
       *
       */
      value = 3 - localIdx;
   };
   view.forElements( 0, matrix.getRows(), f );  // or matrix.forElements
   std::cout << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrix on host: " << std::endl;
   forElementsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating matrix on CUDA device: " << std::endl;
   forElementsExample< TNL::Devices::Cuda >();
#endif
}

In this case, we need to provide a lambda function f (lines 27-43) which is called for each matrix row just by the method forElements (line 44). The lambda function f provides the following parameters

rowIdx is an index iof the matrix row.
localIdx is in index of the matrix subdiagonal.
columnIdx is a column index of the matrix element.
value is a reference to the matrix element value. It can be used even for changing the value.

In this example, the matrix element value depends only on the subdiagonal index localIdx (see Indexing of nonzero matrix elements in sparse matrices) as we can see on the line 42. The result looks as follows:

Creating matrix on host: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     1:3       2:2       3:1   
Row: 4 ->     2:3       3:2       4:1   
 
Creating matrix on CUDA device: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     1:3       2:2       3:1   
Row: 4 ->     2:3       3:2       4:1   
 

Lambda matrices

Lambda matrix (TNL::Matrices::LambdaMatrix) is a special type of matrix which could be also called matrix-free matrix. The matrix elements are not stored in memory explicitly but they are evaluated on-the-fly by means of user defined lambda functions. If the matrix elements can be expressed by computationally not expansive formula, we can significantly reduce the memory consumption which is appreciated especially on GPUs. Since the memory accesses are quite expensive even on both CPU and GPU, we can get, at the end, even much faster code.

The lambda matrix (TNL::Matrices::LambdaMatrix) is a templated class with the following template parameters:

MatrixElementsLambda is a lambda function which evaluates the matrix elements values and column indexes.
CompressedRowLengthsLambda is a lambda function telling how many nonzero elements are there in given matrix row.
Real is a type of matrix elements values.
Device is a device on which the lambda functions mentioned above will be evaluated.
Index is a type to be used for indexing.

The lambda function MatrixElementsLambda is supposed to have the following declaration:

1auto matrixElements = [=

2] __cuda_callable__( Index rows, Index columns, Index row, Index localIdx, Index& columnIdx, Real& value );

where the particular parameters have the following meaning:

rows tells the number of matrix rows.
columns tells the number of matrix columns.
rowIdx is index of the matrix row in which we are supposed to evaluate the matrix element.
localIdx is a rank of the nonzero matrix element, see Indexing of nonzero matrix elements in sparse matrices.
columnIdx is a reference on variable where we are supposed to store the matrix element column index.
value is a reference on variable where we are supposed to store the matrix element value.

The lambda function CompressedRowLengthsLambda (by compressed row length we mean the number of matrix elements in a row after ignoring/compressing the zero elements) is supposed to be declared like this:

1auto compressedRowLengths = [=] __cuda_callable__( Index rows, Index columns, Index row ) -> Index;

where the parameters can be described as follows:

rows tells the number of matrix rows.
columns tells the number of matrix columns.
rowIdx is index of the matrix row for which we are supposed to evaluate the number of nonzero matrix elements.

The lambda function is supposed to return just the number of the nonzero matrix elements in given matrix row.

Lambda matrix inititation

How to put the lambda functions together with the lambda matrix is demonstrated in the following example:

#include <iostream>
#include <TNL/Matrices/LambdaMatrix.h>
 
int
main( int argc, char* argv[] )
{
   /***
    * Lambda functions defining the matrix.
    */
   auto compressedRowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
   {
      return 1;
   };
   auto matrixElements1 =
      [ = ] __cuda_callable__(
         const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
   {
      columnIdx = rowIdx;
      value = 1.0;
   };
   auto matrixElements2 =
      [ = ] __cuda_callable__(
         const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
   {
      columnIdx = rowIdx;
      value = rowIdx;
   };
 
   const int size = 5;
 
   /***
    * Matrix construction with explicit type definition.
    */
   using MatrixType = decltype( TNL::Matrices::LambdaMatrixFactory< double, TNL::Devices::Host, int >::create(
      matrixElements1, compressedRowLengths ) );
   MatrixType m1( size, size, matrixElements1, compressedRowLengths );
 
   /***
    * Matrix construction using 'auto'.
    */
   auto m2 =
      TNL::Matrices::LambdaMatrixFactory< double, TNL::Devices::Host, int >::create( matrixElements2, compressedRowLengths );
   m2.setDimensions( size, size );
 
   std::cout << "The first lambda matrix: " << std::endl << m1 << std::endl;
   std::cout << "The second lambda matrix: " << std::endl << m2 << std::endl;
}

Here we create two simple diagonal matrices. Therefore they share the same lambda function compressedRowLengths telling the number of nonzero matrix elements in particular matrix rows which is always one (line 9). The first matrix, defined by the lambda function matrixElements1, is identity matrix and so its each diagonal element equals one. We set the matrix element value to 1.0 (line 12) and the column index equals the row index (line 15). The second matrix, defined by the lambda function matrixElements2, is also diagonal but not the identity matrix. The values of the diagonal elements equal to row index (line 16).

With the same lambda functions we can define matrices with different dimensions. In this example, we set the matrix size to five (line 19). It could be quite difficult to express the lambda matrix type because it depends on the types of the lambda functions. To make this easier, one may use the lambda-matrix factory (TNL::Matrices::LambdaMatrixFactory). Using decltype one can deduce even the matrix type (line 24) followed by calling lambda matrix constructor with matrix dimensions and instances of the lambda functions (line 25). Or one can just simply employ the keyword auto (line 30) followed by setting the matrix dimensions (line 31).

The result looks as follows:

The first lambda matrix: 
Row: 0 ->     0:1   
Row: 1 ->     1:1   
Row: 2 ->     2:1   
Row: 3 ->     3:1   
Row: 4 ->     4:1   
 
The second lambda matrix: 
Row: 0 -> 
Row: 1 ->     1:1   
Row: 2 ->     2:2   
Row: 3 ->     3:3   
Row: 4 ->     4:4   
 

Method forRows

Method forRows (TNL::Matrices::LambdaMatrix::forRows, TNL::Matrices::LambdaMatrix::forAllRows) iterates in parallel over all matrix rows. In the case of lambda matrices, it cannot be used for changing the matrix elements since they cannot be changed. In the following example, we show how to use this method to copy the matrix elements values to the dense matrix:

#include <iostream>
#include <TNL/Matrices/LambdaMatrix.h>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forRowsExample()
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method.
    */
   const int gridSize( 4 );
   const int matrixSize = gridSize * gridSize;
   auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
   {
      const int gridRow = rowIdx / gridSize;  // coordinates in the numerical grid
      const int gridColumn = rowIdx % gridSize;
      if( gridRow == 0 || gridRow == gridSize - 1 ||  // boundary grid node
          gridColumn == 0 || gridColumn == gridSize - 1 )
         return 1;
      return 5;
   };
   auto matrixElements =
      [ = ] __cuda_callable__(
         const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
   {
      const int gridRow = rowIdx / gridSize;  // coordinates in the numerical grid
      const int gridColumn = rowIdx % gridSize;
      if( gridRow == 0 || gridRow == gridSize - 1 ||  // boundary grid node
          gridColumn == 0 || gridColumn == gridSize - 1 )
      {
         columnIdx = rowIdx;  // diagonal element ....
         value = 1.0;         // ... is set to 1
      }
      else  // interior grid node
      {
         switch( localIdx )  // set diagonal element to 4
         {                   // and the others to -1
            case 0:
               columnIdx = rowIdx - gridSize;
               value = 1;
               break;
            case 1:
               columnIdx = rowIdx - 1;
               value = 1;
               break;
            case 2:
               columnIdx = rowIdx;
               value = -4;
               break;
            case 3:
               columnIdx = rowIdx + 1;
               value = 1;
               break;
            case 4:
               columnIdx = rowIdx + gridSize;
               value = 1;
               break;
         }
      }
   };
   auto matrix =
      TNL::Matrices::LambdaMatrixFactory< double, Device, int >::create( matrixSize, matrixSize, matrixElements, rowLengths );
   using MatrixType = decltype( matrix );
   using RowView = typename MatrixType::RowView;
 
   TNL::Matrices::DenseMatrix< double, Device > denseMatrix( matrixSize, matrixSize );
   denseMatrix.setValue( 0.0 );
   auto dense_view = denseMatrix.getView();
   auto f = [ = ] __cuda_callable__( const RowView& row ) mutable
   {
      auto dense_row = dense_view.getRow( row.getRowIndex() );
      for( int localIdx = 0; localIdx < row.getSize(); localIdx++ )
         dense_row.setValue( row.getColumnIndex( localIdx ), row.getValue( localIdx ) );
   };
   matrix.forAllRows( f );
 
   std::cout << "Laplace operator lambda matrix: " << std::endl << matrix << std::endl;
   std::cout << "Laplace operator dense matrix: " << std::endl << denseMatrix << std::endl;
 
   /***
    * Compute sum of elements in each row and store it into a vector.
    */
   TNL::Containers::Vector< double, Device > sum_vector( matrixSize );
   auto sum_view = sum_vector.getView();
   matrix.forAllRows(
      [ = ] __cuda_callable__( const RowView& row ) mutable
      {
         double sum( 0.0 );
         for( auto element : row )
            sum += TNL::abs( element.value() );
         sum_view[ row.getRowIndex() ] = sum;
      } );
 
   std::cout << "Sums in matrix rows = " << sum_vector << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Running example on CPU ... " << std::endl;
   forRowsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Running example on CUDA GPU ... " << std::endl;
   forRowsExample< TNL::Devices::Cuda >();
#endif
}

We start with the lambda functions (lines 17-61) defining the elements of the lambda matrix. Next, we create the lambda matrix matrix (lines 62-64) and the dense matrix denseMatrix (lines 67-68) together with the dense matrix view (line 69). The lambda function f (lines 70-74) serves for copying matrix elements from the lambda matrix to the dense matrix. The process of matrix elements copying is started by calling the method forAllRows (TNL::Matrices::LambdaMatrix::forRows, TNL::Matrices::LambdaMatrix::forAllRows) (line 75).

Note, however, that use of forElements method (TNL::Matrices::LambdaMatrix::forElements) would be more convenient.

Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 83-90). Firstly we create the vector sum_vector for storing the sums (line 83) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 85-90, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 92).

The result looks as follows:

Running example on CPU ... 
Laplace operator lambda matrix: 
Row: 0 ->     0:1   
Row: 1 ->     1:1   
Row: 2 ->     2:1   
Row: 3 ->     3:1   
Row: 4 ->     4:1   
Row: 5 ->     1:1       4:1       5:-4      6:1       9:1   
Row: 6 ->     2:1       5:1       6:-4      7:1      10:1   
Row: 7 ->     7:1   
Row: 8 ->     8:1   
Row: 9 ->     5:1       8:1       9:-4     10:1      13:1   
Row: 10 ->     6:1       9:1      10:-4     11:1      14:1   
Row: 11 ->    11:1   
Row: 12 ->    12:1   
Row: 13 ->    13:1   
Row: 14 ->    14:1   
Row: 15 ->    15:1   
 
Laplace operator dense matrix: 
Row: 0 ->     0:1       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 1 ->     0:0       1:1       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 2 ->     0:0       1:0       2:1       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 3 ->     0:0       1:0       2:0       3:1       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 4 ->     0:0       1:0       2:0       3:0       4:1       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 5 ->     0:0       1:1       2:0       3:0       4:1       5:-4      6:1       7:0       8:0       9:1      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 6 ->     0:0       1:0       2:1       3:0       4:0       5:1       6:-4      7:1       8:0       9:0      10:1      11:0      12:0      13:0      14:0      15:0   
Row: 7 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:1       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 8 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:1       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 9 ->     0:0       1:0       2:0       3:0       4:0       5:1       6:0       7:0       8:1       9:-4     10:1      11:0      12:0      13:1      14:0      15:0   
Row: 10 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:1       7:0       8:0       9:1      10:-4     11:1      12:0      13:0      14:1      15:0   
Row: 11 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:1      12:0      13:0      14:0      15:0   
Row: 12 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:1      13:0      14:0      15:0   
Row: 13 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:1      14:0      15:0   
Row: 14 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:1      15:0   
Row: 15 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:1   
Sums in matrix rows = [ 1, 1, 1, 1, 1, 8, 8, 1, 1, 8, 8, 1, 1, 1, 1, 1 ]
Running example on CUDA GPU ... 
Laplace operator lambda matrix: 
Row: 0 ->     0:1   
Row: 1 ->     1:1   
Row: 2 ->     2:1   
Row: 3 ->     3:1   
Row: 4 ->     4:1   
Row: 5 ->     1:1       4:1       5:-4      6:1       9:1   
Row: 6 ->     2:1       5:1       6:-4      7:1      10:1   
Row: 7 ->     7:1   
Row: 8 ->     8:1   
Row: 9 ->     5:1       8:1       9:-4     10:1      13:1   
Row: 10 ->     6:1       9:1      10:-4     11:1      14:1   
Row: 11 ->    11:1   
Row: 12 ->    12:1   
Row: 13 ->    13:1   
Row: 14 ->    14:1   
Row: 15 ->    15:1   
 
Laplace operator dense matrix: 
Row: 0 ->     0:1       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 1 ->     0:0       1:1       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 2 ->     0:0       1:0       2:1       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 3 ->     0:0       1:0       2:0       3:1       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 4 ->     0:0       1:0       2:0       3:0       4:1       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 5 ->     0:0       1:1       2:0       3:0       4:1       5:-4      6:1       7:0       8:0       9:1      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 6 ->     0:0       1:0       2:1       3:0       4:0       5:1       6:-4      7:1       8:0       9:0      10:1      11:0      12:0      13:0      14:0      15:0   
Row: 7 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:1       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 8 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:1       9:0      10:0      11:0      12:0      13:0      14:0      15:0   
Row: 9 ->     0:0       1:0       2:0       3:0       4:0       5:1       6:0       7:0       8:1       9:-4     10:1      11:0      12:0      13:1      14:0      15:0   
Row: 10 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:1       7:0       8:0       9:1      10:-4     11:1      12:0      13:0      14:1      15:0   
Row: 11 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:1      12:0      13:0      14:0      15:0   
Row: 12 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:1      13:0      14:0      15:0   
Row: 13 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:1      14:0      15:0   
Row: 14 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:1      15:0   
Row: 15 ->     0:0       1:0       2:0       3:0       4:0       5:0       6:0       7:0       8:0       9:0      10:0      11:0      12:0      13:0      14:0      15:1   
Sums in matrix rows = [ 1, 1, 1, 1, 1, 8, 8, 1, 1, 8, 8, 1, 1, 1, 1, 1 ]

Method forElements

The lambda matrix has the same interface as other matrix types except of the method getRow. The following example demonstrates the use of the method forElements (TNL::Matrices::LambdaMatrix::forElements) to copy the lambda matrix into the dense matrix:

#include <iostream>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Matrices/LambdaMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
forElementsExample()
{
   /***
    * Lambda functions defining the matrix.
    */
   auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
   {
      return columns;
   };
   auto matrixElements =
      [ = ] __cuda_callable__(
         const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
   {
      columnIdx = localIdx;
      value = TNL::max( rowIdx - columnIdx + 1, 0 );
   };
 
   using MatrixFactory = TNL::Matrices::LambdaMatrixFactory< double, Device, int >;
   auto matrix = MatrixFactory::create( 5, 5, matrixElements, rowLengths );
 
   TNL::Matrices::DenseMatrix< double, Device > denseMatrix( 5, 5 );
   auto denseView = denseMatrix.getView();
 
   auto f = [ = ] __cuda_callable__( int rowIdx, int localIdx, int columnIdx, double value ) mutable
   {
      denseView.setElement( rowIdx, columnIdx, value );
   };
 
   matrix.forElements( 0, matrix.getRows(), f );
   std::cout << "Original lambda matrix:" << std::endl << matrix << std::endl;
   std::cout << "Dense matrix:" << std::endl << denseMatrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Copying matrix on host: " << std::endl;
   forElementsExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Copying matrix on CUDA device: " << std::endl;
   forElementsExample< TNL::Devices::Cuda >();
#endif
}

Here, we treat the lambda matrix as if it was dense matrix and so the lambda function compressedRowLengths returns the number of the nonzero elements equal to the number of matrix columns (line 13). However, the lambda function matrixElements (lines 14-17), sets nonzero values only to lower triangular part of the matrix. The elements in the upper part are equal to zero (line 16). Next we create an instance of the lambda matrix with a help of the lambda matrix factory (TNL::Matrices::LambdaMatrixFactory) (lines 19-20) and an instance of the dense matrix (TNL::Matrices::DenseMatrix) (lines 22-23).

Next we call the lambda function f by the method forElements (TNL::Matrices::LambdaMatrix::forElements) to set the matrix elements of the dense matrix denseMatrix (line 26) via the dense matrix view (denseView) (TNL::Matrices::DenseMatrixView). Note, that in the lambda function f we get the matrix element value already evaluated in the variable value as we are used to from other matrix types. So in fact, the same lambda function f would do the same job even for sparse matrix or any other. Also note, that in this case we iterate even over all zero matrix elements because the lambda function compressedRowLengths (line 13) tells so. The result looks as follows:

Copying matrix on host: 
Original lambda matrix:
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     0:4       1:3       2:2       3:1   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   
 
Dense matrix:
Row: 0 ->     0:1       1:0       2:0       3:0       4:0   
Row: 1 ->     0:2       1:1       2:0       3:0       4:0   
Row: 2 ->     0:3       1:2       2:1       3:0       4:0   
Row: 3 ->     0:4       1:3       2:2       3:1       4:0   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   
Copying matrix on CUDA device: 
Original lambda matrix:
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     0:4       1:3       2:2       3:1   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   
 
Dense matrix:
Row: 0 ->     0:1       1:0       2:0       3:0       4:0   
Row: 1 ->     0:2       1:1       2:0       3:0       4:0   
Row: 2 ->     0:3       1:2       2:1       3:0       4:0   
Row: 3 ->     0:4       1:3       2:2       3:1       4:0   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   

At the end of this part, we show two more examples, how to express a matrix approximating the Laplace operator:

#include <iostream>
#include <TNL/Matrices/LambdaMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
laplaceOperatorMatrix()
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method.
    */
   const int gridSize( 4 );
   const int matrixSize = gridSize * gridSize;
   auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
   {
      const int gridRow = rowIdx / gridSize;  // coordinates in the numerical grid
      const int gridColumn = rowIdx % gridSize;
      if( gridRow == 0 || gridRow == gridSize - 1 ||  // boundary grid node
          gridColumn == 0 || gridColumn == gridSize - 1 )
         return 1;
      return 5;
   };
   auto matrixElements =
      [ = ] __cuda_callable__(
         const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
   {
      const int gridRow = rowIdx / gridSize;  // coordinates in the numerical grid
      const int gridColumn = rowIdx % gridSize;
      if( gridRow == 0 || gridRow == gridSize - 1 ||  // boundary grid node
          gridColumn == 0 || gridColumn == gridSize - 1 )
      {
         columnIdx = rowIdx;  // diagonal element ....
         value = 1.0;         // ... is set to 1
      }
      else  // interior grid node
      {
         switch( localIdx )  // set diagonal element to 4
         {                   // and the others to -1
            case 0:
               columnIdx = rowIdx - gridSize;
               value = 1;
               break;
            case 1:
               columnIdx = rowIdx - 1;
               value = 1;
               break;
            case 2:
               columnIdx = rowIdx;
               value = -4;
               break;
            case 3:
               columnIdx = rowIdx + 1;
               value = 1;
               break;
            case 4:
               columnIdx = rowIdx + gridSize;
               value = 1;
               break;
         }
      }
   };
   auto matrix =
      TNL::Matrices::LambdaMatrixFactory< double, Device, int >::create( matrixSize, matrixSize, matrixElements, rowLengths );
   std::cout << "Laplace operator matrix: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
   laplaceOperatorMatrix< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
   laplaceOperatorMatrix< TNL::Devices::Cuda >();
#endif
}

The following is another way of doing the same but with precomputed supporting vectors:

#include <iostream>
#include <TNL/Matrices/LambdaMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
laplaceOperatorMatrix()
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method.
    */
   const int gridSize( 4 );
   const int matrixSize = gridSize * gridSize;
   TNL::Containers::Vector< int, Device > columnOffsets{
      0, -1, 0, 1, 0
   };  // helper vector for getting matrix elements column indexes
   columnOffsets.setElement( 0, -gridSize );
   columnOffsets.setElement( 4, gridSize );
   auto columnOffsetsView = columnOffsets.getView();
   TNL::Containers::Vector< double, Device > values{ 1, 1, -4, 1, 1 };  // helper vector for getting matrix elements values
   auto valuesView = values.getView();
 
   auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
   {
      const int gridRow = rowIdx / gridSize;  // coordinates in the numerical grid
      const int gridColumn = rowIdx % gridSize;
 
      if( gridRow == 0 || gridRow == gridSize - 1 ||  // boundary grid node
          gridColumn == 0 || gridColumn == gridSize - 1 )
         return 1;
      return 5;
   };
 
   auto matrixElements =
      [ = ] __cuda_callable__(
         const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
   {
      const int gridRow = rowIdx / gridSize;  // coordinates in the numerical grid
      const int gridColumn = rowIdx % gridSize;
      if( gridRow == 0 || gridRow == gridSize - 1 ||  // boundary grid node
          gridColumn == 0 || gridColumn == gridSize - 1 )
      {
         columnIdx = rowIdx;  // diagonal element ....
         value = 1.0;         // ... is set to 1
      }
      else  // interior grid node
      {
         columnIdx = rowIdx + columnOffsetsView[ localIdx ];
         value = valuesView[ localIdx ];
      }
   };
   auto matrix =
      TNL::Matrices::LambdaMatrixFactory< double, Device, int >::create( matrixSize, matrixSize, matrixElements, rowLengths );
   std::cout << "Laplace operator matrix: " << std::endl << matrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
   laplaceOperatorMatrix< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
   laplaceOperatorMatrix< TNL::Devices::Cuda >();
#endif
}

The result of both examples looks as follows:

Creating Laplace operator matrix on CPU ... 
Laplace operator matrix: 
Row: 0 ->     0:1   
Row: 1 ->     1:1   
Row: 2 ->     2:1   
Row: 3 ->     3:1   
Row: 4 ->     4:1   
Row: 5 ->     1:1       4:1       5:-4      6:1       9:1   
Row: 6 ->     2:1       5:1       6:-4      7:1      10:1   
Row: 7 ->     7:1   
Row: 8 ->     8:1   
Row: 9 ->     5:1       8:1       9:-4     10:1      13:1   
Row: 10 ->     6:1       9:1      10:-4     11:1      14:1   
Row: 11 ->    11:1   
Row: 12 ->    12:1   
Row: 13 ->    13:1   
Row: 14 ->    14:1   
Row: 15 ->    15:1   
 
Creating Laplace operator matrix on CUDA GPU ... 
Laplace operator matrix: 
Row: 0 ->     0:1   
Row: 1 ->     1:1   
Row: 2 ->     2:1   
Row: 3 ->     3:1   
Row: 4 ->     4:1   
Row: 5 ->     1:1       4:1       5:-4      6:1       9:1   
Row: 6 ->     2:1       5:1       6:-4      7:1      10:1   
Row: 7 ->     7:1   
Row: 8 ->     8:1   
Row: 9 ->     5:1       8:1       9:-4     10:1      13:1   
Row: 10 ->     6:1       9:1      10:-4     11:1      14:1   
Row: 11 ->    11:1   
Row: 12 ->    12:1   
Row: 13 ->    13:1   
Row: 14 ->    14:1   
Row: 15 ->    15:1   
 

Distributed matrices

TODO: Write documentation on distributed matrices.

Flexible reduction in matrix rows

Flexible reduction in matrix rows is a powerful tool for many different matrix operations. It is represented by the method reduceRows (TNL::Matrices::DenseMatrixBase::reduceRows, TNL::Matrices::SparseMatrix::reduceRows, TNL::Matrices::TridiagonalMatrix::reduceRows, TNL::Matrices::MultidiagonalMatrix::reduceRows, TNL::Matrices::LambdaMatrix::reduceRows) and similar to the method forElements it iterates over particular matrix rows. However, it performs flexible parallel reduction in addition. For example, the matrix-vector product can be seen as a reduction of products of matrix elements with the input vector in particular matrix rows. The first element of the result vector ios obtained as:

\[y_1 = a_{11} x_1 + a_{12} x_2 + \ldots + a_{1n} x_n = \sum_{j=1}^n a_{1j}x_j \]

and in general i-th element of the result vector is computed as

\[y_i = a_{i1} x_1 + a_{i2} x_2 + \ldots + a_{in} x_n = \sum_{j=1}^n a_{ij}x_j. \]

We see that in i-th matrix row we have to compute the sum \(\sum_{j=1}^n a_{ij}x_j\) which is reduction of products \( a_{ij}x_j\). Similar to flexible parallel reduction (TNL::Algorithms::reduce) we just need to design proper lambda functions. There are three of them.

fetch reads and preprocesses data entering the flexible parallel reduction.
reduce performs the reduction operation.
keep stores the results from each matrix row.

Lambda function fetch

This lambda function has the same purpose as the lambda function fetch in flexible parallel reduction for arrays and vectors (see Flexible Parallel Reduction). It is supposed to be declared as follows:

1auto fetch = [=] __cuda_callable__( Index rowIdx, Index columnIdx, const Real& value ) -> Real;

The meaning of the particular parameters is as follows:

rowIdx is the row index of the matrix element.
columnIdx is the column index of the matrix element.
value is the value of the matrix element.

The lambda function returns a value of type Real based on the input data.

Lambda function reduce

The lambda function reduce expresses reduction operation (sum, product, minimum, maximum etc.) which is supposed to be done during the flexible reduction.

1auto reduce = [] __cuda_callable__( const Real& a, const Real& b ) -> Real;

The meaning of the particular parameters is as follows:

a is the first operand for the reduction operation.
b is the second operand for the reduction operation.

Lambda function keep

The lambda function keep is new one compared to the flexible reduction for arrays, vectors or other linear structures. The reason is that the result consists of as many numbers as there are matrix rows. Result obtained for each matrix row is processed by this lambda function. It is declared as follows:

1auto keep = [=] __cuda_callable__( Index rowIdx, const Real& value ) mutable;

The meaning of the particular parameters is as follows:

rowIdx is an index of the matrix row related to given result of flexible reduction.
valueis the result of the flexible reduction in given matrix row.

The method reduceRows (TNL::Matrices::DenseMatrixBase::reduceRows, TNL::Matrices::SparseMatrix::reduceRows, TNL::Matrices::TridiagonalMatrix::reduceRows, TNL::Matrices::MultidiagonalMatrix::reduceRows, TNL::Matrices::LambdaMatrix::reduceRows) accepts the following arguments:

begin is the beginning of the matrix rows range on which the reduction will be performed.
end is the end of the matrix rows range on which the reduction will be performed. The last matrix row which is going to be processed has index end-1.
fetch is the lambda function for data fetching.
reduce is the the lambda function performing the reduction.
keep is the lambda function responsible for processing the results from particular matrix rows.
zero is the identity element of given reduction operation.

Though the interface is the same for all matrix types, in the following part we will show several examples for different matrix types to better demonstrate possible ways of use of the flexible reduction for matrices.

Dense matrices example

The following example demonstrates implementation of the dense matrix-vector product \( {\bf y} = A \vec {\bf x}\), i.e.

\[ y_i = \sum_{j=0}^{columns - 1} a_{ij} x_j \text{ for } i = 0, \ldots, rows-1. \]

#include <iostream>
#include <iomanip>
#include <functional>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
reduceRows()
{
   // clang-format off
   TNL::Matrices::DenseMatrix< double, Device > matrix {
      {  1,  0,  0,  0,  0 },
      {  1,  2,  0,  0,  0 },
      {  0,  1,  8,  0,  0 },
      {  0,  0,  1,  9,  0 },
      {  0,  0,  0,  0,  1 } };
   // clang-format on
   /***
    * Allocate input and output vectors for matrix-vector product
    */
   TNL::Containers::Vector< double, Device > x( matrix.getColumns() ), y( matrix.getRows() );
 
   /***
    * Fill the input vectors with ones.
    */
   x = 1.0;
 
   /***
    * Prepare vector view for lambdas.
    */
   auto xView = x.getView();
   auto yView = y.getView();
 
   /***
    * Fetch lambda just returns product of appropriate matrix elements and the
    * input vector elements.
    */
   auto fetch = [ = ] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
   {
      return xView[ columnIdx ] * value;
   };
 
   /***
    * Keep lambda store the result of matrix-vector product to output vector y.
    */
   auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
   {
      yView[ rowIdx ] = value;
   };
 
   /***
    * Compute matrix-vector product.
    */
   matrix.reduceRows( 0, matrix.getRows(), fetch, std::plus<>{}, keep, 0.0 );
 
   std::cout << "The matrix reads as:" << std::endl << matrix << std::endl;
   std::cout << "The input vector is:" << x << std::endl;
   std::cout << "Result of matrix-vector multiplication is: " << y << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Rows reduction on host:" << std::endl;
   reduceRows< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << std::endl;
   std::cout << "Rows reduction on CUDA device:" << std::endl;
   reduceRows< TNL::Devices::Cuda >();
#endif
}

We set the following lambda functions:

fetch lambda function computes the product \( a_{ij}x_j\) where \( a_{ij} \) is represented by value and \(x_j \) is represented by xView[columnIdx] (line 40).
reduce - reduction is just sum of particular products and it is represented by std::plus (line 53).
keep is responsible for storing the results of reduction in each matrix row (which is represented by the variable value) into the output vector y.

The result looks as:

Rows reduction on host:
The matrix reads as:
Row: 0 ->     0:1       1:0       2:0       3:0       4:0   
Row: 1 ->     0:1       1:2       2:0       3:0       4:0   
Row: 2 ->     0:0       1:1       2:8       3:0       4:0   
Row: 3 ->     0:0       1:0       2:1       3:9       4:0   
Row: 4 ->     0:0       1:0       2:0       3:0       4:1   
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
 
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 ->     0:1       1:0       2:0       3:0       4:0   
Row: 1 ->     0:1       1:2       2:0       3:0       4:0   
Row: 2 ->     0:0       1:1       2:8       3:0       4:0   
Row: 3 ->     0:0       1:0       2:1       3:9       4:0   
Row: 4 ->     0:0       1:0       2:0       3:0       4:1   
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]

We will show one more example which is a computation of maximal absolute value in each matrix row. The results will be stored in a vector:

\[y_i = \max_{j=1,\ldots,n} |a_{ij}|. \]

See the following example:

#include <iostream>
#include <iomanip>
#include <functional>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
reduceRows()
{
   // clang-format off
   TNL::Matrices::DenseMatrix< double, Device > matrix {
      {  1,  0,  0,  0,  0 },
      {  1,  2,  0,  0,  0 },
      {  0,  1,  8,  0,  0 },
      {  0,  0,  1,  9,  0 },
      {  0,  0,  0,  0,  1 } };
   // clang-format on
   /***
    * Find largest element in each row.
    */
   TNL::Containers::Vector< double, Device > rowMax( matrix.getRows() );
 
   /***
    * Prepare vector view for lambdas.
    */
   auto rowMaxView = rowMax.getView();
 
   /***
    * Fetch lambda just returns absolute value of matrix elements.
    */
   auto fetch = [ = ] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
   {
      return TNL::abs( value );
   };
 
   /***
    * Reduce lambda return maximum of given values.
    */
   auto reduce = [ = ] __cuda_callable__( double& a, const double& b ) -> double
   {
      return TNL::max( a, b );
   };
 
   /***
    * Keep lambda store the largest value in each row to the vector rowMax.
    */
   auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
   {
      rowMaxView[ rowIdx ] = value;
   };
 
   /***
    * Compute the largest values in each row.
    */
   matrix.reduceRows( 0, matrix.getRows(), fetch, reduce, keep, std::numeric_limits< double >::lowest() );
 
   std::cout << "Max. elements in rows are: " << rowMax << std::endl;
   std::cout << "Max. matrix norm is: " << TNL::max( rowMax ) << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Rows reduction on host:" << std::endl;
   reduceRows< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Rows reduction on CUDA device:" << std::endl;
   reduceRows< TNL::Devices::Cuda >();
#endif
}

The lambda functions rare:

fetch lambda function just returns absolute value of \(a_{ij} \) which is represented again by the variable value.
reduce lambda function returns larger of given values.
keep stores the results to the output vector the same way as in the previous example.

Note, that the identity value for the reduction is std::numeric_limits< double >::lowest. Of course, if we compute the maximum of all output vector elements, we get some kind of maximal matrix norm. The output looks as:

Rows reduction on host:
Max. elements in rows are: [ 1, 2, 8, 9, 1 ]
Max. matrix norm is: 9
Rows reduction on CUDA device:
Max. elements in rows are: [ 1, 2, 8, 9, 1 ]
Max. matrix norm is: 9

Sparse matrices example

The following example demonstrates sparse matrix-vector product:

#include <iostream>
#include <iomanip>
#include <functional>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
reduceRows()
{
   // clang-format off
   TNL::Matrices::SparseMatrix< double, Device > matrix{ 5, 5, {
      { 0, 0, 1 },
      { 1, 0, 1 }, { 1, 1, 2 },
      { 2, 1, 1 }, { 2, 2, 8 },
      { 3, 2, 1 }, { 3, 3, 9 },
      { 4, 4, 1 }
   } };
   // clang-format on
 
   /***
    * Allocate input and output vectors for matrix-vector product
    */
   TNL::Containers::Vector< double, Device > x( matrix.getColumns() ), y( matrix.getRows() );
 
   /***
    * Fill the input vectors with ones.
    */
   x = 1.0;
 
   /***
    * Prepare vector view for lambdas.
    */
   auto xView = x.getView();
   auto yView = y.getView();
 
   /***
    * Fetch lambda just returns product of appropriate matrix elements and the
    * input vector elements.
    */
   auto fetch = [ = ] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
   {
      return xView[ columnIdx ] * value;
   };
 
   /***
    * Keep lambda store the result of matrix-vector product to output vector y.
    */
   auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
   {
      yView[ rowIdx ] = value;
   };
 
   /***
    * Compute matrix-vector product.
    */
   matrix.reduceRows( 0, matrix.getRows(), fetch, std::plus<>{}, keep, 0.0 );
 
   std::cout << "The matrix reads as:" << std::endl << matrix << std::endl;
   std::cout << "The input vector is:" << x << std::endl;
   std::cout << "Result of matrix-vector multiplication is: " << y << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Rows reduction on host:" << std::endl;
   reduceRows< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << std::endl;
   std::cout << "Rows reduction on CUDA device:" << std::endl;
   reduceRows< TNL::Devices::Cuda >();
#endif
}

On the lines 11-16 we set the following matrix:

\[\left( \begin{array}{ccccc} 1 & . & . & . & . \\ 1 & 2 & . & . & . \\ . & 1 & 8 & . & . \\ . & . & 1 & 9 & . \\ . & . & . & . & 1 \end{array} \right) \]

The lambda functions on the lines 39-48 are the same as in the example with the dense matrix. The result looks as follows:

Rows reduction on host:
The matrix reads as:
Row: 0 ->     0:1   
Row: 1 ->     0:1       1:2   
Row: 2 ->     1:1       2:8   
Row: 3 ->     2:1       3:9   
Row: 4 ->     4:1   
 
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
 
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 ->     0:1   
Row: 1 ->     0:1       1:2   
Row: 2 ->     1:1       2:8   
Row: 3 ->     2:1       3:9   
Row: 4 ->     4:1   
 
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]

Tridiagonal matrices example

In this example, we will compute maximal absolute value in each row of the following tridiagonal matrix:

\[\left( \begin{array}{ccccc} 1 & 3 & & & & \\ 2 & 1 & 3 & & & \\ & 2 & 1 & 3 & & \\ & & 2 & 1 & 3 & \\ & & & 2 & 1 & 3 \end{array} \right). \]

The source code reads as follows:

#include <iostream>
#include <iomanip>
#include <functional>
#include <TNL/Matrices/TridiagonalMatrix.h>
#include <TNL/Devices/Host.h>
 
template< typename Device >
void
reduceRows()
{
   /***
    * Set the following matrix (dots represent zero matrix elements and zeros are
    * padding zeros for memory alignment):
    *
    *  0 / 1  3  .  .  . \   -> { 0, 1, 3 }
    *    | 2  1  3  .  . |   -> { 2, 1, 3 }
    *    | .  2  1  3  . |   -> { 2, 1, 3 }
    *    | .  .  2  1  3 |   -> { 2, 1, 3 }
    *    \ .  .  .  2  1 / 0 -> { 2, 1, 0 }
    *
    */
   TNL::Matrices::TridiagonalMatrix< double, Device > matrix( 5,              // number of matrix columns
                                                              { { 0, 1, 3 },  // matrix elements
                                                                { 2, 1, 3 },
                                                                { 2, 1, 3 },
                                                                { 2, 1, 3 },
                                                                { 2, 1, 3 } } );
   auto view = matrix.getView();
 
   /***
    * Find largest element in each row.
    */
   TNL::Containers::Vector< double, Device > rowMax( matrix.getRows() );
 
   /***
    * Prepare vector view for lambdas.
    */
   auto rowMaxView = rowMax.getView();
 
   /***
    * Fetch lambda just returns absolute value of matrix elements.
    */
   auto fetch = [] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
   {
      return TNL::abs( value );
   };
 
   /***
    * Reduce lambda return maximum of given values.
    */
   auto reduce = [] __cuda_callable__( const double& a, const double& b ) -> double
   {
      return TNL::max( a, b );
   };
 
   /***
    * Keep lambda store the largest value in each row to the vector rowMax.
    */
   auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
   {
      rowMaxView[ rowIdx ] = value;
   };
 
   /***
    * Compute the largest values in each row.
    */
   view.reduceRows( 0, view.getRows(), fetch, reduce, keep, std::numeric_limits< double >::lowest() );  // or matrix.reduceRows
 
   std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
   std::cout << "Max. elements in rows are: " << rowMax << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Rows reduction on host:" << std::endl;
   reduceRows< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Rows reduction on CUDA device:" << std::endl;
   reduceRows< TNL::Devices::Cuda >();
#endif
}

Here we first set the tridiagonal matrix (lines 10-27). Next we allocate the vector rowMax where we will store the results (line 32). The lambda function are:

fetch (lines 42-44) is responsible for reading the matrix elements. In our example, the only thing this function has to do, is to compute the absolute value of each matrix element represented by variable value.
reduce (lines 49-51), performs reduction operation. In this case, it returns maximum of two input values a and b.
keep (lines 56-58) takes the result of the reduction in variable value in each row and stores it into the vector rowMax via related vector view rowMaxView.

Note, that the identity value for the reduction is std::numeric_limits< double >::lowest. The results looks as follows:

Rows reduction on host:
The matrix reads as: 
Row: 0 ->     0:1       1:3   
Row: 1 ->     0:2       1:1       2:3   
Row: 2 ->     1:2       2:1       3:3   
Row: 3 ->     2:2       3:1       4:3   
Row: 4 ->     3:2       4:1   
 
Max. elements in rows are: [ 3, 3, 3, 3, 2 ]
Rows reduction on CUDA device:
The matrix reads as: 
Row: 0 ->     0:1       1:3   
Row: 1 ->     0:2       1:1       2:3   
Row: 2 ->     1:2       2:1       3:3   
Row: 3 ->     2:2       3:1       4:3   
Row: 4 ->     3:2       4:1   
 
Max. elements in rows are: [ 3, 3, 3, 3, 2 ]

Multidiagonal matrices example

The next example computes again the maximal absolute value in each row. Now, we do it for multidiagonal matrix the following form:

\[\left( \begin{array}{ccccc} 1 & & & & \\ 2 & 1 & & & \\ 3 & 2 & 1 & & \\ & 3 & 2 & 1 & \\ & & 3 & 2 & 1 \end{array} \right) \]

We first create vector rowMax into which we will store the results and fetch it view rowMaxView (line 39). Next we prepare necessary lambda functions:

fetch (lines 44-46) is responsible for reading the matrix element value which is stored in the constant reference value and for returning its absolute value. The other parameters rowIdx and columnIdx correspond to row and column indexes respectively and they are omitted in our example.
reduce (lines 51-53) returns maximum value of the two input values a and b.
keep (line 58-60) stores the input value at the corresponding position, given by the row index rowIdx, in the output vector view rowMaxView.

Finally, we call the method reduceRows (TNL::Matrices::MultidiagonalMatrix::reduceRows) with parameters telling the interval of rows to be processed (the first and second parameter), the lambda functions fetch, reduce and keep, and the identity element for the reduction operation which is the lowest number of given type (std::numeric_limits< double >::lowest ). The result looks as follows:

Rows reduction on host:
The matrix reads as: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     1:3       2:2       3:1   
Row: 4 ->     2:3       3:2       4:1   
 
Max. elements in rows are: [ -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308 ]
Rows reduction on CUDA device:
The matrix reads as: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     1:3       2:2       3:1   
Row: 4 ->     2:3       3:2       4:1   
 
Max. elements in rows are: [ -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308 ]

Lambda matrices example

The reduction of matrix rows is available for the lambda matrices as well. See the following example:

#include <iostream>
#include <iomanip>
#include <functional>
#include <TNL/Matrices/LambdaMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
reduceRows()
{
   /***
    * Lambda functions defining the matrix.
    */
   auto rowLengths = [ = ] __cuda_callable__( const int rows, const int columns, const int rowIdx ) -> int
   {
      return columns;
   };
   auto matrixElements =
      [ = ] __cuda_callable__(
         const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
   {
      columnIdx = localIdx;
      value = TNL::max( rowIdx - columnIdx + 1, 0 );
   };
 
   using MatrixFactory = TNL::Matrices::LambdaMatrixFactory< double, Device, int >;
   auto matrix = MatrixFactory::create( 5, 5, matrixElements, rowLengths );
 
   /***
    * Find largest element in each row.
    */
   TNL::Containers::Vector< double, Device > rowMax( matrix.getRows() );
 
   /***
    * Prepare vector view for lambdas.
    */
   auto rowMaxView = rowMax.getView();
 
   /***
    * Fetch lambda just returns absolute value of matrix elements.
    */
   auto fetch = [ = ] __cuda_callable__( int rowIdx, int columnIdx, const double& value ) -> double
   {
      return TNL::abs( value );
   };
 
   /***
    * Reduce lambda return maximum of given values.
    */
   auto reduce = [ = ] __cuda_callable__( const double& a, const double& b ) -> double
   {
      return TNL::max( a, b );
   };
 
   /***
    * Keep lambda store the largest value in each row to the vector rowMax.
    */
   auto keep = [ = ] __cuda_callable__( int rowIdx, const double& value ) mutable
   {
      rowMaxView[ rowIdx ] = value;
   };
 
   /***
    * Compute the largest values in each row.
    */
   matrix.reduceRows( 0, matrix.getRows(), fetch, reduce, keep, std::numeric_limits< double >::lowest() );
 
   std::cout << "The matrix reads as: " << std::endl << matrix << std::endl;
   std::cout << "Max. elements in rows are: " << rowMax << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Rows reduction on host:" << std::endl;
   reduceRows< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Rows reduction on CUDA device:" << std::endl;
   reduceRows< TNL::Devices::Cuda >();
#endif
}

On the lines 14-21, we create the lower triangular lambda matrix which looks as follows:

\[\left( \begin{array}{ccccc} 1 & & & & \\ 2 & 1 & & & \\ 3 & 2 & 1 & & \\ 4 & 3 & 2 & 1 & \\ 5 & 4 & 3 & 2 & 1 \end{array} \right) \]

We want to compute maximal absolute value of matrix elements in each row. For this purpose we define well known lambda functions:

fetch takes the value of the lambda matrix element and returns its absolute value.
reduce computes maximum value of two input variables.
keep stores the results into output vector rowMax.

Note that the interface of the lambda functions is the same as for other matrix types. The result looks as follows:

Rows reduction on host:
The matrix reads as: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     0:4       1:3       2:2       3:1   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   
 
Max. elements in rows are: [ 1, 2, 3, 4, 5 ]
Rows reduction on CUDA device:
The matrix reads as: 
Row: 0 ->     0:1   
Row: 1 ->     0:2       1:1   
Row: 2 ->     0:3       1:2       2:1   
Row: 3 ->     0:4       1:3       2:2       3:1   
Row: 4 ->     0:5       1:4       2:3       3:2       4:1   
 
Max. elements in rows are: [ 1, 2, 3, 4, 5 ]

Matrix-vector product

One of the most important matrix operation is the matrix-vector multiplication. It is represented by a method vectorProduct (TNL::Matrices::DenseMatrix::vectorProduct, TNL::Matrices::SparseMatrix::vectorProduct, TNL::Matrices::TridiagonalMatrix::vectorProduct, TNL::Matrices::MultidiagonalMatrix::vectorProduct, TNL::Matrices::LambdaMatrix::vectorProduct). It is templated method with two template parameters InVector and OutVector telling the types of input and output vector respectively. Usually one will substitute some of TNL::Containers::Array, TNL::Containers::ArrayView, TNL::Containers::Vector or TNL::Containers::VectorView for these types. The method accepts the following parameters:

inVector is the input vector having the same number of elements as the number of matrix columns.
outVector is the output vector having the same number of elements as the number of matrix rows.
matrixMultiplicator is a number by which the result of matrix-vector product is multiplied.
outVectorMultiplicator is a number by which the output vector is multiplied before added to the result of matrix-vector product.
begin is the beginning of the matrix rows range on which we compute the matrix-vector product.
end is the end of the matrix rows range on which the matrix-vector product will be evaluated. The last matrix row which is going to be processed has index end-1.

Note that the output vector dimension must be the same as the number of matrix rows no matter how we set begin and end parameters. These parameters just say that some matrix rows and the output vector elements are omitted.

To summarize, this method computes the following formula:

outVector = matrixMultiplicator * ( *this ) * inVector + outVectorMultiplicator * outVector.

Matrix I/O operations

All matrices can be saved to a file using a method save (TNL::Matrices::DenseMatrix::save, TNL::Matrices::SparseMatrix::save, TNL::Matrices::TridiagonalMatrix::save, TNL::Matrices::MultidiagonalMatrix::save) and restored with a method load (TNL::Matrices::DenseMatrix::load, TNL::Matrices::SparseMatrix::load, TNL::Matrices::TridiagonalMatrix::load, TNL::Matrices::MultidiagonalMatrix::load). To print the matrix, there is a method print (TNL::Matrices::DenseMatrix::print, TNL::Matrices::SparseMatrix::print, TNL::Matrices::TridiagonalMatrix::print, TNL::Matrices::MultidiagonalMatrix::print, TNL::Matrices::LambdaMatrix::print) can be used.

Matrix reader and writer

TNL also offers matrix reader (TNL::Matrices::MatrixReader) and matrix writer (TNL::Matrices::MatrixWriter) for import and export of matrices respectively. The matrix reader currently supports only Coordinate MTX file format which is popular mainly for sparse matrices. By the mean of the matrix writer, we can export TNL matrices into coordinate MTX format as well. In addition, the matrices can be exported to a text file suitable for Gnuplot program which can be used for matrix visualization. Finally, a pattern of nonzero matrix elements can be visualized via the EPS format - Encapsulated PostScript. We demonstrate both matrix reader and writer in the following example:

#include <iostream>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Matrices/MatrixReader.h>
#include <TNL/Matrices/MatrixWriter.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
 
template< typename Device >
void
matrixWriterExample()
{
   using Matrix = TNL::Matrices::SparseMatrix< double, Device >;
   Matrix matrix( 5,  // number of matrix rows
                  5,  // number of matrix columns
                  {
                     // matrix elements definition
                     // clang-format off
                     { 0, 0,  2.0 },
                     { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
                     { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
                     { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
                     { 4, 4,  2.0 },
                     // clang-format on
                  } );
 
   std::cout << "Matrix: " << std::endl << matrix << std::endl;
   std::cout << "Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ...";
   TNL::Matrices::MatrixWriter< Matrix >::writeGnuplot( "matrix-writer-example.gplt", matrix );
   std::cout << " OK " << std::endl;
   std::cout << "Writing matrix pattern in EPS format into the file matrix-writer-example.eps ...";
   TNL::Matrices::MatrixWriter< Matrix >::writeEps( "matrix-writer-example.eps", matrix );
   std::cout << " OK " << std::endl;
   std::cout << "Writing matrix in MTX format into the file matrix-writer-example.mtx ...";
   TNL::Matrices::MatrixWriter< Matrix >::writeMtx( "matrix-writer-example.mtx", matrix );
   std::cout << " OK " << std::endl;
}
 
template< typename Device >
void
matrixReaderExample()
{
   using SparseMatrix = TNL::Matrices::SparseMatrix< double, Device >;
   SparseMatrix sparseMatrix;
 
   std::cout << "Reading sparse matrix from MTX file matrix-writer-example.mtx ... ";
   TNL::Matrices::MatrixReader< SparseMatrix >::readMtx( "matrix-writer-example.mtx", sparseMatrix );
   std::cout << " OK " << std::endl;
   std::cout << "Imported matrix is: " << std::endl << sparseMatrix << std::endl;
 
   using DenseMatrix = TNL::Matrices::DenseMatrix< double, Device >;
   DenseMatrix denseMatrix;
 
   std::cout << "Reading dense matrix from MTX file matrix-writer-example.mtx ... ";
   TNL::Matrices::MatrixReader< DenseMatrix >::readMtx( "matrix-writer-example.mtx", denseMatrix );
   std::cout << " OK " << std::endl;
   std::cout << "Imported matrix is: " << std::endl << denseMatrix << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating matrices on CPU ... " << std::endl;
   matrixWriterExample< TNL::Devices::Host >();
   matrixReaderExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << std::endl << std::endl;
   std::cout << "Creating matrices on CUDA GPU ... " << std::endl;
   matrixWriterExample< TNL::Devices::Cuda >();
   matrixReaderExample< TNL::Devices::Cuda >();
#endif
}

The example consists of two functions - matrixWriterExample (lines 10-24) and matrixReaderExample (lines 36-54). In the first one, we first create a toy matrix (lines 13-22) which we subsequently export into Gnuplot (line 26), EPS (line 29) and MTX (line 32) formats. In the next step (the matrixReaderExample function on lines 36-54), the MTX file is used to import the matrix into sparse (line 43) and dense (line 51) matrices. Both matrices are printed out (lines 45 and 53).

The result looks as follows:

Creating matrices on CPU ... 
Matrix: 
Row: 0 ->     0:2   
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     4:2   
 
Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ... OK 
Writing matrix pattern in EPS format into the file matrix-writer-example.eps ... OK 
Writing matrix in MTX format into the file matrix-writer-example.mtx ... OK 
Reading sparse matrix from MTX file matrix-writer-example.mtx ...  OK 
Imported matrix is: 
Row: 0 ->     0:2   
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     4:2   
 
Reading dense matrix from MTX file matrix-writer-example.mtx ...  OK 
Imported matrix is: 
Row: 0 ->     0:2       1:0       2:0       3:0       4:0   
Row: 1 ->     0:-1      1:2       2:-1      3:0       4:0   
Row: 2 ->     0:0       1:-1      2:2       3:-1      4:0   
Row: 3 ->     0:0       1:0       2:-1      3:2       4:-1  
Row: 4 ->     0:0       1:0       2:0       3:0       4:2   
 
 
Creating matrices on CUDA GPU ... 
Matrix: 
Row: 0 ->     0:2   
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     4:2   
 
Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ... OK 
Writing matrix pattern in EPS format into the file matrix-writer-example.eps ... OK 
Writing matrix in MTX format into the file matrix-writer-example.mtx ... OK 
Reading sparse matrix from MTX file matrix-writer-example.mtx ...  OK 
Imported matrix is: 
Row: 0 ->     0:2   
Row: 1 ->     0:-1      1:2       2:-1  
Row: 2 ->     1:-1      2:2       3:-1  
Row: 3 ->     2:-1      3:2       4:-1  
Row: 4 ->     4:2   
 
Reading dense matrix from MTX file matrix-writer-example.mtx ...  OK 
Imported matrix is: 
Row: 0 ->     0:2       1:0       2:0       3:0       4:0   
Row: 1 ->     0:-1      1:2       2:-1      3:0       4:0   
Row: 2 ->     0:0       1:-1      2:2       3:-1      4:0   
Row: 3 ->     0:0       1:0       2:-1      3:2       4:-1  
Row: 4 ->     0:0       1:0       2:0       3:0       4:2   

Appendix

Benchmark of dense matrix setup

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Containers/StaticArray.h>
#include <TNL/Matrices/DenseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
#include <TNL/Timer.h>
 
const int testsCount = 5;
 
template< typename Matrix >
void
setElement_on_host( const int matrixSize, Matrix& matrix )
{
   matrix.setDimensions( matrixSize, matrixSize );
 
   for( int j = 0; j < matrixSize; j++ )
      for( int i = 0; i < matrixSize; i++ )
         matrix.setElement( i, j, i + j );
}
 
template< typename Matrix >
void
setElement_on_host_and_transfer( const int matrixSize, Matrix& matrix )
{
   using RealType = typename Matrix::RealType;
   using IndexType = typename Matrix::IndexType;
   using HostMatrix = TNL::Matrices::DenseMatrix< RealType, TNL::Devices::Host, IndexType >;
   HostMatrix hostMatrix( matrixSize, matrixSize );
 
   for( int j = 0; j < matrixSize; j++ )
      for( int i = 0; i < matrixSize; i++ )
         hostMatrix.setElement( i, j, i + j );
   matrix = hostMatrix;
}
 
template< typename Matrix >
void
setElement_on_device( const int matrixSize, Matrix& matrix )
{
   matrix.setDimensions( matrixSize, matrixSize );
 
   auto matrixView = matrix.getView();
   auto f = [ = ] __cuda_callable__( const TNL::Containers::StaticArray< 2, int >& i ) mutable
   {
      matrixView.setElement( i[ 0 ], i[ 1 ], i[ 0 ] + i[ 1 ] );
   };
   const TNL::Containers::StaticArray< 2, int > begin = { 0, 0 };
   const TNL::Containers::StaticArray< 2, int > end = { matrixSize, matrixSize };
   TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( begin, end, f );
}
 
template< typename Matrix >
void
getRow( const int matrixSize, Matrix& matrix )
{
   matrix.setDimensions( matrixSize, matrixSize );
 
   auto matrixView = matrix.getView();
   auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
   {
      auto row = matrixView.getRow( rowIdx );
      for( int i = 0; i < matrixSize; i++ )
         row.setValue( i, rowIdx + i );
   };
   TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( 0, matrixSize, f );
}
 
template< typename Matrix >
void
forElements( const int matrixSize, Matrix& matrix )
{
   matrix.setDimensions( matrixSize, matrixSize );
 
   auto f = [ = ] __cuda_callable__( int rowIdx, int localIdx, int& columnIdx, float& value ) mutable
   {
      value = rowIdx + columnIdx;
   };
   matrix.forElements( 0, matrixSize, f );
}
 
template< typename Device >
void
setupDenseMatrix()
{
   std::cout << " Dense matrix test:" << std::endl;
   for( int matrixSize = 16; matrixSize <= 8192; matrixSize *= 2 ) {
      std::cout << "  Matrix size = " << matrixSize << std::endl;
      TNL::Timer timer;
 
      std::cout << "   setElement on host: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::DenseMatrix< float, Device, int > matrix;
         setElement_on_host( matrixSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      std::cout << "   setElement on device: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::DenseMatrix< float, Device, int > matrix;
         setElement_on_device( matrixSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      if( std::is_same< Device, TNL::Devices::Cuda >::value ) {
         std::cout << "   setElement on host and transfer on GPU: ";
         timer.reset();
         timer.start();
         for( int i = 0; i < testsCount; i++ ) {
            TNL::Matrices::DenseMatrix< float, Device, int > matrix;
            setElement_on_host_and_transfer( matrixSize, matrix );
         }
         timer.stop();
         std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
      }
 
      std::cout << "   getRow: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::DenseMatrix< float, Device, int > matrix;
         getRow( matrixSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      std::cout << "   forElements: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::DenseMatrix< float, Device, int > matrix;
         forElements( matrixSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
   }
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating dense matrix on CPU ... " << std::endl;
   setupDenseMatrix< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating dense matrix on CUDA GPU ... " << std::endl;
   setupDenseMatrix< TNL::Devices::Cuda >();
#endif
}

Benchmark of sparse matrix setup

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Containers/StaticArray.h>
#include <TNL/Matrices/SparseMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
#include <TNL/Timer.h>
 
const int testsCount = 5;
 
template< typename Matrix >
void
STL_Map( const int gridSize, Matrix& matrix )
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method by means of STL map.
    */
   const int matrixSize = gridSize * gridSize;
   matrix.setDimensions( matrixSize, matrixSize );
   std::map< std::pair< int, int >, double > map;
   for( int j = 0; j < gridSize; j++ )
      for( int i = 0; i < gridSize; i++ ) {
         const int rowIdx = j * gridSize + i;
         if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
            map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx ), 1.0 ) );
         else {
            map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx - gridSize ), 1.0 ) );
            map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx - 1 ), 1.0 ) );
            map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx ), -4.0 ) );
            map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx + 1 ), 1.0 ) );
            map.insert( std::make_pair( std::make_pair( rowIdx, rowIdx + gridSize ), 1.0 ) );
         }
      }
   matrix.setElements( map );
}
 
template< typename Matrix >
void
setElement_on_host( const int gridSize, Matrix& matrix )
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method by means setElement method called
    * from the host system.
    */
   const int matrixSize = gridSize * gridSize;
   TNL::Containers::Vector< int, typename Matrix::DeviceType, int > rowCapacities( matrixSize, 5 );
   matrix.setDimensions( matrixSize, matrixSize );
   matrix.setRowCapacities( rowCapacities );
 
   for( int j = 0; j < gridSize; j++ )
      for( int i = 0; i < gridSize; i++ ) {
         const int rowIdx = j * gridSize + i;
         if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
            matrix.setElement( rowIdx, rowIdx, 1.0 );
         else {
            matrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
            matrix.setElement( rowIdx, rowIdx - 1, 1.0 );
            matrix.setElement( rowIdx, rowIdx, -4.0 );
            matrix.setElement( rowIdx, rowIdx + 1, 1.0 );
            matrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
         }
      }
}
 
template< typename Matrix >
void
setElement_on_host_and_transfer( const int gridSize, Matrix& matrix )
{
   using RealType = typename Matrix::RealType;
   using HostMatrix = typename Matrix::template Self< RealType, TNL::Devices::Host >;
 
   const int matrixSize = gridSize * gridSize;
   TNL::Containers::Vector< int, typename HostMatrix::DeviceType, int > rowCapacities( matrixSize, 5 );
   HostMatrix hostMatrix( matrixSize, matrixSize );
   hostMatrix.setRowCapacities( rowCapacities );
 
   for( int j = 0; j < gridSize; j++ )
      for( int i = 0; i < gridSize; i++ ) {
         const int rowIdx = j * gridSize + i;
         if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
            hostMatrix.setElement( rowIdx, rowIdx, 1.0 );
         else {
            hostMatrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
            hostMatrix.setElement( rowIdx, rowIdx - 1, 1.0 );
            hostMatrix.setElement( rowIdx, rowIdx, -4.0 );
            hostMatrix.setElement( rowIdx, rowIdx + 1, 1.0 );
            hostMatrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
         }
      }
   matrix = hostMatrix;
}
 
template< typename Matrix >
void
setElement_on_device( const int gridSize, Matrix& matrix )
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method by means of setElement method called
    * from the native device.
    */
   const int matrixSize = gridSize * gridSize;
   TNL::Containers::Vector< int, typename Matrix::DeviceType, int > rowCapacities( matrixSize, 5 );
   matrix.setDimensions( matrixSize, matrixSize );
   matrix.setRowCapacities( rowCapacities );
 
   auto matrixView = matrix.getView();
   auto f = [ = ] __cuda_callable__( const TNL::Containers::StaticArray< 2, int >& i ) mutable
   {
      const int rowIdx = i[ 1 ] * gridSize + i[ 0 ];
      if( i[ 0 ] == 0 || i[ 1 ] == 0 || i[ 0 ] == gridSize - 1 || i[ 1 ] == gridSize - 1 )
         matrixView.setElement( rowIdx, rowIdx, 1.0 );
      else {
         matrixView.setElement( rowIdx, rowIdx - gridSize, 1.0 );
         matrixView.setElement( rowIdx, rowIdx - 1, 1.0 );
         matrixView.setElement( rowIdx, rowIdx, -4.0 );
         matrixView.setElement( rowIdx, rowIdx + 1, 1.0 );
         matrixView.setElement( rowIdx, rowIdx + gridSize, 1.0 );
      }
   };
   const TNL::Containers::StaticArray< 2, int > begin = { 0, 0 };
   const TNL::Containers::StaticArray< 2, int > end = { gridSize, gridSize };
   TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( begin, end, f );
}
 
template< typename Matrix >
void
getRow( const int gridSize, Matrix& matrix )
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method by means of getRow method.
    */
   const int matrixSize = gridSize * gridSize;
   TNL::Containers::Vector< int, typename Matrix::DeviceType, int > rowCapacities( matrixSize, 5 );
   matrix.setDimensions( matrixSize, matrixSize );
   matrix.setRowCapacities( rowCapacities );
 
   auto matrixView = matrix.getView();
   auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
   {
      const int i = rowIdx % gridSize;
      const int j = rowIdx / gridSize;
      auto row = matrixView.getRow( rowIdx );
      if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
         row.setElement( 2, rowIdx, 1.0 );
      else {
         row.setElement( 0, rowIdx - gridSize, 1.0 );
         row.setElement( 1, rowIdx - 1, 1.0 );
         row.setElement( 2, rowIdx, -4.0 );
         row.setElement( 3, rowIdx + 1, 1.0 );
         row.setElement( 4, rowIdx + gridSize, 1.0 );
      }
   };
   TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( 0, matrixSize, f );
}
 
template< typename Matrix >
void
forElements( const int gridSize, Matrix& matrix )
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method by means of forElements method.
    */
 
   const int matrixSize = gridSize * gridSize;
   TNL::Containers::Vector< int, typename Matrix::DeviceType, int > rowCapacities( matrixSize, 5 );
   matrix.setDimensions( matrixSize, matrixSize );
   matrix.setRowCapacities( rowCapacities );
 
   auto f = [ = ] __cuda_callable__( int rowIdx, int localIdx, int& columnIdx, float& value ) mutable
   {
      const int i = rowIdx % gridSize;
      const int j = rowIdx / gridSize;
      if( ( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) && localIdx == 0 ) {
         columnIdx = rowIdx;
         value = 1.0;
      }
      else {
         switch( localIdx ) {
            case 0:
               columnIdx = rowIdx - gridSize;
               value = 1.0;
               break;
            case 1:
               columnIdx = rowIdx - 1;
               value = 1.0;
               break;
            case 2:
               columnIdx = rowIdx;
               value = -4.0;
               break;
            case 3:
               columnIdx = rowIdx + 1;
               value = 1.0;
               break;
            case 4:
               columnIdx = rowIdx + gridSize;
               value = 1.0;
               break;
         }
      }
   };
   matrix.forElements( 0, matrixSize, f );
}
 
template< typename Device >
void
laplaceOperatorSparseMatrix()
{
   std::cout << " Sparse matrix test:" << std::endl;
   for( int gridSize = 16; gridSize <= 8192; gridSize *= 2 ) {
      std::cout << "  Grid size = " << gridSize << std::endl;
      TNL::Timer timer;
 
      std::cout << "   STL map: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::SparseMatrix< float, Device, int > matrix;
         STL_Map( gridSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      std::cout << "   setElement on host: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::SparseMatrix< float, Device, int > matrix;
         setElement_on_host( gridSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      if( std::is_same< Device, TNL::Devices::Cuda >::value ) {
         std::cout << "   setElement on host and transfer on GPU: ";
         timer.reset();
         timer.start();
         for( int i = 0; i < testsCount; i++ ) {
            TNL::Matrices::SparseMatrix< float, Device, int > matrix;
            setElement_on_host_and_transfer( gridSize, matrix );
         }
         timer.stop();
         std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
      }
 
      std::cout << "   setElement on device: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::SparseMatrix< float, Device, int > matrix;
         setElement_on_device( gridSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      std::cout << "   getRow: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::SparseMatrix< float, Device, int > matrix;
         getRow( gridSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      std::cout << "   forElements: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::SparseMatrix< float, Device, int > matrix;
         forElements( gridSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
   }
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
   laplaceOperatorSparseMatrix< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
   laplaceOperatorSparseMatrix< TNL::Devices::Cuda >();
#endif
}

Benchmark of multidiagonal matrix setup

#include <iostream>
#include <TNL/Algorithms/parallelFor.h>
#include <TNL/Containers/StaticArray.h>
#include <TNL/Matrices/MultidiagonalMatrix.h>
#include <TNL/Devices/Host.h>
#include <TNL/Devices/Cuda.h>
#include <TNL/Timer.h>
 
const int testsCount = 5;
 
template< typename Device >
TNL::Containers::Vector< int, Device >
getOffsets( const int gridSize )
{
   TNL::Containers::Vector< int, Device > offsets( 5 );
   offsets.setElement( 0, -gridSize );
   offsets.setElement( 1, -1 );
   offsets.setElement( 2, 0 );
   offsets.setElement( 3, 1 );
   offsets.setElement( 4, gridSize );
   return offsets;
}
 
template< typename Matrix >
void
setElement_on_host( const int gridSize, Matrix& matrix )
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method by means setElement method called
    * from the host system.
    */
   const int matrixSize = gridSize * gridSize;
   matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
 
   for( int j = 0; j < gridSize; j++ )
      for( int i = 0; i < gridSize; i++ ) {
         const int rowIdx = j * gridSize + i;
         if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
            matrix.setElement( rowIdx, rowIdx, 1.0 );
         else {
            matrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
            matrix.setElement( rowIdx, rowIdx - 1, 1.0 );
            matrix.setElement( rowIdx, rowIdx, -4.0 );
            matrix.setElement( rowIdx, rowIdx + 1, 1.0 );
            matrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
         }
      }
}
 
template< typename Matrix >
void
setElement_on_host_and_transfer( const int gridSize, Matrix& matrix )
{
   using RealType = typename Matrix::RealType;
   using IndexType = typename Matrix::IndexType;
   using HostMatrix = TNL::Matrices::MultidiagonalMatrix< RealType, TNL::Devices::Host, IndexType >;
   const int matrixSize = gridSize * gridSize;
   HostMatrix hostMatrix( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
 
   for( int j = 0; j < gridSize; j++ )
      for( int i = 0; i < gridSize; i++ ) {
         const int rowIdx = j * gridSize + i;
         if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
            hostMatrix.setElement( rowIdx, rowIdx, 1.0 );
         else {
            hostMatrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
            hostMatrix.setElement( rowIdx, rowIdx - 1, 1.0 );
            hostMatrix.setElement( rowIdx, rowIdx, -4.0 );
            hostMatrix.setElement( rowIdx, rowIdx + 1, 1.0 );
            hostMatrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
         }
      }
   matrix = hostMatrix;
}
 
template< typename Matrix >
void
setElement_on_device( const int gridSize, Matrix& matrix )
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method by means of setElement method called
    * from the native device.
    */
   const int matrixSize = gridSize * gridSize;
   matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
 
   auto matrixView = matrix.getView();
   auto f = [ = ] __cuda_callable__( TNL::Containers::StaticArray< 2, int > i ) mutable
   {
      const int rowIdx = i[ 1 ] * gridSize + i[ 0 ];
      if( i[ 0 ] == 0 || i[ 1 ] == 0 || i[ 0 ] == gridSize - 1 || i[ 1 ] == gridSize - 1 )
         matrixView.setElement( rowIdx, rowIdx, 1.0 );
      else {
         matrixView.setElement( rowIdx, rowIdx - gridSize, 1.0 );
         matrixView.setElement( rowIdx, rowIdx - 1, 1.0 );
         matrixView.setElement( rowIdx, rowIdx, -4.0 );
         matrixView.setElement( rowIdx, rowIdx + 1, 1.0 );
         matrixView.setElement( rowIdx, rowIdx + gridSize, 1.0 );
      }
   };
   const TNL::Containers::StaticArray< 2, int > begin = { 0, 0 };
   const TNL::Containers::StaticArray< 2, int > end = { gridSize, gridSize };
   TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( begin, end, f );
}
 
template< typename Matrix >
void
getRow( const int gridSize, Matrix& matrix )
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method by means of getRow method.
    */
   const int matrixSize = gridSize * gridSize;
   matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
 
   auto matrixView = matrix.getView();
   auto f = [ = ] __cuda_callable__( int rowIdx ) mutable
   {
      const int i = rowIdx % gridSize;
      const int j = rowIdx / gridSize;
      auto row = matrixView.getRow( rowIdx );
      if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
         row.setElement( 2, 1.0 );
      else {
         row.setElement( 0, 1.0 );
         row.setElement( 1, 1.0 );
         row.setElement( 2, -4.0 );
         row.setElement( 3, 1.0 );
         row.setElement( 4, 1.0 );
      }
   };
   TNL::Algorithms::parallelFor< typename Matrix::DeviceType >( 0, matrixSize, f );
}
 
template< typename Matrix >
void
forElements( const int gridSize, Matrix& matrix )
{
   /***
    * Set  matrix representing approximation of the Laplace operator on regular
    * grid using the finite difference method by means of forElements method.
    */
 
   const int matrixSize = gridSize * gridSize;
   matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
 
   auto f = [ = ] __cuda_callable__( int rowIdx, int localIdx, int columnIdx, float& value ) mutable
   {
      const int i = rowIdx % gridSize;
      const int j = rowIdx / gridSize;
      if( ( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) && localIdx == 0 ) {
         value = 1.0;
      }
      else {
         if( localIdx == 2 )
            value = -4.0;
         else
            value = 1.0;
      }
   };
   matrix.forElements( 0, matrixSize, f );
}
 
template< typename Device >
void
laplaceOperatorMultidiagonalMatrix()
{
   std::cout << " Sparse matrix test:" << std::endl;
   for( int gridSize = 16; gridSize <= 8192; gridSize *= 2 ) {
      std::cout << "  Grid size = " << gridSize << std::endl;
      TNL::Timer timer;
 
      std::cout << "   setElement on host: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::MultidiagonalMatrix< float, Device, int > matrix;
         setElement_on_host( gridSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      if( std::is_same< Device, TNL::Devices::Cuda >::value ) {
         std::cout << "   setElement on host and transfer on GPU: ";
         timer.reset();
         timer.start();
         for( int i = 0; i < testsCount; i++ ) {
            TNL::Matrices::MultidiagonalMatrix< float, Device, int > matrix;
            setElement_on_host_and_transfer( gridSize, matrix );
         }
         timer.stop();
         std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
      }
 
      std::cout << "   setElement on device: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::MultidiagonalMatrix< float, Device, int > matrix;
         setElement_on_device( gridSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      std::cout << "   getRow: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::MultidiagonalMatrix< float, Device, int > matrix;
         getRow( gridSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
 
      std::cout << "   forElements: ";
      timer.reset();
      timer.start();
      for( int i = 0; i < testsCount; i++ ) {
         TNL::Matrices::MultidiagonalMatrix< float, Device, int > matrix;
         forElements( gridSize, matrix );
      }
      timer.stop();
      std::cout << timer.getRealTime() / (double) testsCount << " sec." << std::endl;
   }
}
 
int
main( int argc, char* argv[] )
{
   std::cout << "Creating Laplace operator matrix on CPU ... " << std::endl;
   laplaceOperatorMultidiagonalMatrix< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   std::cout << "Creating Laplace operator matrix on CUDA GPU ... " << std::endl;
   laplaceOperatorMultidiagonalMatrix< TNL::Devices::Cuda >();
#endif
}

Table of Contents

Introduction

Overview of matrix types

Indexing of nonzero matrix elements in sparse matrices

Matrix view

Allocation and setup of different matrix types

Dense matrix

Sparse matrix

Multidiagonal matrix

Dense matrices

Initializer list

Methods setElement and addElement

Method getRow

Method forRows

Method forElements

Wrapping existing data to dense matrix view

Sparse matrices

Setting of row capacities

Initializer list

STL map

Methods setElement and addElement

Method getRow

Method forRows

Method forElements

Wrapping existing data to sparse matrix view

Symmetric sparse matrices

Binary sparse matrices

Tridiagonal matrices

Initializer list

Methods setElement and addElement

Method getRow

Method forRows

Method forElements

Multidiagonal matrices

Initializer list

Methods setElement and addElement

Method getRow

Method forRows

Method forElements

Lambda matrices

Lambda matrix inititation

Method forRows

Method forElements

Distributed matrices

Flexible reduction in matrix rows

Lambda function fetch

Lambda function reduce

Lambda function keep

Dense matrices example

Sparse matrices example

Tridiagonal matrices example

Multidiagonal matrices example

Lambda matrices example

Matrix-vector product

Matrix I/O operations

Matrix reader and writer

Appendix

Benchmark of dense matrix setup

Benchmark of sparse matrix setup

Benchmark of multidiagonal matrix setup

Methods `setElement` and `addElement`

Method `getRow`

Method `forRows`

Method `forElements`

Methods `setElement` and `addElement`

Method `getRow`

Method `forRows`

Method `forElements`