TODO: Add description of forRows and sequentialForRows.
Introduction
TNL offers several types of matrices like dense (TNL::Matrices::DenseMatrix), sparse (TNL::Matrices::SparseMatrix), tridiagonal (TNL::Matrices::TridiagonalMatrix), multidiagonal (TNL::Matrices::MultidiagonalMatrix) and lambda matrices (TNL::Matrices::LambdaMatrix). The sparse matrices can be symmetric to lower the memory requirements. The interfaces of given matrix types are designed to be as unified as possible to ensure that the user can easily switch between different matrix types while making no or only a little changes in the source code. All matrix types allows traversing all matrix elements and manipulate them using lambda functions as well as performing flexible reduction in matrix rows. The following text describes particular matrix types and their unified interface in details.
Overview of matrix types
In a lot of numerical algorithms either dense or sparse matrices are used. The dense matrix (TNL::Matrices::DenseMatrix) is such that all or at least most of its matrix elements are nonzero. On the other hand sparse matrix (TNL::Matrices::SparseMatrix) is a matrix which has most of the matrix elements equal to zero. From the implementation point of view, the data structures for the dense matrices allocates all matrix elements while formats for the sparse matrices aim to store explicitly only the nonzero matrix elements. The most popular format for storing the sparse matrices is CSR format. However, especially for better data alignment in memory of GPUs, many other formats were designed. In TNL, the user may choose between several different sparse matrix formats. There are also sparse matrices with specific pattern of the nonzero elements like tridiagonal matrices (TNL::Matrices::TridiagonalMatrix) which "has nonzero elements on the main diagonal, the first diagonal below this, and the first diagonal above the main diagonal only". An example of such matrix may look as follows:
\[\left(
\begin{array}{ccccccc}
-2 & 1 & . & . & . & . \\
1 & -2 & 1 & . & . & . \\
. & 1 & -2 & 1 & . & . \\
. & . & 1 & -2 & 1 & . \\
. & . & . & 1 & -2 & 1 \\
. & . & . & . & 1 & -2
\end{array}
\right)
\]
Similar but more general type of matrices are multidiagonal matrices (TNL::Matrices::MultidiagonalMatrix) which have the nonzero matrix elements positioned only on lines parallel to the main diagonal like the following matrix:
\[ \left(
\begin{array}{ccccccc}
-4 & 1 & . & 1 & . & . \\
1 & -4 & 1 & . & 1 & . \\
. & 1 & -4 & 1 & . & 1 \\
1 & . & 1 & -4 & 1 & . \\
. & 1 & . & 1 & -4 & 1 \\
. & . & 1 & . & 1 & -4
\end{array}
\right)
\]
Finally, TNL offers so called lambda matrices (TNL::Matrices::LambdaMatrix) which are kind of "matrix-free matrices". They do not store the matrix elements explicitly in the memory, but rather evaluates them on-the-fly based on user defined lambda functions.
In the following table we show comparison of expressing a tridiagonal matrix by means of different matrix types.
Matrix dimensions | Dense elems. | Dense mem. | Sparse elems. | Sparse mem. | Tridiag. elems. | Tridiag. mem. | Multidiag. elems. | Mutlidiag. mem. |
10x10 | 100 | 800 B | >28 | >336 B | 30 | 240 B | 30 | 252 B |
100x100 | 10,000 | 80 kB | >298 | >3,576 B | 300 | 2,400 B | 300 | 2,412 B |
1,000x1,000 | 1,000,000 | 8 MB | >2,998 | >35,976 B | 3,000 | 24,000 B | 3,000 | 24,012 B |
10,000x10,000 | 100,000,000 | 800 MB | >29,998 | >359,976 B | 30,000 | 240,000 B | 30,000 | 240,012 B |
100,000x100,000 | 10,000,000,000 | 80 GB | >299,998 | >3,599,876 B | 300,000 | 2,400,000 B | 300,000 | 2,400,012 B |
In the table:
- Matrix dimensions is the number of matrix rows and columns
- Dense elems. is the number of allocated matrix elements in the dense matrix.
- Dense mem. is the allocated memory for the matrix elements in the dense matrix if the elements are stored in the double precision.
- Sparse elems. is the number of allocated matrix elements in the sparse matrix. Some formats may allocate padding zeros for better data alignment in the memory and so the number of allocated matrix elements may increase.
- Sparse mem. is the allocated memory for the matrix elements in the sparse matrix if the elements are stored in the double precision and column indexes in 32-bit integer.
- Tridiag. elems is the number of allocated matrix elements in the tridiagonal matrix.
- Tridiag mem. is the allocated memory for the matrix elements in the tridiagonal matrix if the elements are stored in the double precision.
- Multidiag. elems is the number of allocated matrix elements in the multidiagonal matrix.
- Multidiag mem. is the allocated memory for the matrix elements in the multidiagonal matrix if the elements are stored in the double precision.
Choosing the best matrix type can have tremendous impact on the performance but also on memory requirements. If we would treat each matrix as dense one we would not be able to to work with matrices larger than 50,000x50,000 on common personal computers, because we would need tens of gibabytes of memory. At the same time, we see that the other matrix types can do the same job with only few megabytes. In addition, other matrix types work with much less matrix elements and so operations like matrix-vector multiplication can be done with significantly less operations which means much faster. Since in the modern hardware architectures, the computing units are limited mainly by the performance of the memory chips (so called memory wall), transferring less data from the memory increases the performance even more.
The following table shows the same as the one above but when storing a matrix which has only five nonzero elements in each row. Such matrices arise often from the finite difference method for solution of the partial differential equations:
Matrix dimensions | Dense elems. | Dense mem. | Sparse elems. | Sparse mem. | Multidiag. elems. | Mutlidiag. mem. |
10x10 | 100 | 800 B | >50 | >600 B | 50 | 420 B |
100x100 | 10,000 | 80 kB | >500 | >6,000 B | 500 | 4,020 B |
1,000x1,000 | 1,000,000 | 8 MB | >5,000 | >60,000 B | 5,000 | 40,020 B |
10,000x10,000 | 100,000,000 | 800 MB | >50,000 | >600,000 B | 50,000 | 400,020 B |
100,000x100,000 | 10,000,000,000 | 80 GB | >500,000 | >6,000,000 B | 500,000 | 4,000,020 B |
There is no change in the dense matrix part of the table. The numbers grow proportionally in case of sparse and mutlidiagonal matrix. General sparse matrix formats need to store column indexes for each matrix element which is not true for the multidiagonal matrix. The following table shows how many bytes we need for storing of one matrix element with different matrix types depending on the type of the matrix elements (Real) and column indexes (Index):
Real | Index | Dense matrix | Multidiagonal matrix | Sparse matrix | Fill ratio |
float | 32-bit | 4 B | 4 B | 8 B | << 50% |
float | 64-bit | 4 B | 4 B | 12 B | << 30% |
double | 32-bit | 8 B | 8 B | 12 B | << 60% |
double | 64-bit | 8 B | 8 B | 16 B | << 50% |
In this table:
- Real is matrix element type.
- Index is column index type.
- Dense matrix is number of bytes needed to store one matrix element in the dense matrix.
- Multidiagonal matrix is number of bytes needed to store one matrix element in the mutldiagonal matrix.
- Sparse matrix is number of bytes needed to store one matrix element in the sparse matrix.
- Fill ratio is maximal percentage of the nonzero matrix elements until which the sparse matrix can perform better.
The multidiagonal matrix type is especially suitable for the finite difference method or similar numerical methods for solution of the partial differential equations.
Indexing of nonzero matrix elements in sparse matrices
The sparse matrix formats usually, in the first step, compress the matrix rows by omitting the zero matrix elements as follows
\[\left(
\begin{array}{ccccc}
0 & 1 & 0 & 2 & 0 \\
0 & 0 & 5 & 0 & 0 \\
4 & 0 & 0 & 0 & 7 \\
0 & 3 & 0 & 8 & 5 \\
0 & 5 & 7 & 0 & 0
\end{array}
\right)
\rightarrow
\left(
\begin{array}{ccccc}
1 & 2 & . & . & . \\
5 & . & . & . & . \\
4 & 7 & . & . & . \\
3 & 8 & 5 & . & . \\
5 & 7 & . & . & .
\end{array}
\right)
\]
In such a form, it is more efficient to refer the nonzero matrix elements in given row by their rank in the compressed matrix row rather than by their column index in the original matrix. In methods for the sparse matrices, this parameter is called localIdx. Some sparse matrix formats add padding zeros for better alignment of data in memory. But if this is not the case, the variable localIdx of particular matrix elements would read as:
\[\left(
\begin{array}{ccccc}
0 & 1 & . & . & . \\
0 & . & . & . & . \\
0 & 1 & . & . & . \\
0 & 1 & 2 & . & . \\
0 & 1 & . & . & .
\end{array}
\right)
\]
Matrix view
Matrix views are small reference objects which help accessing the matrix in GPU kernels or lambda functions being executed on GPUs. We describe this in details in section about Shared pointers and views. The problem lies in fact that we cannot pass references to GPU kernels and we do not want to pass there deep copies of matrices. Matrix view is some kind of reference to a matrix. A copy of matrix view is always shallow and so it behaves like a reference. The following example shows how to obtain the matrix view by means of method getView and pass it to a lambda function:
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9getRowExample()
10{
11
12
13
14
15
16
17
18
19
20 const int size = 5;
23
24
25
26
28 {
29 auto row = view.
getRow( rowIdx );
30 if( rowIdx == 0 )
31 row.setElement( 0, rowIdx, 2.0 );
32 else if( rowIdx == size - 1 )
33 row.setElement( 0, rowIdx, 2.0 );
34 else {
35 row.setElement( 0, rowIdx - 1, 1.0 );
36 row.setElement( 1, rowIdx, 2.0 );
37 row.setElement( 2, rowIdx + 1, 1.0 );
38 }
39 };
42}
43
44int
45main( int argc, char* argv[] )
46{
48 getRowExample< TNL::Devices::Host >();
49
50#ifdef __CUDACC__
52 getRowExample< TNL::Devices::Cuda >();
53#endif
54}
#define __cuda_callable__
Definition Macros.h:49
__cuda_callable__ ConstRowView getRow(IndexType rowIdx) const
Constant getter of simple structure for accessing given matrix row.
Definition SparseMatrixBase.hpp:132
Implementation of sparse matrix, i.e. matrix storing only non-zero elements.
Definition SparseMatrix.h:55
ViewType getView()
Returns a modifiable view of the sparse matrix.
Definition SparseMatrix.hpp:166
std::enable_if_t< std::is_integral_v< Begin > &&std::is_integral_v< End > > parallelFor(const Begin &begin, const End &end, typename Device::LaunchConfiguration launch_config, Function f, FunctionArgs... args)
Parallel for-loop function for 1D range specified with integral values.
Definition parallelFor.h:41
Here we create sparse matrix matrix on the line 11, and use the method getView to get the matrix view on the line 12. The view is then used in the lambda function on the line 15 where it is captured by value (see [=] in the definition of the lambda function f on the line 14).
Allocation and setup of different matrix types
There are several ways how to create a new matrix:
- Initializer lists allow to create matrix from the C++ initializer lists. The matrix elements must be therefore encoded in the source code and so it is useful for rather smaller matrices. Methods and constructors with initializer lists are user friendly and simple to use. It is a good choice for tool problems with small matrices.
- STL map can be used for creation of sparse matrices only. The user first insert all matrix elements together with their coordinates into std::map based on which the sparse matrix is created in the next step. It is simple and user friendly approach suitable for creation of large matrices. An advantage is that we do not need to know the distribution of the nonzero matrix elements in matrix rows in advance like we do in other ways of construction of sparse matrices. This makes the use of STL map suitable for combining of sparse matrices from TNL with other numerical packages. However, the sparse matrix is constructed on the host and then copied on GPU if necessary. Therefore, this approach is not a good choice if fast and efficient matrix construction is required.
- Methods setElement and addElement called from the host allow to change particular matrix elements. The methods can be called from host even for matrices allocated on GPU. In this case, however, the matrix elements are transferred on GPU one by one which is very inefficient. If the matrix is allocated on the host system (CPU), the efficiency is nearly optimal. In case of sparse matrices, one must set row capacities (i.e. maximal number of nonzero elements in each row) before using these methods. If the row capacity is exceeded, the matrix has to be reallocated and all matrix elements are lost.
- Methods setElement and addElement called on the host and copy matrix on GPU setting particular matrix elements by the methods setElement and addElement when the matrix is allocated on GPU can be time consuming for large matrices. Setting up the matrix on CPU using the same methods and copying it on GPU at once when the setup is finished can be significantly more efficient. A drawback is that we need to allocate temporarily whole matrix on CPU.
- Methods setElement and addElement called from native device allow to do efficient matrix elements setup even on devices (GPUs). In this case, the methods must be called from a GPU kernel or a lambda function combined with the parallel for (TNL::Algorithms::parallelFor). The user get very good performance even when manipulating matrix allocated on GPU. On the other hand, only data structures allocated on GPUs can be accessed from the kernel or lambda function. The matrix can be accessed in the GPU kernel or lambda function by means of matrix view (see the previous section) or the shared pointer (see TNL::Pointers::SharedPointer).
- Method getRow combined with parallelFor is very similar to the previous one. The difference is that we first fetch helper object called matrix row which is linked to particular matrix row. Using methods of this object, one may change the matrix elements in given matrix row. An advantage is that the access to the matrix row is resolved only once for all elements in the row. In some more sophisticated sparse matrix formats, this can be nontrivial operation and this approach may slightly improve the performance. Another advantage for sparse matrices is that we access the matrix elements based on their local index ('localIdx', see [Indexing of nonzero matrix elements in sparse matrices](indexing_of_nonzero_matrix_elements_in_sparse_matrices)) in the row which is something like a rank of the nonzero element in the row. This is more efficient than addressing the matrix elements by the column indexes which requires searching in the matrix row. So this may significantly improve the performance of setup of sparse matrices. When it comes to dense matrices, there should not be great difference in performance compared to use of the methods setElement and getElement. Note that when the method is called from a GPU kernel or a lambda function, only data structures allocated on GPU can be accessed and the matrix must be made accessible by the means of matrix view.
- Methods forRows and forElements this approach is very similar to the previous one but it avoids using parallelFor and necessity of passing the matrix to GPU kernels by matrix view or shared pointers.
The following table shows pros and cons of particular methods:
Method | Efficient | Easy to use | Pros | Cons |
Initializer list | ** | ***** | Very easy to use. | Only for small matrices. |
| | | Does not need setting of matrix rows capacities | |
STL map | ** | ***** | Very easy to use. | Higher memory requirements. |
| | | Does not need setting of matrix rows capacities | Slow transfer on GPU. |
[set,add]Element on host | ****/* | ***** | Very easy to use. | Requires setting of row capacities. |
| | | | Extremely slow transfer on GPU. |
[set,and]Element on host© on GPU | *** | **** | Easy to use. | Requires setting of row capacities. |
| | | Reasonable efficiency. | Allocation of auxiliary matrix on CPU. |
[set,add]Element on native device | **** | | Good efficiency. | Requires setting of row capacities. |
| | | | Requires writing GPU kernel or lambda function. |
| | | | Allows accessing only data allocated on the same device/memory space. |
getRow and parallelFor | ***** | ** | Best efficiency for sparse matrices. | Requires setting of row capacities. |
| | | | Requires writing GPU kernel or lambda function. |
| | | | Allows accessing only data allocated on the same device/memory space. |
| | | | Use of matrix local indexes can be less intuitive. |
forRows, forElements | ***** | ** | Best efficiency for sparse matrices. | Requires setting of row capacities. |
| | | Avoid use of matrix view or shared pointer in kernels/lambda function. | Requires writing GPU kernel or lambda function. |
| | | | Allows accessing only data allocated on the same device/memory space. |
| | | | Use of matrix local indexes is less intuitive. |
Though it may seem that the later methods come with more cons than pros, they offer much higher performance and we believe that even they are still user friendly. On the other hand, if the matrix setup performance is not a priority, the use easy-to-use but slow method can still be a good choice. The following tables demonstrate the performance of different methods. The tests were performed with the following setup:
| |
CPU | Intel i9-9900KF, 3.60GHz, 8 cores, 16384 KB cache |
GPU | GeForce RTX 2070 |
g++ version | 10.2.0 |
nvcc version | 11.2.67 |
Precision | single precision |
Dense matrix
In the test of dense matrices, we set each matrix element to value equal to rowIdx + columnIdx. The times in seconds obtained on CPU looks as follows:
Matrix rows and columns | setElement on host | setElement with parallelFor | getRow | forElements |
16 | 0.00000086 | 0.0000053 | 0.00000035 | 0.0000023 |
32 | 0.00000278 | 0.0000050 | 0.00000201 | 0.0000074 |
64 | 0.00000703 | 0.0000103 | 0.00000354 | 0.0000203 |
128 | 0.00002885 | 0.0000312 | 0.00000867 | 0.0000709 |
256 | 0.00017543 | 0.0000439 | 0.00002490 | 0.0001054 |
512 | 0.00078153 | 0.0001683 | 0.00005999 | 0.0002713 |
1024 | 0.00271989 | 0.0006691 | 0.00003808 | 0.0003942 |
2048 | 0.01273520 | 0.0038295 | 0.00039116 | 0.0017083 |
4096 | 0.08381450 | 0.0716542 | 0.00937997 | 0.0116771 |
8192 | 0.51596800 | 0.3535530 | 0.03971900 | 0.0467374 |
Here:
- setElement on host tests run in one thread. Therefore they are faster for small matrices compared to setElement with parallelFor tests.
- setElement with parallelFor tests run in parallel in several OpenMP threads. This approach is faster for larger matrices.
- getRow tests run in parallel in several OpenMP threads mapping of which is more efficient compared to setElement on host tests.
And the same on GPU is in the following table:
Matrix rows and columns | setElement on host | setElement on host and copy | setElement on GPU | getRow | forElements |
16 | 0.027835 | 0.02675 | 0.000101198 | 0.00009903 | 0.000101214 |
32 | 0.002776 | 0.00018 | 0.000099197 | 0.00009901 | 0.000100481 |
64 | 0.010791 | 0.00015 | 0.000094446 | 0.00009493 | 0.000101796 |
128 | 0.043014 | 0.00021 | 0.000099397 | 0.00010024 | 0.000102729 |
256 | 0.171029 | 0.00056 | 0.000100469 | 0.00010448 | 0.000105893 |
512 | 0.683627 | 0.00192 | 0.000103346 | 0.00011034 | 0.000112752 |
1024 | 2.736680 | 0.00687 | 0.000158805 | 0.00016932 | 0.000170302 |
2048 | 10.930300 | 0.02474 | 0.000509000 | 0.00050917 | 0.000511183 |
4096 | 43.728700 | 0.13174 | 0.001557030 | 0.00156117 | 0.001557930 |
8192 | 174.923000 | 0.70602 | 0.005312470 | 0.00526658 | 0.005263870 |
Here:
- setElement on host tests are very slow especially for large matrices since each matrix element is copied on GPU separately.
- setElement on host and copy tests are much faster because the matrix is copied from CPU to GPU on the whole which is more efficient.
- setElement on GPU tests are even more faster since there is no transfer of data between CPU and GPU.
- getRow tests have the same performance as setElement on GPU.
- forElements tests have the same performance as both setElement on GPU and getRow.
You can see the source code of the previous benchmark in Appendix.
Sparse matrix
The sparse matrices are tested on computation of matrix the discrete Laplace operator in 2D. This matrix has at most five nonzero elements in each row. The times for sparse matrix (with CSR format) on CPU in seconds look as follows:
Matrix rows and columns | STL Map | setElement on host | setElement with parallelFor | getRow | forElements |
256 | 0.00016 | 0.000017 | 0.000014 | 0.000013 | 0.000020 |
1,024 | 0.00059 | 0.000044 | 0.000021 | 0.000019 | 0.000022 |
4,096 | 0.00291 | 0.000130 | 0.000031 | 0.000022 | 0.000031 |
16,384 | 0.01414 | 0.000471 | 0.000067 | 0.000031 | 0.000065 |
65,536 | 0.06705 | 0.001869 | 0.000218 | 0.000074 | 0.000209 |
262,144 | 0.31728 | 0.007436 | 0.000856 | 0.000274 | 0.000799 |
1,048,576 | 1.46388 | 0.027087 | 0.006162 | 0.005653 | 0.005904 |
4,194,304 | 7.46147 | 0.102808 | 0.028385 | 0.027925 | 0.027937 |
16,777,216 | 38.95900 | 0.413823 | 0.125870 | 0.124588 | 0.123858 |
67,108,864 | 185.75700 | 1.652580 | 0.505232 | 0.501003 | 0.500927 |
Here:
- STL Map tests show that use of STL Map can be very slow on large matrices and, of course, they need to allocate the map containing all the matrix elements. This can be memory consuming. On the other hand, it is the only way which does not require knowing the matrix row capacities in advance.
- setElement on host tests are much faster compared to STL map, it does not need to allocate anything else except the sparse matrix. However, matrix row capacities must be known in advance.
- setElement with parallelFor tests run in parallel in several OpenMP threads and so this can be faster for larger matrices.
- getRow tests perform the same as setElement with parallelFor.
- forElements tests perform the same as both setElement with parallelFor and forElements.
We see, that the use of STL map makes sense only in situation when it is hard to estimate necessary row capacities. Otherwise very easy setup with setElement method is much faster. If the performance is the highest priority, getRow method should be preferred. The results for GPU are in the following table:
Matrix rows and columns | STL Map | setElement on host | setElement on host and copy | setElement on GPU | getRow | forElements |
256 | 0.002 | 0.036 | 0.0280 | 0.00017 | 0.00017 | 0.00017 |
1,024 | 0.001 | 0.161 | 0.0006 | 0.00017 | 0.00017 | 0.00017 |
4,096 | 0.003 | 0.680 | 0.0010 | 0.00020 | 0.00020 | 0.00020 |
16,384 | 0.015 | 2.800 | 0.0034 | 0.00021 | 0.00020 | 0.00021 |
65,536 | 0.074 | 11.356 | 0.0130 | 0.00048 | 0.00047 | 0.00048 |
262,144 | 0.350 | 45.745 | 0.0518 | 0.00088 | 0.00087 | 0.00088 |
1,048,576 | 1.630 | 183.632 | 0.2057 | 0.00247 | 0.00244 | 0.00245 |
4,194,304 | 8.036 | 735.848 | 0.8119 | 0.00794 | 0.00783 | 0.00788 |
16,777,216 | 41.057 | 2946.610 | 3.2198 | 0.02481 | 0.02429 | 0.02211 |
67,108,864 | 197.581 | 11791.601 | 12.7775 | 0.07196 | 0.06329 | 0.06308 |
Here:
- STL Map tests show that the times are comparable to CPU times which means the most of the time is spent by creating the matrix on CPU.
- setElement on host tests are again extremely slow for large matrices. It is even slower than the use of STL map. So in case of GPU, this is another reason for using the STL map.
- setElement on host and copy tests are, similar to the dense matrix, much faster compared to the previous approaches. So it is the best way when you need to use data structures available only on the host system (CPU).
- setElement on GPU tests exhibit the best performance together with getRow and forElements methods. Note, however, that this method can be slower that getRow and forElements if there would be more nonzero matrix elements in a row.
- getRow tests exhibit the best performance together with setElement on GPU and forElements methods.
- forElements tests exhibit the best performance together with getRow and setElement on GPU methods.
Here we see, that the setElement methods performs extremely bad because all matrix elements are transferred to GPU one-by-one. Even STL map is much faster. Note, that the times for STL map are not much higher compared to CPU which indicates that the transfer of the matrix on GPU is not dominant. Setup of the matrix on CPU by the means of setElement method and transfer on GPU is even faster. However, the best performance can be obtained only we creating the matrix directly on GPU by methods setElement, getRow and forElements. Note, however, that even if all of them perform the same way, for matrices with more nonzero matrix elements in a row, setElement could be slower compared to the getRow and forElements.
You can see the source code of the previous benchmark in Appendix.
Multidiagonal matrix
Finally, the following tables show the times of the same test performed with multidiagonal matrix. Times on CPU in seconds looks as follows:
Matrix rows and columns | setElement on host | setElement with parallelFor | getRow | forElements |
256 | 0.000055 | 0.0000038 | 0.000004 | 0.000009 |
1,024 | 0.000002 | 0.0000056 | 0.000003 | 0.000006 |
4,096 | 0.000087 | 0.0000130 | 0.000005 | 0.000014 |
16,384 | 0.000347 | 0.0000419 | 0.000010 | 0.000046 |
65,536 | 0.001378 | 0.0001528 | 0.000032 | 0.000177 |
262,144 | 0.005504 | 0.0006025 | 0.000131 | 0.000711 |
1,048,576 | 0.019392 | 0.0028773 | 0.001005 | 0.003265 |
4,194,304 | 0.072078 | 0.0162378 | 0.011915 | 0.018065 |
16,777,216 | 0.280085 | 0.0642682 | 0.048876 | 0.072084 |
67,108,864 | 1.105120 | 0.2427610 | 0.181974 | 0.272579 |
Here:
- setElement on host tests show that this method is fairly efficient.
- setElement with parallelFor tests run in parallel in several OpenMP threads compared to setElement on host tests. For larger matrices, this way of matrix setup performs better.
- getRow tests perform more or less the same as setElement with parallelFor and forElements.
- forElements tests perform more or less the same as setElement with parallelFor and getRow.
Note, that setup of multidiagonal matrix is faster compared to the same matrix stored in general sparse format. Results for GPU are in the following table:
Matrix rows and columns | setElement on host | setElement on host and copy | setElement on GPU | getRow | forElements |
256 | 0.035 | 0.02468 | 0.000048 | 0.000045 | 0.000047 |
1,024 | 0.059 | 0.00015 | 0.000047 | 0.000045 | 0.000047 |
4,096 | 0.251 | 0.00044 | 0.000048 | 0.000045 | 0.000047 |
16,384 | 1.030 | 0.00158 | 0.000049 | 0.000046 | 0.000048 |
65,536 | 4.169 | 0.00619 | 0.000053 | 0.000048 | 0.000052 |
262,144 | 16.807 | 0.02187 | 0.000216 | 0.000214 | 0.000217 |
1,048,576 | 67.385 | 0.08043 | 0.000630 | 0.000629 | 0.000634 |
4,194,304 | 270.025 | 0.31272 | 0.001939 | 0.001941 | 0.001942 |
16,777,216 | 1080.741 | 1.18849 | 0.003212 | 0.004185 | 0.004207 |
67,108,864 | 4326.120 | 4.74481 | 0.013672 | 0.022494 | 0.030369 |
- setElement on host tests are extremely slow again, especially for large matrices.
- setElement on host and copy tests are much faster compared to the previous.
- setElement with parallelFor tests offer the best performance. They are even faster then getRow and forElements method. This, however, does not have be true for matrices having more nonzero elements in a row.
- getRow tests perform more or less the same as forElements. For matrices having more nonzero elements in a row this method could be faster than setElement.
- forElements tests perform more or less the same as getRow.
Note that multidiagonal matrix performs better compared to general sparse matrix. One reason for it is the fact, that the multidiagonal type does not store explicitly column indexes of all matrix elements. Because of this, less data need to be transferred from the memory.
You can see the source code of the previous benchmark in Appendix.
In the following parts we will describe how to setup particular matrix types by means of the methods mentioned above.
Dense matrices
Dense matrix (TNL::Matrices::DenseMatrix) is a templated class defined in the namespace TNL::Matrices. It has five template parameters:
- Real is a type of the matrix elements. It is double by default.
- Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
- Index is a type to be used for indexing of the matrix elements. It is int by default.
- ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default, it is the row-major order if the matrix is allocated on the host system and column major order if it is allocated on GPU.
- RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.
The following examples show how to allocate the dense matrix and how to initialize the matrix elements.
Initializer list
Small matrices can be created simply by the constructor with an initializer list.
1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7initializerListExample()
8{
10
11 { 1, 2, 3, 4, 5, 6 },
12 { 7, 8, 9, 10, 11, 12 },
13 { 13, 14, 15, 16, 17, 18 },
14
15 };
16
18
20
21 { 1 },
22 { 2, 3 },
23 { 4, 5, 6 },
24 { 7, 8, 9, 10 },
25 { 11, 12, 13, 14, 15 },
26
27 };
28
30}
31
32int
33main( int argc, char* argv[] )
34{
36 initializerListExample< TNL::Devices::Host >();
37
38#ifdef __CUDACC__
40 initializerListExample< TNL::Devices::Cuda >();
41#endif
42}
Implementation of dense matrix, i.e. matrix storing explicitly all of its elements including zeros.
Definition DenseMatrix.h:30
In fact, the constructor takes a list of initializer lists. Each embedded list defines one matrix row and so the number of matrix rows is given by the size of the outer initializer list. The number of matrix columns is given by the longest inner initializer lists. Shorter inner lists are filled with zeros from the right side. The result looks as follows:
Creating matrices on CPU ...
General dense matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5 5:6
Row: 1 -> 0:7 1:8 2:9 3:10 4:11 5:12
Row: 2 -> 0:13 1:14 2:15 3:16 4:17 5:18
Triangular dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:3 2:0 3:0 4:0
Row: 2 -> 0:4 1:5 2:6 3:0 4:0
Row: 3 -> 0:7 1:8 2:9 3:10 4:0
Row: 4 -> 0:11 1:12 2:13 3:14 4:15
Creating matrices on CUDA GPU ...
General dense matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5 5:6
Row: 1 -> 0:7 1:8 2:9 3:10 4:11 5:12
Row: 2 -> 0:13 1:14 2:15 3:16 4:17 5:18
Triangular dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:3 2:0 3:0 4:0
Row: 2 -> 0:4 1:5 2:6 3:0 4:0
Row: 3 -> 0:7 1:8 2:9 3:10 4:0
Row: 4 -> 0:11 1:12 2:13 3:14 4:15
Methods setElement and addElement
Larger matrices can be setup with methods setElement and addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement). The following example shows how to call these methods from the host.
1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7addElements()
8{
10 auto matrixView = matrix.getView();
11
12 for( int i = 0; i < 5; i++ )
13 matrixView.setElement( i, i, i );
14
16
17 for( int i = 0; i < 5; i++ )
18 for( int j = 0; j < 5; j++ )
19 matrixView.addElement( i, j, 1.0, 5.0 );
20
22}
23
24int
25main( int argc, char* argv[] )
26{
28 addElements< TNL::Devices::Host >();
29
30#ifdef __CUDACC__
32 addElements< TNL::Devices::Cuda >();
33#endif
34}
As we can see, both methods can be called from the host no matter where the matrix is allocated. If it is on GPU, each call of setElement or addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement) causes slow transfer of the data between CPU and GPU. Use this approach only if the performance is not a priority. The result looks as follows:
Add elements on host:
Initial matrix is:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix after addition is:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21
Add elements on CUDA device:
Initial matrix is:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix after addition is:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21
More efficient way of the matrix initialization on GPU consists of calling the methods setElement and addElement (TNL::Matrices::DenseMatrix::setElement, TNL::Matrices::DenseMatrix::addElement) directly from GPU, for example by means of lambda function and parallelFor. It is demonstrated in the following example (of course it works even on CPU):
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9setElements()
10{
12 auto matrixView = matrix.getView();
13 for( int i = 0; i < 5; i++ )
14 matrixView.setElement( i, i, i );
15
18
20 {
21 matrixView.addElement( i[ 0 ], i[ 1 ], 5.0 );
22 };
26
29}
30
31int
32main( int argc, char* argv[] )
33{
35 setElements< TNL::Devices::Host >();
36
37#ifdef __CUDACC__
39 setElements< TNL::Devices::Cuda >();
40#endif
41}
Array with constant size.
Definition StaticArray.h:21
Here we get the matrix view (TNL::Matrices::DenseMatrixView) (line 10) to make the matrix accessible in lambda function even on GPU (see Shared pointers and views ). We first call the setElement method from CPU to set the i-th diagonal element to i (lines 11-12). Next we iterate over the matrix rows with parallelFor (line 20) and for each row we call the lambda function f. This is done on the same device where the matrix is allocated and so it we get optimal performance even for matrices on GPU. In the lambda function we add one to each matrix element (line 18). The result looks as follows:
Set elements on host:
Matrix set from the host:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix set from its native device:
Row: 0 -> 0:5 1:5 2:5 3:5 4:5
Row: 1 -> 0:5 1:6 2:5 3:5 4:5
Row: 2 -> 0:5 1:5 2:7 3:5 4:5
Row: 3 -> 0:5 1:5 2:5 3:8 4:5
Row: 4 -> 0:5 1:5 2:5 3:5 4:9
Set elements on CUDA device:
Matrix set from the host:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0
Row: 2 -> 0:0 1:0 2:2 3:0 4:0
Row: 3 -> 0:0 1:0 2:0 3:3 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:4
Matrix set from its native device:
Row: 0 -> 0:5 1:5 2:5 3:5 4:5
Row: 1 -> 0:5 1:6 2:5 3:5 4:5
Row: 2 -> 0:5 1:5 2:7 3:5 4:5
Row: 3 -> 0:5 1:5 2:5 3:8 4:5
Row: 4 -> 0:5 1:5 2:5 3:5 4:9
Method getRow
This method is available for the dense matrix (TNL::Matrices::DenseMatrix::getRow) mainly for two reasons:
- The method getRow is recommended for sparse matrices. In most cases, it is not optimal for dense matrices. However, if one needs to have one code for both dense and sparse matrices, this method is a good choice.
- In general, use of setElement (TNL::Matrices::DenseMatrix::setElement) combined with parallelFor is preferred, for dense matrices, since it offers more parallelism for GPUs. parallelFor creates one CUDA thread per each matrix element which is desirable for GPUs. With the use of the method getRow we have only one CUDA thread per each matrix row. This makes sense only in situation when we need to setup each matrix row sequentially.
Here we show an example:
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9getRowExample()
10{
11 const int size = 5;
13
14
15
16
17
18 auto matrixView = matrix.getView();
19
21 {
22 auto row = matrixView.getRow( rowIdx );
23 if( rowIdx > 0 )
24 row.setValue( rowIdx - 1, -1.0 );
25 row.setValue( rowIdx, rowIdx + 1.0 );
26 if( rowIdx < size - 1 )
27 row.setValue( rowIdx + 1, -1.0 );
28 };
29
30
31
32
35}
36
37int
38main( int argc, char* argv[] )
39{
41 getRowExample< TNL::Devices::Host >();
42
43#ifdef __CUDACC__
45 getRowExample< TNL::Devices::Cuda >();
46#endif
47}
Here we create the matrix on the line 10 and get the matrix view on the line 16. Next we use parallelFor (line 31) to iterate over the matrix rows and call the lambda function f (lines 19-26) for each of them. In the lambda function, we first fetch the matrix row by means of the method getRow (TNL::Matrices::DenseMatrixView::getRow) and next we set the matrix elements by using the method setElement of the matrix row (TNL::Matrices::DenseMatrixRowView::setElement). For the compatibility with the sparse matrices, use the variant of setElement with the parameter localIdx. It has no effect here, it is only for compatibility of the interface.
The result looks as follows:
Getting matrix rows on host:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Getting matrix rows on CUDA device:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Method forRows
This method iterates in parallel over all matrix rows. In fact, it combines TNL::Algorithms::parallelFor and TNL::Matrices::DenseMatrix::getRow method in one. See the following example. It is even a bit simpler compared to the previous one:
1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forRowsExample()
9{
11 using RowView = typename MatrixType::RowView;
12 const int size = 5;
13 MatrixType matrix( size, size );
15
16
17
18
20 {
21 const int& rowIdx = row.getRowIndex();
22 if( rowIdx > 0 )
23 row.setValue( rowIdx - 1, -1.0 );
24 row.setValue( rowIdx, rowIdx + 1.0 );
25 if( rowIdx < size - 1 )
26 row.setValue( rowIdx + 1, -1.0 );
27 };
30
31
32
33
36 {
38 for( auto element : row )
39 largest =
TNL::max( largest, element.value() );
40 for( auto element : row )
41 element.value() /= largest;
42 } );
45}
46
47int
48main( int argc, char* argv[] )
49{
51 forRowsExample< TNL::Devices::Host >();
52
53#ifdef __CUDACC__
55 forRowsExample< TNL::Devices::Cuda >();
56#endif
57}
void forAllRows(Function &&function)
Method for parallel iteration over all matrix rows.
Definition SparseMatrixBase.hpp:626
constexpr ResultType max(const T1 &a, const T2 &b)
This function returns maximum of two numbers.
Definition Math.h:49
The lambda function f, which is called for each matrix row (lines 18-25), have to accept parameter row with type RowView. This type is defined inside each TNL matrix and in the case of the dense matrix, it is TNL::Matrices::DenseMatrixRowView. We use the method TNL::Matrices::DenseMatrixRowView::getRowIndex to get the index of the matrix row being currently processed and method TNL::Matrices::DenseMatrixRowView::setElement which sets the value of the element with given column index (the first parameter).
Next, on the lines 32-38, we call another lambda function which firstly find the largest element in each row (lines 33-35) and then it divides the matrix row by its value (lines 36-37).
The result looks as follows:
Getting matrix rows on host:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Divide each matrix row by its largest element...
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-0.5 1:1 2:-0.5 3:0 4:0
Row: 2 -> 0:0 1:-0.333333 2:1 3:-0.333333 4:0
Row: 3 -> 0:0 1:0 2:-0.25 3:1 4:-0.25
Row: 4 -> 0:0 1:0 2:0 3:-0.2 4:1
Getting matrix rows on CUDA device:
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
Row: 2 -> 0:0 1:-1 2:3 3:-1 4:0
Row: 3 -> 0:0 1:0 2:-1 3:4 4:-1
Row: 4 -> 0:0 1:0 2:0 3:-1 4:5
Divide each matrix row by its largest element...
Row: 0 -> 0:1 1:-1 2:0 3:0 4:0
Row: 1 -> 0:-0.5 1:1 2:-0.5 3:0 4:0
Row: 2 -> 0:0 1:-0.333333 2:1 3:-0.333333 4:0
Row: 3 -> 0:0 1:0 2:-0.25 3:1 4:-0.25
Row: 4 -> 0:0 1:0 2:0 3:-0.2 4:1
Method forElements
The next example demonstrates the method forElements (TNL::Matrices::DenseMatrix::forElements) which works in very similar way as the method getRow but it is slightly easier to use. It is also compatible with sparse matrices. See the following example:
1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forElementsExample()
9{
11 auto matrixView = matrix.getView();
12
13 auto f = []
__cuda_callable__(
int rowIdx,
int columnIdx,
int globalIdx,
double& value )
14 {
15 if( columnIdx <= rowIdx )
16 value = rowIdx + columnIdx;
17 };
18
19 matrixView.forElements( 0, matrix.getRows(), f );
21}
22
23int
24main( int argc, char* argv[] )
25{
27 forElementsExample< TNL::Devices::Host >();
28
29#ifdef __CUDACC__
31 forElementsExample< TNL::Devices::Cuda >();
32#endif
33}
We do not need any matrix view and instead of calling parallelFor we call just the method forElements (line 18). The lambda function f (line 11) must accept the following parameters:
- rowIdx is the row index of given matrix element.
- columnIdx is the column index of given matrix element.
- value is a reference on the matrix element value and so by changing this value we can modify the matrix element.
The result looks as follows:
Creating matrix on host:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:2 1:3 2:4 3:0 4:0
Row: 3 -> 0:3 1:4 2:5 3:6 4:0
Row: 4 -> 0:4 1:5 2:6 3:7 4:8
Creating matrix on CUDA device:
Row: 0 -> 0:0 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:2 1:3 2:4 3:0 4:0
Row: 3 -> 0:3 1:4 2:5 3:6 4:0
Row: 4 -> 0:4 1:5 2:6 3:7 4:8
Wrapping existing data to dense matrix view
In case when you have already allocated data for dense matrix (for example in some other library), you may wrap it to dense matrix view with a function TNL::Matrices::wrapDenseMatrix . See the following example:
1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Matrices/MatrixWrapping.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9wrapMatrixView()
10{
11 const int rows( 3 ), columns( 4 );
13
14 1, 2, 3, 4,
15 5, 6, 7, 8,
16 9, 10, 11, 12
17
18 };
19 double* values = valuesVector.
getData();
20
21
22
23
26}
27
28int
29main( int argc, char* argv[] )
30{
32 wrapMatrixView< TNL::Devices::Host >();
33
34#ifdef __CUDACC__
36 wrapMatrixView< TNL::Devices::Cuda >();
37#endif
38}
__cuda_callable__ const Value * getData() const
Returns a const-qualified raw pointer to the data.
Definition Array.hpp:322
Vector extends Array with algebraic operations.
Definition Vector.h:36
DenseMatrixView< Real, Device, Index, Organization > wrapDenseMatrix(const Index &rows, const Index &columns, Real *values)
Function for wrapping an array of values into a dense matrix view.
Definition MatrixWrapping.h:38
Here we create dense matrix having three rows and four columns. We use TNL vector (TNL::Containers::Vector) only for allocation of the matrix elements (lines 12-15) and we get a pointer to the allocated array immediately (line 16). Next we use just the array to get dense matrix view with proper matrix dimensions (line 21). Note that we must explicitly state the device type as a template parameter of the function wrapDenseMatrix (TNL::Matrices::wrapDenseMatrix). Finally, we print the matrix to see if it is correct (line 22). The result looks as follows:
Wrapping matrix view on host:
Matrix reads as:
Row: 0 -> 0:1 1:2 2:3 3:4
Row: 1 -> 0:5 1:6 2:7 3:8
Row: 2 -> 0:9 1:10 2:11 3:12
Wrapping matrix view on CUDA device:
Matrix reads as:
Row: 0 -> 0:1 1:4 2:7 3:10
Row: 1 -> 0:2 1:5 2:8 3:11
Row: 2 -> 0:3 1:6 2:9 3:12
Sparse matrices
Sparse matrices are extremely important in a lot of numerical algorithms. They are used at situations when we need to operate with matrices having majority of the matrix elements equal to zero. In this case, only the non-zero matrix elements are stored with possibly some padding zeros used for memory alignment. This is necessary mainly on GPUs. See the Overview of matrix types for the differences in memory requirements.
Major disadvantage of sparse matrices is that there are a lot of different formats for their storage in memory. Though CSR (Compressed Sparse Row) format is the most popular of all, especially for GPUs, there are many other formats. Often their performance differ significantly for various matrices. So it is a good idea to test several sparse matrix formats if you want to get the best performance. In TNL, there is one templated class TNL::Matrices::SparseMatrix representing general sparse matrices. The change of underlying matrix format can be done just by changing one template parameter. The list of the template parameters is as follows:
- Real is type if the matrix elements. It is double by default.
- Device is a device where the matrix is allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
- Index is a type to be used for indexing of the matrix elements. It is int by default.
- MatrixType tells if the matrix is symmetric (TNL::Matrices::SymmetricMatrix) or general (TNL::Matrices::GeneralMatrix). It is a TNL::Matrices::GeneralMatrix by default.
- Segments define the format of the sparse matrix. It can be one of the following (by default, it is TNL::Algorithms::Segments::CSR):
- ComputeReal is type which is used for internal computations. By default it is the same as Real if Real is not bool. If Real is bool, ComputeReal is set to Index type. This can be changed, of course, by the user.
- RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.
- IndexAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the column indexes of the matrix elements. By default, it is the default allocator for given Index type and Device type – see TNL::Allocators::Default.
If Real is set to bool, we get a binary matrix for which the non-zero elements can be equal only to one and so the matrix elements values are not stored explicitly in the memory. This can significantly reduce the memory requirements and also increase performance.
In the following text we will show how to create and setup sparse matrices.
Setting of row capacities
Larger sparse matrices are created in two steps:
- We use a method TNL::Matrices::SparseMatrix::setRowCapacities to initialize the underlying matrix format and to allocate memory for the matrix elements. This method only needs to know how many non-zero elements are supposed to be in each row. Once this is set, it cannot be changed only by resetting the whole matrix. In most situations, it is not an issue to compute the number of nonzero elements in each row. Otherwise, we can currently only recommend the use of matrix setup with STL map, which is, however, quite slow.
- Now, the nonzero matrix elements can be set one after another by telling its coordinates and a value. Since majority of sparse matrix formats are designed to allow quick access to particular matrix rows the insertion can be done in parallel by mapping different threads to different matrix rows. This approach is usually optimal or nearly optimal when it comes to efficiency.
See the following example which creates lower triangular matrix like this one
\[\left(
\begin{array}{ccccc}
1 & 0 & 0 & 0 & 0 \\
2 & 1 & 0 & 0 & 0 \\
3 & 2 & 1 & 0 & 0 \\
4 & 3 & 2 & 1 & 0 \\
5 & 4 & 3 & 2 & 1
\end{array}
\right).
\]
1#include <iostream>
2#include <TNL/Containers/Vector.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9setRowCapacitiesExample()
10{
13 matrix.setRowCapacities( rowCapacities );
14 for( int row = 0; row < 5; row++ )
15 for( int column = 0; column <= row; column++ )
16 matrix.
setElement( row, column, row - column + 1 );
17
19}
20
21int
22main( int argc, char* argv[] )
23{
25 setRowCapacitiesExample< TNL::Devices::Host >();
26
27#ifdef __CUDACC__
29 setRowCapacitiesExample< TNL::Devices::Cuda >();
30#endif
31}
__cuda_callable__ void setElement(IndexType i, ValueType value)
Sets the value of the i-th element to v.
Definition Array.hpp:354
The method TNL::Matrices::SparseMatrix::setRowCapacities reads the required capacities of the matrix rows from a vector (or similar container - TNL::Containers::Array, TNL::Containers::ArrayView, TNL::Containers::Vector and TNL::Containers::VectorView) which has the same number of elements as the number of matrix rows and each element defines the capacity of the related row. The result looks as follows:
Creating matrix on CPU ...
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Creating matrix on CUDA GPU ...
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
There are constructors which also set the row capacities. The first one uses a vector:
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Devices/Host.h>
5
6template< typename Device >
7void
8initializerListExample()
9{
12 6 };
13
14 for(
int row = 0; row < matrix.
getRows(); row++ )
15 for( int column = 0; column <= row; column++ )
16 matrix.
setElement( row, column, row - column + 1 );
18}
19
20int
21main( int argc, char* argv[] )
22{
24 initializerListExample< TNL::Devices::Host >();
25
26#ifdef __CUDACC__
28 initializerListExample< TNL::Devices::Cuda >();
29#endif
30}
__cuda_callable__ IndexType getRows() const
Returns number of matrix rows.
Definition MatrixBase.hpp:42
__cuda_callable__ void setElement(IndexType row, IndexType column, const RealType &value)
Sets element at given row and column to given value.
Definition SparseMatrixBase.hpp:150
The second one uses an initializer list:
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7initializerListExample()
8{
10 6 };
11
12 for(
int row = 0; row < matrix.
getRows(); row++ )
13 for( int column = 0; column <= row; column++ )
14 matrix.
setElement( row, column, row - column + 1 );
16}
17
18int
19main( int argc, char* argv[] )
20{
22 initializerListExample< TNL::Devices::Host >();
23
24#ifdef __CUDACC__
26 initializerListExample< TNL::Devices::Cuda >();
27#endif
28}
The result of both examples looks as follows:
Creating matrices on CPU ...
General sparse matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Creating matrices on CUDA GPU ...
General sparse matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Initializer list
Small matrices can be initialized by a constructor with an initializer list. We assume having the following sparse matrix
\[\left(
\begin{array}{ccccc}
1 & 0 & 0 & 0 & 0 \\
-1 & 2 & -1 & 0 & 0 \\
0 & -1 & 2 & -1 & 0 \\
0 & 0 & -1 & 2 & -1 \\
0 & 0 & 0 & -1 & 0
\end{array}
\right).
\]
It can be created with the initializer list constructor like we shows in the following example:
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7initializerListExample()
8{
10
11 5,
12
13 5,
14
15 {
16
17 { 0, 0, 2.0 },
18 { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
19 { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
20 { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
21 { 4, 4, 2.0 },
22
23 } );
24
26}
27
28int
29main( int argc, char* argv[] )
30{
32 initializerListExample< TNL::Devices::Host >();
33
34#ifdef __CUDACC__
36 initializerListExample< TNL::Devices::Cuda >();
37#endif
38}
The constructor accepts the following parameters (lines 9-17):
- rows is a number of matrix rows.
- columns is a number of matrix columns.
- data is definition of nonzero matrix elements. It is a initializer list of triples having a form { row_index, column_index, value }. In fact, it is very much like the Coordinate format - COO.
The constructor also accepts Real and Index allocators (TNL::Allocators) but the default ones are used in this example. A method setElements (TNL::Matrices::SparseMatrix::setElements) works the same way:
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8setElementsExample()
9{
11
13
14 { 0, 0, 2.0 },
15 { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
16 { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
17 { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
18 { 4, 4, 2.0 },
19
20 } );
21
23}
24
25int
26main( int argc, char* argv[] )
27{
29 setElementsExample< TNL::Devices::Host >();
30
31#ifdef __CUDACC__
33 setElementsExample< TNL::Devices::Cuda >();
34#endif
35}
void setElements(const std::initializer_list< std::tuple< Index, Index, Real > > &data, SymmetricMatrixEncoding encoding=SymmetricMatrixEncoding::LowerPart)
This method sets the sparse matrix elements from initializer list.
Definition SparseMatrix.hpp:323
In this example, we create the matrix in two steps. Firstly we use constructor with only matrix dimensions as parameters (line 9) and next we set the matrix elements by setElements method (lines 10-15). The result of both examples looks as follows:
Creating matrices on CPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2
Creating matrices on CUDA GPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2
STL map
The constructor which creates the sparse matrix from std::map is useful especially in situations when you cannot estimate the matrix row capacities in advance. You can first store the matrix elements in std::map data structure in a COO format manner. It means that each entry of the map is the following pair:
std::pair( std::pair( row_index, column_index ), element_value )
which defines one matrix element at given coordinates (row_index,column_index) with given value (element_value). Of course, you can insert such entries into the map in arbitrary order. When it is complete, you pass the map to the sparse matrix. See the following example:
1#include <iostream>
2#include <map>
3#include <utility>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9initializerListExample()
10{
23
25
27}
28
29int
30main( int argc, char* argv[] )
31{
33 initializerListExample< TNL::Devices::Host >();
34
35#ifdef __CUDACC__
37 initializerListExample< TNL::Devices::Cuda >();
38#endif
39}
The method setElements (TNL::Matrices::SparseMatrix::setElements) works the same way for already existing instances of sparse matrix:
1#include <iostream>
2#include <map>
3#include <utility>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9setElementsExample()
10{
23
26
28}
29
30int
31main( int argc, char* argv[] )
32{
34 setElementsExample< TNL::Devices::Host >();
35
36#ifdef __CUDACC__
38 setElementsExample< TNL::Devices::Cuda >();
39#endif
40}
The result of both examples looks as follows:
Creating matrices on CPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2
Creating matrices on CUDA GPU ...
General sparse matrix:
Row: 0 -> 0:2
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 4:2
Note, however, that the map can be constructed only on CPU and not on GPU. It requires allocation of additional memory on the host system (CPU) and if the target sparse matrix resided on GPU, the matrix elements must be copied on GPU. This is the reason, why this way of the sparse matrix setup is inefficient compared to other methods.
Methods setElement and addElement
Another way of setting the sparse matrix is by means of the methods setElement and addElement (TNL::Matrices::SparseMatrix::setElement, TNL::Matrices::SparseMatrix::addElement). The method can be called from both host (CPU) and device (GPU) if the matrix is allocated there. Note, however, that if the matrix is allocated on GPU and the methods are called from CPU there will be significant performance drop because the matrix elements will be transferred one-by-one separately. However, if the matrix elements setup is not a critical part of your algorithm this can be an easy way how to do it. See the following example:
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9setElements()
10{
12
13
14
15
17 for( int i = 0; i < 5; i++ )
19
22
24 {
26 };
27
29
32}
33
34int
35main( int argc, char* argv[] )
36{
38 setElements< TNL::Devices::Host >();
39
40#ifdef __CUDACC__
42 setElements< TNL::Devices::Cuda >();
43#endif
44}
We first allocate matrix with five rows (it is given by the size of the initializer list) and columns and we set capacity each row to one (line 12). The first for-loop (lines 17-19) runs on CPU no matter where the matrix is allocated. After printing the matrix (lines 21-22), we call the lambda function f (lines 24-26) with a help of parallelFor (line 28) which is device sensitive and so it runs on CPU or GPU depending on where the matrix is allocated. The result looks as follows:
Set elements on host:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 ->
Row: 1 -> 1:-1
Row: 2 -> 2:-2
Row: 3 -> 3:-3
Row: 4 -> 4:-4
Set elements on CUDA device:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 ->
Row: 1 -> 1:-1
Row: 2 -> 2:-2
Row: 3 -> 3:-3
Row: 4 -> 4:-4
The method addElement (TNL::Matrices::SparseMatrix::addElement) adds a value to specific matrix element. Otherwise, it behaves the same as setElement. See the following example:
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7addElements()
8{
10 auto matrixView = matrix.
getView();
11
12 for( int i = 0; i < 5; i++ )
14
16
17 for( int i = 0; i < 5; i++ )
18 for( int j = 0; j < 5; j++ )
20
22}
23
24int
25main( int argc, char* argv[] )
26{
28 addElements< TNL::Devices::Host >();
29
30#ifdef __CUDACC__
32 addElements< TNL::Devices::Cuda >();
33#endif
34}
__cuda_callable__ void addElement(IndexType row, IndexType column, const RealType &value, const RealType &thisElementMultiplicator=1.0)
Add element at given row and column to given value.
Definition SparseMatrixBase.hpp:160
The result looks as follows:
Add elements on host:
Initial matrix is:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix after addition is:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21
Add elements on CUDA device:
Initial matrix is:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix after addition is:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:6 2:1 3:1 4:1
Row: 2 -> 0:1 1:1 2:11 3:1 4:1
Row: 3 -> 0:1 1:1 2:1 3:16 4:1
Row: 4 -> 0:1 1:1 2:1 3:1 4:21
Method getRow
More efficient method, especially for GPUs, is to combine getRow (TNL::Matrices::SparseMatrix::getRow) method with parallelFor and lambda function as the following example demonstrates:
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/SparseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9getRowExample()
10{
11
12
13
14
15
16
17
18
19
20 const int size = 5;
23
24
25
26
28 {
29 auto row = view.
getRow( rowIdx );
30 if( rowIdx == 0 )
31 row.setElement( 0, rowIdx, 2.0 );
32 else if( rowIdx == size - 1 )
33 row.setElement( 0, rowIdx, 2.0 );
34 else {
35 row.setElement( 0, rowIdx - 1, 1.0 );
36 row.setElement( 1, rowIdx, 2.0 );
37 row.setElement( 2, rowIdx + 1, 1.0 );
38 }
39 };
42}
43
44int
45main( int argc, char* argv[] )
46{
48 getRowExample< TNL::Devices::Host >();
49
50#ifdef __CUDACC__
52 getRowExample< TNL::Devices::Cuda >();
53#endif
54}
On the line 21, we create small matrix having five rows (number of rows is given by the size of the initializer list ) and columns (number of columns is given by the second parameter) and we set each row capacity to one or three (particular elements of the initializer list). On the line 41, we call parallelFor to iterate over all matrix rows. Each row is processed by the lambda function f (lines 24-36). In the lambda function, we first fetch a sparse matrix row (TNL::Matrices::SparseMatrixRowView) (line 25) which serves for accessing particular matrix elements in the matrix row. This object has a method setElement (TNL::Matrices::SparseMatrixRowView::setElement) accepting three parameters:
- localIdx is a rank of the nonzero element in given matrix row.
- columnIdx is the new column index of the matrix element.
- value is the new value of the matrix element.
The result looks as follows:
Getting matrix rows on host:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2
Getting matrix rows on CUDA device:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2
Method forRows
The method forRows (TNL::Matrices::SparseMatrix::forRows) calls the method getRow (TNL::Matrices::SparseMatrix::getRow) in parallel. See the following example which has the same effect as the previous one but it is slightly simpler:
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forRowsExample()
9{
10
11
12
13
14
15
16
17
18
19 const int size( 5 );
21 using RowView = typename MatrixType::RowView;
22 MatrixType matrix( { 1, 3, 3, 3, 1 }, size );
23 auto matrixView = matrix.
getView();
24
25
26
27
29 {
30 const int rowIdx = row.getRowIndex();
31 if( rowIdx == 0 )
33 else if( rowIdx == size - 1 )
34 row.setElement( 0, rowIdx, 2.0 );
35 else {
36 row.setElement( 0, rowIdx - 1, 1.0 );
37 row.setElement( 1, rowIdx, 2.0 );
38 row.setElement( 2, rowIdx + 1, 1.0 );
39 }
40 };
41 matrixView.forAllRows( f );
43
44
45
46
49 {
50 double sum = 0.0;
51 for( auto element : row )
52 sum += element.value();
53 for( auto element : row )
54 element.value() /= sum;
55 } );
56 std::cout <<
"Divide each matrix row by a sum of all elements in the row ... " <<
std::endl;
58}
59
60int
61main( int argc, char* argv[] )
62{
64 forRowsExample< TNL::Devices::Host >();
65
66#ifdef __CUDACC__
68 forRowsExample< TNL::Devices::Cuda >();
69#endif
70}
The differences are:
- We do not need to get the matrix view as we did in the previous example.
- We call the method forAllRows (TNL::Matrices::SparseMatrix::forAllRows) instead of parallelFor which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither.
- The lambda function f (lines 27-39) accepts one parameter row of the type RowView (TNL::Matrices::SparseMatrix::RowView which is TNL::Matrices::SparseMatrixRowView) instead of the index of the matrix row. Therefore we do not need to call the method getRow (TNL::Matrices::SparseMatrix::getRow). On the other hand, we need the method geRowIndex (TNL::Matrices::SparseMatrixRowView::getRowIndex) to get the index of the matrix row (line 28).
On the lines 46-52, we call a lambda function which computes sum of all elements in a row (lines 47-49) and it divides the row by the sum then (lines 50-51).
The result looks as follows:
Getting matrix rows on host:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2
Divide each matrix row by a sum of all elements in the row ...
Row: 0 -> 0:1
Row: 1 -> 0:0.25 1:0.5 2:0.25
Row: 2 -> 1:0.25 2:0.5 3:0.25
Row: 3 -> 2:0.25 3:0.5 4:0.25
Row: 4 -> 4:1
Getting matrix rows on CUDA device:
Row: 0 -> 0:2
Row: 1 -> 0:1 1:2 2:1
Row: 2 -> 1:1 2:2 3:1
Row: 3 -> 2:1 3:2 4:1
Row: 4 -> 4:2
Divide each matrix row by a sum of all elements in the row ...
Row: 0 -> 0:1
Row: 1 -> 0:0.25 1:0.5 2:0.25
Row: 2 -> 1:0.25 2:0.5 3:0.25
Row: 3 -> 2:0.25 3:0.5 4:0.25
Row: 4 -> 4:1
Method forElements
Finally, another efficient way of setting the nonzero matrix elements, is use of the method forElements (TNL::Matrices::SparseMatrix::forElements). It requires indexes of the range of rows (begin and end) to be processed and a lambda function function which is called for each nonzero element. The lambda function provides the following data:
- rowIdx is a row index of the matrix element.
- localIdx is an index of the nonzero matrix element within the matrix row.
- columnIdx is a column index of the matrix element. If the matrix element column index is supposed to be modified, this parameter can be a reference and so its value can be changed.
- value is a value of the matrix element. If the matrix element value is supposed to be modified, this parameter can be a reference as well and so the element value can be changed.
See the following example:
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forElementsExample()
9{
11 auto matrixView = matrix.
getView();
12
13 auto f = []
__cuda_callable__(
int rowIdx,
int localIdx,
int& columnIdx,
double& value )
14 {
15
16
17 if( rowIdx >= localIdx ) {
18 columnIdx = localIdx;
19 value = rowIdx + localIdx + 1;
20 }
21 };
22
25}
26
27int
28main( int argc, char* argv[] )
29{
31 forElementsExample< TNL::Devices::Host >();
32
33#ifdef __CUDACC__
35 forElementsExample< TNL::Devices::Cuda >();
36#endif
37}
void forElements(IndexType begin, IndexType end, Function &&function) const
Method for iteration over all matrix rows for constant instances.
Definition SparseMatrixBase.hpp:528
On the line 9, we allocate a lower triangular matrix by setting the row capacities as {1,2,3,4,5}. On the line 11, we prepare lambda function f which we execute on the line 22 just by calling the method forElements (TNL::Matrices::SparseMatrix::forElements). This method takes the range of matrix rows as the first two parameters and the lambda function as the last parameter. The lambda function receives parameters mentioned above (see the line 11). We first check if the matrix element coordinates (rowIdx and localIdx) points to an element lying before the matrix diagonal or on the diagonal (line 12). In case of the lower triangular matrix in our example, the local index is in fact the same as the column index
\[\left(
\begin{array}{ccccc}
0 & . & . & . & . \\
0 & 1 & . & . & . \\
0 & 1 & 2 & . & . \\
0 & 1 & 2 & 3 & . \\
0 & 1 & 2 & 3 & 4
\end{array}
\right)
\]
If we call the method forElements (TNL::Matrices::SparseMatrix::forElements) to setup the matrix elements for the first time, the parameter columnIdx has no sense because the matrix elements and their column indexes were not set yet. Therefore it is important that the test on the line 12 reads as
because
would not make sense. If we pass through this test, the matrix element lies in the lower triangular part of the matrix and we may set the matrix elements which is done on the lines 17 and 18. The column index (columnIdx) is set to local index (line 17) and value is set on the line 18. The result looks as follows:
Creating matrix on host:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:3
Row: 2 -> 0:3 1:4 2:5
Row: 3 -> 0:4 1:5 2:6 3:7
Row: 4 -> 0:5 1:6 2:7 3:8 4:9
Creating matrix on CUDA device:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:3
Row: 2 -> 0:3 1:4 2:5
Row: 3 -> 0:4 1:5 2:6 3:7
Row: 4 -> 0:5 1:6 2:7 3:8 4:9
Wrapping existing data to sparse matrix view
Standard sparse matrix format like CSR and Ellpack store the matrix elements in specifically defined arrays. In case that you have already allocated them (for example in some other library), they can be wrapped into a sparse matrix view with given matrix format. This can be done by means of functions TNL::Matrices::wrapCSRMatrix and TNL::Matrices::wrapEllpackMatrix . See the following example for demonstration of the CSR format:
1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Matrices/MatrixWrapping.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9wrapMatrixView()
10{
11
12
13
14
15
16
17
18
19 const int rows( 4 ), columns( 4 );
23
24 double* values = valuesVector.
getData();
25 int* columnIndexes = columnIndexesVector.
getData();
26 int* rowPointers = rowPointersVector.
getData();
27
28
29
30
32
34}
35
36int
37main( int argc, char* argv[] )
38{
40 wrapMatrixView< TNL::Devices::Host >();
41
42#ifdef __CUDACC__
44 wrapMatrixView< TNL::Devices::Cuda >();
45#endif
46}
SparseMatrixView< Real, Device, Index, GeneralMatrix, Algorithms::Segments::CSRView > wrapCSRMatrix(const Index &rows, const Index &columns, Index *rowPointers, Real *values, Index *columnIndexes)
Function for wrapping of arrays defining CSR format into a sparse matrix view.
Definition MatrixWrapping.h:73
We create sparse matrix having four rows and four columns (line 19). We use TNL vector (TNL::Containers::Vector) to allocate arrays necessary for the CSR format:
- valuesVector (line 20) - contains values of the nonzero matrix elements.
- columnIndexesVector (line 21) - contains column indexes of the nonzero matrix elements.
- rowPointersVector (line 22) - contains positions of the first nonzero matrix elements in each row within valuesVector and columnIndexesVector. The size of this array equals number of matrix rows plus one.
Next we turn the vectors into C style pointers (lines 24-26) to wrap them into sparse matrix view (line 31). Note, that we must explicitly state the device on which the arrays are allocated. Finlay we print the matrix to check the correctness (line 33). The result looks as follows:
Wrapping matrix view on host:
Matrix reads as:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16
Wrapping matrix view on CUDA device:
Matrix reads as:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16
Wrapping data corresponding with the Ellpack format is very similar as we can see in the following example:
1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Matrices/MatrixWrapping.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9wrapMatrixView()
10{
11
12
13
14
15
16
17
18
19 const int rows( 4 ), columns( 4 );
22
23 double* values = valuesVector.
getData();
24 int* columnIndexes = columnIndexesVector.
getData();
25
26
27
28
30 rows, columns, 2, values, columnIndexes );
31
33}
34
35int
36main( int argc, char* argv[] )
37{
39 wrapMatrixView< TNL::Devices::Host >();
40
41#ifdef __CUDACC__
43 wrapMatrixView< TNL::Devices::Cuda >();
44#endif
45}
auto wrapEllpackMatrix(const Index rows, const Index columns, const Index nonzerosPerRow, Real *values, Index *columnIndexes) -> decltype(EllpackMatrixWrapper< Device, Organization, Real, Index, Alignment >::wrap(rows, columns, nonzerosPerRow, values, columnIndexes))
Function for wrapping of arrays defining Ellpack format into a sparse matrix view.
Definition MatrixWrapping.h:134
We encode the same sparse matrix as in the previous example. The essence of the Ellpack format is that we allocate the same number of matrix elements for each row which is two in our example. For some matrix rows we use the padding zeros for which we set the column index to -1 (line 21). Therefore the size of valuesVector and columnIndexesVector equals number of matrix rows times number of matrix elements allocated in each row. As before, we turn the vectors into C style pointers (lines 23-24) and wrap them into sparse matrix view with Ellpack format (line 29). Note that we must state the device on which the arrays are allocated explicitly and also the matrix elements organization, which is TNL::Algorithms::Segments::RowMajorOrder in this case. For Ellpack matrix stored on GPU, TNL::Algorithms::Segments::ColumnMajorOrder is preferred. The result looks as follows:
Wrapping matrix view on host:
Matrix reads as:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16
Wrapping matrix view on CUDA device:
Matrix reads as:
Row: 0 -> 0:1 1:2
Row: 1 -> 1:6
Row: 2 -> 0:9
Row: 3 -> 2:15 3:16
Symmetric sparse matrices
For sparse symmetric matrices, TNL offers a format storing only a half of the matrix elements. More precisely, only the matrix diagonal and the elements bellow are stored in the memory. The matrix elements above the diagonal are deduced from those bellow. If such a symmetric format is used on GPU, atomic operations must be used in some matrix operations. For this reason, symmetric matrices can be combined only with matrix elements values expressed in float or double type. An advantage of the symmetric formats is lower memory consumption. Since less data need to be transferred from the memory, better performance might be observed. In some cases, however, the use of atomic operations on GPU may cause performance drop. Mostly we can see approximately the same performance compared to general formats but we can profit from lower memory requirements which is appreciated especially on GPU. The following example shows how to create symmetric sparse matrix.
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Devices/Host.h>
4
5template< typename Device >
6void
7symmetricSparseMatrixExample()
8{
10 5,
11 5,
12 {
13
14
15 { 0, 0, 1.0 },
16 { 1, 0, 2.0 }, { 1, 1, 1.0 },
17 { 2, 0, 3.0 }, { 2, 2, 1.0 },
18 { 3, 0, 4.0 }, { 3, 3, 1.0 },
19 { 4, 0, 5.0 }, { 4, 4, 1.0 },
20
21 } );
22
24
26 symmetricMatrix.vectorProduct( inVector, outVector );
28}
29
30int
31main( int argc, char* argv[] )
32{
34 symmetricSparseMatrixExample< TNL::Devices::Host >();
35
36#ifdef __CUDACC__
38 symmetricSparseMatrixExample< TNL::Devices::Cuda >();
39#endif
40}
We construct matrix of the following form
\[\left(
\begin{array}{ccccc}
1 & \color{grey}{2} & \color{grey}{3} & \color{grey}{4} & \color{grey}{5} \\
2 & 1 & & & \\
3 & & 1 & & \\
4 & & & 1 & \\
5 & & & & 1
\end{array}
\right)
\]
The elements depicted in grey color are not stored in the memory. The main difference, compared to creation of general sparse matrix, is on line 9 where we state that the matrix is symmetric by setting the matrix type to TNL::Matrices::SymmetricMatrix. Next we set only the diagonal elements and those lying bellow the diagonal (lines 13-17). When we print the matrix (line 19) we can see also the symmetric part above the diagonal. Next we test product of matrix and vector (lines 21-23). The result looks as follows:
Creating matrix on CPU ...
Symmetric sparse matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 2:1
Row: 3 -> 0:4 3:1
Row: 4 -> 0:5 4:1
Product with vector [ 1, 1, 1, 1, 1 ] is [ 15, 3, 4, 5, 6 ]
Creating matrix on CUDA GPU ...
Symmetric sparse matrix:
Row: 0 -> 0:1 1:2 2:3 3:4 4:5
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 2:1
Row: 3 -> 0:4 3:1
Row: 4 -> 0:5 4:1
Product with vector [ 1, 1, 1, 1, 1 ] is [ 15, 3, 4, 5, 6 ]
Warning: Assignment of symmetric sparse matrix to general sparse matrix does not give correct result, currently. Only the diagonal and the lower part of the matrix is assigned.
Binary sparse matrices
If the matrix element value type (i.e. Real type) is set to bool the matrix elements can be only 1 or 0. So in the sparse matrix formats, where we do not store the zero matrix elements, explicitly stored elements can have only one possible value which is 1. Therefore we do not need to store the values, only the positions of the nonzero elements. The array values, which usually stores the matrix elements values, can be completely omitted and we can reduce the memory requirements. The following table shows how much we can reduce the memory consumption when using binary matrix instead of common sparse matrix using float or double types:
Real | Index | Common sparse matrix | Binary sparse matrix | Ratio |
float | 32-bit | 4 + 4 = 8 B | 4 B | 50% |
float | 64-bit | 4 + 8 = 12 B | 8 B | 75% |
double | 32-bit | 8 + 4 = 12 B | 4 B | 33% |
double | 64-bit | 8 + 8 = 16 B | 8 B | 50% |
The following example demonstrates the use of binary matrix:
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Algorithms/Segments/CSR.h>
4#include <TNL/Devices/Host.h>
5
6template< typename Device >
7void
8binarySparseMatrixExample()
9{
11
12 5,
13
14 5,
15
16 {
17
18 { 0, 0, 1.0 }, { 0, 1, 2.0 }, { 0, 2, 3.0 }, { 0, 3, 4.0 }, { 0, 4, 5.0 },
19 { 1, 0, 2.0 }, { 1, 1, 1.0 },
20 { 2, 0, 3.0 }, { 2, 2, 1.0 },
21 { 3, 0, 4.0 }, { 3, 3, 1.0 },
22 { 4, 0, 5.0 }, { 4, 4, 1.0 },
23
24 } );
25
26 std::cout <<
"Binary sparse matrix: \n" << binaryMatrix <<
'\n';
27
30 std::cout <<
"Product with vector " << inVector <<
" is " << outVector <<
"\n\n";
31
33 binaryMatrix2;
34 binaryMatrix2 = binaryMatrix;
36 std::cout <<
"Product with vector in double precision " << inVector <<
" is " << outVector <<
"\n\n";
37}
38
39int
40main( int argc, char* argv[] )
41{
42 std::cout <<
"Creating matrix on CPU ... \n";
43 binarySparseMatrixExample< TNL::Devices::Host >();
44
45#ifdef __CUDACC__
46 std::cout <<
"Creating matrix on CUDA GPU ... \n";
47 binarySparseMatrixExample< TNL::Devices::Cuda >();
48#endif
49}
void vectorProduct(const InVector &inVector, OutVector &outVector, ComputeRealType matrixMultiplicator=1.0, ComputeRealType outVectorMultiplicator=0.0, IndexType begin=0, IndexType end=0, const SegmentsReductionKernel &kernel=SegmentsReductionKernel{}) const
Computes product of matrix and vector.
Definition SparseMatrixBase.hpp:264
All we need to do is set the Real type to bool as we can see on the line 9. We can see that even though we set different values to different matrix elements (lines 14-18) at the end all of them are turned into ones (printing of the matrix on the line 20). There is an issue, however, which is demonstrated on the product of the matrix with a vector. Nonbinary matrices compute all operations using the Real type. If it is set to bool operations like SpMV would not get correct solution. Therefore sparse matrices use another type called ComputeReal which is the 6th template parameter of TNL::Matrices::SparseMatrix. By default it is set to Index type but it can be changed by the user. On the lines 26-29 we show how to change this type to double and what is the effect of it (correct result of matrix-vector multiplication). The result looks as follows:
Creating matrix on CPU ...
Binary sparse matrix:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:1
Row: 2 -> 0:1 2:1
Row: 3 -> 0:1 3:1
Row: 4 -> 0:1 4:1
Product with vector [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5, 2, 2, 2, 2 ]
Product with vector in double precision [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5.5, 2.2, 2.2, 2.2, 2.2 ]
Creating matrix on CUDA GPU ...
Binary sparse matrix:
Row: 0 -> 0:1 1:1 2:1 3:1 4:1
Row: 1 -> 0:1 1:1
Row: 2 -> 0:1 2:1
Row: 3 -> 0:1 3:1
Row: 4 -> 0:1 4:1
Product with vector [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5, 2, 2, 2, 2 ]
Product with vector in double precision [ 1.1, 1.1, 1.1, 1.1, 1.1 ] is [ 5.5, 2.2, 2.2, 2.2, 2.2 ]
Tridiagonal matrices
Tridiagonal matrix format serves for specific matrix pattern when the nonzero matrix elements can be placed only at the diagonal and immediately next to the diagonal. Here is an example:
\[\left(
\begin{array}{ccccccc}
2 & -1 & . & . & . & . \\
-1 & 2 & -1 & . & . & . \\
. & -1 & 2 & -1 & . & . \\
. & . & -1 & 2 & -1 & . \\
. & . & . & -1 & 2 & -1 \\
. & . & . & . & -1 & 2
\end{array}
\right)
\]
An advantage is that we do not store the column indexes explicitly as it is in TNL::Matrices::SparseMatrix. This can reduce significantly the memory requirements which also means better performance. See the following table for the storage requirements comparison between TNL::Matrices::TridiagonalMatrix and TNL::Matrices::SparseMatrix.
Real | Index | SparseMatrix | TridiagonalMatrix | Ratio |
float | 32-bit int | 8 bytes per element | 4 bytes per element | 50% |
double | 32-bit int | 12 bytes per element | 8 bytes per element | 75% |
float | 64-bit int | 12 bytes per element | 4 bytes per element | 30% |
double | 64-bit int | 16 bytes per element | 8 bytes per element | 50% |
Tridiagonal matrix is a templated class defined in the namespace TNL::Matrices. It has five template parameters:
- Real is a type of the matrix elements. It is double by default.
- Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for GPU supporting CUDA. It is TNL::Devices::Host by default.
- Index is a type to be used for indexing of the matrix elements. It is int by default.
- ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default it is the row-major order if the matrix is allocated in the host system and column major order if it is allocated on GPU.
- RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.
For better alignment in the memory the tridiagonal format is organized like if there were three nonzero matrix elements in each row. This is not true for example in the first row where there is no matrix element on the left side of the diagonal. The same happens on the last row of the matrix. We have to add even the artificial matrix elements like this:
\[\begin{array}{c}
0 \\
. \\
. \\
. \\
. \\
.
\end{array}
\left(
\begin{array}{ccccccc}
2 & -1 & . & . & . & . \\
-1 & 2 & -1 & . & . & . \\
. & -1 & 2 & -1 & . & . \\
. & . & -1 & 2 & -1 & . \\
. & . & . & -1 & 2 & -1 \\
. & . & . & . & -1 & 2
\end{array}
\right)
\begin{array}{c}
. \\
. \\
. \\
. \\
. \\
0
\end{array}
\]
If the tridiagonal matrix has more rows then columns, we have to extend the last two rows with nonzero elements in this way
\[\left(
\begin{array}{ccccccc}
2 & -1 & . & . & . & . \\
-1 & 2 & -1 & . & . & . \\
. & -1 & 2 & -1 & . & . \\
. & . & -1 & 2 & -1 & . \\
. & . & . & -1 & 2 & -1 \\
. & . & . & . & -1 & 2 \\
. & . & . & . & . & -1
\end{array}
\right)
\rightarrow
\begin{array}{c}
0 \\
. \\
. \\
. \\
. \\
. \\
.
\end{array}
\left(
\begin{array}{ccccccc}
2 & -1 & . & . & . & . \\
-1 & 2 & -1 & . & . & . \\
. & -1 & 2 & -1 & . & . \\
. & . & -1 & 2 & -1 & . \\
. & . & . & -1 & 2 & -1 \\
. & . & . & . & -1 & 2 \\
. & . & . & . & . & -1
\end{array}
\right)
\begin{array}{cc}
. & . \\
. & . \\
. & . \\
. & . \\
. & . \\
0 & . \\
0 & 0
\end{array}
\]
We also would like to remind the meaning of the local index (localIdx) of the matrix element within a matrix row. It is a rank of the nonzero matrix element in given row as we explained in section Indexing of nonzero matrix elements in sparse matrices. The values of the local index for tridiagonal matrix elements are as follows
\[\left(
\begin{array}{cccccc}
1 & 2 & & & & \\
0 & 1 & 2 & & & \\
& 0 & 1 & 2 & & \\
& & 0 & 1 & 2 & \\
& & & 0 & 1 & 2 \\
& & & & 0 & 1
\end{array}
\right)
\]
In the following text we show different methods for setup of tridiagonal matrices.
Initializer list
The tridiagonal matrix can be initialized by the means of the constructor with initializer list. The matrix from the beginning of this section can be constructed as the following example demonstrates:
1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8createTridiagonalMatrix()
9{
10 const int matrixSize = 6;
11
12
13
14
15
16
17
18
19
20
21
22
24 matrixSize,
25 {
26
27
28
29
30
31
32
33
34
35
36
37
38 { 0, 2, -1 },
39 { -1, 2, -1 },
40 { -1, 2, -1 },
41 { -1, 2, -1 },
42 { -1, 2, -1 },
43 { -1, 2, 0 } } );
45}
46
47int
48main( int argc, char* argv[] )
49{
51 createTridiagonalMatrix< TNL::Devices::Host >();
52
53#ifdef __CUDACC__
55 createTridiagonalMatrix< TNL::Devices::Cuda >();
56#endif
57}
Implementation of sparse tridiagonal matrix.
Definition TridiagonalMatrix.h:55
The matrix elements values are defined on lines 39-44. Each matrix row is represented by embedded an initializer list. We set three values in each row including the padding zeros.
The output of the example looks as:
Creating tridiagonal matrix on CPU ...
The matrix reads as:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2
Creating tridiagonal matrix on CUDA GPU ...
The matrix reads as:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2
Methods setElement and addElement
Similar way of the tridiagonal matrix setup is offered by the method setElements (TNL::Matrices::TridiagonalMatrix::setElements) as the following example demonstrates:
1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8createTridiagonalMatrix()
9{
10 const int matrixSize = 6;
11
12
13
14
15
16
17
18
19
20
21
22
25
26
27
28
29
30
31
32
33
34
35
36
37 { 0, 2, -1 },
38 { -1, 2, -1 },
39 { -1, 2, -1 },
40 { -1, 2, -1 },
41 { -1, 2, -1 },
42 { -1, 2, 0 } } );
44}
45
46int
47main( int argc, char* argv[] )
48{
50 createTridiagonalMatrix< TNL::Devices::Host >();
51
52#ifdef __CUDACC__
54 createTridiagonalMatrix< TNL::Devices::Cuda >();
55#endif
56}
Here we create the matrix in two steps. Firstly, we setup the matrix dimensions by the appropriate constructor (line 24) and after that we setup the matrix elements (line 25-45). The result looks the same as in the previous example:
Creating tridiagonal matrix on CPU ...
The matrix reads as:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2
Creating tridiagonal matrix on CUDA GPU ...
The matrix reads as:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2 5:-1
Row: 5 -> 4:-1 5:2
In the following example we create tridiagonal matrix with 5 rows and 5 columns (line 12-14) by the means of a shared pointer (TNL::Pointers::SharedPointer) to make this work even on GPU. We set numbers 0,...,4 on the diagonal (line 16) and we print the matrix (line 18). Next we use a lambda function (lines 21-27) combined with parallelFor (line 35), to modify the matrix. The offdiagonal elements are set to 1 (lines 23 and 26) and for the diagonal elements, we change the sign (line 24).
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/TridiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9setElements()
10{
11 const int matrixSize( 5 );
13 Matrix matrix( matrixSize, matrixSize );
15
16 for( int i = 0; i < 5; i++ )
18
21
23 {
24 if( i > 0 )
27 if( i < matrixSize - 1 )
29 };
30
32
35}
36
37int
38main( int argc, char* argv[] )
39{
41 setElements< TNL::Devices::Host >();
42
43#ifdef __CUDACC__
45 setElements< TNL::Devices::Cuda >();
46#endif
47}
The result looks as follows:
Set elements on host:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4
Set elements on CUDA device:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4
Method getRow
A bit different way of setting up the matrix, is the use of tridiagonal matrix view and the method getRow (TNL::Matrices::TridiagonalMatrixView::getRow) as the following example demonstrates:
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/TridiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9getRowExample()
10{
11
12
13
14
15
16
17
18
19
20
21
22
23 const int matrixSize( 5 );
25 MatrixType matrix( matrixSize,
26 matrixSize
27 );
29
31 {
32 auto row = view.
getRow( rowIdx );
33
34 if( rowIdx > 0 )
35 row.setElement( 0, -1.0 );
36 row.setElement( 1, 2.0 );
37 if( rowIdx < matrixSize - 1 )
38 row.setElement( 2, -1.0 );
39 };
40
41
42
43
46}
47
48int
49main( int argc, char* argv[] )
50{
52 getRowExample< TNL::Devices::Host >();
53
54#ifdef __CUDACC__
56 getRowExample< TNL::Devices::Cuda >();
57#endif
58}
We create a matrix with the same size (line 22-27). Next, we fetch the tridiagonal matrix view (ef TNL::Matrices::TridiagonalMatrixView ,line 28) which we use in the lambda function for matrix elements modification (lines 30-38). Inside the lambda function, we first get a matrix row by calling the method getRow (TNL::Matrices::TridiagonalMatrixView::getRow) using which we can access the matrix elements (lines 33-37). We would like to stress that the method setElement addresses the matrix elements with the localIdx parameter which is a rank of the nonzero element in the matrix row - see Indexing of nonzero matrix elements in sparse matrices. The lambda function is called by the parallelFor.
The result looks as follows:
Getting matrix rows on host:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Getting matrix rows on CUDA device:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Method forRows
As in the case of other matrix types, the method forRows (TNL::Matrices::TridiagonalMatrix::forRows) calls the method getRow (TNL::Matrices::TridiagonalMatrix::getRow) in parallel. It is demonstrated by the following example which we may directly compare with the previous one:
1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forRowsExample()
9{
11
12
13
14
15
16
17
18
19
20
21
22
23 const int size = 5;
24 MatrixType matrix( size, size );
26
28 {
29 const int& rowIdx = row.getRowIndex();
30 if( rowIdx > 0 )
32 row.setElement( 1, 2.0 );
33 if( rowIdx < size - 1 )
34 row.setElement( 2, -1.0 );
35 };
38}
39
40int
41main( int argc, char* argv[] )
42{
44 forRowsExample< TNL::Devices::Host >();
45
46#ifdef __CUDACC__
48 forRowsExample< TNL::Devices::Cuda >();
49#endif
50}
The differences are:
- We do not need to get the matrix view as we did in the previous example.
- We call the method forAllRows (TNL::Matrices::TridiagonalMatrix::forAllRows) (line 33) instead of parallelFor which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither.
- The lambda function f (lines 25-31) accepts one parameter row of the type RowView (TNL::Matrices::TridiagonalMatrix::RowView which is TNL::Matrices::TridiagonalMatrixRowView) instead of the index of the matrix row. Therefore we do not need to call the method getRow (TNL::Matrices::TridiagonalMatrix::getRow). On the other hand, we need the method geRowIndex (TNL::Matrices::TridiagonalMatrixRowView::getRowIndex) to get the index of the matrix row (line 24).
Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 39-46). Firstly we create the vector sum_vector for storing the sums (line 39) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 41-46, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 45).
The result looks as follows:
Creating matrix on host:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Creating matrix on CUDA device:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Method forElements
Finally, even a bit more simple way of matrix elements manipulation with the method forElements (TNL::Matrices::TridiagonalMatrix::forElements) is demonstrated in the following example:
1#include <iostream>
2#include <TNL/Matrices/TridiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forElementsExample()
9{
10
11
12
13
14
15
16
17
18
19
21 5 );
23
24 auto f = []
__cuda_callable__(
int rowIdx,
int localIdx,
int columnIdx,
double& value )
25 {
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40 value = 3 - localIdx;
41 };
44}
45
46int
47main( int argc, char* argv[] )
48{
50 forElementsExample< TNL::Devices::Host >();
51
52#ifdef __CUDACC__
54 forElementsExample< TNL::Devices::Cuda >();
55#endif
56}
On the line 41, we call the method forElements (TNL::Matrices::TridiagonalMatrix::forElements) instead of parallelFor. This method iterates over all matrix rows and all nonzero matrix elements. The lambda function on the line 24 therefore do not receive only the matrix row index but also local index of the matrix element (localIdx) which is a rank of the nonzero matrix element in given row - see Indexing of nonzero matrix elements in sparse matrices. Next parameter, columnIdx received by the lambda function, is the column index of the matrix element. The fourth parameter value is a reference on the matrix element which we use for its modification.
The result looks as follows:
Creating matrix on host:
Row: 0 -> 0:2 1:1
Row: 1 -> 0:3 1:2 2:1
Row: 2 -> 1:3 2:2 3:1
Row: 3 -> 2:3 3:2 4:1
Row: 4 -> 3:3 4:2
Creating matrix on CUDA device:
Row: 0 -> 0:2 1:1
Row: 1 -> 0:3 1:2 2:1
Row: 2 -> 1:3 2:2 3:1
Row: 3 -> 2:3 3:2 4:1
Row: 4 -> 3:3 4:2
Multidiagonal matrices
Multidiagonal matrices are generalization of the tridiagonal ones. It is a special type of sparse matrices with specific pattern of the nonzero matrix elements which are positioned only parallel along diagonal. See the following example:
\[ \left(
\begin{array}{ccccccc}
4 & -1 & . & -1 & . & . \\
-1 & 4 & -1 & . & -1 & . \\
. & -1 & 4 & -1 & . & -1 \\
-1 & . & -1 & 4 & -1 & . \\
. & -1 & . & -1 & 4 & -1 \\
. & . & -1 & . & -1 & 4
\end{array}
\right)
\]
We can see that the matrix elements lay on lines parallel to the main diagonal. Such lines can be characterized by their offsets from the main diagonal. On the following figure, each such line is depicted in different color:
\[\begin{array}{ccc}
\color{green}{-3} & . & \color{cyan}{-1} \\
\hline
\color{green}{*} & . & \color{cyan}{*} \\
. & \color{green}{*} & . \\
. & . & \color{green}{*} \\
. & . & . \\
. & . & . \\
. & . & .
\end{array}
\left(
\begin{array}{ccccccc}
\color{blue}{0} & \color{magenta}{1} & . & \color{red}{3} & . & . \\
\hline
\color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} & . & . \\
\color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} & . \\
. & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . & \color{red}{-1} \\
\color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} & . \\
. & \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4} & \color{magenta}{-1} \\
. & . & \color{green}{-1} & . & \color{cyan}{-1} & \color{blue}{4}
\end{array}
\right)
\]
In this matrix, the offsets reads as \(\{-3, -1, 0, +1, +3\}\). It also means that the column indexes on \(i-\)th row are \(\{i-3, i-1, i, i+1, i+3\}\) (where we accept only nonnegative indexes smaller than the number of matrix columns). An advantage is that, similar to the tridiagonal matrix (TNL::Matrices::TridiagonalMatrix), we do not store the column indexes explicitly as it is in TNL::Matrices::SparseMatrix. This can significantly reduce the memory requirements which also means better performance. See the following table for the storage requirements comparison between multidiagonal matrix (TNL::Matrices::MultidiagonalMatrix) and general sparse matrix (TNL::Matrices::SparseMatrix).
Real | Index | SparseMatrix | MultidiagonalMatrix | Ratio |
float | 32-bit int | 8 bytes per element | 4 bytes per element | 50% |
double | 32-bit int | 12 bytes per element | 8 bytes per element | 75% |
float | 64-bit int | 12 bytes per element | 4 bytes per element | 30% |
double | 64-bit int | 16 bytes per element | 8 bytes per element | 50% |
For the sake of better memory alignment and faster access to the matrix elements, we store all subdiagonals in complete form including the elements which are outside the matrix as depicted on the following figure where zeros stand for the padding artificial zero matrix elements
\[\begin{array}{cccc}
0 & & & 0 \\
& 0 & & \\
& & 0 & \\
& & & 0 \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & &
\end{array}
\left(
\begin{array}{cccccccccccccccc}
1 & 0 & & & 0 & & & & & & & & & & & \\
0 & 1 & 0 & & & 0 & & & & & & & & & & \\
& 0 & 1 & 0 & & & 0 & & & & & & & & & \\
& & 0 & 1 & 0 & & & 0 & & & & & & & & \\
0 & & & 0 & 1 & 0 & & & 0 & & & & & & & \\
& -1 & & & -1 & 1 & -1 & & & -1 & & & & & & \\
& & -1 & & & -1 & 1 & -1 & & & -1 & & & & & \\
& & & 0 & & & 0 & 1 & 0 & & & 0 & & & & \\
& & & & 0 & & & 0 & 1 & 0 & & & 0 & & & \\
& & & & & -1 & & & -1 & 1 & -1 & & & -1 & & \\
& & & & & & -1 & & & -1 & 1 & -1 & & & -1 & \\
& & & & & & & 0 & & & 0 & 1 & 0 & & & 0 \\
& & & & & & & & 0 & & & 0 & 1 & 0 & & \\
& & & & & & & & & 0 & & & 0 & 1 & 0 & \\
& & & & & & & & & & 0 & & & 0 & 1 & 0 \\
& & & & & & & & & & & 0 & & & 0 & 1
\end{array}
\right)
\begin{array}
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
0 & & & \\
& 0 & & \\
& & 0 & \\
0 & & & 0
\end{array}
\]
Multidiagonal matrix is a templated class defined in the namespace TNL::Matrices. It has six template parameters:
- Real is a type of the matrix elements. It is double by default.
- Device is a device where the matrix shall be allocated. Currently it can be either TNL::Devices::Host for CPU or TNL::Devices::Cuda for CUDA supporting GPUs. It is TNL::Devices::Host by default.
- Index is a type to be used for indexing of the matrix elements. It is int by default.
- ElementsOrganization defines the organization of the matrix elements in memory. It can be TNL::Algorithms::Segments::ColumnMajorOrder or TNL::Algorithms::Segments::RowMajorOrder for column-major and row-major organization respectively. Be default, it is the row-major order if the matrix is allocated in the host system and column major order if it is allocated on GPU.
- RealAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements. By default, it is the default allocator for given Real type and Device type – see TNL::Allocators::Default.
- IndexAllocator is a memory allocator (one from TNL::Allocators) which shall be used for allocation of the matrix elements offsets. By default, it is the default allocator for given Index type and Device type – see TNL::Allocators::Default.
In the following text we show different methods how to setup multidiagonal matrices.
Initializer list
Smaller multidiagonal matrices can be created using the constructor with initializer lists (std::initializer_list) as we demonstrate in the following example:
1#include <iostream>
2#include <TNL/Matrices/MultidiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8createMultidiagonalMatrix()
9{
10 const int matrixSize = 6;
11
12
13
14
15
16
17
18
19
20
21
22
23
25 matrixSize,
26 { -3, -1, 0, 1, 3 },
27 {
28
29
30
31
32
33
34
35
36
37
38
39
40
41 { 0, 0, 4, -1, -1 },
42 { 0, -1, 4, -1, -1 },
43 { 0, -1, 4, -1, -1 },
44 { -1, -1, 4, -1, 0 },
45 { -1, -1, 4, -1, 0 },
46 { -1, -1, 4, 0, 0 },
47
48 } );
50}
51
52int
53main( int argc, char* argv[] )
54{
56 createMultidiagonalMatrix< TNL::Devices::Host >();
57
58#ifdef __CUDACC__
60 createMultidiagonalMatrix< TNL::Devices::Cuda >();
61#endif
62}
Implementation of sparse multidiagonal matrix.
Definition MultidiagonalMatrix.h:63
Here, we create a matrix which looks as
\[\left(
\begin{array}{cccccc}
4 & -1 & & -1 & & \\
-1 & 4 & -1 & & -1 & \\
& -1 & 4 & -1 & & -1 \\
-1 & & -1 & 4 & -1 & \\
& -1 & & -1 & 4 & -1 \\
& & -1 & & -1 & 4 \\
\end{array}
\right).
\]
On the lines 25-46, we call the constructor which, in addition to matrix dimensions and subdiagonals offsets, accepts also initializer list of initializer lists with matrix elements values. Each embedded list corresponds to one matrix row and it contains values of matrix elements on particular subdiagonals including those which lies out of the matrix. The result looks as follows:
Create multidiagonal matrix on CPU ...
The matrix reads as:
Row: 0 -> 0:4 1:-1 3:-1
Row: 1 -> 0:-1 1:4 2:-1 4:-1
Row: 2 -> 1:-1 2:4 3:-1 5:-1
Row: 3 -> 0:-1 2:-1 3:4 4:-1
Row: 4 -> 1:-1 3:-1 4:4 5:-1
Row: 5 -> 2:-1 4:-1 5:4
Creating multidiagonal matrix on CUDA GPU ...
The matrix reads as:
Row: 0 -> 0:4 1:-1 3:-1
Row: 1 -> 0:-1 1:4 2:-1 4:-1
Row: 2 -> 1:-1 2:4 3:-1 5:-1
Row: 3 -> 0:-1 2:-1 3:4 4:-1
Row: 4 -> 1:-1 3:-1 4:4 5:-1
Row: 5 -> 2:-1 4:-1 5:4
Methods setElement and addElement
Another and more efficient way of setting the matrix elements is by means of the method setElement (TNL::Matrices::MultidiagonalMatrix::setElement). It is demonstrated in the following example:
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Matrices/MultidiagonalMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6#include <TNL/Pointers/SharedPointer.h>
7#include <TNL/Pointers/SmartPointersRegister.h>
8
9template< typename Device >
10void
11setElements()
12{
13 const int matrixSize( 5 );
14 auto diagonalsOffsets = { -1, 0, 1 };
16 Matrix matrix( matrixSize, matrixSize, diagonalsOffsets );
18
19 for( int i = 0; i < 5; i++ )
21
24
26 {
27 if( i > 0 )
30 if( i < matrixSize - 1 )
32 };
33
35
38}
39
40int
41main( int argc, char* argv[] )
42{
44 setElements< TNL::Devices::Host >();
45
46#ifdef __CUDACC__
48 setElements< TNL::Devices::Cuda >();
49#endif
50}
This examples shows that the method setElement can be used both on the host (CPU) (line 19) as well as in the lambda functions that can be called from GPU kernels (lines 25-29). For this purpose, we fetch a matrix view on the line 16. The result looks as follows:
Set elements on host:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4
Set elements on CUDA device:
Matrix set from the host:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Matrix set from its native device:
Row: 0 -> 1:1
Row: 1 -> 0:1 1:-1 2:1
Row: 2 -> 1:1 2:-2 3:1
Row: 3 -> 2:1 3:-3 4:1
Row: 4 -> 3:1 4:-4
Method getRow
Slightly more efficient way of the multidiagonal matrix setup is offered by the method getRow (TNL::Matrices::MultidiagonalMatrix::getRow). We will use it to create a matrix of the following form:
\[\left(
\begin{array}{cccccccccccccccc}
1 & . & & & . & & & & & & & & & & & \\
. & 1 & . & & & . & & & & & & & & & & \\
& . & 1 & . & & & . & & & & & & & & & \\
& & . & 1 & . & & & . & & & & & & & & \\
. & & & . & 1 & . & & & . & & & & & & & \\
& -1 & & & -1 & -4 & -1 & & & -1 & & & & & & \\
& & -1 & & & -1 & -4 & -1 & & & -1 & & & & & \\
& & & . & & & . & 1 & . & & & . & & & & \\
& & & & . & & & . & 1 & . & & & . & & & \\
& & & & & -1 & & & -1 & -4 & -1 & & & -1 & & \\
& & & & & & -1 & & & -1 & -4 & -1 & & & -1 & \\
& & & & & & & . & & & . & 1 & . & & & . \\
& & & & & & & & . & & & . & 1 & . & & \\
& & & & & & & & & . & & & . & 1 & . & \\
& & & & & & & & & & . & & & . & 1 & . \\
& & & & & & & & & & & . & & & . & 1
\end{array}
\right)
\]
The matrices of this form arise from a discretization of the Laplace operator in 2D by the finite difference method. We use this example because it is a frequent numerical problem and the multidiagonal format is very suitable for such matrices. If the reader, however, is not familiar with the finite difference method, please, do not be scared, we will just create the matrix mentioned above. The code based on use of method getRow reads as:
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/MultidiagonalMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void
10laplaceOperatorMatrix()
11{
12
13
14
15
16 const int gridSize( 4 );
17 const int matrixSize = gridSize * gridSize;
20 auto matrixView = matrix.
getView();
22 {
23 const int elementIdx = i[ 1 ] * gridSize + i[ 0 ];
24 auto row = matrixView.
getRow( elementIdx );
25 if( i[ 0 ] == 0 || i[ 1 ] == 0 || i[ 0 ] == gridSize - 1 || i[ 1 ] == gridSize - 1 )
26 row.setElement( 2, 1.0 );
27
28
29 else {
30 row.setElement( 0, -1.0 );
31 row.setElement( 1, -1.0 );
32 row.setElement( 2, 4.0 );
33 row.setElement( 3, -1.0 );
34 row.setElement( 4, -1.0 );
35 }
36 };
40
42}
43
44int
45main( int argc, char* argv[] )
46{
48 laplaceOperatorMatrix< TNL::Devices::Host >();
49
50#ifdef __CUDACC__
52 laplaceOperatorMatrix< TNL::Devices::Cuda >();
53#endif
54}
We firstly compute the matrix size (matrixSize) based on the numerical grid dimensions on the line 16. The subdiagonals offsets are defined by the numerical grid size and since it is four in this example the offsets read as \(\left\{-4,-1,0,1,4 \right\} \) or { -gridSize, -1, 0, 1, gridSize} (line 17). Here we store the offsets in vector (TNL::Containers::Vector) called offsets. Next we use a constructor with matrix dimensions and offsets passed via TNL vector (line 18) and we fetch a matrix view (TNL::Matrices::MultidiagonalMatrixView, line 19).
The matrix is constructed by iterating over particular nodes of the numerical grid. Each node correspond to one matrix row. This is why the lambda function f (lines 20-35) take two indexes i and j (line 20). Their values are coordinates of the two-dimensional numerical grid. Based on these coordinates we compute index (elementIdx) of the corresponding matrix row (line 21). We fetch matrix row (row) by calling the getRow method (TNL::Matrices::MultidiagonalMatrix::getRow) (line 22). Depending on the grid node coordinates we set either the boundary conditions (lines 23-26) for the boundary nodes (those laying on the boundary of the grid and so their coordinates fulfil the condition i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) for which se set only the diagonal element to 1. The inner nodes of the numerical grid are handled on the lines 29-33 where we set coefficients approximating the Laplace operator. We use the method setElement of the matrix row (TNL::Matrices::MultidiagonalMatrixRowView::setElement) which takes the local index of the nonzero matrix element as the first parameter (see Indexing of nonzero matrix elements in sparse matrices) and the new value of the element as the second parameter. The local indexes, in fact, refer to particular subdiagonals as depicted on the following figure (in blue):
\[\begin{array}{cccc}
\color{blue}{-4} & & & \color{blue}{-1} \\
\hline
. & & & . \\
& . & & \\
& & . & \\
& & & . \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & & \\
& & &
\end{array}
\left(
\begin{array}{cccccccccccccccc}
\color{blue}{0} & \color{blue}{1} & & & \color{blue}{4} & & & & & & & & & & & \\
\hline
1 & . & & & . & & & & & & & & & & & \\
. & 1 & . & & & . & & & & & & & & & & \\
& . & 1 & . & & & . & & & & & & & & & \\
& & . & 1 & . & & & . & & & & & & & & \\
. & & & . & 1 & . & & & . & & & & & & & \\
& -1 & & & -1 & 1 & -1 & & & -1 & & & & & & \\
& & -1 & & & -1 & 1 & -1 & & & -1 & & & & & \\
& & & . & & & . & 1 & . & & & . & & & & \\
& & & & . & & & . & 1 & . & & & . & & & \\
& & & & & -1 & & & -1 & 1 & -1 & & & -1 & & \\
& & & & & & -1 & & & -1 & 1 & -1 & & & -1 & \\
& & & & & & & . & & & . & 1 & . & & & . \\
& & & & & & & & . & & & . & 1 & . & & \\
& & & & & & & & & . & & & . & 1 & . & \\
& & & & & & & & & & . & & & . & 1 & . \\
& & & & & & & & & & & . & & & . & 1
\end{array}
\right)
\]
We use parallelFor to iterate over all nodes of the numerical grid (line 36) and apply the lambda function. The result looks as follows:
Creating Laplace operator matrix on CPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:-1 4:-1 5:4 6:-1 9:-1
Row: 6 -> 2:-1 5:-1 6:4 7:-1 10:-1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:-1 8:-1 9:4 10:-1 13:-1
Row: 10 -> 6:-1 9:-1 10:4 11:-1 14:-1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Creating Laplace operator matrix on CUDA GPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:-1 4:-1 5:4 6:-1 9:-1
Row: 6 -> 2:-1 5:-1 6:4 7:-1 10:-1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:-1 8:-1 9:4 10:-1 13:-1
Row: 10 -> 6:-1 9:-1 10:4 11:-1 14:-1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Method forRows
As in the case of other matrix types, the method forRows (TNL::Matrices::MultidiagonalMatrix::forRows) calls the method getRow (TNL::Matrices::MultidiagonalMatrix::getRow) in parallel. It is demonstrated by the following example:
1#include <iostream>
2#include <TNL/Matrices/MultidiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forRowsExample()
9{
11
12
13
14
15
16
17
18
19
20
21
22
23 const int size = 5;
24 MatrixType matrix( size,
25 size,
26 { -1, 0, 1 } );
28
30 {
31 const int& rowIdx = row.getRowIndex();
32 if( rowIdx > 0 )
34 row.setElement( 1, 2.0 );
35 if( rowIdx < size - 1 )
36 row.setElement( 2, -1.0 );
37 };
38 view.forAllRows( f );
40}
41
42int
43main( int argc, char* argv[] )
44{
46 forRowsExample< TNL::Devices::Host >();
47
48#ifdef __CUDACC__
50 forRowsExample< TNL::Devices::Cuda >();
51#endif
52}
We call the method forAllRows (TNL::Matrices::MultidiagonalMatrix::forAllRows) (line 36) instead of parallelFor which is simpler since we do not have to state the device type explicitly. The method forAllRows calls the method forRows for all matrix rows so we do not have to state explicitly the interval of matrix rows neither. The lambda function f (lines 28-35) accepts one parameter row of the type RowView (TNL::Matrices::MultidiagonalMatrix::RowView which is TNL::Matrices::MultidiagonalMatrixRowView). At the beginning of the lambda function, we call the method geRowIndex (TNL::Matrices::MultidiagonalMatrixRowView::getRowIndex) to get the index of the matrix row (line 29).
Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 39-46). Firstly we create the vector sum_vector for storing the sums (line 39) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 41-46, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 45).
The result looks as follows:
Creating matrix on host:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Creating matrix on CUDA device:
Row: 0 -> 0:2 1:-1
Row: 1 -> 0:-1 1:2 2:-1
Row: 2 -> 1:-1 2:2 3:-1
Row: 3 -> 2:-1 3:2 4:-1
Row: 4 -> 3:-1 4:2
Method forElements
Similar and even a bit simpler way of setting the matrix elements is offered by the method forElements (TNL::Matrices::MultidiagonalMatrix::forElements, TNL::Matrices::MultidiagonalMatrixView::forElements) as demonstrated in the following example:
1#include <iostream>
2#include <TNL/Matrices/MultidiagonalMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8forElementsExample()
9{
10
11
12
13
14
15
16
17
18
19
20
21
23 5,
24 { -2, -1, 0 } );
26
27 auto f = []
__cuda_callable__(
int rowIdx,
int localIdx,
int columnIdx,
double& value )
28 {
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43 value = 3 - localIdx;
44 };
47}
48
49int
50main( int argc, char* argv[] )
51{
53 forElementsExample< TNL::Devices::Host >();
54
55#ifdef __CUDACC__
57 forElementsExample< TNL::Devices::Cuda >();
58#endif
59}
void forElements(IndexType begin, IndexType end, Function &function) const
Method for iteration over all matrix rows for constant instances.
Definition MultidiagonalMatrixBase.hpp:267
In this case, we need to provide a lambda function f (lines 27-43) which is called for each matrix row just by the method forElements (line 44). The lambda function f provides the following parameters
- rowIdx is an index iof the matrix row.
- localIdx is in index of the matrix subdiagonal.
- columnIdx is a column index of the matrix element.
- value is a reference to the matrix element value. It can be used even for changing the value.
In this example, the matrix element value depends only on the subdiagonal index localIdx (see Indexing of nonzero matrix elements in sparse matrices) as we can see on the line 42. The result looks as follows:
Creating matrix on host:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Creating matrix on CUDA device:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Lambda matrices
Lambda matrix (TNL::Matrices::LambdaMatrix) is a special type of matrix which could be also called matrix-free matrix. The matrix elements are not stored in memory explicitly but they are evaluated on-the-fly by means of user defined lambda functions. If the matrix elements can be expressed by computationally not expansive formula, we can significantly reduce the memory consumption which is appreciated especially on GPUs. Since the memory accesses are quite expensive even on both CPU and GPU, we can get, at the end, even much faster code.
The lambda matrix (TNL::Matrices::LambdaMatrix) is a templated class with the following template parameters:
- MatrixElementsLambda is a lambda function which evaluates the matrix elements values and column indexes.
- CompressedRowLengthsLambda is a lambda function telling how many nonzero elements are there in given matrix row.
- Real is a type of matrix elements values.
- Device is a device on which the lambda functions mentioned above will be evaluated.
- Index is a type to be used for indexing.
The lambda function MatrixElementsLambda is supposed to have the following declaration:
1auto matrixElements = [=
2]
__cuda_callable__( Index rows, Index columns, Index row, Index localIdx, Index& columnIdx, Real& value );
where the particular parameters have the following meaning:
- rows tells the number of matrix rows.
- columns tells the number of matrix columns.
- rowIdx is index of the matrix row in which we are supposed to evaluate the matrix element.
- localIdx is a rank of the nonzero matrix element, see Indexing of nonzero matrix elements in sparse matrices.
- columnIdx is a reference on variable where we are supposed to store the matrix element column index.
- value is a reference on variable where we are supposed to store the matrix element value.
The lambda function CompressedRowLengthsLambda (by compressed row length we mean the number of matrix elements in a row after ignoring/compressing the zero elements) is supposed to be declared like this:
1auto compressedRowLengths = [=]
__cuda_callable__( Index rows, Index columns, Index row ) -> Index;
where the parameters can be described as follows:
- rows tells the number of matrix rows.
- columns tells the number of matrix columns.
- rowIdx is index of the matrix row for which we are supposed to evaluate the number of nonzero matrix elements.
The lambda function is supposed to return just the number of the nonzero matrix elements in given matrix row.
Lambda matrix inititation
How to put the lambda functions together with the lambda matrix is demonstrated in the following example:
1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3
4int
5main( int argc, char* argv[] )
6{
7
8
9
10 auto compressedRowLengths = [ = ]
__cuda_callable__(
const int rows,
const int columns,
const int rowIdx ) ->
int
11 {
12 return 1;
13 };
14 auto matrixElements1 =
16 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
17 {
18 columnIdx = rowIdx;
19 value = 1.0;
20 };
21 auto matrixElements2 =
23 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
24 {
25 columnIdx = rowIdx;
26 value = rowIdx;
27 };
28
29 const int size = 5;
30
31
32
33
35 matrixElements1, compressedRowLengths ) );
36 MatrixType m1( size, size, matrixElements1, compressedRowLengths );
37
38
39
40
41 auto m2 =
43 m2.setDimensions( size, size );
44
47}
static auto create(MatrixElementsLambda &matrixElementsLambda, CompressedRowLengthsLambda &compressedRowLengthsLambda) -> LambdaMatrix< MatrixElementsLambda, CompressedRowLengthsLambda, Real, Device, Index >
Creates lambda matrix with given lambda functions.
Definition LambdaMatrix.h:572
Here we create two simple diagonal matrices. Therefore they share the same lambda function compressedRowLengths telling the number of nonzero matrix elements in particular matrix rows which is always one (line 9). The first matrix, defined by the lambda function matrixElements1, is identity matrix and so its each diagonal element equals one. We set the matrix element value to 1.0 (line 12) and the column index equals the row index (line 15). The second matrix, defined by the lambda function matrixElements2, is also diagonal but not the identity matrix. The values of the diagonal elements equal to row index (line 16).
With the same lambda functions we can define matrices with different dimensions. In this example, we set the matrix size to five (line 19). It could be quite difficult to express the lambda matrix type because it depends on the types of the lambda functions. To make this easier, one may use the lambda-matrix factory (TNL::Matrices::LambdaMatrixFactory). Using decltype one can deduce even the matrix type (line 24) followed by calling lambda matrix constructor with matrix dimensions and instances of the lambda functions (line 25). Or one can just simply employ the keyword auto (line 30) followed by setting the matrix dimensions (line 31).
The result looks as follows:
The first lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
The second lambda matrix:
Row: 0 ->
Row: 1 -> 1:1
Row: 2 -> 2:2
Row: 3 -> 3:3
Row: 4 -> 4:4
Method forRows
Method forRows (TNL::Matrices::LambdaMatrix::forRows, TNL::Matrices::LambdaMatrix::forAllRows) iterates in parallel over all matrix rows. In the case of lambda matrices, it cannot be used for changing the matrix elements since they cannot be changed. In the following example, we show how to use this method to copy the matrix elements values to the dense matrix:
1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9forRowsExample()
10{
11
12
13
14
15 const int gridSize( 4 );
16 const int matrixSize = gridSize * gridSize;
17 auto rowLengths = [ = ]
__cuda_callable__(
const int rows,
const int columns,
const int rowIdx ) ->
int
18 {
19 const int gridRow = rowIdx / gridSize;
20 const int gridColumn = rowIdx % gridSize;
21 if( gridRow == 0 || gridRow == gridSize - 1 ||
22 gridColumn == 0 || gridColumn == gridSize - 1 )
23 return 1;
24 return 5;
25 };
26 auto matrixElements =
28 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
29 {
30 const int gridRow = rowIdx / gridSize;
31 const int gridColumn = rowIdx % gridSize;
32 if( gridRow == 0 || gridRow == gridSize - 1 ||
33 gridColumn == 0 || gridColumn == gridSize - 1 )
34 {
35 columnIdx = rowIdx;
36 value = 1.0;
37 }
38 else
39 {
40 switch( localIdx )
41 {
42 case 0:
43 columnIdx = rowIdx - gridSize;
44 value = 1;
45 break;
46 case 1:
47 columnIdx = rowIdx - 1;
48 value = 1;
49 break;
50 case 2:
51 columnIdx = rowIdx;
52 value = -4;
53 break;
54 case 3:
55 columnIdx = rowIdx + 1;
56 value = 1;
57 break;
58 case 4:
59 columnIdx = rowIdx + gridSize;
60 value = 1;
61 break;
62 }
63 }
64 };
65 auto matrix =
67 using MatrixType = decltype( matrix );
68 using RowView = typename MatrixType::RowView;
69
71 denseMatrix.setValue( 0.0 );
72 auto dense_view = denseMatrix.getView();
74 {
75 auto dense_row = dense_view.getRow( row.getRowIndex() );
76 for( int localIdx = 0; localIdx < row.getSize(); localIdx++ )
77 dense_row.setValue( row.getColumnIndex( localIdx ), row.getValue( localIdx ) );
78 };
80
83
84
85
86
88 auto sum_view = sum_vector.getView();
91 {
92 double sum( 0.0 );
93 for( auto element : row )
95 sum_view[ row.getRowIndex() ] = sum;
96 } );
97
99}
100
101int
102main( int argc, char* argv[] )
103{
105 forRowsExample< TNL::Devices::Host >();
106
107#ifdef __CUDACC__
109 forRowsExample< TNL::Devices::Cuda >();
110#endif
111}
__cuda_callable__ T abs(const T &n)
This function returns absolute value of given number n.
Definition Math.h:75
We start with the lambda functions (lines 17-61) defining the elements of the lambda matrix. Next, we create the lambda matrix matrix (lines 62-64) and the dense matrix denseMatrix (lines 67-68) together with the dense matrix view (line 69). The lambda function f (lines 70-74) serves for copying matrix elements from the lambda matrix to the dense matrix. The process of matrix elements copying is started by calling the method forAllRows (TNL::Matrices::LambdaMatrix::forRows, TNL::Matrices::LambdaMatrix::forAllRows) (line 75).
Note, however, that use of forElements method (TNL::Matrices::LambdaMatrix::forElements) would be more convenient.
Next, we compute sum of absolute values of matrix elements in each row and store it in a vector (lines 83-90). Firstly we create the vector sum_vector for storing the sums (line 83) and get a vector view sum_view to get access to the vector from a lambda function. On the lines 85-90, we call lambda function for each matrix row which iterates over all matrix elements and sum their absolute values. Finally we store the result to the output vector (line 92).
The result looks as follows:
Running example on CPU ...
Laplace operator lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Laplace operator dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 2 -> 0:0 1:0 2:1 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 3 -> 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 5 -> 0:0 1:1 2:0 3:0 4:1 5:-4 6:1 7:0 8:0 9:1 10:0 11:0 12:0 13:0 14:0 15:0
Row: 6 -> 0:0 1:0 2:1 3:0 4:0 5:1 6:-4 7:1 8:0 9:0 10:1 11:0 12:0 13:0 14:0 15:0
Row: 7 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 8 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:1 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 9 -> 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:1 9:-4 10:1 11:0 12:0 13:1 14:0 15:0
Row: 10 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:1 7:0 8:0 9:1 10:-4 11:1 12:0 13:0 14:1 15:0
Row: 11 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:1 12:0 13:0 14:0 15:0
Row: 12 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:1 13:0 14:0 15:0
Row: 13 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:1 14:0 15:0
Row: 14 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:1 15:0
Row: 15 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:1
Sums in matrix rows = [ 1, 1, 1, 1, 1, 8, 8, 1, 1, 8, 8, 1, 1, 1, 1, 1 ]
Running example on CUDA GPU ...
Laplace operator lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Laplace operator dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 1 -> 0:0 1:1 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 2 -> 0:0 1:0 2:1 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 3 -> 0:0 1:0 2:0 3:1 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 5 -> 0:0 1:1 2:0 3:0 4:1 5:-4 6:1 7:0 8:0 9:1 10:0 11:0 12:0 13:0 14:0 15:0
Row: 6 -> 0:0 1:0 2:1 3:0 4:0 5:1 6:-4 7:1 8:0 9:0 10:1 11:0 12:0 13:0 14:0 15:0
Row: 7 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:1 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 8 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:1 9:0 10:0 11:0 12:0 13:0 14:0 15:0
Row: 9 -> 0:0 1:0 2:0 3:0 4:0 5:1 6:0 7:0 8:1 9:-4 10:1 11:0 12:0 13:1 14:0 15:0
Row: 10 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:1 7:0 8:0 9:1 10:-4 11:1 12:0 13:0 14:1 15:0
Row: 11 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:1 12:0 13:0 14:0 15:0
Row: 12 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:1 13:0 14:0 15:0
Row: 13 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:1 14:0 15:0
Row: 14 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:1 15:0
Row: 15 -> 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:1
Sums in matrix rows = [ 1, 1, 1, 1, 1, 8, 8, 1, 1, 8, 8, 1, 1, 1, 1, 1 ]
Method forElements
The lambda matrix has the same interface as other matrix types except of the method getRow. The following example demonstrates the use of the method forElements (TNL::Matrices::LambdaMatrix::forElements) to copy the lambda matrix into the dense matrix:
1#include <iostream>
2#include <TNL/Matrices/DenseMatrix.h>
3#include <TNL/Matrices/LambdaMatrix.h>
4#include <TNL/Devices/Host.h>
5#include <TNL/Devices/Cuda.h>
6
7template< typename Device >
8void
9forElementsExample()
10{
11
12
13
14 auto rowLengths = [ = ]
__cuda_callable__(
const int rows,
const int columns,
const int rowIdx ) ->
int
15 {
16 return columns;
17 };
18 auto matrixElements =
20 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
21 {
22 columnIdx = localIdx;
23 value =
TNL::max( rowIdx - columnIdx + 1, 0 );
24 };
25
27 auto matrix = MatrixFactory::create( 5, 5, matrixElements, rowLengths );
28
30 auto denseView = denseMatrix.getView();
31
32 auto f = [ = ]
__cuda_callable__(
int rowIdx,
int localIdx,
int columnIdx,
double value )
mutable
33 {
34 denseView.setElement( rowIdx, columnIdx, value );
35 };
36
40}
41
42int
43main( int argc, char* argv[] )
44{
46 forElementsExample< TNL::Devices::Host >();
47
48#ifdef __CUDACC__
50 forElementsExample< TNL::Devices::Cuda >();
51#endif
52}
Helper class for creating instances of LambdaMatrix.
Definition LambdaMatrix.h:553
Here, we treat the lambda matrix as if it was dense matrix and so the lambda function compressedRowLengths returns the number of the nonzero elements equal to the number of matrix columns (line 13). However, the lambda function matrixElements (lines 14-17), sets nonzero values only to lower triangular part of the matrix. The elements in the upper part are equal to zero (line 16). Next we create an instance of the lambda matrix with a help of the lambda matrix factory (TNL::Matrices::LambdaMatrixFactory) (lines 19-20) and an instance of the dense matrix (TNL::Matrices::DenseMatrix) (lines 22-23).
Next we call the lambda function f by the method forElements (TNL::Matrices::LambdaMatrix::forElements) to set the matrix elements of the dense matrix denseMatrix (line 26) via the dense matrix view (denseView) (TNL::Matrices::DenseMatrixView). Note, that in the lambda function f we get the matrix element value already evaluated in the variable value as we are used to from other matrix types. So in fact, the same lambda function f would do the same job even for sparse matrix or any other. Also note, that in this case we iterate even over all zero matrix elements because the lambda function compressedRowLengths (line 13) tells so. The result looks as follows:
Copying matrix on host:
Original lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:1 2:0 3:0 4:0
Row: 2 -> 0:3 1:2 2:1 3:0 4:0
Row: 3 -> 0:4 1:3 2:2 3:1 4:0
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Copying matrix on CUDA device:
Original lambda matrix:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Dense matrix:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:2 1:1 2:0 3:0 4:0
Row: 2 -> 0:3 1:2 2:1 3:0 4:0
Row: 3 -> 0:4 1:3 2:2 3:1 4:0
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
At the end of this part, we show two more examples, how to express a matrix approximating the Laplace operator:
1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8laplaceOperatorMatrix()
9{
10
11
12
13
14 const int gridSize( 4 );
15 const int matrixSize = gridSize * gridSize;
16 auto rowLengths = [ = ]
__cuda_callable__(
const int rows,
const int columns,
const int rowIdx ) ->
int
17 {
18 const int gridRow = rowIdx / gridSize;
19 const int gridColumn = rowIdx % gridSize;
20 if( gridRow == 0 || gridRow == gridSize - 1 ||
21 gridColumn == 0 || gridColumn == gridSize - 1 )
22 return 1;
23 return 5;
24 };
25 auto matrixElements =
27 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
28 {
29 const int gridRow = rowIdx / gridSize;
30 const int gridColumn = rowIdx % gridSize;
31 if( gridRow == 0 || gridRow == gridSize - 1 ||
32 gridColumn == 0 || gridColumn == gridSize - 1 )
33 {
34 columnIdx = rowIdx;
35 value = 1.0;
36 }
37 else
38 {
39 switch( localIdx )
40 {
41 case 0:
42 columnIdx = rowIdx - gridSize;
43 value = 1;
44 break;
45 case 1:
46 columnIdx = rowIdx - 1;
47 value = 1;
48 break;
49 case 2:
50 columnIdx = rowIdx;
51 value = -4;
52 break;
53 case 3:
54 columnIdx = rowIdx + 1;
55 value = 1;
56 break;
57 case 4:
58 columnIdx = rowIdx + gridSize;
59 value = 1;
60 break;
61 }
62 }
63 };
64 auto matrix =
67}
68
69int
70main( int argc, char* argv[] )
71{
73 laplaceOperatorMatrix< TNL::Devices::Host >();
74
75#ifdef __CUDACC__
77 laplaceOperatorMatrix< TNL::Devices::Cuda >();
78#endif
79}
The following is another way of doing the same but with precomputed supporting vectors:
1#include <iostream>
2#include <TNL/Matrices/LambdaMatrix.h>
3#include <TNL/Devices/Host.h>
4#include <TNL/Devices/Cuda.h>
5
6template< typename Device >
7void
8laplaceOperatorMatrix()
9{
10
11
12
13
14 const int gridSize( 4 );
15 const int matrixSize = gridSize * gridSize;
17 0, -1, 0, 1, 0
18 };
21 auto columnOffsetsView = columnOffsets.
getView();
23 auto valuesView = values.
getView();
24
25 auto rowLengths = [ = ]
__cuda_callable__(
const int rows,
const int columns,
const int rowIdx ) ->
int
26 {
27 const int gridRow = rowIdx / gridSize;
28 const int gridColumn = rowIdx % gridSize;
29
30 if( gridRow == 0 || gridRow == gridSize - 1 ||
31 gridColumn == 0 || gridColumn == gridSize - 1 )
32 return 1;
33 return 5;
34 };
35
36 auto matrixElements =
38 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
39 {
40 const int gridRow = rowIdx / gridSize;
41 const int gridColumn = rowIdx % gridSize;
42 if( gridRow == 0 || gridRow == gridSize - 1 ||
43 gridColumn == 0 || gridColumn == gridSize - 1 )
44 {
45 columnIdx = rowIdx;
46 value = 1.0;
47 }
48 else
49 {
50 columnIdx = rowIdx + columnOffsetsView[ localIdx ];
51 value = valuesView[ localIdx ];
52 }
53 };
54 auto matrix =
57}
58
59int
60main( int argc, char* argv[] )
61{
63 laplaceOperatorMatrix< TNL::Devices::Host >();
64
65#ifdef __CUDACC__
67 laplaceOperatorMatrix< TNL::Devices::Cuda >();
68#endif
69}
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
Definition Vector.hpp:25
The result of both examples looks as follows:
Creating Laplace operator matrix on CPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Creating Laplace operator matrix on CUDA GPU ...
Laplace operator matrix:
Row: 0 -> 0:1
Row: 1 -> 1:1
Row: 2 -> 2:1
Row: 3 -> 3:1
Row: 4 -> 4:1
Row: 5 -> 1:1 4:1 5:-4 6:1 9:1
Row: 6 -> 2:1 5:1 6:-4 7:1 10:1
Row: 7 -> 7:1
Row: 8 -> 8:1
Row: 9 -> 5:1 8:1 9:-4 10:1 13:1
Row: 10 -> 6:1 9:1 10:-4 11:1 14:1
Row: 11 -> 11:1
Row: 12 -> 12:1
Row: 13 -> 13:1
Row: 14 -> 14:1
Row: 15 -> 15:1
Distributed matrices
TODO: Write documentation on distributed matrices.
Flexible reduction in matrix rows
Flexible reduction in matrix rows is a powerful tool for many different matrix operations. It is represented by the method reduceRows (TNL::Matrices::DenseMatrixBase::reduceRows, TNL::Matrices::SparseMatrix::reduceRows, TNL::Matrices::TridiagonalMatrix::reduceRows, TNL::Matrices::MultidiagonalMatrix::reduceRows, TNL::Matrices::LambdaMatrix::reduceRows) and similar to the method forElements it iterates over particular matrix rows. However, it performs flexible parallel reduction in addition. For example, the matrix-vector product can be seen as a reduction of products of matrix elements with the input vector in particular matrix rows. The first element of the result vector ios obtained as:
\[y_1 = a_{11} x_1 + a_{12} x_2 + \ldots + a_{1n} x_n = \sum_{j=1}^n a_{1j}x_j
\]
and in general i-th element of the result vector is computed as
\[y_i = a_{i1} x_1 + a_{i2} x_2 + \ldots + a_{in} x_n = \sum_{j=1}^n a_{ij}x_j.
\]
We see that in i-th matrix row we have to compute the sum \(\sum_{j=1}^n a_{ij}x_j\) which is reduction of products \( a_{ij}x_j\). Similar to flexible parallel reduction (TNL::Algorithms::reduce) we just need to design proper lambda functions. There are three of them.
- fetch reads and preprocesses data entering the flexible parallel reduction.
- reduce performs the reduction operation.
- keep stores the results from each matrix row.
Lambda function fetch
This lambda function has the same purpose as the lambda function fetch in flexible parallel reduction for arrays and vectors (see Flexible Parallel Reduction). It is supposed to be declared as follows:
1auto fetch = [=]
__cuda_callable__( Index rowIdx, Index columnIdx,
const Real& value ) -> Real;
The meaning of the particular parameters is as follows:
- rowIdx is the row index of the matrix element.
- columnIdx is the column index of the matrix element.
- value is the value of the matrix element.
The lambda function returns a value of type Real based on the input data.
Lambda function reduce
The lambda function reduce expresses reduction operation (sum, product, minimum, maximum etc.) which is supposed to be done during the flexible reduction.
The meaning of the particular parameters is as follows:
- a is the first operand for the reduction operation.
- b is the second operand for the reduction operation.
Lambda function keep
The lambda function keep is new one compared to the flexible reduction for arrays, vectors or other linear structures. The reason is that the result consists of as many numbers as there are matrix rows. Result obtained for each matrix row is processed by this lambda function. It is declared as follows:
The meaning of the particular parameters is as follows:
- rowIdx is an index of the matrix row related to given result of flexible reduction.
- valueis the result of the flexible reduction in given matrix row.
The method reduceRows (TNL::Matrices::DenseMatrixBase::reduceRows, TNL::Matrices::SparseMatrix::reduceRows, TNL::Matrices::TridiagonalMatrix::reduceRows, TNL::Matrices::MultidiagonalMatrix::reduceRows, TNL::Matrices::LambdaMatrix::reduceRows) accepts the following arguments:
- begin is the beginning of the matrix rows range on which the reduction will be performed.
- end is the end of the matrix rows range on which the reduction will be performed. The last matrix row which is going to be processed has index end-1.
- fetch is the lambda function for data fetching.
- reduce is the the lambda function performing the reduction.
- keep is the lambda function responsible for processing the results from particular matrix rows.
- zero is the identity element of given reduction operation.
Though the interface is the same for all matrix types, in the following part we will show several examples for different matrix types to better demonstrate possible ways of use of the flexible reduction for matrices.
Dense matrices example
The following example demonstrates implementation of the dense matrix-vector product \( {\bf y} = A \vec {\bf x}\), i.e.
\[ y_i = \sum_{j=0}^{columns - 1} a_{ij} x_j \text{ for } i = 0, \ldots, rows-1.
\]
1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void
10reduceRows()
11{
12
14 { 1, 0, 0, 0, 0 },
15 { 1, 2, 0, 0, 0 },
16 { 0, 1, 8, 0, 0 },
17 { 0, 0, 1, 9, 0 },
18 { 0, 0, 0, 0, 1 } };
19
20
21
22
24
25
26
27
28 x = 1.0;
29
30
31
32
33 auto xView = x.getView();
34 auto yView = y.getView();
35
36
37
38
39
40 auto fetch = [ = ]
__cuda_callable__(
int rowIdx,
int columnIdx,
const double& value ) ->
double
41 {
42 return xView[ columnIdx ] * value;
43 };
44
45
46
47
49 {
50 yView[ rowIdx ] = value;
51 };
52
53
54
55
57
61}
62
63int
64main( int argc, char* argv[] )
65{
67 reduceRows< TNL::Devices::Host >();
68
69#ifdef __CUDACC__
72 reduceRows< TNL::Devices::Cuda >();
73#endif
74}
__cuda_callable__ IndexType getColumns() const
Returns number of matrix columns.
Definition MatrixBase.hpp:50
We set the following lambda functions:
- fetch lambda function computes the product \( a_{ij}x_j\) where \( a_{ij} \) is represented by value and \(x_j \) is represented by xView[columnIdx] (line 40).
- reduce - reduction is just sum of particular products and it is represented by std::plus (line 53).
- keep is responsible for storing the results of reduction in each matrix row (which is represented by the variable value) into the output vector y.
The result looks as:
Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:0 1:1 2:8 3:0 4:0
Row: 3 -> 0:0 1:0 2:1 3:9 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1 1:0 2:0 3:0 4:0
Row: 1 -> 0:1 1:2 2:0 3:0 4:0
Row: 2 -> 0:0 1:1 2:8 3:0 4:0
Row: 3 -> 0:0 1:0 2:1 3:9 4:0
Row: 4 -> 0:0 1:0 2:0 3:0 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
We will show one more example which is a computation of maximal absolute value in each matrix row. The results will be stored in a vector:
\[y_i = \max_{j=1,\ldots,n} |a_{ij}|.
\]
See the following example:
1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9reduceRows()
10{
11
13 { 1, 0, 0, 0, 0 },
14 { 1, 2, 0, 0, 0 },
15 { 0, 1, 8, 0, 0 },
16 { 0, 0, 1, 9, 0 },
17 { 0, 0, 0, 0, 1 } };
18
19
20
21
23
24
25
26
27 auto rowMaxView = rowMax.getView();
28
29
30
31
32 auto fetch = [ = ]
__cuda_callable__(
int rowIdx,
int columnIdx,
const double& value ) ->
double
33 {
35 };
36
37
38
39
41 {
43 };
44
45
46
47
49 {
50 rowMaxView[ rowIdx ] = value;
51 };
52
53
54
55
57
60}
61
62int
63main( int argc, char* argv[] )
64{
66 reduceRows< TNL::Devices::Host >();
67
68#ifdef __CUDACC__
70 reduceRows< TNL::Devices::Cuda >();
71#endif
72}
Result reduce(Index begin, Index end, Fetch &&fetch, Reduction &&reduction, const Result &identity)
reduce implements (parallel) reduction for vectors and arrays.
Definition reduce.h:66
The lambda functions rare:
- fetch lambda function just returns absolute value of \(a_{ij} \) which is represented again by the variable value.
- reduce lambda function returns larger of given values.
- keep stores the results to the output vector the same way as in the previous example.
Note, that the identity value for the reduction is std::numeric_limits< double >::lowest. Of course, if we compute the maximum of all output vector elements, we get some kind of maximal matrix norm. The output looks as:
Rows reduction on host:
Max. elements in rows are: [ 1, 2, 8, 9, 1 ]
Max. matrix norm is: 9
Rows reduction on CUDA device:
Max. elements in rows are: [ 1, 2, 8, 9, 1 ]
Max. matrix norm is: 9
Sparse matrices example
The following example demonstrates sparse matrix-vector product:
1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void
10reduceRows()
11{
12
14 { 0, 0, 1 },
15 { 1, 0, 1 }, { 1, 1, 2 },
16 { 2, 1, 1 }, { 2, 2, 8 },
17 { 3, 2, 1 }, { 3, 3, 9 },
18 { 4, 4, 1 }
19 } };
20
21
22
23
24
26
27
28
29
30 x = 1.0;
31
32
33
34
35 auto xView = x.getView();
36 auto yView = y.getView();
37
38
39
40
41
42 auto fetch = [ = ]
__cuda_callable__(
int rowIdx,
int columnIdx,
const double& value ) ->
double
43 {
44 return xView[ columnIdx ] * value;
45 };
46
47
48
49
51 {
52 yView[ rowIdx ] = value;
53 };
54
55
56
57
59
63}
64
65int
66main( int argc, char* argv[] )
67{
69 reduceRows< TNL::Devices::Host >();
70
71#ifdef __CUDACC__
74 reduceRows< TNL::Devices::Cuda >();
75#endif
76}
std::enable_if_t< Algorithms::SegmentsReductionKernels::isSegmentReductionKernel< SegmentsReductionKernel >::value > reduceRows(IndexType begin, IndexType end, Fetch &&fetch, const Reduce &reduce, Keep &&keep, const FetchValue &identity, const SegmentsReductionKernel &kernel=SegmentsReductionKernel{}) const
Method for performing general reduction on matrix rows for constant instances.
Definition SparseMatrixBase.hpp:451
On the lines 11-16 we set the following matrix:
\[\left(
\begin{array}{ccccc}
1 & . & . & . & . \\
1 & 2 & . & . & . \\
. & 1 & 8 & . & . \\
. & . & 1 & 9 & . \\
. & . & . & . & 1
\end{array}
\right)
\]
The lambda functions on the lines 39-48 are the same as in the example with the dense matrix. The result looks as follows:
Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:1 1:2
Row: 2 -> 1:1 2:8
Row: 3 -> 2:1 3:9
Row: 4 -> 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:1 1:2
Row: 2 -> 1:1 2:8
Row: 3 -> 2:1 3:9
Row: 4 -> 4:1
The input vector is:[ 1, 1, 1, 1, 1 ]
Result of matrix-vector multiplication is: [ 1, 3, 9, 10, 1 ]
Tridiagonal matrices example
In this example, we will compute maximal absolute value in each row of the following tridiagonal matrix:
\[\left(
\begin{array}{ccccc}
1 & 3 & & & & \\
2 & 1 & 3 & & & \\
& 2 & 1 & 3 & & \\
& & 2 & 1 & 3 & \\
& & & 2 & 1 & 3
\end{array}
\right).
\]
The source code reads as follows:
1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/TridiagonalMatrix.h>
5#include <TNL/Devices/Host.h>
6
7template< typename Device >
8void
9reduceRows()
10{
11
12
13
14
15
16
17
18
19
20
21
23 { { 0, 1, 3 },
24 { 2, 1, 3 },
25 { 2, 1, 3 },
26 { 2, 1, 3 },
27 { 2, 1, 3 } } );
29
30
31
32
34
35
36
37
38 auto rowMaxView = rowMax.getView();
39
40
41
42
43 auto fetch = []
__cuda_callable__(
int rowIdx,
int columnIdx,
const double& value ) ->
double
44 {
46 };
47
48
49
50
52 {
54 };
55
56
57
58
60 {
61 rowMaxView[ rowIdx ] = value;
62 };
63
64
65
66
68
71}
72
73int
74main( int argc, char* argv[] )
75{
77 reduceRows< TNL::Devices::Host >();
78
79#ifdef __CUDACC__
81 reduceRows< TNL::Devices::Cuda >();
82#endif
83}
void reduceRows(IndexType begin, IndexType end, Fetch &&fetch, const Reduce &reduce, Keep &&keep, const FetchReal &identity) const
Method for performing general reduction on matrix rows for constant instances.
Definition TridiagonalMatrixBase.hpp:189
Here we first set the tridiagonal matrix (lines 10-27). Next we allocate the vector rowMax where we will store the results (line 32). The lambda function are:
- fetch (lines 42-44) is responsible for reading the matrix elements. In our example, the only thing this function has to do, is to compute the absolute value of each matrix element represented by variable value.
- reduce (lines 49-51), performs reduction operation. In this case, it returns maximum of two input values a and b.
- keep (lines 56-58) takes the result of the reduction in variable value in each row and stores it into the vector rowMax via related vector view rowMaxView.
Note, that the identity value for the reduction is std::numeric_limits< double >::lowest. The results looks as follows:
Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1 1:3
Row: 1 -> 0:2 1:1 2:3
Row: 2 -> 1:2 2:1 3:3
Row: 3 -> 2:2 3:1 4:3
Row: 4 -> 3:2 4:1
Max. elements in rows are: [ 3, 3, 3, 3, 2 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1 1:3
Row: 1 -> 0:2 1:1 2:3
Row: 2 -> 1:2 2:1 3:3
Row: 3 -> 2:2 3:1 4:3
Row: 4 -> 3:2 4:1
Max. elements in rows are: [ 3, 3, 3, 3, 2 ]
Multidiagonal matrices example
The next example computes again the maximal absolute value in each row. Now, we do it for multidiagonal matrix the following form:
\[\left(
\begin{array}{ccccc}
1 & & & & \\
2 & 1 & & & \\
3 & 2 & 1 & & \\
& 3 & 2 & 1 & \\
& & 3 & 2 & 1
\end{array}
\right)
\]
We first create vector rowMax into which we will store the results and fetch it view rowMaxView (line 39). Next we prepare necessary lambda functions:
- fetch (lines 44-46) is responsible for reading the matrix element value which is stored in the constant reference value and for returning its absolute value. The other parameters rowIdx and columnIdx correspond to row and column indexes respectively and they are omitted in our example.
- reduce (lines 51-53) returns maximum value of the two input values a and b.
- keep (line 58-60) stores the input value at the corresponding position, given by the row index rowIdx, in the output vector view rowMaxView.
Finally, we call the method reduceRows (TNL::Matrices::MultidiagonalMatrix::reduceRows) with parameters telling the interval of rows to be processed (the first and second parameter), the lambda functions fetch, reduce and keep, and the identity element for the reduction operation which is the lowest number of given type (std::numeric_limits< double >::lowest ). The result looks as follows:
Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Max. elements in rows are: [ -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 1:3 2:2 3:1
Row: 4 -> 2:3 3:2 4:1
Max. elements in rows are: [ -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308, -1.79769e+308 ]
Lambda matrices example
The reduction of matrix rows is available for the lambda matrices as well. See the following example:
1#include <iostream>
2#include <iomanip>
3#include <functional>
4#include <TNL/Matrices/LambdaMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7
8template< typename Device >
9void
10reduceRows()
11{
12
13
14
15 auto rowLengths = [ = ]
__cuda_callable__(
const int rows,
const int columns,
const int rowIdx ) ->
int
16 {
17 return columns;
18 };
19 auto matrixElements =
21 const int rows, const int columns, const int rowIdx, const int localIdx, int& columnIdx, double& value )
22 {
23 columnIdx = localIdx;
24 value =
TNL::max( rowIdx - columnIdx + 1, 0 );
25 };
26
28 auto matrix = MatrixFactory::create( 5, 5, matrixElements, rowLengths );
29
30
31
32
34
35
36
37
38 auto rowMaxView = rowMax.getView();
39
40
41
42
43 auto fetch = [ = ]
__cuda_callable__(
int rowIdx,
int columnIdx,
const double& value ) ->
double
44 {
46 };
47
48
49
50
52 {
54 };
55
56
57
58
60 {
61 rowMaxView[ rowIdx ] = value;
62 };
63
64
65
66
68
71}
72
73int
74main( int argc, char* argv[] )
75{
77 reduceRows< TNL::Devices::Host >();
78
79#ifdef __CUDACC__
81 reduceRows< TNL::Devices::Cuda >();
82#endif
83}
On the lines 14-21, we create the lower triangular lambda matrix which looks as follows:
\[\left(
\begin{array}{ccccc}
1 & & & & \\
2 & 1 & & & \\
3 & 2 & 1 & & \\
4 & 3 & 2 & 1 & \\
5 & 4 & 3 & 2 & 1
\end{array}
\right)
\]
We want to compute maximal absolute value of matrix elements in each row. For this purpose we define well known lambda functions:
- fetch takes the value of the lambda matrix element and returns its absolute value.
- reduce computes maximum value of two input variables.
- keep stores the results into output vector rowMax.
Note that the interface of the lambda functions is the same as for other matrix types. The result looks as follows:
Rows reduction on host:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Max. elements in rows are: [ 1, 2, 3, 4, 5 ]
Rows reduction on CUDA device:
The matrix reads as:
Row: 0 -> 0:1
Row: 1 -> 0:2 1:1
Row: 2 -> 0:3 1:2 2:1
Row: 3 -> 0:4 1:3 2:2 3:1
Row: 4 -> 0:5 1:4 2:3 3:2 4:1
Max. elements in rows are: [ 1, 2, 3, 4, 5 ]
Matrix-vector product
One of the most important matrix operation is the matrix-vector multiplication. It is represented by a method vectorProduct (TNL::Matrices::DenseMatrix::vectorProduct, TNL::Matrices::SparseMatrix::vectorProduct, TNL::Matrices::TridiagonalMatrix::vectorProduct, TNL::Matrices::MultidiagonalMatrix::vectorProduct, TNL::Matrices::LambdaMatrix::vectorProduct). It is templated method with two template parameters InVector and OutVector telling the types of input and output vector respectively. Usually one will substitute some of TNL::Containers::Array, TNL::Containers::ArrayView, TNL::Containers::Vector or TNL::Containers::VectorView for these types. The method accepts the following parameters:
- inVector is the input vector having the same number of elements as the number of matrix columns.
- outVector is the output vector having the same number of elements as the number of matrix rows.
- matrixMultiplicator is a number by which the result of matrix-vector product is multiplied.
- outVectorMultiplicator is a number by which the output vector is multiplied before added to the result of matrix-vector product.
- begin is the beginning of the matrix rows range on which we compute the matrix-vector product.
- end is the end of the matrix rows range on which the matrix-vector product will be evaluated. The last matrix row which is going to be processed has index end-1.
Note that the output vector dimension must be the same as the number of matrix rows no matter how we set begin and end parameters. These parameters just say that some matrix rows and the output vector elements are omitted.
To summarize, this method computes the following formula:
outVector = matrixMultiplicator * ( *this ) * inVector + outVectorMultiplicator * outVector.
Matrix I/O operations
All matrices can be saved to a file using a method save (TNL::Matrices::DenseMatrix::save, TNL::Matrices::SparseMatrix::save, TNL::Matrices::TridiagonalMatrix::save, TNL::Matrices::MultidiagonalMatrix::save) and restored with a method load (TNL::Matrices::DenseMatrix::load, TNL::Matrices::SparseMatrix::load, TNL::Matrices::TridiagonalMatrix::load, TNL::Matrices::MultidiagonalMatrix::load). To print the matrix, there is a method print (TNL::Matrices::DenseMatrix::print, TNL::Matrices::SparseMatrix::print, TNL::Matrices::TridiagonalMatrix::print, TNL::Matrices::MultidiagonalMatrix::print, TNL::Matrices::LambdaMatrix::print) can be used.
Matrix reader and writer
TNL also offers matrix reader (TNL::Matrices::MatrixReader) and matrix writer (TNL::Matrices::MatrixWriter) for import and export of matrices respectively. The matrix reader currently supports only Coordinate MTX file format which is popular mainly for sparse matrices. By the mean of the matrix writer, we can export TNL matrices into coordinate MTX format as well. In addition, the matrices can be exported to a text file suitable for Gnuplot program which can be used for matrix visualization. Finally, a pattern of nonzero matrix elements can be visualized via the EPS format - Encapsulated PostScript. We demonstrate both matrix reader and writer in the following example:
1#include <iostream>
2#include <TNL/Matrices/SparseMatrix.h>
3#include <TNL/Matrices/DenseMatrix.h>
4#include <TNL/Matrices/MatrixReader.h>
5#include <TNL/Matrices/MatrixWriter.h>
6#include <TNL/Devices/Host.h>
7#include <TNL/Devices/Cuda.h>
8
9template< typename Device >
10void
11matrixWriterExample()
12{
14 Matrix matrix( 5,
15 5,
16 {
17
18
19 { 0, 0, 2.0 },
20 { 1, 0, -1.0 }, { 1, 1, 2.0 }, { 1, 2, -1.0 },
21 { 2, 1, -1.0 }, { 2, 2, 2.0 }, { 2, 3, -1.0 },
22 { 3, 2, -1.0 }, { 3, 3, 2.0 }, { 3, 4, -1.0 },
23 { 4, 4, 2.0 },
24
25 } );
26
28 std::cout <<
"Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ...";
31 std::cout <<
"Writing matrix pattern in EPS format into the file matrix-writer-example.eps ...";
34 std::cout <<
"Writing matrix in MTX format into the file matrix-writer-example.mtx ...";
37}
38
39template< typename Device >
40void
41matrixReaderExample()
42{
44 SparseMatrix sparseMatrix;
45
46 std::cout <<
"Reading sparse matrix from MTX file matrix-writer-example.mtx ... ";
50
52 DenseMatrix denseMatrix;
53
54 std::cout <<
"Reading dense matrix from MTX file matrix-writer-example.mtx ... ";
58}
59
60int
61main( int argc, char* argv[] )
62{
64 matrixWriterExample< TNL::Devices::Host >();
65 matrixReaderExample< TNL::Devices::Host >();
66
67#ifdef __CUDACC__
70 matrixWriterExample< TNL::Devices::Cuda >();
71 matrixReaderExample< TNL::Devices::Cuda >();
72#endif
73}
static void readMtx(const std::string &fileName, Matrix &matrix, bool verbose=false)
Method for importing matrix from file with given filename.
Definition MatrixReader.hpp:18
static void writeEps(const std::string &fileName, const Matrix &matrix, bool verbose=false)
Method for exporting matrix to file with given filename using EPS format.
Definition MatrixWriter.hpp:51
static void writeGnuplot(const std::string &fileName, const Matrix &matrix, bool verbose=false)
Method for exporting matrix to file with given filename using Gnuplot format.
Definition MatrixWriter.hpp:15
static void writeMtx(const std::string &fileName, const Matrix &matrix, bool verbose=false)
Method for exporting matrix to file with given filename using MTX format.
Definition MatrixWriter.hpp:33
The example consists of two functions - matrixWriterExample (lines 10-24) and matrixReaderExample (lines 36-54). In the first one, we first create a toy matrix (lines 13-22) which we subsequently export into Gnuplot (line 26), EPS (line 29) and MTX (line 32) formats. In the next step (the matrixReaderExample function on lines 36-54), the MTX file is used to import the matrix into sparse (line 43) and dense (line 51) matrices. Both matrices are printed out (lines 45 and 53).
The result looks as follows:
1Creating matrices on CPU ...
2Matrix:
3Row: 0 -> 0:2
4Row: 1 -> 0:-1 1:2 2:-1
5Row: 2 -> 1:-1 2:2 3:-1
6Row: 3 -> 2:-1 3:2 4:-1
7Row: 4 -> 4:2
8
9Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ... OK
10Writing matrix pattern in EPS format into the file matrix-writer-example.eps ... OK
11Writing matrix in MTX format into the file matrix-writer-example.mtx ... OK
12Reading sparse matrix from MTX file matrix-writer-example.mtx ... OK
13Imported matrix is:
14Row: 0 -> 0:2
15Row: 1 -> 0:-1 1:2 2:-1
16Row: 2 -> 1:-1 2:2 3:-1
17Row: 3 -> 2:-1 3:2 4:-1
18Row: 4 -> 4:2
19
20Reading dense matrix from MTX file matrix-writer-example.mtx ... OK
21Imported matrix is:
22Row: 0 -> 0:2 1:0 2:0 3:0 4:0
23Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
24Row: 2 -> 0:0 1:-1 2:2 3:-1 4:0
25Row: 3 -> 0:0 1:0 2:-1 3:2 4:-1
26Row: 4 -> 0:0 1:0 2:0 3:0 4:2
27
28
29Creating matrices on CUDA GPU ...
30Matrix:
31Row: 0 -> 0:2
32Row: 1 -> 0:-1 1:2 2:-1
33Row: 2 -> 1:-1 2:2 3:-1
34Row: 3 -> 2:-1 3:2 4:-1
35Row: 4 -> 4:2
36
37Writing matrix in Gnuplot format into the file matrix-writer-example.gplt ... OK
38Writing matrix pattern in EPS format into the file matrix-writer-example.eps ... OK
39Writing matrix in MTX format into the file matrix-writer-example.mtx ... OK
40Reading sparse matrix from MTX file matrix-writer-example.mtx ... OK
41Imported matrix is:
42Row: 0 -> 0:2
43Row: 1 -> 0:-1 1:2 2:-1
44Row: 2 -> 1:-1 2:2 3:-1
45Row: 3 -> 2:-1 3:2 4:-1
46Row: 4 -> 4:2
47
48Reading dense matrix from MTX file matrix-writer-example.mtx ... OK
49Imported matrix is:
50Row: 0 -> 0:2 1:0 2:0 3:0 4:0
51Row: 1 -> 0:-1 1:2 2:-1 3:0 4:0
52Row: 2 -> 0:0 1:-1 2:2 3:-1 4:0
53Row: 3 -> 0:0 1:0 2:-1 3:2 4:-1
54Row: 4 -> 0:0 1:0 2:0 3:0 4:2
Appendix
Benchmark of dense matrix setup
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/DenseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7#include <TNL/Timer.h>
8
9const int testsCount = 5;
10
11template< typename Matrix >
12void
13setElement_on_host( const int matrixSize, Matrix& matrix )
14{
15 matrix.setDimensions( matrixSize, matrixSize );
16
17 for( int j = 0; j < matrixSize; j++ )
18 for( int i = 0; i < matrixSize; i++ )
19 matrix.setElement( i, j, i + j );
20}
21
22template< typename Matrix >
23void
24setElement_on_host_and_transfer( const int matrixSize, Matrix& matrix )
25{
26 using RealType = typename Matrix::RealType;
27 using IndexType = typename Matrix::IndexType;
29 HostMatrix hostMatrix( matrixSize, matrixSize );
30
31 for( int j = 0; j < matrixSize; j++ )
32 for( int i = 0; i < matrixSize; i++ )
33 hostMatrix.setElement( i, j, i + j );
34 matrix = hostMatrix;
35}
36
37template< typename Matrix >
38void
39setElement_on_device( const int matrixSize, Matrix& matrix )
40{
41 matrix.setDimensions( matrixSize, matrixSize );
42
43 auto matrixView = matrix.
getView();
45 {
46 matrixView.
setElement( i[ 0 ], i[ 1 ], i[ 0 ] + i[ 1 ] );
47 };
51}
52
53template< typename Matrix >
54void
55getRow( const int matrixSize, Matrix& matrix )
56{
57 matrix.setDimensions( matrixSize, matrixSize );
58
59 auto matrixView = matrix.
getView();
61 {
62 auto row = matrixView.
getRow( rowIdx );
63 for( int i = 0; i < matrixSize; i++ )
64 row.setValue( i, rowIdx + i );
65 };
67}
68
69template< typename Matrix >
70void
71forElements( const int matrixSize, Matrix& matrix )
72{
73 matrix.setDimensions( matrixSize, matrixSize );
74
75 auto f = [ = ]
__cuda_callable__(
int rowIdx,
int localIdx,
int& columnIdx,
float& value )
mutable
76 {
77 value = rowIdx + columnIdx;
78 };
79 matrix.forElements( 0, matrixSize, f );
80}
81
82template< typename Device >
83void
84setupDenseMatrix()
85{
87 for( int matrixSize = 16; matrixSize <= 8192; matrixSize *= 2 ) {
90
94 for( int i = 0; i < testsCount; i++ ) {
96 setElement_on_host( matrixSize, matrix );
97 }
100
104 for( int i = 0; i < testsCount; i++ ) {
106 setElement_on_device( matrixSize, matrix );
107 }
110
112 std::cout <<
" setElement on host and transfer on GPU: ";
115 for( int i = 0; i < testsCount; i++ ) {
117 setElement_on_host_and_transfer( matrixSize, matrix );
118 }
121 }
122
126 for( int i = 0; i < testsCount; i++ ) {
128 getRow( matrixSize, matrix );
129 }
132
136 for( int i = 0; i < testsCount; i++ ) {
138 forElements( matrixSize, matrix );
139 }
142 }
143}
144
145int
146main( int argc, char* argv[] )
147{
149 setupDenseMatrix< TNL::Devices::Host >();
150
151#ifdef __CUDACC__
153 setupDenseMatrix< TNL::Devices::Cuda >();
154#endif
155}
Class for real time, CPU time and CPU cycles measuring.
Definition Timer.h:25
void reset()
Reset the CPU and real-time timers.
Definition Timer.hpp:22
double getRealTime() const
Returns the elapsed real time on this timer.
Definition Timer.hpp:50
void stop()
Stops (pauses) the CPU and the real-time timers, but does not set them to zero.
Definition Timer.hpp:32
void start()
Starts timer.
Definition Timer.hpp:42
Benchmark of sparse matrix setup
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/SparseMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7#include <TNL/Timer.h>
8
9const int testsCount = 5;
10
11template< typename Matrix >
12void
13STL_Map( const int gridSize, Matrix& matrix )
14{
15
16
17
18
19 const int matrixSize = gridSize * gridSize;
20 matrix.setDimensions( matrixSize, matrixSize );
22 for( int j = 0; j < gridSize; j++ )
23 for( int i = 0; i < gridSize; i++ ) {
24 const int rowIdx = j * gridSize + i;
25 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
27 else {
33 }
34 }
35 matrix.setElements( map );
36}
37
38template< typename Matrix >
39void
40setElement_on_host( const int gridSize, Matrix& matrix )
41{
42
43
44
45
46
47 const int matrixSize = gridSize * gridSize;
49 matrix.setDimensions( matrixSize, matrixSize );
50 matrix.setRowCapacities( rowCapacities );
51
52 for( int j = 0; j < gridSize; j++ )
53 for( int i = 0; i < gridSize; i++ ) {
54 const int rowIdx = j * gridSize + i;
55 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
56 matrix.setElement( rowIdx, rowIdx, 1.0 );
57 else {
58 matrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
59 matrix.setElement( rowIdx, rowIdx - 1, 1.0 );
60 matrix.setElement( rowIdx, rowIdx, -4.0 );
61 matrix.setElement( rowIdx, rowIdx + 1, 1.0 );
62 matrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
63 }
64 }
65}
66
67template< typename Matrix >
68void
69setElement_on_host_and_transfer( const int gridSize, Matrix& matrix )
70{
71 using RealType = typename Matrix::RealType;
72 using HostMatrix = typename Matrix::template Self< RealType, TNL::Devices::Host >;
73
74 const int matrixSize = gridSize * gridSize;
76 HostMatrix hostMatrix( matrixSize, matrixSize );
77 hostMatrix.setRowCapacities( rowCapacities );
78
79 for( int j = 0; j < gridSize; j++ )
80 for( int i = 0; i < gridSize; i++ ) {
81 const int rowIdx = j * gridSize + i;
82 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
83 hostMatrix.setElement( rowIdx, rowIdx, 1.0 );
84 else {
85 hostMatrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
86 hostMatrix.setElement( rowIdx, rowIdx - 1, 1.0 );
87 hostMatrix.setElement( rowIdx, rowIdx, -4.0 );
88 hostMatrix.setElement( rowIdx, rowIdx + 1, 1.0 );
89 hostMatrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
90 }
91 }
92 matrix = hostMatrix;
93}
94
95template< typename Matrix >
96void
97setElement_on_device( const int gridSize, Matrix& matrix )
98{
99
100
101
102
103
104 const int matrixSize = gridSize * gridSize;
106 matrix.setDimensions( matrixSize, matrixSize );
107 matrix.setRowCapacities( rowCapacities );
108
109 auto matrixView = matrix.
getView();
111 {
112 const int rowIdx = i[ 1 ] * gridSize + i[ 0 ];
113 if( i[ 0 ] == 0 || i[ 1 ] == 0 || i[ 0 ] == gridSize - 1 || i[ 1 ] == gridSize - 1 )
115 else {
116 matrixView.
setElement( rowIdx, rowIdx - gridSize, 1.0 );
117 matrixView.
setElement( rowIdx, rowIdx - 1, 1.0 );
118 matrixView.
setElement( rowIdx, rowIdx, -4.0 );
119 matrixView.
setElement( rowIdx, rowIdx + 1, 1.0 );
120 matrixView.
setElement( rowIdx, rowIdx + gridSize, 1.0 );
121 }
122 };
126}
127
128template< typename Matrix >
129void
130getRow( const int gridSize, Matrix& matrix )
131{
132
133
134
135
136 const int matrixSize = gridSize * gridSize;
138 matrix.setDimensions( matrixSize, matrixSize );
139 matrix.setRowCapacities( rowCapacities );
140
141 auto matrixView = matrix.
getView();
143 {
144 const int i = rowIdx % gridSize;
145 const int j = rowIdx / gridSize;
146 auto row = matrixView.
getRow( rowIdx );
147 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
148 row.setElement( 2, rowIdx, 1.0 );
149 else {
150 row.setElement( 0, rowIdx - gridSize, 1.0 );
151 row.setElement( 1, rowIdx - 1, 1.0 );
152 row.setElement( 2, rowIdx, -4.0 );
153 row.setElement( 3, rowIdx + 1, 1.0 );
154 row.setElement( 4, rowIdx + gridSize, 1.0 );
155 }
156 };
158}
159
160template< typename Matrix >
161void
162forElements( const int gridSize, Matrix& matrix )
163{
164
165
166
167
168
169 const int matrixSize = gridSize * gridSize;
171 matrix.setDimensions( matrixSize, matrixSize );
172 matrix.setRowCapacities( rowCapacities );
173
174 auto f = [ = ]
__cuda_callable__(
int rowIdx,
int localIdx,
int& columnIdx,
float& value )
mutable
175 {
176 const int i = rowIdx % gridSize;
177 const int j = rowIdx / gridSize;
178 if( ( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) && localIdx == 0 ) {
179 columnIdx = rowIdx;
180 value = 1.0;
181 }
182 else {
183 switch( localIdx ) {
184 case 0:
185 columnIdx = rowIdx - gridSize;
186 value = 1.0;
187 break;
188 case 1:
189 columnIdx = rowIdx - 1;
190 value = 1.0;
191 break;
192 case 2:
193 columnIdx = rowIdx;
194 value = -4.0;
195 break;
196 case 3:
197 columnIdx = rowIdx + 1;
198 value = 1.0;
199 break;
200 case 4:
201 columnIdx = rowIdx + gridSize;
202 value = 1.0;
203 break;
204 }
205 }
206 };
207 matrix.forElements( 0, matrixSize, f );
208}
209
210template< typename Device >
211void
212laplaceOperatorSparseMatrix()
213{
215 for( int gridSize = 16; gridSize <= 8192; gridSize *= 2 ) {
218
222 for( int i = 0; i < testsCount; i++ ) {
224 STL_Map( gridSize, matrix );
225 }
228
232 for( int i = 0; i < testsCount; i++ ) {
234 setElement_on_host( gridSize, matrix );
235 }
238
240 std::cout <<
" setElement on host and transfer on GPU: ";
243 for( int i = 0; i < testsCount; i++ ) {
245 setElement_on_host_and_transfer( gridSize, matrix );
246 }
249 }
250
254 for( int i = 0; i < testsCount; i++ ) {
256 setElement_on_device( gridSize, matrix );
257 }
260
264 for( int i = 0; i < testsCount; i++ ) {
266 getRow( gridSize, matrix );
267 }
270
274 for( int i = 0; i < testsCount; i++ ) {
276 forElements( gridSize, matrix );
277 }
280 }
281}
282
283int
284main( int argc, char* argv[] )
285{
287 laplaceOperatorSparseMatrix< TNL::Devices::Host >();
288
289#ifdef __CUDACC__
291 laplaceOperatorSparseMatrix< TNL::Devices::Cuda >();
292#endif
293}
Benchmark of multidiagonal matrix setup
1#include <iostream>
2#include <TNL/Algorithms/parallelFor.h>
3#include <TNL/Containers/StaticArray.h>
4#include <TNL/Matrices/MultidiagonalMatrix.h>
5#include <TNL/Devices/Host.h>
6#include <TNL/Devices/Cuda.h>
7#include <TNL/Timer.h>
8
9const int testsCount = 5;
10
11template< typename Device >
13getOffsets( const int gridSize )
14{
21 return offsets;
22}
23
24template< typename Matrix >
25void
26setElement_on_host( const int gridSize, Matrix& matrix )
27{
28
29
30
31
32
33 const int matrixSize = gridSize * gridSize;
34 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
35
36 for( int j = 0; j < gridSize; j++ )
37 for( int i = 0; i < gridSize; i++ ) {
38 const int rowIdx = j * gridSize + i;
39 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
40 matrix.setElement( rowIdx, rowIdx, 1.0 );
41 else {
42 matrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
43 matrix.setElement( rowIdx, rowIdx - 1, 1.0 );
44 matrix.setElement( rowIdx, rowIdx, -4.0 );
45 matrix.setElement( rowIdx, rowIdx + 1, 1.0 );
46 matrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
47 }
48 }
49}
50
51template< typename Matrix >
52void
53setElement_on_host_and_transfer( const int gridSize, Matrix& matrix )
54{
55 using RealType = typename Matrix::RealType;
56 using IndexType = typename Matrix::IndexType;
58 const int matrixSize = gridSize * gridSize;
59 HostMatrix hostMatrix( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
60
61 for( int j = 0; j < gridSize; j++ )
62 for( int i = 0; i < gridSize; i++ ) {
63 const int rowIdx = j * gridSize + i;
64 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
65 hostMatrix.setElement( rowIdx, rowIdx, 1.0 );
66 else {
67 hostMatrix.setElement( rowIdx, rowIdx - gridSize, 1.0 );
68 hostMatrix.setElement( rowIdx, rowIdx - 1, 1.0 );
69 hostMatrix.setElement( rowIdx, rowIdx, -4.0 );
70 hostMatrix.setElement( rowIdx, rowIdx + 1, 1.0 );
71 hostMatrix.setElement( rowIdx, rowIdx + gridSize, 1.0 );
72 }
73 }
74 matrix = hostMatrix;
75}
76
77template< typename Matrix >
78void
79setElement_on_device( const int gridSize, Matrix& matrix )
80{
81
82
83
84
85
86 const int matrixSize = gridSize * gridSize;
87 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
88
89 auto matrixView = matrix.
getView();
91 {
92 const int rowIdx = i[ 1 ] * gridSize + i[ 0 ];
93 if( i[ 0 ] == 0 || i[ 1 ] == 0 || i[ 0 ] == gridSize - 1 || i[ 1 ] == gridSize - 1 )
95 else {
96 matrixView.
setElement( rowIdx, rowIdx - gridSize, 1.0 );
97 matrixView.
setElement( rowIdx, rowIdx - 1, 1.0 );
99 matrixView.
setElement( rowIdx, rowIdx + 1, 1.0 );
100 matrixView.
setElement( rowIdx, rowIdx + gridSize, 1.0 );
101 }
102 };
106}
107
108template< typename Matrix >
109void
110getRow( const int gridSize, Matrix& matrix )
111{
112
113
114
115
116 const int matrixSize = gridSize * gridSize;
117 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
118
119 auto matrixView = matrix.
getView();
121 {
122 const int i = rowIdx % gridSize;
123 const int j = rowIdx / gridSize;
124 auto row = matrixView.
getRow( rowIdx );
125 if( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 )
126 row.setElement( 2, 1.0 );
127 else {
128 row.setElement( 0, 1.0 );
129 row.setElement( 1, 1.0 );
130 row.setElement( 2, -4.0 );
131 row.setElement( 3, 1.0 );
132 row.setElement( 4, 1.0 );
133 }
134 };
136}
137
138template< typename Matrix >
139void
140forElements( const int gridSize, Matrix& matrix )
141{
142
143
144
145
146
147 const int matrixSize = gridSize * gridSize;
148 matrix.setDimensions( matrixSize, matrixSize, getOffsets< typename Matrix::DeviceType >( gridSize ) );
149
150 auto f = [ = ]
__cuda_callable__(
int rowIdx,
int localIdx,
int columnIdx,
float& value )
mutable
151 {
152 const int i = rowIdx % gridSize;
153 const int j = rowIdx / gridSize;
154 if( ( i == 0 || j == 0 || i == gridSize - 1 || j == gridSize - 1 ) && localIdx == 0 ) {
155 value = 1.0;
156 }
157 else {
158 if( localIdx == 2 )
159 value = -4.0;
160 else
161 value = 1.0;
162 }
163 };
164 matrix.forElements( 0, matrixSize, f );
165}
166
167template< typename Device >
168void
169laplaceOperatorMultidiagonalMatrix()
170{
172 for( int gridSize = 16; gridSize <= 8192; gridSize *= 2 ) {
175
179 for( int i = 0; i < testsCount; i++ ) {
181 setElement_on_host( gridSize, matrix );
182 }
185
187 std::cout <<
" setElement on host and transfer on GPU: ";
190 for( int i = 0; i < testsCount; i++ ) {
192 setElement_on_host_and_transfer( gridSize, matrix );
193 }
196 }
197
201 for( int i = 0; i < testsCount; i++ ) {
203 setElement_on_device( gridSize, matrix );
204 }
207
211 for( int i = 0; i < testsCount; i++ ) {
213 getRow( gridSize, matrix );
214 }
217
221 for( int i = 0; i < testsCount; i++ ) {
223 forElements( gridSize, matrix );
224 }
227 }
228}
229
230int
231main( int argc, char* argv[] )
232{
234 laplaceOperatorMultidiagonalMatrix< TNL::Devices::Host >();
235
236#ifdef __CUDACC__
238 laplaceOperatorMultidiagonalMatrix< TNL::Devices::Cuda >();
239#endif
240}