Template Numerical Library version\ main:94209208
Loading...
Searching...
No Matches
Multidimensional arrays

Introduction

Many algorithms in scientific computing work with coefficients indexed by three, four or even more indices and multidimensional arrays are a natural data structure for representing such values in the computer memory. Since the C++ language supports only one-dimensional arrays natively, multidimensional arrays have to be implemented explicitly (e.g., in a library) based on a mapping of multidimensional data to an internal one-dimensional array.

An interesting problem is how to choose the mapping from the multidimensional index space into the one-dimensional index space. Even for two-dimensional arrays (i.e., matrices) it can be argued whether they should be stored in the row-major format, where rows are stored as 1D arrays, or in the column-major format, where columns are stored as 1D arrays. The optimal choice depends on the operations that we want to do with the data, as well as on the hardware architecture that will process the data. For example, the row-major format is suitable for algorithms processing a matrix row by row on the CPU, while for GPU algorithms also processing a matrix row by row the column-major format is more appropriate. For three- and more-dimensional arrays there are even more combinations of possible array orderings.

For these reasons, we developed several data structures which allow to configure the indexing of multidimensional data and thus optimize the data structure for given algorithm and hardware architecture. This chapter walks through the options available in TNL.

Dynamic N-dimensional array

Dynamic N-dimensional arrays are objects of the TNL::Containers::NDArray class. It has several template parameters, four of them are the most important:

  • Value specifies the type of values stored in the array
  • SizesHolder specifies the dimension and static sizes of the array
  • Permutation specifies the layout of the array in memory
  • Device specifies the device that will be used for running operations on the array

The sizes of the array are specified using TNL::Containers::SizesHolder as follows. The first template parameter specifies the IndexType used for indexing (which will be used even by NDArray) and the following sequence of integers specifies the static sizes of the array along the individual axes. For a dynamic array, there must be at least one axis with zero static size.

For example, here we specify all axes as dynamic:

Holds static and dynamic sizes of an N-dimensional array.
Definition SizesHolder.h:30

The layout of the array in memory can be configured using a permutation specified using std::index_sequence. By default, the NDArray class uses an identity permutation. Users can use this, for example, to switch between row-major (std::index_sequence<0, 1>) and column-major (std::index_sequence<1, 0>) layout of a 2D array, or to specify an arbitrary layout for a higher-dimensional array.

When all template parameters are known, the NDArray can be instantiated:

using RowMajorArray = NDArray< int, // Value
SizesHolder< int, 0, 0 >, // SizesHolder
TNL::Devices::Host >; // Device
RowMajorArray a;

Then, the dynamic sizes can be set by calling setSizes. Here we set the sizes to create a \( 3 \times 4 \) array, where the first size corresponds to the 0-axis and the second size to the 1-axis:

Elements of the array can be accessed using the operator():

int value = 0;
for( int i = 0; i < 3; i++ )
for( int j = 0; j < 4; j++ )
a( i, j ) = value++;

Note that regardless of the permutation, the order of indices in the operator() logically corresponds to the order of sizes in the setSizes call (and in the SizesHolder<...> specification). This allows the programmer to change the layout of the array simply by setting a different permutation in the template parameters without having to rewrite the order of indices everywhere the array is used.

To examine the layout, we can print the underlying 1D array:

Output:

a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ]

Notice that the identity permutation works such that indices along the last axis change the fastest and indices along the first axis change the slowest.

In the following example, we show the effect of three different permutations on a \( 3 \times 3 \times 3 \) array.

1#include <TNL/Containers/NDArray.h>
2
3using namespace TNL::Containers;
4
5template< typename Permutation >
6using Array3D = NDArray< int, // Value
8 Permutation,
10
11int
12main()
13{
14 // create 3 arrays with different permutations
15 Array3D< std::index_sequence< 0, 1, 2 > > a; // first index changes the slowest, last index changes the fastest
16 Array3D< std::index_sequence< 2, 0, 1 > > b; // last index changes the slowest, middle index changes the fastest
17 Array3D< std::index_sequence< 2, 1, 0 > > c; // last index changes the slowest, first index changes the fastest
18
19 // allocate the storage
20 a.setSizes( 3, 3, 3 );
21 b.setSizes( 3, 3, 3 );
22 c.setSizes( 3, 3, 3 );
23
24 // initialize values
25 int value = 0;
26 for( int i = 0; i < 3; i++ )
27 for( int j = 0; j < 3; j++ )
28 for( int k = 0; k < 3; k++ ) {
29 a( i, j, k ) = value;
30 b( i, j, k ) = value;
31 c( i, j, k ) = value;
32 value++;
33 }
34
35 // print the underlying 1D arrays
36 std::cout << "a = " << a.getStorageArray() << std::endl;
37 std::cout << "b = " << b.getStorageArray() << std::endl;
38 std::cout << "c = " << c.getStorageArray() << std::endl;
39}
Dynamic N-dimensional array.
Definition NDArray.h:569
Definition Host.h:19
T endl(T... args)
Namespace for TNL containers.
Definition Array.h:17

Output:

a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 ]
b = [ 0, 3, 6, 9, 12, 15, 18, 21, 24, 1, 4, 7, 10, 13, 16, 19, 22, 25, 2, 5, 8, 11, 14, 17, 20, 23, 26 ]
c = [ 0, 9, 18, 3, 12, 21, 6, 15, 24, 1, 10, 19, 4, 13, 22, 7, 16, 25, 2, 11, 20, 5, 14, 23, 8, 17, 26 ]

The TNL::Containers::NDArray class also has a corresponding view, TNL::Containers::NDArrayView, which has the same semantics as the TNL::Containers::ArrayView for the basic one-dimensional array.

See the reference documentation for the overview of all operations available for N-dimensional arrays.

Different permutations depending on the device

In practice, the optimal permutation for given application often depends on the hardware architecture used for the computations. For example, one permutation may be optimal for GPU computations and a different permutation may be optimal for CPU computations. Hence, it may be desirable to create a generic data structure which uses different permutations for different devices. This may be achieved, for example, by defining the following template alias:

template< typename Value,
typename SizesHolder,
typename HostPermutation,
typename CudaPermutation,
typename Device >
std::conditional_t< std::is_same< Device, TNL::Devices::Cuda >::value,
CudaPermutation,
HostPermutation >,
Device >;

Static N-dimensional array

Static N-dimensional arrays can be created using the TNL::Containers::StaticNDArray class. It has the same base class as TNL::Containers::NDArray, but it exposes different template parameters and uses TNL::Containers::StaticArray instead of TNL::Containers::Array for the underlying storage. Hence, there is no allocator (the array lives on the stack) and the device is always TNL::Devices::Sequential.

Static sizes are specified as positive integers in the sequence passed to TNL::Containers::SizesHolder. For example, to create a static row-major \( 3 \times 4 \) array:

using StaticRowMajorArray = StaticNDArray< int, // Value
SizesHolder< int, 3, 4 >, // SizesHolder
StaticRowMajorArray a;
Static N-dimensional array.
Definition NDArray.h:634

As it is a static array, the setSizes method is not called and the elements can be accessed using operator().

Note that static and dynamic sizes can be combined. For example, we may want to create a 2D array which has a constant size of 3 rows, but the number of columns is a priori unknown and has to be set at run-time. This can be achieved using TNL::Containers::NDArray as follows:

1#include <TNL/Containers/NDArray.h>
2
3using namespace TNL::Containers;
4
5int
6main()
7{
8 // compile-time constant number of rows
9 constexpr int num_rows = 3;
10
11 // dynamic (run-time specified) number of columns
12 int num_cols = 10;
13
14 using HalfStaticArray = NDArray< int, // Value
16 std::index_sequence< 0, 1 >, // Permutation
17 TNL::Devices::Host >; // Device
18 HalfStaticArray a;
19
20 // set array sizes: statically-sized axes must have 0
21 a.setSizes( 0, num_cols );
22
23 int value = 0;
24 for( int i = 0; i < num_rows; i++ )
25 for( int j = 0; j < num_cols; j++ )
26 a( i, j ) = value++;
27
28 std::cout << "a = " << a.getStorageArray() << std::endl;
29}

Notice that the SizesHolder<...> specification contains positive values for static sizes and zeros for dynamic sizes, whereas the setSizes(...) call has zeros for static sizes and positive values for dynamic sizes.

Output:

a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 ]

Sliced dynamic N-dimensional array

The TNL::Containers::SlicedNDArray allows to create sliced N-dimensional arrays. The idea of slicing is that one particular axis is not indexed contiguously, but after a given number of elements the indexing continues with the other dimensions. The slicing is specified using the TNL::Containers::SliceInfo class which is specified in the template parameters of SlicedNDArray after Permutation.

In the following example, we create a two-dimensional array with a static number of rows (3), dynamic number of columns (10), and slicing size of 4 along the 1-axis. Hence, the array is divided into 3 slices, each containing 4 columns (the allocated size of the array is aligned to 12 columns).

1#include <TNL/Containers/NDArray.h>
2
3using namespace TNL::Containers;
4
5int
6main()
7{
8 // compile-time constant number of rows
9 constexpr int num_rows = 3;
10
11 // dynamic (run-time specified) number of columns
12 int num_cols = 10;
13
14 // number of columns per slice
15 constexpr int slice_size = 4;
16
17 using SlicedArray = SlicedNDArray< int, // Value
19 std::index_sequence< 0, 1 >, // Permutation
20 SliceInfo< 1, slice_size >, // SliceInfo
21 TNL::Devices::Host >; // Device
22 SlicedArray a;
23
24 // set array sizes: statically-sized axes must have 0
25 a.setSizes( 0, num_cols );
26
27 int value = 0;
28 for( int i = 0; i < num_rows; i++ )
29 for( int j = 0; j < num_cols; j++ )
30 a( i, j ) = value++;
31
32 std::cout << "a = " << a.getStorageArray() << std::endl;
33}
Dynamic N-dimensional array with configurable slicing/tiling.
Definition NDArray.h:692
Describes slicing configuration for SlicedNDArray.
Definition NDArray.h:31

Output:

a = [ 0, 1, 2, 3, 10, 11, 12, 13, 20, 21, 22, 23, 4, 5, 6, 7, 14, 15, 16, 17, 24, 25, 26, 27, 8, 9, 0, 0, 18, 19, 0, 0, 28, 29, 0, 0 ]

Notice how the numbering goes to the second row (starting with the value 10) after numbering 4 elements of the first row, then to the third row (starting with the value 20) after numbering 4 elements of the second row, and back to the first row (starting with the value 4) after numbering 4 elements of the last row. Compare the output with the previous example using a non-sliced array with the same sizes.

Distributed N-dimensional array

DistributedNDArray extends NDArray in a similar way that DistributedArray extends Array. Because it is N-dimensional, there is more freedom in the way it can be decomposed. The following example creates a 2D array oriented in a column-major layout and decomposed along the vertical axis (y-axis) with one layer of overlaps (ghost regions). The elements in ghost regions get initialized to -1 and later they are synchronized using TNL::Containers::DistributedNDArraySynchronizer. Note that the synchronization is periodic by default (i.e., the first rank synchronizes values with the last rank).

#include <iostream>
#include <TNL/Containers/BlockPartitioning.h>
#include <TNL/Containers/DistributedNDArray.h>
#include <TNL/Containers/DistributedNDArraySynchronizer.h>
#include <TNL/MPI/ScopedInitializer.h>
// The following works for any device (Host, Cuda, etc.)
template< typename Device >
void
distributedNDArrayExample()
{
using namespace TNL::Containers;
using LocalArrayType = NDArray< int, // Value
Device, // Device
int, // Index
>;
using LocalRangeType = typename ArrayType::LocalRangeType;
// set input parameters
const TNL::MPI::Comm communicator = MPI_COMM_WORLD;
const int num_rows = 10; // number of rows
const int num_cols = 4; // number of columns
constexpr int distributedAxis = 0; // 0: num_rows gets distributed, 1: num_cols gets distributed
// decompose the range of rows
const LocalRangeType localRange =
// create the distributed array
ArrayType a;
a.setSizes( num_rows, num_cols );
a.template setDistribution< distributedAxis >( localRange.getBegin(), localRange.getEnd(), communicator );
a.allocate();
// do some work with the array
auto a_view = a.getLocalView();
a.forAll(
[ = ] __cuda_callable__( int gi, int j ) mutable
{
// convert global row index to local
const int i = gi - localRange.getBegin();
// write the global row index to the array using local coordinates
a_view( i, j ) = gi;
} );
a.forGhosts(
[ = ] __cuda_callable__( int gi, int j ) mutable
{
// convert global row index to local
const int i = gi - localRange.getBegin();
// use the local indices to set ghost elements to -1
a_view( i, j ) = -1;
} );
// output the local elements as a flat array
ArrayView flat_view( a_view.getData(), a_view.getStorageSize() );
std::cout << "Rank " << communicator.rank() << " before synchronization: " << flat_view << std::endl;
// synchronize the ghost regions and output again
synchronizer.setSynchronizationPattern( NDArraySyncPatterns::D1Q3 );
synchronizer.synchronize( a );
std::cout << "Rank " << communicator.rank() << " after synchronization: " << flat_view << std::endl;
}
int
main( int argc, char* argv[] )
{
TNL::MPI::ScopedInitializer mpi( argc, argv );
if( TNL::MPI::GetRank() == 0 )
std::cout << "The first test runs on CPU ..." << std::endl;
distributedNDArrayExample< TNL::Devices::Host >();
#ifdef __CUDACC__
TNL::MPI::Barrier();
if( TNL::MPI::GetRank() == 0 )
std::cout << "The second test runs on GPU ..." << std::endl;
distributedNDArrayExample< TNL::Devices::Cuda >();
#endif
}
#define __cuda_callable__
Definition Macros.h:49
ArrayView is a simple data structure which provides a non-owning encapsulation of array data....
Definition ArrayView.h:55
Synchronizer for DistributedNDArray.
Definition DistributedNDArraySynchronizer.h:28
void setSynchronizationPattern(const std::array< SyncDirection, Q > &pattern)
Set the communication pattern between neighbors during data synchronization.
Definition DistributedNDArraySynchronizer.h:90
void synchronize(DistributedNDArray &array, SyncDirection mask=SyncDirection::All)
Synchronizes data in array distributed among MPI ranks.
Definition DistributedNDArraySynchronizer.h:199
Distributed N-dimensional array.
Definition DistributedNDArray.h:26
Holds static sizes of an N-dimensional array.
Definition StaticSizesHolder.h:27
An RAII wrapper for custom MPI communicators.
Definition Comm.h:63
int rank() const
Determines the rank of the calling process in the communicator.
Definition Comm.h:216
Subrange< Index > splitRange(Index rangeBegin, Index rangeEnd, int rank, int num_subintervals)
A helper function which splits a one-dimensional range.
Definition BlockPartitioning.h:27
Definition ScopedInitializer.h:63

Possible output:

Rank 0: rank on node is 0, using GPU id 0 of 1
Environment:
CUDA_VISIBLE_DEVICES=
Rank 2: rank on node is 2, using GPU id 0 of 1
Environment:
CUDA_VISIBLE_DEVICES=
Rank 3: rank on node is 3, using GPU id 0 of 1
Environment:
CUDA_VISIBLE_DEVICES=
Rank 1: rank on node is 1, using GPU id 0 of 1
Environment:
CUDA_VISIBLE_DEVICES=
Rank 2 before synchronization: [ -1, 6, 7, -1, -1, 6, 7, -1, -1, 6, 7, -1, -1, 6, 7, -1 ]
Rank 3 before synchronization: [ -1, 8, 9, -1, -1, 8, 9, -1, -1, 8, 9, -1, -1, 8, 9, -1 ]
Rank 1 before synchronization: [ -1, 3, 4, 5, -1, -1, 3, 4, 5, -1, -1, 3, 4, 5, -1, -1, 3, 4, 5, -1 ]
Rank 2 after synchronization: [ 5, 6, 7, 8, 5, 6, 7, 8, 5, 6, 7, 8, 5, 6, 7, 8 ]
The first test runs on CPU ...
Rank 0 before synchronization: [ -1, 0, 1, 2, -1, -1, 0, 1, 2, -1, -1, 0, 1, 2, -1, -1, 0, 1, 2, -1 ]
Rank 0 after synchronization: [ 9, 0, 1, 2, 3, 9, 0, 1, 2, 3, 9, 0, 1, 2, 3, 9, 0, 1, 2, 3 ]
Rank 3 after synchronization: [ 7, 8, 9, 0, 7, 8, 9, 0, 7, 8, 9, 0, 7, 8, 9, 0 ]
Rank 1 after synchronization: [ 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6 ]
The second test runs on GPU ...
Rank 3 before synchronization: [ -1, 8, 9, -1, -1, 8, 9, -1, -1, 8, 9, -1, -1, 8, 9, -1 ]
Rank 2 before synchronization: [ -1, 6, 7, -1, -1, 6, 7, -1, -1, 6, 7, -1, -1, 6, 7, -1 ]
Rank 1 before synchronization: [ -1, 3, 4, 5, -1, -1, 3, 4, 5, -1, -1, 3, 4, 5, -1, -1, 3, 4, 5, -1 ]
Rank 0 before synchronization: [ -1, 0, 1, 2, -1, -1, 0, 1, 2, -1, -1, 0, 1, 2, -1, -1, 0, 1, 2, -1 ]
Rank 3 after synchronization: [ 7, 8, 9, 0, 7, 8, 9, 0, 7, 8, 9, 0, 7, 8, 9, 0 ]
Rank 1 after synchronization: [ 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6 ]
Rank 2 after synchronization: [ 5, 6, 7, 8, 5, 6, 7, 8, 5, 6, 7, 8, 5, 6, 7, 8 ]
Rank 0 after synchronization: [ 9, 0, 1, 2, 3, 9, 0, 1, 2, 3, 9, 0, 1, 2, 3, 9, 0, 1, 2, 3 ]