Introduction
Dynamic N-dimensional array
- Different permutations depending on the device
Static N-dimensional array
Sliced dynamic N-dimensional array
Distributed N-dimensional array

Introduction

Many algorithms in scientific computing work with coefficients indexed by three, four or even more indices and multidimensional arrays are a natural data structure for representing such values in the computer memory. Since the C++ language supports only one-dimensional arrays natively, multidimensional arrays have to be implemented explicitly (e.g., in a library) based on a mapping of multidimensional data to an internal one-dimensional array.

An interesting problem is how to choose the mapping from the multidimensional index space into the one-dimensional index space. Even for two-dimensional arrays (i.e., matrices) it can be argued whether they should be stored in the row-major format, where rows are stored as 1D arrays, or in the column-major format, where columns are stored as 1D arrays. The optimal choice depends on the operations that we want to do with the data, as well as on the hardware architecture that will process the data. For example, the row-major format is suitable for algorithms processing a matrix row by row on the CPU, while for GPU algorithms also processing a matrix row by row the column-major format is more appropriate. For three- and more-dimensional arrays there are even more combinations of possible array orderings.

For these reasons, we developed several data structures which allow to configure the indexing of multidimensional data and thus optimize the data structure for given algorithm and hardware architecture. This chapter walks through the options available in TNL.

Dynamic N-dimensional array

Dynamic N-dimensional arrays are objects of the TNL::Containers::NDArray class. It has several template parameters, four of them are the most important:

Value specifies the type of values stored in the array
SizesHolder specifies the dimension and static sizes of the array
Permutation specifies the layout of the array in memory
Device specifies the device that will be used for running operations on the array

The sizes of the array are specified using TNL::Containers::SizesHolder as follows. The first template parameter specifies the IndexType used for indexing (which will be used even by NDArray) and the following sequence of integers specifies the static sizes of the array along the individual axes. For a dynamic array, there must be at least one axis with zero static size.

For example, here we specify all axes as dynamic:

using My2DSizes = TNL::Containers::SizesHolder< int, 0, 0 >,

using My3DSizes = TNL::Containers::SizesHolder< int, 0, 0, 0 >,

TNL::Containers::SizesHolder

Holds static and dynamic sizes of an N-dimensional array.

Definition SizesHolder.h:30

The layout of the array in memory can be configured using a permutation specified using std::index_sequence. By default, the NDArray class uses an identity permutation. Users can use this, for example, to switch between row-major (std::index_sequence<0, 1>) and column-major (std::index_sequence<1, 0>) layout of a 2D array, or to specify an arbitrary layout for a higher-dimensional array.

When all template parameters are known, the NDArray can be instantiated:

   using RowMajorArray = NDArray< int,                          // Value
                                  SizesHolder< int, 0, 0 >,     // SizesHolder
                                  std::index_sequence< 0, 1 >,  // Permutation
                                  TNL::Devices::Host >;         // Device
   RowMajorArray a;

Then, the dynamic sizes can be set by calling setSizes. Here we set the sizes to create a \( 3 \times 4 \) array, where the first size corresponds to the 0-axis and the second size to the 1-axis:

a.setSizes( 3, 4 );

Elements of the array can be accessed using the operator():

int value = 0;

for( int i = 0; i < 3; i++ )

for( int j = 0; j < 4; j++ )

a( i, j ) = value++;

Note that regardless of the permutation, the order of indices in the operator() logically corresponds to the order of sizes in the setSizes call (and in the SizesHolder<...> specification). This allows the programmer to change the layout of the array simply by setting a different permutation in the template parameters without having to rewrite the order of indices everywhere the array is used.

To examine the layout, we can print the underlying 1D array:

std::cout << "a = " << a.getStorageArray() << std::endl;

Output:

a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ]

Notice that the identity permutation works such that indices along the last axis change the fastest and indices along the first axis change the slowest.

In the following example, we show the effect of three different permutations on a \( 3 \times 3 \times 3 \) array.

#include <TNL/Containers/NDArray.h>
 
using namespace TNL::Containers;
 
template< typename Permutation >
using Array3D = NDArray< int,  // Value
                         SizesHolder< int, 0, 0, 0 >,
                         Permutation,
                         TNL::Devices::Host >;
 
int
main()
{
   // create 3 arrays with different permutations
   Array3D< std::index_sequence< 0, 1, 2 > > a;  // first index changes the slowest, last index changes the fastest
   Array3D< std::index_sequence< 2, 0, 1 > > b;  // last index changes the slowest, middle index changes the fastest
   Array3D< std::index_sequence< 2, 1, 0 > > c;  // last index changes the slowest, first index changes the fastest
 
   // allocate the storage
   a.setSizes( 3, 3, 3 );
   b.setSizes( 3, 3, 3 );
   c.setSizes( 3, 3, 3 );
 
   // initialize values
   int value = 0;
   for( int i = 0; i < 3; i++ )
      for( int j = 0; j < 3; j++ )
         for( int k = 0; k < 3; k++ ) {
            a( i, j, k ) = value;
            b( i, j, k ) = value;
            c( i, j, k ) = value;
            value++;
         }
 
   // print the underlying 1D arrays
   std::cout << "a = " << a.getStorageArray() << std::endl;
   std::cout << "b = " << b.getStorageArray() << std::endl;
   std::cout << "c = " << c.getStorageArray() << std::endl;
}

Output:

a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 ]

b = [ 0, 3, 6, 9, 12, 15, 18, 21, 24, 1, 4, 7, 10, 13, 16, 19, 22, 25, 2, 5, 8, 11, 14, 17, 20, 23, 26 ]

c = [ 0, 9, 18, 3, 12, 21, 6, 15, 24, 1, 10, 19, 4, 13, 22, 7, 16, 25, 2, 11, 20, 5, 14, 23, 8, 17, 26 ]

The TNL::Containers::NDArray class also has a corresponding view, TNL::Containers::NDArrayView, which has the same semantics as the TNL::Containers::ArrayView for the basic one-dimensional array.

See the reference documentation for the overview of all operations available for N-dimensional arrays.

Different permutations depending on the device

In practice, the optimal permutation for given application often depends on the hardware architecture used for the computations. For example, one permutation may be optimal for GPU computations and a different permutation may be optimal for CPU computations. Hence, it may be desirable to create a generic data structure which uses different permutations for different devices. This may be achieved, for example, by defining the following template alias:

template< typename Value,
          typename SizesHolder,
          typename HostPermutation,
          typename CudaPermutation,
          typename Device >
using NDArray = TNL::Containers::NDArray< Value,
                                          SizesHolder,
                                          std::conditional_t< std::is_same< Device, TNL::Devices::Cuda >::value,
                                                              CudaPermutation,
                                                              HostPermutation >,
                                          Device >;

Static N-dimensional array

Static N-dimensional arrays can be created using the TNL::Containers::StaticNDArray class. It has the same base class as TNL::Containers::NDArray, but it exposes different template parameters and uses TNL::Containers::StaticArray instead of TNL::Containers::Array for the underlying storage. Hence, there is no allocator (the array lives on the stack) and the device is always TNL::Devices::Sequential.

Static sizes are specified as positive integers in the sequence passed to TNL::Containers::SizesHolder. For example, to create a static row-major \( 3 \times 4 \) array:

using StaticRowMajorArray = StaticNDArray< int, // Value

SizesHolder< int, 3, 4 >, // SizesHolder

std::index_sequence< 0, 1 >, // Permutation

StaticRowMajorArray a;

TNL::Containers::StaticNDArray

Static N-dimensional array.

Definition NDArray.h:634

std::index_sequence

As it is a static array, the setSizes method is not called and the elements can be accessed using operator().

Note that static and dynamic sizes can be combined. For example, we may want to create a 2D array which has a constant size of 3 rows, but the number of columns is a priori unknown and has to be set at run-time. This can be achieved using TNL::Containers::NDArray as follows:

#include <TNL/Containers/NDArray.h>
 
using namespace TNL::Containers;
 
int
main()
{
   // compile-time constant number of rows
   constexpr int num_rows = 3;
 
   // dynamic (run-time specified) number of columns
   int num_cols = 10;
 
   using HalfStaticArray = NDArray< int,                              // Value
                                    SizesHolder< int, num_rows, 0 >,  // SizesHolder
                                    std::index_sequence< 0, 1 >,      // Permutation
                                    TNL::Devices::Host >;             // Device
   HalfStaticArray a;
 
   // set array sizes: statically-sized axes must have 0
   a.setSizes( 0, num_cols );
 
   int value = 0;
   for( int i = 0; i < num_rows; i++ )
      for( int j = 0; j < num_cols; j++ )
         a( i, j ) = value++;
 
   std::cout << "a = " << a.getStorageArray() << std::endl;
}

Notice that the SizesHolder<...> specification contains positive values for static sizes and zeros for dynamic sizes, whereas the setSizes(...) call has zeros for static sizes and positive values for dynamic sizes.

Output:

a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 ]

Sliced dynamic N-dimensional array

The TNL::Containers::SlicedNDArray allows to create sliced N-dimensional arrays. The idea of slicing is that one particular axis is not indexed contiguously, but after a given number of elements the indexing continues with the other dimensions. The slicing is specified using the TNL::Containers::SliceInfo class which is specified in the template parameters of SlicedNDArray after Permutation.

In the following example, we create a two-dimensional array with a static number of rows (3), dynamic number of columns (10), and slicing size of 4 along the 1-axis. Hence, the array is divided into 3 slices, each containing 4 columns (the allocated size of the array is aligned to 12 columns).

#include <TNL/Containers/NDArray.h>
 
using namespace TNL::Containers;
 
int
main()
{
   // compile-time constant number of rows
   constexpr int num_rows = 3;
 
   // dynamic (run-time specified) number of columns
   int num_cols = 10;
 
   // number of columns per slice
   constexpr int slice_size = 4;
 
   using SlicedArray = SlicedNDArray< int,                              // Value
                                      SizesHolder< int, num_rows, 0 >,  // SizesHolder
                                      std::index_sequence< 0, 1 >,      // Permutation
                                      SliceInfo< 1, slice_size >,       // SliceInfo
                                      TNL::Devices::Host >;             // Device
   SlicedArray a;
 
   // set array sizes: statically-sized axes must have 0
   a.setSizes( 0, num_cols );
 
   int value = 0;
   for( int i = 0; i < num_rows; i++ )
      for( int j = 0; j < num_cols; j++ )
         a( i, j ) = value++;
 
   std::cout << "a = " << a.getStorageArray() << std::endl;
}

Output:

a = [ 0, 1, 2, 3, 10, 11, 12, 13, 20, 21, 22, 23, 4, 5, 6, 7, 14, 15, 16, 17, 24, 25, 26, 27, 8, 9, 0, 0, 18, 19, 0, 0, 28, 29, 0, 0 ]

Notice how the numbering goes to the second row (starting with the value 10) after numbering 4 elements of the first row, then to the third row (starting with the value 20) after numbering 4 elements of the second row, and back to the first row (starting with the value 4) after numbering 4 elements of the last row. Compare the output with the previous example using a non-sliced array with the same sizes.

Distributed N-dimensional array

DistributedNDArray extends NDArray in a similar way that DistributedArray extends Array. Because it is N-dimensional, there is more freedom in the way it can be decomposed. The following example creates a 2D array oriented in a column-major layout and decomposed along the vertical axis (y-axis) with one layer of overlaps (ghost regions). The elements in ghost regions get initialized to -1 and later they are synchronized using TNL::Containers::DistributedNDArraySynchronizer. Note that the synchronization is periodic by default (i.e., the first rank synchronizes values with the last rank).

#include <iostream>
#include <TNL/Containers/BlockPartitioning.h>
#include <TNL/Containers/DistributedNDArray.h>
#include <TNL/Containers/DistributedNDArraySynchronizer.h>
#include <TNL/MPI/ScopedInitializer.h>
 
// The following works for any device (Host, Cuda, etc.)
template< typename Device >
void
distributedNDArrayExample()
{
   using namespace TNL::Containers;
   using LocalArrayType = NDArray< int,                            // Value
                                   SizesHolder< int, 0, 0 >,       // Sizes
                                   std::index_sequence< 1, 0 >,    // Permutation
                                   Device,                         // Device
                                   int,                            // Index
                                   StaticSizesHolder< int, 1, 0 >  // Overlaps
                                   >;
   using ArrayType = DistributedNDArray< LocalArrayType >;
   using LocalRangeType = typename ArrayType::LocalRangeType;
 
   // set input parameters
   const TNL::MPI::Comm communicator = MPI_COMM_WORLD;
   const int num_rows = 10;            // number of rows
   const int num_cols = 4;             // number of columns
   constexpr int distributedAxis = 0;  // 0: num_rows gets distributed, 1: num_cols gets distributed
 
   // decompose the range of rows
   const LocalRangeType localRange =
      TNL::Containers::splitRange< typename LocalArrayType::IndexType >( num_rows, communicator );
 
   // create the distributed array
   ArrayType a;
   a.setSizes( num_rows, num_cols );
   a.template setDistribution< distributedAxis >( localRange.getBegin(), localRange.getEnd(), communicator );
   a.allocate();
 
   // do some work with the array
   auto a_view = a.getLocalView();
   a.forAll(
      [ = ] __cuda_callable__( int gi, int j ) mutable
      {
         // convert global row index to local
         const int i = gi - localRange.getBegin();
         // write the global row index to the array using local coordinates
         a_view( i, j ) = gi;
      } );
   a.forGhosts(
      [ = ] __cuda_callable__( int gi, int j ) mutable
      {
         // convert global row index to local
         const int i = gi - localRange.getBegin();
         // use the local indices to set ghost elements to -1
         a_view( i, j ) = -1;
      } );
 
   // output the local elements as a flat array
   using ArrayView = TNL::Containers::ArrayView< int, Device, int >;
   ArrayView flat_view( a_view.getData(), a_view.getStorageSize() );
   std::cout << "Rank " << communicator.rank() << " before synchronization: " << flat_view << std::endl;
 
   // synchronize the ghost regions and output again
   DistributedNDArraySynchronizer< ArrayType > synchronizer;
   synchronizer.setSynchronizationPattern( NDArraySyncPatterns::D1Q3 );
   synchronizer.synchronize( a );
   std::cout << "Rank " << communicator.rank() << " after synchronization: " << flat_view << std::endl;
}
 
int
main( int argc, char* argv[] )
{
   TNL::MPI::ScopedInitializer mpi( argc, argv );
 
   if( TNL::MPI::GetRank() == 0 )
      std::cout << "The first test runs on CPU ..." << std::endl;
   distributedNDArrayExample< TNL::Devices::Host >();
 
#ifdef __CUDACC__
   TNL::MPI::Barrier();
 
   if( TNL::MPI::GetRank() == 0 )
      std::cout << "The second test runs on GPU ..." << std::endl;
   distributedNDArrayExample< TNL::Devices::Cuda >();
#endif
}

Possible output:

Rank 1: rank on node is 1, using GPU id 1 of 2
Environment:
  CUDA_VISIBLE_DEVICES=
Rank 2: rank on node is 2, using GPU id 0 of 2
Environment:
  CUDA_VISIBLE_DEVICES=
Rank 3: rank on node is 3, using GPU id 1 of 2
Environment:
  CUDA_VISIBLE_DEVICES=
Rank 0: rank on node is 0, using GPU id 0 of 2
Environment:
  CUDA_VISIBLE_DEVICES=
Rank 2 before synchronization: [ -1, 6, 7, -1, -1, 6, 7, -1, -1, 6, 7, -1, -1, 6, 7, -1 ]
The first test runs on CPU ...
Rank 0 before synchronization: [ -1, 0, 1, 2, -1, -1, 0, 1, 2, -1, -1, 0, 1, 2, -1, -1, 0, 1, 2, -1 ]
Rank 3 before synchronization: [ -1, 8, 9, -1, -1, 8, 9, -1, -1, 8, 9, -1, -1, 8, 9, -1 ]
Rank 3 after synchronization: [ 7, 8, 9, 0, 7, 8, 9, 0, 7, 8, 9, 0, 7, 8, 9, 0 ]
Rank 1 before synchronization: [ -1, 3, 4, 5, -1, -1, 3, 4, 5, -1, -1, 3, 4, 5, -1, -1, 3, 4, 5, -1 ]
Rank 1 after synchronization: [ 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6 ]
Rank 2 after synchronization: [ 5, 6, 7, 8, 5, 6, 7, 8, 5, 6, 7, 8, 5, 6, 7, 8 ]
Rank 0 after synchronization: [ 9, 0, 1, 2, 3, 9, 0, 1, 2, 3, 9, 0, 1, 2, 3, 9, 0, 1, 2, 3 ]
The second test runs on GPU ...
Rank 3 before synchronization: [ -1, 8, 9, -1, -1, 8, 9, -1, -1, 8, 9, -1, -1, 8, 9, -1 ]
Rank 1 before synchronization: [ -1, 3, 4, 5, -1, -1, 3, 4, 5, -1, -1, 3, 4, 5, -1, -1, 3, 4, 5, -1 ]
Rank 2 before synchronization: [ -1, 6, 7, -1, -1, 6, 7, -1, -1, 6, 7, -1, -1, 6, 7, -1 ]
Rank 2 after synchronization: [ 5, 6, 7, 8, 5, 6, 7, 8, 5, 6, 7, 8, 5, 6, 7, 8 ]
Rank 0 before synchronization: [ -1, 0, 1, 2, -1, -1, 0, 1, 2, -1, -1, 0, 1, 2, -1, -1, 0, 1, 2, -1 ]
Rank 3 after synchronization: [ 7, 8, 9, 0, 7, 8, 9, 0, 7, 8, 9, 0, 7, 8, 9, 0 ]
Rank 0 after synchronization: [ 9, 0, 1, 2, 3, 9, 0, 1, 2, 3, 9, 0, 1, 2, 3, 9, 0, 1, 2, 3 ]
Rank 1 after synchronization: [ 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6 ]

Table of Contents

Introduction

Dynamic N-dimensional array

Different permutations depending on the device

Static N-dimensional array

Sliced dynamic N-dimensional array

Distributed N-dimensional array