Template Numerical Library version\ main:481315e2
Loading...
Searching...
No Matches
Arrays

Introduction

This part introduces arrays in TNL. There are three types - common arrays with dynamic allocation, static arrays allocated on stack and distributed arrays with dynamic allocation. Arrays are one of the most important structures for memory management. Methods implemented in arrays are particularly useful for GPU programming. From this point of view, the reader will learn how to easily allocate memory on GPU, transfer data between GPU and CPU but also, how to initialize data allocated on GPU. In addition, the resulting code is hardware platform independent, so it can be ran on CPU nad GPU without any changes.

Dynamic arrays

Array is templated class defined in namespace TNL::Containers having three template parameters:

  • Value is type of data to be stored in the array
  • Device is the device where the array is allocated. Currently it can be either Devices::Host for CPU or Devices::Cuda for GPU supporting CUDA.
  • Index is the type to be used for indexing the array elements.

The following example shows how to allocate arrays on CPU and GPU and how to initialize the data.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <list>
#include <vector>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create one array on host and one array on device.
*/
Array< int > host_array( 10 );
/***
* Initiate the host array with number three and assign it to the device array.
* NOTE: Of course, you may do directly 'device_array = 3' as well.
*/
host_array = 3;
device_array = host_array;
/****
* Print both arrays.
*/
std::cout << "host_array = " << host_array << std::endl;
std::cout << "device_array = " << device_array << std::endl;
/****
* There are few other ways how to initialize arrays...
*/
std::list< int > list { 1, 2, 3, 4, 5 };
std::vector< int > vector { 6, 7, 8, 9, 10 };
Array< int, Devices::Cuda > device_array_list( list );
Array< int, Devices::Cuda > device_array_vector( vector );
Array< int, Devices::Cuda > device_array_init_list { 11, 12, 13, 14, 15 };
/****
* ... and print them all
*/
std::cout << "device_array_list = " << device_array_list << std::endl;
std::cout << "device_array_vector = " << device_array_vector << std::endl;
std::cout << "device_array_init_list = " << device_array_init_list << std::endl;
}
Array is responsible for memory management, access to array elements, and general array operations.
Definition Array.h:66
T endl(T... args)
Namespace for TNL containers.
Definition Array.h:20
The main TNL namespace.
Definition AtomicOperations.h:12

The result looks as follows:

host_array = [ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]
device_array = [ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]
device_array_list = [ 1, 2, 3, 4, 5 ]
device_array_vector = [ 6, 7, 8, 9, 10 ]
device_array_init_list = [ 11, 12, 13, 14, 15 ]

Array views

Arrays cannot share data with each other or data allocated elsewhere. This can be achieved with the ArrayView structure which has similar semantics to Array, but it does not handle allocation and deallocation of the data. Hence, array view cannot be resized, but it can be used to wrap data allocated elsewhere (e.g. using an Array or an operator new) and to partition large arrays into subarrays. The process of wrapping external data with a view is called binding.

The following code snippet shows how to create an array view.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create new array
*/
const int size = 5;
Array< float > a( size );
/****
* Bind an array view with it
*/
auto another_view = a.getView();
auto const_view = a.getConstView();
another_view = -5;
std::cout << " a = " << a << std::endl;
std::cout << " a_view = " << a_view << std::endl;
std::cout << " another_view = " << another_view << std::endl;
std::cout << " const_view = " << const_view << std::endl;
//const_view = 3; this would not compile
}
ArrayView is a simple data structure which provides a non-owning encapsulation of array data....
Definition ArrayView.h:57
__cuda_callable__ ConstViewType getConstView(IndexType begin=0, IndexType end=0) const
Returns a non-modifiable view of the array view.
Definition ArrayView.hpp:77
__cuda_callable__ ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the array view.
Definition ArrayView.hpp:61

The output is:

a = [ -5, -5, -5, -5, -5 ]
a_view = [ -5, -5, -5, -5, -5 ]
another_view = [ -5, -5, -5, -5, -5 ]
const_view = [ -5, -5, -5, -5, -5 ]

You can also bind external data into array view:

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Allocate your own data
*/
const int size = 5;
float* a = new float[ size ];
/****
* Wrap the data with an array view
*/
ArrayView< float > a_view( a, size );
a_view = -5;
std::cout << " a_view = " << a_view << std::endl;
for( int i = 0; i < size; i++ )
std::cout << a[ i ] << " ";
/****
* Free the allocated memory
*/
delete[] a;
}

Output:

a_view = [ -5, -5, -5, -5, -5 ]
-5 -5 -5 -5 -5

Since array views do not allocate or deallocate memory, they can be created even in CUDA kernels, which is not possible with Array. ArrayView can also be passed-by-value into CUDA kernels or captured-by-value by device lambda functions, because the ArrayView's copy-constructor makes only a shallow copy (i.e., it copies only the data pointer and size).

Accessing the array elements

There are two ways how to work with the array (or array view) elements - using the indexing operator (operator[]) which is more efficient or using methods setElement and getElement which is more flexible.

Accessing the array elements with operator[]

Indexing operator operator[] is implemented in both Array and ArrayView and it is defined as __cuda_callable__. It means that it can be called even in CUDA kernels if the data is allocated on GPU, i.e. the Device parameter is Devices::Cuda. This operator returns a reference to given array element and so it is very efficient. However, calling this operator from host for data allocated on device (or vice versa) leads to segmentation fault (on the host system) or broken state of the device. It means:

  • You may call the operator[] on the host only for data allocated on the host (with device Devices::Host).
  • You may call the operator[] on the device only for data allocated on the device (with device Devices::Cuda).

The following example shows use of operator[].

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
__global__ void initKernel( ArrayView< float, Devices::Cuda > view )
{
int tid = threadIdx.x;
if( tid < view.getSize() )
view[ tid ] = -tid;
}
int main( int argc, char* argv[] )
{
/****
* Create new arrays on both host and device
*/
const int size = 5;
Array< float, Devices::Host > host_array( size );
Array< float, Devices::Cuda > device_array( size );
/****
* Initiate the host array
*/
for( int i = 0; i < size; i++ )
host_array[ i ] = i;
/****
* Prepare array view for the device array - we will pass it to a CUDA kernel.
* NOTE: Better way is to use ArrayView::evaluate or ParallelFor, this is just
* an example.
*/
auto device_view = device_array.getView();
/****
* Call CUDA kernel to initiate the array on the device
*/
initKernel<<< 1, size >>>( device_view );
/****
* Print the results
*/
std::cout << " host_array = " << host_array << std::endl;
std::cout << " device_array = " << device_array << std::endl;
}
__cuda_callable__ IndexType getSize() const
Returns the current size of the array view.
Definition ArrayView.hpp:172

Output:

host_array = [ 0, 1, 2, 3, 4 ]
device_array = [ 0, -1, -2, -3, -4 ]

In general in TNL, each method defined as __cuda_callable__ can be called from the CUDA kernels. The method ArrayView::getSize is another example. We also would like to point the reader to better ways of arrays initiation for example with method ArrayView::forElements or with ParallelFor.

Accessing the array elements with setElement and getElement

On the other hand, the methods setElement and getElement can be called from the host no matter where the array is allocated. In addition they can be called from kernels on device where the array is allocated. getElement returns copy of an element rather than a reference. Therefore it is slightly slower. If the array is on GPU and the methods are called from the host, the array element is copied from the device on the host (or vice versa) which is significantly slower. In the parts of code where the performance matters, these methods shall not be called from the host when the array is allocated on the device. In this way, their use is, however, easier compared to operator[] and they allow to write one simple code for both CPU and GPU. Both methods are good candidates for:

  • reading/writing of only few elements in the array
  • arrays initiation which is done only once and it is not time critical part of a code
  • debugging purposes

The following example shows the use of getElement and setElement:

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create new arrays on both host and device
*/
const int size = 5;
Array< float, Devices::Host > host_array( size );
Array< float, Devices::Cuda > device_array( size );
/****
* Initiate the arrays with setElement
*/
for( int i = 0; i < size; i++ )
{
host_array.setElement( i, i );
device_array.setElement( i, i );
}
/****
* Compare the arrays using getElement
*/
for( int i = 0; i < size; i++ )
if( host_array.getElement( i ) == device_array.getElement( i ) )
std::cout << "Elements at position " << i << " match." << std::endl;
/****
* Print the results
*/
std::cout << "host_array = " << host_array << std::endl;
std::cout << "device_array = " << device_array << std::endl;
}

Output:

Elements at position 0 match.
Elements at position 1 match.
Elements at position 2 match.
Elements at position 3 match.
Elements at position 4 match.
host_array = [ 0, 1, 2, 3, 4 ]
device_array = [ 0, 1, 2, 3, 4 ]

Arrays and parallel for

More efficient and still quite simple method for (not only) array elements initiation is with the use of C++ lambda functions and methods forElements and forAllElements. As an argument a lambda function is passed which is then applied for all elements. Optionally one may define only subinterval of element indexes where the lambda shall be applied. If the underlying array is allocated on GPU, the lambda function is called from CUDA kernel. This is why it is more efficient than use of setElement. On the other hand, one must be careful to use only __cuda_callable__ methods inside the lambda. The use of the methods forElements and forAllElements is demonstrated in the following example.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
template< typename Device >
void forElementsExample()
{
/****
* Create new arrays
*/
const int size = 10;
Containers::Array< float, Device > a( size ), b( size );
b = 0;
/****
* Initiate the elements of array `a`
*/
a.forAllElements( [] __cuda_callable__ ( int i, float& value ) { value = i; } );
/****
* Initiate elements of array `b` with indexes 0-4 using `a_view`
*/
auto a_view = a.getView();
b.forElements( 0, 5, [=] __cuda_callable__ ( int i, float& value ) { value = a_view[ i ] + 4.0; } );
/****
* Print the results
*/
std::cout << " a = " << a << std::endl;
std::cout << " b = " << b << std::endl;
}
int main( int argc, char* argv[] )
{
std::cout << "Running example on the host system: " << std::endl;
forElementsExample< Devices::Host >();
#ifdef __CUDACC__
std::cout << "Running example on the CUDA device: " << std::endl;
forElementsExample< Devices::Cuda >();
#endif
}
#define __cuda_callable__
Definition CudaCallable.h:22
void forElements(IndexType begin, IndexType end, Function &&f)
Process the lambda function f for each array element in interval [ begin, end).
Definition ArrayView.hpp:279

Output:

Running example on the host system:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
b = [ 4, 5, 6, 7, 8, 0, 0, 0, 0, 0 ]
Running example on the CUDA device:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
b = [ 4, 5, 6, 7, 8, 0, 0, 0, 0, 0 ]

Arrays and flexible reduction

Arrays also offer simpler way to do the flexible parallel reduction. See the section about [the flexible parallel reduction](ug_ReductionAndScan) to understand how it works. Flexible reduction for arrays just simplifies access to the array elements. See the following example:

#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/reduce.h>
using namespace TNL;
template< typename Device >
void reduceArrayExample()
{
/****
* Create new arrays
*/
const int size = 10;
/****
* Initiate the elements of array `a`
*/
a.forAllElements( [] __cuda_callable__ ( int i, float& value ) { value = i; } );
/****
* Sum all elements of array `a`
*/
float sum_total = Algorithms::reduce( a, TNL::Plus{} );
/****
* Sum last 5 elements of array `a`
*/
float sum_last_five = Algorithms::reduce( a.getConstView( 5, 10 ), TNL::Plus{} );
/****
* Print the results
*/
std::cout << " a = " << a << std::endl;
std::cout << " sum of all elements = " << sum_total << std::endl;
std::cout << " sum of last 5 elements = " << sum_last_five << std::endl;
}
int main( int argc, char* argv[] )
{
std::cout << "Running example on the host system: " << std::endl;
reduceArrayExample< Devices::Host >();
#ifdef __CUDACC__
std::cout << "Running example on the CUDA device: " << std::endl;
reduceArrayExample< Devices::Cuda >();
#endif
}
Function object implementing x + y.
Definition Functional.h:20

Output:

Running example on the host system:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
sum of all elements = 45
sum of last 5 elements = 35
Running example on the CUDA device:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
sum of all elements = 45
sum of last 5 elements = 35

Checking the array contents

The functions TNL::Algorithms::contains and TNL::Algorithms::containsOnlyValue serve for testing the contents of arrays, vectors or their views. contains returns true if there is at least one element in the array with given value. containsOnlyValue returns true only if all elements of the array are equal to the given value. The test can be restricted to a subinterval of array elements. See the following code snippet for usage example.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/contains.h>
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
int main( int argc, char* argv[] )
{
/****
* Create new arrays and initiate them
*/
const int size = 10;
Array< float, Devices::Cuda > a( size ), b( size );
a = 0;
b.forAllElements( [=] __cuda_callable__ ( int i, float& value ) { value = i; } );
/****
* Test the values stored in the arrays
*/
if( contains( a, 0.0 ) )
std::cout << "a contains 0" << std::endl;
if( contains( a, 1.0 ) )
std::cout << "a contains 1" << std::endl;
if( contains( b, 0.0 ) )
std::cout << "b contains 0" << std::endl;
if( contains( b, 1.0 ) )
std::cout << "b contains 1" << std::endl;
if( containsOnlyValue( a, 0.0 ) )
std::cout << "a contains only 0" << std::endl;
if( containsOnlyValue( a, 1.0 ) )
std::cout << "a contains only 1" << std::endl;
if( containsOnlyValue( b, 0.0 ) )
std::cout << "b contains only 0" << std::endl;
if( containsOnlyValue( b, 1.0 ) )
std::cout << "b contains only 1" << std::endl;
/****
* Change the first half of b and test it again
*/
b.forElements( 0, 5, [=] __cuda_callable__ ( int i, float& value ) { value = 0.0; } );
if( containsOnlyValue( b, 0.0, 0, 5 ) )
std::cout << "First five elements of b contains only 0" << std::endl;
}
Namespace for fundamental TNL algorithms.
Definition AtomicOperations.h:12
bool containsOnlyValue(const Array &array, typename Array::ValueType value, typename Array::IndexType begin=0, typename Array::IndexType end=0)
Checks if all elements of an array/vector/view have the given value.
Definition contains.h:64

Output:

a contains 0
b contains 0
b contains 1
a contains only 0
First five elements of b contains only 0

IO operations with arrays

Methods save and load serve for storing/restoring the array to/from a file in a binary form. In case of Array, loading of an array from a file causes data reallocation. ArrayView cannot do reallocation, therefore the data loaded from a file is copied to the memory managed by the ArrayView. The number of elements managed by the array view and those loaded from the file must be equal. See the following example.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create new arrays
*/
const int size = 15;
Array< float > a( size ), b( 5 );
a = 1;
b = 2;
/****
* Create array view for the middle of array a
*/
auto a_view = a.getView( 5, 10 );
/****
* Save array b to file
*/
b.save( "b.tnl" );
/****
* Load data from b to a_view
*/
a_view.load( "b.tnl" );
/****
* Print the results
*/
std::cout << "a = " << a << std::endl;
std::cout << "a_view = " << a_view << std::endl;
std::cout << "b = " << b << std::endl;
}
void save(const std::string &fileName) const
Method for saving the data to a binary file fileName.
Definition ArrayView.hpp:324
void load(const std::string &fileName)
Method for loading the data from a binary file fileName.
Definition ArrayView.hpp:331

Output:

a = [ 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1 ]
a_view = [ 2, 2, 2, 2, 2 ]
b = [ 2, 2, 2, 2, 2 ]

Static arrays

Static arrays are allocated on stack and thus they can be created even in CUDA kernels. Their size is fixed and it is given by a template parameter. Static array is a templated class defined in namespace TNL::Containers having two template parameters:

  • Size is the array size.
  • Value is type of data stored in the array.

The interface of StaticArray is very smillar to Array but much simpler. It contains set of common constructors. Array elements can be accessed by the operator[] and also using method x(), y() and z() when it makes sense. See the following example for typical use of StaticArray.

#include <iostream>
#include <TNL/Containers/StaticArray.h>
#include <TNL/File.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
StaticArray< 3, int > a1, a2( 1, 2, 3 ), a3{ 4,3,2 };
a1 = 0.0;
std::cout << "a1 = " << a1 << std::endl;
std::cout << "a2 = " << a2 << std::endl;
std::cout << "a3 = " << a3 << std::endl;
File( "static-array-example-file.tnl", std::ios::out ) << a3;
File( "static-array-example-file.tnl", std::ios::in ) >> a1;
std::cout << "a1 = " << a1 << std::endl;
a1.sort();
std::cout << "Sorted a1 = " << a1 << std::endl;
}
Array with constant size.
Definition StaticArray.h:23
constexpr void sort()
Sorts the elements in this static array in ascending order.
Definition StaticArray.hpp:289
This class serves for binary IO. It allows to do IO even for data allocated on GPU together with on-t...
Definition File.h:28

The output looks as:

a1 = [ 0, 0, 0 ]
a2 = [ 1, 2, 3 ]
a3 = [ 4, 3, 2 ]
a1 = [ 4, 3, 2 ]
Sorted a1 = [ 2, 3, 4 ]

Distributed arrays

Distributed arrays are managed by the TNL::Containers::DistributedArray class. It is a wrapper around a local array, MPI communicator and global indexing information. When creating a distributed array, the global range must be partitioned into subranges (e.g. using TNL::Containers::Partitioner) and passed to the constructor or the setDistribution member function. For example:

using LocalRangeType = typename ArrayType::LocalRangeType;
const TNL::MPI::Comm communicator = MPI_COMM_WORLD;
const int size = 97;
const int ghosts = 0;
const LocalRangeType localRange = Partitioner::splitRange( size, communicator );
ArrayType a( localRange, ghosts, size, communicator );
Distributed array.
Definition DistributedArray.h:27
Definition Partitioner.h:22
An RAII wrapper for custom MPI communicators.
Definition Comm.h:66

The local arrays can be accessed via views returned by the following member functions:

The reference manual for TNL::Containers::DistributedArray lists all functionality of the data structure. The following shows a full example.

#include <iostream>
#include <TNL/Containers/Partitioner.h>
#include <TNL/Containers/DistributedArray.h>
#include <TNL/MPI/ScopedInitializer.h>
/***
* The following works for any device (CPU, GPU ...).
*/
template< typename Device >
void distributedArrayExample()
{
using IndexType = typename ArrayType::IndexType;
using LocalRangeType = typename ArrayType::LocalRangeType;
const TNL::MPI::Comm communicator = MPI_COMM_WORLD;
// We set the global array size to a prime number to force non-uniform distribution.
const int size = 97;
const int ghosts = (communicator.size() > 1) ? 4 : 0;
const LocalRangeType localRange = TNL::Containers::Partitioner< IndexType >::splitRange( size, communicator );
ArrayType a( localRange, ghosts, size, communicator );
a.forElements( 0, size, [] __cuda_callable__ ( int idx, int& value ) { value = idx; } );
std::cout << "Rank " << communicator.rank() << ": " << a.getLocalView() << std::endl;
}
int main( int argc, char* argv[] )
{
TNL::MPI::ScopedInitializer mpi(argc, argv);
if( TNL::MPI::GetRank() == 0 )
std::cout << "The first test runs on CPU ..." << std::endl;
distributedArrayExample< TNL::Devices::Host >();
#ifdef __CUDACC__
TNL::MPI::Barrier();
if( TNL::MPI::GetRank() == 0 )
std::cout << "The second test runs on GPU ..." << std::endl;
distributedArrayExample< TNL::Devices::Cuda >();
#endif
}
int size() const
Returns the size of the group associated with a communicator.
Definition Comm.h:226
int rank() const
Determines the rank of the calling process in the communicator.
Definition Comm.h:219
Definition ScopedInitializer.h:66

The output looks as:

Rank 3: rank on node is 3, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=
Rank 3: [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 ]
Rank 0: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=
The first test runs on CPU ...
Rank 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 ]
Rank 2: rank on node is 2, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=
Rank 2: [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72 ]
Rank 1: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=
Rank 1: [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 ]
The second test runs on GPU ...
Rank 0: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 ]
Rank 3: [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 ]
Rank 1: [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 ]
Rank 2: [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72 ]