Introduction
This part introduces arrays in TNL. There are three types of arrays: common arrays with dynamic allocation, static arrays allocated on the stack, and distributed arrays with dynamic allocation. Arrays are one of the most important data structures for memory management. Methods implemented in TNL arrays are particularly useful for GPU programming as they allow to easily allocate memory on the GPU, transfer data between GPU and CPU, but also to initialize data allocated on the GPU. In addition, the resulting code is independent of the hardware platform, so it can be run on any supported architecture without any changes.
Dynamic arrays
Array is a class template defined in the TNL::Containers
namespace, which has four template parameters:
Value
is type of data to be stored in the array
Device
is the device to be used for the execution of array operations. It can be any class defined in the TNL::Devices namespace.
Index
is the type to be used for indexing the array elements.
Allocator
is the type of the allocator used for the allocation and deallocation of memory used by the array. By default, an appropriate allocator for the specified Device
is selected with TNL::Allocators::Default.
The following example shows how to allocate arrays on CPU and GPU and how to initialize the data.
#include <iostream>
#include <TNL/Containers/Array.h>
#include <list>
#include <vector>
int
main( int argc, char* argv[] )
{
host_array = 3;
device_array = host_array;
}
Array is responsible for memory management, access to array elements, and general array operations.
Definition Array.h:64
Namespace for TNL containers.
Definition Array.h:17
The main TNL namespace.
Definition AtomicOperations.h:9
The result looks as follows:
host_array = [ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]
device_array = [ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]
device_array_list = [ 1, 2, 3, 4, 5 ]
device_array_vector = [ 6, 7, 8, 9, 10 ]
device_array_init_list = [ 11, 12, 13, 14, 15 ]
Array views
Arrays cannot share data with each other or data allocated elsewhere. This can be achieved with the ArrayView
structure which has similar semantics to Array
, but it does not handle allocation and deallocation of the data. Hence, array view cannot be resized, but it can be used to wrap data allocated elsewhere (e.g. using an Array
or an operator new
) and to partition large arrays into subarrays. The process of wrapping external data with a view is called binding.
The following code snippet shows how to create an array view:
#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
int
main( int argc, char* argv[] )
{
const int size = 5;
another_view = -5;
}
ArrayView is a simple data structure which provides a non-owning encapsulation of array data....
Definition ArrayView.h:55
__cuda_callable__ ConstViewType getConstView(IndexType begin=0, IndexType end=0) const
Returns a non-modifiable view of the array view.
Definition ArrayView.hpp:74
__cuda_callable__ ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the array view.
Definition ArrayView.hpp:58
The output is:
a = [ -5, -5, -5, -5, -5 ]
a_view = [ -5, -5, -5, -5, -5 ]
another_view = [ -5, -5, -5, -5, -5 ]
const_view = [ -5, -5, -5, -5, -5 ]
You can also bind external data into array view:
#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
int
main( int argc, char* argv[] )
{
const int size = 5;
float* a = new float[ size ];
a_view = -5;
for( int i = 0; i < size; i++ )
delete[] a;
}
Output:
a_view = [ -5, -5, -5, -5, -5 ]
-5 -5 -5 -5 -5
Since array views do not allocate or deallocate memory, they can be created even in CUDA kernels, which is not possible with Array
. ArrayView
can also be passed by value into CUDA kernels or captured by value in device lambda functions, because the ArrayView
's copy-constructor makes only a shallow copy (i.e., it copies only the data pointer and size).
Accessing the array elements
There are two ways how to work with the array (or array view) elements – using the indexing operator (operator[]
) which is more efficient, or using the setElement
and getElement
methods which are more flexible.
Accessing the array elements with operator[]
The indexing operator operator[]
is implemented in both Array
and ArrayView
and it is defined as __cuda_callable__
. It means that it can be called even in CUDA kernels if the data processing executes on a GPU, i.e. the Device
parameter is Devices::Cuda
. This operator returns a reference to the given array element and so it is very efficient. However, calling this operator from host for data allocated on the device (or vice versa) leads to segmentation fault (on the host system) or broken state of the device. It means:
- You may call the
operator[]
on the host only for data allocated on the host (with device Devices::Host
).
- You may call the
operator[]
on the device only for data allocated on the device (with device Devices::Cuda
).
The following example shows use of operator[]
.
#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
__global__
void
{
int tid = threadIdx.x;
view[ tid ] = -tid;
}
int
main( int argc, char* argv[] )
{
const int size = 5;
for( int i = 0; i < size; i++ )
host_array[ i ] = i;
auto device_view = device_array.getView();
initKernel<<< 1, size >>>( device_view );
}
__cuda_callable__ IndexType getSize() const
Returns the current size of the array view.
Definition ArrayView.hpp:169
Output:
host_array = [ 0, 1, 2, 3, 4 ]
device_array = [ 0, -1, -2, -3, -4 ]
In general in TNL, each method defined as __cuda_callable__
can be called from the CUDA kernels. The method ArrayView::getSize
is another example. We also would like to point the reader to better ways of arrays initiation for example with method ArrayView::forElements
or with ParallelFor
.
Accessing the array elements with setElement
and getElement
On the other hand, the methods setElement
and getElement
can be called from the host no matter where the array is allocated. In addition they can be called from kernels on device where the array is allocated. getElement
returns copy of an element rather than a reference. If the array is on GPU and the methods are called from the host, the array element is copied from the device on the host (or vice versa) which is significantly slower. In the parts of code where the performance matters, these methods shall not be called from the host when the array is allocated on the device. In this way, their use is, however, easier compared to operator[]
and they allow to write one simple code for both CPU and GPU. Both methods are good candidates for:
- reading/writing of only few elements in the array
- one-time data initialization in parts of code that are not time critical
- debugging purposes
The following example shows the use of getElement
and setElement
:
#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
int
main( int argc, char* argv[] )
{
const int size = 5;
for( int i = 0; i < size; i++ ) {
host_array.setElement( i, i );
device_array.setElement( i, i );
}
for( int i = 0; i < size; i++ )
if( host_array.getElement( i ) == device_array.getElement( i ) )
}
Output:
Elements at position 0 match.
Elements at position 1 match.
Elements at position 2 match.
Elements at position 3 match.
Elements at position 4 match.
host_array = [ 0, 1, 2, 3, 4 ]
device_array = [ 0, 1, 2, 3, 4 ]
Arrays and parallel for
More efficient and still quite simple method for (not only) array elements initialization is with the use of C++ lambda functions and methods forElements
and forAllElements
. A lambda function is passed as an argument to the method and it is then applied for all elements. Optionally, one may define only a subrange of element indexes where the lambda shall be applied. If the underlying array is allocated on GPU, the lambda function is called from CUDA kernel. This is why it is more efficient than using setElement
repeatedly. On the other hand, one must be careful to use only __cuda_callable__
methods inside the lambda function. The use of the methods forElements
and forAllElements
is demonstrated in the following example.
#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
template< typename Device >
void
forElementsExample()
{
const int size = 10;
b = 0;
a.forAllElements(
{
value = i;
} );
5,
{
value = a_view[ i ] + 4.0;
} );
}
int
main( int argc, char* argv[] )
{
forElementsExample< Devices::Host >();
#ifdef __CUDACC__
forElementsExample< Devices::Cuda >();
#endif
}
#define __cuda_callable__
Definition Macros.h:49
void forElements(IndexType begin, IndexType end, Function &&f)
Process the lambda function f for each array element in interval [ begin, end).
Definition ArrayView.hpp:282
Output:
Running example on the host system:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
b = [ 4, 5, 6, 7, 8, 0, 0, 0, 0, 0 ]
Running example on the CUDA device:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
b = [ 4, 5, 6, 7, 8, 0, 0, 0, 0, 0 ]
Arrays and flexible reduction
Arrays also offer simpler way to do the flexible parallel reduction. See the section about [the flexible parallel reduction](ug_ReductionAndScan) to understand how it works. Flexible reduction for arrays just simplifies access to the array elements. See the following example:
#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/reduce.h>
template< typename Device >
void
reduceArrayExample()
{
const int size = 10;
a.forAllElements(
{
value = i;
} );
float sum_total = Algorithms::reduce( a,
TNL::Plus{} );
float sum_last_five = Algorithms::reduce( a.getConstView( 5, 10 ),
TNL::Plus{} );
}
int
main( int argc, char* argv[] )
{
reduceArrayExample< Devices::Host >();
#ifdef __CUDACC__
reduceArrayExample< Devices::Cuda >();
#endif
}
Function object implementing x + y.
Definition Functional.h:17
Output:
Running example on the host system:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
sum of all elements = 45
sum of last 5 elements = 35
Running example on the CUDA device:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
sum of all elements = 45
sum of last 5 elements = 35
Checking the array contents
The functions TNL::Algorithms::contains and TNL::Algorithms::containsOnlyValue serve for testing the contents of arrays, vectors or their views. contains
returns true
if there is at least one element in the array with given value. containsOnlyValue
returns true
only if all elements of the array are equal to the given value. The test can be restricted to a subrange of array elements. See the following code snippet for usage example.
#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/contains.h>
int
main( int argc, char* argv[] )
{
const int size = 10;
a = 0;
b.forAllElements(
{
value = i;
} );
if( contains( a, 0.0 ) )
if( contains( a, 1.0 ) )
if( contains( b, 0.0 ) )
if( contains( b, 1.0 ) )
if( containsOnlyValue( a, 0.0 ) )
if( containsOnlyValue( a, 1.0 ) )
if( containsOnlyValue( b, 0.0 ) )
if( containsOnlyValue( b, 1.0 ) )
b.forElements( 0,
5,
{
value = 0.0;
} );
}
Namespace for fundamental TNL algorithms.
Definition AtomicOperations.h:9
bool containsOnlyValue(const Array &array, typename Array::ValueType value, typename Array::IndexType begin=0, typename Array::IndexType end=0)
Checks if all elements of an array/vector/view have the given value.
Definition contains.h:61
Output:
a contains 0
b contains 0
b contains 1
a contains only 0
First five elements of b contains only 0
IO operations with arrays
Methods save
and load
serve for storing/restoring the array to/from a file in a binary form. In case of Array
, loading of an array from a file causes data reallocation. ArrayView
cannot do reallocation, therefore the data loaded from a file is copied to the memory managed by the ArrayView
. The number of elements managed by the array view and those loaded from the file must be equal. See the following example.
#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
int
main( int argc, char* argv[] )
{
const int size = 15;
a = 1;
b = 2;
}
void save(const std::string &fileName) const
Method for saving the data to a binary file fileName.
Definition ArrayView.hpp:327
void load(const std::string &fileName)
Method for loading the data from a binary file fileName.
Definition ArrayView.hpp:334
Output:
a = [ 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1 ]
a_view = [ 2, 2, 2, 2, 2 ]
b = [ 2, 2, 2, 2, 2 ]
Static arrays
Static arrays are allocated on stack and thus they can be created even in CUDA kernels. Their size is fixed and it is given by a template parameter. Static array is a templated class defined in namespace TNL::Containers
having two template parameters:
Size
is the array size.
Value
is type of data stored in the array.
The interface of StaticArray is very smillar to Array but much simpler. It contains set of common constructors. Array elements can be accessed by the operator[]
and also using method x()
, y()
and z()
when it makes sense. See the following example for typical use of StaticArray
.
#include <iostream>
#include <TNL/Containers/StaticArray.h>
#include <TNL/File.h>
int
main( int argc, char* argv[] )
{
a1 = 0.0;
File(
"static-array-example-file.tnl", std::ios::out ) << a3;
File(
"static-array-example-file.tnl", std::ios::in ) >> a1;
}
Array with constant size.
Definition StaticArray.h:20
constexpr void sort()
Sorts the elements in this static array in ascending order.
Definition StaticArray.hpp:320
This class serves for binary IO. It allows to do IO even for data allocated on GPU together with on-t...
Definition File.h:25
The output looks as:
a1 = [ 0, 0, 0 ]
a2 = [ 1, 2, 3 ]
a3 = [ 4, 3, 2 ]
a1 = [ 4, 3, 2 ]
Sorted a1 = [ 2, 3, 4 ]
Distributed arrays
Distributed arrays are managed by the TNL::Containers::DistributedArray class. It is a wrapper around a local array, MPI communicator and global indexing information. When creating a distributed array, the global range must be partitioned into subranges (e.g. using TNL::Containers::splitRange) and passed to the constructor or the setDistribution member function. For example:
using LocalRangeType = typename ArrayType::LocalRangeType;
using Partitioner = TNL::Containers::Partitioner< typename ArrayType::IndexType >;
const int size = 97;
const int ghosts = 0;
const LocalRangeType localRange = Partitioner::splitRange( size, communicator );
ArrayType a( localRange, ghosts, size, communicator );
Distributed array.
Definition DistributedArray.h:24
An RAII wrapper for custom MPI communicators.
Definition Comm.h:63
The local arrays can be accessed via views returned by the following member functions:
The reference manual for TNL::Containers::DistributedArray lists all functionality of the data structure. The following shows a full example:
#include <iostream>
#include <TNL/Containers/BlockPartitioning.h>
#include <TNL/Containers/DistributedArray.h>
#include <TNL/MPI/ScopedInitializer.h>
template< typename Device >
void
distributedArrayExample()
{
using IndexType = typename ArrayType::IndexType;
using LocalRangeType = typename ArrayType::LocalRangeType;
const int size = 97;
const int ghosts = ( communicator.
size() > 1 ) ? 4 : 0;
ArrayType a( localRange, ghosts, size, communicator );
a.forElements( 0,
size,
{
value = idx;
} );
ArrayType b( localRange, ghosts, size, communicator );
b.forElements( 0,
size,
{
value = idx - ( idx == 90 );
} );
for(
int i = 0; i < communicator.
size(); i++ ) {
if( communicator.
rank() == i )
<<
" local range = " << a.getLocalRange().getBegin() <<
" - " << a.getLocalRange().getEnd() <<
std::endl
<<
" ghosts = " << a.getGhosts() <<
std::endl
<<
" local data = " << a.getLocalView() <<
std::endl
<<
" local data with ghosts = " << a.getLocalViewWithGhosts() <<
std::endl;
TNL::MPI::Barrier();
}
}
int
main( int argc, char* argv[] )
{
if( TNL::MPI::GetRank() == 0 )
distributedArrayExample< TNL::Devices::Host >();
#ifdef __CUDACC__
TNL::MPI::Barrier();
if( TNL::MPI::GetRank() == 0 )
distributedArrayExample< TNL::Devices::Cuda >();
#endif
}
int size() const
Returns the size of the group associated with a communicator.
Definition Comm.h:223
int rank() const
Determines the rank of the calling process in the communicator.
Definition Comm.h:216
Subrange< Index > splitRange(Index rangeBegin, Index rangeEnd, int rank, int num_subintervals)
A helper function which splits a one-dimensional range.
Definition BlockPartitioning.h:27
Definition ScopedInitializer.h:63
The output looks as:
Rank 1: rank on node is 1, using GPU id 1 of 2
Environment:
CUDA_VISIBLE_DEVICES=
Rank 3: rank on node is 3, using GPU id 1 of 2
Environment:
CUDA_VISIBLE_DEVICES=
Rank 2: rank on node is 2, using GPU id 0 of 2
Environment:
CUDA_VISIBLE_DEVICES=
Rank 0: rank on node is 0, using GPU id 0 of 2
Environment:
CUDA_VISIBLE_DEVICES=
The first test runs on CPU ...
MPI rank = 0
size = 97
local range = 0 - 25
ghosts = 4
local data = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 ]
local data with ghosts = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 12324, 202382080, 92160, 471366016 ]
MPI rank = 1
size = 97
local range = 25 - 49
ghosts = 4
local data = [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 ]
local data with ghosts = [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 471497088, 12324, 202382080, 92160 ]
MPI rank = 2
size = 97
local range = 49 - 73
ghosts = 4
local data = [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72 ]
local data with ghosts = [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 471497088, 12324, 202382080, 92160 ]
MPI rank = 3
size = 97
local range = 73 - 97
ghosts = 4
local data = [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 ]
local data with ghosts = [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 471497088, 12324, 202382080, 92160 ]
The second test runs on GPU ...
MPI rank = 0
size = 97
local range = 0 - 25
ghosts = 4
local data = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 ]
local data with ghosts = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 0, 0, 0, 0 ]
MPI rank = 1
size = 97
local range = 25 - 49
ghosts = 4
local data = [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 ]
local data with ghosts = [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 0, 0, 0, 0 ]
MPI rank = 2
size = 97
local range = 49 - 73
ghosts = 4
local data = [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72 ]
local data with ghosts = [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 0, 0, 0, 0 ]
MPI rank = 3
size = 97
local range = 73 - 97
ghosts = 4
local data = [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 ]
local data with ghosts = [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 0, 0, 0, 0 ]