Template Numerical Library version\ main:f17d0c8
Loading...
Searching...
No Matches
Arrays

Introduction

This part introduces arrays in TNL. There are three types of arrays: common arrays with dynamic allocation, static arrays allocated on the stack, and distributed arrays with dynamic allocation. Arrays are one of the most important data structures for memory management. Methods implemented in TNL arrays are particularly useful for GPU programming as they allow to easily allocate memory on the GPU, transfer data between GPU and CPU, but also to initialize data allocated on the GPU. In addition, the resulting code is independent of the hardware platform, so it can be run on any supported architecture without any changes.

Dynamic arrays

Array is a class template defined in the TNL::Containers namespace, which has four template parameters:

  • Value is type of data to be stored in the array
  • Device is the device to be used for the execution of array operations. It can be any class defined in the TNL::Devices namespace.
  • Index is the type to be used for indexing the array elements.
  • Allocator is the type of the allocator used for the allocation and deallocation of memory used by the array. By default, an appropriate allocator for the specified Device is selected with TNL::Allocators::Default.

The following example shows how to allocate arrays on CPU and GPU and how to initialize the data.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <list>
#include <vector>
using namespace TNL;
using namespace TNL::Containers;
int
main( int argc, char* argv[] )
{
/****
* Create one array on host and one array on device.
*/
Array< int > host_array( 10 );
/***
* Initiate the host array with number three and assign it to the device array.
* NOTE: Of course, you may do directly 'device_array = 3' as well.
*/
host_array = 3;
device_array = host_array;
/****
* Print both arrays.
*/
std::cout << "host_array = " << host_array << std::endl;
std::cout << "device_array = " << device_array << std::endl;
/****
* There are few other ways how to initialize arrays...
*/
std::list< int > list{ 1, 2, 3, 4, 5 };
std::vector< int > vector{ 6, 7, 8, 9, 10 };
Array< int, Devices::Cuda > device_array_list( list );
Array< int, Devices::Cuda > device_array_vector( vector );
Array< int, Devices::Cuda > device_array_init_list{ 11, 12, 13, 14, 15 };
/****
* ... and print them all
*/
std::cout << "device_array_list = " << device_array_list << std::endl;
std::cout << "device_array_vector = " << device_array_vector << std::endl;
std::cout << "device_array_init_list = " << device_array_init_list << std::endl;
}
Array is responsible for memory management, access to array elements, and general array operations.
Definition Array.h:64
T endl(T... args)
Namespace for TNL containers.
Definition Array.h:17
The main TNL namespace.
Definition AtomicOperations.h:9

The result looks as follows:

host_array = [ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]
device_array = [ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]
device_array_list = [ 1, 2, 3, 4, 5 ]
device_array_vector = [ 6, 7, 8, 9, 10 ]
device_array_init_list = [ 11, 12, 13, 14, 15 ]

Array views

Arrays cannot share data with each other or data allocated elsewhere. This can be achieved with the ArrayView structure which has similar semantics to Array, but it does not handle allocation and deallocation of the data. Hence, array view cannot be resized, but it can be used to wrap data allocated elsewhere (e.g. using an Array or an operator new) and to partition large arrays into subarrays. The process of wrapping external data with a view is called binding.

The following code snippet shows how to create an array view:

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int
main( int argc, char* argv[] )
{
/****
* Create new array
*/
const int size = 5;
Array< float > a( size );
/****
* Bind an array view with it
*/
auto another_view = a.getView();
auto const_view = a.getConstView();
another_view = -5;
std::cout << " a = " << a << std::endl;
std::cout << " a_view = " << a_view << std::endl;
std::cout << " another_view = " << another_view << std::endl;
std::cout << " const_view = " << const_view << std::endl;
//const_view = 3; this would not compile
}
ArrayView is a simple data structure which provides a non-owning encapsulation of array data....
Definition ArrayView.h:55
__cuda_callable__ ConstViewType getConstView(IndexType begin=0, IndexType end=0) const
Returns a non-modifiable view of the array view.
Definition ArrayView.hpp:74
__cuda_callable__ ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the array view.
Definition ArrayView.hpp:58

The output is:

a = [ -5, -5, -5, -5, -5 ]
a_view = [ -5, -5, -5, -5, -5 ]
another_view = [ -5, -5, -5, -5, -5 ]
const_view = [ -5, -5, -5, -5, -5 ]

You can also bind external data into array view:

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int
main( int argc, char* argv[] )
{
/****
* Allocate your own data
*/
const int size = 5;
float* a = new float[ size ];
/****
* Wrap the data with an array view
*/
ArrayView< float > a_view( a, size );
a_view = -5;
std::cout << " a_view = " << a_view << std::endl;
for( int i = 0; i < size; i++ )
std::cout << a[ i ] << " ";
/****
* Free the allocated memory
*/
delete[] a;
}

Output:

a_view = [ -5, -5, -5, -5, -5 ]
-5 -5 -5 -5 -5

Since array views do not allocate or deallocate memory, they can be created even in CUDA kernels, which is not possible with Array. ArrayView can also be passed by value into CUDA kernels or captured by value in device lambda functions, because the ArrayView's copy-constructor makes only a shallow copy (i.e., it copies only the data pointer and size).

Accessing the array elements

There are two ways how to work with the array (or array view) elements – using the indexing operator (operator[]) which is more efficient, or using the setElement and getElement methods which are more flexible.

Accessing the array elements with operator[]

The indexing operator operator[] is implemented in both Array and ArrayView and it is defined as __cuda_callable__. It means that it can be called even in CUDA kernels if the data processing executes on a GPU, i.e. the Device parameter is Devices::Cuda. This operator returns a reference to the given array element and so it is very efficient. However, calling this operator from host for data allocated on the device (or vice versa) leads to segmentation fault (on the host system) or broken state of the device. It means:

  • You may call the operator[] on the host only for data allocated on the host (with device Devices::Host).
  • You may call the operator[] on the device only for data allocated on the device (with device Devices::Cuda).

The following example shows use of operator[].

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
__global__
void
{
int tid = threadIdx.x;
if( tid < view.getSize() )
view[ tid ] = -tid;
}
int
main( int argc, char* argv[] )
{
/****
* Create new arrays on both host and device
*/
const int size = 5;
Array< float, Devices::Host > host_array( size );
Array< float, Devices::Cuda > device_array( size );
/****
* Initiate the host array
*/
for( int i = 0; i < size; i++ )
host_array[ i ] = i;
/****
* Prepare array view for the device array - we will pass it to a CUDA kernel.
* NOTE: Better way is to use ArrayView::evaluate or ParallelFor, this is just
* an example.
*/
auto device_view = device_array.getView();
/****
* Call CUDA kernel to initiate the array on the device
*/
initKernel<<< 1, size >>>( device_view );
/****
* Print the results
*/
std::cout << " host_array = " << host_array << std::endl;
std::cout << " device_array = " << device_array << std::endl;
}
__cuda_callable__ IndexType getSize() const
Returns the current size of the array view.
Definition ArrayView.hpp:169

Output:

host_array = [ 0, 1, 2, 3, 4 ]
device_array = [ 0, -1, -2, -3, -4 ]

In general in TNL, each method defined as __cuda_callable__ can be called from the CUDA kernels. The method ArrayView::getSize is another example. We also would like to point the reader to better ways of arrays initiation for example with method ArrayView::forElements or with ParallelFor.

Accessing the array elements with setElement and getElement

On the other hand, the methods setElement and getElement can be called from the host no matter where the array is allocated. In addition they can be called from kernels on device where the array is allocated. getElement returns copy of an element rather than a reference. If the array is on GPU and the methods are called from the host, the array element is copied from the device on the host (or vice versa) which is significantly slower. In the parts of code where the performance matters, these methods shall not be called from the host when the array is allocated on the device. In this way, their use is, however, easier compared to operator[] and they allow to write one simple code for both CPU and GPU. Both methods are good candidates for:

  • reading/writing of only few elements in the array
  • one-time data initialization in parts of code that are not time critical
  • debugging purposes

The following example shows the use of getElement and setElement:

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int
main( int argc, char* argv[] )
{
/****
* Create new arrays on both host and device
*/
const int size = 5;
Array< float, Devices::Host > host_array( size );
Array< float, Devices::Cuda > device_array( size );
/****
* Initiate the arrays with setElement
*/
for( int i = 0; i < size; i++ ) {
host_array.setElement( i, i );
device_array.setElement( i, i );
}
/****
* Compare the arrays using getElement
*/
for( int i = 0; i < size; i++ )
if( host_array.getElement( i ) == device_array.getElement( i ) )
std::cout << "Elements at position " << i << " match." << std::endl;
/****
* Print the results
*/
std::cout << "host_array = " << host_array << std::endl;
std::cout << "device_array = " << device_array << std::endl;
}

Output:

Elements at position 0 match.
Elements at position 1 match.
Elements at position 2 match.
Elements at position 3 match.
Elements at position 4 match.
host_array = [ 0, 1, 2, 3, 4 ]
device_array = [ 0, 1, 2, 3, 4 ]

Arrays and parallel for

More efficient and still quite simple method for (not only) array elements initialization is with the use of C++ lambda functions and methods forElements and forAllElements. A lambda function is passed as an argument to the method and it is then applied for all elements. Optionally, one may define only a subrange of element indexes where the lambda shall be applied. If the underlying array is allocated on GPU, the lambda function is called from CUDA kernel. This is why it is more efficient than using setElement repeatedly. On the other hand, one must be careful to use only __cuda_callable__ methods inside the lambda function. The use of the methods forElements and forAllElements is demonstrated in the following example.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
template< typename Device >
void
forElementsExample()
{
/****
* Create new arrays
*/
const int size = 10;
Containers::Array< float, Device > a( size ), b( size );
b = 0;
/****
* Initiate the elements of array `a`
*/
a.forAllElements(
[] __cuda_callable__( int i, float& value )
{
value = i;
} );
/****
* Initiate elements of array `b` with indexes 0-4 using `a_view`
*/
auto a_view = a.getView();
5,
[ = ] __cuda_callable__( int i, float& value )
{
value = a_view[ i ] + 4.0;
} );
/****
* Print the results
*/
std::cout << " a = " << a << std::endl;
std::cout << " b = " << b << std::endl;
}
int
main( int argc, char* argv[] )
{
std::cout << "Running example on the host system: " << std::endl;
forElementsExample< Devices::Host >();
#ifdef __CUDACC__
std::cout << "Running example on the CUDA device: " << std::endl;
forElementsExample< Devices::Cuda >();
#endif
}
#define __cuda_callable__
Definition Macros.h:49
void forElements(IndexType begin, IndexType end, Function &&f)
Process the lambda function f for each array element in interval [ begin, end).
Definition ArrayView.hpp:282

Output:

Running example on the host system:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
b = [ 4, 5, 6, 7, 8, 0, 0, 0, 0, 0 ]
Running example on the CUDA device:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
b = [ 4, 5, 6, 7, 8, 0, 0, 0, 0, 0 ]

Arrays and flexible reduction

Arrays also offer simpler way to do the flexible parallel reduction. See the section about [the flexible parallel reduction](ug_ReductionAndScan) to understand how it works. Flexible reduction for arrays just simplifies access to the array elements. See the following example:

#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/reduce.h>
using namespace TNL;
template< typename Device >
void
reduceArrayExample()
{
/****
* Create new arrays
*/
const int size = 10;
/****
* Initiate the elements of array `a`
*/
a.forAllElements(
[] __cuda_callable__( int i, float& value )
{
value = i;
} );
/****
* Sum all elements of array `a`
*/
float sum_total = Algorithms::reduce( a, TNL::Plus{} );
/****
* Sum last 5 elements of array `a`
*/
float sum_last_five = Algorithms::reduce( a.getConstView( 5, 10 ), TNL::Plus{} );
/****
* Print the results
*/
std::cout << " a = " << a << std::endl;
std::cout << " sum of all elements = " << sum_total << std::endl;
std::cout << " sum of last 5 elements = " << sum_last_five << std::endl;
}
int
main( int argc, char* argv[] )
{
std::cout << "Running example on the host system: " << std::endl;
reduceArrayExample< Devices::Host >();
#ifdef __CUDACC__
std::cout << "Running example on the CUDA device: " << std::endl;
reduceArrayExample< Devices::Cuda >();
#endif
}
Function object implementing x + y.
Definition Functional.h:17

Output:

Running example on the host system:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
sum of all elements = 45
sum of last 5 elements = 35
Running example on the CUDA device:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
sum of all elements = 45
sum of last 5 elements = 35

Checking the array contents

The functions TNL::Algorithms::contains and TNL::Algorithms::containsOnlyValue serve for testing the contents of arrays, vectors or their views. contains returns true if there is at least one element in the array with given value. containsOnlyValue returns true only if all elements of the array are equal to the given value. The test can be restricted to a subrange of array elements. See the following code snippet for usage example.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/contains.h>
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
int
main( int argc, char* argv[] )
{
/****
* Create new arrays and initiate them
*/
const int size = 10;
Array< float, Devices::Cuda > a( size ), b( size );
a = 0;
b.forAllElements(
[ = ] __cuda_callable__( int i, float& value )
{
value = i;
} );
/****
* Test the values stored in the arrays
*/
if( contains( a, 0.0 ) )
std::cout << "a contains 0" << std::endl;
if( contains( a, 1.0 ) )
std::cout << "a contains 1" << std::endl;
if( contains( b, 0.0 ) )
std::cout << "b contains 0" << std::endl;
if( contains( b, 1.0 ) )
std::cout << "b contains 1" << std::endl;
if( containsOnlyValue( a, 0.0 ) )
std::cout << "a contains only 0" << std::endl;
if( containsOnlyValue( a, 1.0 ) )
std::cout << "a contains only 1" << std::endl;
if( containsOnlyValue( b, 0.0 ) )
std::cout << "b contains only 0" << std::endl;
if( containsOnlyValue( b, 1.0 ) )
std::cout << "b contains only 1" << std::endl;
/****
* Change the first half of b and test it again
*/
b.forElements( 0,
5,
[ = ] __cuda_callable__( int i, float& value )
{
value = 0.0;
} );
if( containsOnlyValue( b, 0.0, 0, 5 ) )
std::cout << "First five elements of b contains only 0" << std::endl;
}
Namespace for fundamental TNL algorithms.
Definition AtomicOperations.h:9
bool containsOnlyValue(const Array &array, typename Array::ValueType value, typename Array::IndexType begin=0, typename Array::IndexType end=0)
Checks if all elements of an array/vector/view have the given value.
Definition contains.h:61

Output:

a contains 0
b contains 0
b contains 1
a contains only 0
First five elements of b contains only 0

IO operations with arrays

Methods save and load serve for storing/restoring the array to/from a file in a binary form. In case of Array, loading of an array from a file causes data reallocation. ArrayView cannot do reallocation, therefore the data loaded from a file is copied to the memory managed by the ArrayView. The number of elements managed by the array view and those loaded from the file must be equal. See the following example.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int
main( int argc, char* argv[] )
{
/****
* Create new arrays
*/
const int size = 15;
Array< float > a( size ), b( 5 );
a = 1;
b = 2;
/****
* Create array view for the middle of array a
*/
auto a_view = a.getView( 5, 10 );
/****
* Save array b to file
*/
b.save( "b.tnl" );
/****
* Load data from b to a_view
*/
a_view.load( "b.tnl" );
/****
* Print the results
*/
std::cout << "a = " << a << std::endl;
std::cout << "a_view = " << a_view << std::endl;
std::cout << "b = " << b << std::endl;
}
void save(const std::string &fileName) const
Method for saving the data to a binary file fileName.
Definition ArrayView.hpp:327
void load(const std::string &fileName)
Method for loading the data from a binary file fileName.
Definition ArrayView.hpp:334

Output:

a = [ 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1 ]
a_view = [ 2, 2, 2, 2, 2 ]
b = [ 2, 2, 2, 2, 2 ]

Static arrays

Static arrays are allocated on stack and thus they can be created even in CUDA kernels. Their size is fixed and it is given by a template parameter. Static array is a templated class defined in namespace TNL::Containers having two template parameters:

  • Size is the array size.
  • Value is type of data stored in the array.

The interface of StaticArray is very smillar to Array but much simpler. It contains set of common constructors. Array elements can be accessed by the operator[] and also using method x(), y() and z() when it makes sense. See the following example for typical use of StaticArray.

#include <iostream>
#include <TNL/Containers/StaticArray.h>
#include <TNL/File.h>
using namespace TNL;
using namespace TNL::Containers;
int
main( int argc, char* argv[] )
{
StaticArray< 3, int > a1, a2( 1, 2, 3 ), a3{ 4, 3, 2 };
a1 = 0.0;
std::cout << "a1 = " << a1 << std::endl;
std::cout << "a2 = " << a2 << std::endl;
std::cout << "a3 = " << a3 << std::endl;
File( "static-array-example-file.tnl", std::ios::out ) << a3;
File( "static-array-example-file.tnl", std::ios::in ) >> a1;
std::cout << "a1 = " << a1 << std::endl;
a1.sort();
std::cout << "Sorted a1 = " << a1 << std::endl;
}
Array with constant size.
Definition StaticArray.h:20
constexpr void sort()
Sorts the elements in this static array in ascending order.
Definition StaticArray.hpp:320
This class serves for binary IO. It allows to do IO even for data allocated on GPU together with on-t...
Definition File.h:25

The output looks as:

a1 = [ 0, 0, 0 ]
a2 = [ 1, 2, 3 ]
a3 = [ 4, 3, 2 ]
a1 = [ 4, 3, 2 ]
Sorted a1 = [ 2, 3, 4 ]

Distributed arrays

Distributed arrays are managed by the TNL::Containers::DistributedArray class. It is a wrapper around a local array, MPI communicator and global indexing information. When creating a distributed array, the global range must be partitioned into subranges (e.g. using TNL::Containers::splitRange) and passed to the constructor or the setDistribution member function. For example:

using LocalRangeType = typename ArrayType::LocalRangeType;
using Partitioner = TNL::Containers::Partitioner< typename ArrayType::IndexType >;
const TNL::MPI::Comm communicator = MPI_COMM_WORLD;
const int size = 97;
const int ghosts = 0;
const LocalRangeType localRange = Partitioner::splitRange( size, communicator );
ArrayType a( localRange, ghosts, size, communicator );
Distributed array.
Definition DistributedArray.h:24
An RAII wrapper for custom MPI communicators.
Definition Comm.h:63

The local arrays can be accessed via views returned by the following member functions:

The reference manual for TNL::Containers::DistributedArray lists all functionality of the data structure. The following shows a full example:

#include <iostream>
#include <TNL/Containers/BlockPartitioning.h>
#include <TNL/Containers/DistributedArray.h>
#include <TNL/MPI/ScopedInitializer.h>
/***
* The following works for any device (CPU, GPU ...).
*/
template< typename Device >
void
distributedArrayExample()
{
using IndexType = typename ArrayType::IndexType;
using LocalRangeType = typename ArrayType::LocalRangeType;
const TNL::MPI::Comm communicator = MPI_COMM_WORLD;
// We set the global array size to a prime number to force non-uniform distribution.
const int size = 97;
const int ghosts = ( communicator.size() > 1 ) ? 4 : 0;
const LocalRangeType localRange = TNL::Containers::splitRange< IndexType >( size, communicator );
ArrayType a( localRange, ghosts, size, communicator );
a.forElements( 0,
size,
[] __cuda_callable__( int idx, int& value )
{
value = idx;
} );
ArrayType b( localRange, ghosts, size, communicator );
b.forElements( 0,
size,
[] __cuda_callable__( int idx, int& value )
{
value = idx - ( idx == 90 );
} );
for( int i = 0; i < communicator.size(); i++ ) {
if( communicator.rank() == i )
std::cout << "MPI rank = " << communicator.rank() << std::endl
<< " size = " << a.getSize() << std::endl
<< " local range = " << a.getLocalRange().getBegin() << " - " << a.getLocalRange().getEnd() << std::endl
<< " ghosts = " << a.getGhosts() << std::endl
<< " local data = " << a.getLocalView() << std::endl
<< " local data with ghosts = " << a.getLocalViewWithGhosts() << std::endl;
TNL::MPI::Barrier();
}
}
int
main( int argc, char* argv[] )
{
TNL::MPI::ScopedInitializer mpi( argc, argv );
if( TNL::MPI::GetRank() == 0 )
std::cout << "The first test runs on CPU ..." << std::endl;
distributedArrayExample< TNL::Devices::Host >();
#ifdef __CUDACC__
TNL::MPI::Barrier();
if( TNL::MPI::GetRank() == 0 )
std::cout << "The second test runs on GPU ..." << std::endl;
distributedArrayExample< TNL::Devices::Cuda >();
#endif
}
int size() const
Returns the size of the group associated with a communicator.
Definition Comm.h:223
int rank() const
Determines the rank of the calling process in the communicator.
Definition Comm.h:216
Subrange< Index > splitRange(Index rangeBegin, Index rangeEnd, int rank, int num_subintervals)
A helper function which splits a one-dimensional range.
Definition BlockPartitioning.h:27
Definition ScopedInitializer.h:63

The output looks as:

Rank 2: rank on node is 2, using GPU id 0 of 2
Environment:
CUDA_VISIBLE_DEVICES=
Rank 1: rank on node is 1, using GPU id 1 of 2
Environment:
CUDA_VISIBLE_DEVICES=
Rank 3: rank on node is 3, using GPU id 1 of 2
Environment:
CUDA_VISIBLE_DEVICES=
Rank 0: rank on node is 0, using GPU id 0 of 2
Environment:
CUDA_VISIBLE_DEVICES=
The first test runs on CPU ...
MPI rank = 0
size = 97
local range = 0 - 25
ghosts = 4
local data = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 ]
local data with ghosts = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 0, 0, 0, 0 ]
MPI rank = 1
size = 97
local range = 25 - 49
ghosts = 4
local data = [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 ]
local data with ghosts = [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 0, 0, 0, 0 ]
MPI rank = 2
size = 97
local range = 49 - 73
ghosts = 4
local data = [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72 ]
local data with ghosts = [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 0, 0, 0, 0 ]
MPI rank = 3
size = 97
local range = 73 - 97
ghosts = 4
local data = [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 ]
local data with ghosts = [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 0, 0, 0, 0 ]
The second test runs on GPU ...
MPI rank = 0
size = 97
local range = 0 - 25
ghosts = 4
local data = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 ]
local data with ghosts = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 0, 0, 0, 0 ]
MPI rank = 1
size = 97
local range = 25 - 49
ghosts = 4
local data = [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 ]
local data with ghosts = [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 0, 0, 0, 0 ]
MPI rank = 2
size = 97
local range = 49 - 73
ghosts = 4
local data = [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72 ]
local data with ghosts = [ 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 0, 0, 0, 0 ]
MPI rank = 3
size = 97
local range = 73 - 97
ghosts = 4
local data = [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 ]
local data with ghosts = [ 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 0, 0, 0, 0 ]