Parallel STL usage

Follow these steps to add Parallel STL to your application:

  1. Add #include <oneapi/dpl/execution> to your code. Then include one or more of the following header files, depending on the algorithms you intend to use:

    • #include <oneapi/dpl/algorithm>

    • #include <oneapi/dpl/numeric>

    • #include <oneapi/dpl/memory>

    For better coexistence with the C++ standard library, include oneDPL header files before the standard C++ header files.

  2. Pass a oneDPL execution policy object, defined in the oneapi::dpl::execution namespace, to a parallel algorithm.

  3. Use the C++ Standard Execution Policies:

    • Compile the code with options that enable OpenMP vectorization pragmas.

    • Link with the oneTBB or TBB dynamic library for parallelism.

  4. Use the DPC++ Execution Policies:

    • Compile the code with options that enable support for SYCL 2020.

Use the C++ Standard Execution Policies

Example

#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <vector>

int main()
{
    std::vector<int> data( 1000 );
    std::fill(oneapi::dpl::execution::par_unseq, data.begin(), data.end(), 42);
    return 0;
}

Use the DPC++ Execution Policies

The Data Parallel C++ (DPC++) execution policy specifies where a parallel algorithm runs. It encapsulates a SYCL* device or queue, and allows you to set an optional kernel name. DPC++ execution policies can be used with all standard C++ algorithms that support execution policies.

To use the policy, create a policy object by providing a class type for a unique kernel name as a template argument, and one of the following constructor arguments:

  • A SYCL queue

  • A SYCL device

  • A SYCL device selector

  • An existing policy object with a different kernel name

Providing a kernel name for a policy is optional if the used compiler supports implicit names for SYCL kernel functions. The Intel® oneAPI DPC++/C++ Compiler supports it by default; for other compilers it may need to be enabled with compilation options such as -fsycl-unnamed-lambda. Refer to your compiler documentation for more information.

The oneapi::dpl::execution::dpcpp_default object is a predefined object of the device_policy class. It is created with a default kernel name and a default queue. Use it to create customized policy objects, or pass directly when invoking an algorithm.

If dpcpp_default is passed directly to more than one algorithm, you must enable implicit kernel names (see above) for compilation.

The make_device_policy function templates simplify device_policy creation.

Usage Examples

Code examples below assume using namespace oneapi::dpl::execution; and using namespace sycl; directive when refer to policy classes and functions:

auto policy_a = device_policy<class PolicyA> {};
std::for_each(policy_a, …);
auto policy_b = device_policy<class PolicyB> {device{gpu_selector{}}};
std::for_each(policy_b, …);
auto policy_c = device_policy<class PolicyС> {cpu_selector{}};
std::for_each(policy_c, …);
auto policy_d = make_device_policy<class PolicyD>(dpcpp_default);
std::for_each(policy_d, …);
auto policy_e = make_device_policy(queue{property::queue::in_order()});
std::for_each(policy_e, …);

Use the FPGA Policy

The fpga_policy class is a DPC++ policy tailored to achieve better performance of parallel algorithms on FPGA hardware devices.

Use the policy when you run the application on a FPGA hardware device or FPGA emulation device:

  1. Define the ONEDPL_FPGA_DEVICE macro to run on FPGA devices and the ONEDPL_FPGA_EMULATOR to run on FPGA emulation devices.

  2. Add #include <oneapi/dpl/execution> to your code.

  3. Create a policy object by providing an unroll factor (see the Note below) and a class type for a unique kernel name as template arguments (both optional), and one of the following constructor arguments:

    1. A SYCL queue constructed for the FPGA Selector (the behavior is undefined with any other queue).

    2. An existing FPGA policy object with a different kernel name and/or unroll factor.

  4. Pass the created policy object to a parallel algorithm.

The default constructor of fpga_policy creates an object with a SYCL queue constructed for fpga_selector, or for fpga_emulator_selector if the ONEDPL_FPGA_EMULATOR is defined.

oneapi::dpl::execution::dpcpp_fpga is a predefined object of the fpga_policy class created with a default unroll factor and a default kernel name. Use it to create customized policy objects, or pass directly when invoking an algorithm.

Note

Specifying unroll factor for a policy enables loop unrolling in the implementation of algorithms. Default value is 1. To find out how to choose a better value, you can refer to the unroll Pragma and Loops Analysis chapters of the Intel® oneAPI DPC++ FPGA Optimization Guide.

The make_fpga_policy function templates simplify fpga_policy creation.

FPGA Policy Usage Examples

The code below assumes using namespace oneapi::dpl::execution; for policies and using namespace sycl; for queues and device selectors:

constexpr auto unroll_factor = 8;
auto fpga_policy_a = fpga_policy<unroll_factor, class FPGAPolicyA>{};
auto fpga_policy_b = make_fpga_policy(queue{intel::fpga_selector{}});
auto fpga_policy_c = make_fpga_policy<unroll_factor, class FPGAPolicyC>();

Pass Data to Algorithms

You can use one of the following ways to pass data to an algorithm executed with a DPC++ policy:

  • oneapi:dpl::begin and oneapi::dpl::end functions

  • Unified shared memory (USM) pointers and std::vector with USM allocators

  • Iterators of host-side std::vector

Use oneapi::dpl::begin and oneapi::dpl::end Functions

oneapi::dpl::begin and oneapi::dpl::end are special helper functions that allow you to pass SYCL buffers to parallel algorithms. These functions accept a SYCL buffer and return an object of an unspecified type that satisfies the following requirements:

  • Is CopyConstructible, CopyAssignable, and comparable with operators == and !=

  • The following expressions are valid: a + n, a - n, and a - b, where a and b are objects of the type, and n is an integer value

  • Has a get_buffer method with no arguments. The method returns the SYCL buffer passed to oneapi::dpl::begin and oneapi::dpl::end functions

To use the functions, add #include <oneapi/dpl/iterator> to your code.

Example:

#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/iterator>
#include <CL/sycl.hpp>
int main(){
  sycl::buffer<int> buf { 1000 };
  auto buf_begin = oneapi::dpl::begin(buf);
  auto buf_end   = oneapi::dpl::end(buf);
  std::fill(oneapi::dpl::execution::dpcpp_default, buf_begin, buf_end, 42);
  return 0;
}

Use Unified Shared Memory (USM)

The following examples demonstrate two ways to use the parallel algorithms with USM:

  • USM pointers

  • USM allocators

If you have a USM-allocated buffer, pass the pointers to the start and past the end of the buffer to a parallel algorithm. Make sure that the execution policy and the buffer were created for the same queue.

#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <CL/sycl.hpp>
int main(){
  sycl::queue q;
  const int n = 1000;
  int* d_head = sycl::malloc_device<int>(n, q);

  std::fill(oneapi::dpl::execution::make_device_policy(q), d_head, d_head + n, 42);

  sycl::free(d_head, q);
  return 0;
}

Alternatively, use std::vector with a USM allocator:

#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <CL/sycl.hpp>
int main(){
  const int n = 1000;
  auto policy = oneapi::dpl::execution::dpcpp_default;
  sycl::usm_allocator<int, sycl::usm::alloc::shared> alloc(policy.queue());
  std::vector<int, decltype(alloc)> vec(n, alloc);

  std::fill(policy, vec.begin(), vec.end(), 42);

  return 0;
}

Use Host-Side std::vector

oneDPL parallel algorithms can be called with ordinary (host-side) iterators, as seen in the example below. In this case, a temporary SYCL buffer is created and the data is copied to this buffer. After processing of the temporary buffer on a device is complete, the data is copied back to the host. Working with SYCL buffers is recommended to reduce data copying between the host and device.

Example:

#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <vector>
int main(){
  std::vector<int> v( 1000 );
  std::fill(oneapi::dpl::execution::dpcpp_default, v.begin(), v.end(), 42);
  // each element of vec equals to 42
  return 0;
}

Error Handling with DPC++ Execution Policies

The DPC++ error handling model supports two types of errors. In cases of synchronous errors DPC++ host runtime libraries throw exceptions, while asynchronous errors may only be processed in a user-supplied error handler associated with a DPC++ queue.

For algorithms executed with DPC++ policies, handling all errors, synchronous or asynchronous, is a responsibility of the caller. Specifically:

  • No exceptions are thrown explicitly by algorithms.

  • Exceptions thrown by runtime libraries at the host CPU, including DPC++ synchronous exceptions, are passed through to the caller.

  • DPC++ asynchronous errors are not handled.

In order to process DPC++ asynchronous errors, the queue associated with a DPC++ policy must be created with an error handler object. The predefined policy objects (dpcpp_default etc.) have no error handlers; do not use those if you need to process asynchronous errors.

Restrictions

When used with DPC++ execution policies, oneDPL algorithms apply the same restrictions as DPC++ does (see the DPC++ specification and the SYCL specification for details), such as:

  • Adding buffers to a lambda capture list is not allowed for lambdas passed to an algorithm.

  • Passing data types, which are not trivially constructible, is only allowed in USM, but not in buffers or host-allocated containers.

Known Limitations

For transform_exclusive_scan, transform_inclusive_scan algorithms result of unary operation should be convertible to the type of the initial value if one is provided, otherwise to the type of values in the processed data sequence (std::iterator_traits<IteratorType>::value_type).

Build Your Code with oneDPL

Use these steps to build your code with oneDPL:

  1. To build with the Intel® oneAPI DPC++/C++ Complier, see the Get Started with the Intel® oneAPI DPC++/C++ Compiler for details.

  2. Set the environment for oneDPL and oneTBB.

  3. To avoid naming device policy objects explicitly, add the –fsycl-unnamed-lambda option.

Below is an example of a command line used to compile code that contains oneDPL parallel algorithms on Linux* (depending on the code, parameters within [] could be unnecessary):

dpcpp [–fsycl-unnamed-lambda] test.cpp [-ltbb] -o test