|
TECA
The Toolkit for Extreme Climate Analysis
|
A class to manage a fixed size pool of threads that dispatch work. More...
#include <teca_cuda_thread_pool.h>
Public Member Functions | |
| teca_cuda_thread_pool (MPI_Comm comm, int n_threads, int n_threads_per_device, bool bind, bool verbose) | |
| teca_cuda_thread_pool (const teca_cuda_thread_pool &src)=delete | |
| teca_cuda_thread_pool (teca_cuda_thread_pool &&src)=delete | |
| teca_cuda_thread_pool & | operator= (const teca_cuda_thread_pool &src)=delete |
| teca_cuda_thread_pool & | operator= (teca_cuda_thread_pool &&src)=delete |
| void | push_task (task_t &task) |
| template<template< typename ... > class container_t, typename ... args> | |
| void | wait_all (container_t< data_t, args ... > &data) |
| template<template< typename ... > class container_t, typename ... args> | |
| int | wait_some (long n_to_wait, long long poll_interval, container_t< data_t, args ... > &data) |
| unsigned int | size () const noexcept |
| get the number of threads More... | |
A class to manage a fixed size pool of threads that dispatch work.
Each thread in the pool services a specific CUDA device or CPU core. During execution each thread assigns work via the device_id request key to the CUDA device or CPU which it services. The default number of threads per CUDA device is 8. This can be overriden via the n_threads_per_device parameter or the TECA_THREADS_PER_CUDA_DEVICE environment variable. Once a CUDA device reaches the maximum specified number of threads per device, no more threads will assign work to it. Once all available CUDA devices reach the maximum specified number of threads per device, all remaining threads in the pool will assign work to the CPU cores to which they are bound.
Upstream algorithms must examine the device_id field in the request to determine which CUDA device or CPU they should use for calculations. The algorithm should allocate memory and invoke computations only on the assigned device. Algorithms that do not support calculation on CUDA GPU will ignore the assignment and make use of the CPU.
| teca_cuda_thread_pool< task_t, data_t >::teca_cuda_thread_pool | ( | MPI_Comm | comm, |
| int | n_threads, | ||
| int | n_threads_per_device, | ||
| bool | bind, | ||
| bool | verbose | ||
| ) |
construct/destruct the thread pool.
| [in] | comm | communicator over which to map threads. Use MPI_COMM_SELF for local mapping and MPI_COMM_NULL to exclude this process from execution. |
| [in] | n_threads | number of threads to create for the pool. -1 will create 1 thread per physical CPU core. all MPI ranks running on the same node are taken into account, resulting in 1 thread per core node wide. |
| [in[ | n_threads_per_device number of threads to assign to each CUDA device. -1 for all threads assigned. | |
| [in] | bind | bind each thread to a specific core. |
| [in] | verbose | print a report of the thread to core bindings |
| void teca_cuda_thread_pool< task_t, data_t >::push_task | ( | task_t & | task | ) |
add a data request task to the queue, returns a future from which the generated dataset can be accessed.
|
inlinenoexcept |
get the number of threads
| void teca_cuda_thread_pool< task_t, data_t >::wait_all | ( | container_t< data_t, args ... > & | data | ) |
wait for all of the requests to execute and transfer datasets in the order that corresponding requests were added to the queue.
| int teca_cuda_thread_pool< task_t, data_t >::wait_some | ( | long | n_to_wait, |
| long long | poll_interval, | ||
| container_t< data_t, args ... > & | data | ||
| ) |
wait for some of the requests to execute. datasets will be retruned as they become ready. n_to_wait specifies how many datasets to gather but there are three cases when the number of datasets returned differs from n_to_wait. when n_to_wait is larger than the number of tasks remaining, datasets from all of the remaining tasks is returned. when n_to_wait is smaller than the number of datasets ready, all of the currenttly ready data are returned. finally, when n_to_wait is < 1 the call blocks until all of the tasks complete and all of the data is returned.