TECA
The Toolkit for Extreme Climate Analysis
teca_cuda_thread_pool< task_t, data_t > Class Template Reference

A class to manage a fixed size pool of threads that dispatch work. More...

#include <teca_cuda_thread_pool.h>

Public Member Functions

 teca_cuda_thread_pool (MPI_Comm comm, int n_threads, int n_threads_per_device, bool bind, bool verbose)
 
 teca_cuda_thread_pool (const teca_cuda_thread_pool &src)=delete
 
 teca_cuda_thread_pool (teca_cuda_thread_pool &&src)=delete
 
teca_cuda_thread_pooloperator= (const teca_cuda_thread_pool &src)=delete
 
teca_cuda_thread_pooloperator= (teca_cuda_thread_pool &&src)=delete
 
void push_task (task_t &task)
 
template<template< typename ... > class container_t, typename ... args>
void wait_all (container_t< data_t, args ... > &data)
 
template<template< typename ... > class container_t, typename ... args>
int wait_some (long n_to_wait, long long poll_interval, container_t< data_t, args ... > &data)
 
unsigned int size () const noexcept
 get the number of threads More...
 

Detailed Description

template<typename task_t, typename data_t>
class teca_cuda_thread_pool< task_t, data_t >

A class to manage a fixed size pool of threads that dispatch work.

Each thread in the pool services a specific CUDA device or CPU core. During execution each thread assigns work via the device_id request key to the CUDA device or CPU which it services. The default number of threads per CUDA device is 8. This can be overriden via the n_threads_per_device parameter or the TECA_THREADS_PER_CUDA_DEVICE environment variable. Once a CUDA device reaches the maximum specified number of threads per device, no more threads will assign work to it. Once all available CUDA devices reach the maximum specified number of threads per device, all remaining threads in the pool will assign work to the CPU cores to which they are bound.

Upstream algorithms must examine the device_id field in the request to determine which CUDA device or CPU they should use for calculations. The algorithm should allocate memory and invoke computations only on the assigned device. Algorithms that do not support calculation on CUDA GPU will ignore the assignment and make use of the CPU.

Constructor & Destructor Documentation

◆ teca_cuda_thread_pool()

template<typename task_t , typename data_t >
teca_cuda_thread_pool< task_t, data_t >::teca_cuda_thread_pool ( MPI_Comm  comm,
int  n_threads,
int  n_threads_per_device,
bool  bind,
bool  verbose 
)

construct/destruct the thread pool.

Parameters
[in]commcommunicator over which to map threads. Use MPI_COMM_SELF for local mapping and MPI_COMM_NULL to exclude this process from execution.
[in]n_threadsnumber of threads to create for the pool. -1 will create 1 thread per physical CPU core. all MPI ranks running on the same node are taken into account, resulting in 1 thread per core node wide.
[in[n_threads_per_device number of threads to assign to each CUDA device. -1 for all threads assigned.
[in]bindbind each thread to a specific core.
[in]verboseprint a report of the thread to core bindings

Member Function Documentation

◆ push_task()

template<typename task_t , typename data_t >
void teca_cuda_thread_pool< task_t, data_t >::push_task ( task_t &  task)

add a data request task to the queue, returns a future from which the generated dataset can be accessed.

◆ size()

template<typename task_t , typename data_t >
unsigned int teca_cuda_thread_pool< task_t, data_t >::size ( ) const
inlinenoexcept

get the number of threads

◆ wait_all()

template<typename task_t , typename data_t >
template<template< typename ... > class container_t, typename ... args>
void teca_cuda_thread_pool< task_t, data_t >::wait_all ( container_t< data_t, args ... > &  data)

wait for all of the requests to execute and transfer datasets in the order that corresponding requests were added to the queue.

◆ wait_some()

template<typename task_t , typename data_t >
template<template< typename ... > class container_t, typename ... args>
int teca_cuda_thread_pool< task_t, data_t >::wait_some ( long  n_to_wait,
long long  poll_interval,
container_t< data_t, args ... > &  data 
)

wait for some of the requests to execute. datasets will be retruned as they become ready. n_to_wait specifies how many datasets to gather but there are three cases when the number of datasets returned differs from n_to_wait. when n_to_wait is larger than the number of tasks remaining, datasets from all of the remaining tasks is returned. when n_to_wait is smaller than the number of datasets ready, all of the currenttly ready data are returned. finally, when n_to_wait is < 1 the call blocks until all of the tasks complete and all of the data is returned.


The documentation for this class was generated from the following file: