A class to manage a fixed size pool of threads that dispatch work. More...

#include <teca_cuda_thread_pool.h>

Public Member Functions
	teca_cuda_thread_pool (MPI_Comm comm, int n_threads, int n_threads_per_device, bool bind, bool verbose)

	teca_cuda_thread_pool (const teca_cuda_thread_pool &src)=delete

	teca_cuda_thread_pool (teca_cuda_thread_pool &&src)=delete

teca_cuda_thread_pool &	operator= (const teca_cuda_thread_pool &src)=delete

teca_cuda_thread_pool &	operator= (teca_cuda_thread_pool &&src)=delete

void	push_task (task_t &task)

template<template< typename ... > class container_t, typename ... args>
void	wait_all (container_t< data_t, args ... > &data)

template<template< typename ... > class container_t, typename ... args>
int	wait_some (long n_to_wait, long long poll_interval, container_t< data_t, args ... > &data)

unsigned int	size () const noexcept
	get the number of threads More...

Detailed Description

template<typename task_t, typename data_t>
class teca_cuda_thread_pool< task_t, data_t >

A class to manage a fixed size pool of threads that dispatch work.

Each thread in the pool services a specific CUDA device or CPU core. During execution each thread assigns work via the device_id request key to the CUDA device or CPU which it services. The default number of threads per CUDA device is 8. This can be overriden via the n_threads_per_device parameter or the TECA_THREADS_PER_CUDA_DEVICE environment variable. Once a CUDA device reaches the maximum specified number of threads per device, no more threads will assign work to it. Once all available CUDA devices reach the maximum specified number of threads per device, all remaining threads in the pool will assign work to the CPU cores to which they are bound.

Upstream algorithms must examine the device_id field in the request to determine which CUDA device or CPU they should use for calculations. The algorithm should allocate memory and invoke computations only on the assigned device. Algorithms that do not support calculation on CUDA GPU will ignore the assignment and make use of the CPU.

Constructor & Destructor Documentation

◆ teca_cuda_thread_pool()

template<typename task_t , typename data_t >

teca_cuda_thread_pool< task_t, data_t >::teca_cuda_thread_pool	(	MPI_Comm	comm,
		int	n_threads,
		int	n_threads_per_device,
		bool	bind,
		bool	verbose
	)

construct/destruct the thread pool.

Parameters

[in]	comm	communicator over which to map threads. Use MPI_COMM_SELF for local mapping and MPI_COMM_NULL to exclude this process from execution.
[in]	n_threads	number of threads to create for the pool. -1 will create 1 thread per physical CPU core. all MPI ranks running on the same node are taken into account, resulting in 1 thread per core node wide.
	[in[	n_threads_per_device number of threads to assign to each CUDA device. -1 for all threads assigned.
[in]	bind	bind each thread to a specific core.
[in]	verbose	print a report of the thread to core bindings

Member Function Documentation

◆ push_task()

template<typename task_t , typename data_t >

void teca_cuda_thread_pool< task_t, data_t >::push_task ( task_t & task )

add a data request task to the queue, returns a future from which the generated dataset can be accessed.

◆ size()

template<typename task_t , typename data_t >

unsigned int teca_cuda_thread_pool< task_t, data_t >::size ( ) const

inlinenoexcept

get the number of threads

◆ wait_all()

template<typename task_t , typename data_t >

template<template< typename ... > class container_t, typename ... args>

void teca_cuda_thread_pool< task_t, data_t >::wait_all ( container_t< data_t, args ... > & data )

wait for all of the requests to execute and transfer datasets in the order that corresponding requests were added to the queue.

◆ wait_some()

template<typename task_t , typename data_t >

template<template< typename ... > class container_t, typename ... args>

int teca_cuda_thread_pool< task_t, data_t >::wait_some	(	long	n_to_wait,
		long long	poll_interval,
		container_t< data_t, args ... > &	data
	)

wait for some of the requests to execute. datasets will be retruned as they become ready. n_to_wait specifies how many datasets to gather but there are three cases when the number of datasets returned differs from n_to_wait. when n_to_wait is larger than the number of tasks remaining, datasets from all of the remaining tasks is returned. when n_to_wait is smaller than the number of datasets ready, all of the currenttly ready data are returned. finally, when n_to_wait is < 1 the call blocks until all of the tasks complete and all of the data is returned.

The documentation for this class was generated from the following file:

stable/core/teca_cuda_thread_pool.h

Public Member Functions

Detailed Description

template<typename task_t, typename data_t> class teca_cuda_thread_pool< task_t, data_t >

Constructor & Destructor Documentation

◆ teca_cuda_thread_pool()

Member Function Documentation

◆ push_task()

◆ size()

◆ wait_all()

◆ wait_some()

template<typename task_t, typename data_t>
class teca_cuda_thread_pool< task_t, data_t >