Provided by: nvidia-cuda-dev_10.1.243-3_amd64 

NAME
Execution Control -
Functions
__cudart_builtin__ cudaError_t cudaFuncGetAttributes (struct cudaFuncAttributes *attr, const void *func)
Find out attributes for a given function.
__cudart_builtin__ cudaError_t cudaFuncSetAttribute (const void *func, enum cudaFuncAttribute attr, int
value)
Set attributes for a given function.
cudaError_t cudaFuncSetCacheConfig (const void *func, enum cudaFuncCache cacheConfig)
Sets the preferred cache configuration for a device function.
cudaError_t cudaFuncSetSharedMemConfig (const void *func, enum cudaSharedMemConfig config)
Sets the shared memory configuration for a device function.
__device__ __cudart_builtin__ void * cudaGetParameterBuffer (size_t alignment, size_t size)
Obtains a parameter buffer.
__device__ __cudart_builtin__ void * cudaGetParameterBufferV2 (void *func, dim3 gridDimension, dim3
blockDimension, unsigned int sharedMemSize)
Launches a specified kernel.
cudaError_t cudaLaunchCooperativeKernel (const void *func, dim3 gridDim, dim3 blockDim, void **args,
size_t sharedMem, cudaStream_t stream)
Launches a device function where thread blocks can cooperate and synchronize as they execute.
cudaError_t cudaLaunchCooperativeKernelMultiDevice (struct cudaLaunchParams *launchParamsList, unsigned
int numDevices, unsigned int flags=0)
Launches device functions on multiple devices where thread blocks can cooperate and synchronize as
they execute.
cudaError_t cudaLaunchHostFunc (cudaStream_t stream, cudaHostFn_t fn, void *userData)
Enqueues a host function call in a stream.
cudaError_t cudaLaunchKernel (const void *func, dim3 gridDim, dim3 blockDim, void **args, size_t
sharedMem, cudaStream_t stream)
Launches a device function.
__CUDA_DEPRECATED cudaError_t cudaSetDoubleForDevice (double *d)
Converts a double argument to be executed on a device.
__CUDA_DEPRECATED cudaError_t cudaSetDoubleForHost (double *d)
Converts a double argument after execution on a device.
Detailed Description
\brief execution control functions of the CUDA runtime API (cuda_runtime_api.h)
This section describes the execution control functions of the CUDA runtime application programming
interface.
Some functions have overloaded C++ API template versions documented separately in the C++ API Routines
module.
Function Documentation
__cudart_builtin__ cudaError_t cudaFuncGetAttributes (struct cudaFuncAttributes * attr, const void * func)
This function obtains the attributes of a function specified via func. func is a device function symbol
and must be declared as a __global__ function. The fetched attributes are placed in attr. If the
specified function does not exist, then cudaErrorInvalidDeviceFunction is returned. For templated
functions, pass the function symbol as follows: func_name<template_arg_0,...,template_arg_N>
Note that some function attributes such as maxThreadsPerBlock may vary based on the device that is
currently being used.
Parameters:
attr - Return pointer to function's attributes
func - Device function symbol
Returns:
cudaSuccess, cudaErrorInvalidDeviceFunction
Note:
Note that this function may also return error codes from previous, asynchronous launches.
Use of a string naming a function as the func parameter was deprecated in CUDA 4.1 and removed in
CUDA 5.0.
See also:
cudaFuncSetCacheConfig (C API), cudaFuncGetAttributes (C++ API), cudaLaunchKernel (C API),
cudaSetDoubleForDevice, cudaSetDoubleForHost, cuFuncGetAttribute
__cudart_builtin__ cudaError_t cudaFuncSetAttribute (const void * func, enum cudaFuncAttribute attr, int
value)
This function sets the attributes of a function specified via func. The parameter func must be a pointer
to a function that executes on the device. The parameter specified by func must be declared as a
__global__ function. The enumeration defined by attr is set to the value defined by value. If the
specified function does not exist, then cudaErrorInvalidDeviceFunction is returned. If the specified
attribute cannot be written, or if the value is incorrect, then cudaErrorInvalidValue is returned.
Valid values for attr are:
• cudaFuncAttributeMaxDynamicSharedMemorySize - The requested maximum size in bytes of dynamically-
allocated shared memory. The sum of this value and the function attribute sharedSizeBytes cannot exceed
the device attribute cudaDevAttrMaxSharedMemoryPerBlockOptin. The maximal size of requestable dynamic
shared memory may differ by GPU architecture.
• cudaFuncAttributePreferredSharedMemoryCarveout - On devices where the L1 cache and shared memory use
the same hardware resources, this sets the shared memory carveout preference, in percent of the total
shared memory. See cudaDevAttrMaxSharedMemoryPerMultiprocessor. This is only a hint, and the driver can
choose a different ratio if required to execute the function.
Parameters:
func - Function to get attributes of
attr - Attribute to set
value - Value to set
Returns:
cudaSuccess, cudaErrorInvalidDeviceFunction, cudaErrorInvalidValue
Note:
Note that this function may also return error codes from previous, asynchronous launches.
cudaLaunchKernel (C++ API), cudaFuncSetCacheConfig (C++ API), cudaFuncGetAttributes (C API),
cudaSetDoubleForDevice, cudaSetDoubleForHost
cudaError_t cudaFuncSetCacheConfig (const void * func, enum cudaFuncCache cacheConfig)
On devices where the L1 cache and shared memory use the same hardware resources, this sets through
cacheConfig the preferred cache configuration for the function specified via func. This is only a
preference. The runtime will use the requested configuration if possible, but it is free to choose a
different configuration if required to execute func.
func is a device function symbol and must be declared as a __global__ function. If the specified function
does not exist, then cudaErrorInvalidDeviceFunction is returned. For templated functions, pass the
function symbol as follows: func_name<template_arg_0,...,template_arg_N>
This setting does nothing on devices where the size of the L1 cache and shared memory are fixed.
Launching a kernel with a different preference than the most recent preference setting may insert a
device-side synchronization point.
The supported cache configurations are:
• cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
• cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
• cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory
• cudaFuncCachePreferEqual: prefer equal size L1 cache and shared memory
Parameters:
func - Device function symbol
cacheConfig - Requested cache configuration
Returns:
cudaSuccess, cudaErrorInvalidDeviceFunction
Note:
Note that this function may also return error codes from previous, asynchronous launches.
Use of a string naming a function as the func parameter was deprecated in CUDA 4.1 and removed in
CUDA 5.0.
See also:
cudaFuncSetCacheConfig (C++ API), cudaFuncGetAttributes (C API), cudaLaunchKernel (C API),
cudaSetDoubleForDevice, cudaSetDoubleForHost, cudaThreadGetCacheConfig, cudaThreadSetCacheConfig,
cuFuncSetCacheConfig
cudaError_t cudaFuncSetSharedMemConfig (const void * func, enum cudaSharedMemConfig config)
On devices with configurable shared memory banks, this function will force all subsequent launches of the
specified device function to have the given shared memory bank size configuration. On any given launch of
the function, the shared memory configuration of the device will be temporarily changed if needed to suit
the function's preferred configuration. Changes in shared memory configuration between subsequent
launches of functions, may introduce a device side synchronization point.
Any per-function setting of shared memory bank size set via cudaFuncSetSharedMemConfig will override the
device wide setting set by cudaDeviceSetSharedMemConfig.
Changing the shared memory bank size will not increase shared memory usage or affect occupancy of
kernels, but may have major effects on performance. Larger bank sizes will allow for greater potential
bandwidth to shared memory, but will change what kinds of accesses to shared memory will result in bank
conflicts.
This function will do nothing on devices with fixed shared memory bank size.
For templated functions, pass the function symbol as follows:
func_name<template_arg_0,...,template_arg_N>
The supported bank configurations are:
• cudaSharedMemBankSizeDefault: use the device's shared memory configuration when launching this
function.
• cudaSharedMemBankSizeFourByte: set shared memory bank width to be four bytes natively when launching
this function.
• cudaSharedMemBankSizeEightByte: set shared memory bank width to be eight bytes natively when launching
this function.
Parameters:
func - Device function symbol
config - Requested shared memory configuration
Returns:
cudaSuccess, cudaErrorInvalidDeviceFunction, cudaErrorInvalidValue,
Note:
Note that this function may also return error codes from previous, asynchronous launches.
Use of a string naming a function as the func parameter was deprecated in CUDA 4.1 and removed in
CUDA 5.0.
See also:
cudaDeviceSetSharedMemConfig, cudaDeviceGetSharedMemConfig, cudaDeviceSetCacheConfig,
cudaDeviceGetCacheConfig, cudaFuncSetCacheConfig, cuFuncSetSharedMemConfig
__device__ __cudart_builtin__ void* cudaGetParameterBuffer (size_t alignment, size_t size)
Obtains a parameter buffer which can be filled with parameters for a kernel launch. Parameters passed to
cudaLaunchDevice must be allocated via this function.
This is a low level API and can only be accessed from Parallel Thread Execution (PTX). CUDA user code
should use <<< >>> to launch kernels.
Parameters:
alignment - Specifies alignment requirement of the parameter buffer
size - Specifies size requirement in bytes
Returns:
Returns pointer to the allocated parameterBuffer
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaLaunchDevice
__device__ __cudart_builtin__ void* cudaGetParameterBufferV2 (void * func, dim3 gridDimension, dim3
blockDimension, unsigned int sharedMemSize)
Launches a specified kernel with the specified parameter buffer. A parameter buffer can be obtained by
calling cudaGetParameterBuffer().
This is a low level API and can only be accessed from Parallel Thread Execution (PTX). CUDA user code
should use <<< >>> to launch the kernels.
Parameters:
func - Pointer to the kernel to be launched
parameterBuffer - Holds the parameters to the launched kernel. parameterBuffer can be NULL.
(Optional)
gridDimension - Specifies grid dimensions
blockDimension - Specifies block dimensions
sharedMemSize - Specifies size of shared memory
stream - Specifies the stream to be used
Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorLaunchMaxDepthExceeded, cudaErrorInvalidConfiguration,
cudaErrorStartupFailure, cudaErrorLaunchPendingCountExceeded, cudaErrorLaunchOutOfResources
Note:
Note that this function may also return error codes from previous, asynchronous launches.
Please refer to Execution Configuration and Parameter Buffer Layout from the CUDA Programming Guide
for the detailed descriptions of launch configuration and parameter layout respectively.
See also:
cudaGetParameterBuffer
cudaError_t cudaLaunchCooperativeKernel (const void * func, dim3 gridDim, dim3 blockDim, void ** args, size_t
sharedMem, cudaStream_t stream)
The function invokes kernel func on gridDim (gridDim.x × gridDim.y × gridDim.z) grid of blocks. Each
block contains blockDim (blockDim.x × blockDim.y × blockDim.z) threads.
The device on which this kernel is invoked must have a non-zero value for the device attribute
cudaDevAttrCooperativeLaunch.
The total number of blocks launched cannot exceed the maximum number of blocks per multiprocessor as
returned by cudaOccupancyMaxActiveBlocksPerMultiprocessor (or
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified
by the device attribute cudaDevAttrMultiProcessorCount.
The kernel cannot make use of CUDA dynamic parallelism.
If the kernel has N parameters the args should point to array of N pointers. Each pointer, from args[0]
to args[N - 1], point to the region of memory from which the actual parameter will be copied.
For templated functions, pass the function symbol as follows:
func_name<template_arg_0,...,template_arg_N>
sharedMem sets the amount of dynamic shared memory that will be available to each thread block.
stream specifies a stream the invocation is associated to.
Parameters:
func - Device function symbol
gridDim - Grid dimensions
blockDim - Block dimensions
args - Arguments
sharedMem - Shared memory
stream - Stream identifier
Returns:
cudaSuccess, cudaErrorInvalidDeviceFunction, cudaErrorInvalidConfiguration, cudaErrorLaunchFailure,
cudaErrorLaunchTimeout, cudaErrorLaunchOutOfResources, cudaErrorCooperativeLaunchTooLarge,
cudaErrorSharedObjectInitFailed
Note:
This function uses standard semantics.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaLaunchCooperativeKernel (C++ API), cudaLaunchCooperativeKernelMultiDevice,
cuLaunchCooperativeKernel
cudaError_t cudaLaunchCooperativeKernelMultiDevice (struct cudaLaunchParams * launchParamsList, unsigned int
numDevices, unsigned int flags = 0)
Invokes kernels as specified in the launchParamsList array where each element of the array specifies all
the parameters required to perform a single kernel launch. These kernels can cooperate and synchronize as
they execute. The size of the array is specified by numDevices.
No two kernels can be launched on the same device. All the devices targeted by this multi-device launch
must be identical. All devices must have a non-zero value for the device attribute
cudaDevAttrCooperativeMultiDeviceLaunch.
The same kernel must be launched on all devices. Note that any __device__ or __constant__ variables are
independently instantiated on every device. It is the application's responsibility to ensure these
variables are initialized and used appropriately.
The size of the grids as specified in blocks, the size of the blocks themselves and the amount of shared
memory used by each thread block must also match across all launched kernels.
The streams used to launch these kernels must have been created via either cudaStreamCreate or
cudaStreamCreateWithPriority or cudaStreamCreateWithPriority. The NULL stream or cudaStreamLegacy or
cudaStreamPerThread cannot be used.
The total number of blocks launched per kernel cannot exceed the maximum number of blocks per
multiprocessor as returned by cudaOccupancyMaxActiveBlocksPerMultiprocessor (or
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified
by the device attribute cudaDevAttrMultiProcessorCount. Since the total number of blocks launched per
device has to match across all devices, the maximum number of blocks that can be launched per device will
be limited by the device with the least number of multiprocessors.
The kernel cannot make use of CUDA dynamic parallelism.
The cudaLaunchParams structure is defined as:
struct cudaLaunchParams
{
void *func;
dim3 gridDim;
dim3 blockDim;
void **args;
size_t sharedMem;
cudaStream_t stream;
};
where:
• cudaLaunchParams::func specifies the kernel to be launched. This same functions must be launched on all
devices. For templated functions, pass the function symbol as follows:
func_name<template_arg_0,...,template_arg_N>
• cudaLaunchParams::gridDim specifies the width, height and depth of the grid in blocks. This must match
across all kernels launched.
• cudaLaunchParams::blockDim is the width, height and depth of each thread block. This must match across
all kernels launched.
• cudaLaunchParams::args specifies the arguments to the kernel. If the kernel has N parameters then
cudaLaunchParams::args should point to array of N pointers. Each pointer, from
cudaLaunchParams::args[0] to cudaLaunchParams::args[N - 1], point to the region of memory from which
the actual parameter will be copied.
• cudaLaunchParams::sharedMem is the dynamic shared-memory size per thread block in bytes. This must
match across all kernels launched.
• cudaLaunchParams::stream is the handle to the stream to perform the launch in. This cannot be the NULL
stream or cudaStreamLegacy or cudaStreamPerThread.
By default, the kernel won't begin execution on any GPU until all prior work in all the specified streams
has completed. This behavior can be overridden by specifying the flag
cudaCooperativeLaunchMultiDeviceNoPreSync. When this flag is specified, each kernel will only wait for
prior work in the stream corresponding to that GPU to complete before it begins execution.
Similarly, by default, any subsequent work pushed in any of the specified streams will not begin
execution until the kernels on all GPUs have completed. This behavior can be overridden by specifying the
flag cudaCooperativeLaunchMultiDeviceNoPostSync. When this flag is specified, any subsequent work pushed
in any of the specified streams will only wait for the kernel launched on the GPU corresponding to that
stream to complete before it begins execution.
Parameters:
launchParamsList - List of launch parameters, one per device
numDevices - Size of the launchParamsList array
flags - Flags to control launch behavior
Returns:
cudaSuccess, cudaErrorInvalidDeviceFunction, cudaErrorInvalidConfiguration, cudaErrorLaunchFailure,
cudaErrorLaunchTimeout, cudaErrorLaunchOutOfResources, cudaErrorCooperativeLaunchTooLarge,
cudaErrorSharedObjectInitFailed
Note:
This function uses standard semantics.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaLaunchCooperativeKernel (C++ API), cudaLaunchCooperativeKernel,
cuLaunchCooperativeKernelMultiDevice
cudaError_t cudaLaunchHostFunc (cudaStream_t stream, cudaHostFn_t fn, void * userData)
Enqueues a host function to run in a stream. The function will be called after currently enqueued work
and will block work added after it.
The host function must not make any CUDA API calls. Attempting to use a CUDA API may result in
cudaErrorNotPermitted, but this is not required. The host function must not perform any synchronization
that may depend on outstanding CUDA work not mandated to run earlier. Host functions without a mandated
order (such as in independent streams) execute in undefined order and may be serialized.
For the purposes of Unified Memory, execution makes a number of guarantees:
• The stream is considered idle for the duration of the function's execution. Thus, for example, the
function may always use memory attached to the stream it was enqueued in.
• The start of execution of the function has the same effect as synchronizing an event recorded in the
same stream immediately prior to the function. It thus synchronizes streams which have been 'joined'
prior to the function.
• Adding device work to any stream does not have the effect of making the stream active until all
preceding host functions and stream callbacks have executed. Thus, for example, a function might use
global attached memory even if work has been added to another stream, if the work has been ordered
behind the function call with an event.
• Completion of the function does not cause a stream to become active except as described above. The
stream will remain idle if no device work follows the function, and will remain idle across consecutive
host functions or stream callbacks without device work in between. Thus, for example, stream
synchronization can be done by signaling from a host function at the end of the stream.
Note that, in contrast to cuStreamAddCallback, the function will not be called in the event of an error
in the CUDA context.
Parameters:
hStream - Stream to enqueue function call in
fn - The function to call once preceding stream operations are complete
userData - User-specified data to be passed to the function
Returns:
cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorInvalidValue, cudaErrorNotSupported
Note:
This function uses standard semantics.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaStreamCreate, cudaStreamQuery, cudaStreamSynchronize, cudaStreamWaitEvent, cudaStreamDestroy,
cudaMallocManaged, cudaStreamAttachMemAsync, cudaStreamAddCallback, cuLaunchHostFunc
cudaError_t cudaLaunchKernel (const void * func, dim3 gridDim, dim3 blockDim, void ** args, size_t sharedMem,
cudaStream_t stream)
The function invokes kernel func on gridDim (gridDim.x × gridDim.y × gridDim.z) grid of blocks. Each
block contains blockDim (blockDim.x × blockDim.y × blockDim.z) threads.
If the kernel has N parameters the args should point to array of N pointers. Each pointer, from args[0]
to args[N - 1], point to the region of memory from which the actual parameter will be copied.
For templated functions, pass the function symbol as follows:
func_name<template_arg_0,...,template_arg_N>
sharedMem sets the amount of dynamic shared memory that will be available to each thread block.
stream specifies a stream the invocation is associated to.
Parameters:
func - Device function symbol
gridDim - Grid dimensions
blockDim - Block dimensions
args - Arguments
sharedMem - Shared memory
stream - Stream identifier
Returns:
cudaSuccess, cudaErrorInvalidDeviceFunction, cudaErrorInvalidConfiguration, cudaErrorLaunchFailure,
cudaErrorLaunchTimeout, cudaErrorLaunchOutOfResources, cudaErrorSharedObjectInitFailed,
cudaErrorInvalidPtx, cudaErrorNoKernelImageForDevice, cudaErrorJitCompilerNotFound
Note:
This function uses standard semantics.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaLaunchKernel (C++ API), cuLaunchKernel
__CUDA_DEPRECATED cudaError_t cudaSetDoubleForDevice (double * d)
Parameters:
d - Double to convert
Deprecated
This function is deprecated as of CUDA 7.5
Converts the double value of d to an internal float representation if the device does not support double
arithmetic. If the device does natively support doubles, then this function does nothing.
Returns:
cudaSuccess
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaFuncSetCacheConfig (C API), cudaFuncGetAttributes (C API), cudaSetDoubleForHost
__CUDA_DEPRECATED cudaError_t cudaSetDoubleForHost (double * d)
Deprecated
This function is deprecated as of CUDA 7.5
Converts the double value of d from a potentially internal float representation if the device does not
support double arithmetic. If the device does natively support doubles, then this function does nothing.
Parameters:
d - Double to convert
Returns:
cudaSuccess
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaFuncSetCacheConfig (C API), cudaFuncGetAttributes (C API), cudaSetDoubleForDevice
Author
Generated automatically by Doxygen from the source code.
Version 6.0 28 Jul 2019 Execution Control(3)