Ubuntu Manpage: Execution Control -

Provided by: nvidia-cuda-dev_10.1.243-3_amd64

NAME

       Execution Control -

   Functions
       __cudart_builtin__ cudaError_t cudaFuncGetAttributes (struct cudaFuncAttributes *attr, const void *func)
           Find out attributes for a given function.
       __cudart_builtin__ cudaError_t cudaFuncSetAttribute (const void *func, enum cudaFuncAttribute attr, int
           value)
           Set attributes for a given function.
       cudaError_t cudaFuncSetCacheConfig (const void *func, enum cudaFuncCache cacheConfig)
           Sets the preferred cache configuration for a device function.
       cudaError_t cudaFuncSetSharedMemConfig (const void *func, enum cudaSharedMemConfig config)
           Sets the shared memory configuration for a device function.
       __device__ __cudart_builtin__ void * cudaGetParameterBuffer (size_t alignment, size_t size)
           Obtains a parameter buffer.
       __device__ __cudart_builtin__ void * cudaGetParameterBufferV2 (void *func, dim3 gridDimension, dim3
           blockDimension, unsigned int sharedMemSize)
           Launches a specified kernel.
       cudaError_t cudaLaunchCooperativeKernel (const void *func, dim3 gridDim, dim3 blockDim, void **args,
           size_t sharedMem, cudaStream_t stream)
           Launches a device function where thread blocks can cooperate and synchronize as they execute.
       cudaError_t cudaLaunchCooperativeKernelMultiDevice (struct cudaLaunchParams *launchParamsList, unsigned
           int numDevices, unsigned int flags=0)
           Launches device functions on multiple devices where thread blocks can cooperate and synchronize as
           they execute.
       cudaError_t cudaLaunchHostFunc (cudaStream_t stream, cudaHostFn_t fn, void *userData)
           Enqueues a host function call in a stream.
       cudaError_t cudaLaunchKernel (const void *func, dim3 gridDim, dim3 blockDim, void **args, size_t
           sharedMem, cudaStream_t stream)
           Launches a device function.
       __CUDA_DEPRECATED cudaError_t cudaSetDoubleForDevice (double *d)
           Converts a double argument to be executed on a device.
       __CUDA_DEPRECATED cudaError_t cudaSetDoubleForHost (double *d)
           Converts a double argument after execution on a device.

Detailed Description

       \brief execution control functions of the CUDA runtime API (cuda_runtime_api.h)

       This section describes the execution control functions of the CUDA runtime application programming
       interface.

       Some functions have overloaded C++ API template versions documented separately in the C++ API Routines
       module.

Function Documentation

   __cudart_builtin__ cudaError_t cudaFuncGetAttributes (struct cudaFuncAttributes * attr, const void * func)
       This function obtains the attributes of a function specified via func. func is a device function symbol
       and must be declared as a __global__ function. The fetched attributes are placed in attr. If the
       specified function does not exist, then cudaErrorInvalidDeviceFunction is returned. For templated
       functions, pass the function symbol as follows: func_name<template_arg_0,...,template_arg_N>

       Note that some function attributes such as maxThreadsPerBlock may vary based on the device that is
       currently being used.

       Parameters:
           attr - Return pointer to function's attributes
           func - Device function symbol

       Returns:
           cudaSuccess, cudaErrorInvalidDeviceFunction

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           Use of a string naming a function as the func parameter was deprecated in CUDA 4.1 and removed in
           CUDA 5.0.

       See also:
           cudaFuncSetCacheConfig (C API), cudaFuncGetAttributes (C++ API), cudaLaunchKernel (C API),
           cudaSetDoubleForDevice, cudaSetDoubleForHost, cuFuncGetAttribute

   __cudart_builtin__ cudaError_t cudaFuncSetAttribute (const void * func, enum cudaFuncAttribute attr, int
       value)
       This function sets the attributes of a function specified via func. The parameter func must be a pointer
       to a function that executes on the device. The parameter specified by func must be declared as a
       __global__ function. The enumeration defined by attr is set to the value defined by value. If the
       specified function does not exist, then cudaErrorInvalidDeviceFunction is returned. If the specified
       attribute cannot be written, or if the value is incorrect, then cudaErrorInvalidValue is returned.

       Valid values for attr are:

       • cudaFuncAttributeMaxDynamicSharedMemorySize  -  The  requested  maximum  size  in bytes of dynamically-
         allocated shared memory. The sum of this value and the function attribute sharedSizeBytes cannot exceed
         the device attribute cudaDevAttrMaxSharedMemoryPerBlockOptin. The maximal size of  requestable  dynamic
         shared memory may differ by GPU architecture.

       • cudaFuncAttributePreferredSharedMemoryCarveout  -  On  devices where the L1 cache and shared memory use
         the same hardware resources, this sets the shared memory carveout preference, in percent of  the  total
         shared memory. See cudaDevAttrMaxSharedMemoryPerMultiprocessor. This is only a hint, and the driver can
         choose a different ratio if required to execute the function.

       Parameters:
           func - Function to get attributes of
           attr - Attribute to set
           value - Value to set

       Returns:
           cudaSuccess, cudaErrorInvalidDeviceFunction, cudaErrorInvalidValue

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       cudaLaunchKernel   (C++   API),   cudaFuncSetCacheConfig   (C++   API),  cudaFuncGetAttributes  (C  API),
       cudaSetDoubleForDevice, cudaSetDoubleForHost

   cudaError_t cudaFuncSetCacheConfig (const void * func, enum cudaFuncCache cacheConfig)
       On devices where the L1 cache and shared memory use  the  same  hardware  resources,  this  sets  through
       cacheConfig  the  preferred  cache  configuration  for  the  function  specified via func. This is only a
       preference. The runtime will use the requested configuration if possible, but it  is  free  to  choose  a
       different configuration if required to execute func.

       func is a device function symbol and must be declared as a __global__ function. If the specified function
       does  not  exist,  then  cudaErrorInvalidDeviceFunction  is  returned.  For templated functions, pass the
       function symbol as follows: func_name<template_arg_0,...,template_arg_N>

       This setting does nothing on devices where the size of the L1 cache and shared memory are fixed.

       Launching a kernel with a different preference than the most  recent  preference  setting  may  insert  a
       device-side synchronization point.

       The supported cache configurations are:

       • cudaFuncCachePreferNone: no preference for shared memory or L1 (default)

       • cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache

       • cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory

       • cudaFuncCachePreferEqual: prefer equal size L1 cache and shared memory

       Parameters:
           func - Device function symbol
           cacheConfig - Requested cache configuration

       Returns:
           cudaSuccess, cudaErrorInvalidDeviceFunction

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           Use  of  a  string  naming a function as the func parameter was deprecated in CUDA 4.1 and removed in
           CUDA 5.0.

       See also:
           cudaFuncSetCacheConfig  (C++  API),  cudaFuncGetAttributes  (C  API),   cudaLaunchKernel   (C   API),
           cudaSetDoubleForDevice,   cudaSetDoubleForHost,  cudaThreadGetCacheConfig,  cudaThreadSetCacheConfig,
           cuFuncSetCacheConfig

   cudaError_t cudaFuncSetSharedMemConfig (const void * func, enum cudaSharedMemConfig config)
       On devices with configurable shared memory banks, this function will force all subsequent launches of the
       specified device function to have the given shared memory bank size configuration. On any given launch of
       the function, the shared memory configuration of the device will be temporarily changed if needed to suit
       the function's preferred  configuration.  Changes  in  shared  memory  configuration  between  subsequent
       launches of functions, may introduce a device side synchronization point.

       Any  per-function setting of shared memory bank size set via cudaFuncSetSharedMemConfig will override the
       device wide setting set by cudaDeviceSetSharedMemConfig.

       Changing the shared memory bank size will not  increase  shared  memory  usage  or  affect  occupancy  of
       kernels,  but  may  have major effects on performance. Larger bank sizes will allow for greater potential
       bandwidth to shared memory, but will change what kinds of accesses to shared memory will result  in  bank
       conflicts.

       This function will do nothing on devices with fixed shared memory bank size.

       For       templated       functions,       pass       the      function      symbol      as      follows:
       func_name<template_arg_0,...,template_arg_N>

       The supported bank configurations are:

       • cudaSharedMemBankSizeDefault:  use  the  device's  shared  memory  configuration  when  launching  this
         function.

       • cudaSharedMemBankSizeFourByte:  set  shared  memory bank width to be four bytes natively when launching
         this function.

       • cudaSharedMemBankSizeEightByte: set shared memory bank width to be eight bytes natively when  launching
         this function.

       Parameters:
           func - Device function symbol
           config - Requested shared memory configuration

       Returns:
           cudaSuccess, cudaErrorInvalidDeviceFunction, cudaErrorInvalidValue,

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           Use  of  a  string  naming a function as the func parameter was deprecated in CUDA 4.1 and removed in
           CUDA 5.0.

       See also:
           cudaDeviceSetSharedMemConfig,         cudaDeviceGetSharedMemConfig,         cudaDeviceSetCacheConfig,
           cudaDeviceGetCacheConfig, cudaFuncSetCacheConfig, cuFuncSetSharedMemConfig

   __device__ __cudart_builtin__ void* cudaGetParameterBuffer (size_t alignment, size_t size)
       Obtains  a parameter buffer which can be filled with parameters for a kernel launch. Parameters passed to
       cudaLaunchDevice must be allocated via this function.

       This is a low level API and can only be accessed from Parallel Thread Execution  (PTX).  CUDA  user  code
       should use <<< >>> to launch kernels.

       Parameters:
           alignment - Specifies alignment requirement of the parameter buffer
           size - Specifies size requirement in bytes

       Returns:
           Returns pointer to the allocated parameterBuffer

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cudaLaunchDevice

   __device__   __cudart_builtin__  void*  cudaGetParameterBufferV2  (void  *  func,  dim3  gridDimension,  dim3
       blockDimension, unsigned int sharedMemSize)
       Launches a specified kernel with the specified parameter buffer. A parameter buffer can  be  obtained  by
       calling cudaGetParameterBuffer().

       This  is  a  low  level API and can only be accessed from Parallel Thread Execution (PTX). CUDA user code
       should use <<< >>> to launch the kernels.

       Parameters:
           func - Pointer to the kernel to be launched
           parameterBuffer - Holds  the  parameters  to  the  launched  kernel.  parameterBuffer  can  be  NULL.
           (Optional)
           gridDimension - Specifies grid dimensions
           blockDimension - Specifies block dimensions
           sharedMemSize - Specifies size of shared memory
           stream - Specifies the stream to be used

       Returns:
           cudaSuccess,  cudaErrorInvalidDevice, cudaErrorLaunchMaxDepthExceeded, cudaErrorInvalidConfiguration,
           cudaErrorStartupFailure, cudaErrorLaunchPendingCountExceeded, cudaErrorLaunchOutOfResources

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.
            Please refer to Execution Configuration and Parameter Buffer Layout from the CUDA Programming  Guide
           for the detailed descriptions of launch configuration and parameter layout respectively.

       See also:
           cudaGetParameterBuffer

   cudaError_t cudaLaunchCooperativeKernel (const void * func, dim3 gridDim, dim3 blockDim, void ** args, size_t
       sharedMem, cudaStream_t stream)
       The  function  invokes  kernel  func  on gridDim (gridDim.x × gridDim.y × gridDim.z) grid of blocks. Each
       block contains blockDim (blockDim.x × blockDim.y × blockDim.z) threads.

       The device on which this kernel  is  invoked  must  have  a  non-zero  value  for  the  device  attribute
       cudaDevAttrCooperativeLaunch.

       The  total  number  of  blocks  launched cannot exceed the maximum number of blocks per multiprocessor as
       returned               by                cudaOccupancyMaxActiveBlocksPerMultiprocessor                (or
       cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags)  times the number of multiprocessors as specified
       by the device attribute cudaDevAttrMultiProcessorCount.

       The kernel cannot make use of CUDA dynamic parallelism.

       If the kernel has N parameters the args should point to array of N pointers. Each pointer,  from  args[0]
       to args[N - 1], point to the region of memory from which the actual parameter will be copied.

       For       templated       functions,       pass       the      function      symbol      as      follows:
       func_name<template_arg_0,...,template_arg_N>

       sharedMem sets the amount of dynamic shared memory that will be available to each thread block.

       stream specifies a stream the invocation is associated to.

       Parameters:
           func - Device function symbol
           gridDim - Grid dimensions
           blockDim - Block dimensions
           args - Arguments
           sharedMem - Shared memory
           stream - Stream identifier

       Returns:
           cudaSuccess, cudaErrorInvalidDeviceFunction,  cudaErrorInvalidConfiguration,  cudaErrorLaunchFailure,
           cudaErrorLaunchTimeout,       cudaErrorLaunchOutOfResources,      cudaErrorCooperativeLaunchTooLarge,
           cudaErrorSharedObjectInitFailed

       Note:
           This function uses standard  semantics.

           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cudaLaunchCooperativeKernel        (C++         API),         cudaLaunchCooperativeKernelMultiDevice,
           cuLaunchCooperativeKernel

   cudaError_t  cudaLaunchCooperativeKernelMultiDevice (struct cudaLaunchParams * launchParamsList, unsigned int
       numDevices, unsigned int flags = 0)
       Invokes kernels as specified in the launchParamsList array where each element of the array specifies  all
       the parameters required to perform a single kernel launch. These kernels can cooperate and synchronize as
       they execute. The size of the array is specified by numDevices.

       No  two  kernels can be launched on the same device. All the devices targeted by this multi-device launch
       must  be  identical.  All   devices   must   have   a   non-zero   value   for   the   device   attribute
       cudaDevAttrCooperativeMultiDeviceLaunch.

       The  same  kernel must be launched on all devices. Note that any __device__ or __constant__ variables are
       independently instantiated on every device. It  is  the  application's  responsibility  to  ensure  these
       variables are initialized and used appropriately.

       The  size of the grids as specified in blocks, the size of the blocks themselves and the amount of shared
       memory used by each thread block must also match across all launched kernels.

       The streams used to  launch  these  kernels  must  have  been  created  via  either  cudaStreamCreate  or
       cudaStreamCreateWithPriority  or  cudaStreamCreateWithPriority.  The  NULL  stream or cudaStreamLegacy or
       cudaStreamPerThread cannot be used.

       The total number of  blocks  launched  per  kernel  cannot  exceed  the  maximum  number  of  blocks  per
       multiprocessor       as      returned      by      cudaOccupancyMaxActiveBlocksPerMultiprocessor      (or
       cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as  specified
       by  the  device  attribute  cudaDevAttrMultiProcessorCount. Since the total number of blocks launched per
       device has to match across all devices, the maximum number of blocks that can be launched per device will
       be limited by the device with the least number of multiprocessors.

       The kernel cannot make use of CUDA dynamic parallelism.

       The cudaLaunchParams structure is defined as:

               struct cudaLaunchParams
               {
                   void *func;
                   dim3 gridDim;
                   dim3 blockDim;
                   void **args;
                   size_t sharedMem;
                   cudaStream_t stream;
               };

        where:

       • cudaLaunchParams::func specifies the kernel to be launched. This same functions must be launched on all
         devices.    For    templated    functions,     pass     the     function     symbol     as     follows:
         func_name<template_arg_0,...,template_arg_N>

       • cudaLaunchParams::gridDim  specifies the width, height and depth of the grid in blocks. This must match
         across all kernels launched.

       • cudaLaunchParams::blockDim is the width, height and depth of each thread block. This must match  across
         all kernels launched.

       • cudaLaunchParams::args  specifies  the  arguments  to  the  kernel. If the kernel has N parameters then
         cudaLaunchParams::args   should   point   to   array   of    N    pointers.    Each    pointer,    from
         cudaLaunchParams::args[0]  to  cudaLaunchParams::args[N  - 1], point to the region of memory from which
         the actual parameter will be copied.

       • cudaLaunchParams::sharedMem is the dynamic shared-memory size per thread  block  in  bytes.  This  must
         match across all kernels launched.

       • cudaLaunchParams::stream  is the handle to the stream to perform the launch in. This cannot be the NULL
         stream or cudaStreamLegacy or cudaStreamPerThread.

       By default, the kernel won't begin execution on any GPU until all prior work in all the specified streams
       has    completed.    This    behavior    can     be     overridden     by     specifying     the     flag
       cudaCooperativeLaunchMultiDeviceNoPreSync.  When  this  flag is specified, each kernel will only wait for
       prior work in the stream corresponding to that GPU to complete before it begins execution.

       Similarly, by default, any subsequent work pushed  in  any  of  the  specified  streams  will  not  begin
       execution until the kernels on all GPUs have completed. This behavior can be overridden by specifying the
       flag  cudaCooperativeLaunchMultiDeviceNoPostSync. When this flag is specified, any subsequent work pushed
       in any of the specified streams will only wait for the kernel launched on the GPU corresponding  to  that
       stream to complete before it begins execution.

       Parameters:
           launchParamsList - List of launch parameters, one per device
           numDevices - Size of the launchParamsList array
           flags - Flags to control launch behavior

       Returns:
           cudaSuccess,  cudaErrorInvalidDeviceFunction,  cudaErrorInvalidConfiguration, cudaErrorLaunchFailure,
           cudaErrorLaunchTimeout,      cudaErrorLaunchOutOfResources,       cudaErrorCooperativeLaunchTooLarge,
           cudaErrorSharedObjectInitFailed

       Note:
           This function uses standard  semantics.

           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cudaLaunchCooperativeKernel             (C++            API),            cudaLaunchCooperativeKernel,
           cuLaunchCooperativeKernelMultiDevice

   cudaError_t cudaLaunchHostFunc (cudaStream_t stream, cudaHostFn_t fn, void * userData)
       Enqueues a host function to run in a stream. The function will be called after  currently  enqueued  work
       and will block work added after it.

       The  host  function  must  not  make  any  CUDA  API  calls.  Attempting  to use a CUDA API may result in
       cudaErrorNotPermitted, but this is not required. The host function must not perform  any  synchronization
       that  may  depend on outstanding CUDA work not mandated to run earlier. Host functions without a mandated
       order (such as in independent streams) execute in undefined order and may be serialized.

       For the purposes of Unified Memory, execution makes a number of guarantees:

       • The stream is considered idle for the duration of the function's  execution.  Thus,  for  example,  the
         function may always use memory attached to the stream it was enqueued in.
       • The  start  of  execution of the function has the same effect as synchronizing an event recorded in the
         same stream immediately prior to the function. It thus synchronizes streams which  have  been  'joined'
         prior to the function.
       • Adding  device  work  to  any  stream  does  not  have the effect of making the stream active until all
         preceding host functions and stream callbacks have executed. Thus, for example, a  function  might  use
         global  attached  memory  even  if  work has been added to another stream, if the work has been ordered
         behind the function call with an event.
       • Completion of the function does not cause a stream to become active  except  as  described  above.  The
         stream will remain idle if no device work follows the function, and will remain idle across consecutive
         host  functions  or  stream  callbacks  without  device  work  in  between.  Thus,  for example, stream
         synchronization can be done by signaling from a host function at the end of the stream.
       Note that, in contrast to cuStreamAddCallback, the function will not be called in the event of  an  error
       in the CUDA context.
       Parameters:
           hStream - Stream to enqueue function call in
           fn - The function to call once preceding stream operations are complete
           userData - User-specified data to be passed to the function
       Returns:
           cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorInvalidValue, cudaErrorNotSupported
       Note:
           This function uses standard  semantics.
           Note that this function may also return error codes from previous, asynchronous launches.
       See also:
           cudaStreamCreate,  cudaStreamQuery,  cudaStreamSynchronize,  cudaStreamWaitEvent,  cudaStreamDestroy,
           cudaMallocManaged, cudaStreamAttachMemAsync, cudaStreamAddCallback, cuLaunchHostFunc
   cudaError_t cudaLaunchKernel (const void * func, dim3 gridDim, dim3 blockDim, void ** args, size_t sharedMem,
       cudaStream_t stream)
       The function invokes kernel func on gridDim (gridDim.x × gridDim.y ×  gridDim.z)  grid  of  blocks.  Each
       block contains blockDim (blockDim.x × blockDim.y × blockDim.z) threads.
       If  the  kernel has N parameters the args should point to array of N pointers. Each pointer, from args[0]
       to args[N - 1], point to the region of memory from which the actual parameter will be copied.
       For      templated      functions,      pass      the       function       symbol       as       follows:
       func_name<template_arg_0,...,template_arg_N>
       sharedMem sets the amount of dynamic shared memory that will be available to each thread block.
       stream specifies a stream the invocation is associated to.
       Parameters:
           func - Device function symbol
           gridDim - Grid dimensions
           blockDim - Block dimensions
           args - Arguments
           sharedMem - Shared memory
           stream - Stream identifier
       Returns:
           cudaSuccess,  cudaErrorInvalidDeviceFunction,  cudaErrorInvalidConfiguration, cudaErrorLaunchFailure,
           cudaErrorLaunchTimeout,        cudaErrorLaunchOutOfResources,        cudaErrorSharedObjectInitFailed,
           cudaErrorInvalidPtx, cudaErrorNoKernelImageForDevice, cudaErrorJitCompilerNotFound
       Note:
           This function uses standard  semantics.
           Note that this function may also return error codes from previous, asynchronous launches.
       See also:
           cudaLaunchKernel (C++ API), cuLaunchKernel
   __CUDA_DEPRECATED cudaError_t cudaSetDoubleForDevice (double * d)
       Parameters:
           d - Double to convert
       Deprecated
           This function is deprecated as of CUDA 7.5
       Converts  the double value of d to an internal float representation if the device does not support double
       arithmetic. If the device does natively support doubles, then this function does nothing.
       Returns:
           cudaSuccess
       Note:
           Note that this function may also return error codes from previous, asynchronous launches.
       See also:
           cudaFuncSetCacheConfig (C API), cudaFuncGetAttributes (C API), cudaSetDoubleForHost
   __CUDA_DEPRECATED cudaError_t cudaSetDoubleForHost (double * d)
       Deprecated
           This function is deprecated as of CUDA 7.5
       Converts the double value of d from a potentially internal float representation if the  device  does  not
       support double arithmetic. If the device does natively support doubles, then this function does nothing.
       Parameters:
           d - Double to convert
       Returns:
           cudaSuccess
       Note:
           Note that this function may also return error codes from previous, asynchronous launches.
       See also:
           cudaFuncSetCacheConfig (C API), cudaFuncGetAttributes (C API), cudaSetDoubleForDevice

Author

       Generated automatically by Doxygen from the source code.

Version 6.0                                        28 Jul 2019                              Execution Control(3)