Ubuntu Manpage: Execution Control -

Provided by: nvidia-cuda-dev_9.1.85-3ubuntu1_amd64

NAME

       Execution Control -

   Functions
       CUresult cuFuncGetAttribute (int *pi, CUfunction_attribute attrib, CUfunction hfunc)
           Returns information about a function.
       CUresult cuFuncSetAttribute (CUfunction hfunc, CUfunction_attribute attrib, int value)
           Sets information about a function.
       CUresult cuFuncSetCacheConfig (CUfunction hfunc, CUfunc_cache config)
           Sets the preferred cache configuration for a device function.
       CUresult cuFuncSetSharedMemConfig (CUfunction hfunc, CUsharedconfig config)
           Sets the shared memory configuration for a device function.
       CUresult cuLaunchCooperativeKernel (CUfunction f, unsigned int gridDimX, unsigned int
           gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY,
           unsigned int blockDimZ, unsigned int sharedMemBytes, CUstream hStream, void
           **kernelParams)
           Launches a CUDA function where thread blocks can cooperate and synchronize as they
           execute.
       CUresult cuLaunchCooperativeKernelMultiDevice (CUDA_LAUNCH_PARAMS *launchParamsList,
           unsigned int numDevices, unsigned int flags)
           Launches CUDA functions on multiple devices where thread blocks can cooperate and
           synchronize as they execute.
       CUresult cuLaunchKernel (CUfunction f, unsigned int gridDimX, unsigned int gridDimY,
           unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int
           blockDimZ, unsigned int sharedMemBytes, CUstream hStream, void **kernelParams, void
           **extra)
           Launches a CUDA function.

Detailed Description

       \brief execution control functions of the low-level CUDA driver API (cuda.h)

       This section describes the execution control functions of the low-level CUDA driver
       application programming interface.

Function Documentation

   CUresult cuFuncGetAttribute (int * pi, CUfunction_attribute attrib, CUfunction hfunc)
       Returns in *pi the integer value of the attribute attrib on the kernel given by hfunc. The
       supported attributes are:

       • CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK: The maximum number of threads per block, beyond
         which a launch of the function would fail. This number depends on both the function and
         the device on which the function is currently loaded.

       • CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES: The size in bytes of statically-allocated shared
         memory per block required by this function. This does not include dynamically-allocated
         shared memory requested by the user at runtime.

       • CU_FUNC_ATTRIBUTE_CONST_SIZE_BYTES: The size in bytes of user-allocated constant memory
         required by this function.

       • CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES: The size in bytes of local memory used by each
         thread of this function.

       • CU_FUNC_ATTRIBUTE_NUM_REGS: The number of registers used by each thread of this
         function.

       • CU_FUNC_ATTRIBUTE_PTX_VERSION: The PTX virtual architecture version for which the
         function was compiled. This value is the major PTX version * 10 + the minor PTX version,
         so a PTX version 1.3 function would return the value 13. Note that this may return the
         undefined value of 0 for cubins compiled prior to CUDA 3.0.

       • CU_FUNC_ATTRIBUTE_BINARY_VERSION: The binary architecture version for which the function
         was compiled. This value is the major binary version * 10 + the minor binary version, so
         a binary version 1.3 function would return the value 13. Note that this will return a
         value of 10 for legacy cubins that do not have a properly-encoded binary architecture
         version.

       • CU_FUNC_CACHE_MODE_CA: The attribute to indicate whether the function has been compiled
         with user specified option '-Xptxas --dlcm=ca' set .

       • CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES: The maximum size in bytes of
         dynamically-allocated shared memory.

       • CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT: Preferred shared memory-L1 cache
         split ratio in percent of shared memory.

       Parameters:
           pi - Returned attribute value
           attrib - Attribute requested
           hfunc - Function to query attribute of

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,
           CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous
           launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncSetCacheConfig, cuLaunchKernel,
           cudaFuncGetAttributes cudaFuncSetAttribute

   CUresult cuFuncSetAttribute (CUfunction hfunc, CUfunction_attribute attrib, int value)
       This call sets the value of a specified attribute attrib on the kernel given by hfunc to
       an integer value specified by val This function returns CUDA_SUCCESS if the new value of
       the attribute could be successfully set. If the set fails, this call will return an error.
       Not all attributes can have values set. Attempting to set a value on a read-only attribute
       will result in an error (CUDA_ERROR_INVALID_VALUE)

       Supported attributes for the cuFuncSetAttribute call are:

       • CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES: This maximum size in bytes of
         dynamically-allocated shared memory. The value should contain the requested maximum size
         of dynamically-allocated shared memory. The sum of this value and the function attribute
         CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES cannot exceed the device attribute
         CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN. The maximal size of requestable
         dynamic shared memory may differ by GPU architecture.

       • CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT: On devices where the L1 cache and
         shared memory use the same hardware resources, this sets the shared memory carveout
         preference, in percent of the total resources. This is only a hint, and the driver can
         choose a different ratio if required to execute the function.

       Parameters:
           hfunc - Function to query attribute of
           attrib - Attribute requested
           value - The value to set

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,
           CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous
           launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncSetCacheConfig, cuLaunchKernel,
           cudaFuncGetAttributes cudaFuncSetAttribute

   CUresult cuFuncSetCacheConfig (CUfunction hfunc, CUfunc_cache config)
       On devices where the L1 cache and shared memory use the same hardware resources, this sets
       through config the preferred cache configuration for the device function hfunc. This is
       only a preference. The driver will use the requested configuration if possible, but it is
       free to choose a different configuration if required to execute hfunc. Any context-wide
       preference set via cuCtxSetCacheConfig() will be overridden by this per-function setting
       unless the per-function setting is CU_FUNC_CACHE_PREFER_NONE. In that case, the current
       context-wide setting will be used.

       This setting does nothing on devices where the size of the L1 cache and shared memory are
       fixed.

       Launching a kernel with a different preference than the most recent preference setting may
       insert a device-side synchronization point.

       The supported cache configurations are:

       • CU_FUNC_CACHE_PREFER_NONE: no preference for shared memory or L1 (default)

       • CU_FUNC_CACHE_PREFER_SHARED: prefer larger shared memory and smaller L1 cache

       • CU_FUNC_CACHE_PREFER_L1: prefer larger L1 cache and smaller shared memory

       • CU_FUNC_CACHE_PREFER_EQUAL: prefer equal sized L1 cache and shared memory

       Parameters:
           hfunc - Kernel to configure cache for
           config - Requested cache configuration

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_DEINITIALIZED,
           CUDA_ERROR_NOT_INITIALIZED, CUDA_ERROR_INVALID_CONTEXT

       Note:
           Note that this function may also return error codes from previous, asynchronous
           launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncGetAttribute, cuLaunchKernel,
           cudaFuncSetCacheConfig

   CUresult cuFuncSetSharedMemConfig (CUfunction hfunc, CUsharedconfig config)
       On devices with configurable shared memory banks, this function will force all subsequent
       launches of the specified device function to have the given shared memory bank size
       configuration. On any given launch of the function, the shared memory configuration of the
       device will be temporarily changed if needed to suit the function's preferred
       configuration. Changes in shared memory configuration between subsequent launches of
       functions, may introduce a device side synchronization point.

       Any per-function setting of shared memory bank size set via cuFuncSetSharedMemConfig will
       override the context wide setting set with cuCtxSetSharedMemConfig.

       Changing the shared memory bank size will not increase shared memory usage or affect
       occupancy of kernels, but may have major effects on performance. Larger bank sizes will
       allow for greater potential bandwidth to shared memory, but will change what kinds of
       accesses to shared memory will result in bank conflicts.

       This function will do nothing on devices with fixed shared memory bank size.

       The supported bank configurations are:

       • CU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE: use the context's shared memory configuration
         when launching this function.

       • CU_SHARED_MEM_CONFIG_FOUR_BYTE_BANK_SIZE: set shared memory bank width to be natively
         four bytes when launching this function.

       • CU_SHARED_MEM_CONFIG_EIGHT_BYTE_BANK_SIZE: set shared memory bank width to be natively
         eight bytes when launching this function.

       Parameters:
           hfunc - kernel to be given a shared memory config
           config - requested shared memory configuration

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_DEINITIALIZED,
           CUDA_ERROR_NOT_INITIALIZED, CUDA_ERROR_INVALID_CONTEXT

       Note:
           Note that this function may also return error codes from previous, asynchronous
           launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuCtxGetSharedMemConfig,
           cuCtxSetSharedMemConfig, cuFuncGetAttribute, cuLaunchKernel,
           cudaFuncSetSharedMemConfig

   CUresult cuLaunchCooperativeKernel (CUfunction f, unsigned int gridDimX, unsigned int
       gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned
       int blockDimZ, unsigned int sharedMemBytes, CUstream hStream, void ** kernelParams)
       Invokes the kernel f on a gridDimX x gridDimY x gridDimZ grid of blocks. Each block
       contains blockDimX x blockDimY x blockDimZ threads.

       sharedMemBytes sets the amount of dynamic shared memory that will be available to each
       thread block.

       The device on which this kernel is invoked must have a non-zero value for the device
       attribute CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH.

       The total number of blocks launched cannot exceed the maximum number of blocks per
       multiprocessor as returned by cuOccupancyMaxActiveBlocksPerMultiprocessor (or
       cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors
       as specified by the device attribute CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT.

       The kernel cannot make use of CUDA dynamic parallelism.

       Kernel parameters must be specified via kernelParams. If f has N parameters, then
       kernelParams needs to be an array of N pointers. Each of kernelParams[0] through
       kernelParams[N-1] must point to a region of memory from which the actual kernel parameter
       will be copied. The number of kernel parameters and their offsets and sizes do not need to
       be specified as that information is retrieved directly from the kernel's image.

       Calling cuLaunchCooperativeKernel() sets persistent function state that is the same as
       function state set through cuLaunchKernel API

       When the kernel f is launched via cuLaunchCooperativeKernel(), the previous block shape,
       shared size and parameter info associated with f is overwritten.

       Note that to use cuLaunchCooperativeKernel(), the kernel f must either have been compiled
       with toolchain version 3.2 or later so that it will contain kernel parameter information,
       or have no kernel parameters. If either of these conditions is not met, then
       cuLaunchCooperativeKernel() will return CUDA_ERROR_INVALID_IMAGE.

       Parameters:
           f - Kernel to launch
           gridDimX - Width of grid in blocks
           gridDimY - Height of grid in blocks
           gridDimZ - Depth of grid in blocks
           blockDimX - X dimension of each thread block
           blockDimY - Y dimension of each thread block
           blockDimZ - Z dimension of each thread block
           sharedMemBytes - Dynamic shared-memory size per thread block in bytes
           hStream - Stream identifier
           kernelParams - Array of pointers to kernel parameters

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,
           CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_INVALID_IMAGE,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_LAUNCH_FAILED,
           CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, CUDA_ERROR_LAUNCH_TIMEOUT,
           CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING, CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE,
           CUDA_ERROR_SHARED_OBJECT_INIT_FAILED

       Note:
           This function uses standard  semantics.

           Note that this function may also return error codes from previous, asynchronous
           launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncSetCacheConfig, cuFuncGetAttribute,
           cuLaunchCooperativeKernelMultiDevice, cudaLaunchCooperativeKernel

   CUresult cuLaunchCooperativeKernelMultiDevice (CUDA_LAUNCH_PARAMS * launchParamsList, unsigned
       int numDevices, unsigned int flags)
       Invokes kernels as specified in the launchParamsList array where each element of the array
       specifies all the parameters required to perform a single kernel launch. These kernels can
       cooperate and synchronize as they execute. The size of the array is specified by
       numDevices.

       No two kernels can be launched on the same device. All the devices targeted by this multi-
       device launch must be identical. All devices must have a non-zero value for the device
       attribute CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH.

       All kernels launched must be identical with respect to the compiled code. Note that any
       __device__, __constant__ or __managed__ variables present in the module that owns the
       kernel launched on each device, are independently instantiated on every device. It is the
       application's responsibility to ensure these variables are initialized and used
       appropriately.

       The size of the grids as specified in blocks, the size of the blocks themselves and the
       amount of shared memory used by each thread block must also match across all launched
       kernels.

       The streams used to launch these kernels must have been created via either cuStreamCreate
       or cuStreamCreateWithPriority. The NULL stream or CU_STREAM_LEGACY or CU_STREAM_PER_THREAD
       cannot be used.

       The total number of blocks launched per kernel cannot exceed the maximum number of blocks
       per multiprocessor as returned by cuOccupancyMaxActiveBlocksPerMultiprocessor (or
       cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors
       as specified by the device attribute CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT. Since the
       total number of blocks launched per device has to match across all devices, the maximum
       number of blocks that can be launched per device will be limited by the device with the
       least number of multiprocessors.

       The kernels cannot make use of CUDA dynamic parallelism.

       The CUDA_LAUNCH_PARAMS structure is defined as:

               typedef struct CUDA_LAUNCH_PARAMS_st
               {
                   CUfunction function;
                   unsigned int gridDimX;
                   unsigned int gridDimY;
                   unsigned int gridDimZ;
                   unsigned int blockDimX;
                   unsigned int blockDimY;
                   unsigned int blockDimZ;
                   unsigned int sharedMemBytes;
                   CUstream hStream;
                   void **kernelParams;
               } CUDA_LAUNCH_PARAMS;

        where:

       • CUDA_LAUNCH_PARAMS::function specifies the kernel to be launched. All functions must be
         identical with respect to the compiled code.

       • CUDA_LAUNCH_PARAMS::gridDimX is the width of the grid in blocks. This must match across
         all kernels launched.

       • CUDA_LAUNCH_PARAMS::gridDimY is the height of the grid in blocks. This must match across
         all kernels launched.

       • CUDA_LAUNCH_PARAMS::gridDimZ is the depth of the grid in blocks. This must match across
         all kernels launched.

       • CUDA_LAUNCH_PARAMS::blockDimX is the X dimension of each thread block. This must match
         across all kernels launched.

       • CUDA_LAUNCH_PARAMS::blockDimX is the Y dimension of each thread block. This must match
         across all kernels launched.

       • CUDA_LAUNCH_PARAMS::blockDimZ is the Z dimension of each thread block. This must match
         across all kernels launched.

       • CUDA_LAUNCH_PARAMS::sharedMemBytes is the dynamic shared-memory size per thread block in
         bytes. This must match across all kernels launched.

       • CUDA_LAUNCH_PARAMS::hStream is the handle to the stream to perform the launch in. This
         cannot be the NULL stream or CU_STREAM_LEGACY or CU_STREAM_PER_THREAD. The CUDA context
         associated with this stream must match that associated with
         CUDA_LAUNCH_PARAMS::function.

       • CUDA_LAUNCH_PARAMS::kernelParams is an array of pointers to kernel parameters. If
         CUDA_LAUNCH_PARAMS::function has N parameters, then CUDA_LAUNCH_PARAMS::kernelParams
         needs to be an array of N pointers. Each of CUDA_LAUNCH_PARAMS::kernelParams[0] through
         CUDA_LAUNCH_PARAMS::kernelParams[N-1] must point to a region of memory from which the
         actual kernel parameter will be copied. The number of kernel parameters and their
         offsets and sizes do not need to be specified as that information is retrieved directly
         from the kernel's image.

       By default, the kernel won't begin execution on any GPU until all prior work in all the
       specified streams has completed. This behavior can be overridden by specifying the flag
       CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_PRE_LAUNCH_SYNC. When this flag is specified, each
       kernel will only wait for prior work in the stream corresponding to that GPU to complete
       before it begins execution.

       Similarly, by default, any subsequent work pushed in any of the specified streams will not
       begin execution until the kernels on all GPUs have completed. This behavior can be
       overridden by specifying the flag
       CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_POST_LAUNCH_SYNC. When this flag is specified, any
       subsequent work pushed in any of the specified streams will only wait for the kernel
       launched on the GPU corresponding to that stream to complete before it begins execution.

       Calling cuLaunchCooperativeKernelMultiDevice() sets persistent function state that is the
       same as function state set through cuLaunchKernel API when called individually for each
       element in launchParamsList.

       When kernels are launched via cuLaunchCooperativeKernelMultiDevice(), the previous block
       shape, shared size and parameter info associated with each CUDA_LAUNCH_PARAMS::function in
       launchParamsList is overwritten.

       Note that to use cuLaunchCooperativeKernelMultiDevice(), the kernels must either have been
       compiled with toolchain version 3.2 or later so that it will contain kernel parameter
       information, or have no kernel parameters. If either of these conditions is not met, then
       cuLaunchCooperativeKernelMultiDevice() will return CUDA_ERROR_INVALID_IMAGE.

       Parameters:
           launchParamsList - List of launch parameters, one per device
           numDevices - Size of the launchParamsList array
           flags - Flags to control launch behavior

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,
           CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_INVALID_IMAGE,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_LAUNCH_FAILED,
           CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, CUDA_ERROR_LAUNCH_TIMEOUT,
           CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING, CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE,
           CUDA_ERROR_SHARED_OBJECT_INIT_FAILED

       Note:
           This function uses standard  semantics.

           Note that this function may also return error codes from previous, asynchronous
           launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncSetCacheConfig, cuFuncGetAttribute,
           cuLaunchCooperativeKernel, cudaLaunchCooperativeKernelMultiDevice

   CUresult cuLaunchKernel (CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned
       int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ,
       unsigned int sharedMemBytes, CUstream hStream, void ** kernelParams, void ** extra)
       Invokes the kernel f on a gridDimX x gridDimY x gridDimZ grid of blocks. Each block
       contains blockDimX x blockDimY x blockDimZ threads.

       sharedMemBytes sets the amount of dynamic shared memory that will be available to each
       thread block.

       Kernel parameters to f can be specified in one of two ways:

       1) Kernel parameters can be specified via kernelParams. If f has N parameters, then
       kernelParams needs to be an array of N pointers. Each of kernelParams[0] through
       kernelParams[N-1] must point to a region of memory from which the actual kernel parameter
       will be copied. The number of kernel parameters and their offsets and sizes do not need to
       be specified as that information is retrieved directly from the kernel's image.

       2) Kernel parameters can also be packaged by the application into a single buffer that is
       passed in via the extra parameter. This places the burden on the application of knowing
       each kernel parameter's size and alignment/padding within the buffer. Here is an example
       of using the extra parameter in this manner:

           size_t argBufferSize;
           char argBuffer[256];

           // populate argBuffer and argBufferSize

           void *config[] = {
               CU_LAUNCH_PARAM_BUFFER_POINTER, argBuffer,
               CU_LAUNCH_PARAM_BUFFER_SIZE,    &argBufferSize,
               CU_LAUNCH_PARAM_END
           };
           status = cuLaunchKernel(f, gx, gy, gz, bx, by, bz, sh, s, NULL, config);

       The extra parameter exists to allow cuLaunchKernel to take additional less commonly used
       arguments. extra specifies a list of names of extra settings and their corresponding
       values. Each extra setting name is immediately followed by the corresponding value. The
       list must be terminated with either NULL or CU_LAUNCH_PARAM_END.

       • CU_LAUNCH_PARAM_END, which indicates the end of the extra array;

       • CU_LAUNCH_PARAM_BUFFER_POINTER, which specifies that the next value in extra will be a
         pointer to a buffer containing all the kernel parameters for launching kernel f;

       • CU_LAUNCH_PARAM_BUFFER_SIZE, which specifies that the next value in extra will be a
         pointer to a size_t containing the size of the buffer specified with
         CU_LAUNCH_PARAM_BUFFER_POINTER;

       The error CUDA_ERROR_INVALID_VALUE will be returned if kernel parameters are specified
       with both kernelParams and extra (i.e. both kernelParams and extra are non-NULL).

       Calling cuLaunchKernel() sets persistent function state that is the same as function state
       set through the following deprecated APIs: cuFuncSetBlockShape(), cuFuncSetSharedSize(),
       cuParamSetSize(), cuParamSeti(), cuParamSetf(), cuParamSetv().

       When the kernel f is launched via cuLaunchKernel(), the previous block shape, shared size
       and parameter info associated with f is overwritten.

       Note that to use cuLaunchKernel(), the kernel f must either have been compiled with
       toolchain version 3.2 or later so that it will contain kernel parameter information, or
       have no kernel parameters. If either of these conditions is not met, then cuLaunchKernel()
       will return CUDA_ERROR_INVALID_IMAGE.

       Parameters:
           f - Kernel to launch
           gridDimX - Width of grid in blocks
           gridDimY - Height of grid in blocks
           gridDimZ - Depth of grid in blocks
           blockDimX - X dimension of each thread block
           blockDimY - Y dimension of each thread block
           blockDimZ - Z dimension of each thread block
           sharedMemBytes - Dynamic shared-memory size per thread block in bytes
           hStream - Stream identifier
           kernelParams - Array of pointers to kernel parameters
           extra - Extra options

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,
           CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_INVALID_IMAGE,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_LAUNCH_FAILED,
           CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, CUDA_ERROR_LAUNCH_TIMEOUT,
           CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING, CUDA_ERROR_SHARED_OBJECT_INIT_FAILED

       Note:
           This function uses standard  semantics.

           Note that this function may also return error codes from previous, asynchronous
           launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncSetCacheConfig, cuFuncGetAttribute,
           cudaLaunchKernel

Author

       Generated automatically by Doxygen from the source code.