Provided by: nvidia-cuda-dev_10.1.243-3_amd64 bug

NAME

       Execution Control -

   Functions
       CUresult cuFuncGetAttribute (int *pi, CUfunction_attribute attrib, CUfunction hfunc)
           Returns information about a function.
       CUresult cuFuncSetAttribute (CUfunction hfunc, CUfunction_attribute attrib, int value)
           Sets information about a function.
       CUresult cuFuncSetCacheConfig (CUfunction hfunc, CUfunc_cache config)
           Sets the preferred cache configuration for a device function.
       CUresult cuFuncSetSharedMemConfig (CUfunction hfunc, CUsharedconfig config)
           Sets the shared memory configuration for a device function.
       CUresult cuLaunchCooperativeKernel (CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned
           int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int
           sharedMemBytes, CUstream hStream, void **kernelParams)
           Launches a CUDA function where thread blocks can cooperate and synchronize as they execute.
       CUresult cuLaunchCooperativeKernelMultiDevice (CUDA_LAUNCH_PARAMS *launchParamsList, unsigned int
           numDevices, unsigned int flags)
           Launches CUDA functions on multiple devices where thread blocks can cooperate and synchronize as they
           execute.
       CUresult cuLaunchHostFunc (CUstream hStream, CUhostFn fn, void *userData)
           Enqueues a host function call in a stream.
       CUresult cuLaunchKernel (CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned int
           gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int
           sharedMemBytes, CUstream hStream, void **kernelParams, void **extra)
           Launches a CUDA function.

Detailed Description

       \brief execution control functions of the low-level CUDA driver API (cuda.h)

       This section describes the execution control functions of the low-level CUDA driver application
       programming interface.

Function Documentation

   CUresult cuFuncGetAttribute (int * pi, CUfunction_attribute attrib, CUfunction hfunc)
       Returns in *pi the integer value of the attribute attrib on the kernel given by hfunc. The supported
       attributes are:

       • CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK: The maximum number of threads per block, beyond which a launch
         of  the  function  would  fail.  This  number  depends on both the function and the device on which the
         function is currently loaded.

       • CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES: The size in bytes of statically-allocated shared memory per  block
         required  by  this function. This does not include dynamically-allocated shared memory requested by the
         user at runtime.

       • CU_FUNC_ATTRIBUTE_CONST_SIZE_BYTES: The size in bytes of user-allocated  constant  memory  required  by
         this function.

       • CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES:  The  size  in  bytes  of  local memory used by each thread of this
         function.

       • CU_FUNC_ATTRIBUTE_NUM_REGS: The number of registers used by each thread of this function.

       • CU_FUNC_ATTRIBUTE_PTX_VERSION: The  PTX  virtual  architecture  version  for  which  the  function  was
         compiled.  This  value  is  the  major  PTX  version * 10 + the minor PTX version, so a PTX version 1.3
         function would return the value 13. Note that this may return the  undefined  value  of  0  for  cubins
         compiled prior to CUDA 3.0.

       • CU_FUNC_ATTRIBUTE_BINARY_VERSION:  The binary architecture version for which the function was compiled.
         This value is the major binary version * 10 + the  minor  binary  version,  so  a  binary  version  1.3
         function  would return the value 13. Note that this will return a value of 10 for legacy cubins that do
         not have a properly-encoded binary architecture version.

       • CU_FUNC_CACHE_MODE_CA: The attribute to indicate whether the  function  has  been  compiled  with  user
         specified option '-Xptxas --dlcm=ca' set .

       • CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES:  The  maximum  size  in bytes of dynamically-allocated
         shared memory.

       • CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT: Preferred shared memory-L1  cache  split  ratio  in
         percent of total shared memory.

       Parameters:
           pi - Returned attribute value
           attrib - Attribute requested
           hfunc - Function to query attribute of

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncSetCacheConfig, cuLaunchKernel, cudaFuncGetAttributes
           cudaFuncSetAttribute

   CUresult cuFuncSetAttribute (CUfunction hfunc, CUfunction_attribute attrib, int value)
       This call sets the value of a specified attribute attrib on the kernel given by hfunc to an integer value
       specified by val This function  returns  CUDA_SUCCESS  if  the  new  value  of  the  attribute  could  be
       successfully  set.  If  the set fails, this call will return an error. Not all attributes can have values
       set.  Attempting  to   set   a   value   on   a   read-only   attribute   will   result   in   an   error
       (CUDA_ERROR_INVALID_VALUE)

       Supported attributes for the cuFuncSetAttribute call are:

       • CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES:  This  maximum  size in bytes of dynamically-allocated
         shared memory. The value should contain the requested  maximum  size  of  dynamically-allocated  shared
         memory.  The  sum  of  this value and the function attribute CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES cannot
         exceed the device attribute CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN. The maximal size  of
         requestable dynamic shared memory may differ by GPU architecture.

       • CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT: On devices where the L1 cache and shared memory use
         the  same  hardware resources, this sets the shared memory carveout preference, in percent of the total
         shared memory. See CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR This is only  a  hint,  and
         the driver can choose a different ratio if required to execute the function.

       Parameters:
           hfunc - Function to query attribute of
           attrib - Attribute requested
           value - The value to set

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncSetCacheConfig, cuLaunchKernel, cudaFuncGetAttributes
           cudaFuncSetAttribute

   CUresult cuFuncSetCacheConfig (CUfunction hfunc, CUfunc_cache config)
       On devices where the L1 cache and shared memory use the same hardware resources, this sets through config
       the preferred cache configuration for the device function hfunc. This is only a  preference.  The  driver
       will  use  the requested configuration if possible, but it is free to choose a different configuration if
       required to execute hfunc. Any context-wide preference set via cuCtxSetCacheConfig() will  be  overridden
       by  this per-function setting unless the per-function setting is CU_FUNC_CACHE_PREFER_NONE. In that case,
       the current context-wide setting will be used.

       This setting does nothing on devices where the size of the L1 cache and shared memory are fixed.

       Launching a kernel with a different preference than the most  recent  preference  setting  may  insert  a
       device-side synchronization point.

       The supported cache configurations are:

       • CU_FUNC_CACHE_PREFER_NONE: no preference for shared memory or L1 (default)

       • CU_FUNC_CACHE_PREFER_SHARED: prefer larger shared memory and smaller L1 cache

       • CU_FUNC_CACHE_PREFER_L1: prefer larger L1 cache and smaller shared memory

       • CU_FUNC_CACHE_PREFER_EQUAL: prefer equal sized L1 cache and shared memory

       Parameters:
           hfunc - Kernel to configure cache for
           config - Requested cache configuration

       Returns:
           CUDA_SUCCESS,    CUDA_ERROR_INVALID_VALUE,    CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,
           CUDA_ERROR_INVALID_CONTEXT

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncGetAttribute, cuLaunchKernel, cudaFuncSetCacheConfig

   CUresult cuFuncSetSharedMemConfig (CUfunction hfunc, CUsharedconfig config)
       On devices with configurable shared memory banks, this function will force all subsequent launches of the
       specified device function to have the given shared memory bank size configuration. On any given launch of
       the function, the shared memory configuration of the device will be temporarily changed if needed to suit
       the function's preferred  configuration.  Changes  in  shared  memory  configuration  between  subsequent
       launches of functions, may introduce a device side synchronization point.

       Any  per-function  setting  of shared memory bank size set via cuFuncSetSharedMemConfig will override the
       context wide setting set with cuCtxSetSharedMemConfig.

       Changing the shared memory bank size will not  increase  shared  memory  usage  or  affect  occupancy  of
       kernels,  but  may  have major effects on performance. Larger bank sizes will allow for greater potential
       bandwidth to shared memory, but will change what kinds of accesses to shared memory will result  in  bank
       conflicts.

       This function will do nothing on devices with fixed shared memory bank size.

       The supported bank configurations are:

       • CU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE:  use  the  context's shared memory configuration when launching
         this function.

       • CU_SHARED_MEM_CONFIG_FOUR_BYTE_BANK_SIZE: set shared memory bank width to be natively four  bytes  when
         launching this function.

       • CU_SHARED_MEM_CONFIG_EIGHT_BYTE_BANK_SIZE: set shared memory bank width to be natively eight bytes when
         launching this function.

       Parameters:
           hfunc - kernel to be given a shared memory config
           config - requested shared memory configuration

       Returns:
           CUDA_SUCCESS,    CUDA_ERROR_INVALID_VALUE,    CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,
           CUDA_ERROR_INVALID_CONTEXT

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuCtxGetCacheConfig,    cuCtxSetCacheConfig,    cuCtxGetSharedMemConfig,     cuCtxSetSharedMemConfig,
           cuFuncGetAttribute, cuLaunchKernel, cudaFuncSetSharedMemConfig

   CUresult  cuLaunchCooperativeKernel (CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned int
       gridDimZ,  unsigned  int  blockDimX,  unsigned  int  blockDimY,  unsigned  int  blockDimZ,  unsigned  int
       sharedMemBytes, CUstream hStream, void ** kernelParams)
       Invokes  the kernel f on a gridDimX x gridDimY x gridDimZ grid of blocks. Each block contains blockDimX x
       blockDimY x blockDimZ threads.

       sharedMemBytes sets the amount of dynamic shared memory that will be available to each thread block.

       The device on which this kernel  is  invoked  must  have  a  non-zero  value  for  the  device  attribute
       CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH.

       The  total  number  of  blocks  launched cannot exceed the maximum number of blocks per multiprocessor as
       returned                by                cuOccupancyMaxActiveBlocksPerMultiprocessor                 (or
       cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by
       the device attribute CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT.

       The kernel cannot make use of CUDA dynamic parallelism.

       Kernel  parameters  must be specified via kernelParams. If f has N parameters, then kernelParams needs to
       be an array of N pointers. Each of kernelParams[0] through kernelParams[N-1] must point to  a  region  of
       memory  from  which the actual kernel parameter will be copied. The number of kernel parameters and their
       offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's
       image.

       Calling cuLaunchCooperativeKernel() sets persistent function state that is the same as function state set
       through cuLaunchKernel API

       When the kernel f is launched via cuLaunchCooperativeKernel(), the previous block shape, shared size  and
       parameter info associated with f is overwritten.

       Note  that to use cuLaunchCooperativeKernel(), the kernel f must either have been compiled with toolchain
       version 3.2 or later so that it will contain kernel parameter information, or have no kernel  parameters.
       If   either   of   these   conditions   is   not   met,   then  cuLaunchCooperativeKernel()  will  return
       CUDA_ERROR_INVALID_IMAGE.

       Parameters:
           f - Kernel to launch
           gridDimX - Width of grid in blocks
           gridDimY - Height of grid in blocks
           gridDimZ - Depth of grid in blocks
           blockDimX - X dimension of each thread block
           blockDimY - Y dimension of each thread block
           blockDimZ - Z dimension of each thread block
           sharedMemBytes - Dynamic shared-memory size per thread block in bytes
           hStream - Stream identifier
           kernelParams - Array of pointers to kernel parameters

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE,             CUDA_ERROR_INVALID_IMAGE,            CUDA_ERROR_INVALID_VALUE,
           CUDA_ERROR_LAUNCH_FAILED,       CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES,        CUDA_ERROR_LAUNCH_TIMEOUT,
           CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING,                    CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE,
           CUDA_ERROR_SHARED_OBJECT_INIT_FAILED

       Note:
           This function uses standard  semantics.

           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuCtxGetCacheConfig,       cuCtxSetCacheConfig,       cuFuncSetCacheConfig,       cuFuncGetAttribute,
           cuLaunchCooperativeKernelMultiDevice, cudaLaunchCooperativeKernel

   CUresult   cuLaunchCooperativeKernelMultiDevice   (CUDA_LAUNCH_PARAMS   *   launchParamsList,   unsigned  int
       numDevices, unsigned int flags)
       Invokes kernels as specified in the launchParamsList array where each element of the array specifies  all
       the parameters required to perform a single kernel launch. These kernels can cooperate and synchronize as
       they execute. The size of the array is specified by numDevices.

       No  two  kernels can be launched on the same device. All the devices targeted by this multi-device launch
       must  be  identical.  All   devices   must   have   a   non-zero   value   for   the   device   attribute
       CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH.

       All  kernels  launched  must  be  identical  with respect to the compiled code. Note that any __device__,
       __constant__ or __managed__ variables present in the module that owns the kernel launched on each device,
       are independently instantiated on every device. It is the application's responsibility  to  ensure  these
       variables are initialized and used appropriately.

       The  size of the grids as specified in blocks, the size of the blocks themselves and the amount of shared
       memory used by each thread block must also match across all launched kernels.

       The streams  used  to  launch  these  kernels  must  have  been  created  via  either  cuStreamCreate  or
       cuStreamCreateWithPriority. The NULL stream or CU_STREAM_LEGACY or CU_STREAM_PER_THREAD cannot be used.

       The  total  number  of  blocks  launched  per  kernel  cannot  exceed  the  maximum  number of blocks per
       multiprocessor      as      returned       by       cuOccupancyMaxActiveBlocksPerMultiprocessor       (or
       cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by
       the  device attribute CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT. Since the total number of blocks launched
       per device has to match across all devices, the maximum number of blocks that can be launched per  device
       will be limited by the device with the least number of multiprocessors.

       The kernels cannot make use of CUDA dynamic parallelism.

       The CUDA_LAUNCH_PARAMS structure is defined as:

               typedef struct CUDA_LAUNCH_PARAMS_st
               {
                   CUfunction function;
                   unsigned int gridDimX;
                   unsigned int gridDimY;
                   unsigned int gridDimZ;
                   unsigned int blockDimX;
                   unsigned int blockDimY;
                   unsigned int blockDimZ;
                   unsigned int sharedMemBytes;
                   CUstream hStream;
                   void **kernelParams;
               } CUDA_LAUNCH_PARAMS;

        where:

       • CUDA_LAUNCH_PARAMS::function  specifies the kernel to be launched. All functions must be identical with
         respect to the compiled code.

       • CUDA_LAUNCH_PARAMS::gridDimX is the width of the grid in blocks. This must  match  across  all  kernels
         launched.

       • CUDA_LAUNCH_PARAMS::gridDimY  is  the  height of the grid in blocks. This must match across all kernels
         launched.

       • CUDA_LAUNCH_PARAMS::gridDimZ is the depth of the grid in blocks. This must  match  across  all  kernels
         launched.

       • CUDA_LAUNCH_PARAMS::blockDimX  is  the  X  dimension  of  each thread block. This must match across all
         kernels launched.

       • CUDA_LAUNCH_PARAMS::blockDimX is the Y dimension of each thread  block.  This  must  match  across  all
         kernels launched.

       • CUDA_LAUNCH_PARAMS::blockDimZ  is  the  Z  dimension  of  each thread block. This must match across all
         kernels launched.

       • CUDA_LAUNCH_PARAMS::sharedMemBytes is the dynamic shared-memory size per thread block  in  bytes.  This
         must match across all kernels launched.

       • CUDA_LAUNCH_PARAMS::hStream  is  the  handle to the stream to perform the launch in. This cannot be the
         NULL stream or CU_STREAM_LEGACY or CU_STREAM_PER_THREAD. The CUDA context associated with  this  stream
         must match that associated with CUDA_LAUNCH_PARAMS::function.

       • CUDA_LAUNCH_PARAMS::kernelParams    is    an    array    of   pointers   to   kernel   parameters.   If
         CUDA_LAUNCH_PARAMS::function has N parameters, then CUDA_LAUNCH_PARAMS::kernelParams  needs  to  be  an
         array      of      N      pointers.     Each     of     CUDA_LAUNCH_PARAMS::kernelParams[0]     through
         CUDA_LAUNCH_PARAMS::kernelParams[N-1] must point to a region of memory from  which  the  actual  kernel
         parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be
         specified as that information is retrieved directly from the kernel's image.

       By default, the kernel won't begin execution on any GPU until all prior work in all the specified streams
       has     completed.     This     behavior     can     be     overridden    by    specifying    the    flag
       CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_PRE_LAUNCH_SYNC. When this flag is specified,  each  kernel  will
       only wait for prior work in the stream corresponding to that GPU to complete before it begins execution.

       Similarly,  by  default,  any  subsequent  work  pushed  in  any  of the specified streams will not begin
       execution until the kernels on all GPUs have completed. This behavior can be overridden by specifying the
       flag  CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_POST_LAUNCH_SYNC.  When  this  flag   is   specified,   any
       subsequent  work pushed in any of the specified streams will only wait for the kernel launched on the GPU
       corresponding to that stream to complete before it begins execution.

       Calling cuLaunchCooperativeKernelMultiDevice() sets  persistent  function  state  that  is  the  same  as
       function   state   set   through  cuLaunchKernel  API  when  called  individually  for  each  element  in
       launchParamsList.

       When kernels are launched via cuLaunchCooperativeKernelMultiDevice(), the previous  block  shape,  shared
       size  and  parameter  info  associated  with  each  CUDA_LAUNCH_PARAMS::function  in  launchParamsList is
       overwritten.

       Note that to use cuLaunchCooperativeKernelMultiDevice(), the kernels must either have been compiled  with
       toolchain  version  3.2  or later so that it will contain kernel parameter information, or have no kernel
       parameters. If either of these conditions is not met,  then  cuLaunchCooperativeKernelMultiDevice()  will
       return CUDA_ERROR_INVALID_IMAGE.

       Parameters:
           launchParamsList - List of launch parameters, one per device
           numDevices - Size of the launchParamsList array
           flags - Flags to control launch behavior

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE,            CUDA_ERROR_INVALID_IMAGE,             CUDA_ERROR_INVALID_VALUE,
           CUDA_ERROR_LAUNCH_FAILED,        CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES,       CUDA_ERROR_LAUNCH_TIMEOUT,
           CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING,                    CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE,
           CUDA_ERROR_SHARED_OBJECT_INIT_FAILED

       Note:
           This function uses standard  semantics.

           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuCtxGetCacheConfig,       cuCtxSetCacheConfig,       cuFuncSetCacheConfig,       cuFuncGetAttribute,
           cuLaunchCooperativeKernel, cudaLaunchCooperativeKernelMultiDevice

   CUresult cuLaunchHostFunc (CUstream hStream, CUhostFn fn, void * userData)
       Enqueues a host function to run in a stream. The function will be called after  currently  enqueued  work
       and will block work added after it.

       The  host  function  must  not  make  any  CUDA  API  calls.  Attempting  to use a CUDA API may result in
       CUDA_ERROR_NOT_PERMITTED,  but  this  is  not  required.  The  host  function  must   not   perform   any
       synchronization  that  may  depend  on  outstanding CUDA work not mandated to run earlier. Host functions
       without a mandated order (such as  in  independent  streams)  execute  in  undefined  order  and  may  be
       serialized.

       For the purposes of Unified Memory, execution makes a number of guarantees:

       • The  stream  is  considered  idle  for the duration of the function's execution. Thus, for example, the
         function may always use memory attached to the stream it was enqueued in.
       • The start of execution of the function has the same effect as synchronizing an event  recorded  in  the
         same  stream  immediately  prior to the function. It thus synchronizes streams which have been 'joined'
         prior to the function.
       • Adding device work to any stream does not have the  effect  of  making  the  stream  active  until  all
         preceding  host  functions  and stream callbacks have executed. Thus, for example, a function might use
         global attached memory even if work has been added to another stream, if  the  work  has  been  ordered
         behind the function call with an event.
       • Completion  of  the  function  does  not cause a stream to become active except as described above. The
         stream will remain idle if no device work follows the function, and will remain idle across consecutive
         host functions or  stream  callbacks  without  device  work  in  between.  Thus,  for  example,  stream
         synchronization can be done by signaling from a host function at the end of the stream.
       Note  that,  in contrast to cuStreamAddCallback, the function will not be called in the event of an error
       in the CUDA context.
       Parameters:
           hStream - Stream to enqueue function call in
           fn - The function to call once preceding stream operations are complete
           userData - User-specified data to be passed to the function
       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_NOT_SUPPORTED
       Note:
           This function uses standard  semantics.
           Note that this function may also return error codes from previous, asynchronous launches.
       See also:
           cuStreamCreate,     cuStreamQuery,     cuStreamSynchronize,    cuStreamWaitEvent,    cuStreamDestroy,
           cuMemAllocManaged, cuStreamAttachMemAsync, cuStreamAddCallback
   CUresult cuLaunchKernel (CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned  int  gridDimZ,
       unsigned  int  blockDimX,  unsigned  int  blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes,
       CUstream hStream, void ** kernelParams, void ** extra)
       Invokes the kernel f on a gridDimX x gridDimY x gridDimZ grid of blocks. Each block contains blockDimX  x
       blockDimY x blockDimZ threads.
       sharedMemBytes sets the amount of dynamic shared memory that will be available to each thread block.
       Kernel parameters to f can be specified in one of two ways:
       1) Kernel parameters can be specified via kernelParams. If f has N parameters, then kernelParams needs to
       be  an  array  of N pointers. Each of kernelParams[0] through kernelParams[N-1] must point to a region of
       memory from which the actual kernel parameter will be copied. The number of kernel parameters  and  their
       offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's
       image.
       2)  Kernel  parameters can also be packaged by the application into a single buffer that is passed in via
       the extra parameter. This places the burden on the application of knowing each  kernel  parameter's  size
       and alignment/padding within the buffer. Here is an example of using the extra parameter in this manner:
           size_t argBufferSize;
           char argBuffer[256];

           // populate argBuffer and argBufferSize

           void *config[] = {
               CU_LAUNCH_PARAM_BUFFER_POINTER, argBuffer,
               CU_LAUNCH_PARAM_BUFFER_SIZE,    &argBufferSize,
               CU_LAUNCH_PARAM_END
           };
           status = cuLaunchKernel(f, gx, gy, gz, bx, by, bz, sh, s, NULL, config);

       The extra parameter exists to allow cuLaunchKernel to take additional less commonly used arguments. extra
       specifies  a  list  of names of extra settings and their corresponding values. Each extra setting name is
       immediately followed by the corresponding value.  The  list  must  be  terminated  with  either  NULL  or
       CU_LAUNCH_PARAM_END.
       • CU_LAUNCH_PARAM_END, which indicates the end of the extra array;
       • CU_LAUNCH_PARAM_BUFFER_POINTER,  which  specifies  that  the next value in extra will be a pointer to a
         buffer containing all the kernel parameters for launching kernel f;
       • CU_LAUNCH_PARAM_BUFFER_SIZE, which specifies that the next value in extra will be a pointer to a size_t
         containing the size of the buffer specified with CU_LAUNCH_PARAM_BUFFER_POINTER;
       The error CUDA_ERROR_INVALID_VALUE will  be  returned  if  kernel  parameters  are  specified  with  both
       kernelParams and extra (i.e. both kernelParams and extra are non-NULL).
       Calling  cuLaunchKernel()  sets  persistent function state that is the same as function state set through
       the  following   deprecated   APIs:   cuFuncSetBlockShape(),   cuFuncSetSharedSize(),   cuParamSetSize(),
       cuParamSeti(), cuParamSetf(), cuParamSetv().
       When  the  kernel f is launched via cuLaunchKernel(), the previous block shape, shared size and parameter
       info associated with f is overwritten.
       Note that to use cuLaunchKernel(), the kernel f must either have been compiled with toolchain version 3.2
       or later so that it will contain kernel parameter information, or have no kernel parameters. If either of
       these conditions is not met, then cuLaunchKernel() will return CUDA_ERROR_INVALID_IMAGE.
       Parameters:
           f - Kernel to launch
           gridDimX - Width of grid in blocks
           gridDimY - Height of grid in blocks
           gridDimZ - Depth of grid in blocks
           blockDimX - X dimension of each thread block
           blockDimY - Y dimension of each thread block
           blockDimZ - Z dimension of each thread block
           sharedMemBytes - Dynamic shared-memory size per thread block in bytes
           hStream - Stream identifier
           kernelParams - Array of pointers to kernel parameters
           extra - Extra options
       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE,             CUDA_ERROR_INVALID_IMAGE,            CUDA_ERROR_INVALID_VALUE,
           CUDA_ERROR_LAUNCH_FAILED,       CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES,        CUDA_ERROR_LAUNCH_TIMEOUT,
           CUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING, CUDA_ERROR_SHARED_OBJECT_INIT_FAILED
       Note:
           This function uses standard  semantics.
           Note that this function may also return error codes from previous, asynchronous launches.
       See also:
           cuCtxGetCacheConfig, cuCtxSetCacheConfig, cuFuncSetCacheConfig, cuFuncGetAttribute, cudaLaunchKernel

Author

       Generated automatically by Doxygen from the source code.

Version 6.0                                        28 Jul 2019                              Execution Control(3)