Ubuntu Manpage: Memory Management -

Provided by: nvidia-cuda-dev_9.1.85-3ubuntu1_amd64

NAME

       Memory Management -

   Functions
       CUresult cuArray3DCreate (CUarray *pHandle, const CUDA_ARRAY3D_DESCRIPTOR *pAllocateArray)
           Creates a 3D CUDA array.
       CUresult cuArray3DGetDescriptor (CUDA_ARRAY3D_DESCRIPTOR *pArrayDescriptor, CUarray hArray)
           Get a 3D CUDA array descriptor.
       CUresult cuArrayCreate (CUarray *pHandle, const CUDA_ARRAY_DESCRIPTOR *pAllocateArray)
           Creates a 1D or 2D CUDA array.
       CUresult cuArrayDestroy (CUarray hArray)
           Destroys a CUDA array.
       CUresult cuArrayGetDescriptor (CUDA_ARRAY_DESCRIPTOR *pArrayDescriptor, CUarray hArray)
           Get a 1D or 2D CUDA array descriptor.
       CUresult cuDeviceGetByPCIBusId (CUdevice *dev, const char *pciBusId)
           Returns a handle to a compute device.
       CUresult cuDeviceGetPCIBusId (char *pciBusId, int len, CUdevice dev)
           Returns a PCI Bus Id string for the device.
       CUresult cuIpcCloseMemHandle (CUdeviceptr dptr)
           Close memory mapped with cuIpcOpenMemHandle.
       CUresult cuIpcGetEventHandle (CUipcEventHandle *pHandle, CUevent event)
           Gets an interprocess handle for a previously allocated event.
       CUresult cuIpcGetMemHandle (CUipcMemHandle *pHandle, CUdeviceptr dptr)
           Gets an interprocess memory handle for an existing device memory allocation.
       CUresult cuIpcOpenEventHandle (CUevent *phEvent, CUipcEventHandle handle)
           Opens an interprocess event handle for use in the current process.
       CUresult cuIpcOpenMemHandle (CUdeviceptr *pdptr, CUipcMemHandle handle, unsigned int Flags)
           Opens an interprocess memory handle exported from another process and returns a device pointer usable
           in the local process.
       CUresult cuMemAlloc (CUdeviceptr *dptr, size_t bytesize)
           Allocates device memory.
       CUresult cuMemAllocHost (void **pp, size_t bytesize)
           Allocates page-locked host memory.
       CUresult cuMemAllocManaged (CUdeviceptr *dptr, size_t bytesize, unsigned int flags)
           Allocates memory that will be automatically managed by the Unified Memory system.
       CUresult cuMemAllocPitch (CUdeviceptr *dptr, size_t *pPitch, size_t WidthInBytes, size_t Height, unsigned
           int ElementSizeBytes)
           Allocates pitched device memory.
       CUresult cuMemcpy (CUdeviceptr dst, CUdeviceptr src, size_t ByteCount)
           Copies memory.
       CUresult cuMemcpy2D (const CUDA_MEMCPY2D *pCopy)
           Copies memory for 2D arrays.
       CUresult cuMemcpy2DAsync (const CUDA_MEMCPY2D *pCopy, CUstream hStream)
           Copies memory for 2D arrays.
       CUresult cuMemcpy2DUnaligned (const CUDA_MEMCPY2D *pCopy)
           Copies memory for 2D arrays.
       CUresult cuMemcpy3D (const CUDA_MEMCPY3D *pCopy)
           Copies memory for 3D arrays.
       CUresult cuMemcpy3DAsync (const CUDA_MEMCPY3D *pCopy, CUstream hStream)
           Copies memory for 3D arrays.
       CUresult cuMemcpy3DPeer (const CUDA_MEMCPY3D_PEER *pCopy)
           Copies memory between contexts.
       CUresult cuMemcpy3DPeerAsync (const CUDA_MEMCPY3D_PEER *pCopy, CUstream hStream)
           Copies memory between contexts asynchronously.
       CUresult cuMemcpyAsync (CUdeviceptr dst, CUdeviceptr src, size_t ByteCount, CUstream hStream)
           Copies memory asynchronously.
       CUresult cuMemcpyAtoA (CUarray dstArray, size_t dstOffset, CUarray srcArray, size_t srcOffset, size_t
           ByteCount)
           Copies memory from Array to Array.
       CUresult cuMemcpyAtoD (CUdeviceptr dstDevice, CUarray srcArray, size_t srcOffset, size_t ByteCount)
           Copies memory from Array to Device.
       CUresult cuMemcpyAtoH (void *dstHost, CUarray srcArray, size_t srcOffset, size_t ByteCount)
           Copies memory from Array to Host.
       CUresult cuMemcpyAtoHAsync (void *dstHost, CUarray srcArray, size_t srcOffset, size_t ByteCount, CUstream
           hStream)
           Copies memory from Array to Host.
       CUresult cuMemcpyDtoA (CUarray dstArray, size_t dstOffset, CUdeviceptr srcDevice, size_t ByteCount)
           Copies memory from Device to Array.
       CUresult cuMemcpyDtoD (CUdeviceptr dstDevice, CUdeviceptr srcDevice, size_t ByteCount)
           Copies memory from Device to Device.
       CUresult cuMemcpyDtoDAsync (CUdeviceptr dstDevice, CUdeviceptr srcDevice, size_t ByteCount, CUstream
           hStream)
           Copies memory from Device to Device.
       CUresult cuMemcpyDtoH (void *dstHost, CUdeviceptr srcDevice, size_t ByteCount)
           Copies memory from Device to Host.
       CUresult cuMemcpyDtoHAsync (void *dstHost, CUdeviceptr srcDevice, size_t ByteCount, CUstream hStream)
           Copies memory from Device to Host.
       CUresult cuMemcpyHtoA (CUarray dstArray, size_t dstOffset, const void *srcHost, size_t ByteCount)
           Copies memory from Host to Array.
       CUresult cuMemcpyHtoAAsync (CUarray dstArray, size_t dstOffset, const void *srcHost, size_t ByteCount,
           CUstream hStream)
           Copies memory from Host to Array.
       CUresult cuMemcpyHtoD (CUdeviceptr dstDevice, const void *srcHost, size_t ByteCount)
           Copies memory from Host to Device.
       CUresult cuMemcpyHtoDAsync (CUdeviceptr dstDevice, const void *srcHost, size_t ByteCount, CUstream
           hStream)
           Copies memory from Host to Device.
       CUresult cuMemcpyPeer (CUdeviceptr dstDevice, CUcontext dstContext, CUdeviceptr srcDevice, CUcontext
           srcContext, size_t ByteCount)
           Copies device memory between two contexts.
       CUresult cuMemcpyPeerAsync (CUdeviceptr dstDevice, CUcontext dstContext, CUdeviceptr srcDevice, CUcontext
           srcContext, size_t ByteCount, CUstream hStream)
           Copies device memory between two contexts asynchronously.
       CUresult cuMemFree (CUdeviceptr dptr)
           Frees device memory.
       CUresult cuMemFreeHost (void *p)
           Frees page-locked host memory.
       CUresult cuMemGetAddressRange (CUdeviceptr *pbase, size_t *psize, CUdeviceptr dptr)
           Get information on memory allocations.
       CUresult cuMemGetInfo (size_t *free, size_t *total)
           Gets free and total memory.
       CUresult cuMemHostAlloc (void **pp, size_t bytesize, unsigned int Flags)
           Allocates page-locked host memory.
       CUresult cuMemHostGetDevicePointer (CUdeviceptr *pdptr, void *p, unsigned int Flags)
           Passes back device pointer of mapped pinned memory.
       CUresult cuMemHostGetFlags (unsigned int *pFlags, void *p)
           Passes back flags that were used for a pinned allocation.
       CUresult cuMemHostRegister (void *p, size_t bytesize, unsigned int Flags)
           Registers an existing host memory range for use by CUDA.
       CUresult cuMemHostUnregister (void *p)
           Unregisters a memory range that was registered with cuMemHostRegister.
       CUresult cuMemsetD16 (CUdeviceptr dstDevice, unsigned short us, size_t N)
           Initializes device memory.
       CUresult cuMemsetD16Async (CUdeviceptr dstDevice, unsigned short us, size_t N, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD2D16 (CUdeviceptr dstDevice, size_t dstPitch, unsigned short us, size_t Width, size_t
           Height)
           Initializes device memory.
       CUresult cuMemsetD2D16Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned short us, size_t Width,
           size_t Height, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD2D32 (CUdeviceptr dstDevice, size_t dstPitch, unsigned int ui, size_t Width, size_t
           Height)
           Initializes device memory.
       CUresult cuMemsetD2D32Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned int ui, size_t Width,
           size_t Height, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD2D8 (CUdeviceptr dstDevice, size_t dstPitch, unsigned char uc, size_t Width, size_t
           Height)
           Initializes device memory.
       CUresult cuMemsetD2D8Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned char uc, size_t Width,
           size_t Height, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD32 (CUdeviceptr dstDevice, unsigned int ui, size_t N)
           Initializes device memory.
       CUresult cuMemsetD32Async (CUdeviceptr dstDevice, unsigned int ui, size_t N, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD8 (CUdeviceptr dstDevice, unsigned char uc, size_t N)
           Initializes device memory.
       CUresult cuMemsetD8Async (CUdeviceptr dstDevice, unsigned char uc, size_t N, CUstream hStream)
           Sets device memory.
       CUresult cuMipmappedArrayCreate (CUmipmappedArray *pHandle, const CUDA_ARRAY3D_DESCRIPTOR
           *pMipmappedArrayDesc, unsigned int numMipmapLevels)
           Creates a CUDA mipmapped array.
       CUresult cuMipmappedArrayDestroy (CUmipmappedArray hMipmappedArray)
           Destroys a CUDA mipmapped array.
       CUresult cuMipmappedArrayGetLevel (CUarray *pLevelArray, CUmipmappedArray hMipmappedArray, unsigned int
           level)
           Gets a mipmap level of a CUDA mipmapped array.

Detailed Description

       \brief memory management functions of the low-level CUDA driver API (cuda.h)

       This section describes the memory management functions of the low-level CUDA driver application
       programming interface.

Function Documentation

   CUresult cuArray3DCreate (CUarray * pHandle, const CUDA_ARRAY3D_DESCRIPTOR * pAllocateArray)
       Creates a CUDA array according to the CUDA_ARRAY3D_DESCRIPTOR structure pAllocateArray and returns a
       handle to the new CUDA array in *pHandle. The CUDA_ARRAY3D_DESCRIPTOR is defined as:

           typedef struct {
               unsigned int Width;
               unsigned int Height;
               unsigned int Depth;
               CUarray_format Format;
               unsigned int NumChannels;
               unsigned int Flags;
           } CUDA_ARRAY3D_DESCRIPTOR;

        where:

       • Width,  Height,  and  Depth  are  the  width,  height,  and  depth of the CUDA array (in elements); the
         following types of CUDA arrays can be allocated:

         • A 1D array is allocated if Height and Depth extents are both zero.

         • A 2D array is allocated if only Depth extent is zero.

         • A 3D array is allocated if all three extents are non-zero.

         • A 1D layered CUDA array is allocated if only Height is zero and the CUDA_ARRAY3D_LAYERED flag is set.
           Each layer is a 1D array. The number of layers is determined by the depth extent.

         • A 2D layered CUDA array is allocated if all three extents are non-zero and  the  CUDA_ARRAY3D_LAYERED
           flag is set. Each layer is a 2D array. The number of layers is determined by the depth extent.

         • A cubemap CUDA array is allocated if all three extents are non-zero and the CUDA_ARRAY3D_CUBEMAP flag
           is  set.  Width  must  be  equal  to Height, and Depth must be six. A cubemap is a special type of 2D
           layered CUDA array, where the six layers represent the six faces of a cube.  The  order  of  the  six
           layers in memory is the same as that listed in CUarray_cubemap_face.

         • A   cubemap  layered  CUDA  array  is  allocated  if  all  three  extents  are  non-zero,  and  both,
           CUDA_ARRAY3D_CUBEMAP and CUDA_ARRAY3D_LAYERED flags are set. Width must be equal to Height, and Depth
           must be a multiple of six. A cubemap layered CUDA array is a special type of 2D  layered  CUDA  array
           that consists of a collection of cubemaps. The first six layers represent the first cubemap, the next
           six layers form the second cubemap, and so on.

       • Format specifies the format of the elements; CUarray_format is defined as:

           typedef enum CUarray_format_enum {
               CU_AD_FORMAT_UNSIGNED_INT8 = 0x01,
               CU_AD_FORMAT_UNSIGNED_INT16 = 0x02,
               CU_AD_FORMAT_UNSIGNED_INT32 = 0x03,
               CU_AD_FORMAT_SIGNED_INT8 = 0x08,
               CU_AD_FORMAT_SIGNED_INT16 = 0x09,
               CU_AD_FORMAT_SIGNED_INT32 = 0x0a,
               CU_AD_FORMAT_HALF = 0x10,
               CU_AD_FORMAT_FLOAT = 0x20
           } CUarray_format;

       • NumChannels specifies the number of packed components per CUDA array element; it may be 1, 2, or 4;

       • Flags may be set to

         • CUDA_ARRAY3D_LAYERED  to enable creation of layered CUDA arrays. If this flag is set, Depth specifies
           the number of layers, not the depth of a 3D array.

         • CUDA_ARRAY3D_SURFACE_LDST to enable surface references to be bound to the CUDA array. If this flag is
           not set, cuSurfRefSetArray will fail when attempting to bind the CUDA array to a surface reference.

         • CUDA_ARRAY3D_CUBEMAP to enable creation of cubemaps. If this flag is set,  Width  must  be  equal  to
           Height,  and  Depth  must  be six. If the CUDA_ARRAY3D_LAYERED flag is also set, then Depth must be a
           multiple of six.

         • CUDA_ARRAY3D_TEXTURE_GATHER to indicate that the CUDA array will be used for texture gather.  Texture
           gather can only be performed on 2D CUDA arrays.

       Width,  Height and Depth must meet certain size requirements as listed in the following table. All values
       are specified in elements. Note that for brevity's sake, the full name of the  device  attribute  is  not
       specified.      For      ex.,      TEXTURE1D_WIDTH      refers      to      the      device     attribute
       CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_WIDTH.

       Note that 2D CUDA arrays have different size requirements if the CUDA_ARRAY3D_TEXTURE_GATHER flag is set.
       Width  and  Height  must  not  be  greater  than  CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_GATHER_WIDTH  and
       CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_GATHER_HEIGHT respectively, in that case.

       CUDA array type Valid extents that must always be met
       {(width  range  in elements), (height range), (depth range)} Valid extents with CUDA_ARRAY3D_SURFACE_LDST
       set
        {(width range in elements),  (height  range),  (depth  range)}  1D  {  (1,TEXTURE1D_WIDTH),  0,  0  }  {
       (1,SURFACE1D_WIDTH),  0,  0  } 2D { (1,TEXTURE2D_WIDTH), (1,TEXTURE2D_HEIGHT), 0 } { (1,SURFACE2D_WIDTH),
       (1,SURFACE2D_HEIGHT), 0 } 3D { (1,TEXTURE3D_WIDTH), (1,TEXTURE3D_HEIGHT), (1,TEXTURE3D_DEPTH) }
       OR
       {  (1,TEXTURE3D_WIDTH_ALTERNATE),  (1,TEXTURE3D_HEIGHT_ALTERNATE),  (1,TEXTURE3D_DEPTH_ALTERNATE)   }   {
       (1,SURFACE3D_WIDTH),      (1,SURFACE3D_HEIGHT),      (1,SURFACE3D_DEPTH)      }      1D     Layered     {
       (1,TEXTURE1D_LAYERED_WIDTH),  0,  (1,TEXTURE1D_LAYERED_LAYERS)  }   {   (1,SURFACE1D_LAYERED_WIDTH),   0,
       (1,SURFACE1D_LAYERED_LAYERS)  }  2D  Layered { (1,TEXTURE2D_LAYERED_WIDTH), (1,TEXTURE2D_LAYERED_HEIGHT),
       (1,TEXTURE2D_LAYERED_LAYERS)    }    {     (1,SURFACE2D_LAYERED_WIDTH),     (1,SURFACE2D_LAYERED_HEIGHT),
       (1,SURFACE2D_LAYERED_LAYERS)  }  Cubemap  {  (1,TEXTURECUBEMAP_WIDTH),  (1,TEXTURECUBEMAP_WIDTH),  6  } {
       (1,SURFACECUBEMAP_WIDTH),      (1,SURFACECUBEMAP_WIDTH),      6      }      Cubemap       Layered       {
       (1,TEXTURECUBEMAP_LAYERED_WIDTH), (1,TEXTURECUBEMAP_LAYERED_WIDTH), (1,TEXTURECUBEMAP_LAYERED_LAYERS) } {
       (1,SURFACECUBEMAP_LAYERED_WIDTH), (1,SURFACECUBEMAP_LAYERED_WIDTH), (1,SURFACECUBEMAP_LAYERED_LAYERS) }

       Here are examples of CUDA array descriptions:

       Description for a CUDA array of 2048 floats:

           CUDA_ARRAY3D_DESCRIPTOR desc;
           desc.Format = CU_AD_FORMAT_FLOAT;
           desc.NumChannels = 1;
           desc.Width = 2048;
           desc.Height = 0;
           desc.Depth = 0;

       Description for a 64 x 64 CUDA array of floats:

           CUDA_ARRAY3D_DESCRIPTOR desc;
           desc.Format = CU_AD_FORMAT_FLOAT;
           desc.NumChannels = 1;
           desc.Width = 64;
           desc.Height = 64;
           desc.Depth = 0;

       Description for a width x height x depth CUDA array of 64-bit, 4x16-bit float16's:

           CUDA_ARRAY3D_DESCRIPTOR desc;
           desc.FormatFlags = CU_AD_FORMAT_HALF;
           desc.NumChannels = 4;
           desc.Width = width;
           desc.Height = height;
           desc.Depth = depth;

       Parameters:
           pHandle - Returned array
           pAllocateArray - 3D array descriptor

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_UNKNOWN

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DGetDescriptor,    cuArrayCreate,    cuArrayDestroy,    cuArrayGetDescriptor,     cuMemAlloc,
           cuMemAllocHost,   cuMemAllocPitch,   cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMalloc3DArray

   CUresult cuArray3DGetDescriptor (CUDA_ARRAY3D_DESCRIPTOR * pArrayDescriptor, CUarray hArray)
       Returns in *pArrayDescriptor a descriptor containing information on the format and dimensions of the CUDA
       array hArray. It is useful for subroutines that have been passed a CUDA array, but need to know the  CUDA
       array parameters for validation or other purposes.

       This  function  may  be  called on 1D and 2D arrays, in which case the Height and/or Depth members of the
       descriptor struct will be set to 0.

       Parameters:
           pArrayDescriptor - Returned 3D array descriptor
           hArray - 3D array to get descriptor of

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,  cuArrayCreate,  cuArrayDestroy,  cuArrayGetDescriptor,  cuMemAlloc, cuMemAllocHost,
           cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,   cuMemcpy3DAsync,
           cuMemcpyAtoA,    cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,   cuMemcpyDtoA,   cuMemcpyDtoD,
           cuMemcpyDtoDAsync, cuMemcpyDtoH, cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,  cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,   cuMemsetD16,
           cuMemsetD32, cudaArrayGetInfo

   CUresult cuArrayCreate (CUarray * pHandle, const CUDA_ARRAY_DESCRIPTOR * pAllocateArray)
       Creates a CUDA array according to the CUDA_ARRAY_DESCRIPTOR structure pAllocateArray and returns a handle
       to the new CUDA array in *pHandle. The CUDA_ARRAY_DESCRIPTOR is defined as:

           typedef struct {
               unsigned int Width;
               unsigned int Height;
               CUarray_format Format;
               unsigned int NumChannels;
           } CUDA_ARRAY_DESCRIPTOR;

        where:

       • Width,  and  Height  are  the width, and height of the CUDA array (in elements); the CUDA array is one-
         dimensional if height is 0, two-dimensional otherwise;

       • Format specifies the format of the elements; CUarray_format is defined as:

           typedef enum CUarray_format_enum {
               CU_AD_FORMAT_UNSIGNED_INT8 = 0x01,
               CU_AD_FORMAT_UNSIGNED_INT16 = 0x02,
               CU_AD_FORMAT_UNSIGNED_INT32 = 0x03,
               CU_AD_FORMAT_SIGNED_INT8 = 0x08,
               CU_AD_FORMAT_SIGNED_INT16 = 0x09,
               CU_AD_FORMAT_SIGNED_INT32 = 0x0a,
               CU_AD_FORMAT_HALF = 0x10,
               CU_AD_FORMAT_FLOAT = 0x20
           } CUarray_format;

       • NumChannels specifies the number of packed components per CUDA array element; it may be 1, 2, or 4;

       Here are examples of CUDA array descriptions:

       Description for a CUDA array of 2048 floats:

           CUDA_ARRAY_DESCRIPTOR desc;
           desc.Format = CU_AD_FORMAT_FLOAT;
           desc.NumChannels = 1;
           desc.Width = 2048;
           desc.Height = 1;

       Description for a 64 x 64 CUDA array of floats:

           CUDA_ARRAY_DESCRIPTOR desc;
           desc.Format = CU_AD_FORMAT_FLOAT;
           desc.NumChannels = 1;
           desc.Width = 64;
           desc.Height = 64;

       Description for a width x height CUDA array of 64-bit, 4x16-bit float16's:

           CUDA_ARRAY_DESCRIPTOR desc;
           desc.FormatFlags = CU_AD_FORMAT_HALF;
           desc.NumChannels = 4;
           desc.Width = width;
           desc.Height = height;

       Description for a width x height CUDA array of 16-bit elements, each  of  which  is  two  8-bit  unsigned
       chars:

           CUDA_ARRAY_DESCRIPTOR arrayDesc;
           desc.FormatFlags = CU_AD_FORMAT_UNSIGNED_INT8;
           desc.NumChannels = 2;
           desc.Width = width;
           desc.Height = height;

       Parameters:
           pHandle - Returned array
           pAllocateArray - Array descriptor

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_UNKNOWN

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,    cuArrayDestroy,    cuArrayGetDescriptor,    cuMemAlloc,
           cuMemAllocHost,   cuMemAllocPitch,   cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMallocArray

   CUresult cuArrayDestroy (CUarray hArray)
       Destroys the CUDA array hArray.

       Parameters:
           hArray - Array to destroy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_ARRAY_IS_MAPPED

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,    cuArray3DGetDescriptor,    cuArrayCreate,    cuArrayGetDescriptor,    cuMemAlloc,
           cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,   cuMemcpy2DUnaligned,   cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaFreeArray

   CUresult cuArrayGetDescriptor (CUDA_ARRAY_DESCRIPTOR * pArrayDescriptor, CUarray hArray)
       Returns in *pArrayDescriptor a descriptor containing information on the format and dimensions of the CUDA
       array  hArray. It is useful for subroutines that have been passed a CUDA array, but need to know the CUDA
       array parameters for validation or other purposes.

       Parameters:
           pArrayDescriptor - Returned array descriptor
           hArray - Array to get descriptor of

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,  cuArray3DGetDescriptor,  cuArrayCreate, cuArrayDestroy, cuMemAlloc, cuMemAllocHost,
           cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,   cuMemcpy3DAsync,
           cuMemcpyAtoA,    cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,   cuMemcpyDtoA,   cuMemcpyDtoD,
           cuMemcpyDtoDAsync, cuMemcpyDtoH, cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,  cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,   cuMemsetD16,
           cuMemsetD32, cudaArrayGetInfo

   CUresult cuDeviceGetByPCIBusId (CUdevice * dev, const char * pciBusId)
       Returns in *device a device handle given a PCI bus ID string.

       Parameters:
           dev - Returned device handle
           pciBusId   -   String   in   one   of   the   following   forms:   [domain]:[bus]:[device].[function]
           [domain]:[bus]:[device] [bus]:[device].[function] where domain, bus, device,  and  function  are  all
           hexadecimal values

       Returns:
           CUDA_SUCCESS,    CUDA_ERROR_DEINITIALIZED,    CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_VALUE,
           CUDA_ERROR_INVALID_DEVICE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuDeviceGet, cuDeviceGetAttribute, cuDeviceGetPCIBusId, cudaDeviceGetByPCIBusId

   CUresult cuDeviceGetPCIBusId (char * pciBusId, int len, CUdevice dev)
       Returns an ASCII string identifying the device dev in the NULL-terminated string pointed to by  pciBusId.
       len specifies the maximum length of the string that may be returned.

       Parameters:
           pciBusId    -    Returned    identifier   string   for   the   device   in   the   following   format
           [domain]:[bus]:[device].[function] where domain,  bus,  device,  and  function  are  all  hexadecimal
           values. pciBusId should be large enough to store 13 characters including the NULL-terminator.
           len - Maximum length of string to store in name
           dev - Device to get identifier string for

       Returns:
           CUDA_SUCCESS,    CUDA_ERROR_DEINITIALIZED,    CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_VALUE,
           CUDA_ERROR_INVALID_DEVICE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuDeviceGet, cuDeviceGetAttribute, cuDeviceGetByPCIBusId, cudaDeviceGetPCIBusId

   CUresult cuIpcCloseMemHandle (CUdeviceptr dptr)
       Unmaps memory returned by cuIpcOpenMemHandle. The original allocation in the exporting process as well as
       imported mappings in other processes will be unaffected.

       Any resources used to enable peer access will be freed if this is the last mapping using them.

       IPC functionality is restricted to devices  with  support  for  unified  addressing  on  Linux  operating
       systems.

       Parameters:
           dptr - Device pointer returned by cuIpcOpenMemHandle

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_MAP_FAILED, CUDA_ERROR_INVALID_HANDLE,

       See also:
           cuMemAlloc,      cuMemFree,     cuIpcGetEventHandle,     cuIpcOpenEventHandle,     cuIpcGetMemHandle,
           cuIpcOpenMemHandle, cudaIpcCloseMemHandle

   CUresult cuIpcGetEventHandle (CUipcEventHandle * pHandle, CUevent event)
       Takes  as  input  a  previously  allocated  event.  This  event  must  have   been   created   with   the
       CU_EVENT_INTERPROCESS  and CU_EVENT_DISABLE_TIMING flags set. This opaque handle may be copied into other
       processes and opened with cuIpcOpenEventHandle to allow efficient hardware  synchronization  between  GPU
       work in different processes.

       After   the   event  has  been  opened  in  the  importing  process,  cuEventRecord,  cuEventSynchronize,
       cuStreamWaitEvent and cuEventQuery may be used in either process. Performing operations on  the  imported
       event after the exported event has been freed with cuEventDestroy will result in undefined behavior.

       IPC  functionality  is  restricted  to  devices  with  support  for unified addressing on Linux operating
       systems.

       Parameters:
           pHandle - Pointer to a user allocated CUipcEventHandle in which to return the opaque event handle
           event - Event allocated with CU_EVENT_INTERPROCESS and CU_EVENT_DISABLE_TIMING flags.

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_MAP_FAILED

       See also:
           cuEventCreate,     cuEventDestroy,     cuEventSynchronize,      cuEventQuery,      cuStreamWaitEvent,
           cuIpcOpenEventHandle,        cuIpcGetMemHandle,        cuIpcOpenMemHandle,       cuIpcCloseMemHandle,
           cudaIpcGetEventHandle

   CUresult cuIpcGetMemHandle (CUipcMemHandle * pHandle, CUdeviceptr dptr)
       Takes a pointer to the base of an existing device memory allocation created with cuMemAlloc  and  exports
       it  for  use  in  another process. This is a lightweight operation and may be called multiple times on an
       allocation without adverse effects.

       If a region of memory is freed with cuMemFree and a subsequent call to cuMemAlloc returns memory with the
       same device address, cuIpcGetMemHandle will return a unique handle for the new memory.

       IPC functionality is restricted to devices  with  support  for  unified  addressing  on  Linux  operating
       systems.

       Parameters:
           pHandle - Pointer to user allocated CUipcMemHandle to return the handle in.
           dptr - Base pointer to previously allocated device memory

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_MAP_FAILED,

       See also:
           cuMemAlloc,     cuMemFree,     cuIpcGetEventHandle,     cuIpcOpenEventHandle,     cuIpcOpenMemHandle,
           cuIpcCloseMemHandle, cudaIpcGetMemHandle

   CUresult cuIpcOpenEventHandle (CUevent * phEvent, CUipcEventHandle handle)
       Opens an interprocess event handle exported from another process with cuIpcGetEventHandle. This  function
       returns  a  CUevent  that  behaves  like  a  locally  created event with the CU_EVENT_DISABLE_TIMING flag
       specified. This event must be freed with cuEventDestroy.

       Performing operations on the imported event after the exported event has been freed  with  cuEventDestroy
       will result in undefined behavior.

       IPC  functionality  is  restricted  to  devices  with  support  for unified addressing on Linux operating
       systems.

       Parameters:
           phEvent - Returns the imported event
           handle - Interprocess handle to open

       Returns:
           CUDA_SUCCESS, CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_MAP_FAILED,  CUDA_ERROR_PEER_ACCESS_UNSUPPORTED,
           CUDA_ERROR_INVALID_HANDLE

       See also:
           cuEventCreate,      cuEventDestroy,      cuEventSynchronize,     cuEventQuery,     cuStreamWaitEvent,
           cuIpcGetEventHandle,        cuIpcGetMemHandle,        cuIpcOpenMemHandle,        cuIpcCloseMemHandle,
           cudaIpcOpenEventHandle

   CUresult cuIpcOpenMemHandle (CUdeviceptr * pdptr, CUipcMemHandle handle, unsigned int Flags)
       Maps  memory  exported from another process with cuIpcGetMemHandle into the current device address space.
       For contexts on different devices cuIpcOpenMemHandle can  attempt  to  enable  peer  access  between  the
       devices   as   if   the   user   called   cuCtxEnablePeerAccess.  This  behavior  is  controlled  by  the
       CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS flag. cuDeviceCanAccessPeer can determine if a mapping is possible.

       Contexts that may open CUipcMemHandles are restricted in the following  way.  CUipcMemHandles  from  each
       CUdevice in a given process may only be opened by one CUcontext per CUdevice per other process.

       Memory returned from cuIpcOpenMemHandle must be freed with cuIpcCloseMemHandle.

       Calling  cuMemFree  on  an  exported  memory  region  before calling cuIpcCloseMemHandle in the importing
       context will result in undefined behavior.

       IPC functionality is restricted to devices  with  support  for  unified  addressing  on  Linux  operating
       systems.

       Parameters:
           pdptr - Returned device pointer
           handle - CUipcMemHandle to open
           Flags - Flags for this operation. Must be specified as CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS

       Returns:
           CUDA_SUCCESS,     CUDA_ERROR_INVALID_CONTEXT,    CUDA_ERROR_MAP_FAILED,    CUDA_ERROR_INVALID_HANDLE,
           CUDA_ERROR_TOO_MANY_PEERS

       Note:
           No guarantees are made about the address returned in *pdptr. In particular,  multiple  processes  may
           not receive the same address for the same handle.

       See also:
           cuMemAlloc,      cuMemFree,     cuIpcGetEventHandle,     cuIpcOpenEventHandle,     cuIpcGetMemHandle,
           cuIpcCloseMemHandle, cuCtxEnablePeerAccess, cuDeviceCanAccessPeer, cudaIpcOpenMemHandle

   CUresult cuMemAlloc (CUdeviceptr * dptr, size_t bytesize)
       Allocates bytesize bytes of linear memory on the device and returns in *dptr a pointer to  the  allocated
       memory.  The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. If
       bytesize is 0, cuMemAlloc() returns CUDA_ERROR_INVALID_VALUE.

       Parameters:
           dptr - Returned device pointer
           bytesize - Requested allocation size in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,   cuMemcpy2DUnaligned,   cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMalloc

   CUresult cuMemAllocHost (void ** pp, size_t bytesize)
       Allocates  bytesize  bytes  of  host  memory that is page-locked and accessible to the device. The driver
       tracks the virtual memory ranges allocated with this function  and  automatically  accelerates  calls  to
       functions  such as cuMemcpy(). Since the memory can be accessed directly by the device, it can be read or
       written with much higher bandwidth than  pageable  memory  obtained  with  functions  such  as  malloc().
       Allocating  excessive  amounts  of  memory with cuMemAllocHost() may degrade system performance, since it
       reduces the amount of memory available to the system for paging. As a result, this function is best  used
       sparingly to allocate staging areas for data exchange between host and device.

       Note all host memory allocated using cuMemHostAlloc() will automatically be immediately accessible to all
       contexts   on   all   devices   which   support   unified   addressing   (as   may   be   queried   using
       CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING). The device pointer that may be used to access this  host  memory
       from  those  contexts  is  always  equal  to  the  returned  host pointer *pp. See Unified Addressing for
       additional details.

       Parameters:
           pp - Returned host pointer to page-locked memory
           bytesize - Requested allocation size in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocPitch,   cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,    cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMallocHost

   CUresult cuMemAllocManaged (CUdeviceptr * dptr, size_t bytesize, unsigned int flags)
       Allocates  bytesize bytes of managed memory on the device and returns in *dptr a pointer to the allocated
       memory. If the device doesn't support allocating managed memory,  CUDA_ERROR_NOT_SUPPORTED  is  returned.
       Support  for managed memory can be queried using the device attribute CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY.
       The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. If bytesize
       is 0, cuMemAllocManaged returns CUDA_ERROR_INVALID_VALUE. The pointer is valid on the CPU and on all GPUs
       in the system that support managed memory. All accesses to this pointer  must  obey  the  Unified  Memory
       programming model.

       flags   specifies   the   default   stream  association  for  this  allocation.  flags  must  be  one  of
       CU_MEM_ATTACH_GLOBAL or CU_MEM_ATTACH_HOST. If CU_MEM_ATTACH_GLOBAL is specified,  then  this  memory  is
       accessible  from any stream on any device. If CU_MEM_ATTACH_HOST is specified, then the allocation should
       not   be   accessed   from   devices   that   have   a   zero   value   for    the    device    attribute
       CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS;   an  explicit  call  to  cuStreamAttachMemAsync  will  be
       required to enable access on such devices.

       If the association  is  later  changed  via  cuStreamAttachMemAsync  to  a  single  stream,  the  default
       association  as  specified  during  cuMemAllocManaged  is  restored  when  that  stream is destroyed. For
       __managed__ variables, the default association is always CU_MEM_ATTACH_GLOBAL.  Note  that  destroying  a
       stream  is  an  asynchronous  operation,  and as a result, the change to default association won't happen
       until all work in the stream has completed.

       Memory allocated with cuMemAllocManaged should be released with cuMemFree.

       Device memory oversubscription is possible for GPUs that have a non-zero value for the  device  attribute
       CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS.  Managed  memory  on  such GPUs may be evicted from device
       memory to host memory at any time by  the  Unified  Memory  driver  in  order  to  make  room  for  other
       allocations.

       In   a   multi-GPU   system   where   all   GPUs   have   a  non-zero  value  for  the  device  attribute
       CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, managed memory may not be populated when this API  returns
       and  instead  may  be populated on access. In such systems, managed memory can migrate to any processor's
       memory at any time. The Unified Memory driver will  employ  heuristics  to  maintain  data  locality  and
       prevent  excessive  page  faults  to the extent possible. The application can also guide the driver about
       memory usage patterns via cuMemAdvise. The application can also explicitly migrate memory  to  a  desired
       processor's memory via cuMemPrefetchAsync.

       In   a   multi-GPU   system  where  all  of  the  GPUs  have  a  zero  value  for  the  device  attribute
       CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS and all the GPUs have peer-to-peer support with each other,
       the  physical  storage  for  managed  memory  is  created  on  the  GPU  which  is  active  at  the  time
       cuMemAllocManaged  is  called.  All  other  GPUs  will  reference  the data at reduced bandwidth via peer
       mappings over the PCIe bus. The Unified Memory driver does not migrate memory among such GPUs.

       In a multi-GPU system where not all GPUs have peer-to-peer support with each other and where the value of
       the device attribute CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS is zero for  at  least  one  of  those
       GPUs, the location chosen for physical storage of managed memory is system-dependent.

       • On  Linux,  the location chosen will be device memory as long as the current set of active contexts are
         on devices that either have peer-to-peer support with each other or  have  a  non-zero  value  for  the
         device  attribute CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS. If there is an active context on a GPU
         that does not have a non-zero value for that device attribute and it does not have peer-to-peer support
         with the other devices that have active contexts on them, then the location for physical  storage  will
         be  'zero-copy'  or  host  memory.  Note  that this means that managed memory that is located in device
         memory is migrated to host memory if a new context is created on a GPU that  doesn't  have  a  non-zero
         value for the device attribute and does not support peer-to-peer with at least one of the other devices
         that  has  an  active  context.  This  in  turn  implies  that  context  creation  may fail if there is
         insufficient host memory to migrate all managed allocations.

       • On Windows, the physical storage is always created  in  'zero-copy'  or  host  memory.  All  GPUs  will
         reference  the  data  at  reduced  bandwidth  over  the  PCIe  bus.  In these circumstances, use of the
         environment variable CUDA_VISIBLE_DEVICES is recommended to restrict CUDA to only use those  GPUs  that
         have  peer-to-peer support. Alternatively, users can also set CUDA_MANAGED_FORCE_DEVICE_ALLOC to a non-
         zero value to force the driver to always use device memory for physical storage. When this  environment
         variable  is  set  to  a  non-zero  value, all contexts created in that process on devices that support
         managed memory have to be peer-to-peer compatible with each other. Context  creation  will  fail  if  a
         context is created on a device that supports managed memory and is not peer-to-peer compatible with any
         of the other managed memory supporting devices on which contexts were previously created, even if those
         contexts  have  been destroyed. These environment variables are described in the CUDA programming guide
         under the 'CUDA environment variables' section.

       • On ARM, managed memory is not available on discrete gpu with Drive PX-2.

       Parameters:
           dptr - Returned device pointer
           bytesize - Requested allocation size in bytes
           flags - Must be one of CU_MEM_ATTACH_GLOBAL or CU_MEM_ATTACH_HOST

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_NOT_SUPPORTED, CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,   cuMemcpy2DUnaligned,   cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cuDeviceGetAttribute, cuStreamAttachMemAsync, cudaMallocManaged

   CUresult  cuMemAllocPitch  (CUdeviceptr * dptr, size_t * pPitch, size_t WidthInBytes, size_t Height, unsigned
       int ElementSizeBytes)
       Allocates at least WidthInBytes * Height bytes of linear memory on the device  and  returns  in  *dptr  a
       pointer  to  the  allocated  memory.  The  function  may  pad the allocation to ensure that corresponding
       pointers in any given row will continue to meet the alignment requirements for coalescing as the  address
       is updated from row to row. ElementSizeBytes specifies the size of the largest reads and writes that will
       be performed on the memory range. ElementSizeBytes may be 4, 8 or 16 (since coalesced memory transactions
       are  not possible on other data sizes). If ElementSizeBytes is smaller than the actual read/write size of
       a kernel, the kernel will run correctly, but possibly at reduced speed. The pitch returned in *pPitch  by
       cuMemAllocPitch()  is  the width in bytes of the allocation. The intended usage of pitch is as a separate
       parameter of the allocation, used to compute addresses within the 2D array. Given the row and  column  of
       an array element of type T, the address is computed as:

          T* pElement = (T*)((char*)BaseAddress + Row * Pitch) + Column;

       The  pitch returned by cuMemAllocPitch() is guaranteed to work with cuMemcpy2D() under all circumstances.
       For allocations of 2D arrays, it is recommended that programmers consider  performing  pitch  allocations
       using  cuMemAllocPitch().  Due  to alignment restrictions in the hardware, this is especially true if the
       application will be performing 2D memory copies between  different  regions  of  device  memory  (whether
       linear memory or CUDA arrays).

       The  byte  alignment  of  the  pitch  returned  by cuMemAllocPitch() is guaranteed to match or exceed the
       alignment requirement for texture binding with cuTexRefSetAddress2D().

       Parameters:
           dptr - Returned device pointer
           pPitch - Returned pitch of allocation in bytes
           WidthInBytes - Requested allocation width in bytes
           Height - Requested allocation height in rows
           ElementSizeBytes - Size of largest reads/writes for range

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemcpy2D,   cuMemcpy2DAsync,    cuMemcpy2DUnaligned,    cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMallocPitch

   CUresult cuMemcpy (CUdeviceptr dst, CUdeviceptr src, size_t ByteCount)
       Copies  data  between  two  pointers.  dst  and  src  are  base  pointers  of the destination and source,
       respectively. ByteCount specifies the number of bytes to copy. Note that this function infers the type of
       the transfer (host to host, host to device, device to device, or device to host) from the pointer values.
       This function is only allowed in contexts which support unified addressing.

       Parameters:
           dst - Destination unified virtual address space pointer
           src - Source unified virtual address space pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoH,  cuMemcpyDtoHAsync,   cuMemcpyHtoA,   cuMemcpyHtoAAsync,   cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,   cuMemsetD16,
           cuMemsetD32, cudaMemcpy, cudaMemcpyToSymbol, cudaMemcpyFromSymbol

   CUresult cuMemcpy2D (const CUDA_MEMCPY2D * pCopy)
       Perform  a  2D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY2D structure is
       defined as:

          typedef struct CUDA_MEMCPY2D_st {
             unsigned int srcXInBytes, srcY;
             CUmemorytype srcMemoryType;
                 const void *srcHost;
                 CUdeviceptr srcDevice;
                 CUarray srcArray;
                 unsigned int srcPitch;

             unsigned int dstXInBytes, dstY;
             CUmemorytype dstMemoryType;
                 void *dstHost;
                 CUdeviceptr dstDevice;
                 CUarray dstArray;
                 unsigned int dstPitch;

             unsigned int WidthInBytes;
             unsigned int Height;
          } CUDA_MEMCPY2D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_UNIFIED, srcDevice and  srcPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. srcArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_HOST, srcHost and srcPitch specify the (host) base address of the
       source data and the bytes per row to apply. srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice and srcPitch specify the (device) base  address
       of the source data and the bytes per row to apply. srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice and srcPitch are ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination data and the bytes per row to apply. dstArray is ignored.

       .RS  4  If  dstMemoryType  is  CU_MEMORYTYPE_UNIFIED, dstDevice and dstPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. dstArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS  4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base address
       of the destination data and the bytes per row to apply. dstArray is ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle  of  the  destination  data.
       dstHost, dstDevice and dstPitch are ignored.

       • srcXInBytes and srcY specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+srcY*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+srcY*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes and dstY specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+dstY*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+dstY*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes and Height specify the width (in bytes) and height of the 2D copy being performed.

       • If  specified,  srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and dstPitch must
         be greater than or equal to WidthInBytes + dstXInBytes.

       .RS  4  cuMemcpy2D()  returns  an  error  if  any   pitch   is   greater   than   the   maximum   allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).   cuMemAllocPitch()   passes   back   pitches   that  always  work  with
       cuMemcpy2D(). On intra-device memory copies (device to device, CUDA array to device, CUDA array  to  CUDA
       array),  cuMemcpy2D()  may fail for pitches not computed by cuMemAllocPitch(). cuMemcpy2DUnaligned() does
       not have this restriction, but may run significantly slower in the cases where  cuMemcpy2D()  would  have
       returned an error code.

       Parameters:
           pCopy - Parameters for the memory copy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray

   CUresult cuMemcpy2DAsync (const CUDA_MEMCPY2D * pCopy, CUstream hStream)
       Perform a 2D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY2D  structure  is
       defined as:

          typedef struct CUDA_MEMCPY2D_st {
             unsigned int srcXInBytes, srcY;
             CUmemorytype srcMemoryType;
             const void *srcHost;
             CUdeviceptr srcDevice;
             CUarray srcArray;
             unsigned int srcPitch;
             unsigned int dstXInBytes, dstY;
             CUmemorytype dstMemoryType;
             void *dstHost;
             CUdeviceptr dstDevice;
             CUarray dstArray;
             unsigned int dstPitch;
             unsigned int WidthInBytes;
             unsigned int Height;
          } CUDA_MEMCPY2D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_HOST, srcHost and srcPitch specify the (host) base address of the
       source data and the bytes per row to apply. srcArray is ignored.

       .RS  4  If  srcMemoryType  is  CU_MEMORYTYPE_UNIFIED, srcDevice and srcPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. srcArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS  4 If srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice and srcPitch specify the (device) base address
       of the source data and the bytes per row to apply. srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice and srcPitch are ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_UNIFIED, dstDevice and  dstPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. dstArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination data and the bytes per row to apply. dstArray is ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base  address
       of the destination data and the bytes per row to apply. dstArray is ignored.

       .RS  4  If  dstMemoryType  is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle of the destination data.
       dstHost, dstDevice and dstPitch are ignored.

       • srcXInBytes and srcY specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+srcY*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+srcY*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes and dstY specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+dstY*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+dstY*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes and Height specify the width (in bytes) and height of the 2D copy being performed.

       • If specified, srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and  dstPitch  must
         be greater than or equal to WidthInBytes + dstXInBytes.

       • If  specified,  srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and dstPitch must
         be greater than or equal to WidthInBytes + dstXInBytes.

       • If specified, srcHeight must be greater than or equal to Height + srcY, and dstHeight must  be  greater
         than or equal to Height + dstY.

       .RS   4   cuMemcpy2DAsync()  returns  an  error  if  any  pitch  is  greater  than  the  maximum  allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).  cuMemAllocPitch()  passes  back   pitches   that   always   work   with
       cuMemcpy2D().  On  intra-device memory copies (device to device, CUDA array to device, CUDA array to CUDA
       array), cuMemcpy2DAsync() may fail for pitches not computed by cuMemAllocPitch().

       Parameters:
           pCopy - Parameters for the memory copy
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,   cuMemcpy2D,    cuMemcpy2DUnaligned,    cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc,   cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,    cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async,   cuMemsetD32,   cuMemsetD32Async,   cudaMemcpy2DAsync,   cudaMemcpy2DToArrayAsync,
           cudaMemcpy2DFromArrayAsync

   CUresult cuMemcpy2DUnaligned (const CUDA_MEMCPY2D * pCopy)
       Perform  a  2D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY2D structure is
       defined as:

          typedef struct CUDA_MEMCPY2D_st {
             unsigned int srcXInBytes, srcY;
             CUmemorytype srcMemoryType;
             const void *srcHost;
             CUdeviceptr srcDevice;
             CUarray srcArray;
             unsigned int srcPitch;
             unsigned int dstXInBytes, dstY;
             CUmemorytype dstMemoryType;
             void *dstHost;
             CUdeviceptr dstDevice;
             CUarray dstArray;
             unsigned int dstPitch;
             unsigned int WidthInBytes;
             unsigned int Height;
          } CUDA_MEMCPY2D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_UNIFIED, srcDevice and  srcPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. srcArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_HOST, srcHost and srcPitch specify the (host) base address of the
       source data and the bytes per row to apply. srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice and srcPitch specify the (device) base  address
       of the source data and the bytes per row to apply. srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice and srcPitch are ignored.

       .RS  4  If  dstMemoryType  is  CU_MEMORYTYPE_UNIFIED, dstDevice and dstPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. dstArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination data and the bytes per row to apply. dstArray is ignored.

       .RS  4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base address
       of the destination data and the bytes per row to apply. dstArray is ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle  of  the  destination  data.
       dstHost, dstDevice and dstPitch are ignored.

       • srcXInBytes and srcY specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+srcY*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+srcY*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes and dstY specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+dstY*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+dstY*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes and Height specify the width (in bytes) and height of the 2D copy being performed.

       • If  specified,  srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and dstPitch must
         be greater than or equal to WidthInBytes + dstXInBytes.

       .RS  4  cuMemcpy2D()  returns  an  error  if  any   pitch   is   greater   than   the   maximum   allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).   cuMemAllocPitch()   passes   back   pitches   that  always  work  with
       cuMemcpy2D(). On intra-device memory copies (device to device, CUDA array to device, CUDA array  to  CUDA
       array),  cuMemcpy2D()  may fail for pitches not computed by cuMemAllocPitch(). cuMemcpy2DUnaligned() does
       not have this restriction, but may run significantly slower in the cases where  cuMemcpy2D()  would  have
       returned an error code.

       Parameters:
           pCopy - Parameters for the memory copy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,     cuMemAllocHost,    cuMemAllocPitch,    cuMemcpy2D,    cuMemcpy2DAsync,    cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray

   CUresult cuMemcpy3D (const CUDA_MEMCPY3D * pCopy)
       Perform a 3D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY3D  structure  is
       defined as:

               typedef struct CUDA_MEMCPY3D_st {

                   unsigned int srcXInBytes, srcY, srcZ;
                   unsigned int srcLOD;
                   CUmemorytype srcMemoryType;
                       const void *srcHost;
                       CUdeviceptr srcDevice;
                       CUarray srcArray;
                       unsigned int srcPitch;  // ignored when src is array
                       unsigned int srcHeight; // ignored when src is array; may be 0 if Depth==1

                   unsigned int dstXInBytes, dstY, dstZ;
                   unsigned int dstLOD;
                   CUmemorytype dstMemoryType;
                       void *dstHost;
                       CUdeviceptr dstDevice;
                       CUarray dstArray;
                       unsigned int dstPitch;  // ignored when dst is array
                       unsigned int dstHeight; // ignored when dst is array; may be 0 if Depth==1

                   unsigned int WidthInBytes;
                   unsigned int Height;
                   unsigned int Depth;
               } CUDA_MEMCPY3D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS  4  If  srcMemoryType  is  CU_MEMORYTYPE_UNIFIED, srcDevice and srcPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. srcArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS  4  If  srcMemoryType  is CU_MEMORYTYPE_HOST, srcHost, srcPitch and srcHeight specify the (host) base
       address of the source data, the bytes per row, and the height of each 2D slice of the 3D array.  srcArray
       is ignored.

       .RS  4  If  srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice, srcPitch and srcHeight specify the (device)
       base address of the source data, the bytes per row, and the height of each 2D  slice  of  the  3D  array.
       srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice, srcPitch and srcHeight are ignored.

       .RS  4  If  dstMemoryType  is  CU_MEMORYTYPE_UNIFIED, dstDevice and dstPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. dstArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination  data,  the  bytes  per  row,  and  the  height of each 2D slice of the 3D array. dstArray is
       ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base  address
       of  the destination data, the bytes per row, and the height of each 2D slice of the 3D array. dstArray is
       ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle  of  the  destination  data.
       dstHost, dstDevice, dstPitch and dstHeight are ignored.

       • srcXInBytes, srcY and srcZ specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+(srcZ*srcHeight+srcY)*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+(srcZ*srcHeight+srcY)*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes, dstY and dstZ specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+(dstZ*dstHeight+dstY)*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+(dstZ*dstHeight+dstY)*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes,  Height  and  Depth  specify  the width (in bytes), height and depth of the 3D copy being
         performed.

       • If specified, srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and  dstPitch  must
         be greater than or equal to WidthInBytes + dstXInBytes.

       • If  specified,  srcHeight must be greater than or equal to Height + srcY, and dstHeight must be greater
         than or equal to Height + dstY.

       .RS  4  cuMemcpy3D()  returns  an  error  if  any   pitch   is   greater   than   the   maximum   allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).

       The srcLOD and dstLOD members of the CUDA_MEMCPY3D structure must be set to 0.

       Parameters:
           pCopy - Parameters for the memory copy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy3D

   CUresult cuMemcpy3DAsync (const CUDA_MEMCPY3D * pCopy, CUstream hStream)
       Perform a 3D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY3D  structure  is
       defined as:

               typedef struct CUDA_MEMCPY3D_st {

                   unsigned int srcXInBytes, srcY, srcZ;
                   unsigned int srcLOD;
                   CUmemorytype srcMemoryType;
                       const void *srcHost;
                       CUdeviceptr srcDevice;
                       CUarray srcArray;
                       unsigned int srcPitch;  // ignored when src is array
                       unsigned int srcHeight; // ignored when src is array; may be 0 if Depth==1

                   unsigned int dstXInBytes, dstY, dstZ;
                   unsigned int dstLOD;
                   CUmemorytype dstMemoryType;
                       void *dstHost;
                       CUdeviceptr dstDevice;
                       CUarray dstArray;
                       unsigned int dstPitch;  // ignored when dst is array
                       unsigned int dstHeight; // ignored when dst is array; may be 0 if Depth==1

                   unsigned int WidthInBytes;
                   unsigned int Height;
                   unsigned int Depth;
               } CUDA_MEMCPY3D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS  4  If  srcMemoryType  is  CU_MEMORYTYPE_UNIFIED, srcDevice and srcPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. srcArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS  4  If  srcMemoryType  is CU_MEMORYTYPE_HOST, srcHost, srcPitch and srcHeight specify the (host) base
       address of the source data, the bytes per row, and the height of each 2D slice of the 3D array.  srcArray
       is ignored.

       .RS  4  If  srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice, srcPitch and srcHeight specify the (device)
       base address of the source data, the bytes per row, and the height of each 2D  slice  of  the  3D  array.
       srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice, srcPitch and srcHeight are ignored.

       .RS  4  If  dstMemoryType  is  CU_MEMORYTYPE_UNIFIED, dstDevice and dstPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. dstArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination  data,  the  bytes  per  row,  and  the  height of each 2D slice of the 3D array. dstArray is
       ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base  address
       of  the destination data, the bytes per row, and the height of each 2D slice of the 3D array. dstArray is
       ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle  of  the  destination  data.
       dstHost, dstDevice, dstPitch and dstHeight are ignored.

       • srcXInBytes, srcY and srcZ specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+(srcZ*srcHeight+srcY)*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+(srcZ*srcHeight+srcY)*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes, dstY and dstZ specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+(dstZ*dstHeight+dstY)*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+(dstZ*dstHeight+dstY)*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes,  Height  and  Depth  specify  the width (in bytes), height and depth of the 3D copy being
         performed.

       • If specified, srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and  dstPitch  must
         be greater than or equal to WidthInBytes + dstXInBytes.

       • If  specified,  srcHeight must be greater than or equal to Height + srcY, and dstHeight must be greater
         than or equal to Height + dstY.

       .RS  4  cuMemcpy3DAsync()  returns  an  error  if  any  pitch  is  greater  than  the   maximum   allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).

       The srcLOD and dstLOD members of the CUDA_MEMCPY3D structure must be set to 0.

       Parameters:
           pCopy - Parameters for the memory copy
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D, cuMemcpyAtoA, cuMemcpyAtoD, cuMemcpyAtoH, cuMemcpyAtoHAsync, cuMemcpyDtoA,  cuMemcpyDtoD,
           cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA, cuMemcpyHtoAAsync, cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,   cuMemsetD2D8,   cuMemsetD2D8Async,   cuMemsetD2D16,  cuMemsetD2D16Async,
           cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,   cuMemsetD16,   cuMemsetD16Async,
           cuMemsetD32, cuMemsetD32Async, cudaMemcpy3DAsync

   CUresult cuMemcpy3DPeer (const CUDA_MEMCPY3D_PEER * pCopy)
       Perform  a  3D  memory  copy  according  to  the parameters specified in pCopy. See the definition of the
       CUDA_MEMCPY3D_PEER structure for documentation of its parameters.

       Parameters:
           pCopy - Parameters for the memory copy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuMemcpyDtoD,     cuMemcpyPeer,     cuMemcpyDtoDAsync,     cuMemcpyPeerAsync,    cuMemcpy3DPeerAsync,
           cudaMemcpy3DPeer

   CUresult cuMemcpy3DPeerAsync (const CUDA_MEMCPY3D_PEER * pCopy, CUstream hStream)
       Perform a 3D memory copy according to the parameters specified  in  pCopy.  See  the  definition  of  the
       CUDA_MEMCPY3D_PEER structure for documentation of its parameters.

       Parameters:
           pCopy - Parameters for the memory copy
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuMemcpyDtoD,    cuMemcpyPeer,     cuMemcpyDtoDAsync,     cuMemcpyPeerAsync,     cuMemcpy3DPeerAsync,
           cudaMemcpy3DPeerAsync

   CUresult cuMemcpyAsync (CUdeviceptr dst, CUdeviceptr src, size_t ByteCount, CUstream hStream)
       Copies  data  between  two  pointers.  dst  and  src  are  base  pointers  of the destination and source,
       respectively. ByteCount specifies the number of bytes to copy. Note that this function infers the type of
       the transfer (host to host, host to device, device to device, or device to host) from the pointer values.
       This function is only allowed in contexts which support unified addressing.

       Parameters:
           dst - Destination unified virtual address space pointer
           src - Source unified virtual address space pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,   cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc,   cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,    cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async,   cuMemsetD32,    cuMemsetD32Async,    cudaMemcpyAsync,    cudaMemcpyToSymbolAsync,
           cudaMemcpyFromSymbolAsync

   CUresult  cuMemcpyAtoA  (CUarray  dstArray,  size_t  dstOffset,  CUarray  srcArray,  size_t srcOffset, size_t
       ByteCount)
       Copies from one 1D CUDA array to another. dstArray and srcArray specify the handles  of  the  destination
       and  source  CUDA  arrays for the copy, respectively. dstOffset and srcOffset specify the destination and
       source offsets in bytes into the CUDA arrays. ByteCount is the number of bytes to be copied. The size  of
       the  elements in the CUDA arrays need not be the same format, but the elements must be the same size; and
       count must be evenly divisible by that size.

       Parameters:
           dstArray - Destination array
           dstOffset - Offset in bytes of destination array
           srcArray - Source array
           srcOffset - Offset in bytes of source array
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyArrayToArray

   CUresult cuMemcpyAtoD (CUdeviceptr dstDevice, CUarray srcArray, size_t srcOffset, size_t ByteCount)
       Copies  from  one 1D CUDA array to device memory. dstDevice specifies the base pointer of the destination
       and must be naturally aligned with the CUDA array elements. srcArray and srcOffset specify the CUDA array
       handle and the offset in bytes into the array where the copy is to begin. ByteCount specifies the  number
       of bytes to copy and must be evenly divisible by the array element size.

       Parameters:
           dstDevice - Destination device pointer
           srcArray - Source array
           srcOffset - Offset in bytes of source array
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoH,    cuMemcpyAtoHAsync,    cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyFromArray

   CUresult cuMemcpyAtoH (void * dstHost, CUarray srcArray, size_t srcOffset, size_t ByteCount)
       Copies from one 1D CUDA array to host memory. dstHost specifies the  base  pointer  of  the  destination.
       srcArray  and  srcOffset  specify  the CUDA array handle and starting offset in bytes of the source data.
       ByteCount specifies the number of bytes to copy.

       Parameters:
           dstHost - Destination device pointer
           srcArray - Source array
           srcOffset - Offset in bytes of source array
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyFromArray

   CUresult  cuMemcpyAtoHAsync  (void  * dstHost, CUarray srcArray, size_t srcOffset, size_t ByteCount, CUstream
       hStream)
       Copies from one 1D CUDA array to host memory. dstHost specifies the  base  pointer  of  the  destination.
       srcArray  and  srcOffset  specify  the CUDA array handle and starting offset in bytes of the source data.
       ByteCount specifies the number of bytes to copy.

       Parameters:
           dstHost - Destination pointer
           srcArray - Source array
           srcOffset - Offset in bytes of source array
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,  cuMemcpy3DAsync,  cuMemcpyAtoA,  cuMemcpyAtoD, cuMemcpyAtoH, cuMemcpyDtoA, cuMemcpyDtoD,
           cuMemcpyDtoDAsync, cuMemcpyDtoH, cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,  cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,  cuMemsetD2D8,   cuMemsetD2D8Async,   cuMemsetD2D16,   cuMemsetD2D16Async,
           cuMemsetD2D32,   cuMemsetD2D32Async,   cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,  cuMemsetD16Async,
           cuMemsetD32, cuMemsetD32Async, cudaMemcpyFromArrayAsync

   CUresult cuMemcpyDtoA (CUarray dstArray, size_t dstOffset, CUdeviceptr srcDevice, size_t ByteCount)
       Copies from device memory to a 1D CUDA array. dstArray and dstOffset specify the CUDA  array  handle  and
       starting  index  of  the  destination data. srcDevice specifies the base pointer of the source. ByteCount
       specifies the number of bytes to copy.

       Parameters:
           dstArray - Destination array
           dstOffset - Offset in bytes of destination array
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyToArray

   CUresult cuMemcpyDtoD (CUdeviceptr dstDevice, CUdeviceptr srcDevice, size_t ByteCount)
       Copies  from  device  memory  to  device  memory.  dstDevice  and  srcDevice are the base pointers of the
       destination and source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoH,  cuMemcpyDtoHAsync,   cuMemcpyHtoA,   cuMemcpyHtoAAsync,   cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,   cuMemsetD16,
           cuMemsetD32, cudaMemcpy, cudaMemcpyToSymbol, cudaMemcpyFromSymbol

   CUresult cuMemcpyDtoDAsync (CUdeviceptr dstDevice, CUdeviceptr srcDevice, size_t ByteCount, CUstream hStream)

       Copies  from  device  memory  to  device  memory.  dstDevice  and  srcDevice are the base pointers of the
       destination and source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,   cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc,   cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,    cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async,   cuMemsetD32,    cuMemsetD32Async,    cudaMemcpyAsync,    cudaMemcpyToSymbolAsync,
           cudaMemcpyFromSymbolAsync

   CUresult cuMemcpyDtoH (void * dstHost, CUdeviceptr srcDevice, size_t ByteCount)
       Copies from device to host memory. dstHost and srcDevice specify the base pointers of the destination and
       source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstHost - Destination host pointer
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy, cudaMemcpyFromSymbol

   CUresult cuMemcpyDtoHAsync (void * dstHost, CUdeviceptr srcDevice, size_t ByteCount, CUstream hStream)
       Copies from device to host memory. dstHost and srcDevice specify the base pointers of the destination and
       source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstHost - Destination host pointer
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyHtoA,   cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc,   cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,    cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemcpyAsync, cudaMemcpyFromSymbolAsync

   CUresult cuMemcpyHtoA (CUarray dstArray, size_t dstOffset, const void * srcHost, size_t ByteCount)
       Copies from host memory to a 1D CUDA array. dstArray and dstOffset specify  the  CUDA  array  handle  and
       starting  offset  in  bytes  of  the  destination  data.  pSrc  specifies the base address of the source.
       ByteCount specifies the number of bytes to copy.

       Parameters:
           dstArray - Destination array
           dstOffset - Offset in bytes of destination array
           srcHost - Source host pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA, cuMemcpyDtoD, cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyToArray

   CUresult  cuMemcpyHtoAAsync  (CUarray  dstArray,  size_t  dstOffset,  const void * srcHost, size_t ByteCount,
       CUstream hStream)
       Copies from host memory to a 1D CUDA array. dstArray and dstOffset specify  the  CUDA  array  handle  and
       starting  offset  in  bytes  of  the  destination data. srcHost specifies the base address of the source.
       ByteCount specifies the number of bytes to copy.

       Parameters:
           dstArray - Destination array
           dstOffset - Offset in bytes of destination array
           srcHost - Source host pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc,   cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,    cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemcpyToArrayAsync

   CUresult cuMemcpyHtoD (CUdeviceptr dstDevice, const void * srcHost, size_t ByteCount)
       Copies from host memory to device memory. dstDevice and srcHost are the base addresses of the destination
       and source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           srcHost - Source host pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoDAsync,  cuMemFree, cuMemFreeHost, cuMemGetAddressRange, cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy, cudaMemcpyToSymbol

   CUresult cuMemcpyHtoDAsync (CUdeviceptr dstDevice, const void * srcHost, size_t ByteCount, CUstream hStream)
       Copies from host memory to device memory. dstDevice and srcHost are the base addresses of the destination
       and source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           srcHost - Source host pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,   cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemcpyAsync, cudaMemcpyToSymbolAsync

   CUresult   cuMemcpyPeer  (CUdeviceptr  dstDevice,  CUcontext  dstContext,  CUdeviceptr  srcDevice,  CUcontext
       srcContext, size_t ByteCount)
       Copies from device memory in one context to device memory in  another  context.  dstDevice  is  the  base
       device pointer of the destination memory and dstContext is the destination context. srcDevice is the base
       device  pointer of the source memory and srcContext is the source pointer. ByteCount specifies the number
       of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           dstContext - Destination context
           srcDevice - Source device pointer
           srcContext - Source context
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuMemcpyDtoD,     cuMemcpy3DPeer,    cuMemcpyDtoDAsync,    cuMemcpyPeerAsync,    cuMemcpy3DPeerAsync,
           cudaMemcpyPeer

   CUresult cuMemcpyPeerAsync (CUdeviceptr dstDevice, CUcontext  dstContext,  CUdeviceptr  srcDevice,  CUcontext
       srcContext, size_t ByteCount, CUstream hStream)
       Copies  from  device  memory  in  one  context to device memory in another context. dstDevice is the base
       device pointer of the destination memory and dstContext is the destination context. srcDevice is the base
       device pointer of the source memory and srcContext is the source pointer. ByteCount specifies the  number
       of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           dstContext - Destination context
           srcDevice - Source device pointer
           srcContext - Source context
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuMemcpyDtoD,     cuMemcpyPeer,     cuMemcpy3DPeer,      cuMemcpyDtoDAsync,      cuMemcpy3DPeerAsync,
           cudaMemcpyPeerAsync

   CUresult cuMemFree (CUdeviceptr dptr)
       Frees  the  memory  space  pointed  to  by  dptr,  which  must  have  been returned by a previous call to
       cuMemAlloc() or cuMemAllocPitch().

       Parameters:
           dptr - Pointer to memory to free

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,     cuMemcpyHtoD,    cuMemcpyHtoDAsync,    cuMemFreeHost,    cuMemGetAddressRange,
           cuMemGetInfo, cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8, cuMemsetD2D16,  cuMemsetD2D32,
           cuMemsetD8, cuMemsetD16, cuMemsetD32, cudaFree

   CUresult cuMemFreeHost (void * p)
       Frees  the  memory  space  pointed  to  by  p,  which  must  have  been  returned  by  a previous call to
       cuMemAllocHost().

       Parameters:
           p - Pointer to memory to free

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree, cuMemGetAddressRange, cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaFreeHost

   CUresult cuMemGetAddressRange (CUdeviceptr * pbase, size_t * psize, CUdeviceptr dptr)
       Returns  the  base  address  in  *pbase  and  size  in  *psize  of  the  allocation  by  cuMemAlloc()  or
       cuMemAllocPitch() that contains the input pointer dptr. Both parameters pbase and psize are optional.  If
       one of them is NULL, it is ignored.

       Parameters:
           pbase - Returned base address
           psize - Returned size of device memory allocation
           dptr - Device pointer to query

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_NOT_FOUND, CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync,   cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,   cuMemFreeHost,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32

   CUresult cuMemGetInfo (size_t * free, size_t * total)
       Returns in *free and *total respectively, the free and total amount of memory available for allocation by
       the CUDA context, in bytes.

       Parameters:
           free - Returned free memory in bytes
           total - Returned total memory in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemGetInfo

   CUresult cuMemHostAlloc (void ** pp, size_t bytesize, unsigned int Flags)
       Allocates  bytesize  bytes  of  host  memory that is page-locked and accessible to the device. The driver
       tracks the virtual memory ranges allocated with this function  and  automatically  accelerates  calls  to
       functions such as cuMemcpyHtoD(). Since the memory can be accessed directly by the device, it can be read
       or  written  with  much  higher  bandwidth than pageable memory obtained with functions such as malloc().
       Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount
       of memory available to the system for paging. As a result,  this  function  is  best  used  sparingly  to
       allocate staging areas for data exchange between host and device.

       The Flags parameter enables different options to be specified that affect the allocation, as follows.

       • CU_MEMHOSTALLOC_PORTABLE:  The  memory returned by this call will be considered as pinned memory by all
         CUDA contexts, not just the one that performed the allocation.

       • CU_MEMHOSTALLOC_DEVICEMAP: Maps the allocation into the CUDA address space. The device pointer  to  the
         memory may be obtained by calling cuMemHostGetDevicePointer().

       • CU_MEMHOSTALLOC_WRITECOMBINED:   Allocates  the  memory  as  write-combined  (WC).  WC  memory  can  be
         transferred across the PCI Express bus more quickly on some system configurations, but cannot  be  read
         efficiently  by  most  CPUs. WC memory is a good option for buffers that will be written by the CPU and
         read by the GPU via mapped pinned memory or host->device transfers.

       All of these flags are orthogonal to one another: a developer  may  allocate  memory  that  is  portable,
       mapped and/or write-combined with no restrictions.

       The   CUDA   context   must   have   been  created  with  the  CU_CTX_MAP_HOST  flag  in  order  for  the
       CU_MEMHOSTALLOC_DEVICEMAP flag to have any effect.

       The CU_MEMHOSTALLOC_DEVICEMAP flag may be specified on CUDA contexts for  devices  that  do  not  support
       mapped  pinned  memory.  The failure is deferred to cuMemHostGetDevicePointer() because the memory may be
       mapped into other CUDA contexts via the CU_MEMHOSTALLOC_PORTABLE flag.

       The memory allocated by this function must be freed with cuMemFreeHost().

       Note all host memory allocated using cuMemHostAlloc() will automatically be immediately accessible to all
       contexts   on   all   devices   which   support   unified   addressing   (as   may   be   queried   using
       CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING).  Unless the flag CU_MEMHOSTALLOC_WRITECOMBINED is specified, the
       device pointer that may be used to access this host memory from those contexts is  always  equal  to  the
       returned  host  pointer  *pp.  If  the flag CU_MEMHOSTALLOC_WRITECOMBINED is specified, then the function
       cuMemHostGetDevicePointer() must be used to query the  device  pointer,  even  if  the  context  supports
       unified addressing. See Unified Addressing for additional details.

       Parameters:
           pp - Returned host pointer to page-locked memory
           bytesize - Requested allocation size in bytes
           Flags - Flags for allocation request

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaHostAlloc

   CUresult cuMemHostGetDevicePointer (CUdeviceptr * pdptr, void * p, unsigned int Flags)
       Passes back the device pointer pdptr corresponding to the mapped,  pinned  host  buffer  p  allocated  by
       cuMemHostAlloc.

       cuMemHostGetDevicePointer() will fail if the CU_MEMHOSTALLOC_DEVICEMAP flag was not specified at the time
       the  memory  was  allocated,  or  if  the function is called on a GPU that does not support mapped pinned
       memory.

       For     devices     that     have     a     non-zero     value     for     the      device      attribute
       CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM,  the  memory  can  also be accessed from the
       device using the host pointer p. The device pointer returned by cuMemHostGetDevicePointer()  may  or  may
       not  match  the  original  host  pointer  p and depends on the devices visible to the application. If all
       devices visible to the application have a non-zero value for the device  attribute,  the  device  pointer
       returned  by  cuMemHostGetDevicePointer() will match the original pointer p. If any device visible to the
       application  has  a  zero  value  for  the   device   attribute,   the   device   pointer   returned   by
       cuMemHostGetDevicePointer()  will  not match the original host pointer p, but it will be suitable for use
       on all devices provided Unified Virtual Addressing is enabled. In such systems, it is valid to access the
       memory using either pointer on devices that have a non-zero value for the device attribute. Note  however
       that such devices should access the memory using only of the two pointers and not both.

       Flags provides for future releases. For now, it must be set to 0.

       Parameters:
           pdptr - Returned device pointer
           p - Host pointer
           Flags - Options (must be 0)

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,  cuMemHostAlloc,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8, cuMemsetD16,
           cuMemsetD32, cudaHostGetDevicePointer

   CUresult cuMemHostGetFlags (unsigned int * pFlags, void * p)
       Passes back the flags pFlags that were specified when allocating the pinned host buffer  p  allocated  by
       cuMemHostAlloc.

       cuMemHostGetFlags()   will   fail  if  the  pointer  does  not  reside  in  an  allocation  performed  by
       cuMemAllocHost() or cuMemHostAlloc().

       Parameters:
           pFlags - Returned flags word
           p - Host pointer

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMemAllocHost, cuMemHostAlloc, cudaHostGetFlags

   CUresult cuMemHostRegister (void * p, size_t bytesize, unsigned int Flags)
       Page-locks  the  memory  range  specified by p and bytesize and maps it for the device(s) as specified by
       Flags. This memory range also is added to the same tracking mechanism as cuMemHostAlloc to  automatically
       accelerate  calls  to  functions such as cuMemcpyHtoD(). Since the memory can be accessed directly by the
       device, it can be read or written with much higher bandwidth than  pageable  memory  that  has  not  been
       registered. Page-locking excessive amounts of memory may degrade system performance, since it reduces the
       amount of memory available to the system for paging. As a result, this function is best used sparingly to
       register staging areas for data exchange between host and device.

       This function has limited support on Mac OS X. OS 10.7 or higher is required.

       The Flags parameter enables different options to be specified that affect the allocation, as follows.

       • CU_MEMHOSTREGISTER_PORTABLE:  The  memory  returned by this call will be considered as pinned memory by
         all CUDA contexts, not just the one that performed the allocation.

       • CU_MEMHOSTREGISTER_DEVICEMAP: Maps the allocation into the CUDA address space. The  device  pointer  to
         the memory may be obtained by calling cuMemHostGetDevicePointer().

       • CU_MEMHOSTREGISTER_IOMEMORY:  The pointer is treated as pointing to some I/O memory space, e.g. the PCI
         Express resource of a 3rd party device.

       All of these flags are orthogonal to one another: a developer may page-lock memory that  is  portable  or
       mapped with no restrictions.

       The   CUDA   context   must   have   been  created  with  the  CU_CTX_MAP_HOST  flag  in  order  for  the
       CU_MEMHOSTREGISTER_DEVICEMAP flag to have any effect.

       The CU_MEMHOSTREGISTER_DEVICEMAP flag may be specified on CUDA contexts for devices that do  not  support
       mapped  pinned  memory.  The failure is deferred to cuMemHostGetDevicePointer() because the memory may be
       mapped into other CUDA contexts via the CU_MEMHOSTREGISTER_PORTABLE flag.

       For     devices     that     have     a     non-zero     value     for     the      device      attribute
       CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM,  the  memory  can  also be accessed from the
       device using the host pointer p. The device pointer returned by cuMemHostGetDevicePointer()  may  or  may
       not  match  the  original  host pointer ptr and depends on the devices visible to the application. If all
       devices visible to the application have a non-zero value for the device  attribute,  the  device  pointer
       returned by cuMemHostGetDevicePointer() will match the original pointer ptr. If any device visible to the
       application   has   a   zero   value   for   the   device  attribute,  the  device  pointer  returned  by
       cuMemHostGetDevicePointer() will not match the original host pointer ptr, but it will be suitable for use
       on all devices provided Unified Virtual Addressing is enabled. In such systems, it is valid to access the
       memory using either pointer on devices that have a non-zero value for the device attribute. Note  however
       that such devices should access the memory using only of the two pointers and not both.

       The memory page-locked by this function must be unregistered with cuMemHostUnregister().

       Parameters:
           p - Host pointer to memory to page-lock
           bytesize - Size in bytes of the address range to page-lock
           Flags - Flags for allocation request

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE,    CUDA_ERROR_OUT_OF_MEMORY,     CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED,
           CUDA_ERROR_NOT_PERMITTED, CUDA_ERROR_NOT_SUPPORTED

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMemHostUnregister, cuMemHostGetFlags, cuMemHostGetDevicePointer, cudaHostRegister

   CUresult cuMemHostUnregister (void * p)
       Unmaps the memory range whose base address is specified by p, and makes it pageable again.

       The base address must be the same one specified to cuMemHostRegister().

       Parameters:
           p - Host pointer to memory to unregister

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED,

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMemHostRegister, cudaHostUnregister

   CUresult cuMemsetD16 (CUdeviceptr dstDevice, unsigned short us, size_t N)
       Sets the memory range of N 16-bit values to the specified value us. The dstDevice  pointer  must  be  two
       byte aligned.

       Parameters:
           dstDevice - Destination device pointer
           us - Value to set
           N - Number of elements

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16, cuMemsetD2D16Async, cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset

   CUresult cuMemsetD16Async (CUdeviceptr dstDevice, unsigned short us, size_t N, CUstream hStream)
       Sets  the  memory  range  of N 16-bit values to the specified value us. The dstDevice pointer must be two
       byte aligned.

       Parameters:
           dstDevice - Destination device pointer
           us - Value to set
           N - Number of elements
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,  cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async, cuMemsetD8, cuMemsetD8Async,
           cuMemsetD16, cuMemsetD32, cuMemsetD32Async, cudaMemsetAsync

   CUresult cuMemsetD2D16 (CUdeviceptr dstDevice, size_t dstPitch,  unsigned  short  us,  size_t  Width,  size_t
       Height)
       Sets the 2D memory range of Width 16-bit values to the specified value us. Height specifies the number of
       rows  to  set,  and  dstPitch  specifies  the number of bytes between each row. The dstDevice pointer and
       dstPitch offset must be two byte aligned. This function performs fastest when the pitch is one  that  has
       been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           us - Value to set
           Width - Width of row
           Height - Number of rows

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2D

   CUresult  cuMemsetD2D16Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned short us, size_t Width, size_t
       Height, CUstream hStream)
       Sets the 2D memory range of Width 16-bit values to the specified value us. Height specifies the number of
       rows to set, and dstPitch specifies the number of bytes between  each  row.  The  dstDevice  pointer  and
       dstPitch  offset  must be two byte aligned. This function performs fastest when the pitch is one that has
       been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           us - Value to set
           Width - Width of row
           Height - Number of rows
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,   cuMemsetD2D32,   cuMemsetD2D32Async,   cuMemsetD8,   cuMemsetD8Async,   cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2DAsync

   CUresult cuMemsetD2D32 (CUdeviceptr dstDevice, size_t dstPitch, unsigned int ui, size_t Width, size_t Height)

       Sets the 2D memory range of Width 32-bit values to the specified value ui. Height specifies the number of
       rows to set, and dstPitch specifies the number of bytes between  each  row.  The  dstDevice  pointer  and
       dstPitch  offset must be four byte aligned. This function performs fastest when the pitch is one that has
       been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           ui - Value to set
           Width - Width of row
           Height - Number of rows

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,  cuMemsetD2D16Async,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2D

   CUresult cuMemsetD2D32Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned int ui,  size_t  Width,  size_t
       Height, CUstream hStream)
       Sets the 2D memory range of Width 32-bit values to the specified value ui. Height specifies the number of
       rows  to  set,  and  dstPitch  specifies  the number of bytes between each row. The dstDevice pointer and
       dstPitch offset must be four byte aligned. This function performs fastest when the pitch is one that  has
       been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           ui - Value to set
           Width - Width of row
           Height - Number of rows
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,   cuMemsetD2D16Async,   cuMemsetD2D32,   cuMemsetD8,   cuMemsetD8Async,   cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2DAsync

   CUresult cuMemsetD2D8 (CUdeviceptr dstDevice, size_t dstPitch, unsigned char uc, size_t Width, size_t Height)

       Sets  the 2D memory range of Width 8-bit values to the specified value uc. Height specifies the number of
       rows to set, and dstPitch specifies the number of bytes between each row. This function performs  fastest
       when the pitch is one that has been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           uc - Value to set
           Width - Width of row
           Height - Number of rows

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8Async,   cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2D

   CUresult  cuMemsetD2D8Async  (CUdeviceptr  dstDevice, size_t dstPitch, unsigned char uc, size_t Width, size_t
       Height, CUstream hStream)
       Sets the 2D memory range of Width 8-bit values to the specified value uc. Height specifies the number  of
       rows  to set, and dstPitch specifies the number of bytes between each row. This function performs fastest
       when the pitch is one that has been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           uc - Value to set
           Width - Width of row
           Height - Number of rows
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,     cuMemHostAlloc,     cuMemHostGetDevicePointer,     cuMemsetD2D8,     cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2DAsync

   CUresult cuMemsetD32 (CUdeviceptr dstDevice, unsigned int ui, size_t N)
       Sets the memory range of N 32-bit values to the specified value ui. The dstDevice pointer  must  be  four
       byte aligned.

       Parameters:
           dstDevice - Destination device pointer
           ui - Value to set
           N - Number of elements

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16, cuMemsetD2D16Async, cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,
           cuMemsetD16, cuMemsetD16Async, cuMemsetD32Async, cudaMemset

   CUresult cuMemsetD32Async (CUdeviceptr dstDevice, unsigned int ui, size_t N, CUstream hStream)
       Sets  the  memory  range of N 32-bit values to the specified value ui. The dstDevice pointer must be four
       byte aligned.

       Parameters:
           dstDevice - Destination device pointer
           ui - Value to set
           N - Number of elements
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,  cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async, cuMemsetD8, cuMemsetD8Async,
           cuMemsetD16, cuMemsetD16Async, cuMemsetD32, cudaMemsetAsync

   CUresult cuMemsetD8 (CUdeviceptr dstDevice, unsigned char uc, size_t N)
       Sets the memory range of N 8-bit values to the specified value uc.

       Parameters:
           dstDevice - Destination device pointer
           uc - Value to set
           N - Number of elements

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,  cuMemsetD2D16Async,  cuMemsetD2D32, cuMemsetD2D32Async, cuMemsetD8Async, cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset

   CUresult cuMemsetD8Async (CUdeviceptr dstDevice, unsigned char uc, size_t N, CUstream hStream)
       Sets the memory range of N 8-bit values to the specified value uc.

       Parameters:
           dstDevice - Destination device pointer
           uc - Value to set
           N - Number of elements
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,   cuMemsetD2D16Async,   cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemsetAsync

   CUresult   cuMipmappedArrayCreate   (CUmipmappedArray   *   pHandle,    const    CUDA_ARRAY3D_DESCRIPTOR    *
       pMipmappedArrayDesc, unsigned int numMipmapLevels)
       Creates a CUDA mipmapped array according to the CUDA_ARRAY3D_DESCRIPTOR structure pMipmappedArrayDesc and
       returns  a  handle  to  the new CUDA mipmapped array in *pHandle. numMipmapLevels specifies the number of
       mipmap levels to be allocated. This value is clamped to the range [1, 1 +  floor(log2(max(width,  height,
       depth)))].

       The CUDA_ARRAY3D_DESCRIPTOR is defined as:

           typedef struct {
               unsigned int Width;
               unsigned int Height;
               unsigned int Depth;
               CUarray_format Format;
               unsigned int NumChannels;
               unsigned int Flags;
           } CUDA_ARRAY3D_DESCRIPTOR;

        where:

       • Width,  Height,  and  Depth  are  the  width,  height,  and  depth of the CUDA array (in elements); the
         following types of CUDA arrays can be allocated:

         • A 1D mipmapped array is allocated if Height and Depth extents are both zero.

         • A 2D mipmapped array is allocated if only Depth extent is zero.

         • A 3D mipmapped array is allocated if all three extents are non-zero.

         • A 1D layered CUDA mipmapped array is allocated if only Height is zero  and  the  CUDA_ARRAY3D_LAYERED
           flag is set. Each layer is a 1D array. The number of layers is determined by the depth extent.

         • A  2D  layered  CUDA  mipmapped  array  is  allocated  if  all  three  extents  are  non-zero and the
           CUDA_ARRAY3D_LAYERED flag is set. Each layer is a 2D array. The number of layers is determined by the
           depth extent.

         • A  cubemap  CUDA  mipmapped  array  is  allocated  if  all  three  extents  are  non-zero   and   the
           CUDA_ARRAY3D_CUBEMAP  flag is set. Width must be equal to Height, and Depth must be six. A cubemap is
           a special type of 2D layered CUDA array, where the six layers represent the six faces of a cube.  The
           order of the six layers in memory is the same as that listed in CUarray_cubemap_face.

         • A  cubemap  layered  CUDA  mipmapped  array is allocated if all three extents are non-zero, and both,
           CUDA_ARRAY3D_CUBEMAP and CUDA_ARRAY3D_LAYERED flags are set. Width must be equal to Height, and Depth
           must be a multiple of six. A cubemap layered CUDA array is a special type of 2D  layered  CUDA  array
           that consists of a collection of cubemaps. The first six layers represent the first cubemap, the next
           six layers form the second cubemap, and so on.

       • Format specifies the format of the elements; CUarray_format is defined as:

           typedef enum CUarray_format_enum {
               CU_AD_FORMAT_UNSIGNED_INT8 = 0x01,
               CU_AD_FORMAT_UNSIGNED_INT16 = 0x02,
               CU_AD_FORMAT_UNSIGNED_INT32 = 0x03,
               CU_AD_FORMAT_SIGNED_INT8 = 0x08,
               CU_AD_FORMAT_SIGNED_INT16 = 0x09,
               CU_AD_FORMAT_SIGNED_INT32 = 0x0a,
               CU_AD_FORMAT_HALF = 0x10,
               CU_AD_FORMAT_FLOAT = 0x20
           } CUarray_format;

       • NumChannels specifies the number of packed components per CUDA array element; it may be 1, 2, or 4;

       • Flags may be set to

         • CUDA_ARRAY3D_LAYERED  to enable creation of layered CUDA mipmapped arrays. If this flag is set, Depth
           specifies the number of layers, not the depth of a 3D array.

         • CUDA_ARRAY3D_SURFACE_LDST to enable surface references to be bound to individual mipmap levels of the
           CUDA mipmapped array. If this flag is not set, cuSurfRefSetArray will fail when attempting to bind  a
           mipmap level of the CUDA mipmapped array to a surface reference.

         • CUDA_ARRAY3D_CUBEMAP  to  enable  creation  of mipmapped cubemaps. If this flag is set, Width must be
           equal to Height, and Depth must be six. If the CUDA_ARRAY3D_LAYERED flag is also set, then Depth must
           be a multiple of six.

         • CUDA_ARRAY3D_TEXTURE_GATHER to indicate that the CUDA  mipmapped  array  will  be  used  for  texture
           gather. Texture gather can only be performed on 2D CUDA mipmapped arrays.

       Width,  Height and Depth must meet certain size requirements as listed in the following table. All values
       are specified in elements. Note that for brevity's sake, the full name of the  device  attribute  is  not
       specified.     For     ex.,     TEXTURE1D_MIPMAPPED_WIDTH     refers     to    the    device    attribute
       CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH.

       CUDA array type Valid extents that must always be met
       {(width range in elements), (height range), (depth range)} Valid extents  with  CUDA_ARRAY3D_SURFACE_LDST
       set
        {(width  range in elements), (height range), (depth range)} 1D { (1,TEXTURE1D_MIPMAPPED_WIDTH), 0, 0 } {
       (1,SURFACE1D_WIDTH), 0, 0 } 2D { (1,TEXTURE2D_MIPMAPPED_WIDTH),  (1,TEXTURE2D_MIPMAPPED_HEIGHT),  0  }  {
       (1,SURFACE2D_WIDTH),   (1,SURFACE2D_HEIGHT),   0   }   3D  {  (1,TEXTURE3D_WIDTH),  (1,TEXTURE3D_HEIGHT),
       (1,TEXTURE3D_DEPTH) }
       OR
       {  (1,TEXTURE3D_WIDTH_ALTERNATE),  (1,TEXTURE3D_HEIGHT_ALTERNATE),  (1,TEXTURE3D_DEPTH_ALTERNATE)   }   {
       (1,SURFACE3D_WIDTH),      (1,SURFACE3D_HEIGHT),      (1,SURFACE3D_DEPTH)      }      1D     Layered     {
       (1,TEXTURE1D_LAYERED_WIDTH),  0,  (1,TEXTURE1D_LAYERED_LAYERS)  }   {   (1,SURFACE1D_LAYERED_WIDTH),   0,
       (1,SURFACE1D_LAYERED_LAYERS)  }  2D  Layered { (1,TEXTURE2D_LAYERED_WIDTH), (1,TEXTURE2D_LAYERED_HEIGHT),
       (1,TEXTURE2D_LAYERED_LAYERS)    }    {     (1,SURFACE2D_LAYERED_WIDTH),     (1,SURFACE2D_LAYERED_HEIGHT),
       (1,SURFACE2D_LAYERED_LAYERS)  }  Cubemap  {  (1,TEXTURECUBEMAP_WIDTH),  (1,TEXTURECUBEMAP_WIDTH),  6  } {
       (1,SURFACECUBEMAP_WIDTH),      (1,SURFACECUBEMAP_WIDTH),      6      }      Cubemap       Layered       {
       (1,TEXTURECUBEMAP_LAYERED_WIDTH), (1,TEXTURECUBEMAP_LAYERED_WIDTH), (1,TEXTURECUBEMAP_LAYERED_LAYERS) } {
       (1,SURFACECUBEMAP_LAYERED_WIDTH), (1,SURFACECUBEMAP_LAYERED_WIDTH), (1,SURFACECUBEMAP_LAYERED_LAYERS) }

       Parameters:
           pHandle - Returned mipmapped array
           pMipmappedArrayDesc - mipmapped array descriptor
           numMipmapLevels - Number of mipmap levels

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_UNKNOWN

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMipmappedArrayDestroy, cuMipmappedArrayGetLevel, cuArrayCreate, cudaMallocMipmappedArray

   CUresult cuMipmappedArrayDestroy (CUmipmappedArray hMipmappedArray)
       Destroys the CUDA mipmapped array hMipmappedArray.

       Parameters:
           hMipmappedArray - Mipmapped array to destroy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_ARRAY_IS_MAPPED

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMipmappedArrayCreate, cuMipmappedArrayGetLevel, cuArrayCreate, cudaFreeMipmappedArray

   CUresult  cuMipmappedArrayGetLevel  (CUarray  *  pLevelArray,  CUmipmappedArray hMipmappedArray, unsigned int
       level)
       Returns in *pLevelArray a CUDA array that represents a single mipmap level of the  CUDA  mipmapped  array
       hMipmappedArray.

       If  level  is greater than the maximum number of levels in this mipmapped array, CUDA_ERROR_INVALID_VALUE
       is returned.

       Parameters:
           pLevelArray - Returned mipmap level CUDA array
           hMipmappedArray - CUDA mipmapped array
           level - Mipmap level

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMipmappedArrayCreate, cuMipmappedArrayDestroy, cuArrayCreate, cudaGetMipmappedArrayLevel

Author

       Generated automatically by Doxygen from the source code.

Version 6.0                                        3 Nov 2017                               Memory Management(3)