Ubuntu Manpage: Memory Management -

Provided by: nvidia-cuda-dev_10.1.243-3_amd64

NAME

       Memory Management -

   Functions
       CUresult cuArray3DCreate (CUarray *pHandle, const CUDA_ARRAY3D_DESCRIPTOR *pAllocateArray)
           Creates a 3D CUDA array.
       CUresult cuArray3DGetDescriptor (CUDA_ARRAY3D_DESCRIPTOR *pArrayDescriptor, CUarray hArray)
           Get a 3D CUDA array descriptor.
       CUresult cuArrayCreate (CUarray *pHandle, const CUDA_ARRAY_DESCRIPTOR *pAllocateArray)
           Creates a 1D or 2D CUDA array.
       CUresult cuArrayDestroy (CUarray hArray)
           Destroys a CUDA array.
       CUresult cuArrayGetDescriptor (CUDA_ARRAY_DESCRIPTOR *pArrayDescriptor, CUarray hArray)
           Get a 1D or 2D CUDA array descriptor.
       CUresult cuDeviceGetByPCIBusId (CUdevice *dev, const char *pciBusId)
           Returns a handle to a compute device.
       CUresult cuDeviceGetPCIBusId (char *pciBusId, int len, CUdevice dev)
           Returns a PCI Bus Id string for the device.
       CUresult cuIpcCloseMemHandle (CUdeviceptr dptr)
           Close memory mapped with cuIpcOpenMemHandle.
       CUresult cuIpcGetEventHandle (CUipcEventHandle *pHandle, CUevent event)
           Gets an interprocess handle for a previously allocated event.
       CUresult cuIpcGetMemHandle (CUipcMemHandle *pHandle, CUdeviceptr dptr)
           Gets an interprocess memory handle for an existing device memory allocation.
       CUresult cuIpcOpenEventHandle (CUevent *phEvent, CUipcEventHandle handle)
           Opens an interprocess event handle for use in the current process.
       CUresult cuIpcOpenMemHandle (CUdeviceptr *pdptr, CUipcMemHandle handle, unsigned int Flags)
           Opens an interprocess memory handle exported from another process and returns a device pointer usable
           in the local process.
       CUresult cuMemAlloc (CUdeviceptr *dptr, size_t bytesize)
           Allocates device memory.
       CUresult cuMemAllocHost (void **pp, size_t bytesize)
           Allocates page-locked host memory.
       CUresult cuMemAllocManaged (CUdeviceptr *dptr, size_t bytesize, unsigned int flags)
           Allocates memory that will be automatically managed by the Unified Memory system.
       CUresult cuMemAllocPitch (CUdeviceptr *dptr, size_t *pPitch, size_t WidthInBytes, size_t Height, unsigned
           int ElementSizeBytes)
           Allocates pitched device memory.
       CUresult cuMemcpy (CUdeviceptr dst, CUdeviceptr src, size_t ByteCount)
           Copies memory.
       CUresult cuMemcpy2D (const CUDA_MEMCPY2D *pCopy)
           Copies memory for 2D arrays.
       CUresult cuMemcpy2DAsync (const CUDA_MEMCPY2D *pCopy, CUstream hStream)
           Copies memory for 2D arrays.
       CUresult cuMemcpy2DUnaligned (const CUDA_MEMCPY2D *pCopy)
           Copies memory for 2D arrays.
       CUresult cuMemcpy3D (const CUDA_MEMCPY3D *pCopy)
           Copies memory for 3D arrays.
       CUresult cuMemcpy3DAsync (const CUDA_MEMCPY3D *pCopy, CUstream hStream)
           Copies memory for 3D arrays.
       CUresult cuMemcpy3DPeer (const CUDA_MEMCPY3D_PEER *pCopy)
           Copies memory between contexts.
       CUresult cuMemcpy3DPeerAsync (const CUDA_MEMCPY3D_PEER *pCopy, CUstream hStream)
           Copies memory between contexts asynchronously.
       CUresult cuMemcpyAsync (CUdeviceptr dst, CUdeviceptr src, size_t ByteCount, CUstream hStream)
           Copies memory asynchronously.
       CUresult cuMemcpyAtoA (CUarray dstArray, size_t dstOffset, CUarray srcArray, size_t srcOffset, size_t
           ByteCount)
           Copies memory from Array to Array.
       CUresult cuMemcpyAtoD (CUdeviceptr dstDevice, CUarray srcArray, size_t srcOffset, size_t ByteCount)
           Copies memory from Array to Device.
       CUresult cuMemcpyAtoH (void *dstHost, CUarray srcArray, size_t srcOffset, size_t ByteCount)
           Copies memory from Array to Host.
       CUresult cuMemcpyAtoHAsync (void *dstHost, CUarray srcArray, size_t srcOffset, size_t ByteCount, CUstream
           hStream)
           Copies memory from Array to Host.
       CUresult cuMemcpyDtoA (CUarray dstArray, size_t dstOffset, CUdeviceptr srcDevice, size_t ByteCount)
           Copies memory from Device to Array.
       CUresult cuMemcpyDtoD (CUdeviceptr dstDevice, CUdeviceptr srcDevice, size_t ByteCount)
           Copies memory from Device to Device.
       CUresult cuMemcpyDtoDAsync (CUdeviceptr dstDevice, CUdeviceptr srcDevice, size_t ByteCount, CUstream
           hStream)
           Copies memory from Device to Device.
       CUresult cuMemcpyDtoH (void *dstHost, CUdeviceptr srcDevice, size_t ByteCount)
           Copies memory from Device to Host.
       CUresult cuMemcpyDtoHAsync (void *dstHost, CUdeviceptr srcDevice, size_t ByteCount, CUstream hStream)
           Copies memory from Device to Host.
       CUresult cuMemcpyHtoA (CUarray dstArray, size_t dstOffset, const void *srcHost, size_t ByteCount)
           Copies memory from Host to Array.
       CUresult cuMemcpyHtoAAsync (CUarray dstArray, size_t dstOffset, const void *srcHost, size_t ByteCount,
           CUstream hStream)
           Copies memory from Host to Array.
       CUresult cuMemcpyHtoD (CUdeviceptr dstDevice, const void *srcHost, size_t ByteCount)
           Copies memory from Host to Device.
       CUresult cuMemcpyHtoDAsync (CUdeviceptr dstDevice, const void *srcHost, size_t ByteCount, CUstream
           hStream)
           Copies memory from Host to Device.
       CUresult cuMemcpyPeer (CUdeviceptr dstDevice, CUcontext dstContext, CUdeviceptr srcDevice, CUcontext
           srcContext, size_t ByteCount)
           Copies device memory between two contexts.
       CUresult cuMemcpyPeerAsync (CUdeviceptr dstDevice, CUcontext dstContext, CUdeviceptr srcDevice, CUcontext
           srcContext, size_t ByteCount, CUstream hStream)
           Copies device memory between two contexts asynchronously.
       CUresult cuMemFree (CUdeviceptr dptr)
           Frees device memory.
       CUresult cuMemFreeHost (void *p)
           Frees page-locked host memory.
       CUresult cuMemGetAddressRange (CUdeviceptr *pbase, size_t *psize, CUdeviceptr dptr)
           Get information on memory allocations.
       CUresult cuMemGetInfo (size_t *free, size_t *total)
           Gets free and total memory.
       CUresult cuMemHostAlloc (void **pp, size_t bytesize, unsigned int Flags)
           Allocates page-locked host memory.
       CUresult cuMemHostGetDevicePointer (CUdeviceptr *pdptr, void *p, unsigned int Flags)
           Passes back device pointer of mapped pinned memory.
       CUresult cuMemHostGetFlags (unsigned int *pFlags, void *p)
           Passes back flags that were used for a pinned allocation.
       CUresult cuMemHostRegister (void *p, size_t bytesize, unsigned int Flags)
           Registers an existing host memory range for use by CUDA.
       CUresult cuMemHostUnregister (void *p)
           Unregisters a memory range that was registered with cuMemHostRegister.
       CUresult cuMemsetD16 (CUdeviceptr dstDevice, unsigned short us, size_t N)
           Initializes device memory.
       CUresult cuMemsetD16Async (CUdeviceptr dstDevice, unsigned short us, size_t N, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD2D16 (CUdeviceptr dstDevice, size_t dstPitch, unsigned short us, size_t Width, size_t
           Height)
           Initializes device memory.
       CUresult cuMemsetD2D16Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned short us, size_t Width,
           size_t Height, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD2D32 (CUdeviceptr dstDevice, size_t dstPitch, unsigned int ui, size_t Width, size_t
           Height)
           Initializes device memory.
       CUresult cuMemsetD2D32Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned int ui, size_t Width,
           size_t Height, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD2D8 (CUdeviceptr dstDevice, size_t dstPitch, unsigned char uc, size_t Width, size_t
           Height)
           Initializes device memory.
       CUresult cuMemsetD2D8Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned char uc, size_t Width,
           size_t Height, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD32 (CUdeviceptr dstDevice, unsigned int ui, size_t N)
           Initializes device memory.
       CUresult cuMemsetD32Async (CUdeviceptr dstDevice, unsigned int ui, size_t N, CUstream hStream)
           Sets device memory.
       CUresult cuMemsetD8 (CUdeviceptr dstDevice, unsigned char uc, size_t N)
           Initializes device memory.
       CUresult cuMemsetD8Async (CUdeviceptr dstDevice, unsigned char uc, size_t N, CUstream hStream)
           Sets device memory.
       CUresult cuMipmappedArrayCreate (CUmipmappedArray *pHandle, const CUDA_ARRAY3D_DESCRIPTOR
           *pMipmappedArrayDesc, unsigned int numMipmapLevels)
           Creates a CUDA mipmapped array.
       CUresult cuMipmappedArrayDestroy (CUmipmappedArray hMipmappedArray)
           Destroys a CUDA mipmapped array.
       CUresult cuMipmappedArrayGetLevel (CUarray *pLevelArray, CUmipmappedArray hMipmappedArray, unsigned int
           level)
           Gets a mipmap level of a CUDA mipmapped array.

Detailed Description

       \brief memory management functions of the low-level CUDA driver API (cuda.h)

       This section describes the memory management functions of the low-level CUDA driver application
       programming interface.

Function Documentation

   CUresult cuArray3DCreate (CUarray * pHandle, const CUDA_ARRAY3D_DESCRIPTOR * pAllocateArray)
       Creates a CUDA array according to the CUDA_ARRAY3D_DESCRIPTOR structure pAllocateArray and returns a
       handle to the new CUDA array in *pHandle. The CUDA_ARRAY3D_DESCRIPTOR is defined as:

           typedef struct {
               unsigned int Width;
               unsigned int Height;
               unsigned int Depth;
               CUarray_format Format;
               unsigned int NumChannels;
               unsigned int Flags;
           } CUDA_ARRAY3D_DESCRIPTOR;

        where:

       • Width,  Height,  and  Depth  are  the  width,  height,  and  depth of the CUDA array (in elements); the
         following types of CUDA arrays can be allocated:

         • A 1D array is allocated if Height and Depth extents are both zero.

         • A 2D array is allocated if only Depth extent is zero.

         • A 3D array is allocated if all three extents are non-zero.

         • A 1D layered CUDA array is allocated if only Height is zero and the CUDA_ARRAY3D_LAYERED flag is set.
           Each layer is a 1D array. The number of layers is determined by the depth extent.

         • A 2D layered CUDA array is allocated if all three extents are non-zero and  the  CUDA_ARRAY3D_LAYERED
           flag is set. Each layer is a 2D array. The number of layers is determined by the depth extent.

         • A cubemap CUDA array is allocated if all three extents are non-zero and the CUDA_ARRAY3D_CUBEMAP flag
           is  set.  Width  must  be  equal  to Height, and Depth must be six. A cubemap is a special type of 2D
           layered CUDA array, where the six layers represent the six faces of a cube.  The  order  of  the  six
           layers in memory is the same as that listed in CUarray_cubemap_face.

         • A   cubemap  layered  CUDA  array  is  allocated  if  all  three  extents  are  non-zero,  and  both,
           CUDA_ARRAY3D_CUBEMAP and CUDA_ARRAY3D_LAYERED flags are set. Width must be equal to Height, and Depth
           must be a multiple of six. A cubemap layered CUDA array is a special type of 2D  layered  CUDA  array
           that consists of a collection of cubemaps. The first six layers represent the first cubemap, the next
           six layers form the second cubemap, and so on.

       • Format specifies the format of the elements; CUarray_format is defined as:

           typedef enum CUarray_format_enum {
               CU_AD_FORMAT_UNSIGNED_INT8 = 0x01,
               CU_AD_FORMAT_UNSIGNED_INT16 = 0x02,
               CU_AD_FORMAT_UNSIGNED_INT32 = 0x03,
               CU_AD_FORMAT_SIGNED_INT8 = 0x08,
               CU_AD_FORMAT_SIGNED_INT16 = 0x09,
               CU_AD_FORMAT_SIGNED_INT32 = 0x0a,
               CU_AD_FORMAT_HALF = 0x10,
               CU_AD_FORMAT_FLOAT = 0x20
           } CUarray_format;

       • NumChannels specifies the number of packed components per CUDA array element; it may be 1, 2, or 4;

       • Flags may be set to

         • CUDA_ARRAY3D_LAYERED  to enable creation of layered CUDA arrays. If this flag is set, Depth specifies
           the number of layers, not the depth of a 3D array.

         • CUDA_ARRAY3D_SURFACE_LDST to enable surface references to be bound to the CUDA array. If this flag is
           not set, cuSurfRefSetArray will fail when attempting to bind the CUDA array to a surface reference.

         • CUDA_ARRAY3D_CUBEMAP to enable creation of cubemaps. If this flag is set,  Width  must  be  equal  to
           Height,  and  Depth  must  be six. If the CUDA_ARRAY3D_LAYERED flag is also set, then Depth must be a
           multiple of six.

         • CUDA_ARRAY3D_TEXTURE_GATHER to indicate that the CUDA array will be used for texture gather.  Texture
           gather can only be performed on 2D CUDA arrays.

       Width,  Height and Depth must meet certain size requirements as listed in the following table. All values
       are specified in elements. Note that for brevity's sake, the full name of the  device  attribute  is  not
       specified.      For      ex.,      TEXTURE1D_WIDTH      refers      to      the      device     attribute
       CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_WIDTH.

       Note that 2D CUDA arrays have different size requirements if the CUDA_ARRAY3D_TEXTURE_GATHER flag is set.
       Width  and  Height  must  not  be  greater  than  CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_GATHER_WIDTH  and
       CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_GATHER_HEIGHT respectively, in that case.

       CUDA array type Valid extents that must always be met
       {(width  range  in elements), (height range), (depth range)} Valid extents with CUDA_ARRAY3D_SURFACE_LDST
       set
        {(width range in elements),  (height  range),  (depth  range)}  1D  {  (1,TEXTURE1D_WIDTH),  0,  0  }  {
       (1,SURFACE1D_WIDTH),  0,  0  } 2D { (1,TEXTURE2D_WIDTH), (1,TEXTURE2D_HEIGHT), 0 } { (1,SURFACE2D_WIDTH),
       (1,SURFACE2D_HEIGHT), 0 } 3D { (1,TEXTURE3D_WIDTH), (1,TEXTURE3D_HEIGHT), (1,TEXTURE3D_DEPTH) }
       OR
       {  (1,TEXTURE3D_WIDTH_ALTERNATE),  (1,TEXTURE3D_HEIGHT_ALTERNATE),  (1,TEXTURE3D_DEPTH_ALTERNATE)   }   {
       (1,SURFACE3D_WIDTH),      (1,SURFACE3D_HEIGHT),      (1,SURFACE3D_DEPTH)      }      1D     Layered     {
       (1,TEXTURE1D_LAYERED_WIDTH),  0,  (1,TEXTURE1D_LAYERED_LAYERS)  }   {   (1,SURFACE1D_LAYERED_WIDTH),   0,
       (1,SURFACE1D_LAYERED_LAYERS)  }  2D  Layered { (1,TEXTURE2D_LAYERED_WIDTH), (1,TEXTURE2D_LAYERED_HEIGHT),
       (1,TEXTURE2D_LAYERED_LAYERS)    }    {     (1,SURFACE2D_LAYERED_WIDTH),     (1,SURFACE2D_LAYERED_HEIGHT),
       (1,SURFACE2D_LAYERED_LAYERS)  }  Cubemap  {  (1,TEXTURECUBEMAP_WIDTH),  (1,TEXTURECUBEMAP_WIDTH),  6  } {
       (1,SURFACECUBEMAP_WIDTH),      (1,SURFACECUBEMAP_WIDTH),      6      }      Cubemap       Layered       {
       (1,TEXTURECUBEMAP_LAYERED_WIDTH), (1,TEXTURECUBEMAP_LAYERED_WIDTH), (1,TEXTURECUBEMAP_LAYERED_LAYERS) } {
       (1,SURFACECUBEMAP_LAYERED_WIDTH), (1,SURFACECUBEMAP_LAYERED_WIDTH), (1,SURFACECUBEMAP_LAYERED_LAYERS) }

       Here are examples of CUDA array descriptions:

       Description for a CUDA array of 2048 floats:

           CUDA_ARRAY3D_DESCRIPTOR desc;
           desc.Format = CU_AD_FORMAT_FLOAT;
           desc.NumChannels = 1;
           desc.Width = 2048;
           desc.Height = 0;
           desc.Depth = 0;

       Description for a 64 x 64 CUDA array of floats:

           CUDA_ARRAY3D_DESCRIPTOR desc;
           desc.Format = CU_AD_FORMAT_FLOAT;
           desc.NumChannels = 1;
           desc.Width = 64;
           desc.Height = 64;
           desc.Depth = 0;

       Description for a width x height x depth CUDA array of 64-bit, 4x16-bit float16's:

           CUDA_ARRAY3D_DESCRIPTOR desc;
           desc.FormatFlags = CU_AD_FORMAT_HALF;
           desc.NumChannels = 4;
           desc.Width = width;
           desc.Height = height;
           desc.Depth = depth;

       Parameters:
           pHandle - Returned array
           pAllocateArray - 3D array descriptor

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_UNKNOWN

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DGetDescriptor,    cuArrayCreate,    cuArrayDestroy,    cuArrayGetDescriptor,     cuMemAlloc,
           cuMemAllocHost,   cuMemAllocPitch,   cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMalloc3DArray

   CUresult cuArray3DGetDescriptor (CUDA_ARRAY3D_DESCRIPTOR * pArrayDescriptor, CUarray hArray)
       Returns in *pArrayDescriptor a descriptor containing information on the format and dimensions of the CUDA
       array hArray. It is useful for subroutines that have been passed a CUDA array, but need to know the  CUDA
       array parameters for validation or other purposes.

       This  function  may  be  called on 1D and 2D arrays, in which case the Height and/or Depth members of the
       descriptor struct will be set to 0.

       Parameters:
           pArrayDescriptor - Returned 3D array descriptor
           hArray - 3D array to get descriptor of

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_CONTEXT_IS_DESTROYED

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,  cuArrayCreate,  cuArrayDestroy,  cuArrayGetDescriptor,  cuMemAlloc, cuMemAllocHost,
           cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,   cuMemcpy3DAsync,
           cuMemcpyAtoA,    cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,   cuMemcpyDtoA,   cuMemcpyDtoD,
           cuMemcpyDtoDAsync, cuMemcpyDtoH, cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,  cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,   cuMemsetD16,
           cuMemsetD32, cudaArrayGetInfo

   CUresult cuArrayCreate (CUarray * pHandle, const CUDA_ARRAY_DESCRIPTOR * pAllocateArray)
       Creates a CUDA array according to the CUDA_ARRAY_DESCRIPTOR structure pAllocateArray and returns a handle
       to the new CUDA array in *pHandle. The CUDA_ARRAY_DESCRIPTOR is defined as:

           typedef struct {
               unsigned int Width;
               unsigned int Height;
               CUarray_format Format;
               unsigned int NumChannels;
           } CUDA_ARRAY_DESCRIPTOR;

        where:

       • Width,  and  Height  are  the width, and height of the CUDA array (in elements); the CUDA array is one-
         dimensional if height is 0, two-dimensional otherwise;

       • Format specifies the format of the elements; CUarray_format is defined as:

           typedef enum CUarray_format_enum {
               CU_AD_FORMAT_UNSIGNED_INT8 = 0x01,
               CU_AD_FORMAT_UNSIGNED_INT16 = 0x02,
               CU_AD_FORMAT_UNSIGNED_INT32 = 0x03,
               CU_AD_FORMAT_SIGNED_INT8 = 0x08,
               CU_AD_FORMAT_SIGNED_INT16 = 0x09,
               CU_AD_FORMAT_SIGNED_INT32 = 0x0a,
               CU_AD_FORMAT_HALF = 0x10,
               CU_AD_FORMAT_FLOAT = 0x20
           } CUarray_format;

       • NumChannels specifies the number of packed components per CUDA array element; it may be 1, 2, or 4;

       Here are examples of CUDA array descriptions:

       Description for a CUDA array of 2048 floats:

           CUDA_ARRAY_DESCRIPTOR desc;
           desc.Format = CU_AD_FORMAT_FLOAT;
           desc.NumChannels = 1;
           desc.Width = 2048;
           desc.Height = 1;

       Description for a 64 x 64 CUDA array of floats:

           CUDA_ARRAY_DESCRIPTOR desc;
           desc.Format = CU_AD_FORMAT_FLOAT;
           desc.NumChannels = 1;
           desc.Width = 64;
           desc.Height = 64;

       Description for a width x height CUDA array of 64-bit, 4x16-bit float16's:

           CUDA_ARRAY_DESCRIPTOR desc;
           desc.FormatFlags = CU_AD_FORMAT_HALF;
           desc.NumChannels = 4;
           desc.Width = width;
           desc.Height = height;

       Description for a width x height CUDA array of 16-bit elements, each  of  which  is  two  8-bit  unsigned
       chars:

           CUDA_ARRAY_DESCRIPTOR arrayDesc;
           desc.FormatFlags = CU_AD_FORMAT_UNSIGNED_INT8;
           desc.NumChannels = 2;
           desc.Width = width;
           desc.Height = height;

       Parameters:
           pHandle - Returned array
           pAllocateArray - Array descriptor

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_UNKNOWN

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,    cuArrayDestroy,    cuArrayGetDescriptor,    cuMemAlloc,
           cuMemAllocHost,   cuMemAllocPitch,   cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMallocArray

   CUresult cuArrayDestroy (CUarray hArray)
       Destroys the CUDA array hArray.

       Parameters:
           hArray - Array to destroy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_ARRAY_IS_MAPPED, CUDA_ERROR_CONTEXT_IS_DESTROYED

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,    cuArray3DGetDescriptor,    cuArrayCreate,    cuArrayGetDescriptor,    cuMemAlloc,
           cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,   cuMemcpy2DUnaligned,   cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaFreeArray

   CUresult cuArrayGetDescriptor (CUDA_ARRAY_DESCRIPTOR * pArrayDescriptor, CUarray hArray)
       Returns in *pArrayDescriptor a descriptor containing information on the format and dimensions of the CUDA
       array  hArray. It is useful for subroutines that have been passed a CUDA array, but need to know the CUDA
       array parameters for validation or other purposes.

       Parameters:
           pArrayDescriptor - Returned array descriptor
           hArray - Array to get descriptor of

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,  cuArray3DGetDescriptor,  cuArrayCreate, cuArrayDestroy, cuMemAlloc, cuMemAllocHost,
           cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,   cuMemcpy3DAsync,
           cuMemcpyAtoA,    cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,   cuMemcpyDtoA,   cuMemcpyDtoD,
           cuMemcpyDtoDAsync, cuMemcpyDtoH, cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,  cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,   cuMemsetD16,
           cuMemsetD32, cudaArrayGetInfo

   CUresult cuDeviceGetByPCIBusId (CUdevice * dev, const char * pciBusId)
       Returns in *device a device handle given a PCI bus ID string.

       Parameters:
           dev - Returned device handle
           pciBusId   -   String   in   one   of   the   following   forms:   [domain]:[bus]:[device].[function]
           [domain]:[bus]:[device] [bus]:[device].[function] where domain, bus, device,  and  function  are  all
           hexadecimal values

       Returns:
           CUDA_SUCCESS,    CUDA_ERROR_DEINITIALIZED,    CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_VALUE,
           CUDA_ERROR_INVALID_DEVICE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuDeviceGet, cuDeviceGetAttribute, cuDeviceGetPCIBusId, cudaDeviceGetByPCIBusId

   CUresult cuDeviceGetPCIBusId (char * pciBusId, int len, CUdevice dev)
       Returns an ASCII string identifying the device dev in the NULL-terminated string pointed to by  pciBusId.
       len specifies the maximum length of the string that may be returned.

       Parameters:
           pciBusId    -    Returned    identifier   string   for   the   device   in   the   following   format
           [domain]:[bus]:[device].[function] where domain,  bus,  device,  and  function  are  all  hexadecimal
           values. pciBusId should be large enough to store 13 characters including the NULL-terminator.
           len - Maximum length of string to store in name
           dev - Device to get identifier string for

       Returns:
           CUDA_SUCCESS,    CUDA_ERROR_DEINITIALIZED,    CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_VALUE,
           CUDA_ERROR_INVALID_DEVICE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuDeviceGet, cuDeviceGetAttribute, cuDeviceGetByPCIBusId, cudaDeviceGetPCIBusId

   CUresult cuIpcCloseMemHandle (CUdeviceptr dptr)
       Unmaps memory returned by cuIpcOpenMemHandle. The original allocation in the exporting process as well as
       imported mappings in other processes will be unaffected.

       Any resources used to enable peer access will be freed if this is the last mapping using them.

       IPC functionality is restricted to devices with support for  unified  addressing  on  Linux  and  Windows
       operating systems. IPC functionality on Windows is restricted to GPUs in TCC mode

       Parameters:
           dptr - Device pointer returned by cuIpcOpenMemHandle

       Returns:
           CUDA_SUCCESS,     CUDA_ERROR_INVALID_CONTEXT,    CUDA_ERROR_MAP_FAILED,    CUDA_ERROR_INVALID_HANDLE,
           CUDA_ERROR_INVALID_VALUE

       See also:
           cuMemAlloc,     cuMemFree,     cuIpcGetEventHandle,     cuIpcOpenEventHandle,      cuIpcGetMemHandle,
           cuIpcOpenMemHandle, cudaIpcCloseMemHandle

   CUresult cuIpcGetEventHandle (CUipcEventHandle * pHandle, CUevent event)
       Takes   as   input   a   previously  allocated  event.  This  event  must  have  been  created  with  the
       CU_EVENT_INTERPROCESS and CU_EVENT_DISABLE_TIMING flags set. This opaque handle may be copied into  other
       processes  and  opened  with cuIpcOpenEventHandle to allow efficient hardware synchronization between GPU
       work in different processes.

       After  the  event  has  been  opened  in  the  importing  process,   cuEventRecord,   cuEventSynchronize,
       cuStreamWaitEvent  and  cuEventQuery may be used in either process. Performing operations on the imported
       event after the exported event has been freed with cuEventDestroy will result in undefined behavior.

       IPC functionality is restricted to devices with support for  unified  addressing  on  Linux  and  Windows
       operating systems. IPC functionality on Windows is restricted to GPUs in TCC mode

       Parameters:
           pHandle - Pointer to a user allocated CUipcEventHandle in which to return the opaque event handle
           event - Event allocated with CU_EVENT_INTERPROCESS and CU_EVENT_DISABLE_TIMING flags.

       Returns:
           CUDA_SUCCESS,     CUDA_ERROR_INVALID_HANDLE,     CUDA_ERROR_OUT_OF_MEMORY,     CUDA_ERROR_MAP_FAILED,
           CUDA_ERROR_INVALID_VALUE

       See also:
           cuEventCreate,     cuEventDestroy,     cuEventSynchronize,      cuEventQuery,      cuStreamWaitEvent,
           cuIpcOpenEventHandle,        cuIpcGetMemHandle,        cuIpcOpenMemHandle,       cuIpcCloseMemHandle,
           cudaIpcGetEventHandle

   CUresult cuIpcGetMemHandle (CUipcMemHandle * pHandle, CUdeviceptr dptr)
       Takes a pointer to the base of an existing device memory allocation created with cuMemAlloc  and  exports
       it  for  use  in  another process. This is a lightweight operation and may be called multiple times on an
       allocation without adverse effects.

       If a region of memory is freed with cuMemFree and a subsequent call to cuMemAlloc returns memory with the
       same device address, cuIpcGetMemHandle will return a unique handle for the new memory.

       IPC functionality is restricted to devices with support for  unified  addressing  on  Linux  and  Windows
       operating systems. IPC functionality on Windows is restricted to GPUs in TCC mode

       Parameters:
           pHandle - Pointer to user allocated CUipcMemHandle to return the handle in.
           dptr - Base pointer to previously allocated device memory

       Returns:
           CUDA_SUCCESS,     CUDA_ERROR_INVALID_HANDLE,     CUDA_ERROR_OUT_OF_MEMORY,     CUDA_ERROR_MAP_FAILED,
           CUDA_ERROR_INVALID_VALUE

       See also:
           cuMemAlloc,     cuMemFree,     cuIpcGetEventHandle,     cuIpcOpenEventHandle,     cuIpcOpenMemHandle,
           cuIpcCloseMemHandle, cudaIpcGetMemHandle

   CUresult cuIpcOpenEventHandle (CUevent * phEvent, CUipcEventHandle handle)
       Opens  an interprocess event handle exported from another process with cuIpcGetEventHandle. This function
       returns a CUevent that behaves like  a  locally  created  event  with  the  CU_EVENT_DISABLE_TIMING  flag
       specified. This event must be freed with cuEventDestroy.

       Performing  operations  on the imported event after the exported event has been freed with cuEventDestroy
       will result in undefined behavior.

       IPC functionality is restricted to devices with support for  unified  addressing  on  Linux  and  Windows
       operating systems. IPC functionality on Windows is restricted to GPUs in TCC mode

       Parameters:
           phEvent - Returns the imported event
           handle - Interprocess handle to open

       Returns:
           CUDA_SUCCESS,  CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_MAP_FAILED, CUDA_ERROR_PEER_ACCESS_UNSUPPORTED,
           CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_INVALID_VALUE

       See also:
           cuEventCreate,     cuEventDestroy,     cuEventSynchronize,      cuEventQuery,      cuStreamWaitEvent,
           cuIpcGetEventHandle,        cuIpcGetMemHandle,        cuIpcOpenMemHandle,        cuIpcCloseMemHandle,
           cudaIpcOpenEventHandle

   CUresult cuIpcOpenMemHandle (CUdeviceptr * pdptr, CUipcMemHandle handle, unsigned int Flags)
       Maps memory exported from another process with cuIpcGetMemHandle into the current device  address  space.
       For  contexts  on  different  devices  cuIpcOpenMemHandle  can  attempt to enable peer access between the
       devices  as  if  the  user  called  cuCtxEnablePeerAccess.   This   behavior   is   controlled   by   the
       CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS flag. cuDeviceCanAccessPeer can determine if a mapping is possible.

       cuIpcOpenMemHandle can open handles to devices that may not be visible in the process calling the API.

       Contexts  that  may  open  CUipcMemHandles are restricted in the following way. CUipcMemHandles from each
       CUdevice in a given process may only be opened by one CUcontext per CUdevice per other process.

       Memory returned from cuIpcOpenMemHandle must be freed with cuIpcCloseMemHandle.

       Calling cuMemFree on an exported memory  region  before  calling  cuIpcCloseMemHandle  in  the  importing
       context will result in undefined behavior.

       IPC  functionality  is  restricted  to  devices  with support for unified addressing on Linux and Windows
       operating systems. IPC functionality on Windows is restricted to GPUs in TCC mode

       Parameters:
           pdptr - Returned device pointer
           handle - CUipcMemHandle to open
           Flags - Flags for this operation. Must be specified as CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS

       Returns:
           CUDA_SUCCESS,    CUDA_ERROR_INVALID_CONTEXT,    CUDA_ERROR_MAP_FAILED,     CUDA_ERROR_INVALID_HANDLE,
           CUDA_ERROR_TOO_MANY_PEERS, CUDA_ERROR_INVALID_VALUE

       Note:
           No  guarantees  are  made about the address returned in *pdptr. In particular, multiple processes may
           not receive the same address for the same handle.

       See also:
           cuMemAlloc,     cuMemFree,     cuIpcGetEventHandle,     cuIpcOpenEventHandle,      cuIpcGetMemHandle,
           cuIpcCloseMemHandle, cuCtxEnablePeerAccess, cuDeviceCanAccessPeer, cudaIpcOpenMemHandle

   CUresult cuMemAlloc (CUdeviceptr * dptr, size_t bytesize)
       Allocates  bytesize  bytes of linear memory on the device and returns in *dptr a pointer to the allocated
       memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared.  If
       bytesize is 0, cuMemAlloc() returns CUDA_ERROR_INVALID_VALUE.

       Parameters:
           dptr - Returned device pointer
           bytesize - Requested allocation size in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAllocHost,   cuMemAllocPitch,   cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMalloc

   CUresult cuMemAllocHost (void ** pp, size_t bytesize)
       Allocates bytesize bytes of host memory that is page-locked and accessible  to  the  device.  The  driver
       tracks  the  virtual  memory  ranges  allocated with this function and automatically accelerates calls to
       functions such as cuMemcpy(). Since the memory can be accessed directly by the device, it can be read  or
       written  with  much  higher  bandwidth  than  pageable  memory  obtained with functions such as malloc().
       Allocating excessive amounts of memory with cuMemAllocHost() may degrade  system  performance,  since  it
       reduces  the amount of memory available to the system for paging. As a result, this function is best used
       sparingly to allocate staging areas for data exchange between host and device.

       Note all host memory allocated using cuMemHostAlloc() will automatically be immediately accessible to all
       contexts   on   all   devices   which   support   unified   addressing   (as   may   be   queried   using
       CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING).  The  device pointer that may be used to access this host memory
       from those contexts is always equal to  the  returned  host  pointer  *pp.  See  Unified  Addressing  for
       additional details.

       Parameters:
           pp - Returned host pointer to page-locked memory
           bytesize - Requested allocation size in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,    cuMemAllocPitch,   cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,   cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMallocHost

   CUresult cuMemAllocManaged (CUdeviceptr * dptr, size_t bytesize, unsigned int flags)
       Allocates bytesize bytes of managed memory on the device and returns in *dptr a pointer to the  allocated
       memory.  If  the  device doesn't support allocating managed memory, CUDA_ERROR_NOT_SUPPORTED is returned.
       Support for managed memory can be queried using the device attribute  CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY.
       The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. If bytesize
       is 0, cuMemAllocManaged returns CUDA_ERROR_INVALID_VALUE. The pointer is valid on the CPU and on all GPUs
       in  the  system  that  support  managed memory. All accesses to this pointer must obey the Unified Memory
       programming model.

       flags  specifies  the  default  stream  association  for  this  allocation.  flags   must   be   one   of
       CU_MEM_ATTACH_GLOBAL  or  CU_MEM_ATTACH_HOST.  If  CU_MEM_ATTACH_GLOBAL is specified, then this memory is
       accessible from any stream on any device. If CU_MEM_ATTACH_HOST is specified, then the allocation  should
       not    be    accessed    from   devices   that   have   a   zero   value   for   the   device   attribute
       CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS;  an  explicit  call  to  cuStreamAttachMemAsync  will   be
       required to enable access on such devices.

       If  the  association  is  later  changed  via  cuStreamAttachMemAsync  to  a  single  stream, the default
       association as specified during  cuMemAllocManaged  is  restored  when  that  stream  is  destroyed.  For
       __managed__  variables,  the  default  association is always CU_MEM_ATTACH_GLOBAL. Note that destroying a
       stream is an asynchronous operation, and as a result, the change  to  default  association  won't  happen
       until all work in the stream has completed.

       Memory allocated with cuMemAllocManaged should be released with cuMemFree.

       Device  memory  oversubscription is possible for GPUs that have a non-zero value for the device attribute
       CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS. Managed memory on such GPUs may  be  evicted  from  device
       memory  to  host  memory  at  any  time  by  the  Unified  Memory  driver in order to make room for other
       allocations.

       In  a  multi-GPU  system  where  all  GPUs   have   a   non-zero   value   for   the   device   attribute
       CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS,  managed memory may not be populated when this API returns
       and instead may be populated on access. In such systems, managed memory can migrate  to  any  processor's
       memory  at  any  time.  The  Unified  Memory  driver will employ heuristics to maintain data locality and
       prevent excessive page faults to the extent possible. The application can also  guide  the  driver  about
       memory  usage  patterns  via cuMemAdvise. The application can also explicitly migrate memory to a desired
       processor's memory via cuMemPrefetchAsync.

       In  a  multi-GPU  system  where  all  of  the  GPUs  have  a  zero  value  for   the   device   attribute
       CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS and all the GPUs have peer-to-peer support with each other,
       the  physical  storage  for  managed  memory  is  created  on  the  GPU  which  is  active  at  the  time
       cuMemAllocManaged is called. All other GPUs will  reference  the  data  at  reduced  bandwidth  via  peer
       mappings over the PCIe bus. The Unified Memory driver does not migrate memory among such GPUs.

       In a multi-GPU system where not all GPUs have peer-to-peer support with each other and where the value of
       the  device  attribute  CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS  is  zero for at least one of those
       GPUs, the location chosen for physical storage of managed memory is system-dependent.

       • On Linux, the location chosen will be device memory as long as the current set of active  contexts  are
         on  devices  that  either  have  peer-to-peer  support with each other or have a non-zero value for the
         device attribute CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS. If there is an active context on a  GPU
         that does not have a non-zero value for that device attribute and it does not have peer-to-peer support
         with  the  other devices that have active contexts on them, then the location for physical storage will
         be 'zero-copy' or host memory. Note that this means that managed  memory  that  is  located  in  device
         memory  is  migrated  to  host memory if a new context is created on a GPU that doesn't have a non-zero
         value for the device attribute and does not support peer-to-peer with at least one of the other devices
         that has an active context.  This  in  turn  implies  that  context  creation  may  fail  if  there  is
         insufficient host memory to migrate all managed allocations.

       • On  Windows,  the  physical  storage  is  always  created  in 'zero-copy' or host memory. All GPUs will
         reference the data at reduced bandwidth  over  the  PCIe  bus.  In  these  circumstances,  use  of  the
         environment  variable  CUDA_VISIBLE_DEVICES is recommended to restrict CUDA to only use those GPUs that
         have peer-to-peer support. Alternatively, users can also set CUDA_MANAGED_FORCE_DEVICE_ALLOC to a  non-
         zero  value to force the driver to always use device memory for physical storage. When this environment
         variable is set to a non-zero value, all contexts created in  that  process  on  devices  that  support
         managed  memory  have  to  be  peer-to-peer compatible with each other. Context creation will fail if a
         context is created on a device that supports managed memory and is not peer-to-peer compatible with any
         of the other managed memory supporting devices on which contexts were previously created, even if those
         contexts have been destroyed. These environment variables are described in the CUDA  programming  guide
         under the 'CUDA environment variables' section.

       • On ARM, managed memory is not available on discrete gpu with Drive PX-2.

       Parameters:
           dptr - Returned device pointer
           bytesize - Requested allocation size in bytes
           flags - Must be one of CU_MEM_ATTACH_GLOBAL or CU_MEM_ATTACH_HOST

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_NOT_SUPPORTED, CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAllocHost,   cuMemAllocPitch,   cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,  cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cuDeviceGetAttribute, cuStreamAttachMemAsync, cudaMallocManaged

   CUresult cuMemAllocPitch (CUdeviceptr * dptr, size_t * pPitch, size_t WidthInBytes, size_t  Height,  unsigned
       int ElementSizeBytes)
       Allocates  at  least  WidthInBytes  *  Height bytes of linear memory on the device and returns in *dptr a
       pointer to the allocated memory. The function  may  pad  the  allocation  to  ensure  that  corresponding
       pointers  in any given row will continue to meet the alignment requirements for coalescing as the address
       is updated from row to row. ElementSizeBytes specifies the size of the largest reads and writes that will
       be performed on the memory range. ElementSizeBytes may be 4, 8 or 16 (since coalesced memory transactions
       are not possible on other data sizes). If ElementSizeBytes is smaller than the actual read/write size  of
       a  kernel, the kernel will run correctly, but possibly at reduced speed. The pitch returned in *pPitch by
       cuMemAllocPitch() is the width in bytes of the allocation. The intended usage of pitch is as  a  separate
       parameter  of  the allocation, used to compute addresses within the 2D array. Given the row and column of
       an array element of type T, the address is computed as:

          T* pElement = (T*)((char*)BaseAddress + Row * Pitch) + Column;

       The pitch returned by cuMemAllocPitch() is guaranteed to work with cuMemcpy2D() under all  circumstances.
       For  allocations  of  2D arrays, it is recommended that programmers consider performing pitch allocations
       using cuMemAllocPitch(). Due to alignment restrictions in the hardware, this is especially  true  if  the
       application  will  be  performing  2D  memory  copies between different regions of device memory (whether
       linear memory or CUDA arrays).

       The byte alignment of the pitch returned by cuMemAllocPitch()  is  guaranteed  to  match  or  exceed  the
       alignment requirement for texture binding with cuTexRefSetAddress2D().

       Parameters:
           dptr - Returned device pointer
           pPitch - Returned pitch of allocation in bytes
           WidthInBytes - Requested allocation width in bytes
           Height - Requested allocation height in rows
           ElementSizeBytes - Size of largest reads/writes for range

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,    cuMemAllocHost,    cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,   cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMallocPitch

   CUresult cuMemcpy (CUdeviceptr dst, CUdeviceptr src, size_t ByteCount)
       Copies data between two pointers.  dst  and  src  are  base  pointers  of  the  destination  and  source,
       respectively. ByteCount specifies the number of bytes to copy. Note that this function infers the type of
       the transfer (host to host, host to device, device to device, or device to host) from the pointer values.
       This function is only allowed in contexts which support unified addressing.

       Parameters:
           dst - Destination unified virtual address space pointer
           src - Source unified virtual address space pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,  cuMemcpyHtoAAsync,  cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,   cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,  cuMemsetD16,
           cuMemsetD32, cudaMemcpy, cudaMemcpyToSymbol, cudaMemcpyFromSymbol

   CUresult cuMemcpy2D (const CUDA_MEMCPY2D * pCopy)
       Perform a 2D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY2D  structure  is
       defined as:

          typedef struct CUDA_MEMCPY2D_st {
             unsigned int srcXInBytes, srcY;
             CUmemorytype srcMemoryType;
                 const void *srcHost;
                 CUdeviceptr srcDevice;
                 CUarray srcArray;
                 unsigned int srcPitch;

             unsigned int dstXInBytes, dstY;
             CUmemorytype dstMemoryType;
                 void *dstHost;
                 CUdeviceptr dstDevice;
                 CUarray dstArray;
                 unsigned int dstPitch;

             unsigned int WidthInBytes;
             unsigned int Height;
          } CUDA_MEMCPY2D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS  4  If  srcMemoryType  is  CU_MEMORYTYPE_UNIFIED, srcDevice and srcPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. srcArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_HOST, srcHost and srcPitch specify the (host) base address of the
       source data and the bytes per row to apply. srcArray is ignored.

       .RS  4 If srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice and srcPitch specify the (device) base address
       of the source data and the bytes per row to apply. srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice and srcPitch are ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination data and the bytes per row to apply. dstArray is ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_UNIFIED, dstDevice and  dstPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. dstArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base  address
       of the destination data and the bytes per row to apply. dstArray is ignored.

       .RS  4  If  dstMemoryType  is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle of the destination data.
       dstHost, dstDevice and dstPitch are ignored.

       • srcXInBytes and srcY specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+srcY*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+srcY*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes and dstY specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+dstY*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+dstY*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes and Height specify the width (in bytes) and height of the 2D copy being performed.

       • If specified, srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and  dstPitch  must
         be greater than or equal to WidthInBytes + dstXInBytes.

       .RS   4   cuMemcpy2D()   returns   an   error   if   any  pitch  is  greater  than  the  maximum  allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).  cuMemAllocPitch()  passes  back   pitches   that   always   work   with
       cuMemcpy2D().  On  intra-device memory copies (device to device, CUDA array to device, CUDA array to CUDA
       array), cuMemcpy2D() may fail for pitches not computed by cuMemAllocPitch().  cuMemcpy2DUnaligned()  does
       not  have  this  restriction, but may run significantly slower in the cases where cuMemcpy2D() would have
       returned an error code.

       Parameters:
           pCopy - Parameters for the memory copy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2DAsync,   cuMemcpy2DUnaligned,   cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray

   CUresult cuMemcpy2DAsync (const CUDA_MEMCPY2D * pCopy, CUstream hStream)
       Perform  a  2D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY2D structure is
       defined as:

          typedef struct CUDA_MEMCPY2D_st {
             unsigned int srcXInBytes, srcY;
             CUmemorytype srcMemoryType;
             const void *srcHost;
             CUdeviceptr srcDevice;
             CUarray srcArray;
             unsigned int srcPitch;
             unsigned int dstXInBytes, dstY;
             CUmemorytype dstMemoryType;
             void *dstHost;
             CUdeviceptr dstDevice;
             CUarray dstArray;
             unsigned int dstPitch;
             unsigned int WidthInBytes;
             unsigned int Height;
          } CUDA_MEMCPY2D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_HOST, srcHost and srcPitch specify the (host) base address of the
       source data and the bytes per row to apply. srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_UNIFIED, srcDevice and  srcPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. srcArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice and srcPitch specify the (device) base  address
       of the source data and the bytes per row to apply. srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice and srcPitch are ignored.

       .RS  4  If  dstMemoryType  is  CU_MEMORYTYPE_UNIFIED, dstDevice and dstPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. dstArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination data and the bytes per row to apply. dstArray is ignored.

       .RS  4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base address
       of the destination data and the bytes per row to apply. dstArray is ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle  of  the  destination  data.
       dstHost, dstDevice and dstPitch are ignored.

       • srcXInBytes and srcY specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+srcY*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+srcY*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes and dstY specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+dstY*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+dstY*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes and Height specify the width (in bytes) and height of the 2D copy being performed.

       • If  specified,  srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and dstPitch must
         be greater than or equal to WidthInBytes + dstXInBytes.

       • If specified, srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and  dstPitch  must
         be greater than or equal to WidthInBytes + dstXInBytes.

       • If  specified,  srcHeight must be greater than or equal to Height + srcY, and dstHeight must be greater
         than or equal to Height + dstY.

       .RS  4  cuMemcpy2DAsync()  returns  an  error  if  any  pitch  is  greater  than  the   maximum   allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).   cuMemAllocPitch()   passes   back   pitches   that  always  work  with
       cuMemcpy2D(). On intra-device memory copies (device to device, CUDA array to device, CUDA array  to  CUDA
       array), cuMemcpy2DAsync() may fail for pitches not computed by cuMemAllocPitch().

       Parameters:
           pCopy - Parameters for the memory copy
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,    cuMemAllocHost,    cuMemAllocPitch,   cuMemcpy2D,   cuMemcpy2DUnaligned,   cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,   cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async,   cuMemsetD32,   cuMemsetD32Async,   cudaMemcpy2DAsync,   cudaMemcpy2DToArrayAsync,
           cudaMemcpy2DFromArrayAsync

   CUresult cuMemcpy2DUnaligned (const CUDA_MEMCPY2D * pCopy)
       Perform a 2D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY2D  structure  is
       defined as:

          typedef struct CUDA_MEMCPY2D_st {
             unsigned int srcXInBytes, srcY;
             CUmemorytype srcMemoryType;
             const void *srcHost;
             CUdeviceptr srcDevice;
             CUarray srcArray;
             unsigned int srcPitch;
             unsigned int dstXInBytes, dstY;
             CUmemorytype dstMemoryType;
             void *dstHost;
             CUdeviceptr dstDevice;
             CUarray dstArray;
             unsigned int dstPitch;
             unsigned int WidthInBytes;
             unsigned int Height;
          } CUDA_MEMCPY2D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS  4  If  srcMemoryType  is  CU_MEMORYTYPE_UNIFIED, srcDevice and srcPitch specify the (unified virtual
       address space) base address of the source data and the bytes per row to apply. srcArray is ignored.  This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_HOST, srcHost and srcPitch specify the (host) base address of the
       source data and the bytes per row to apply. srcArray is ignored.

       .RS  4 If srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice and srcPitch specify the (device) base address
       of the source data and the bytes per row to apply. srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice and srcPitch are ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_UNIFIED, dstDevice and  dstPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. dstArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination data and the bytes per row to apply. dstArray is ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base  address
       of the destination data and the bytes per row to apply. dstArray is ignored.

       .RS  4  If  dstMemoryType  is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle of the destination data.
       dstHost, dstDevice and dstPitch are ignored.

       • srcXInBytes and srcY specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+srcY*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+srcY*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes and dstY specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+dstY*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+dstY*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes and Height specify the width (in bytes) and height of the 2D copy being performed.

       • If specified, srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and  dstPitch  must
         be greater than or equal to WidthInBytes + dstXInBytes.

       .RS   4   cuMemcpy2D()   returns   an   error   if   any  pitch  is  greater  than  the  maximum  allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).  cuMemAllocPitch()  passes  back   pitches   that   always   work   with
       cuMemcpy2D().  On  intra-device memory copies (device to device, CUDA array to device, CUDA array to CUDA
       array), cuMemcpy2D() may fail for pitches not computed by cuMemAllocPitch().  cuMemcpy2DUnaligned()  does
       not  have  this  restriction, but may run significantly slower in the cases where cuMemcpy2D() would have
       returned an error code.

       Parameters:
           pCopy - Parameters for the memory copy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,    cuMemAllocHost,    cuMemAllocPitch,    cuMemcpy2D,    cuMemcpy2DAsync,     cuMemcpy3D,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy2D, cudaMemcpy2DToArray, cudaMemcpy2DFromArray

   CUresult cuMemcpy3D (const CUDA_MEMCPY3D * pCopy)
       Perform  a  3D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY3D structure is
       defined as:

               typedef struct CUDA_MEMCPY3D_st {

                   unsigned int srcXInBytes, srcY, srcZ;
                   unsigned int srcLOD;
                   CUmemorytype srcMemoryType;
                       const void *srcHost;
                       CUdeviceptr srcDevice;
                       CUarray srcArray;
                       unsigned int srcPitch;  // ignored when src is array
                       unsigned int srcHeight; // ignored when src is array; may be 0 if Depth==1

                   unsigned int dstXInBytes, dstY, dstZ;
                   unsigned int dstLOD;
                   CUmemorytype dstMemoryType;
                       void *dstHost;
                       CUdeviceptr dstDevice;
                       CUarray dstArray;
                       unsigned int dstPitch;  // ignored when dst is array
                       unsigned int dstHeight; // ignored when dst is array; may be 0 if Depth==1

                   unsigned int WidthInBytes;
                   unsigned int Height;
                   unsigned int Depth;
               } CUDA_MEMCPY3D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_UNIFIED, srcDevice and  srcPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. srcArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_HOST, srcHost, srcPitch and srcHeight  specify  the  (host)  base
       address  of the source data, the bytes per row, and the height of each 2D slice of the 3D array. srcArray
       is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice, srcPitch and srcHeight  specify  the  (device)
       base  address  of  the  source  data, the bytes per row, and the height of each 2D slice of the 3D array.
       srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice, srcPitch and srcHeight are ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_UNIFIED, dstDevice and  dstPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. dstArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination data, the bytes per row, and the height of each  2D  slice  of  the  3D  array.  dstArray  is
       ignored.

       .RS  4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base address
       of the destination data, the bytes per row, and the height of each 2D slice of the 3D array. dstArray  is
       ignored.

       .RS  4  If  dstMemoryType  is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle of the destination data.
       dstHost, dstDevice, dstPitch and dstHeight are ignored.

       • srcXInBytes, srcY and srcZ specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+(srcZ*srcHeight+srcY)*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+(srcZ*srcHeight+srcY)*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes, dstY and dstZ specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+(dstZ*dstHeight+dstY)*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+(dstZ*dstHeight+dstY)*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes, Height and Depth specify the width (in bytes), height and depth  of  the  3D  copy  being
         performed.

       • If  specified,  srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and dstPitch must
         be greater than or equal to WidthInBytes + dstXInBytes.

       • If specified, srcHeight must be greater than or equal to Height + srcY, and dstHeight must  be  greater
         than or equal to Height + dstY.

       .RS   4   cuMemcpy3D()   returns   an   error   if   any  pitch  is  greater  than  the  maximum  allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).

       The srcLOD and dstLOD members of the CUDA_MEMCPY3D structure must be set to 0.

       Parameters:
           pCopy - Parameters for the memory copy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy3D

   CUresult cuMemcpy3DAsync (const CUDA_MEMCPY3D * pCopy, CUstream hStream)
       Perform  a  3D memory copy according to the parameters specified in pCopy. The CUDA_MEMCPY3D structure is
       defined as:

               typedef struct CUDA_MEMCPY3D_st {

                   unsigned int srcXInBytes, srcY, srcZ;
                   unsigned int srcLOD;
                   CUmemorytype srcMemoryType;
                       const void *srcHost;
                       CUdeviceptr srcDevice;
                       CUarray srcArray;
                       unsigned int srcPitch;  // ignored when src is array
                       unsigned int srcHeight; // ignored when src is array; may be 0 if Depth==1

                   unsigned int dstXInBytes, dstY, dstZ;
                   unsigned int dstLOD;
                   CUmemorytype dstMemoryType;
                       void *dstHost;
                       CUdeviceptr dstDevice;
                       CUarray dstArray;
                       unsigned int dstPitch;  // ignored when dst is array
                       unsigned int dstHeight; // ignored when dst is array; may be 0 if Depth==1

                   unsigned int WidthInBytes;
                   unsigned int Height;
                   unsigned int Depth;
               } CUDA_MEMCPY3D;

        where:

       • srcMemoryType and dstMemoryType specify the type of memory of the source and destination, respectively;
         CUmemorytype_enum is defined as:

          typedef enum CUmemorytype_enum {
             CU_MEMORYTYPE_HOST = 0x01,
             CU_MEMORYTYPE_DEVICE = 0x02,
             CU_MEMORYTYPE_ARRAY = 0x03,
             CU_MEMORYTYPE_UNIFIED = 0x04
          } CUmemorytype;

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_UNIFIED, srcDevice and  srcPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. srcArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_HOST, srcHost, srcPitch and srcHeight  specify  the  (host)  base
       address  of the source data, the bytes per row, and the height of each 2D slice of the 3D array. srcArray
       is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_DEVICE, srcDevice, srcPitch and srcHeight  specify  the  (device)
       base  address  of  the  source  data, the bytes per row, and the height of each 2D slice of the 3D array.
       srcArray is ignored.

       .RS 4 If srcMemoryType is CU_MEMORYTYPE_ARRAY, srcArray specifies the handle of the source data. srcHost,
       srcDevice, srcPitch and srcHeight are ignored.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_UNIFIED, dstDevice and  dstPitch  specify  the  (unified  virtual
       address  space) base address of the source data and the bytes per row to apply. dstArray is ignored. This
       value may be used only if unified addressing is supported in the calling context.

       .RS 4 If dstMemoryType is CU_MEMORYTYPE_HOST, dstHost and dstPitch specify the (host) base address of the
       destination data, the bytes per row, and the height of each  2D  slice  of  the  3D  array.  dstArray  is
       ignored.

       .RS  4 If dstMemoryType is CU_MEMORYTYPE_DEVICE, dstDevice and dstPitch specify the (device) base address
       of the destination data, the bytes per row, and the height of each 2D slice of the 3D array. dstArray  is
       ignored.

       .RS  4  If  dstMemoryType  is CU_MEMORYTYPE_ARRAY, dstArray specifies the handle of the destination data.
       dstHost, dstDevice, dstPitch and dstHeight are ignored.

       • srcXInBytes, srcY and srcZ specify the base address of the source data for the copy.

       .RS 4 For host pointers, the starting address is

         void* Start = (void*)((char*)srcHost+(srcZ*srcHeight+srcY)*srcPitch + srcXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr Start = srcDevice+(srcZ*srcHeight+srcY)*srcPitch+srcXInBytes;

       .RS 4 For CUDA arrays, srcXInBytes must be evenly divisible by the array element size.

       • dstXInBytes, dstY and dstZ specify the base address of the destination data for the copy.

       .RS 4 For host pointers, the base address is

         void* dstStart = (void*)((char*)dstHost+(dstZ*dstHeight+dstY)*dstPitch + dstXInBytes);

       .RS 4 For device pointers, the starting address is

         CUdeviceptr dstStart = dstDevice+(dstZ*dstHeight+dstY)*dstPitch+dstXInBytes;

       .RS 4 For CUDA arrays, dstXInBytes must be evenly divisible by the array element size.

       • WidthInBytes, Height and Depth specify the width (in bytes), height and depth  of  the  3D  copy  being
         performed.

       • If  specified,  srcPitch must be greater than or equal to WidthInBytes + srcXInBytes, and dstPitch must
         be greater than or equal to WidthInBytes + dstXInBytes.

       • If specified, srcHeight must be greater than or equal to Height + srcY, and dstHeight must  be  greater
         than or equal to Height + dstY.

       .RS   4   cuMemcpy3DAsync()  returns  an  error  if  any  pitch  is  greater  than  the  maximum  allowed
       (CU_DEVICE_ATTRIBUTE_MAX_PITCH).

       The srcLOD and dstLOD members of the CUDA_MEMCPY3D structure must be set to 0.

       Parameters:
           pCopy - Parameters for the memory copy
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,  cuMemcpyAtoA, cuMemcpyAtoD, cuMemcpyAtoH, cuMemcpyAtoHAsync, cuMemcpyDtoA, cuMemcpyDtoD,
           cuMemcpyDtoDAsync, cuMemcpyDtoH, cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,  cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,  cuMemsetD2D8,   cuMemsetD2D8Async,   cuMemsetD2D16,   cuMemsetD2D16Async,
           cuMemsetD2D32,   cuMemsetD2D32Async,   cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,  cuMemsetD16Async,
           cuMemsetD32, cuMemsetD32Async, cudaMemcpy3DAsync

   CUresult cuMemcpy3DPeer (const CUDA_MEMCPY3D_PEER * pCopy)
       Perform a 3D memory copy according to the parameters specified  in  pCopy.  See  the  definition  of  the
       CUDA_MEMCPY3D_PEER structure for documentation of its parameters.

       Parameters:
           pCopy - Parameters for the memory copy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuMemcpyDtoD,    cuMemcpyPeer,     cuMemcpyDtoDAsync,     cuMemcpyPeerAsync,     cuMemcpy3DPeerAsync,
           cudaMemcpy3DPeer

   CUresult cuMemcpy3DPeerAsync (const CUDA_MEMCPY3D_PEER * pCopy, CUstream hStream)
       Perform  a  3D  memory  copy  according  to  the parameters specified in pCopy. See the definition of the
       CUDA_MEMCPY3D_PEER structure for documentation of its parameters.

       Parameters:
           pCopy - Parameters for the memory copy
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuMemcpyDtoD,     cuMemcpyPeer,     cuMemcpyDtoDAsync,     cuMemcpyPeerAsync,    cuMemcpy3DPeerAsync,
           cudaMemcpy3DPeerAsync

   CUresult cuMemcpyAsync (CUdeviceptr dst, CUdeviceptr src, size_t ByteCount, CUstream hStream)
       Copies data between two pointers.  dst  and  src  are  base  pointers  of  the  destination  and  source,
       respectively. ByteCount specifies the number of bytes to copy. Note that this function infers the type of
       the transfer (host to host, host to device, device to device, or device to host) from the pointer values.
       This function is only allowed in contexts which support unified addressing.

       Parameters:
           dst - Destination unified virtual address space pointer
           src - Source unified virtual address space pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,   cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async,    cuMemsetD32,    cuMemsetD32Async,    cudaMemcpyAsync,   cudaMemcpyToSymbolAsync,
           cudaMemcpyFromSymbolAsync

   CUresult cuMemcpyAtoA (CUarray  dstArray,  size_t  dstOffset,  CUarray  srcArray,  size_t  srcOffset,  size_t
       ByteCount)
       Copies  from  one  1D CUDA array to another. dstArray and srcArray specify the handles of the destination
       and source CUDA arrays for the copy, respectively. dstOffset and srcOffset specify  the  destination  and
       source  offsets in bytes into the CUDA arrays. ByteCount is the number of bytes to be copied. The size of
       the elements in the CUDA arrays need not be the same format, but the elements must be the same size;  and
       count must be evenly divisible by that size.

       Parameters:
           dstArray - Destination array
           dstOffset - Offset in bytes of destination array
           srcArray - Source array
           srcOffset - Offset in bytes of source array
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoD,   cuMemcpyAtoH,    cuMemcpyAtoHAsync,    cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyArrayToArray

   CUresult cuMemcpyAtoD (CUdeviceptr dstDevice, CUarray srcArray, size_t srcOffset, size_t ByteCount)
       Copies from one 1D CUDA array to device memory. dstDevice specifies the base pointer of  the  destination
       and must be naturally aligned with the CUDA array elements. srcArray and srcOffset specify the CUDA array
       handle  and the offset in bytes into the array where the copy is to begin. ByteCount specifies the number
       of bytes to copy and must be evenly divisible by the array element size.

       Parameters:
           dstDevice - Destination device pointer
           srcArray - Source array
           srcOffset - Offset in bytes of source array
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,   cuMemcpyDtoA,
           cuMemcpyDtoD, cuMemcpyDtoDAsync, cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyFromArray

   CUresult cuMemcpyAtoH (void * dstHost, CUarray srcArray, size_t srcOffset, size_t ByteCount)
       Copies  from  one  1D  CUDA  array to host memory. dstHost specifies the base pointer of the destination.
       srcArray and srcOffset specify the CUDA array handle and starting offset in bytes  of  the  source  data.
       ByteCount specifies the number of bytes to copy.

       Parameters:
           dstHost - Destination device pointer
           srcArray - Source array
           srcOffset - Offset in bytes of source array
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoHAsync,    cuMemcpyDtoA,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyFromArray

   CUresult cuMemcpyAtoHAsync (void * dstHost, CUarray srcArray, size_t srcOffset,  size_t  ByteCount,  CUstream
       hStream)
       Copies  from  one  1D  CUDA  array to host memory. dstHost specifies the base pointer of the destination.
       srcArray and srcOffset specify the CUDA array handle and starting offset in bytes  of  the  source  data.
       ByteCount specifies the number of bytes to copy.

       Parameters:
           dstHost - Destination pointer
           srcArray - Source array
           srcOffset - Offset in bytes of source array
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D, cuMemcpy3DAsync, cuMemcpyAtoA, cuMemcpyAtoD,  cuMemcpyAtoH,  cuMemcpyDtoA,  cuMemcpyDtoD,
           cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA, cuMemcpyHtoAAsync, cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,   cuMemsetD2D8,   cuMemsetD2D8Async,   cuMemsetD2D16,  cuMemsetD2D16Async,
           cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,   cuMemsetD16,   cuMemsetD16Async,
           cuMemsetD32, cuMemsetD32Async, cudaMemcpyFromArrayAsync

   CUresult cuMemcpyDtoA (CUarray dstArray, size_t dstOffset, CUdeviceptr srcDevice, size_t ByteCount)
       Copies  from  device  memory to a 1D CUDA array. dstArray and dstOffset specify the CUDA array handle and
       starting index of the destination data. srcDevice specifies the base pointer  of  the  source.  ByteCount
       specifies the number of bytes to copy.

       Parameters:
           dstArray - Destination array
           dstOffset - Offset in bytes of destination array
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH,  cuMemcpyDtoHAsync, cuMemcpyHtoA, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyToArray

   CUresult cuMemcpyDtoD (CUdeviceptr dstDevice, CUdeviceptr srcDevice, size_t ByteCount)
       Copies from device memory to device memory.  dstDevice  and  srcDevice  are  the  base  pointers  of  the
       destination and source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,  cuMemcpyHtoAAsync,  cuMemcpyHtoD,
           cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,  cuMemHostAlloc,
           cuMemHostGetDevicePointer,   cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,  cuMemsetD16,
           cuMemsetD32, cudaMemcpy, cudaMemcpyToSymbol, cudaMemcpyFromSymbol

   CUresult cuMemcpyDtoDAsync (CUdeviceptr dstDevice, CUdeviceptr srcDevice, size_t ByteCount, CUstream hStream)

       Copies from device memory to device memory.  dstDevice  and  srcDevice  are  the  base  pointers  of  the
       destination and source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,   cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async,    cuMemsetD32,    cuMemsetD32Async,    cudaMemcpyAsync,   cudaMemcpyToSymbolAsync,
           cudaMemcpyFromSymbolAsync

   CUresult cuMemcpyDtoH (void * dstHost, CUdeviceptr srcDevice, size_t ByteCount)
       Copies from device to host memory. dstHost and srcDevice specify the base pointers of the destination and
       source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstHost - Destination host pointer
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA, cuMemcpyDtoD, cuMemcpyDtoDAsync,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy, cudaMemcpyFromSymbol

   CUresult cuMemcpyDtoHAsync (void * dstHost, CUdeviceptr srcDevice, size_t ByteCount, CUstream hStream)
       Copies from device to host memory. dstHost and srcDevice specify the base pointers of the destination and
       source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstHost - Destination host pointer
           srcDevice - Source device pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyHtoA,  cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,   cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemcpyAsync, cudaMemcpyFromSymbolAsync

   CUresult cuMemcpyHtoA (CUarray dstArray, size_t dstOffset, const void * srcHost, size_t ByteCount)
       Copies  from  host  memory  to  a 1D CUDA array. dstArray and dstOffset specify the CUDA array handle and
       starting offset in bytes of the destination  data.  pSrc  specifies  the  base  address  of  the  source.
       ByteCount specifies the number of bytes to copy.

       Parameters:
           dstArray - Destination array
           dstOffset - Offset in bytes of destination array
           srcHost - Source host pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,  cuMemcpyDtoH, cuMemcpyDtoHAsync, cuMemcpyHtoAAsync,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpyToArray

   CUresult cuMemcpyHtoAAsync (CUarray dstArray, size_t dstOffset,  const  void  *  srcHost,  size_t  ByteCount,
       CUstream hStream)
       Copies  from  host  memory  to  a 1D CUDA array. dstArray and dstOffset specify the CUDA array handle and
       starting offset in bytes of the destination data. srcHost specifies  the  base  address  of  the  source.
       ByteCount specifies the number of bytes to copy.

       Parameters:
           dstArray - Destination array
           dstOffset - Offset in bytes of destination array
           srcHost - Source host pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoD,  cuMemcpyHtoDAsync,  cuMemFree,  cuMemFreeHost,   cuMemGetAddressRange,   cuMemGetInfo,
           cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,   cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemcpyToArrayAsync

   CUresult cuMemcpyHtoD (CUdeviceptr dstDevice, const void * srcHost, size_t ByteCount)
       Copies from host memory to device memory. dstDevice and srcHost are the base addresses of the destination
       and source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           srcHost - Source host pointer
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemcpy, cudaMemcpyToSymbol

   CUresult cuMemcpyHtoDAsync (CUdeviceptr dstDevice, const void * srcHost, size_t ByteCount, CUstream hStream)
       Copies from host memory to device memory. dstDevice and srcHost are the base addresses of the destination
       and source, respectively. ByteCount specifies the number of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           srcHost - Source host pointer
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,   cuMemcpyHtoD,   cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc,   cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,    cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemcpyAsync, cudaMemcpyToSymbolAsync

   CUresult  cuMemcpyPeer  (CUdeviceptr  dstDevice,  CUcontext  dstContext,  CUdeviceptr  srcDevice,   CUcontext
       srcContext, size_t ByteCount)
       Copies  from  device  memory  in  one  context to device memory in another context. dstDevice is the base
       device pointer of the destination memory and dstContext is the destination context. srcDevice is the base
       device pointer of the source memory and srcContext is the source pointer. ByteCount specifies the  number
       of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           dstContext - Destination context
           srcDevice - Source device pointer
           srcContext - Source context
           ByteCount - Size of memory copy in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

       See also:
           cuMemcpyDtoD,    cuMemcpy3DPeer,    cuMemcpyDtoDAsync,    cuMemcpyPeerAsync,     cuMemcpy3DPeerAsync,
           cudaMemcpyPeer

   CUresult  cuMemcpyPeerAsync  (CUdeviceptr  dstDevice,  CUcontext dstContext, CUdeviceptr srcDevice, CUcontext
       srcContext, size_t ByteCount, CUstream hStream)
       Copies from device memory in one context to device memory in  another  context.  dstDevice  is  the  base
       device pointer of the destination memory and dstContext is the destination context. srcDevice is the base
       device  pointer of the source memory and srcContext is the source pointer. ByteCount specifies the number
       of bytes to copy.

       Parameters:
           dstDevice - Destination device pointer
           dstContext - Destination context
           srcDevice - Source device pointer
           srcContext - Source context
           ByteCount - Size of memory copy in bytes
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           This function exhibits  behavior for most use cases.

           This function uses standard  semantics.

       See also:
           cuMemcpyDtoD,      cuMemcpyPeer,      cuMemcpy3DPeer,     cuMemcpyDtoDAsync,     cuMemcpy3DPeerAsync,
           cudaMemcpyPeerAsync

   CUresult cuMemFree (CUdeviceptr dptr)
       Frees the memory space pointed to by  dptr,  which  must  have  been  returned  by  a  previous  call  to
       cuMemAlloc() or cuMemAllocPitch().

       Parameters:
           dptr - Pointer to memory to free

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync,    cuMemcpyHtoD,    cuMemcpyHtoDAsync,    cuMemFreeHost,     cuMemGetAddressRange,
           cuMemGetInfo,  cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8, cuMemsetD2D16, cuMemsetD2D32,
           cuMemsetD8, cuMemsetD16, cuMemsetD32, cudaFree

   CUresult cuMemFreeHost (void * p)
       Frees the memory space  pointed  to  by  p,  which  must  have  been  returned  by  a  previous  call  to
       cuMemAllocHost().

       Parameters:
           p - Pointer to memory to free

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync,  cuMemFree,  cuMemGetAddressRange,  cuMemGetInfo,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaFreeHost

   CUresult cuMemGetAddressRange (CUdeviceptr * pbase, size_t * psize, CUdeviceptr dptr)
       Returns  the  base  address  in  *pbase  and  size  in  *psize  of  the  allocation  by  cuMemAlloc()  or
       cuMemAllocPitch()  that contains the input pointer dptr. Both parameters pbase and psize are optional. If
       one of them is NULL, it is ignored.

       Parameters:
           pbase - Returned base address
           psize - Returned size of device memory allocation
           dptr - Device pointer to query

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_NOT_FOUND, CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,   cuMemcpyHtoD,   cuMemcpyHtoDAsync,   cuMemFree,   cuMemFreeHost,   cuMemGetInfo,
           cuMemHostAlloc, cuMemHostGetDevicePointer, cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32

   CUresult cuMemGetInfo (size_t * free, size_t * total)
       Returns in *free and *total respectively, the free and total amount of memory available for allocation by
       the CUDA context, in bytes.

       Parameters:
           free - Returned free memory in bytes
           total - Returned total memory in bytes

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemHostAlloc,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16, cuMemsetD2D32, cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaMemGetInfo

   CUresult cuMemHostAlloc (void ** pp, size_t bytesize, unsigned int Flags)
       Allocates bytesize bytes of host memory that is page-locked and accessible  to  the  device.  The  driver
       tracks  the  virtual  memory  ranges  allocated with this function and automatically accelerates calls to
       functions such as cuMemcpyHtoD(). Since the memory can be accessed directly by the device, it can be read
       or written with much higher bandwidth than pageable memory obtained  with  functions  such  as  malloc().
       Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount
       of  memory  available  to  the  system  for  paging. As a result, this function is best used sparingly to
       allocate staging areas for data exchange between host and device.

       The Flags parameter enables different options to be specified that affect the allocation, as follows.

       • CU_MEMHOSTALLOC_PORTABLE: The memory returned by this call will be considered as pinned memory  by  all
         CUDA contexts, not just the one that performed the allocation.

       • CU_MEMHOSTALLOC_DEVICEMAP:  Maps  the allocation into the CUDA address space. The device pointer to the
         memory may be obtained by calling cuMemHostGetDevicePointer().

       • CU_MEMHOSTALLOC_WRITECOMBINED:  Allocates  the  memory  as  write-combined  (WC).  WC  memory  can   be
         transferred  across  the PCI Express bus more quickly on some system configurations, but cannot be read
         efficiently by most CPUs. WC memory is a good option for buffers that will be written by  the  CPU  and
         read by the GPU via mapped pinned memory or host->device transfers.

       All  of  these  flags  are  orthogonal  to one another: a developer may allocate memory that is portable,
       mapped and/or write-combined with no restrictions.

       The  CUDA  context  must  have  been  created  with  the  CU_CTX_MAP_HOST   flag   in   order   for   the
       CU_MEMHOSTALLOC_DEVICEMAP flag to have any effect.

       The  CU_MEMHOSTALLOC_DEVICEMAP  flag  may  be  specified on CUDA contexts for devices that do not support
       mapped pinned memory. The failure is deferred to cuMemHostGetDevicePointer() because the  memory  may  be
       mapped into other CUDA contexts via the CU_MEMHOSTALLOC_PORTABLE flag.

       The memory allocated by this function must be freed with cuMemFreeHost().

       Note all host memory allocated using cuMemHostAlloc() will automatically be immediately accessible to all
       contexts   on   all   devices   which   support   unified   addressing   (as   may   be   queried   using
       CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING). Unless the flag CU_MEMHOSTALLOC_WRITECOMBINED is specified,  the
       device  pointer  that  may  be used to access this host memory from those contexts is always equal to the
       returned host pointer *pp. If the flag CU_MEMHOSTALLOC_WRITECOMBINED  is  specified,  then  the  function
       cuMemHostGetDevicePointer()  must  be  used  to  query  the  device pointer, even if the context supports
       unified addressing. See Unified Addressing for additional details.

       Parameters:
           pp - Returned host pointer to page-locked memory
           bytesize - Requested allocation size in bytes
           Flags - Flags for allocation request

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,  cuMemHostGetDevicePointer,  cuMemsetD2D8,  cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,
           cuMemsetD16, cuMemsetD32, cudaHostAlloc

   CUresult cuMemHostGetDevicePointer (CUdeviceptr * pdptr, void * p, unsigned int Flags)
       Passes  back  the  device  pointer  pdptr  corresponding to the mapped, pinned host buffer p allocated by
       cuMemHostAlloc.

       cuMemHostGetDevicePointer() will fail if the CU_MEMHOSTALLOC_DEVICEMAP flag was not specified at the time
       the memory was allocated, or if the function is called on a GPU  that  does  not  support  mapped  pinned
       memory.

       For      devices      that     have     a     non-zero     value     for     the     device     attribute
       CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM, the memory can also  be  accessed  from  the
       device  using  the  host pointer p. The device pointer returned by cuMemHostGetDevicePointer() may or may
       not match the original host pointer p and depends on the devices  visible  to  the  application.  If  all
       devices  visible  to  the  application have a non-zero value for the device attribute, the device pointer
       returned by cuMemHostGetDevicePointer() will match the original pointer p. If any device visible  to  the
       application   has   a   zero   value   for   the   device  attribute,  the  device  pointer  returned  by
       cuMemHostGetDevicePointer() will not match the original host pointer p, but it will be suitable  for  use
       on all devices provided Unified Virtual Addressing is enabled. In such systems, it is valid to access the
       memory  using either pointer on devices that have a non-zero value for the device attribute. Note however
       that such devices should access the memory using only of the two pointers and not both.

       Flags provides for future releases. For now, it must be set to 0.

       Parameters:
           pdptr - Returned device pointer
           p - Host pointer
           Flags - Options (must be 0)

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo, cuMemHostAlloc, cuMemsetD2D8, cuMemsetD2D16,  cuMemsetD2D32,  cuMemsetD8,  cuMemsetD16,
           cuMemsetD32, cudaHostGetDevicePointer

   CUresult cuMemHostGetFlags (unsigned int * pFlags, void * p)
       Passes  back  the  flags pFlags that were specified when allocating the pinned host buffer p allocated by
       cuMemHostAlloc.

       cuMemHostGetFlags()  will  fail  if  the  pointer  does  not  reside  in  an  allocation   performed   by
       cuMemAllocHost() or cuMemHostAlloc().

       Parameters:
           pFlags - Returned flags word
           p - Host pointer

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMemAllocHost, cuMemHostAlloc, cudaHostGetFlags

   CUresult cuMemHostRegister (void * p, size_t bytesize, unsigned int Flags)
       Page-locks the memory range specified by p and bytesize and maps it for the  device(s)  as  specified  by
       Flags.  This memory range also is added to the same tracking mechanism as cuMemHostAlloc to automatically
       accelerate calls to functions such as cuMemcpyHtoD(). Since the memory can be accessed  directly  by  the
       device,  it  can  be  read  or  written with much higher bandwidth than pageable memory that has not been
       registered. Page-locking excessive amounts of memory may degrade system performance, since it reduces the
       amount of memory available to the system for paging. As a result, this function is best used sparingly to
       register staging areas for data exchange between host and device.

       This function has limited support on Mac OS X. OS 10.7 or higher is required.

       The Flags parameter enables different options to be specified that affect the allocation, as follows.

       • CU_MEMHOSTREGISTER_PORTABLE: The memory returned by this call will be considered as  pinned  memory  by
         all CUDA contexts, not just the one that performed the allocation.

       • CU_MEMHOSTREGISTER_DEVICEMAP:  Maps  the  allocation into the CUDA address space. The device pointer to
         the memory may be obtained by calling cuMemHostGetDevicePointer().

       • CU_MEMHOSTREGISTER_IOMEMORY: The pointer is treated as pointing to some I/O memory space, e.g. the  PCI
         Express resource of a 3rd party device.

       All  of  these  flags are orthogonal to one another: a developer may page-lock memory that is portable or
       mapped with no restrictions.

       The  CUDA  context  must  have  been  created  with  the  CU_CTX_MAP_HOST   flag   in   order   for   the
       CU_MEMHOSTREGISTER_DEVICEMAP flag to have any effect.

       The  CU_MEMHOSTREGISTER_DEVICEMAP  flag may be specified on CUDA contexts for devices that do not support
       mapped pinned memory. The failure is deferred to cuMemHostGetDevicePointer() because the  memory  may  be
       mapped into other CUDA contexts via the CU_MEMHOSTREGISTER_PORTABLE flag.

       For      devices      that     have     a     non-zero     value     for     the     device     attribute
       CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM, the memory can also  be  accessed  from  the
       device  using  the  host pointer p. The device pointer returned by cuMemHostGetDevicePointer() may or may
       not match the original host pointer ptr and depends on the devices visible to  the  application.  If  all
       devices  visible  to  the  application have a non-zero value for the device attribute, the device pointer
       returned by cuMemHostGetDevicePointer() will match the original pointer ptr. If any device visible to the
       application  has  a  zero  value  for  the   device   attribute,   the   device   pointer   returned   by
       cuMemHostGetDevicePointer() will not match the original host pointer ptr, but it will be suitable for use
       on all devices provided Unified Virtual Addressing is enabled. In such systems, it is valid to access the
       memory  using either pointer on devices that have a non-zero value for the device attribute. Note however
       that such devices should access the memory using only of the two pointers and not both.

       The memory page-locked by this function must be unregistered with cuMemHostUnregister().

       Parameters:
           p - Host pointer to memory to page-lock
           bytesize - Size in bytes of the address range to page-lock
           Flags - Flags for allocation request

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE,     CUDA_ERROR_OUT_OF_MEMORY,    CUDA_ERROR_HOST_MEMORY_ALREADY_REGISTERED,
           CUDA_ERROR_NOT_PERMITTED, CUDA_ERROR_NOT_SUPPORTED

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMemHostUnregister, cuMemHostGetFlags, cuMemHostGetDevicePointer, cudaHostRegister

   CUresult cuMemHostUnregister (void * p)
       Unmaps the memory range whose base address is specified by p, and makes it pageable again.

       The base address must be the same one specified to cuMemHostRegister().

       Parameters:
           p - Host pointer to memory to unregister

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_HOST_MEMORY_NOT_REGISTERED,

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMemHostRegister, cudaHostUnregister

   CUresult cuMemsetD16 (CUdeviceptr dstDevice, unsigned short us, size_t N)
       Sets  the  memory  range  of N 16-bit values to the specified value us. The dstDevice pointer must be two
       byte aligned.

       Parameters:
           dstDevice - Destination device pointer
           us - Value to set
           N - Number of elements

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,  cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async, cuMemsetD8, cuMemsetD8Async,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset

   CUresult cuMemsetD16Async (CUdeviceptr dstDevice, unsigned short us, size_t N, CUstream hStream)
       Sets the memory range of N 16-bit values to the specified value us. The dstDevice  pointer  must  be  two
       byte aligned.

       Parameters:
           dstDevice - Destination device pointer
           us - Value to set
           N - Number of elements
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16, cuMemsetD2D16Async, cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,
           cuMemsetD16, cuMemsetD32, cuMemsetD32Async, cudaMemsetAsync

   CUresult  cuMemsetD2D16  (CUdeviceptr  dstDevice,  size_t  dstPitch,  unsigned short us, size_t Width, size_t
       Height)
       Sets the 2D memory range of Width 16-bit values to the specified value us. Height specifies the number of
       rows to set, and dstPitch specifies the number of bytes between  each  row.  The  dstDevice  pointer  and
       dstPitch  offset  must be two byte aligned. This function performs fastest when the pitch is one that has
       been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           us - Value to set
           Width - Width of row
           Height - Number of rows

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2D

   CUresult cuMemsetD2D16Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned short us, size_t Width,  size_t
       Height, CUstream hStream)
       Sets the 2D memory range of Width 16-bit values to the specified value us. Height specifies the number of
       rows  to  set,  and  dstPitch  specifies  the number of bytes between each row. The dstDevice pointer and
       dstPitch offset must be two byte aligned. This function performs fastest when the pitch is one  that  has
       been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           us - Value to set
           Width - Width of row
           Height - Number of rows
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,   cuMemsetD2D32,   cuMemsetD2D32Async,   cuMemsetD8,   cuMemsetD8Async,   cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2DAsync

   CUresult cuMemsetD2D32 (CUdeviceptr dstDevice, size_t dstPitch, unsigned int ui, size_t Width, size_t Height)

       Sets the 2D memory range of Width 32-bit values to the specified value ui. Height specifies the number of
       rows  to  set,  and  dstPitch  specifies  the number of bytes between each row. The dstDevice pointer and
       dstPitch offset must be four byte aligned. This function performs fastest when the pitch is one that  has
       been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           ui - Value to set
           Width - Width of row
           Height - Number of rows

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,  cuMemsetD2D16Async,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2D

   CUresult  cuMemsetD2D32Async  (CUdeviceptr  dstDevice, size_t dstPitch, unsigned int ui, size_t Width, size_t
       Height, CUstream hStream)
       Sets the 2D memory range of Width 32-bit values to the specified value ui. Height specifies the number of
       rows to set, and dstPitch specifies the number of bytes between  each  row.  The  dstDevice  pointer  and
       dstPitch  offset must be four byte aligned. This function performs fastest when the pitch is one that has
       been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           ui - Value to set
           Width - Width of row
           Height - Number of rows
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,   cuMemsetD2D16Async,   cuMemsetD2D32,   cuMemsetD8,   cuMemsetD8Async,   cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2DAsync

   CUresult cuMemsetD2D8 (CUdeviceptr dstDevice, size_t dstPitch, unsigned char uc, size_t Width, size_t Height)

       Sets the 2D memory range of Width 8-bit values to the specified value uc. Height specifies the number  of
       rows  to set, and dstPitch specifies the number of bytes between each row. This function performs fastest
       when the pitch is one that has been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           uc - Value to set
           Width - Width of row
           Height - Number of rows

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,   cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8Async,    cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2D

   CUresult cuMemsetD2D8Async (CUdeviceptr dstDevice, size_t dstPitch, unsigned char uc,  size_t  Width,  size_t
       Height, CUstream hStream)
       Sets  the 2D memory range of Width 8-bit values to the specified value uc. Height specifies the number of
       rows to set, and dstPitch specifies the number of bytes between each row. This function performs  fastest
       when the pitch is one that has been passed back by cuMemAllocPitch().

       Parameters:
           dstDevice - Destination device pointer
           dstPitch - Pitch of destination device pointer
           uc - Value to set
           Width - Width of row
           Height - Number of rows
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,     cuMemHostAlloc,     cuMemHostGetDevicePointer,     cuMemsetD2D8,     cuMemsetD2D16,
           cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset2DAsync

   CUresult cuMemsetD32 (CUdeviceptr dstDevice, unsigned int ui, size_t N)
       Sets  the  memory  range of N 32-bit values to the specified value ui. The dstDevice pointer must be four
       byte aligned.

       Parameters:
           dstDevice - Destination device pointer
           ui - Value to set
           N - Number of elements

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,  cuMemAllocHost,  cuMemAllocPitch,  cuMemcpy2D,   cuMemcpy2DAsync,   cuMemcpy2DUnaligned,
           cuMemcpy3D,    cuMemcpy3DAsync,    cuMemcpyAtoA,   cuMemcpyAtoD,   cuMemcpyAtoH,   cuMemcpyAtoHAsync,
           cuMemcpyDtoA,  cuMemcpyDtoD,  cuMemcpyDtoDAsync,   cuMemcpyDtoH,   cuMemcpyDtoHAsync,   cuMemcpyHtoA,
           cuMemcpyHtoAAsync,  cuMemcpyHtoD,  cuMemcpyHtoDAsync, cuMemFree, cuMemFreeHost, cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,  cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async, cuMemsetD8, cuMemsetD8Async,
           cuMemsetD16, cuMemsetD16Async, cuMemsetD32Async, cudaMemset

   CUresult cuMemsetD32Async (CUdeviceptr dstDevice, unsigned int ui, size_t N, CUstream hStream)
       Sets the memory range of N 32-bit values to the specified value ui. The dstDevice pointer  must  be  four
       byte aligned.

       Parameters:
           dstDevice - Destination device pointer
           ui - Value to set
           N - Number of elements
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16, cuMemsetD2D16Async, cuMemsetD2D32,  cuMemsetD2D32Async,  cuMemsetD8,  cuMemsetD8Async,
           cuMemsetD16, cuMemsetD16Async, cuMemsetD32, cudaMemsetAsync

   CUresult cuMemsetD8 (CUdeviceptr dstDevice, unsigned char uc, size_t N)
       Sets the memory range of N 8-bit values to the specified value uc.

       Parameters:
           dstDevice - Destination device pointer
           uc - Value to set
           N - Number of elements

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16, cuMemsetD2D16Async, cuMemsetD2D32, cuMemsetD2D32Async,  cuMemsetD8Async,  cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemset

   CUresult cuMemsetD8Async (CUdeviceptr dstDevice, unsigned char uc, size_t N, CUstream hStream)
       Sets the memory range of N 8-bit values to the specified value uc.

       Parameters:
           dstDevice - Destination device pointer
           uc - Value to set
           N - Number of elements
           hStream - Stream identifier

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

           See also .

           This function uses standard  semantics.

       See also:
           cuArray3DCreate,   cuArray3DGetDescriptor,   cuArrayCreate,   cuArrayDestroy,   cuArrayGetDescriptor,
           cuMemAlloc,   cuMemAllocHost,   cuMemAllocPitch,  cuMemcpy2D,  cuMemcpy2DAsync,  cuMemcpy2DUnaligned,
           cuMemcpy3D,   cuMemcpy3DAsync,   cuMemcpyAtoA,   cuMemcpyAtoD,    cuMemcpyAtoH,    cuMemcpyAtoHAsync,
           cuMemcpyDtoA,   cuMemcpyDtoD,   cuMemcpyDtoDAsync,   cuMemcpyDtoH,  cuMemcpyDtoHAsync,  cuMemcpyHtoA,
           cuMemcpyHtoAAsync, cuMemcpyHtoD, cuMemcpyHtoDAsync, cuMemFree,  cuMemFreeHost,  cuMemGetAddressRange,
           cuMemGetInfo,    cuMemHostAlloc,    cuMemHostGetDevicePointer,    cuMemsetD2D8,    cuMemsetD2D8Async,
           cuMemsetD2D16,  cuMemsetD2D16Async,  cuMemsetD2D32,  cuMemsetD2D32Async,   cuMemsetD8,   cuMemsetD16,
           cuMemsetD16Async, cuMemsetD32, cuMemsetD32Async, cudaMemsetAsync

   CUresult    cuMipmappedArrayCreate    (CUmipmappedArray    *   pHandle,   const   CUDA_ARRAY3D_DESCRIPTOR   *
       pMipmappedArrayDesc, unsigned int numMipmapLevels)
       Creates a CUDA mipmapped array according to the CUDA_ARRAY3D_DESCRIPTOR structure pMipmappedArrayDesc and
       returns a handle to the new CUDA mipmapped array in *pHandle. numMipmapLevels  specifies  the  number  of
       mipmap  levels  to be allocated. This value is clamped to the range [1, 1 + floor(log2(max(width, height,
       depth)))].

       The CUDA_ARRAY3D_DESCRIPTOR is defined as:

           typedef struct {
               unsigned int Width;
               unsigned int Height;
               unsigned int Depth;
               CUarray_format Format;
               unsigned int NumChannels;
               unsigned int Flags;
           } CUDA_ARRAY3D_DESCRIPTOR;

        where:

       • Width, Height, and Depth are the width, height,  and  depth  of  the  CUDA  array  (in  elements);  the
         following types of CUDA arrays can be allocated:

         • A 1D mipmapped array is allocated if Height and Depth extents are both zero.

         • A 2D mipmapped array is allocated if only Depth extent is zero.

         • A 3D mipmapped array is allocated if all three extents are non-zero.

         • A  1D  layered  CUDA mipmapped array is allocated if only Height is zero and the CUDA_ARRAY3D_LAYERED
           flag is set. Each layer is a 1D array. The number of layers is determined by the depth extent.

         • A 2D layered  CUDA  mipmapped  array  is  allocated  if  all  three  extents  are  non-zero  and  the
           CUDA_ARRAY3D_LAYERED flag is set. Each layer is a 2D array. The number of layers is determined by the
           depth extent.

         • A   cubemap   CUDA  mipmapped  array  is  allocated  if  all  three  extents  are  non-zero  and  the
           CUDA_ARRAY3D_CUBEMAP flag is set. Width must be equal to Height, and Depth must be six. A cubemap  is
           a  special type of 2D layered CUDA array, where the six layers represent the six faces of a cube. The
           order of the six layers in memory is the same as that listed in CUarray_cubemap_face.

         • A cubemap layered CUDA mipmapped array is allocated if all three  extents  are  non-zero,  and  both,
           CUDA_ARRAY3D_CUBEMAP and CUDA_ARRAY3D_LAYERED flags are set. Width must be equal to Height, and Depth
           must  be  a  multiple of six. A cubemap layered CUDA array is a special type of 2D layered CUDA array
           that consists of a collection of cubemaps. The first six layers represent the first cubemap, the next
           six layers form the second cubemap, and so on.

       • Format specifies the format of the elements; CUarray_format is defined as:

           typedef enum CUarray_format_enum {
               CU_AD_FORMAT_UNSIGNED_INT8 = 0x01,
               CU_AD_FORMAT_UNSIGNED_INT16 = 0x02,
               CU_AD_FORMAT_UNSIGNED_INT32 = 0x03,
               CU_AD_FORMAT_SIGNED_INT8 = 0x08,
               CU_AD_FORMAT_SIGNED_INT16 = 0x09,
               CU_AD_FORMAT_SIGNED_INT32 = 0x0a,
               CU_AD_FORMAT_HALF = 0x10,
               CU_AD_FORMAT_FLOAT = 0x20
           } CUarray_format;

       • NumChannels specifies the number of packed components per CUDA array element; it may be 1, 2, or 4;

       • Flags may be set to

         • CUDA_ARRAY3D_LAYERED to enable creation of layered CUDA mipmapped arrays. If this flag is set,  Depth
           specifies the number of layers, not the depth of a 3D array.

         • CUDA_ARRAY3D_SURFACE_LDST to enable surface references to be bound to individual mipmap levels of the
           CUDA  mipmapped array. If this flag is not set, cuSurfRefSetArray will fail when attempting to bind a
           mipmap level of the CUDA mipmapped array to a surface reference.

         • CUDA_ARRAY3D_CUBEMAP to enable creation of mipmapped cubemaps. If this flag is  set,  Width  must  be
           equal to Height, and Depth must be six. If the CUDA_ARRAY3D_LAYERED flag is also set, then Depth must
           be a multiple of six.

         • CUDA_ARRAY3D_TEXTURE_GATHER  to  indicate  that  the  CUDA  mipmapped  array will be used for texture
           gather. Texture gather can only be performed on 2D CUDA mipmapped arrays.

       Width, Height and Depth must meet certain size requirements as listed in the following table. All  values
       are  specified  in  elements.  Note that for brevity's sake, the full name of the device attribute is not
       specified.    For    ex.,    TEXTURE1D_MIPMAPPED_WIDTH     refers     to     the     device     attribute
       CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH.

       CUDA array type Valid extents that must always be met
       {(width  range  in elements), (height range), (depth range)} Valid extents with CUDA_ARRAY3D_SURFACE_LDST
       set
        {(width range in elements), (height range), (depth range)} 1D { (1,TEXTURE1D_MIPMAPPED_WIDTH), 0, 0 }  {
       (1,SURFACE1D_WIDTH),  0,  0  }  2D { (1,TEXTURE2D_MIPMAPPED_WIDTH), (1,TEXTURE2D_MIPMAPPED_HEIGHT), 0 } {
       (1,SURFACE2D_WIDTH),  (1,SURFACE2D_HEIGHT),  0  }   3D   {   (1,TEXTURE3D_WIDTH),   (1,TEXTURE3D_HEIGHT),
       (1,TEXTURE3D_DEPTH) }
       OR
       {   (1,TEXTURE3D_WIDTH_ALTERNATE),   (1,TEXTURE3D_HEIGHT_ALTERNATE),  (1,TEXTURE3D_DEPTH_ALTERNATE)  }  {
       (1,SURFACE3D_WIDTH),     (1,SURFACE3D_HEIGHT),     (1,SURFACE3D_DEPTH)      }      1D      Layered      {
       (1,TEXTURE1D_LAYERED_WIDTH),   0,   (1,TEXTURE1D_LAYERED_LAYERS)   }  {  (1,SURFACE1D_LAYERED_WIDTH),  0,
       (1,SURFACE1D_LAYERED_LAYERS) } 2D Layered  {  (1,TEXTURE2D_LAYERED_WIDTH),  (1,TEXTURE2D_LAYERED_HEIGHT),
       (1,TEXTURE2D_LAYERED_LAYERS)     }     {    (1,SURFACE2D_LAYERED_WIDTH),    (1,SURFACE2D_LAYERED_HEIGHT),
       (1,SURFACE2D_LAYERED_LAYERS) }  Cubemap  {  (1,TEXTURECUBEMAP_WIDTH),  (1,TEXTURECUBEMAP_WIDTH),  6  }  {
       (1,SURFACECUBEMAP_WIDTH),       (1,SURFACECUBEMAP_WIDTH),       6      }      Cubemap      Layered      {
       (1,TEXTURECUBEMAP_LAYERED_WIDTH), (1,TEXTURECUBEMAP_LAYERED_WIDTH), (1,TEXTURECUBEMAP_LAYERED_LAYERS) } {
       (1,SURFACECUBEMAP_LAYERED_WIDTH), (1,SURFACECUBEMAP_LAYERED_WIDTH), (1,SURFACECUBEMAP_LAYERED_LAYERS) }

       Parameters:
           pHandle - Returned mipmapped array
           pMipmappedArrayDesc - mipmapped array descriptor
           numMipmapLevels - Number of mipmap levels

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_UNKNOWN

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMipmappedArrayDestroy, cuMipmappedArrayGetLevel, cuArrayCreate, cudaMallocMipmappedArray

   CUresult cuMipmappedArrayDestroy (CUmipmappedArray hMipmappedArray)
       Destroys the CUDA mipmapped array hMipmappedArray.

       Parameters:
           hMipmappedArray - Mipmapped array to destroy

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_ARRAY_IS_MAPPED, CUDA_ERROR_CONTEXT_IS_DESTROYED

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMipmappedArrayCreate, cuMipmappedArrayGetLevel, cuArrayCreate, cudaFreeMipmappedArray

   CUresult cuMipmappedArrayGetLevel (CUarray *  pLevelArray,  CUmipmappedArray  hMipmappedArray,  unsigned  int
       level)
       Returns  in  *pLevelArray  a CUDA array that represents a single mipmap level of the CUDA mipmapped array
       hMipmappedArray.

       If level is greater than the maximum number of levels in this mipmapped  array,  CUDA_ERROR_INVALID_VALUE
       is returned.

       Parameters:
           pLevelArray - Returned mipmap level CUDA array
           hMipmappedArray - CUDA mipmapped array
           level - Mipmap level

       Returns:
           CUDA_SUCCESS,   CUDA_ERROR_DEINITIALIZED,   CUDA_ERROR_NOT_INITIALIZED,   CUDA_ERROR_INVALID_CONTEXT,
           CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_INVALID_HANDLE

       Note:
           Note that this function may also return error codes from previous, asynchronous launches.

       See also:
           cuMipmappedArrayCreate, cuMipmappedArrayDestroy, cuArrayCreate, cudaGetMipmappedArrayLevel

Author

       Generated automatically by Doxygen from the source code.

Version 6.0                                        28 Jul 2019                              Memory Management(3)