Provided by: nvidia-cuda-dev_10.1.243-3_amd64 

NAME
Device Management -
Functions
cudaError_t cudaChooseDevice (int *device, const struct cudaDeviceProp *prop)
Select compute-device which best matches criteria.
__cudart_builtin__ cudaError_t cudaDeviceGetAttribute (int *value, enum cudaDeviceAttr attr, int device)
Returns information about the device.
cudaError_t cudaDeviceGetByPCIBusId (int *device, const char *pciBusId)
Returns a handle to a compute device.
__cudart_builtin__ cudaError_t cudaDeviceGetCacheConfig (enum cudaFuncCache *pCacheConfig)
Returns the preferred cache configuration for the current device.
__cudart_builtin__ cudaError_t cudaDeviceGetLimit (size_t *pValue, enum cudaLimit limit)
Returns resource limits.
__cudart_builtin__ cudaError_t cudaDeviceGetP2PAttribute (int *value, enum cudaDeviceP2PAttr attr, int
srcDevice, int dstDevice)
Queries attributes of the link between two devices.
cudaError_t cudaDeviceGetPCIBusId (char *pciBusId, int len, int device)
Returns a PCI Bus Id string for the device.
__cudart_builtin__ cudaError_t cudaDeviceGetSharedMemConfig (enum cudaSharedMemConfig *pConfig)
Returns the shared memory configuration for the current device.
__cudart_builtin__ cudaError_t cudaDeviceGetStreamPriorityRange (int *leastPriority, int
*greatestPriority)
Returns numerical values that correspond to the least and greatest stream priorities.
cudaError_t cudaDeviceReset (void)
Destroy all allocations and reset all state on the current device in the current process.
cudaError_t cudaDeviceSetCacheConfig (enum cudaFuncCache cacheConfig)
Sets the preferred cache configuration for the current device.
cudaError_t cudaDeviceSetLimit (enum cudaLimit limit, size_t value)
Set resource limits.
cudaError_t cudaDeviceSetSharedMemConfig (enum cudaSharedMemConfig config)
Sets the shared memory configuration for the current device.
__cudart_builtin__ cudaError_t cudaDeviceSynchronize (void)
Wait for compute device to finish.
__cudart_builtin__ cudaError_t cudaGetDevice (int *device)
Returns which device is currently being used.
__cudart_builtin__ cudaError_t cudaGetDeviceCount (int *count)
Returns the number of compute-capable devices.
cudaError_t cudaGetDeviceFlags (unsigned int *flags)
Gets the flags for the current device.
__cudart_builtin__ cudaError_t cudaGetDeviceProperties (struct cudaDeviceProp *prop, int device)
Returns information about the compute-device.
cudaError_t cudaIpcCloseMemHandle (void *devPtr)
Close memory mapped with cudaIpcOpenMemHandle.
cudaError_t cudaIpcGetEventHandle (cudaIpcEventHandle_t *handle, cudaEvent_t event)
Gets an interprocess handle for a previously allocated event.
cudaError_t cudaIpcGetMemHandle (cudaIpcMemHandle_t *handle, void *devPtr)
Gets an interprocess memory handle for an existing device memory allocation.
cudaError_t cudaIpcOpenEventHandle (cudaEvent_t *event, cudaIpcEventHandle_t handle)
Opens an interprocess event handle for use in the current process.
cudaError_t cudaIpcOpenMemHandle (void **devPtr, cudaIpcMemHandle_t handle, unsigned int flags)
Opens an interprocess memory handle exported from another process and returns a device pointer usable
in the local process.
cudaError_t cudaSetDevice (int device)
Set device to be used for GPU executions.
cudaError_t cudaSetDeviceFlags (unsigned int flags)
Sets flags to be used for device executions.
cudaError_t cudaSetValidDevices (int *device_arr, int len)
Set a list of devices that can be used for CUDA.
Detailed Description
\brief device management functions of the CUDA runtime API (cuda_runtime_api.h)
This section describes the device management functions of the CUDA runtime application programming
interface.
Function Documentation
cudaError_t cudaChooseDevice (int * device, const struct cudaDeviceProp * prop)
Returns in *device the device which has properties that best match *prop.
Parameters:
device - Device with best match
prop - Desired device properties
Returns:
cudaSuccess, cudaErrorInvalidValue
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaGetDeviceCount, cudaGetDevice, cudaSetDevice, cudaGetDeviceProperties
__cudart_builtin__ cudaError_t cudaDeviceGetAttribute (int * value, enum cudaDeviceAttr attr, int device)
Returns in *value the integer value of the attribute attr on device device. The supported attributes are:
• cudaDevAttrMaxThreadsPerBlock: Maximum number of threads per block;
• cudaDevAttrMaxBlockDimX: Maximum x-dimension of a block;
• cudaDevAttrMaxBlockDimY: Maximum y-dimension of a block;
• cudaDevAttrMaxBlockDimZ: Maximum z-dimension of a block;
• cudaDevAttrMaxGridDimX: Maximum x-dimension of a grid;
• cudaDevAttrMaxGridDimY: Maximum y-dimension of a grid;
• cudaDevAttrMaxGridDimZ: Maximum z-dimension of a grid;
• cudaDevAttrMaxSharedMemoryPerBlock: Maximum amount of shared memory available to a thread block in
bytes;
• cudaDevAttrTotalConstantMemory: Memory available on device for __constant__ variables in a CUDA C
kernel in bytes;
• cudaDevAttrWarpSize: Warp size in threads;
• cudaDevAttrMaxPitch: Maximum pitch in bytes allowed by the memory copy functions that involve memory
regions allocated through cudaMallocPitch();
• cudaDevAttrMaxTexture1DWidth: Maximum 1D texture width;
• cudaDevAttrMaxTexture1DLinearWidth: Maximum width for a 1D texture bound to linear memory;
• cudaDevAttrMaxTexture1DMipmappedWidth: Maximum mipmapped 1D texture width;
• cudaDevAttrMaxTexture2DWidth: Maximum 2D texture width;
• cudaDevAttrMaxTexture2DHeight: Maximum 2D texture height;
• cudaDevAttrMaxTexture2DLinearWidth: Maximum width for a 2D texture bound to linear memory;
• cudaDevAttrMaxTexture2DLinearHeight: Maximum height for a 2D texture bound to linear memory;
• cudaDevAttrMaxTexture2DLinearPitch: Maximum pitch in bytes for a 2D texture bound to linear memory;
• cudaDevAttrMaxTexture2DMipmappedWidth: Maximum mipmapped 2D texture width;
• cudaDevAttrMaxTexture2DMipmappedHeight: Maximum mipmapped 2D texture height;
• cudaDevAttrMaxTexture3DWidth: Maximum 3D texture width;
• cudaDevAttrMaxTexture3DHeight: Maximum 3D texture height;
• cudaDevAttrMaxTexture3DDepth: Maximum 3D texture depth;
• cudaDevAttrMaxTexture3DWidthAlt: Alternate maximum 3D texture width, 0 if no alternate maximum 3D
texture size is supported;
• cudaDevAttrMaxTexture3DHeightAlt: Alternate maximum 3D texture height, 0 if no alternate maximum 3D
texture size is supported;
• cudaDevAttrMaxTexture3DDepthAlt: Alternate maximum 3D texture depth, 0 if no alternate maximum 3D
texture size is supported;
• cudaDevAttrMaxTextureCubemapWidth: Maximum cubemap texture width or height;
• cudaDevAttrMaxTexture1DLayeredWidth: Maximum 1D layered texture width;
• cudaDevAttrMaxTexture1DLayeredLayers: Maximum layers in a 1D layered texture;
• cudaDevAttrMaxTexture2DLayeredWidth: Maximum 2D layered texture width;
• cudaDevAttrMaxTexture2DLayeredHeight: Maximum 2D layered texture height;
• cudaDevAttrMaxTexture2DLayeredLayers: Maximum layers in a 2D layered texture;
• cudaDevAttrMaxTextureCubemapLayeredWidth: Maximum cubemap layered texture width or height;
• cudaDevAttrMaxTextureCubemapLayeredLayers: Maximum layers in a cubemap layered texture;
• cudaDevAttrMaxSurface1DWidth: Maximum 1D surface width;
• cudaDevAttrMaxSurface2DWidth: Maximum 2D surface width;
• cudaDevAttrMaxSurface2DHeight: Maximum 2D surface height;
• cudaDevAttrMaxSurface3DWidth: Maximum 3D surface width;
• cudaDevAttrMaxSurface3DHeight: Maximum 3D surface height;
• cudaDevAttrMaxSurface3DDepth: Maximum 3D surface depth;
• cudaDevAttrMaxSurface1DLayeredWidth: Maximum 1D layered surface width;
• cudaDevAttrMaxSurface1DLayeredLayers: Maximum layers in a 1D layered surface;
• cudaDevAttrMaxSurface2DLayeredWidth: Maximum 2D layered surface width;
• cudaDevAttrMaxSurface2DLayeredHeight: Maximum 2D layered surface height;
• cudaDevAttrMaxSurface2DLayeredLayers: Maximum layers in a 2D layered surface;
• cudaDevAttrMaxSurfaceCubemapWidth: Maximum cubemap surface width;
• cudaDevAttrMaxSurfaceCubemapLayeredWidth: Maximum cubemap layered surface width;
• cudaDevAttrMaxSurfaceCubemapLayeredLayers: Maximum layers in a cubemap layered surface;
• cudaDevAttrMaxRegistersPerBlock: Maximum number of 32-bit registers available to a thread block;
• cudaDevAttrClockRate: Peak clock frequency in kilohertz;
• cudaDevAttrTextureAlignment: Alignment requirement; texture base addresses aligned to textureAlign
bytes do not need an offset applied to texture fetches;
• cudaDevAttrTexturePitchAlignment: Pitch alignment requirement for 2D texture references bound to
pitched memory;
• cudaDevAttrGpuOverlap: 1 if the device can concurrently copy memory between host and device while
executing a kernel, or 0 if not;
• cudaDevAttrMultiProcessorCount: Number of multiprocessors on the device;
• cudaDevAttrKernelExecTimeout: 1 if there is a run time limit for kernels executed on the device, or 0
if not;
• cudaDevAttrIntegrated: 1 if the device is integrated with the memory subsystem, or 0 if not;
• cudaDevAttrCanMapHostMemory: 1 if the device can map host memory into the CUDA address space, or 0 if
not;
• cudaDevAttrComputeMode: Compute mode is the compute mode that the device is currently in. Available
modes are as follows:
• cudaComputeModeDefault: Default mode - Device is not restricted and multiple threads can use
cudaSetDevice() with this device.
• cudaComputeModeExclusive: Compute-exclusive mode - Only one thread will be able to use
cudaSetDevice() with this device.
• cudaComputeModeProhibited: Compute-prohibited mode - No threads can use cudaSetDevice() with this
device.
• cudaComputeModeExclusiveProcess: Compute-exclusive-process mode - Many threads in one process will be
able to use cudaSetDevice() with this device.
• cudaDevAttrConcurrentKernels: 1 if the device supports executing multiple kernels within the same
context simultaneously, or 0 if not. It is not guaranteed that multiple kernels will be resident on the
device concurrently so this feature should not be relied upon for correctness;
• cudaDevAttrEccEnabled: 1 if error correction is enabled on the device, 0 if error correction is
disabled or not supported by the device;
• cudaDevAttrPciBusId: PCI bus identifier of the device;
• cudaDevAttrPciDeviceId: PCI device (also known as slot) identifier of the device;
• cudaDevAttrTccDriver: 1 if the device is using a TCC driver. TCC is only available on Tesla hardware
running Windows Vista or later;
• cudaDevAttrMemoryClockRate: Peak memory clock frequency in kilohertz;
• cudaDevAttrGlobalMemoryBusWidth: Global memory bus width in bits;
• cudaDevAttrL2CacheSize: Size of L2 cache in bytes. 0 if the device doesn't have L2 cache;
• cudaDevAttrMaxThreadsPerMultiProcessor: Maximum resident threads per multiprocessor;
• cudaDevAttrUnifiedAddressing: 1 if the device shares a unified address space with the host, or 0 if
not;
• cudaDevAttrComputeCapabilityMajor: Major compute capability version number;
• cudaDevAttrComputeCapabilityMinor: Minor compute capability version number;
• cudaDevAttrStreamPrioritiesSupported: 1 if the device supports stream priorities, or 0 if not;
• cudaDevAttrGlobalL1CacheSupported: 1 if device supports caching globals in L1 cache, 0 if not;
• cudaDevAttrLocalL1CacheSupported: 1 if device supports caching locals in L1 cache, 0 if not;
• cudaDevAttrMaxSharedMemoryPerMultiprocessor: Maximum amount of shared memory available to a
multiprocessor in bytes; this amount is shared by all thread blocks simultaneously resident on a
multiprocessor;
• cudaDevAttrMaxRegistersPerMultiprocessor: Maximum number of 32-bit registers available to a
multiprocessor; this number is shared by all thread blocks simultaneously resident on a multiprocessor;
• cudaDevAttrManagedMemory: 1 if device supports allocating managed memory, 0 if not;
• cudaDevAttrIsMultiGpuBoard: 1 if device is on a multi-GPU board, 0 if not;
• cudaDevAttrMultiGpuBoardGroupID: Unique identifier for a group of devices on the same multi-GPU board;
• cudaDevAttrHostNativeAtomicSupported: 1 if the link between the device and the host supports native
atomic operations;
• cudaDevAttrSingleToDoublePrecisionPerfRatio: Ratio of single precision performance (in floating-point
operations per second) to double precision performance;
• cudaDevAttrPageableMemoryAccess: 1 if the device supports coherently accessing pageable memory without
calling cudaHostRegister on it, and 0 otherwise.
• cudaDevAttrConcurrentManagedAccess: 1 if the device can coherently access managed memory concurrently
with the CPU, and 0 otherwise.
• cudaDevAttrComputePreemptionSupported: 1 if the device supports Compute Preemption, 0 if not.
• cudaDevAttrCanUseHostPointerForRegisteredMem: 1 if the device can access host registered memory at the
same virtual address as the CPU, and 0 otherwise.
• cudaDevAttrCooperativeLaunch: 1 if the device supports launching cooperative kernels via
cudaLaunchCooperativeKernel, and 0 otherwise.
• cudaDevAttrCooperativeMultiDeviceLaunch: 1 if the device supports launching cooperative kernels via
cudaLaunchCooperativeKernelMultiDevice, and 0 otherwise.
• cudaDevAttrCanFlushRemoteWrites: 1 if the device supports flushing of outstanding remote writes, and 0
otherwise.
• cudaDevAttrHostRegisterSupported: 1 if the device supports host memory registration via
cudaHostRegister, and 0 otherwise.
• cudaDevAttrPageableMemoryAccessUsesHostPageTables: 1 if the device accesses pageable memory via the
host's page tables, and 0 otherwise.
• cudaDevAttrDirectManagedMemAccessFromHost: 1 if the host can directly access managed memory on the
device without migration, and 0 otherwise.
• cudaDevAttrMaxSharedMemoryPerBlockOptin: Maximum per block shared memory size on the device. This value
can be opted into when using cudaFuncSetAttribute
Parameters:
value - Returned device attribute value
attr - Device attribute to query
device - Device number to query
Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorInvalidValue
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaGetDeviceCount, cudaGetDevice, cudaSetDevice, cudaChooseDevice, cudaGetDeviceProperties,
cuDeviceGetAttribute
cudaError_t cudaDeviceGetByPCIBusId (int * device, const char * pciBusId)
Returns in *device a device ordinal given a PCI bus ID string.
Parameters:
device - Returned device ordinal
pciBusId - String in one of the following forms: [domain]:[bus]:[device].[function]
[domain]:[bus]:[device] [bus]:[device].[function] where domain, bus, device, and function are all
hexadecimal values
Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceGetPCIBusId, cuDeviceGetByPCIBusId
__cudart_builtin__ cudaError_t cudaDeviceGetCacheConfig (enum cudaFuncCache * pCacheConfig)
On devices where the L1 cache and shared memory use the same hardware resources, this returns through
pCacheConfig the preferred cache configuration for the current device. This is only a preference. The
runtime will use the requested configuration if possible, but it is free to choose a different
configuration if required to execute functions.
This will return a pCacheConfig of cudaFuncCachePreferNone on devices where the size of the L1 cache and
shared memory are fixed.
The supported cache configurations are:
• cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
• cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
• cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory
• cudaFuncCachePreferEqual: prefer equal size L1 cache and shared memory
Parameters:
pCacheConfig - Returned cache configuration
Returns:
cudaSuccess
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceSetCacheConfig, cudaFuncSetCacheConfig (C API), cudaFuncSetCacheConfig (C++ API),
cuCtxGetCacheConfig
__cudart_builtin__ cudaError_t cudaDeviceGetLimit (size_t * pValue, enum cudaLimit limit)
Returns in *pValue the current size of limit. The supported cudaLimit values are:
• cudaLimitStackSize: stack size in bytes of each GPU thread;
• cudaLimitPrintfFifoSize: size in bytes of the shared FIFO used by the printf() device system call.
• cudaLimitMallocHeapSize: size in bytes of the heap used by the malloc() and free() device system calls;
• cudaLimitDevRuntimeSyncDepth: maximum grid depth at which a thread can issue the device runtime call
cudaDeviceSynchronize() to wait on child grid launches to complete.
• cudaLimitDevRuntimePendingLaunchCount: maximum number of outstanding device runtime launches.
• cudaLimitMaxL2FetchGranularity: L2 cache fetch granularity.
Parameters:
limit - Limit to query
pValue - Returned size of the limit
Returns:
cudaSuccess, cudaErrorUnsupportedLimit, cudaErrorInvalidValue
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceSetLimit, cuCtxGetLimit
__cudart_builtin__ cudaError_t cudaDeviceGetP2PAttribute (int * value, enum cudaDeviceP2PAttr attr, int
srcDevice, int dstDevice)
Returns in *value the value of the requested attribute attrib of the link between srcDevice and
dstDevice. The supported attributes are:
• cudaDevP2PAttrPerformanceRank: A relative value indicating the performance of the link between two
devices. Lower value means better performance (0 being the value used for most performant link).
• cudaDevP2PAttrAccessSupported: 1 if peer access is enabled.
• cudaDevP2PAttrNativeAtomicSupported: 1 if native atomic operations over the link are supported.
• cudaDevP2PAttrCudaArrayAccessSupported: 1 if accessing CUDA arrays over the link is supported.
Returns cudaErrorInvalidDevice if srcDevice or dstDevice are not valid or if they represent the same
device.
Returns cudaErrorInvalidValue if attrib is not valid or if value is a null pointer.
Parameters:
value - Returned value of the requested attribute
attrib - The requested attribute of the link between srcDevice and dstDevice.
srcDevice - The source device of the target link.
dstDevice - The destination device of the target link.
Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorInvalidValue
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaCtxEnablePeerAccess, cudaCtxDisablePeerAccess, cudaCtxCanAccessPeer, cuDeviceGetP2PAttribute
cudaError_t cudaDeviceGetPCIBusId (char * pciBusId, int len, int device)
Returns an ASCII string identifying the device dev in the NULL-terminated string pointed to by pciBusId.
len specifies the maximum length of the string that may be returned.
Parameters:
pciBusId - Returned identifier string for the device in the following format
[domain]:[bus]:[device].[function] where domain, bus, device, and function are all hexadecimal
values. pciBusId should be large enough to store 13 characters including the NULL-terminator.
len - Maximum length of string to store in name
device - Device to get identifier string for
Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceGetByPCIBusId, cuDeviceGetPCIBusId
__cudart_builtin__ cudaError_t cudaDeviceGetSharedMemConfig (enum cudaSharedMemConfig * pConfig)
This function will return in pConfig the current size of shared memory banks on the current device. On
devices with configurable shared memory banks, cudaDeviceSetSharedMemConfig can be used to change this
setting, so that all subsequent kernel launches will by default use the new bank size. When
cudaDeviceGetSharedMemConfig is called on devices without configurable shared memory, it will return the
fixed bank size of the hardware.
The returned bank configurations can be either:
• cudaSharedMemBankSizeFourByte - shared memory bank width is four bytes.
• cudaSharedMemBankSizeEightByte - shared memory bank width is eight bytes.
Parameters:
pConfig - Returned cache configuration
Returns:
cudaSuccess, cudaErrorInvalidValue
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceSetCacheConfig, cudaDeviceGetCacheConfig, cudaDeviceSetSharedMemConfig,
cudaFuncSetCacheConfig, cuCtxGetSharedMemConfig
__cudart_builtin__ cudaError_t cudaDeviceGetStreamPriorityRange (int * leastPriority, int * greatestPriority)
Returns in *leastPriority and *greatestPriority the numerical values that correspond to the least and
greatest stream priorities respectively. Stream priorities follow a convention where lower numbers imply
greater priorities. The range of meaningful stream priorities is given by [*greatestPriority,
*leastPriority]. If the user attempts to create a stream with a priority value that is outside the the
meaningful range as specified by this API, the priority is automatically clamped down or up to either
*leastPriority or *greatestPriority respectively. See cudaStreamCreateWithPriority for details on
creating a priority stream. A NULL may be passed in for *leastPriority or *greatestPriority if the value
is not desired.
This function will return '0' in both *leastPriority and *greatestPriority if the current context's
device does not support stream priorities (see cudaDeviceGetAttribute).
Parameters:
leastPriority - Pointer to an int in which the numerical value for least stream priority is returned
greatestPriority - Pointer to an int in which the numerical value for greatest stream priority is
returned
Returns:
cudaSuccess
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaStreamCreateWithPriority, cudaStreamGetPriority, cuCtxGetStreamPriorityRange
cudaError_t cudaDeviceReset (void)
Explicitly destroys and cleans up all resources associated with the current device in the current
process. Any subsequent API call to this device will reinitialize the device.
Note that this function will reset the device immediately. It is the caller's responsibility to ensure
that the device is not being accessed by any other host threads from the process when this function is
called.
Returns:
cudaSuccess
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceSynchronize
cudaError_t cudaDeviceSetCacheConfig (enum cudaFuncCache cacheConfig)
On devices where the L1 cache and shared memory use the same hardware resources, this sets through
cacheConfig the preferred cache configuration for the current device. This is only a preference. The
runtime will use the requested configuration if possible, but it is free to choose a different
configuration if required to execute the function. Any function preference set via cudaFuncSetCacheConfig
(C API) or cudaFuncSetCacheConfig (C++ API) will be preferred over this device-wide setting. Setting the
device-wide cache configuration to cudaFuncCachePreferNone will cause subsequent kernel launches to
prefer to not change the cache configuration unless required to launch the kernel.
This setting does nothing on devices where the size of the L1 cache and shared memory are fixed.
Launching a kernel with a different preference than the most recent preference setting may insert a
device-side synchronization point.
The supported cache configurations are:
• cudaFuncCachePreferNone: no preference for shared memory or L1 (default)
• cudaFuncCachePreferShared: prefer larger shared memory and smaller L1 cache
• cudaFuncCachePreferL1: prefer larger L1 cache and smaller shared memory
• cudaFuncCachePreferEqual: prefer equal size L1 cache and shared memory
Parameters:
cacheConfig - Requested cache configuration
Returns:
cudaSuccess
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceGetCacheConfig, cudaFuncSetCacheConfig (C API), cudaFuncSetCacheConfig (C++ API),
cuCtxSetCacheConfig
cudaError_t cudaDeviceSetLimit (enum cudaLimit limit, size_t value)
Setting limit to value is a request by the application to update the current limit maintained by the
device. The driver is free to modify the requested value to meet h/w requirements (this could be clamping
to minimum or maximum values, rounding up to nearest element size, etc). The application can use
cudaDeviceGetLimit() to find out exactly what the limit has been set to.
Setting each cudaLimit has its own specific restrictions, so each is discussed here.
• cudaLimitStackSize controls the stack size in bytes of each GPU thread. Note that the CUDA driver will
set the limit to the maximum of value and what the kernel function requires.
• cudaLimitPrintfFifoSize controls the size in bytes of the shared FIFO used by the printf() device
system call. Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses
the printf() device system call - in such case cudaErrorInvalidValue will be returned.
• cudaLimitMallocHeapSize controls the size in bytes of the heap used by the malloc() and free() device
system calls. Setting cudaLimitMallocHeapSize must not be performed after launching any kernel that
uses the malloc() or free() device system calls - in such case cudaErrorInvalidValue will be returned.
• cudaLimitDevRuntimeSyncDepth controls the maximum nesting depth of a grid at which a thread can safely
call cudaDeviceSynchronize(). Setting this limit must be performed before any launch of a kernel that
uses the device runtime and calls cudaDeviceSynchronize() above the default sync depth, two levels of
grids. Calls to cudaDeviceSynchronize() will fail with error code cudaErrorSyncDepthExceeded if the
limitation is violated. This limit can be set smaller than the default or up the maximum launch depth
of 24. When setting this limit, keep in mind that additional levels of sync depth require the runtime
to reserve large amounts of device memory which can no longer be used for user allocations. If these
reservations of device memory fail, cudaDeviceSetLimit will return cudaErrorMemoryAllocation, and the
limit can be reset to a lower value. This limit is only applicable to devices of compute capability 3.5
and higher. Attempting to set this limit on devices of compute capability less than 3.5 will result in
the error cudaErrorUnsupportedLimit being returned.
• cudaLimitDevRuntimePendingLaunchCount controls the maximum number of outstanding device runtime
launches that can be made from the current device. A grid is outstanding from the point of launch up
until the grid is known to have been completed. Device runtime launches which violate this limitation
fail and return cudaErrorLaunchPendingCountExceeded when cudaGetLastError() is called after launch. If
more pending launches than the default (2048 launches) are needed for a module using the device
runtime, this limit can be increased. Keep in mind that being able to sustain additional pending
launches will require the runtime to reserve larger amounts of device memory upfront which can no
longer be used for allocations. If these reservations fail, cudaDeviceSetLimit will return
cudaErrorMemoryAllocation, and the limit can be reset to a lower value. This limit is only applicable
to devices of compute capability 3.5 and higher. Attempting to set this limit on devices of compute
capability less than 3.5 will result in the error cudaErrorUnsupportedLimit being returned.
• cudaLimitMaxL2FetchGranularity controls the L2 cache fetch granularity. Values can range from 0B to
128B. This is purely a performance hint and it can be ignored or clamped depending on the platform.
Parameters:
limit - Limit to set
value - Size of limit
Returns:
cudaSuccess, cudaErrorUnsupportedLimit, cudaErrorInvalidValue, cudaErrorMemoryAllocation
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceGetLimit, cuCtxSetLimit
cudaError_t cudaDeviceSetSharedMemConfig (enum cudaSharedMemConfig config)
On devices with configurable shared memory banks, this function will set the shared memory bank size
which is used for all subsequent kernel launches. Any per-function setting of shared memory set via
cudaFuncSetSharedMemConfig will override the device wide setting.
Changing the shared memory configuration between launches may introduce a device side synchronization
point.
Changing the shared memory bank size will not increase shared memory usage or affect occupancy of
kernels, but may have major effects on performance. Larger bank sizes will allow for greater potential
bandwidth to shared memory, but will change what kinds of accesses to shared memory will result in bank
conflicts.
This function will do nothing on devices with fixed shared memory bank size.
The supported bank configurations are:
• cudaSharedMemBankSizeDefault: set bank width the device default (currently, four bytes)
• cudaSharedMemBankSizeFourByte: set shared memory bank width to be four bytes natively.
• cudaSharedMemBankSizeEightByte: set shared memory bank width to be eight bytes natively.
Parameters:
config - Requested cache configuration
Returns:
cudaSuccess, cudaErrorInvalidValue
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceSetCacheConfig, cudaDeviceGetCacheConfig, cudaDeviceGetSharedMemConfig,
cudaFuncSetCacheConfig, cuCtxSetSharedMemConfig
__cudart_builtin__ cudaError_t cudaDeviceSynchronize (void)
Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an
error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for
this device, the host thread will block until the device has finished its work.
Returns:
cudaSuccess
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaDeviceReset, cuCtxSynchronize
__cudart_builtin__ cudaError_t cudaGetDevice (int * device)
Returns in *device the current device for the calling host thread.
Parameters:
device - Returns the device on which the active host thread executes the device code.
Returns:
cudaSuccess, cudaErrorInvalidValue
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaGetDeviceCount, cudaSetDevice, cudaGetDeviceProperties, cudaChooseDevice, cuCtxGetCurrent
__cudart_builtin__ cudaError_t cudaGetDeviceCount (int * count)
Returns in *count the number of devices with compute capability greater or equal to 2.0 that are
available for execution.
Parameters:
count - Returns the number of devices with compute capability greater or equal to 2.0
Returns:
cudaErrorInvalidValue (if a NULL device pointer is assigned), cudaSuccess
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaGetDevice, cudaSetDevice, cudaGetDeviceProperties, cudaChooseDevice, cuDeviceGetCount
cudaError_t cudaGetDeviceFlags (unsigned int * flags)
Returns in flags the flags for the current device. If there is a current device for the calling thread,
and the device has been initialized or flags have been set on that device specifically, the flags for the
device are returned. If there is no current device, but flags have been set for the thread with
cudaSetDeviceFlags, the thread flags are returned. Finally, if there is no current device and no thread
flags, the flags for the first device are returned, which may be the default flags. Compare to the
behavior of cudaSetDeviceFlags.
Typically, the flags returned should match the behavior that will be seen if the calling thread uses a
device after this call, without any change to the flags or current device inbetween by this or another
thread. Note that if the device is not initialized, it is possible for another thread to change the flags
for the current device before it is initialized. Additionally, when using exclusive mode, if this thread
has not requested a specific device, it may use a device other than the first device, contrary to the
assumption made by this function.
If a context has been created via the driver API and is current to the calling thread, the flags for that
context are always returned.
Flags returned by this function may specifically include cudaDeviceMapHost even though it is not accepted
by cudaSetDeviceFlags because it is implicit in runtime API flags. The reason for this is that the
current context may have been created via the driver API in which case the flag is not implicit and may
be unset.
Parameters:
flags - Pointer to store the device flags
Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorInvalidValue
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaGetDevice, cudaGetDeviceProperties, cudaSetDevice, cudaSetDeviceFlags, cuCtxGetFlags,
cuDevicePrimaryCtxGetState
__cudart_builtin__ cudaError_t cudaGetDeviceProperties (struct cudaDeviceProp * prop, int device)
Returns in *prop the properties of device dev. The cudaDeviceProp structure is defined as:
struct cudaDeviceProp {
char name[256];
cudaUUID_t uuid;
size_t totalGlobalMem;
size_t sharedMemPerBlock;
int regsPerBlock;
int warpSize;
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[3];
int maxGridSize[3];
int clockRate;
size_t totalConstMem;
int major;
int minor;
size_t textureAlignment;
size_t texturePitchAlignment;
int deviceOverlap;
int multiProcessorCount;
int kernelExecTimeoutEnabled;
int integrated;
int canMapHostMemory;
int computeMode;
int maxTexture1D;
int maxTexture1DMipmap;
int maxTexture1DLinear;
int maxTexture2D[2];
int maxTexture2DMipmap[2];
int maxTexture2DLinear[3];
int maxTexture2DGather[2];
int maxTexture3D[3];
int maxTexture3DAlt[3];
int maxTextureCubemap;
int maxTexture1DLayered[2];
int maxTexture2DLayered[3];
int maxTextureCubemapLayered[2];
int maxSurface1D;
int maxSurface2D[2];
int maxSurface3D[3];
int maxSurface1DLayered[2];
int maxSurface2DLayered[3];
int maxSurfaceCubemap;
int maxSurfaceCubemapLayered[2];
size_t surfaceAlignment;
int concurrentKernels;
int ECCEnabled;
int pciBusID;
int pciDeviceID;
int pciDomainID;
int tccDriver;
int asyncEngineCount;
int unifiedAddressing;
int memoryClockRate;
int memoryBusWidth;
int l2CacheSize;
int maxThreadsPerMultiProcessor;
int streamPrioritiesSupported;
int globalL1CacheSupported;
int localL1CacheSupported;
size_t sharedMemPerMultiprocessor;
int regsPerMultiprocessor;
int managedMemory;
int isMultiGpuBoard;
int multiGpuBoardGroupID;
int singleToDoublePrecisionPerfRatio;
int pageableMemoryAccess;
int concurrentManagedAccess;
int computePreemptionSupported;
int canUseHostPointerForRegisteredMem;
int cooperativeLaunch;
int cooperativeMultiDeviceLaunch;
int pageableMemoryAccessUsesHostPageTables;
int directManagedMemAccessFromHost;
}
where:
• name[256] is an ASCII string identifying the device;
• uuid is a 16-byte unique identifier.
• totalGlobalMem is the total amount of global memory available on the device in bytes;
• sharedMemPerBlock is the maximum amount of shared memory available to a thread block in bytes;
• regsPerBlock is the maximum number of 32-bit registers available to a thread block;
• warpSize is the warp size in threads;
• memPitch is the maximum pitch in bytes allowed by the memory copy functions that involve memory regions
allocated through cudaMallocPitch();
• maxThreadsPerBlock is the maximum number of threads per block;
• maxThreadsDim[3] contains the maximum size of each dimension of a block;
• maxGridSize[3] contains the maximum size of each dimension of a grid;
• clockRate is the clock frequency in kilohertz;
• totalConstMem is the total amount of constant memory available on the device in bytes;
• major, minor are the major and minor revision numbers defining the device's compute capability;
• textureAlignment is the alignment requirement; texture base addresses that are aligned to
textureAlignment bytes do not need an offset applied to texture fetches;
• texturePitchAlignment is the pitch alignment requirement for 2D texture references that are bound to
pitched memory;
• deviceOverlap is 1 if the device can concurrently copy memory between host and device while executing a
kernel, or 0 if not. Deprecated, use instead asyncEngineCount.
• multiProcessorCount is the number of multiprocessors on the device;
• kernelExecTimeoutEnabled is 1 if there is a run time limit for kernels executed on the device, or 0 if
not.
• integrated is 1 if the device is an integrated (motherboard) GPU and 0 if it is a discrete (card)
component.
• canMapHostMemory is 1 if the device can map host memory into the CUDA address space for use with
cudaHostAlloc()/cudaHostGetDevicePointer(), or 0 if not;
• computeMode is the compute mode that the device is currently in. Available modes are as follows:
• cudaComputeModeDefault: Default mode - Device is not restricted and multiple threads can use
cudaSetDevice() with this device.
• cudaComputeModeExclusive: Compute-exclusive mode - Only one thread will be able to use
cudaSetDevice() with this device.
• cudaComputeModeProhibited: Compute-prohibited mode - No threads can use cudaSetDevice() with this
device.
• cudaComputeModeExclusiveProcess: Compute-exclusive-process mode - Many threads in one process will be
able to use cudaSetDevice() with this device.
If cudaSetDevice() is called on an already occupied device with computeMode
cudaComputeModeExclusive, cudaErrorDeviceAlreadyInUse will be immediately returned indicating the
device cannot be used. When an occupied exclusive mode device is chosen with cudaSetDevice, all
subsequent non-device management runtime functions will return cudaErrorDevicesUnavailable.
• maxTexture1D is the maximum 1D texture size.
• maxTexture1DMipmap is the maximum 1D mipmapped texture texture size.
• maxTexture1DLinear is the maximum 1D texture size for textures bound to linear memory.
• maxTexture2D[2] contains the maximum 2D texture dimensions.
• maxTexture2DMipmap[2] contains the maximum 2D mipmapped texture dimensions.
• maxTexture2DLinear[3] contains the maximum 2D texture dimensions for 2D textures bound to pitch linear
memory.
• maxTexture2DGather[2] contains the maximum 2D texture dimensions if texture gather operations have to
be performed.
• maxTexture3D[3] contains the maximum 3D texture dimensions.
• maxTexture3DAlt[3] contains the maximum alternate 3D texture dimensions.
• maxTextureCubemap is the maximum cubemap texture width or height.
• maxTexture1DLayered[2] contains the maximum 1D layered texture dimensions.
• maxTexture2DLayered[3] contains the maximum 2D layered texture dimensions.
• maxTextureCubemapLayered[2] contains the maximum cubemap layered texture dimensions.
• maxSurface1D is the maximum 1D surface size.
• maxSurface2D[2] contains the maximum 2D surface dimensions.
• maxSurface3D[3] contains the maximum 3D surface dimensions.
• maxSurface1DLayered[2] contains the maximum 1D layered surface dimensions.
• maxSurface2DLayered[3] contains the maximum 2D layered surface dimensions.
• maxSurfaceCubemap is the maximum cubemap surface width or height.
• maxSurfaceCubemapLayered[2] contains the maximum cubemap layered surface dimensions.
• surfaceAlignment specifies the alignment requirements for surfaces.
• concurrentKernels is 1 if the device supports executing multiple kernels within the same context
simultaneously, or 0 if not. It is not guaranteed that multiple kernels will be resident on the device
concurrently so this feature should not be relied upon for correctness;
• ECCEnabled is 1 if the device has ECC support turned on, or 0 if not.
• pciBusID is the PCI bus identifier of the device.
• pciDeviceID is the PCI device (sometimes called slot) identifier of the device.
• pciDomainID is the PCI domain identifier of the device.
• tccDriver is 1 if the device is using a TCC driver or 0 if not.
• asyncEngineCount is 1 when the device can concurrently copy memory between host and device while
executing a kernel. It is 2 when the device can concurrently copy memory between host and device in
both directions and execute a kernel at the same time. It is 0 if neither of these is supported.
• unifiedAddressing is 1 if the device shares a unified address space with the host and 0 otherwise.
• memoryClockRate is the peak memory clock frequency in kilohertz.
• memoryBusWidth is the memory bus width in bits.
• l2CacheSize is L2 cache size in bytes.
• maxThreadsPerMultiProcessor is the number of maximum resident threads per multiprocessor.
• streamPrioritiesSupported is 1 if the device supports stream priorities, or 0 if it is not supported.
• globalL1CacheSupported is 1 if the device supports caching of globals in L1 cache, or 0 if it is not
supported.
• localL1CacheSupported is 1 if the device supports caching of locals in L1 cache, or 0 if it is not
supported.
• sharedMemPerMultiprocessor is the maximum amount of shared memory available to a multiprocessor in
bytes; this amount is shared by all thread blocks simultaneously resident on a multiprocessor;
• regsPerMultiprocessor is the maximum number of 32-bit registers available to a multiprocessor; this
number is shared by all thread blocks simultaneously resident on a multiprocessor;
• managedMemory is 1 if the device supports allocating managed memory on this system, or 0 if it is not
supported.
• isMultiGpuBoard is 1 if the device is on a multi-GPU board (e.g. Gemini cards), and 0 if not;
• multiGpuBoardGroupID is a unique identifier for a group of devices associated with the same board.
Devices on the same multi-GPU board will share the same identifier;
• singleToDoublePrecisionPerfRatio is the ratio of single precision performance (in floating-point
operations per second) to double precision performance.
• pageableMemoryAccess is 1 if the device supports coherently accessing pageable memory without calling
cudaHostRegister on it, and 0 otherwise.
• concurrentManagedAccess is 1 if the device can coherently access managed memory concurrently with the
CPU, and 0 otherwise.
• computePreemptionSupported is 1 if the device supports Compute Preemption, and 0 otherwise.
• canUseHostPointerForRegisteredMem is 1 if the device can access host registered memory at the same
virtual address as the CPU, and 0 otherwise.
• cooperativeLaunch is 1 if the device supports launching cooperative kernels via
cudaLaunchCooperativeKernel, and 0 otherwise.
• cooperativeMultiDeviceLaunch is 1 if the device supports launching cooperative kernels via
cudaLaunchCooperativeKernelMultiDevice, and 0 otherwise.
• pageableMemoryAccessUsesHostPageTables is 1 if the device accesses pageable memory via the host's page
tables, and 0 otherwise.
• directManagedMemAccessFromHost is 1 if the host can directly access managed memory on the device
without migration, and 0 otherwise.
Parameters:
prop - Properties for the specified device
device - Device number to get properties for
Returns:
cudaSuccess, cudaErrorInvalidDevice
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaGetDeviceCount, cudaGetDevice, cudaSetDevice, cudaChooseDevice, cudaDeviceGetAttribute,
cuDeviceGetAttribute, cuDeviceGetName
cudaError_t cudaIpcCloseMemHandle (void * devPtr)
Unmaps memory returned by cudaIpcOpenMemHandle. The original allocation in the exporting process as well
as imported mappings in other processes will be unaffected.
Any resources used to enable peer access will be freed if this is the last mapping using them.
IPC functionality is restricted to devices with support for unified addressing on Linux operating
systems. IPC functionality is not supported on Tegra platforms.
Parameters:
devPtr - Device pointer returned by cudaIpcOpenMemHandle
Returns:
cudaSuccess, cudaErrorMapBufferObjectFailed, cudaErrorInvalidResourceHandle, cudaErrorNotSupported
See also:
cudaMalloc, cudaFree, cudaIpcGetEventHandle, cudaIpcOpenEventHandle, cudaIpcGetMemHandle,
cudaIpcOpenMemHandle, cuIpcCloseMemHandle
cudaError_t cudaIpcGetEventHandle (cudaIpcEventHandle_t * handle, cudaEvent_t event)
Takes as input a previously allocated event. This event must have been created with the
cudaEventInterprocess and cudaEventDisableTiming flags set. This opaque handle may be copied into other
processes and opened with cudaIpcOpenEventHandle to allow efficient hardware synchronization between GPU
work in different processes.
After the event has been been opened in the importing process, cudaEventRecord, cudaEventSynchronize,
cudaStreamWaitEvent and cudaEventQuery may be used in either process. Performing operations on the
imported event after the exported event has been freed with cudaEventDestroy will result in undefined
behavior.
IPC functionality is restricted to devices with support for unified addressing on Linux operating
systems. IPC functionality is not supported on Tegra platforms.
Parameters:
handle - Pointer to a user allocated cudaIpcEventHandle in which to return the opaque event handle
event - Event allocated with cudaEventInterprocess and cudaEventDisableTiming flags.
Returns:
cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorMemoryAllocation,
cudaErrorMapBufferObjectFailed, cudaErrorNotSupported
See also:
cudaEventCreate, cudaEventDestroy, cudaEventSynchronize, cudaEventQuery, cudaStreamWaitEvent,
cudaIpcOpenEventHandle, cudaIpcGetMemHandle, cudaIpcOpenMemHandle, cudaIpcCloseMemHandle,
cuIpcGetEventHandle
cudaError_t cudaIpcGetMemHandle (cudaIpcMemHandle_t * handle, void * devPtr)
Takes a pointer to the base of an existing device memory allocation created with cudaMalloc and exports
it for use in another process. This is a lightweight operation and may be called multiple times on an
allocation without adverse effects.
If a region of memory is freed with cudaFree and a subsequent call to cudaMalloc returns memory with the
same device address, cudaIpcGetMemHandle will return a unique handle for the new memory.
IPC functionality is restricted to devices with support for unified addressing on Linux operating
systems. IPC functionality is not supported on Tegra platforms.
Parameters:
handle - Pointer to user allocated cudaIpcMemHandle to return the handle in.
devPtr - Base pointer to previously allocated device memory
Returns:
cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorMemoryAllocation,
cudaErrorMapBufferObjectFailed, cudaErrorNotSupported
See also:
cudaMalloc, cudaFree, cudaIpcGetEventHandle, cudaIpcOpenEventHandle, cudaIpcOpenMemHandle,
cudaIpcCloseMemHandle, cuIpcGetMemHandle
cudaError_t cudaIpcOpenEventHandle (cudaEvent_t * event, cudaIpcEventHandle_t handle)
Opens an interprocess event handle exported from another process with cudaIpcGetEventHandle. This
function returns a cudaEvent_t that behaves like a locally created event with the cudaEventDisableTiming
flag specified. This event must be freed with cudaEventDestroy.
Performing operations on the imported event after the exported event has been freed with cudaEventDestroy
will result in undefined behavior.
IPC functionality is restricted to devices with support for unified addressing on Linux operating
systems. IPC functionality is not supported on Tegra platforms.
Parameters:
event - Returns the imported event
handle - Interprocess handle to open
Returns:
cudaSuccess, cudaErrorMapBufferObjectFailed, cudaErrorInvalidResourceHandle, cudaErrorNotSupported
See also:
cudaEventCreate, cudaEventDestroy, cudaEventSynchronize, cudaEventQuery, cudaStreamWaitEvent,
cudaIpcGetEventHandle, cudaIpcGetMemHandle, cudaIpcOpenMemHandle, cudaIpcCloseMemHandle,
cuIpcOpenEventHandle
cudaError_t cudaIpcOpenMemHandle (void ** devPtr, cudaIpcMemHandle_t handle, unsigned int flags)
Maps memory exported from another process with cudaIpcGetMemHandle into the current device address space.
For contexts on different devices cudaIpcOpenMemHandle can attempt to enable peer access between the
devices as if the user called cudaDeviceEnablePeerAccess. This behavior is controlled by the
cudaIpcMemLazyEnablePeerAccess flag. cudaDeviceCanAccessPeer can determine if a mapping is possible.
cudaIpcOpenMemHandle can open handles to devices that may not be visible in the process calling the API.
Contexts that may open cudaIpcMemHandles are restricted in the following way. cudaIpcMemHandles from each
device in a given process may only be opened by one context per device per other process.
Memory returned from cudaIpcOpenMemHandle must be freed with cudaIpcCloseMemHandle.
Calling cudaFree on an exported memory region before calling cudaIpcCloseMemHandle in the importing
context will result in undefined behavior.
IPC functionality is restricted to devices with support for unified addressing on Linux operating
systems. IPC functionality is not supported on Tegra platforms.
Parameters:
devPtr - Returned device pointer
handle - cudaIpcMemHandle to open
flags - Flags for this operation. Must be specified as cudaIpcMemLazyEnablePeerAccess
Returns:
cudaSuccess, cudaErrorMapBufferObjectFailed, cudaErrorInvalidResourceHandle, cudaErrorTooManyPeers,
cudaErrorNotSupported
Note:
No guarantees are made about the address returned in *devPtr. In particular, multiple processes may
not receive the same address for the same handle.
See also:
cudaMalloc, cudaFree, cudaIpcGetEventHandle, cudaIpcOpenEventHandle, cudaIpcGetMemHandle,
cudaIpcCloseMemHandle, cudaDeviceEnablePeerAccess, cudaDeviceCanAccessPeer, cuIpcOpenMemHandle
cudaError_t cudaSetDevice (int device)
Sets device as the current device for the calling host thread. Valid device id's are 0 to
(cudaGetDeviceCount() - 1).
Any device memory subsequently allocated from this host thread using cudaMalloc(), cudaMallocPitch() or
cudaMallocArray() will be physically resident on device. Any host memory allocated from this host thread
using cudaMallocHost() or cudaHostAlloc() or cudaHostRegister() will have its lifetime associated with
device. Any streams or events created from this host thread will be associated with device. Any kernels
launched from this host thread using the <<<>>> operator or cudaLaunchKernel() will be executed on
device.
This call may be made from any host thread, to any device, and at any time. This function will do no
synchronization with the previous or new device, and should be considered a very low overhead call.
Parameters:
device - Device on which the active host thread should execute the device code.
Returns:
cudaSuccess, cudaErrorInvalidDevice, cudaErrorDeviceAlreadyInUse
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaGetDeviceCount, cudaGetDevice, cudaGetDeviceProperties, cudaChooseDevice, cuCtxSetCurrent
cudaError_t cudaSetDeviceFlags (unsigned int flags)
Records flags as the flags to use when initializing the current device. If no device has been made
current to the calling thread, then flags will be applied to the initialization of any device initialized
by the calling host thread, unless that device has had its initialization flags set explicitly by this or
any host thread.
If the current device has been set and that device has already been initialized then this call will fail
with the error cudaErrorSetOnActiveProcess. In this case it is necessary to reset device using
cudaDeviceReset() before the device's initialization flags may be set.
The two LSBs of the flags parameter can be used to control how the CPU thread interacts with the OS
scheduler when waiting for results from the device.
• cudaDeviceScheduleAuto: The default value if the flags parameter is zero, uses a heuristic based on the
number of active CUDA contexts in the process C and the number of logical processors in the system P.
If C > P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not
yield while waiting for results and actively spin on the processor. Additionally, on Tegra devices,
cudaDeviceScheduleAuto uses a heuristic based on the power profile of the platform and may choose
cudaDeviceScheduleBlockingSync for low-powered devices.
• cudaDeviceScheduleSpin: Instruct CUDA to actively spin when waiting for results from the device. This
can decrease latency when waiting for the device, but may lower the performance of CPU threads if they
are performing work in parallel with the CUDA thread.
• cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device.
This can increase latency when waiting for the device, but can increase the performance of CPU threads
performing work in parallel with the device.
• cudaDeviceScheduleBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive
when waiting for the device to finish work.
• cudaDeviceBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when
waiting for the device to finish work.
Deprecated: This flag was deprecated as of CUDA 4.0 and replaced with cudaDeviceScheduleBlockingSync.
• cudaDeviceMapHost: This flag enables allocating pinned host memory that is accessible to the device. It
is implicit for the runtime but may be absent if a context is created using the driver API. If this
flag is not set, cudaHostGetDevicePointer() will always return a failure code.
• cudaDeviceLmemResizeToMax: Instruct CUDA to not reduce local memory after resizing local memory for a
kernel. This can prevent thrashing by local memory allocations when launching many kernels with high
local memory usage at the cost of potentially increased memory usage.
Parameters:
flags - Parameters for device operation
Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorSetOnActiveProcess
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaGetDeviceFlags, cudaGetDeviceCount, cudaGetDevice, cudaGetDeviceProperties, cudaSetDevice,
cudaSetValidDevices, cudaChooseDevice, cuDevicePrimaryCtxSetFlags
cudaError_t cudaSetValidDevices (int * device_arr, int len)
Sets a list of devices for CUDA execution in priority order using device_arr. The parameter len specifies
the number of elements in the list. CUDA will try devices from the list sequentially until it finds one
that works. If this function is not called, or if it is called with a len of 0, then CUDA will go back to
its default behavior of trying devices sequentially from a default list containing all of the available
CUDA devices in the system. If a specified device ID in the list does not exist, this function will
return cudaErrorInvalidDevice. If len is not 0 and device_arr is NULL or if len exceeds the number of
devices in the system, then cudaErrorInvalidValue is returned.
Parameters:
device_arr - List of devices to try
len - Number of devices in specified list
Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaGetDeviceCount, cudaSetDevice, cudaGetDeviceProperties, cudaSetDeviceFlags, cudaChooseDevice
Author
Generated automatically by Doxygen from the source code.
Version 6.0 28 Jul 2019 Device Management(3)