Provided by: nvidia-cuda-dev_9.1.85-3ubuntu1_amd64 bug

NAME

       Drain states -

   Functions
       nvmlReturn_t DECLDIR nvmlDeviceModifyDrainState (nvmlPciInfo_t *pciInfo, nvmlEnableState_t
           newState)
       nvmlReturn_t DECLDIR nvmlDeviceQueryDrainState (nvmlPciInfo_t *pciInfo, nvmlEnableState_t
           *currentState)
       nvmlReturn_t DECLDIR nvmlDeviceRemoveGpu (nvmlPciInfo_t *pciInfo)
       nvmlReturn_t DECLDIR nvmlDeviceDiscoverGpus (nvmlPciInfo_t *pciInfo)

Detailed Description

       This chapter describes methods that NVML can perform against each device to control their
       drain state and recognition by NVML and NVIDIA kernel driver. These methods can be used
       with out-of-band tools to power on/off GPUs, enable robust reset scenarios, etc.

Function Documentation

   nvmlReturn_t DECLDIR nvmlDeviceDiscoverGpus (nvmlPciInfo_t * pciInfo)
       Request the OS and the NVIDIA kernel driver to rediscover a portion of the PCI subsystem
       looking for GPUs that were previously removed. The portion of the PCI tree can be narrowed
       by specifying a domain, bus, and device. If all are zeroes then the entire PCI tree will
       be searched. Please note that for long-running NVML processes the enumeration will change
       based on how many GPUs are discovered and where they are inserted in bus order.

       In addition, all newly discovered GPUs will be initialized and their ECC scrubbed which
       may take several seconds per GPU. Also, all device handles are no longer guaranteed to be
       valid post discovery.

       Must be run as administrator. For Linux only.

       For Pascal (TM) or newer fully supported devices. Some Kepler devices supported.

       Parameters:
           pciInfo The PCI tree to be searched. Only the domain, bus, and device fields are used
           in this call.

       Returns:NVML_SUCCESS if counters were successfully reset

           • NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized

           • NVML_ERROR_INVALID_ARGUMENT if pciInfo is invalid

           • NVML_ERROR_NOT_SUPPORTED if the operating system does not support this feature

           • NVML_ERROR_OPERATING_SYSTEM if the operating system is denying this feature

           • NVML_ERROR_NO_PERMISSION if the calling process has insufficient permissions to
             perform operation

           • NVML_ERROR_UNKNOWN on any unexpected error

   nvmlReturn_t DECLDIR nvmlDeviceModifyDrainState (nvmlPciInfo_t * pciInfo, nvmlEnableState_t
       newState)
       Modify the drain state of a GPU. This method forces a GPU to no longer accept new incoming
       requests. Any new NVML process will no longer see this GPU. Persistence mode for this GPU
       must be turned off before this call is made. Must be called as administrator. For Linux
       only.

       For Pascal (TM) or newer fully supported devices. Some Kepler devices supported.

       Parameters:
           pciInfo The PCI address of the GPU drain state to be modified
           newState The drain state that should be entered, see nvmlEnableState_t

       Returns:NVML_SUCCESS if counters were successfully reset

           • NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized

           • NVML_ERROR_INVALID_ARGUMENT if nvmlIndex or newState is invalid

           • NVML_ERROR_NOT_SUPPORTED if the device doesn't support this feature

           • NVML_ERROR_NO_PERMISSION if the calling process has insufficient permissions to
             perform operation

           • NVML_ERROR_IN_USE if the device has persistence mode turned on

           • NVML_ERROR_UNKNOWN on any unexpected error

   nvmlReturn_t DECLDIR nvmlDeviceQueryDrainState (nvmlPciInfo_t * pciInfo, nvmlEnableState_t *
       currentState)
       Query the drain state of a GPU. This method is used to check if a GPU is in a currently
       draining state. For Linux only.

       For Pascal (TM) or newer fully supported devices. Some Kepler devices supported.

       Parameters:
           pciInfo The PCI address of the GPU drain state to be queried
           currentState The current drain state for this GPU, see nvmlEnableState_t

       Returns:NVML_SUCCESS if counters were successfully reset

           • NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized

           • NVML_ERROR_INVALID_ARGUMENT if nvmlIndex or currentState is invalid

           • NVML_ERROR_NOT_SUPPORTED if the device doesn't support this feature

           • NVML_ERROR_UNKNOWN on any unexpected error

   nvmlReturn_t DECLDIR nvmlDeviceRemoveGpu (nvmlPciInfo_t * pciInfo)
       This method will remove the specified GPU from the view of both NVML and the NVIDIA kernel
       driver as long as no other processes are attached. If other processes are attached, this
       call will return NVML_ERROR_IN_USE and the GPU will be returned to its original 'draining'
       state. Note: the only situation where a process can still be attached after
       nvmlDeviceModifyDrainState() is called to initiate the draining state is if that process
       was using, and is still using, a GPU before the call was made. Also note, persistence mode
       counts as an attachment to the GPU thus it must be disabled prior to this call.

       For long-running NVML processes please note that this will change the enumeration of
       current GPUs. For example, if there are four GPUs present and GPU1 is removed, the new
       enumeration will be 0-2. Also, device handles after the removed GPU will not be valid and
       must be re-established. Must be run as administrator. For Linux only.

       For Pascal (TM) or newer fully supported devices. Some Kepler devices supported.

       Parameters:
           pciInfo The PCI address of the GPU to be removed

       Returns:NVML_SUCCESS if counters were successfully reset

           • NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized

           • NVML_ERROR_INVALID_ARGUMENT if nvmlIndex is invalid

           • NVML_ERROR_NOT_SUPPORTED if the device doesn't support this feature

           • NVML_ERROR_IN_USE if the device is still in use and cannot be removed

Author

       Generated automatically by Doxygen for NVML from the source code.