lunar (1) amdgpu_plugin.1.gz

Provided by: criu_3.17.1-2_amd64 bug

NAME

       amdgpu_plugin  - A plugin extension to CRIU to support checkpoint/restore in userspace for
       AMD GPUs.

CURRENT SUPPORT

       Single and Multi GPU systems (Gfx9) Checkpoint / Restore on different system Checkpoint  /
       Restore inside a docker container Pytorch Tensorflow Using CRIU Image Streamer

DESCRIPTION

       Though  criu  is a great tool for checkpointing and restoring running applications, it has
       certain limitations such as it cannot handle applications that have device files open.  In
       order  to  support  ROCm  based  workloads  with  criu  we  need  to  augment  criu’s core
       functionality  with  a  plugin  based  extension  mechanism.  amdgpu_plugin  provides  the
       necessary support to criu to allow Checkpoint / Restore with ROCm.

   Dependencies
       amdkfd support
           In  order  to  snapshot  the  VRAM  and other GPU device states, we require an updated
           version of amdkfd(amdgpu) driver. The kernel patches are under review currently.

       criu 3.16
           This work is rebased on latest criu release available at this time.

OPTIONS

       Optional parameters can be passed  in  as  environment  variables  before  executing  criu
       command.

       KFD_FW_VER_CHECK
           Enable or disable firmware version check. If enabled, firmware version on restored gpu
           needs  to  be  greater  than  or  equal  firmware   version   on   checkpointed   GPU.
           Default:Enabled

               E.g:
               KFD_FW_VER_CHECK=0

       KFD_SDMA_FW_VER_CHECK
           Enable  or  disable  SDMA firmware version check. If enabled, SDMA firmware version on
           restored gpu needs to be greater than or equal firmware version on  checkpointed  GPU.
           Default:Enabled

               E.g:
               KFD_SDMA_FW_VER_CHECK=0

       KFD_CACHES_COUNT_CHECK
           Enable  or  disable  caches  count check. If enabled, the caches count on restored GPU
           needs to be greater than or equal caches count on checkpointed GPU. Default:Enabled

               E.g:
               KFD_CACHES_COUNT_CHECK=0

       KFD_NUM_GWS_CHECK
           Enable or disable num_gws check. If enabled, the num_gws on restored GPU needs  to  be
           greater than or equal num_gws on checkpointed GPU. Default:Enabled

               E.g:
               KFD_NUM_GWS_CHECK=0

       KFD_VRAM_SIZE_CHECK
           Enable  or disable VRAM size check. If enabled, the VRAM size on restored GPU needs to
           be greater than or equal VRAM size on checkpointed GPU. Default:Enabled

               E.g:
               KFD_VRAM_SIZE_CHECK=0

       KFD_NUMA_CHECK
           Enable or disable NUMA CPU region check. If enabled, the plugin will restore GPUs that
           belong to one CPU NUMA region to the same CPU NUMA region. Default:Enabled

               E.g:
               KFD_NUMA_CHECK=1

       KFD_CAPABILITY_CHECK
           Enable  or  disable capability check. If enabled, the capability on restored GPU needs
           to be equal to the capability on the checkpointed GPU. Default:Enabled

               E.g:
               KFD_CAPABILITY_CHECK=1

AUTHOR

       The AMDKFD team.

       Copyright (C) 2020-2021, Advanced Micro Devices, Inc. (AMD)

                                            12/20/2022                            ROCM SUPPORT(1)