解決 NVIDIA GPU 閒置時耗電問題

發現顯示卡在沒有使用的狀態時,照理來說不會耗電,卻吃了約 120 瓦的電。

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   58C    P0   120W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   58C    P0   126W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

而當我執行程式將模型載入到 GPU 上面時,使用的瓦數會降到不到 20 瓦,顯然上面 GPU 閒置時吃的瓦數不正常

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   55C    P8    11W / 350W |   2508MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   51C    P8    16W / 350W |   2468MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    405385      C                                    2505MiB |
|    1   N/A  N/A    405385      C                                    2465MiB |
+-----------------------------------------------------------------------------+

後來發現如果將 Persistence mode 調整為 on 之後會變正常,不會吃那麼多

sudo nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:01:00.0.
Enabled persistence mode for GPU 00000000:02:00.0.
All done.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   47C    P8    23W / 350W |      1MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
|  0%   36C    P8    15W / 350W |      1MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

這時發現Perf這個欄位的值,從原本的P0變成了P8,查一下這兩個的差別

以下節錄自 nvidia 文件:

The GPU performance state APIs are used to get and set various performance levels on a per-GPU basis. P-States are GPU active/executing performance capability and power consumption states.

P-States range from P0 to P15, with P0 being the highest performance/power state, and P15 being the lowest performance/power state. Each P-State maps to a performance level. Not all P-States are available on a given system. The definition of each P-States are currently as follows:

  • P0/P1 - Maximum 3D performance
  • P2/P3 - Balanced 3D performance-power
  • P8 - Basic HD video playback
  • P10 - DVD playback
  • P12 - Minimum idle power consumption

可見一開始的P0代表 GPU 處於最大效能的狀態,難怪會吃了 120 多瓦,而後來打開 Persistence mode 後變成的P8則是基礎的狀態,只會吃 20 多瓦。

所以可以理解成 GPU 在閒置時狀態會變成P0,消耗大量電力,要打開 Persistence mode 不讓 GPU 閒置,這樣就不會變成P0。至於為什麼閒置時 GPU 會切換成最大效能的原因還有待研究。