NVIDIA 공식 + 오픈소스 기반으로 GPU 사용량 모니터링을 하는 시스템을 구성합니다. nvidia의 DCGM(DataCenter GPU Manager), Grafana 기반입니다.

DCGM 설치

nvidia DCGM 시작하기 를 참고하여 설치합니다.

AD
  • OS: ubuntu 24.04
  • GPU: RTX 4060 Ti 8GB

1. nvidia drvier와 CUDA 버전을 확인

$ nvidia-smi -q | grep -E 'Driver Version|CUDA Version
Driver Version                            : 580.105.08
CUDA Version                              : 13.0

2. nvidia datacenter gpu manager 설치. CUDA_VERSION 환경 변수는 Major version 인 “13” 으로 지정.

$ CUDA_VERSION=13
$ sudo apt-get install --yes \
                       --install-recommends \
                       datacenter-gpu-manager-4-cuda${CUDA_VERSION}

3. 선택 사항이라고 하지만 일단 설치

sudo apt install --yes datacenter-gpu-manager-4-multinode-cuda${CUDA_VERSION}

4. 선택 사항이지만 설치.

sudo apt install --yes datacenter-gpu-manager-4-dev

DCGM 서비스 기동

1. 서비스 등록

sudo systemctl --now enable nvidia-dcgm

2. 서비스 실행 확인

$ sudo systemctl status nvidia-dcgm
nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; enabled; preset: enabled)
     Active: active (running) since Thu 2025-12-11 13:18:28 KST; 5s ago
   Main PID: 1467089 (nv-hostengine)
      Tasks: 15 (limit: 38094)
     Memory: 16.6M (peak: 17.2M)
        CPU: 161ms
     CGroup: /system.slice/nvidia-dcgm.service
             └─1467089 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Dec 11 13:18:28 AI systemd[1]: Started nvidia-dcgm.service - NVIDIA DCGM service.
Dec 11 13:18:28 AI nv-hostengine[1467089]: DCGM initialized
Dec 11 13:18:28 AI nv-hostengine[1467089]: Started host engine version 4.4.2 using port num

3. dcgmi 명령 실행 시험

$ dcgmi discovery -l
1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA GeForce RTX 4060 Ti                                     |
|        | PCI Bus ID: 00000000:01:00.0                                         |
|        | Device UUID: GPU-e57cb625-81b0-b3ec-9d6f-281b6ffdd924                |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
0 ConnectX found.
+----------+
| ConnectX |
+----------+
+----------+
0 CPUs found.
+--------+----------------------------------------------------------------------+
| CPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+

dcgm-exporter

1. Docker 에서 GPU 감지 여부 확인

docker run --rm --gpus all nvidia/cuda:13.0.2-base-ubuntu24.04 nvidia-smi

아래와 같이 nvidia-smi 실행 결과가 정상 출력하면 OK

Status: Downloaded newer image for nvidia/cuda:13.0.2-base-ubuntu24.04
Thu Dec 11 04:31:51 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8              5W /  160W |      77MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

AD

LEAVE A REPLY

Please enter your comment!
Please enter your name here