NVIDIA 공식 + 오픈소스 기반으로 GPU 사용량 모니터링을 하는 시스템을 구성합니다. nvidia의 DCGM(DataCenter GPU Manager), Grafana 기반입니다.
DCGM 설치
nvidia DCGM 시작하기 를 참고하여 설치합니다.
AD
- OS: ubuntu 24.04
- GPU: RTX 4060 Ti 8GB
1. nvidia drvier와 CUDA 버전을 확인
$ nvidia-smi -q | grep -E 'Driver Version|CUDA Version
Driver Version : 580.105.08
CUDA Version : 13.02. nvidia datacenter gpu manager 설치. CUDA_VERSION 환경 변수는 Major version 인 “13” 으로 지정.
$ CUDA_VERSION=13
$ sudo apt-get install --yes \
--install-recommends \
datacenter-gpu-manager-4-cuda${CUDA_VERSION}3. 선택 사항이라고 하지만 일단 설치
sudo apt install --yes datacenter-gpu-manager-4-multinode-cuda${CUDA_VERSION}4. 선택 사항이지만 설치.
sudo apt install --yes datacenter-gpu-manager-4-devDCGM 서비스 기동
1. 서비스 등록
sudo systemctl --now enable nvidia-dcgm2. 서비스 실행 확인
$ sudo systemctl status nvidia-dcgm
● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; enabled; preset: enabled)
Active: active (running) since Thu 2025-12-11 13:18:28 KST; 5s ago
Main PID: 1467089 (nv-hostengine)
Tasks: 15 (limit: 38094)
Memory: 16.6M (peak: 17.2M)
CPU: 161ms
CGroup: /system.slice/nvidia-dcgm.service
└─1467089 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Dec 11 13:18:28 AI systemd[1]: Started nvidia-dcgm.service - NVIDIA DCGM service.
Dec 11 13:18:28 AI nv-hostengine[1467089]: DCGM initialized
Dec 11 13:18:28 AI nv-hostengine[1467089]: Started host engine version 4.4.2 using port num3. dcgmi 명령 실행 시험
$ dcgmi discovery -l
1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA GeForce RTX 4060 Ti |
| | PCI Bus ID: 00000000:01:00.0 |
| | Device UUID: GPU-e57cb625-81b0-b3ec-9d6f-281b6ffdd924 |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
0 ConnectX found.
+----------+
| ConnectX |
+----------+
+----------+
0 CPUs found.
+--------+----------------------------------------------------------------------+
| CPU ID | Device Information |
+--------+----------------------------------------------------------------------+
dcgm-exporter
1. Docker 에서 GPU 감지 여부 확인
docker run --rm --gpus all nvidia/cuda:13.0.2-base-ubuntu24.04 nvidia-smi아래와 같이 nvidia-smi 실행 결과가 정상 출력하면 OK
Status: Downloaded newer image for nvidia/cuda:13.0.2-base-ubuntu24.04
Thu Dec 11 04:31:51 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti On | 00000000:01:00.0 Off | N/A |
| 0% 32C P8 5W / 160W | 77MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+ AD
















![[WordPress enfold theme] Related posts styling](https://www.boolsee.pe.kr/wp-content/uploads/2018/04/enfold-100x70.png)

