티스토리 뷰

MLOps

K8S Node GPU Driver Install

문타리 2025. 3. 14.

Kernel 버전 변경

CUDA 개발 환경은 호스트 컴파일러 및 C 런타임 라이브러리를 포함한 호스트 개발 환경과 긴밀하게 통합되도록 설계되어 있기 때문에 Linux OS 및 Kernel 버전을 확인하는 것이 중요함
NVIDIA CUDA Installation Guide for Linux - Table 1 Native Linux Distribution Support in CUDA 12.8 (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)
NVIDIA cuDNN Backend - CPU Architecture and OS Requirements (https://docs.nvidia.com/deeplearning/cudnn/backend/latest/reference/support-matrix.html#linux)

업데이트 가능한 kernel 버전 확인 및 설치

# 시스템 업데이트
yum update -y

$> dnf list kernel-*
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:05:29 ago on Fri 07 Feb 2025 10:11:59 AM KST.
Installed Packages
kernel-core.x86_64                        4.18.0-372.80.1.el8_6              @System
kernel-core.x86_64                        4.18.0-553.37.1.el8_10             @rhel-8-baseos-rhui-rpms
kernel-modules.x86_64                     4.18.0-372.80.1.el8_6              @System
kernel-modules.x86_64                     4.18.0-553.37.1.el8_10             @rhel-8-baseos-rhui-rpms
kernel-tools.x86_64                       4.18.0-553.37.1.el8_10             @rhel-8-baseos-rhui-rpms
kernel-tools-libs.x86_64                  4.18.0-553.37.1.el8_10             @rhel-8-baseos-rhui-rpms
Available Packages
kernel-abi-stablelists.noarch             4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-abi-whitelists.noarch              4.18.0-240.22.1.el8_3              rhel-8-baseos-rhui-rpms
kernel-cross-headers.x86_64               4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-debug.x86_64                       4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-debug-core.x86_64                  4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-debug-devel.x86_64                 4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-debug-modules.x86_64               4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-debug-modules-extra.x86_64         4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-devel.x86_64                       4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-doc.noarch                         4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-headers.x86_64                     4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-modules-extra.x86_64               4.18.0-553.37.1.el8_10             rhel-8-baseos-rhui-rpms
kernel-rpm-macros.noarch                  131-1.el8                          rhel-8-appstream-rhui-rpms

$> yum install -y kernel-4.18.0-553.37.1.el8_10

$> reboot


# 폐쇄망에 있는 서버와 환경을 맞추기 위해 Kernel 및 OS 버전 Downgrade가 필요할 경우
# yum list kernel-4.18.0-477*
# yum install -y kernel-4.18.0-477.10.1.el8_8
# reboot
# uname -r
# yum downgrade -y redhat-release
# cat /etc/redhat-release

GPU Driver Install

💸 Money 이슈로 인해 GPU 는 T4 를 사용했음
(AWS 의 경우 g4dn.xlarge , GCP의 경우 nvidia-tesla-t4 인스턴스 타입 사용)

$> uname -r
4.18.0-553.37.1.el8_10.x86_64

# 0. gcc 설치
dnf groupinstall -y "Development Tools" 

# 1. 커널 헤더, 개발 패키지 설치
dnf list kernel-*
dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)

# 2. Third-party 패키지 의존성
dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

# 3. NVIDIA Driver Repo 등록
# [Local repo 방식]
# https://www.nvidia.com/en-us/drivers/ 의 Manual Driver Search에서 검색
# 예시)
# ======================
# Data Center / Tesla
# T-Series
# Tesla T4
# Linux 64-bit RHEL 8
# 12.8
# English (US) 또는 Korean
# ======================
# 다운로드 링크 복사

cd
curl -LO https://us.download.nvidia.com/tesla/570.86.15/nvidia-driver-local-repo-rhel8-570.86.15-1.0-1.x86_64.rpm

# Local Repo 생성
rpm --install nvidia-driver-local-repo-rhel8-570.86.15-1.0-1.x86_64.rpm

$> dnf repolist
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

repo id                                                                                                        repo name
ansible-2-for-rhel-8-rhui-rpms                                                                                 Red Hat Ansible Engine 2 for RHEL 8 (RPMs) from RHUI
epel                                                                                                           Extra Packages for Enterprise Linux 8 - x86_64
kubernetes                                                                                                     Kubernetes
nvidia-driver-local-rhel8-570.86.15                                                                            nvidia-driver-local-rhel8-570.86.15
rhel-8-appstream-rhui-rpms                                                                                     Red Hat Enterprise Linux 8 for x86_64 - AppStream from RHUI (RPMs)
rhel-8-baseos-rhui-rpms                                                                                        Red Hat Enterprise Linux 8 for x86_64 - BaseOS from RHUI (RPMs)
rhui-client-config-server-8                                                                                    RHUI Client Configuration Server 8


cat /etc/yum.repos.d/nvidia-driver-local-rhel8-570.86.15.repo
ll /var/nvidia-driver-local-repo-rhel8-570.86.15

# 4. NVIDIA 드라이버 설치
## 설치 시 꽤나 오래걸림
dnf module install -y nvidia-driver:open-dkms

# 5. Persistence Daemon
systemctl start nvidia-persistenced
systemctl enable nvidia-persistenced
systemctl status nvidia-persistenced --no-pager

# 6. NVIDIA 드라이버 설치 확인
$> dkms status
nvidia-open/565.57.01, 4.18.0-553.34.1.el8_10.x86_64, x86_64: installed
 
$> lsmod | grep nvidia
nvidia_drm            110592  0
nvidia_modeset       1556480  1 nvidia_drm
nvidia_uvm           1785856  0
nvidia               9863168  7 nvidia_uvm,nvidia_modeset
video                  57344  1 nvidia_modeset
drm_kms_helper        184320  2 nvidia_drm
drm                   602112  4 drm_kms_helper,nvidia,nvidia_drm

$> cat /proc/driver/nvidia/version

$> nvidia-smi
Mon Jan 20 10:46:28 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   24C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
 
$> modinfo nvidia
filename:       /lib/modules/4.18.0-553.37.1.el8_10.x86_64/extra/nvidia.ko.xz
alias:          char-major-195-*
version:        570.86.15
supported:      external
license:        Dual MIT/GPL
firmware:       nvidia/570.86.15/gsp_tu10x.bin
firmware:       nvidia/570.86.15/gsp_ga10x.bin
rhelversion:    8.10
srcversion:     4850DD09C9629D9E41B682E
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        drm
name:           nvidia
vermagic:       4.18.0-553.37.1.el8_10.x86_64 SMP mod_unload modversions
...

$> yum install -y nvtop
$> nvtop

# dnf install cuda

curl -L https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-rhel8-12-8-local-12.8.0_570.86.10-1.x86_64.rpm -o /root/files/packages/cuda/cuda-repo-rhel8-12-8-local-12.8.0_570.86.10-1.x86_64.rpm
rpm -i /root/files/packages/cuda/cuda-repo-rhel8-12-8-local-12.8.0_570.86.10-1.x86_64.rpm
dnf clean all
dnf -y install cuda-toolkit-12-8

 
$> ls /usr/local/ | grep cuda
cuda
cuda-12
cuda-12.8
 
 
sh -c "echo 'export PATH=$PATH:/usr/local/cuda-12.8/bin' >> /etc/profile"
sh -c "echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64' >> /etc/profile"
sh -c "echo 'export CUDADIR=/usr/local/cuda-12.8' >> /etc/profile"
source /etc/profile
 
 
$> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

다른 방식으로 설치 (참고용)

sudo su -
 
# 서버와 동일한 Kernel 버전으로 변경 (4.18.0-553.8.1.el8_10 --> 4.18.0-477.10.1.el8_8)
dnf list kernel
dnf install -y kernel-4.18.0-477.10.1.el8_8.x86_64
grub2-mkconfig -o /boot/grub2/grub.cfg
systemctl reboot 
 
# 재접속
sudo su -
$> uname -r
4.18.0-477.10.1.el8_8.x86_64
 
# dnf install kernel-devel-$(uname -r)
mkdir -p /root/files/kernel-devel
dnf download --resolve --destdir=/root/files/kernel-devel kernel-devel-4.18.0-477.10.1.el8_8.x86_64
cd /root/files/kernel-devel
dnf localinstall *.rpm 
 
dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# dnf install -y dkms
mkdir -p /root/files/dkms
dnf download --resolve --destdir=/root/files/dkms dkms-3.0.13-1.el8.noarch
cd /root/files/dkms
dnf localinstall dkms-3.0.13-1.el8.noarch.rpm

# https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=RHEL&target_version=8&target_type=rpm_local
# sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
mkdir -p /root/files/cuda-repo
cd /root/files/cuda-repo
curl -O https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda-repo-rhel8-12-2-local-12.2.2_535.104.05-1.x86_64.rpm
dnf localinstall /root/files/cuda-repo/cuda-repo-rhel8-12-2-local-12.2.2_535.104.05-1.x86_64.rpm
 
# dnf install cuda-driver
dnf search cuda-driver
mkdir -p /root/files/cuda-driver
cd /root/files/cuda-driver
dnf download --resolve --destdir=/root/files/cuda-driver cuda-driver
dnf localinstall *.rpm

'MLOps' 카테고리의 다른 글

Kubeflow Install in Kind Cluster  (1) 2025.03.14
Nvidia Device Plugin Install  (0) 2025.03.14
K8S Node Pre Install (Offline)  (0) 2025.03.14
K8S Install  (0) 2025.03.14
K8S Node Pre Install (Online)  (0) 2025.03.11
댓글