Kernel 버전 변경
CUDA 개발 환경은 호스트 컴파일러 및 C 런타임 라이브러리를 포함한 호스트 개발 환경과 긴밀하게 통합되도록 설계되어 있기 때문에 Linux OS 및 Kernel 버전을 확인하는 것이 중요함
NVIDIA CUDA Installation Guide for Linux - Table 1 Native Linux Distribution Support in CUDA 12.8
(https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)
NVIDIA cuDNN Backend - CPU Architecture and OS Requirements
(https://docs.nvidia.com/deeplearning/cudnn/backend/latest/reference/support-matrix.html#linux)
업데이트 가능한 kernel 버전 확인 및 설치
# 시스템 업데이트
yum update -y
$> dnf list kernel-*
Updating Subscription Management repositories.
Unable to read consumer identity
This system is not registered with an entitlement server. You can use subscription-manager to register.
Last metadata expiration check: 0:05:29 ago on Fri 07 Feb 2025 10:11:59 AM KST.
Installed Packages
kernel-core.x86_64 4.18.0-372.80.1.el8_6 @System
kernel-core.x86_64 4.18.0-553.37.1.el8_10 @rhel-8-baseos-rhui-rpms
kernel-modules.x86_64 4.18.0-372.80.1.el8_6 @System
kernel-modules.x86_64 4.18.0-553.37.1.el8_10 @rhel-8-baseos-rhui-rpms
kernel-tools.x86_64 4.18.0-553.37.1.el8_10 @rhel-8-baseos-rhui-rpms
kernel-tools-libs.x86_64 4.18.0-553.37.1.el8_10 @rhel-8-baseos-rhui-rpms
Available Packages
kernel-abi-stablelists.noarch 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-abi-whitelists.noarch 4.18.0-240.22.1.el8_3 rhel-8-baseos-rhui-rpms
kernel-cross-headers.x86_64 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-debug.x86_64 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-debug-core.x86_64 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-debug-devel.x86_64 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-debug-modules.x86_64 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-debug-modules-extra.x86_64 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-devel.x86_64 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-doc.noarch 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-headers.x86_64 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-modules-extra.x86_64 4.18.0-553.37.1.el8_10 rhel-8-baseos-rhui-rpms
kernel-rpm-macros.noarch 131-1.el8 rhel-8-appstream-rhui-rpms
$> yum install -y kernel-4.18.0-553.37.1.el8_10
$> reboot
# 폐쇄망에 있는 서버와 환경을 맞추기 위해 Kernel 및 OS 버전 Downgrade가 필요할 경우
# yum list kernel-4.18.0-477*
# yum install -y kernel-4.18.0-477.10.1.el8_8
# reboot
# uname -r
# yum downgrade -y redhat-release
# cat /etc/redhat-release
GPU Driver Install
💸 Money 이슈로 인해 GPU 는 T4 를 사용했음
(AWS 의 경우g4dn.xlarge
, GCP의 경우nvidia-tesla-t4
인스턴스 타입 사용)
$> uname -r
4.18.0-553.37.1.el8_10.x86_64
# 0. gcc 설치
dnf groupinstall -y "Development Tools"
# 1. 커널 헤더, 개발 패키지 설치
dnf list kernel-*
dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
# 2. Third-party 패키지 의존성
dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# 3. NVIDIA Driver Repo 등록
# [Local repo 방식]
# https://www.nvidia.com/en-us/drivers/ 의 Manual Driver Search에서 검색
# 예시)
# ======================
# Data Center / Tesla
# T-Series
# Tesla T4
# Linux 64-bit RHEL 8
# 12.8
# English (US) 또는 Korean
# ======================
# 다운로드 링크 복사
cd
curl -LO https://us.download.nvidia.com/tesla/570.86.15/nvidia-driver-local-repo-rhel8-570.86.15-1.0-1.x86_64.rpm
# Local Repo 생성
rpm --install nvidia-driver-local-repo-rhel8-570.86.15-1.0-1.x86_64.rpm
$> dnf repolist
Updating Subscription Management repositories.
Unable to read consumer identity
This system is not registered with an entitlement server. You can use subscription-manager to register.
repo id repo name
ansible-2-for-rhel-8-rhui-rpms Red Hat Ansible Engine 2 for RHEL 8 (RPMs) from RHUI
epel Extra Packages for Enterprise Linux 8 - x86_64
kubernetes Kubernetes
nvidia-driver-local-rhel8-570.86.15 nvidia-driver-local-rhel8-570.86.15
rhel-8-appstream-rhui-rpms Red Hat Enterprise Linux 8 for x86_64 - AppStream from RHUI (RPMs)
rhel-8-baseos-rhui-rpms Red Hat Enterprise Linux 8 for x86_64 - BaseOS from RHUI (RPMs)
rhui-client-config-server-8 RHUI Client Configuration Server 8
cat /etc/yum.repos.d/nvidia-driver-local-rhel8-570.86.15.repo
ll /var/nvidia-driver-local-repo-rhel8-570.86.15
# 4. NVIDIA 드라이버 설치
## 설치 시 꽤나 오래걸림
dnf module install -y nvidia-driver:open-dkms
# 5. Persistence Daemon
systemctl start nvidia-persistenced
systemctl enable nvidia-persistenced
systemctl status nvidia-persistenced --no-pager
# 6. NVIDIA 드라이버 설치 확인
$> dkms status
nvidia-open/565.57.01, 4.18.0-553.34.1.el8_10.x86_64, x86_64: installed
$> lsmod | grep nvidia
nvidia_drm 110592 0
nvidia_modeset 1556480 1 nvidia_drm
nvidia_uvm 1785856 0
nvidia 9863168 7 nvidia_uvm,nvidia_modeset
video 57344 1 nvidia_modeset
drm_kms_helper 184320 2 nvidia_drm
drm 602112 4 drm_kms_helper,nvidia,nvidia_drm
$> cat /proc/driver/nvidia/version
$> nvidia-smi
Mon Jan 20 10:46:28 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 24C P8 9W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$> modinfo nvidia
filename: /lib/modules/4.18.0-553.37.1.el8_10.x86_64/extra/nvidia.ko.xz
alias: char-major-195-*
version: 570.86.15
supported: external
license: Dual MIT/GPL
firmware: nvidia/570.86.15/gsp_tu10x.bin
firmware: nvidia/570.86.15/gsp_ga10x.bin
rhelversion: 8.10
srcversion: 4850DD09C9629D9E41B682E
alias: pci:v000010DEd*sv*sd*bc06sc80i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: drm
name: nvidia
vermagic: 4.18.0-553.37.1.el8_10.x86_64 SMP mod_unload modversions
...
$> yum install -y nvtop
$> nvtop
# dnf install cuda
curl -L https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-rhel8-12-8-local-12.8.0_570.86.10-1.x86_64.rpm -o /root/files/packages/cuda/cuda-repo-rhel8-12-8-local-12.8.0_570.86.10-1.x86_64.rpm
rpm -i /root/files/packages/cuda/cuda-repo-rhel8-12-8-local-12.8.0_570.86.10-1.x86_64.rpm
dnf clean all
dnf -y install cuda-toolkit-12-8
$> ls /usr/local/ | grep cuda
cuda
cuda-12
cuda-12.8
sh -c "echo 'export PATH=$PATH:/usr/local/cuda-12.8/bin' >> /etc/profile"
sh -c "echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64' >> /etc/profile"
sh -c "echo 'export CUDADIR=/usr/local/cuda-12.8' >> /etc/profile"
source /etc/profile
$> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
다른 방식으로 설치 (참고용)
sudo su -
# 서버와 동일한 Kernel 버전으로 변경 (4.18.0-553.8.1.el8_10 --> 4.18.0-477.10.1.el8_8)
dnf list kernel
dnf install -y kernel-4.18.0-477.10.1.el8_8.x86_64
grub2-mkconfig -o /boot/grub2/grub.cfg
systemctl reboot
# 재접속
sudo su -
$> uname -r
4.18.0-477.10.1.el8_8.x86_64
# dnf install kernel-devel-$(uname -r)
mkdir -p /root/files/kernel-devel
dnf download --resolve --destdir=/root/files/kernel-devel kernel-devel-4.18.0-477.10.1.el8_8.x86_64
cd /root/files/kernel-devel
dnf localinstall *.rpm
dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# dnf install -y dkms
mkdir -p /root/files/dkms
dnf download --resolve --destdir=/root/files/dkms dkms-3.0.13-1.el8.noarch
cd /root/files/dkms
dnf localinstall dkms-3.0.13-1.el8.noarch.rpm
# https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=RHEL&target_version=8&target_type=rpm_local
# sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
mkdir -p /root/files/cuda-repo
cd /root/files/cuda-repo
curl -O https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda-repo-rhel8-12-2-local-12.2.2_535.104.05-1.x86_64.rpm
dnf localinstall /root/files/cuda-repo/cuda-repo-rhel8-12-2-local-12.2.2_535.104.05-1.x86_64.rpm
# dnf install cuda-driver
dnf search cuda-driver
mkdir -p /root/files/cuda-driver
cd /root/files/cuda-driver
dnf download --resolve --destdir=/root/files/cuda-driver cuda-driver
dnf localinstall *.rpm