Linux【8】-软件管理-9-软件安装3-nvidia(显卡驱动,cuda)

CUDA(Compute Unified Device Architecture),是显卡厂商NVIDIA推出的运算平台。 CUDA™是一种由NVIDIA推出的通用并行计算架构,该架构使GPU能够解决复杂的计算问题

一、安装

1.1 centos8 安装nvidia (直接用这个方法,后续的方法太烦了)

选择nvidia的版本: https://www.nvidia.com/download/index.aspx

wget -c https://us.download.nvidia.cn/tesla/460.106.00/nvidia-driver-local-repo-rhel8-460.106.00-1.0-1.x86_64.rpm

rpm -i nvidia-driver-local-repo-rhel8-460.106.00-1.0-1.x86_64.rpm
yum clean all
yum install cuda-drivers
reboot

1.2 centos7.4安装NVIDIA(备注:G03的安装)

1.2.1 安装gcc

yum -y install gcc-c++

此处是重点:如果有之前的NVIDIA驱动请先卸载,而且,要先装cuda再装驱动。。。你也可以按照我的步骤来,最后再重装一次驱动。

1.2.2 检测显卡驱动及型号

添加ELPepo源

$ sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
$ sudo rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm

安装NVIDIA驱动检测

$ sudo yum install nvidia-detect
$ nvidia-detect -v

Probing for supported NVIDIA devices...
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[1a03:2000] ASPEED Technology, Inc. ASPEED Graphics Family
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia

两块显卡驱动都是390.25

cd /data/src
wget -r -np -nd http://us.download.nvidia.com/XFree86/Linux-x86_64/390.25/NVIDIA-Linux-x86_64-390.25.run

显卡冲突

因为NVIDIA驱动会和系统自带nouveau驱动冲突,执行命令查看该驱动状态:

lsmod | grep nouveau

nouveau              1622010  0 
video                  24520  1 nouveau
mxm_wmi                13021  1 nouveau
drm_kms_helper        159169  2 ast,nouveau
ttm                    99345  2 ast,nouveau
drm                   370825  6 ast,ttm,drm_kms_helper,nouveau
i2c_algo_bit           13413  3 ast,igb,nouveau
i2c_core               40756  8 ast,drm,igb,i2c_i801,ipmi_ssif,drm_kms_helper,i2c_algo_bit,nouveau
wmi                    19070  2 mxm_wmi,nouveau

修改/etc/modprobe.d/blacklist.conf 文件,以阻止 nouveau 模块的加载,如果系统没有该文件需要新建一个,这里使用root权限,普通用户无法再在/etc内生成.conf文件。

$ su root
# echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf

重新建立initramfs image文件

# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
# dracut /boot/initramfs-$(uname -r).img $(uname -r)

1.3 安装NVIDIA( 这一步是不是可以通过安装cuda的时候,一起安装了呢? 没试过,下次先装cuda试试。。)

进入NVIDIA目录执行安装(建议推迟到cuda安装后再装驱动)

$ chmod +x NVIDIA-Linux-x86_64-390.25.run
$ sh NVIDIA-Linux-x86_64-390.25.run
   如果安装完成,可以运行命令查看显卡状态

报错:

 You appear to be running an X server; please exit X before installing.  For   
         further details, please see the section INSTALLING THE NVIDIA DRIVER in the   
         README available on the Linux driver download page at www.nvidia.com.

解决办法:

关闭图形界面:

init 3  

然后,重复sh这一步的安装操作

nvidia-smi

1.4 安装cuda

方法一: 亲测有效

官网下载cuda-rpm包https://developer.nvidia.com/cuda-downloads ,一定要对应自己的版本。

wget https://developer.download.nvidia.com/compute/cuda/11.5.1/local_installers/cuda-repo-rhel7-11-5-local-11.5.1_495.29.05-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-11-5-local-11.5.1_495.29.05-1.x86_64.rpm
sudo yum clean all
sudo yum -y install nvidia-driver-latest-dkms cuda
sudo yum -y install cuda-drivers



PATH=$PATH:/usr/local/cuda-11.5/bin/
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.5/lib64/
CUDA_HOME=/usr/local/cuda-11.5
export PATH
export LD_LIBRARY_PATH
export CUDA_HOME

其他版本的安装:

wget -c https://developer.nvidia.com/compute/cuda/9.1/Prod/local_installers/cuda-repo-rhel7-9-1-local-9.1.85-1.x86_64
sudo rpm -i cuda-repo-rhel7-9-1-local-9.1.85-1.x86_64
sudo yum clean all
sudo yum install cuda

方法二:通过运行程序安装(建议这种方式;建议先试试第一种方法)

wget -c https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run

chmod 775 cuda_10.1.168_418.67_linux.run
./cuda_10.1.168_418.67_linux.run

注意,如果已经安装驱动了,则通过enter取消驱动

测试cuda

cd /usr/local/cuda-9.1/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery

   安装成功

配置环境变量

cuda添加到bashprofile中:

vim .bashprofile

PATH=$PATH:$HOME/bin:/usr/local/cuda/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64/
CUDA_HOME=/usr/local/cuda
export PATH
export LD_LIBRARY_PATH
export CUDA_HOME

查看cuda版本

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

二、查看 nvidia-smi

nvidia-smi(The Nvidia System Management Interface)是Nvidia显卡命令行管理套件,基于NVML(Nvidia Management Library)库,旨在管理和监控Nvidia GPU设备。

该套件允许管理员查询GPU设备状态,并且授权系统管理员合适的权限修改GPU设备状态。 nvidia-smi能够管理Tesla和Fermi设备,并且对其他型号GPU提供有限的支持。关于nvidia-smi详细解释见: http://developer.download.nvidia.com/compute/cuda/6_0/rel/gdk/nvidia-smi.331.38.pdf

查看GPU占用动态信息:

watch -n 10 nvidia-smi 或者nvidia-smi -l 10

上面命令的作用是:每10秒更新GPU信息

  • 第一列  GPU:编号0、1  Fan:GPU的风扇转速,0~100%,
  • 第二列  Name:型号Tesla K20c、Quadro K4000 Temp: 温度,单位摄氏度。 
  • 第三列  Perf:性能状态,P0~P12,P0表示最大性能,P12表示状态最小性能。 
  • 第四列  Persistence-M:持续模式的状态                        Pwr:能耗
  • 第五列  Bus-Id: GPU总线,domain:bus:device.function 
  • 第六列  Disp.A:Display Active,表示GPU的显示是否初始化。          Memory Usage 显存使用率。 
  • 第七列  Volatile GPU-Util 浮动的GPU利用率。 
  • 第八列   Uncorr. ECC   ECC是“Error Checking and Correcting”的简写,“错误检查和纠正”   Compute M是计算模式。 

命令

nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...

参数

参数 详解
-h, –help Print usage information and exit.

LIST OPTIONS:

参数 详解
-L, –list-gpus Display a list of GPUs connected to the system.

例如:

qgzang@ustc:~$ nvidia-smi -L
GPU 0: GeForce GTX TITAN X (UUID: GPU-xxxxx-xxx-xxxxx-xxx-xxxxxx)

SUMMARY OPTIONS:

参数 详解
-i,–id= Target a specific GPU.
-f,–filename= Log to a specified file, rather than to stdout.
-l,–loop= Probe until Ctrl+C at specified second interval.

QUERY OPTIONS:

参数 详解
-q, –query
-u,–unit Show unit, rather than GPU, attributes.
-i,–id= Target a specific GPU or Unit.
-f,–filename= Log to a specified file, rather than to stdout.
-x,–xml-format Produce XML output.
–dtd When showing xml output, embed DTD.
-d,–display= Display only selected information: MEMORY,
-l, –loop= Probe until Ctrl+C at specified second interval.
-lms, –loop-ms= Probe until Ctrl+C at specified millisecond interval.

SELECTIVE QUERY OPTIONS:

参数 详解 补充
–query-gpu= Information about GPU. Call –help-query-gpu for more info.
–query-supported-clocks= List of supported clocks. Call –help-query-supported-clocks for more info.
–query-compute-apps= List of currently active compute processes. Call –help-query-compute-apps for more info.
–query-accounted-apps= List of accounted compute processes. Call –help-query-accounted-apps for more info.
–query-retired-pages= List of device memory pages that have been retired. Call –help-query-retired-pages for more info.

[mandatory]

参数 命令
-i, –id= Target a specific GPU or Unit.
-f, –filename= Log to a specified file, rather than to stdout.
-l, –loop= Probe until Ctrl+C at specified second interval.
-lms, –loop-ms= Probe until Ctrl+C at specified millisecond interval.

DEVICE MODIFICATION OPTIONS:

参数 命令 补充
-pm, –persistence-mode= Set persistence mode: 0/DISABLED, 1/ENABLED
-e, –ecc-config= Toggle ECC support: 0/DISABLED, 1/ENABLED
-p, –reset-ecc-errors= Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
-c, –compute-mode= Set MODE for compute applications: 0/DEFAULT,1/EXCLUSIVE_THREAD (deprecated),2/PROHIBITED, 3/EXCLUSIVE_PROCESS
–gom= Set GPU Operation Mode: 0/ALL_ON, 1/COMPUTE, 2/LOW_DP
-r –gpu-reset Trigger reset of the GPU.

UNIT MODIFICATION OPTIONS:

参数 命令
-t, –toggle-led= Set Unit LED state: 0/GREEN, 1/AMBER
-i, –id= Target a specific Unit.

SHOW DTD OPTIONS:

参数 命令
–dtd Print device DTD and exit.
-f, –filename= Log to a specified file, rather than to stdout.
-u, –unit Show unit, rather than device, DTD.
–debug= Log encrypted debug information to a specified file.

Process Monitoring:

参数 命令 补充
pmon Displays process stats in scrolling format. “nvidia-smi pmon -h” for more information.

TOPOLOGY: (EXPERIMENTAL)

参数 命令 补充
topo Displays device/system topology. “nvidia-smi topo -h” for more information. Please see the nvidia-smi(1) manual page for more detailed information.

查看GPU的compute mode

[root@g02 ~]#  nvidia-smi -a |grep Mode |grep Compute
    Compute Mode                    : Exclusive_Process
    Compute Mode                    : Exclusive_Process
    Compute Mode                    : Exclusive_Process
    Compute Mode                    : Exclusive_Process
    Compute Mode                    : Exclusive_Process

修改compute mode

[root@g02 gcc-build-9.2.0-g01]# nvidia-smi -c 3
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:04:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:05:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:08:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:09:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:85:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:86:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:89:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:8A:00.0.
All done.

三、讨论

4.1 指定空余GPU运行

#!/bin/bash
using=(`nvidia-smi | grep MiB | awk ' $2 != "N/A" {print $2}'`)

for i in `seq 0 15`
do
#echo "${using[@]}"

if [[ ! " ${using[@]} " =~ " $i " ]]; then
CUDA_VISIBLE_DEVICES=$i cmd  # cmd为需要运行的命令行
echo $i
fi
done

最近有一些GPU任务,会出现报错:

error selecting compatible GPU all CUDA-capable devices are busy or unavailable

原因:目前GPU compute mode的设置为一个GPU只能跑一个任务。如果一个GPU上已有计算任务,再提交计算任务到这个GPU就会报错。

解决办法:

  1. 应用层判断空闲GPU,然后再投递任务
  2. 手动指定空闲GPU
  3. 所有GPU任务都通过SLURM提交
  4. docker化应用

推荐的办法:

建议大家所有GPU任务都通过SLURM来提交,SLURM可以通过简单计数的方法来管理GPU任务的排队。但如果有GPU任务没有通过SLURM提交,SLURM没有统计到,就会出现上面的情况。

4.2 在终端执行程序时指定GPU

CUDA_VISIBLE_DEVICES=0    python  your_file.py  # 指定GPU集群中第一块GPU使用,其他的屏蔽掉

CUDA_VISIBLE_DEVICES=1           Only device 1 will be seen
CUDA_VISIBLE_DEVICES=0,1         Devices 0 and 1 will be visible
CUDA_VISIBLE_DEVICES="0,1"       Same as above, quotation marks are optional 多GPU一起使用
CUDA_VISIBLE_DEVICES=0,2,3       Devices 0, 2, 3 will be visible; device 1 is masked
CUDA_VISIBLE_DEVICES=""          No GPU will be visible

四、报错

4.1 Failed to initialize NVML: Driver/library version mismatch.

策略一:

这个问题出现的原因是kernel mod 的 Nvidia driver 的版本没有更新,一般情况下,重启机器就能够解决,如果因为某些原因不能够重启的话,也有办法reload kernel mod。

简单来看,就两步

  1. unload nvidia kernel mod
  2. reload nvidia kernel mod

执行起来就是

sudo rmmod nvidia
sudo nvidia-smi

nvidia-smi 发现没有 kernel mod 会将其自动装载。

这时,就要一点一点的卸载整个驱动了,首先要知道现在kernel mod 的依赖情况,首先我们从错误信息中知道,nvidia_modeset nvidia_uvm 这两个 mod 依赖于 nvidia, 所以要先卸载他们

lsmod |grep nvidia

先关闭相关的进程

sudo rmmod nvidia_uvm
sudo rmmod nvidia_modeset

然后:

sudo rmmod nvidia
nvidia-smi

策略二

当策略一不好用的时候,可以考虑下面这种情况。因为这个报错的本质是:NVIDIA 内核驱动版本与系统驱动不一致

[cebroker@g03 flare]$ ll /usr/src
total 428
drwxr-xr-x. 2 root root      6 Apr 11  2018 debug
-rw-r--r--  1 root root  64667 Jul 29  2019 fortran.c
-rw-r--r--  1 root root  17859 Jul 29  2019 fortran_common.h
-rw-r--r--  1 root root  39040 Jul 29  2019 fortran.h
-rw-r--r--  1 root root 269462 Jul 29  2019 fortran_thunking.c
-rw-r--r--  1 root root  34362 Jul 29  2019 fortran_thunking.h
drwxr-xr-x. 6 root root   4096 Sep 11  2018 kernels
drwxr-xr-x. 2 root root      6 Jul  1  2019 nvidia-387.26
drwxr-xr-x  7 root root    149 Jul  1  2019 nvidia-418.67
drwxr-xr-x  7 root root    149 Nov 28  2019 nvidia-418.87.00
drwxr-xr-x  7 root root    149 Nov 28  2019 nvidia-430.37


[root@g02 ~]# dkms status
nvidia, 390.30, 3.10.0-693.17.1.el7.x86_64, x86_64: built
nvidia, 418.87.00, 3.10.0-1062.1.2.el7.x86_64, x86_64: installed


[root@g02 ~]# dkms remove -m nvidia -v 418.87.00 --all

[root@g02 ~]# dkms remove -m nvidia -v 390.30 --all


[root@g02 ~]# dkms remove -m nvidia -v 390.30 --all


dkms install -m nvidia -v 430.37

rmmod nvidia
nvidia-smi

因为430.37才是最匹配的驱动。。 虽然搞不懂为啥。。

4.2 报错

终端输入: nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

按照官网教程开始debug…

查看显卡状态

[root@g03 ~]# lspci | grep -i nvidia
06:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0d:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0e:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
86:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
87:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
8d:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
8e:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

显示正常

检测安装包无误

$ md5sum cuda-repo-rhel7-8-0-local-8.0.44-1.x86_64-rpm
24fea3b7f2e5f7e3f155cd73bc008108  cuda-repo-rhel7-8-0-local-8.0.44-1.x86_64-rpm

与官网的checksum(http://developer.download.nvidia.com/compute/cuda/8.0/Prod/docs/sidebar/md5sum.txt)对比,无误。

(这一步,我没做,应该是正常的)

检查系统依赖

$ yum info dkms
$ yum info libvdpau 
$ yum info kernel-devel

都有,完美

为内核安装nvdia模块

dkms的模块需要经过added, build, install 3个步骤才能被modinfo检测到

[root@g03 ~]# dkms status
nvidia, 387.26, 3.10.0-693.17.1.el7.x86_64, x86_64: installed (original_module exists) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!)

资源在/usr/src/ ,有387.26 和390.25两个版本

选择:dkms install -m nvidia -v 387.26 会报错

dkms install -m nvidia -v 390.25

重启一下,

nvidia-smi

完美

4.3 报错:

ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel source files for your

 kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the  'kernel-source' or 'kernel-devel' RPM installed.  If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.

解决办法:

yum install kernel-devel kernel-headers -y
yum info kernel-devel kernel-headers
#yum install "kernel-devel-uname-r == $(uname -r)"  # 这个没有尝试

cd /usr/lib/modules/3.10.0-1062.el7.x86_64
 
 [root@g02 3.10.0-1062.el7.x86_64]# ll
total 3276
lrwxrwxrwx.  1 root root     39 Sep  2  2020 build -> /usr/src/kernels/3.10.0-1062.el7.x86_64   # 这个文件不存在呀,所以一直找不到。。。
drwxr-xr-x.  2 root root      6 Aug  8  2019 extra
drwxr-xr-x. 12 root root    128 Sep  2  2020 kernel
-rw-r--r--.  1 root root 852612 Sep  2  2020 modules.alias
-rw-r--r--.  1 root root 813600 Sep  2  2020 modules.alias.bin
-rw-r--r--.  1 root root   1333 Aug  8  2019 modules.block
-rw-r--r--.  1 root root   7357 Aug  8  2019 modules.builtin
-rw-r--r--.  1 root root   9425 Sep  2  2020 modules.builtin.bin
-rw-r--r--.  1 root root 271558 Sep  2  2020 modules.dep
-rw-r--r--.  1 root root 379859 Sep  2  2020 modules.dep.bin
-rw-r--r--.  1 root root    361 Sep  2  2020 modules.devname
-rw-r--r--.  1 root root    140 Aug  8  2019 modules.drm
-rw-r--r--.  1 root root     69 Aug  8  2019 modules.modesetting
-rw-r--r--.  1 root root   1787 Aug  8  2019 modules.networking
-rw-r--r--.  1 root root  97132 Aug  8  2019 modules.order
-rw-r--r--.  1 root root    569 Sep  2  2020 modules.softdep
-rw-r--r--.  1 root root 395012 Sep  2  2020 modules.symbols
-rw-r--r--.  1 root root 483595 Sep  2  2020 modules.symbols.bin
lrwxrwxrwx.  1 root root      5 Sep  2  2020 source -> build
drwxr-xr-x.  2 root root      6 Aug  8  2019 updates
drwxr-xr-x.  2 root root     95 Sep  2  2020 vdso
drwxr-xr-x.  2 root root      6 Aug  8  2019 weak-updates

#重新建立软连接
[root@g02 3.10.0-1062.el7.x86_64]# ln -s /usr/src/kernels/3.10.0-1127.19.1.el7.x86_64  build

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn