Linux【11】-软件安装3-nvidia(显卡驱动,cuda)

CUDA(Compute Unified Device Architecture),是显卡厂商NVIDIA推出的运算平台。 CUDA™是一种由NVIDIA推出的通用并行计算架构,该架构使GPU能够解决复杂的计算问题

一、安装

centos7.4安装NVIDIA(备注:G03的安装)

1.1 安装gcc

yum -y install gcc-c++

此处是重点:如果有之前的NVIDIA驱动请先卸载,而且,要先装cuda再装驱动。。。你也可以按照我的步骤来,最后再重装一次驱动。

1.2 检测显卡驱动及型号

添加ELPepo源

$ sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
$ sudo rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm

安装NVIDIA驱动检测

$ sudo yum install nvidia-detect
$ nvidia-detect -v

Probing for supported NVIDIA devices...
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[1a03:2000] ASPEED Technology, Inc. ASPEED Graphics Family
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia

两块显卡驱动都是390.25

cd /data/src
wget -r -np -nd http://us.download.nvidia.com/XFree86/Linux-x86_64/390.25/NVIDIA-Linux-x86_64-390.25.run

显卡冲突

因为NVIDIA驱动会和系统自带nouveau驱动冲突,执行命令查看该驱动状态:

lsmod | grep nouveau

nouveau              1622010  0 
video                  24520  1 nouveau
mxm_wmi                13021  1 nouveau
drm_kms_helper        159169  2 ast,nouveau
ttm                    99345  2 ast,nouveau
drm                   370825  6 ast,ttm,drm_kms_helper,nouveau
i2c_algo_bit           13413  3 ast,igb,nouveau
i2c_core               40756  8 ast,drm,igb,i2c_i801,ipmi_ssif,drm_kms_helper,i2c_algo_bit,nouveau
wmi                    19070  2 mxm_wmi,nouveau

修改/etc/modprobe.d/blacklist.conf 文件,以阻止 nouveau 模块的加载,如果系统没有该文件需要新建一个,这里使用root权限,普通用户无法再在/etc内生成.conf文件。

$ su root
# echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf

重新建立initramfs image文件

# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
# dracut /boot/initramfs-$(uname -r).img $(uname -r)

1.3 安装NVIDIA

进入NVIDIA目录执行安装(建议推迟到cuda安装后再装驱动)

$ chmod +x NVIDIA-Linux-x86_64-390.25.run
$ sh NVIDIA-Linux-x86_64-390.25.run
   如果安装完成,可以运行命令查看显卡状态

报错:

 You appear to be running an X server; please exit X before installing.  For   
         further details, please see the section INSTALLING THE NVIDIA DRIVER in the   
         README available on the Linux driver download page at www.nvidia.com.

解决办法:

关闭图形界面:

init 3  

然后,重复sh这一步的安装操作

nvidia-smi

1.4 安装cuda

官网下载cuda-rpm包https://developer.nvidia.com/cuda-downloads,一定要对应自己的版本。

wget -c https://developer.nvidia.com/compute/cuda/9.1/Prod/local_installers/cuda-repo-rhel7-9-1-local-9.1.85-1.x86_64
sudo rpm -i cuda-repo-rhel7-9-1-local-9.1.85-1.x86_64
sudo yum clean all
sudo yum install cuda

测试cuda

cd /usr/local/cuda-9.1/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery

   安装成功

cuda添加到bashprofile中:

vim .bashprofile

PATH=$PATH:$HOME/bin:/usr/local/cuda/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64/
CUDA_HOME=/usr/local/cuda
export PATH
export LD_LIBRARY_PATH
export CUDA_HOME

查看nvcc版本号:

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

二、查看 nvidia-smi

nvidia-smi(The Nvidia System Management Interface)是Nvidia显卡命令行管理套件,基于NVML(Nvidia Management Library)库,旨在管理和监控Nvidia GPU设备。

该套件允许管理员查询GPU设备状态,并且授权系统管理员合适的权限修改GPU设备状态。 nvidia-smi能够管理Tesla和Fermi设备,并且对其他型号GPU提供有限的支持。关于nvidia-smi详细解释见: http://developer.download.nvidia.com/compute/cuda/6_0/rel/gdk/nvidia-smi.331.38.pdf

查看GPU占用动态信息:

watch -n 10 nvidia-smi 或者nvidia-smi -l 10

上面命令的作用是:每10秒更新GPU信息

  • 第一列  GPU:编号0、1  Fan:GPU的风扇转速,0~100%,
  • 第二列  Name:型号Tesla K20c、Quadro K4000 Temp: 温度,单位摄氏度。 
  • 第三列  Perf:性能状态,P0~P12,P0表示最大性能,P12表示状态最小性能。 
  • 第四列  Persistence-M:持续模式的状态                        Pwr:能耗
  • 第五列  Bus-Id: GPU总线,domain:bus:device.function 
  • 第六列  Disp.A:Display Active,表示GPU的显示是否初始化。          Memory Usage 显存使用率。 
  • 第七列  Volatile GPU-Util 浮动的GPU利用率。 
  • 第八列   Uncorr. ECC   ECC是“Error Checking and Correcting”的简写,“错误检查和纠正”   Compute M是计算模式。 

三、报错

3.1 Failed to initialize NVML: Driver/library version mismatch.

这个问题出现的原因是kernel mod 的 Nvidia driver 的版本没有更新,一般情况下,重启机器就能够解决,如果因为某些原因不能够重启的话,也有办法reload kernel mod。

简单来看,就两步

  1. unload nvidia kernel mod
  2. reload nvidia kernel mod

执行起来就是

sudo rmmod nvidia
sudo nvidia-smi

nvidia-smi 发现没有 kernel mod 会将其自动装载。

这时,就要一点一点的卸载整个驱动了,首先要知道现在kernel mod 的依赖情况,首先我们从错误信息中知道,nvidia_modeset nvidia_uvm 这两个 mod 依赖于 nvidia, 所以要先卸载他们

lsmod |grep nvidia

先关闭相关的进程

sudo rmmod nvidia_uvm
sudo rmmod nvidia_modeset

然后:

sudo rmmod nvidia
nvidia-smi

3.2 报错

终端输入: nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

按照官网教程开始debug…

3.2.1 查看显卡状态

[root@g03 ~]# lspci | grep -i nvidia
06:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0d:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0e:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
86:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
87:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
8d:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
8e:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

显示正常

2.2.2 检测安装包无误

$ md5sum cuda-repo-rhel7-8-0-local-8.0.44-1.x86_64-rpm
24fea3b7f2e5f7e3f155cd73bc008108  cuda-repo-rhel7-8-0-local-8.0.44-1.x86_64-rpm

与官网的checksum(http://developer.download.nvidia.com/compute/cuda/8.0/Prod/docs/sidebar/md5sum.txt)对比,无误。

(这一步,我没做,应该是正常的)

3.2.3 检查系统依赖

$ yum info dkms
$ yum info libvdpau 
$ yum info kernel-devel

都有,完美

3.2.4 为内核安装nvdia模块

dkms的模块需要经过added, build, install 3个步骤才能被modinfo检测到

[root@g03 ~]# dkms status
nvidia, 387.26, 3.10.0-693.17.1.el7.x86_64, x86_64: installed (original_module exists) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!)

资源在/usr/src/ ,有387.26 和390.25两个版本

选择:dkms install -m nvidia -v 387.26 会报错

dkms install -m nvidia -v 390.25

重启一下,

nvidia-smi

完美

参考资料

个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn

Sam avatar
About Sam
专注生物信息 专注转化医学