Linux【11】-软件安装3-nvidia(显卡驱动,cuda)

centos7.4安装NVIDIA

备注:G03的安装

一、.安装gcc

yum -y install gcc-c++

此处是重点:如果有之前的NVIDIA驱动请先卸载,而且,要先装cuda再装驱动。。。你也可以按照我的步骤来,最后再重装一次驱动。

二、检测显卡驱动及型号

  ##添加ELPepo源

$ sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
$ sudo rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm

安装NVIDIA驱动检测

$ sudo yum install nvidia-detect
$ nvidia-detect -v

Probing for supported NVIDIA devices...
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[1a03:2000] ASPEED Technology, Inc. ASPEED Graphics Family
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia
[10de:102d] NVIDIA Corporation GK210GL [Tesla K80]
This device requires the current 390.25 NVIDIA driver kmod-nvidia

两块显卡驱动都是390.25

cd /data/src
wget -r -np -nd http://us.download.nvidia.com/XFree86/Linux-x86_64/390.25/NVIDIA-Linux-x86_64-390.25.run

2、显卡冲突

因为NVIDIA驱动会和系统自带nouveau驱动冲突,执行命令查看该驱动状态:

lsmod | grep nouveau

nouveau              1622010  0 
video                  24520  1 nouveau
mxm_wmi                13021  1 nouveau
drm_kms_helper        159169  2 ast,nouveau
ttm                    99345  2 ast,nouveau
drm                   370825  6 ast,ttm,drm_kms_helper,nouveau
i2c_algo_bit           13413  3 ast,igb,nouveau
i2c_core               40756  8 ast,drm,igb,i2c_i801,ipmi_ssif,drm_kms_helper,i2c_algo_bit,nouveau
wmi                    19070  2 mxm_wmi,nouveau

修改/etc/modprobe.d/blacklist.conf 文件,以阻止 nouveau 模块的加载,如果系统没有该文件需要新建一个,这里使用root权限,普通用户无法再在/etc内生成.conf文件。

$ su root
# echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf

3、 重新建立initramfs image文件

# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
# dracut /boot/initramfs-$(uname -r).img $(uname -r)

三、安装NVIDIA

进入NVIDIA目录执行安装(建议推迟到cuda安装后再装驱动)

$ chmod +x NVIDIA-Linux-x86_64-390.25.run
$ sh NVIDIA-Linux-x86_64-390.25.run
   如果安装完成,可以运行命令查看显卡状态

报错:

 You appear to be running an X server; please exit X before installing.  For   
         further details, please see the section INSTALLING THE NVIDIA DRIVER in the   
         README available on the Linux driver download page at www.nvidia.com.

解决办法:

关闭图形界面:

init 3  

然后,重复sh这一步的安装操作

$ nvidia-smi

四.安装cuda

官网下载cuda-rpm包https://developer.nvidia.com/cuda-downloads,一定要对应自己的版本。

wget -c https://developer.nvidia.com/compute/cuda/9.1/Prod/local_installers/cuda-repo-rhel7-9-1-local-9.1.85-1.x86_64
sudo rpm -i cuda-repo-rhel7-9-1-local-9.1.85-1.x86_64
sudo yum clean all
sudo yum install cuda

6、测试cuda

cd /usr/local/cuda-9.1/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery

   安装成功

7、cuda添加到bashprofile中

vim .bashprofile

PATH=$PATH:$HOME/bin:/usr/local/cuda/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64/
CUDA_HOME=/usr/local/cuda
export PATH
export LD_LIBRARY_PATH
export CUDA_HOME

查看nvcc版本号

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

六、报错

1. Failed to initialize NVML: Driver/library version mismatch.

这个问题出现的原因是kernel mod 的 Nvidia driver 的版本没有更新,一般情况下,重启机器就能够解决,如果因为某些原因不能够重启的话,也有办法reload kernel mod。

简单来看,就两步

  1. unload nvidia kernel mod
  2. reload nvidia kernel mod

执行起来就是

sudo rmmod nvidia
sudo nvidia-smi

nvidia-smi 发现没有 kernel mod 会将其自动装载。

这时,就要一点一点的卸载整个驱动了,首先要知道现在kernel mod 的依赖情况,首先我们从错误信息中知道,nvidia_modeset nvidia_uvm 这两个 mod 依赖于 nvidia, 所以要先卸载他们

lsmod |grep nvidia

先关闭相关的进程

sudo rmmod nvidia_uvm
sudo rmmod nvidia_modeset

然后:

sudo rmmod nvidia
nvidia-smi

6.2 报错

终端输入: nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

按照官网教程开始debug…

1.1 查看显卡状态

[root@g03 ~]# lspci | grep -i nvidia
06:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0d:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0e:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
86:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
87:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
8d:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
8e:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

显示正常

1.2 检测安装包无误

$ md5sum cuda-repo-rhel7-8-0-local-8.0.44-1.x86_64-rpm
24fea3b7f2e5f7e3f155cd73bc008108  cuda-repo-rhel7-8-0-local-8.0.44-1.x86_64-rpm

与官网的checksum(http://developer.download.nvidia.com/compute/cuda/8.0/Prod/docs/sidebar/md5sum.txt)对比,无误。

(这一步,我没做,应该是正常的)

1.3 检查系统依赖

$ yum info dkms
$ yum info libvdpau 
$ yum info kernel-devel

都有,完美

1.4 为内核安装nvdia模块

dkms的模块需要经过added, build, install 3个步骤才能被modinfo检测到

[root@g03 ~]# dkms status
nvidia, 387.26, 3.10.0-693.17.1.el7.x86_64, x86_64: installed (original_module exists) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!)

资源在/usr/src/ ,有387.26 和390.25两个版本

选择:dkms install -m nvidia -v 387.26 会报错

dkms install -m nvidia -v 390.25

重启一下,

nvidia-smi

完美

参考资料:

个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn

Sam avatar
About Sam
专注生物信息 专注转化医学