Linux【9】-进程管理7-1--集群管理--sge

March 20, 2018 linux 阅读量：次

用SGE来管理多节点任务，在单一的控制节点上投放任务，不用考虑这些任务被分配到哪个节点上，方便用户调取资源。比如现在 5台机子，每台机子 8 个核，则共有 40 个核。现在，我从其中一台机子上提交了 1000 个作业，系统将自动将这 1000 个作业分配给这 40 个核来做。

更多阅读：

《N1 Grid Engine 6 用户指南》
《N1 Grid Engine 6 安装指南》
《N1 Grid Engine 6 管理指南》

一、前提条件：

已建立好NFS /data
已建立好NIS

(参见之前的博文)

二、安装

2.1 mastaer服务器上的操作

2.1.1 修改hosts

vi /etc/hosts

打开/etc/hosts文件以后，有如下内容，不知道对后续的是否有影响

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

在这两行后面加入了

192.168.100.1  C01     C01.xx
192.168.100.2  G02     G02.xx    
192.168.100.3  G03     G03.xx

修改主机名：

方法一：

修改配置文件 /etc/sysconfig/network 内容：

NETWORKING=yes
HOSTNAME=control

修改配置文件 /proc/sys/kernel/hostname 内容：

control

或者：

echo "G02"> /proc/sys/kernel/hostname

方法二：

hostnamectlset-hostname C01

2.1.2 下载安装

yum -y install epel-release
yum -y install jemalloc-devel openssl-devel ncurses-devel pam-devel libXmu-devel hwloc-devel hwloc hwloc-libs java-devel javacc ant-junit libdb-devel motif-devel csh ksh xterm db4-utils perl-XML-Simple perl-Env xorg-x11-fonts-ISO8859-1-100dpi xorg-x11-fonts-ISO8859-1-75dpi

2.1.3 用户权限

 groupadd -g 490 sgeadmin
 useradd -u 495 -g 490 -r -m  -d /home/sgeadmin -s /bin/bash -c "SGE Admin" sgeadmin

visudo，然后添加如下内容（相同的操作免密码）

%sgeadmin       ALL=(ALL)       NOPASSWD: ALL

2.1.4 安装

cd /data/software/src
wget -c https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-8.1.9.tar.gz
tar zxvfp sge-8.1.9.tar.gz
cd sge-8.1.9/source/
sh scripts/bootstrap.sh 
./aimk

这一步的时候，报错：

BUILD FAILED
/data/src/sge-8.1.9/source/build.xml:85: The following error occurred while executing this line:
/data/src/sge-8.1.9/source/build.xml:30: Java returned: 1

Total time: 56 seconds
not done

查看文档发现：

This tries to build all the normal targets, some of which might be
problematic for various reasons (e.g. Java).  Various `aimk` switches
provide selective compilation; use `./aimk -help` for the options, not
all of which may actually work, especially in combination.

   Useful aimk options:

[horizontal]
`-no-qmon`:: Don't build `qmon`;
`-no-qmake`:: don't build `qmake`;
`-no-qtcsh`:: don't build `qtcsh`;
`-no-java -no-jni`:: avoid all Java-related stuff;
`-no-remote`:: don't build `rsh` etc.
(obsoleted by use of `ssh` and the SGE PAM module).

For the core system (daemons, command line clients, but not `qmon`) use

  $ ./aimk -only-core

意思就是不是每个模块都需要安装，于是我选择了去掉java，明显的在java这一步报错的

./aimk -no-java -no-jni

上一步安装成功，接着

./aimk -man

接着：

export SGE_ROOT=/data/software/gridengine && mkdir $SGE_ROOT
echo Y | ./scripts/distinst -local -allall -libs -noexit
chown -R sgeadmin.sgeadmin /data/software/gridengine

cd  $SGE_ROOT
./install_qmaster

开始各种选择：

press enter at the intro screen
press “y” and then specify sgeadmin as the user id (sgeadmin)
leave the install dir as /BiO/gridengine (/data/software)
You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
accept the sge_qmaster info
You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
accept the sge_execd info
leave the cell name as “default”
Enter an appropriate cluster name when requested (Enter new cluster name or hit to use default [p6444] »,这里选择的回车，出来的结果是creating directory: /data/software/gridengine/default/common, Your $SGE_CLUSTER_NAME: p6444)
leave the spool dir as is ( 回车，选择默认)
press “n” for no windows hosts! （这一步选择的是n，不是默认的的哈）
press “y” (permissions are set correctly)
press “y” for all hosts in one DNS domain
If you have Java available on your Qmaster and wish to use SGE Inspect or SDM then enable the JMX MBean server and provide the requested information - probably answer “n” at this point!(这一步选择n,不然也会报错额)
press enter to accept the directory creation notification
enter “classic” for classic spooling (berkeleydb may be more appropriate for large clusters)
press enter to accept the next notice
enter “20000-20100” as the GID range (increase this range if you have execution nodes capable of running more than 100 concurrent jobs)
accept the default spool dir or specify a different folder (for example if you wish to use a shared or local folder outside of SGE_ROOT
enter an email address that will be sent problem reports
press “n” to refuse to change the parameters you have just configured
- 报错：Command failed: ./utilbin/lx-amd64/spooldefaults Command failed: configuration Command failed: /tmp/configuration_2018-03-16_09:00:40.43362 Probably a permission problem. Please check file access permissions. Check read/write permission. Check if SGE daemons are running.
- 重新安装，上面的报错就没有啦
press enter to accept the next notice
press “y” to install the startup scripts
press enter twice to confirm the following messages
- 看到提示信息：cp /data/software/gridengine/default/common/sgemaster /etc/init.d/sgemaster.p6444 /usr/lib/lsb/install_initd /etc/init.d/sgemaster.p6444
press “n” for a file with a list of hosts
enter the names of your hosts who will be able to administer and submit jobs (enter alone to finish adding hosts) 输入了C01,然后enter,输入G02,然后enter，输入G03,然后enter (这里输入了一串乱码，可能对后面造成影响) 27.skip shadow hosts for now (press “n”)
choose “1” for normal configuration and agree with “y”
press enter to accept the next message and “n” to refuse to see the previous screen again and then finally enter to exit the installer You may verify your administrative hosts with the command

qconf -sh

and you may add new administrative hosts with the command

qconf -ah

安装结束以后，

cp /data/software/gridengine/default/common/settings.sh /etc/profile.d/
source /etc/profile

qconf -ah G02
	adminhost "G02" already exists
qconf -ah G03
	adminhost "G03" already exists

2.2 slave服务器上的安装

G02 slave服务器的操作

compute01# yum -y install hwloc-devel
compute01# hostnamectl set-hostname G02

compute01# vi /etc/hosts

192.168.100.1  C01     C01.shhrp
192.168.100.2  G02     G02.shhrp       gpuserver.hengrui.com
192.168.100.3  G03     G03.shhrp

compute01# groupadd -g 490 sgeadmin

sgeadmin已经存在了，可能是因为之前安装了，vim /etc/group，把sgeadmin对应的uid从991改成了490

compute01# useradd -u 495 -g 490 -r -m  -d /home/sgeadmin -s /bin/bash -c "SGE Admin" sgeadmin

提示用户已经存在了，vim /etc/passwd

sgeadmin:x:993:991:Grid Engine admin:/:/sbin/nologin

改成了

sgeadmin:x:495:490:SGE Admin:/:/bin/bash

visudo，然后添加如下内容（相同的操作免密码）

%sgeadmin       ALL=(ALL)       NOPASSWD: ALL

然后

compute01# export SGE_ROOT=/data/software/gridengine
compute01# export SGE_CELL=default
compute01# cd $SGE_ROOT
compute01# ./install_execd  #全部选择默认即可 
compute01# cp /data/software/gridengine/default/common/settings.sh /etc/profile.d/

安装过程中报错：

Checking hostname resolving
---------------------------

Cannot contact qmaster. The command failed:

   ./bin/lx-amd64/qconf -sh

The error message was:

   denied: host "pp" is neither submit nor admin host

You can fix the problem now or abort the installation  procedure.
The problem could be:

   - the qmaster is not running
   - the qmaster host is down
   - an active firewall blocks your request

解决办法：

qconf -ah pp  # 主节点上操作

将sge路径写入到环境变量中

vim /etc/profile

# SGE
export SGE_ROOT=/data/software/gridengine
export PATH="${SGE_ROOT}/bin/lx-amd64:$PATH"

然后：

source /etc/profile

同理G03

vim /etc/group

将

sgeadmin:x:981:

改成了

sgeadmin:x:490:

vim /etc/passwd

将

sgeadmin:x:986:981:Grid Engine admin:/:/sbin/nologin

改成

sgeadmin:x:495:490:SGE Admin:/:/bin/bash

然后按照G02的操作

2.3 最后在主控结点上查看一下是否成功：

qhost

HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
G02                     lx-amd64       32    2   16   32  2.96  125.6G   10.1G  120.0G     0.0
G03                     lx-amd64       72    2   36   72  0.04   94.1G    3.2G    4.0G     0.0

如果不是这样，就需要重启

如果重装失败

ps -ef |grep sge

kill掉跟sge相关的东西

至此，安装结束

2.4 重启

#控制节点：

/etc/init.d/sgemaster.p6444 restart

运行节点：

/etc/init.d/sgeexecd.p6444 restart

au 代表这个节点有问题，需要重启一下

在确认防火墙已经关掉的情况下，运行节点还是 “au”的状态，查看了sge的进程，发现是pp来启动的，果断的不对呀，kill掉以后，重新用root来启动sge

[root@g03 ~]# q
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@C01                      BIP   0/0/64         -NA-     lx-amd64		au
---------------------------------------------------------------------------------
all.q@G02                      BIP   0/0/24         -NA-     lx-amd64		au
---------------------------------------------------------------------------------
all.q@G03                      BIP   0/0/64         -NA-     lx-amd64		au

[root@g03 ~]# ps -ef |grep sge
pp       35025     1  0 09:46 ?        00:00:09 /data/software/gridengine/bin/lx-amd64/sge_execd
root     41269 41142  0 13:16 pts/0    00:00:00 grep --color=auto sge
[root@g03 ~]# kill -9 35025
[root@g03 ~]# /etc/init.d/sgeexecd.p6444 start
   Starting Grid Engine execution daemon
[root@g03 ~]# q
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@C01                      BIP   0/0/64         0.75     lx-amd64
---------------------------------------------------------------------------------
all.q@G02                      BIP   0/0/24         0.01     lx-amd64
---------------------------------------------------------------------------------
all.q@G03                      BIP   0/0/64         0.04     lx-amd64

三、队列的建立与管理

更多阅读：http://www.softpanorama.org/HPC/Grid_engine/sge_queues.shtml

3.1 队列管理的常用命令：

qconf 如下参数

查看：

-sql  show queue list Show a list of all currently defined cluster queues.
-sq  queue_list show queues Displays one or multiple cluster queues or queue instances.

修改

-mq  queuename modify queue configuration Retrieves the current configuration for the specified queue, executes an editor, and registers the new configuration with the sge_qmaster.

删除：

qconf -dq queue_name

新增：

-Aq  fname add new queue Add the queue defined in fname to the cluster.  Name of the queue creates is specified in the file. It is not define by  fname
-aq  queue_name add new queue. In this case qconf retrieves the default queue configuration (see queue_conf man page) and invokes an editor for customizing the queue configuration. Upon exit from the editor, the queue is registered with sge_qmaster. A minimal configuration requires only that the queue name and queue hostlist be set.

相应的配置文件：

$SGE_ROOT/$SGE_CELL/common/act_qmaster         Grid Engine master host file
$SGE_ROOT/$SGE_CELL/spool/qmaster/cqueues/    Queues Configuration Directory

3.2 GPU队列的配置

默认情况下，SGE将所有节点(and the CPU cores or slots therein) 放在同一个队列all.q下。这样，SGE就不知道每个节点的GPUs，同时也不知道怎么给GPU分配任务。

Make SGE aware of available GPUs;
set every GPU in every node in compute exclusive mode;
split all.q into two queues: cpu.q and gpu.q;
make sure a job running on cpu.q does not access GPUs;
make sure a job running on gpu.q uses only one CPU core and one GPU

1.让SGE知道GPUs

cd data/backup
qconf -sc > qconf_sc.txt
cp qconf_sc.txt qconf_sc_gpu.txt

打开 qconf_sc_gpu.txt，然后加入这一行

方案一（没成功）：

gpu                    gpu                BOOL        ==    FORCED      NO         0        0

报错：

Job 64 does not request 'forced' resource "gpu" of host G02
Job 64 does not request 'forced' resource "gpu" of host G03
verification: no suitable queues

Exiting.

改成：

gpu                    gpu                BOOL        ==    YES      NO         0        0

方案二：

然后：

qconf -Mc qconf_sc_gpu.txt

然后加入新的变量的那一行

提示：

root@G03 added "gpu" to complex entry list

检查是否有gpu了

qconf -sc | grep gpu

3台服务器都有了gpu

2.Setting GPUs in compute exclusive mode（这一步不知道怎么弄的）

rocks run host compute 'nvidia-smi -c 1'

Manual page for nvidia-smi indicates that this setting does not persist across reboots.

3.Disabling all.q

qconf -sq all.q > all.q.txt

去除allq.txt

qmod -f -d all.q

cpu队列的配置

cp all.q.txt  cpu.q.txt

编辑cpu.q.txt的内容

qname                 cpu.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make smp mpi
rerun                 FALSE
slots                 1,[G02=32],[G03=72]
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY

#配置信息说明：

每个节点最大提供的slots数目
slots                                 20

如果不指定slots个数，则使用1个，后面跟的是每个节点最大的使用slots数
slots                                 1,[b08=16],[b09=32]

qconf -mhgrp @allhosts #可以查看到@allhosts包含的节点，修改gpu.q为
hostlist              G02 G03
则，我们看到队列中只含有G02,G03节点

一般需要修改的内容包括：

hostlist lus
processors 32
slots 32
shell /bin/bash
pe_list ms

qconf -Aq cpu.q.txt

gpu队列

cp all.q.txt  gpu.q.txt

修改gpu.q.txt内容

qname                 gpu.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make smp mpi
rerun                 FALSE
slots                 1,[G02=8],[G03=8]
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        gpu=True
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY

添加队列

qconf -Aq gpu.q.txt

分别修改G2和G3

qconf -me G02

将

complex_values        NONE

改为：

complex_values        gpu=1

修改队列的配置：

qconf -me 队列名（例如：cpu.q）
qsub -pe gMPI 64

3.3 添加额外的变量作为控制器

hard resource_list（默认的计数器）

如果缺乏某个计算器的话，就会报错：

Unable to run job: unknown resource "FEP_GPGPU"

3.3.1.添加变量名

cd /data/user/sam/sge/config
qconf -sc > qconf_sc.txt
cp qconf_sc.txt qconf_sc_new.txt

修改qconf_sc_new.txt文件，添加两列：

multisim            ms         INT       <=    YES         YES        0        1000
gpus                g          INT       <=    YES         YES        0        1000

然后更改默认文件

qconf -Mc qconf_sc_new.txt

root@C01 added "multisim" to complex entry list
root@C01 added "gpus" to complex entry list

如果提示的为：

complex with name CANVAS_ELEMENTS or shortcut CANVAS_ELEMENTS already exists
complex with name CANVAS_FULL or shortcut CANVAS_FULL already exists
complex with name CANVAS_SHARED or shortcut CANVAS_SHARED already exists

这就有可能没添加成功，因为有重复的，需要去掉重复的部分哦。。

检查是否添加成功

qconf -sc | grep multisim

3.3.2.修改队列的信息，让队列知道这些变量的存在

[root@C01 config]# qconf -sc |grep gpus

gpus g INT <= YES YES 0 1000

修改gpu.q

qconf -mq gpu.q

将：

complex_values        gpu=True

改成：

complex_values        gpu=True,gpus=16

修改cpu.q

qconf -mq cpu.q

将

complex_values        None

改为

complex_values        multisim=8

3.3.3.修改执行主机的信息

qconf -me C01

将

complex_values        NONE

改为：

complex_values        gpus=0,multisim=0

qconf -me G02

将

complex_values        gpu=TRUE

改为：

complex_values        gpu=TRUE,gpus=8,multisim=8

qconf -me G03

将

complex_values        gpu=TRUE

改为：

complex_values        gpu=TRUE,gpus=8,multisim=8

四、常用的命令

官网说明： http://arc.liv.ac.uk/hpc_background/SGE.html

4.1.各种配置

1）命令行配置执行主机

qconf -ae hostname 添加执行主机（前提：该主机首先要安装了执行进程，master主机如果要当执行主机的话，也需要安装install_execd）
qconf -de hostname 删除执行主机
qconf -sel 显示执行主机列表

2）命令行配置管理主机

qconf -ah hostname 添加管理主机
qconf -dh hostname 删除管理主机
qconf -sh 显示管理主机列表

3）命令行配置提交主机

qconf -as hostname 添加提交主机
qconf -ds hostname 删除提交主机
qconf -ss 显示提交主机列表

4）命令行配置队列

qconf -aq queuename 添加集群队列
qconf -dq queuename 删除集群队列
qconf -mq queuename 修改集群队列配置
qconf -sq queuename 显示集群队列配置
qconf -sql 显示集群队列列表

5）命令行配置用户组

qconf -ahgrp groupname 添加用户组
qconf -mhgrp groupname 修改用户组成员
qconf -shgrp groupname 显示用户组成员

6）设置并行环境

qconf -ap PE_name
    添加并行化环境
qconf -mp PE_name
    修改并行化环境
qconf -dp PE_name
    删除并行化环境
qconf -sp PE_name
    显示并行化环境
qconf -spl
    显示并行化环境名称列表

我们常常通过修改队列的配置内容和用户组的配置内容来满足我们的要求\看一下：队列的内容 qconf -sq all.q

4.2.投递任务到指定队列main.q

方法一： qsub -cwd -l vf=*G -q main.q *.sh
方法二： qsub -cwd -S /bin/bash -l vf=*G -q main.q *.sh

-cwd 表示在当前路径下投递，sge的日志会输出到当前路径。

-l vf=*G 任务的预估内存，内存估计的值应稍微大于真实的内存，内存预估偏小可能会导致节点跑挂。

-q 指定要投递到的队列，如果不指定的话，SGE会在用户可使用的队列中选择一个满足要求的队列。 
指定 
gpu.q

注： 方法一和方法二都可以投递任务到指定队列，但是方法一可能会输出警告信息“Warning: no access to tty (Bad file descriptor). Thus no job control in this shell.” 这是因为SGE默认使用的是tcsh，而*.sh使用的是bash，所以应该在投递的时候指明命令解释器。若非要使用方法一的话，可以在脚本*.sh的开头加上#$ -S /bin/bash。

提交任务

a single core job:

qsub -q comp

a 16-core shared-memory job:

qsub -q comp -l cores=16

a 64-process parallel job:

qsub -pe MPI 64

a large shared memory job:

qsub -q himem -l cores=16

a single gpu job:

qsub -q gpu

a 4-gpu shared-memory job:

qsub -q gpu -l cores=4

a 64-gpu parallel job:

4.3. 投递任务到指定节点

qsub -cwd -l vf=*G -l h=node1 *.sh
qsub -cwd -l vf=*G -l h=node1 -P project -q main.q *.sh
-P 参数指明任务所属的项目

qsub -cwd -e /dev/null myscript.sh

4.4. 查询任务

qstat -f      查看所有任务
qstat -j jobId           按任务id查看
qstat -u user            按用户查看
qstat -f -u '*'   查看所有用户的任务
qstat -a       查看所有用户的任务

主机的状态：

1） 'au' – Host is in alarm and unreachable
2）'u' – Host is unreachable. Usually SGE is down or the machine is down. Check
this out.
3） 'a' – Host is in alarm. It is normal on if the state of the node is full, it means, if
the node is using most of its resources.
4）'as' – Host is in alarm and suspended. If the node is using most of its resources,
SGE suspends this node to take any job unless resources are available.
5） 'd' – Host is disabled.
6） 'E' – ERROR, This requires the command 'qmod -c' to clear the error state

如果主机状态有问题：

Disabled state “d” will persist until cleared

如果认为节点有问题：

可以关掉这个节点上的队列

Will NOT affect any running jobs on that node
WILL block any new work from landing there
Disabled state “d” will persist until cleared

Command:

qmod -d <queue name>

To re-enable:

qmod -e <queue name>

任务状态：

qw    表示等待状态
Eqw     投递任务出错
r     表示任务正在运行
dr     节点挂了之后，删除任务就会出现这个状态，只有节点重启之后，任务才会消失
'w' – job waiting
's' – job supended
't' – job transferring and about to start
'r' – job running
'h' – job hold
'R' – job restarted
'd' – job has been marked to deletion

删除有问题的jobs

qmod -c job_id

4.5. 删除任务

qdel -j 1111   删除任务号为1111的任务

4.6. 其他命令

qrsh  与qsub相比，是交互式的投递任务，注意参数：

-now yes|no   默认设置为yes 
若设置为yes，立即调度作业，如果没有可用资源，则拒绝作业，任务投递失败，任务状态为Eqw。
若设置为no，调度时如果没有可用资源，则将作业排入队列，等待调度。

例子： qrsh -l vf=*G -q all.q -now no -w n *sh
qacct  从集群日志中抽取任意账户信息
qalter   更改已提交但正处于暂挂状态的作业的属性
qconf   为集群和队列配置提供用户界面
qhold   阻止已提交作业的执行
qhost   显示SGE执行主机（即各个计算节点）的状态信息
qlogin  启动telnet或类似的登录会话。

查看失败任务的原因：

qalter -w v jobid

查看SGE执行的主机：

qhost

查看GPU的使用情况

qhost -F gpu

五、测试

5.1.第一个测试

#vi uname.sge 
#!/bin/bash
uname -a

# qsub uname.sge

qsub uname.sge Your job 3557 (“uname.sge”) has been submitted

如果运行成功就会在某个执行结点的自己目录下面（我这里用的是 sam 帐号，所以是 /home/sam 目录）得到2个文件，执行结果就在 uname.sge. 这个文件里。

因为没有加 -cwd ,所以默认的生成文件和报错文件都会在用户个人文件夹下哦。。。

可是我执行了半天也没有结果

[sam@C01 test]$ qstat -f

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@G02                      BIP   0/0/32         -NA-     -NA-          au
---------------------------------------------------------------------------------
all.q@G03                      BIP   0/0/72         -NA-     -NA-          au

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
      2 0.55500 uname.sge  sam          qw    03/16/2018 10:15:31     1        
      3 0.55500 uname.sge  sam          qw    03/16/2018 10:17:43     1        
      4 0.55500 uname.sge  sam          qw    03/16/2018 10:18:23     1        
      5 0.55500 uname.sge  sam          qw    03/16/2018 10:22:07     1        
      6 0.55500 uname.sge  sam          qw    03/16/2018 10:24:46     1

会发邮件，告诉你出错的问题，我的问题是home下面没有对应的文件夹？？

六、讨论

6.1 SGE 与 NFS 用户管理问题

sge 用户管理： sge 以用户名来标志相同用户，如果节点 A 上用户 user133 提交的作业想要让节点 B 执行，节点 B 上也必须有用户 user1.当执行作业的时候，执行节点 B 将自动调用用户 user1 来进行作业的执行。
NFS 用户管理： NFS 以用户 id 来标志相同用户，如果节点 A 上 id 为 1000 的用户将文件放到 NFS 的目录上面，则在其余的共享了该目录的主机上，该文件的属主也是 id 为 1000 的用户。可以看到上面的一个矛盾： sge 以用户名来标志不同节点上的相同用户， nfs 以用户 id 来标志不同节点上的相同用户。现在我们的系统，需要多个节点共同工作。可能会出现这样的情况： A 机上的 user1 完成作业的一部分，在 nfs 上面生成了一个目录，而 B 机上的 use1 需要写文件到这个目录上，如果 A 机和 B 机上 user1 的用户 id 不一样， nfs 就会将它们识别为不同的用户，则 A 机上的用户 user1 建立的目录，但 B 机上的 user1 对这个目录并没有写权限。因为这个目录在 B 机上属于 id 和 user1 一样的用户。为了避免上面这个情况，我们要求所有机子上的同名用户也拥有相同的 id。

6.2 向集群中再添加一个执行节点

主机名： node4 ip： 172.16.192.200

将下面这一行添加到其余的所有机子的/etc/hosts 中

172.16.192.200 node4.bnu.edu.cn node4

master服务器中

qconf -ad node4

然后重复G02的操作，注意用户id的统一

6.3 unable to send message to qmaster using port 6444 on host

[root@g02 test]# qsub -cwd uname.sge error: commlib error: got select error (Connection refused) Unable to run job: unable to send message to qmaster using port 6444 on host “G02”: got send error Exiting.

The problem can be:

the qmaster is not running

the qmaster host is down

an active firewall blocks your request

重装了一下G02,G03。刚开始是好的，后来又出问题了。后来发现原来是qhost调取的还是之前的老版本的命令。通过which就可以知道

which qhost

/opt/sge/bin/lx-amd64/qhost

vim /etc/profile

#export SGE_ROOT=/opt/sge
export PATH="${SGE_ROOT}/bin/lx-amd64:$PATH"

source /etc/profile

记得修改队列的host.list

6.3 scheduling info提示报错

cheduling info：(-l FEP_GPGPU=16,gpus=1) cannot run in queue “G02” because it offers only hc:gpus=0.000000

cheduling info 会根据已有资源，各种提示，告诉你为什么没有现在运行。这个的问题是gpus的资源已经用光了。

如果你觉得还有资源的话，可以通过修改执行主机和队列中对应的这个资源的值。

qconf -me G02 
qconf -mq gpu.q

6.3 cannot run in PE “smp” because it only offers 0 slots

在用smp运行的时候，提示这个错误：

cannot run in queue "all.q" because it is not contained in its hard queue list (-q)
                            cannot run in queue "cpu.q" because it is not contained in its hard queue list (-q)
                            cannot run in PE "smp" because it only offers 16 slots

遇到这个问题的时候，困惑了很久，明明提交的是gpu.q的任务，为什么会提示all.q ，cpu.q等问题。其实通过

qalter -w v jobid

就能明确看到问题，就是因为提交的gpu的slots不够了，sge会尝试看看其他队列是否能够被利用，然后发现没有指定其他的队列，所以会把这些问题都给列出来。那么问题来了，为什么我的slots不够呢？

1.确保PEs已安装

qconf -spl

查看是否有smp

2.修改smp的配置(G02,G03服务器都需要配置)

qconf -mp smp 修改
qconf -ap smp 新增

根据是否有，来选择是修改还是新增

将

slots  0

修改为

slots  999

查看是否修改过来：

qconf -sp smp

将

allocation_rule    $pe_slots

修改为：

allocation_rule    $round_robin

$pe_slots 指定一个任务分配的slots必须来自同一台节点；$round_robin or $fill_up 容许来自不同节点的slots

3.smp添加到队列中

qconf -mq gpu.q

修改为：

pe_list               smp make mpich mpi openmpi orte lam_loose_rsh pvm matlab

4.修改提交任务中的slots个数

qargs: -q gpu.q -l gpus=1 -pe smp 16

可以看到slots个数为16，而我的G02,G03的slots分别才为8，8；所以呢，肯定是slots不够，肯定会提示错误呀。。。

6.3 debug的过程：

任何报错分两个层面：程序层面和任务层面。下面的这谢谢内容，是在实战中需要掌握的快速定位bugs的方法

如果是程序层面的：

qstat -f

1.记录文件：

SGE messages and logs are usually very helpful

$SGE_ROOT/default/spool/qmaster/messages
$SGE_ROOT/default/spool/qmaster/schedd/messages

Execd spool logs often hold job specific error data

Remember that local spooling may be used (!)
$SGE_ROOT/default/spool/<node>/messages

SGE panic location

QWill log to /tmp on any node when $SGE_ROOT not found or not writable

2.快速显示问题

qsub -w v 其他参数

能立马告诉你问题，例如：

qsub -w v -cwd -e error12 -q cpu.q simple.sh

邮件告知，为什么任务失败

qsub -m a user@host [rest of command]

3.查看某个job的情况

qstat -j job_id

里面的error信息会提示你报错

6.4.bash脚本运行

4.7. bash脚本与Linux环境变量

#!/bin/bash
 
# SGE Options
#$ -S/bin/bash
#$ -N MyJob
 
# Create Working Directory
WDIR=/state/partition1/$USER/$JOB_NAME-$JOB_ID
mkdir -p $WDIR
if [ ! -d $WDIR ]
then
  echo $WDIR not created
  exit
fi
cd $WDIR
 
# Copy Data and Config Files
cp $HOME/Data/FrogProject/FrogFile .
 
# Put your Science related commands here
/share/apps/runsforever FrogFile
 
# Copy Results Back to Home Directory
RDIR=$HOME/FrogProject/Results/$JOB_NAME-$JOB_ID
mkdir -p $RDIR
cp NobelPrizeWinningResults $RDIR
 
# Cleanup
rm -rf $WDIR

为了防止脚本运行时找不到环境变量，在投递的bash脚本的前面最好加上以下两句话：(原因见1)

#! /bin/bash
#$ -S /bin/bash

6.5. sge 需要掌握的几个基本概念

名字	解释
node	A host computer in the cluster. A node may have multiple processors. Cetus’s nodes are named cetus01, cetus02, etc.
core	A CPU may have multiple cores. A core is often thought of as a separate CPU in its own right, but technically it is not. It does function very much like a separate CPU, but since it shares a CPU with other cores it does not perform quite as well as a separate CPU would. Nonetheless, performance of multiple cores is close to that of the equivalent number of (single core) CPUs, and they are designed to function much like separate CPUs, so it is often practical to think of them as such.
queue	An object which may contain a prioritized list of jobs for which requested resources for the jobs can be satisfied. A queue also has a list of nodes that are available to it, and a limit on the number of jobs that are allowed to run on each node from this queue.
queue instance	A sort of mini-queue that is a branch of the queue on a node. For example, the test.q queue may have instances test.q@cetus01, test.q@cetus02, etc.
slots	For a queue, the number of jobs that may run on a node from this queue. In other words, the number of jobs that may run in each of this queue’s instances.
consumable resource	A (usually numeric) resource, such as disk space or memory, for which each job may consume a portion. A resource may be attributable to the cluster as a whole, to a queue, or to a node. When a job is being scheduled, if it requests more than the available amount of a resource, the job is deferred until sufficient resources are available.

参考资料

http://biohpc.blogspot.com/2016/10/sge-installation-of-son-of-grid-engine.html （主要是参考的这个，还有就是软件里面的readme）
崔再续 SGE的安装和使用文档.pdf 2011-08
http://www.chenlianfu.com/?p=2441
http://blog.csdn.net/younger_china/article/details/53130780
http://www.cnblogs.com/yinghao1991/p/6691483.html
http://www.yinqisen.cn/blog-212.html
https://web.njit.edu/topics/HPC/basement/sge/SGE53AdminUserDoc.pdf （官方文档）
http://sgowtham.com/journal/sge-scheduling-gpu-jobs-on-rocks-5-4-2/ (gpu的配置)
http://bioteam.net/wp-content/uploads/2009/09/07-SGE-6-Admin-Troubleshooting.pdf (超喜欢的一个debug流程)
http://genomics.princeton.edu/support/grids/sge.shtml
http://www.softpanorama.org/HPC/Grid_engine/sge_queues.shtml （队列管理）
http://www.softpanorama.org/HPC/Grid_engine/Troubleshooting/index.shtml (debug)

药企，独角兽，苏州。团队长期招人，感兴趣的都可以发邮件聊聊：tiehan@sina.cn

个人公众号，比较懒，很少更新，可以在上面提问题，如果回复不及时，可发邮件给我： tiehan@sina.cn