【3.3.2】序列聚类(mmseqs2)

一、简介

二、下载安装

wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; 
tar xvfz mmseqs-linux-avx2.tar.gz; 

export PATH=$(pwd)/mmseqs/bin/:$PATH

三、使用说明

mmseqs easy-cluster examples/DB.fasta result tmp
# Cluster output
#  - result_rep_seq.fasta: Representatives
#  - result_all_seq.fasta: FASTA-like per cluster
#  - result_cluster.tsv:   Adjacency list

# Important parameter: –min-seq-id, –cov-mode and -c

--min-seq-id FLOAT              List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]
-c FLOAT                        List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.800
]

examples:

 # Cascaded clustering of FASTA file
 mmseqs cluster sequenceDB clusterDB tmp

 #                  --cov-mode
 # Sequence         0    1    2
 # Q: MAVGTACRPA  60%  IGN  60%
 # T: -AVGTAC---  60% 100%  IGN
 # Cutoff -c 0.7    -    +    -
 #        -c 0.6    +    +    +

threads: 并行线程数

--rescore-mode INT              Rescore diagonals with:
                             0: Hamming distance
                             1: local alignment (score only)
                             2: local alignment
                             3: global alignment
                             4: longest alignment fulfilling window quality criterion [0]


--cluster-mode INT              0: Set-Cover (greedy)
                             1: Connected component (BLASTclust)
                             2,3: Greedy clustering by sequence length (CDHIT) [0]

四、我的例子

/data/software/mmseqs/mmseqs/bin/mmseqs cluster examples/DB.fasta clusterRes tmp --min-seq-id 0.93 -c 0.8 --cov-mode 0  --threads 30  --rescore-mode 3

参考资料

个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn

Sam avatar
About Sam
专注生物信息 专注转化医学