3.2.4 pick_otus.py(qiime)

pick_otus.py通过人为的设定一个阈值,将相同或相似的序列归为一个OUT,即一个物种。

目前,qiime中聚类的方法包括:

  1. Cd-hit,算法为longest-sequence-first list removal algorithm
  2. blast,根据参考序列
  3. Mothur,输入的文件需要是比对后的序列,根据nearest-neighbor, furthest-neighbor, or average-neighbor聚类
  4. prefix/suffix.
  5. Trie 6, uclust.建立“seed”
  6. uclust_ref 需要参考序列
  7. usearch。建立“seed”,根据序列相似百分比,过滤掉低丰度的
  8. usearch_ref
  9. usearch61
  10. usearch61_ref

一、pick_otus.py 参数详解

-i  输入的fasta文件
-m   聚类的方法:mothur, trie, uclust_ref, usearch, usearch_ref, blast, usearch61, usearch61_ref, prefix_suffix, cdhit, uclust.。Mothur需要的是比对后的序列。默认的方法是uclust
-o   生成的文件夹名以及路径
-s  序列相似性的阈值,即设为一个Out的阈值,默认的为0.97。blast, cdhit, uclust, uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref
-z   容许序列反向后来聚类,适用于方法uclust, uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref otu picking, will double the amount of memory used.默认是Falase
-r    参考序列,当选择的方法为blast, -m uclust_ref, -m usearch_ref, or -m usearch61_ref

注:当序列数大于100,000 seqs,建议使用提前过滤数据,来减少计算量

二、聚类算法介绍

Mothur

-c 聚类的算法

如果选择Mothur这种方法的话,得选着对应的算法:furthest, nearest, average.

默认的为furthest

Cd-hit

-M cd-hit这种方法可以使用的最大内存值,默认的是400M

-n 提前过滤方法时候设定的长度(默认的为None.100这个值不错)

如果序列前100个碱基相似,则归为一类,从这里面选出一条序列,然后用cd-hit来聚类,这样就节约了计算的时间,一般选择100

blast

-b blast的数据库,当使用blast时加入这个参数

–min_aligned_percent 当使用blast的方法时认为是一个hit设定的阈值,默认值为0.5

-e 当使用blast时最大的E-value,默认的为1e-10

trie

-q    使用trie时,使序列反向,默认的为FALSE
-t    使用trie时,过滤数据,默认的是FALSE
-n    提前过滤方法时候设定的长度(默认的为None.100这个值不错)

prefix_suffix

-p   prefix_suffix方法设定前缀的长度,默认的为50
-u   prefix_suffix方法设定后缀的长度,默认的为50

Uclust,usearch

-D   使用方法uclust或 uclust_ref时,Suppress presorting of sequences by abundance,默认的为False

-A, --optimal_uclust
Pass the –optimal flag to uclust for uclust otu picking. [default: False]
-E, --exact_uclust
Pass the –exact flag to uclust for uclust otu picking. [default: False]
-B, --user_sort
Pass the –user_sort flag to uclust for uclust otu picking. [default: False]
-C, --suppress_new_clusters
Suppress creation of new clusters using seqs that don’t match reference when using -m uclust_ref, -m usearch61_ref, or -m usearch_ref [default: False]

--max_accepts

Max_accepts value to uclust, uclust_ref, usearch61, and usearch61_ref. By default, will use value suggested by method (uclust: 20, usearch61: 1) [default: default]

--max_rejects

Max_rejects value for uclust, uclust_ref, usearch61, and usearch61_ref. With default settings, will use value recommended by clustering method used (uclust: 500, usearch61: 8 for usearch_fast_cluster option, 32 for reference and smallmem options) [default: default]

--stepwords

Stepwords value to uclust and uclust_ref [default: 20]

--word_length

Word length value for uclust, uclust_ref, and usearch, usearch_ref, usearch61, and usearch61_ref. With default setting, will use the setting recommended by the method (uclust: 12, usearch: 64, usearch61: 8). int value can be supplied to override this setting. [default: default]

--uclust_otu_id_prefix

OTU identifier prefix (string) for the de novo uclust OTU picker and for new clusters when uclust_ref is used without -C [default: denovo, OTU ids are ascending integers]

--suppress_uclust_stable_sort

Don’t pass –stable-sort to uclust [default: False]

--suppress_uclust_prefilter_exact_match

Don’t collapse exact matches before calling uclust [default: False]

-d, --save_uc_files

Enable preservation of intermediate uclust (.uc) files that are used to generate clusters via uclust. Also enables preservation of all intermediate files created by usearch and usearch61. [default: True]

-j, --percent_id_err

Percent identity threshold for cluster error detection with usearch. [default: 0.97]

-g, --minsize

Minimum cluster size for size filtering with usearch. [default: 4]

-a, --abundance_skew

Abundance skew setting for de novo chimera detection with usearch. [default: 2.0]

-f, --db_filepath

Reference database of fasta sequences for reference based chimera detection with usearch. [default: None]

--perc_id_blast

Percent ID for mapping OTUs created by usearch back to original sequence IDs [default: 0.97]

-l, --suppress_cluster_size_filtering

Suppress cluster size filtering in usearch. [default: False]

--usearch_fast_cluster

Use fast clustering option for usearch or usearch61_ref with new clusters. –enable_rev_strand_match can not be enabled with this option, and the only valid option for usearch61_sort_method is ‘length’. This option uses more memory than the default option for de novo clustering. [default: False]

Minimum length of sequence allowed for usearch, usearch_ref, usearch61, and usearch61_ref. [default: 64]

--usearch61_sort_method

Sorting method for usearch61 and usearch61_ref. Valid options are abundance, length, or None. If the –usearch_fast_cluster option is enabled, the only sorting method allowed in length. [default: abundance]

--threads

Specify number of threads per core to be used for usearch61 commands that utilize multithreading. By default, will calculate the number of cores to utilize so a single thread will be used per CPU. Specify a fractional number, e.g. 1.0 for 1 thread per core, or 0.5 for a single thread on a two core CPU. Only applies to usearch61. [default: one_per_cpu]

--de_novo_chimera_detection

Deprecated: de novo chimera detection performed by default, pass –suppress_de_novo_chimera_detection to disable. [default: None]

-k, --suppress_de_novo_chimera_detection  禁止de_novo_chimera_detection

Suppress de novo chimera detection in usearch. [default: False]

--reference_chimera_detection

Deprecated: Reference based chimera detection performed by default, pass –supress_reference_chimera_detection to disable [default: None]

-x, --suppress_reference_chimera_detection

Suppress reference based chimera detection in usearch. [default: False]

--cluster_size_filtering

Deprecated, cluster size filtering enabled by default, pass –suppress_cluster_size_filtering to disable. [default: None]

--remove_usearch_logs

Disable creation of logs when usearch is called. Up to nine logs are created, depending on filtering steps enabled. [default: False]

--derep_fullseq

Dereplication of full sequences, instead of subsequences. Faster than the default –derep_subseqs in usearch. [default: False]

-F, --non_chimeras_retention

Selects subsets of sequences detected as non-chimeras to retain after de novo and reference based chimera detection. Options are intersection or union. union will retain sequences that are flagged as non-chimeric from either filter, while intersection will retain only those sequences that are flagged as non-chimeras from both detection methods. [default: union]

--minlen

--sizeorder

Enable size based preference in clustering with usearch61. Requires that –usearch61_sort_method be abundance. [default: False]

三、输出文件:seqs_otus.txt和seqs_otus.log

第一列的是Out的编号,后面的序列号是这个Out包含的序列

0       seq1          seq5         
1       seq2                  
2       seq3                  
3       seq4          seq6          seq7

我的命令是:

/usr/lib/qiime/bin/pick_otus.py -i good.fasta -m cdhit -o cdhit_picked_otus/ -n 100

参考资料:

http://qiime.org/scripts/pick_otus.html

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn