3.2.4 pick_otus.py(qiime)
pick_otus.py通过人为的设定一个阈值,将相同或相似的序列归为一个OUT,即一个物种。
目前,qiime中聚类的方法包括:
- Cd-hit,算法为longest-sequence-first list removal algorithm
- blast,根据参考序列
- Mothur,输入的文件需要是比对后的序列,根据nearest-neighbor, furthest-neighbor, or average-neighbor聚类
- prefix/suffix.
- Trie 6, uclust.建立“seed”
- uclust_ref 需要参考序列
- usearch。建立“seed”,根据序列相似百分比,过滤掉低丰度的
- usearch_ref
- usearch61
- usearch61_ref
一、pick_otus.py 参数详解
-i 输入的fasta文件
-m 聚类的方法:mothur, trie, uclust_ref, usearch, usearch_ref, blast, usearch61, usearch61_ref, prefix_suffix, cdhit, uclust.。Mothur需要的是比对后的序列。默认的方法是uclust
-o 生成的文件夹名以及路径
-s 序列相似性的阈值,即设为一个Out的阈值,默认的为0.97。blast, cdhit, uclust, uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref
-z 容许序列反向后来聚类,适用于方法uclust, uclust_ref, usearch, usearch_ref, usearch61, or usearch61_ref otu picking, will double the amount of memory used.默认是Falase
-r 参考序列,当选择的方法为blast, -m uclust_ref, -m usearch_ref, or -m usearch61_ref
注:当序列数大于100,000 seqs,建议使用提前过滤数据,来减少计算量
二、聚类算法介绍
Mothur
-c 聚类的算法
如果选择Mothur这种方法的话,得选着对应的算法:furthest, nearest, average.
默认的为furthest
Cd-hit
-M cd-hit这种方法可以使用的最大内存值,默认的是400M
-n 提前过滤方法时候设定的长度(默认的为None.100这个值不错)
如果序列前100个碱基相似,则归为一类,从这里面选出一条序列,然后用cd-hit来聚类,这样就节约了计算的时间,一般选择100
blast
-b blast的数据库,当使用blast时加入这个参数
–min_aligned_percent 当使用blast的方法时认为是一个hit设定的阈值,默认值为0.5
-e 当使用blast时最大的E-value,默认的为1e-10
trie
-q 使用trie时,使序列反向,默认的为FALSE
-t 使用trie时,过滤数据,默认的是FALSE
-n 提前过滤方法时候设定的长度(默认的为None.100这个值不错)
prefix_suffix
-p prefix_suffix方法设定前缀的长度,默认的为50
-u prefix_suffix方法设定后缀的长度,默认的为50
Uclust,usearch
-D 使用方法uclust或 uclust_ref时,Suppress presorting of sequences by abundance,默认的为False
-A, --optimal_uclust
Pass the –optimal flag to uclust for uclust otu picking. [default: False]
-E, --exact_uclust
Pass the –exact flag to uclust for uclust otu picking. [default: False]
-B, --user_sort
Pass the –user_sort flag to uclust for uclust otu picking. [default: False]
-C, --suppress_new_clusters
Suppress creation of new clusters using seqs that don’t match reference when using -m uclust_ref, -m usearch61_ref, or -m usearch_ref [default: False]
--max_accepts
Max_accepts value to uclust, uclust_ref, usearch61, and usearch61_ref. By default, will use value suggested by method (uclust: 20, usearch61: 1) [default: default]
--max_rejects
Max_rejects value for uclust, uclust_ref, usearch61, and usearch61_ref. With default settings, will use value recommended by clustering method used (uclust: 500, usearch61: 8 for usearch_fast_cluster option, 32 for reference and smallmem options) [default: default]
--stepwords
Stepwords value to uclust and uclust_ref [default: 20]
--word_length
Word length value for uclust, uclust_ref, and usearch, usearch_ref, usearch61, and usearch61_ref. With default setting, will use the setting recommended by the method (uclust: 12, usearch: 64, usearch61: 8). int value can be supplied to override this setting. [default: default]
--uclust_otu_id_prefix
OTU identifier prefix (string) for the de novo uclust OTU picker and for new clusters when uclust_ref is used without -C [default: denovo, OTU ids are ascending integers]
--suppress_uclust_stable_sort
Don’t pass –stable-sort to uclust [default: False]
--suppress_uclust_prefilter_exact_match
Don’t collapse exact matches before calling uclust [default: False]
-d, --save_uc_files
Enable preservation of intermediate uclust (.uc) files that are used to generate clusters via uclust. Also enables preservation of all intermediate files created by usearch and usearch61. [default: True]
-j, --percent_id_err
Percent identity threshold for cluster error detection with usearch. [default: 0.97]
-g, --minsize
Minimum cluster size for size filtering with usearch. [default: 4]
-a, --abundance_skew
Abundance skew setting for de novo chimera detection with usearch. [default: 2.0]
-f, --db_filepath
Reference database of fasta sequences for reference based chimera detection with usearch. [default: None]
--perc_id_blast
Percent ID for mapping OTUs created by usearch back to original sequence IDs [default: 0.97]
-l, --suppress_cluster_size_filtering
Suppress cluster size filtering in usearch. [default: False]
--usearch_fast_cluster
Use fast clustering option for usearch or usearch61_ref with new clusters. –enable_rev_strand_match can not be enabled with this option, and the only valid option for usearch61_sort_method is ‘length’. This option uses more memory than the default option for de novo clustering. [default: False]
Minimum length of sequence allowed for usearch, usearch_ref, usearch61, and usearch61_ref. [default: 64]
--usearch61_sort_method
Sorting method for usearch61 and usearch61_ref. Valid options are abundance, length, or None. If the –usearch_fast_cluster option is enabled, the only sorting method allowed in length. [default: abundance]
--threads
Specify number of threads per core to be used for usearch61 commands that utilize multithreading. By default, will calculate the number of cores to utilize so a single thread will be used per CPU. Specify a fractional number, e.g. 1.0 for 1 thread per core, or 0.5 for a single thread on a two core CPU. Only applies to usearch61. [default: one_per_cpu]
--de_novo_chimera_detection
Deprecated: de novo chimera detection performed by default, pass –suppress_de_novo_chimera_detection to disable. [default: None]
-k, --suppress_de_novo_chimera_detection 禁止de_novo_chimera_detection
Suppress de novo chimera detection in usearch. [default: False]
--reference_chimera_detection
Deprecated: Reference based chimera detection performed by default, pass –supress_reference_chimera_detection to disable [default: None]
-x, --suppress_reference_chimera_detection
Suppress reference based chimera detection in usearch. [default: False]
--cluster_size_filtering
Deprecated, cluster size filtering enabled by default, pass –suppress_cluster_size_filtering to disable. [default: None]
--remove_usearch_logs
Disable creation of logs when usearch is called. Up to nine logs are created, depending on filtering steps enabled. [default: False]
--derep_fullseq
Dereplication of full sequences, instead of subsequences. Faster than the default –derep_subseqs in usearch. [default: False]
-F, --non_chimeras_retention
Selects subsets of sequences detected as non-chimeras to retain after de novo and reference based chimera detection. Options are intersection or union. union will retain sequences that are flagged as non-chimeric from either filter, while intersection will retain only those sequences that are flagged as non-chimeras from both detection methods. [default: union]
--minlen
--sizeorder
Enable size based preference in clustering with usearch61. Requires that –usearch61_sort_method be abundance. [default: False]
三、输出文件:seqs_otus.txt和seqs_otus.log
第一列的是Out的编号,后面的序列号是这个Out包含的序列
0 seq1 seq5
1 seq2
2 seq3
3 seq4 seq6 seq7
我的命令是:
/usr/lib/qiime/bin/pick_otus.py -i good.fasta -m cdhit -o cdhit_picked_otus/ -n 100
参考资料:
这里是一个广告位,,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn