【3】数据分析-4-3-流程搭建-snakemake
官方文档: https://snakemake.readthedocs.io/en/stable/
snakemake是一个基于Python的MIT-licensed workflow management。
例子:
rule targets:
input:
"plots/dataset1.pdf",
"plots/dataset2.pdf"
rule plot:
input:
"raw/{dataset}.csv"
output:
"plots/{dataset}.pdf"
shell:
"somecommand {input} {output}"
- 跟GNU Make比较类似,ke可以在最开始的地方定义pseudo-rule
- 可以自定义规则,目标文件和中间文件如何来自输入文件
- 可以编写shell语言或者python语言,或者直接调用python或者R的包
一、安装
1.安装miniconda
下载: https://conda.io/miniconda.html
cd /Users/tanqianshan/Documents/project/4.conda/conda3
bash Miniconda3-latest-MacOSX-x86_64.sh
2.安装snakemake
conda install -c bioconda snakemake
二、教学
1.准备好文件夹和数据
wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
tar -xf v3.11.0.tar.bz2 --strip 1
2.安装示例所需要的各种文件
file下包含了各种所需要的工具,snakemake-tutorial为配置的环境
conda env create --name snakemake-tutorial --file environment.yaml
3.启动环境
source activate snakemake-tutorial
#测试是否安装完毕
snakemake --help
#关闭conda环境
source deactivate
4.范例说明:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} |samtools view -Sb - > {output}"
1.文件命令Snakefile
-
先开启snakemake-tutorial,然后通过 snakemake -np mapped_reads/A.bam运行,-n并不真正的运行程序,-p显示运行的信息,如果已经生成文件,则不再运行,除非是输入文件有变更或删除输出文件。不知道怎么回事,前面的运行不出结果,我的直接snakemake运行处结果
-
文件中rule 定义这个函数名,Input 对应的是输入,output对应的是输出,shell是运行脚本
5.多个样本的运行-sample
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} |samtools view -Sb - > {output}"
运行:
# 加入-n 就不真正运行程序了
snakemake -p mapped_reads/A.bam mapped_reads/B.bam
后者snakemake -np mapped_reads/{A,B}.bam
例二:
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} -O bam {input} > {output}"
运行:
snakemake -p sorted_reads/B.bam
#wildcards.sample 表示当前用的sample的名字
例三:流程图
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
做流程图:
snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg
例四:合并所有输入文件
expand("sorted_reads/{sample}.bam", sample=SAMPLES)
输出为:
["sorted_reads/A.bam", "sorted_reads/B.bam"]
expand("sorted_reads/{sample}.{replicate}.bam", sample=SAMPLES, replicate=[0, 1])
输出为:
["sorted_reads/A.0.bam", "sorted_reads/A.1.bam", "sorted_reads/B.0.bam", "sorted_reads/B.1.bam"]
例五、定义的多种类型的输入
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
例六:写报告
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
例七:不用在命令行中定义输出是什么
在最前面定义,
rule all:
input:
"report.html"
这个就是最后要达到的输出或输入 然后命令行 snakemake 就可以运行了
总结整个流程:
SAMPLES = ["A", "B"]
rule all:
input:
"report.html"
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
三、其他好玩的技巧
1.指定运行的核数
a.程序自带运行的核
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
b.通过外部命令来指定核的个数
snakemake --cores 10
2.配置文件
支持json或者yaml的文件作为config
#放在文件的开头
configfile: "config.yaml"
config.yaml文件
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Snakefile
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
3.参数设置
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg="@RG\tID:{sample}\tSM:{sample}"
threads: 8
shell:
"bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"
4.log管理
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
5.临时文件
最后删除temp
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
需要被保护起来protected
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
6.调用其他的脚本
调用Snakefile
include: "path/to/other.snakefile"
调用其他的脚本
rule report:
input:
T1="calls/all.vcf",
T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
output:
"report.html"
script:
"scripts/report.py"
7.独立的运行环境
envs/samtools.yaml
channels:
- bioconda
dependencies:
- samtools =1.3
8.并行运行
snakemake -j 4
完整的流程
configfile: "config.yaml"
rule all:
input:
"report.html"
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: config["samples"][wildcards.sample]
output:
temp("mapped_reads/{sample}.bam")
params:
rg="@RG\tID:{sample}\tSM:{sample}"
log:
"logs/bwa_mem/{sample}.log"
threads: 8
shell:
"(bwa mem -R '{params.rg}' -t {threads} {input} | "
"samtools view -Sb - > {output}) 2> {log}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
protected("sorted_reads/{sample}.bam")
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule report:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
参考资料:
https://snakemake.readthedocs.io/en/stable/tutorial/basics.html
这里是一个广告位,,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn