【3】数据分析-4-3-流程搭建-snakemake

官方文档: https://snakemake.readthedocs.io/en/stable/

snakemake是一个基于Python的MIT-licensed workflow management。

例子:

rule targets:
	input:
		"plots/dataset1.pdf",
		"plots/dataset2.pdf"

rule plot:
	input:
		"raw/{dataset}.csv"
	output:
		"plots/{dataset}.pdf"
	shell:
		"somecommand {input} {output}"
  1. 跟GNU Make比较类似,ke可以在最开始的地方定义pseudo-rule
  2. 可以自定义规则,目标文件和中间文件如何来自输入文件
  3. 可以编写shell语言或者python语言,或者直接调用python或者R的包

一、安装

1.安装miniconda

下载: https://conda.io/miniconda.html

cd /Users/tanqianshan/Documents/project/4.conda/conda3
bash Miniconda3-latest-MacOSX-x86_64.sh

2.安装snakemake

conda install -c bioconda snakemake

二、教学

1.准备好文件夹和数据

wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
tar -xf v3.11.0.tar.bz2 --strip 1

2.安装示例所需要的各种文件

file下包含了各种所需要的工具,snakemake-tutorial为配置的环境

conda env create --name snakemake-tutorial --file environment.yaml

3.启动环境

source activate snakemake-tutorial

#测试是否安装完毕
snakemake --help

#关闭conda环境
source deactivate

4.范例说明:

rule bwa_map:
	input:
		"data/genome.fa",
		"data/samples/A.fastq"
	output:
		"mapped_reads/A.bam"
	shell:
		"bwa mem {input} |samtools view -Sb -  > {output}"

1.文件命令Snakefile

  1. 先开启snakemake-tutorial,然后通过 snakemake -np mapped_reads/A.bam运行,-n并不真正的运行程序,-p显示运行的信息,如果已经生成文件,则不再运行,除非是输入文件有变更或删除输出文件。不知道怎么回事,前面的运行不出结果,我的直接snakemake运行处结果

  2. 文件中rule 定义这个函数名,Input 对应的是输入,output对应的是输出,shell是运行脚本

5.多个样本的运行-sample

rule bwa_map:
	input:
		"data/genome.fa",
		"data/samples/{sample}.fastq"
	output:
		"mapped_reads/{sample}.bam"
	shell:
		"bwa mem {input} |samtools view -Sb -  > {output}"

运行:

# 加入-n 就不真正运行程序了
snakemake -p mapped_reads/A.bam mapped_reads/B.bam
后者snakemake -np mapped_reads/{A,B}.bam

例二:

rule samtools_sort:
	input:
		"mapped_reads/{sample}.bam"
	output:
		"sorted_reads/{sample}.bam"
	shell:
		"samtools sort -T sorted_reads/{wildcards.sample} -O bam {input} > {output}"

运行:

snakemake -p sorted_reads/B.bam
#wildcards.sample 表示当前用的sample的名字

例三:流程图

rule samtools_index:
	input:
		"sorted_reads/{sample}.bam"
	output:
		"sorted_reads/{sample}.bam.bai"
	shell:
		"samtools index {input}"

做流程图:

snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

例四:合并所有输入文件

expand("sorted_reads/{sample}.bam", sample=SAMPLES)
输出为:
["sorted_reads/A.bam", "sorted_reads/B.bam"]

expand("sorted_reads/{sample}.{replicate}.bam", sample=SAMPLES, replicate=[0, 1])
输出为:
["sorted_reads/A.0.bam", "sorted_reads/A.1.bam", "sorted_reads/B.0.bam", "sorted_reads/B.1.bam"]

例五、定义的多种类型的输入

rule bcftools_call:
	input:
		fa="data/genome.fa",
		bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
		bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
	output:
		"calls/all.vcf"
	shell:
		"samtools mpileup -g -f {input.fa} {input.bam} | "
		"bcftools call -mv - > {output}"

例六:写报告

rule report:
	input:
		"calls/all.vcf"
	output:
		"report.html"
	run:
		from snakemake.utils import report
		with open(input[0]) as vcf:
			n_calls = sum(1 for l in vcf if not l.startswith("#"))

		report("""
		An example variant calling workflow
		===================================

		Reads were mapped to the Yeast
		reference genome and variants were called jointly with
		SAMtools/BCFtools.

		This resulted in {n_calls} variants (see Table T1_).
		""", output[0], T1=input[0])

例七:不用在命令行中定义输出是什么

在最前面定义,

rule all:
	input:
		"report.html"

这个就是最后要达到的输出或输入 然后命令行 snakemake 就可以运行了

总结整个流程:

SAMPLES = ["A", "B"]
	
rule all:
	input:
		"report.html"
	
rule bwa_map:
	input:
		"data/genome.fa",
		"data/samples/{sample}.fastq"
	output:
		"mapped_reads/{sample}.bam"
	shell:
		"bwa mem {input} | samtools view -Sb - > {output}"
	
rule samtools_sort:
	input:
		"mapped_reads/{sample}.bam"
	output:
		"sorted_reads/{sample}.bam"
	shell:
		"samtools sort -T sorted_reads/{wildcards.sample} "
		"-O bam {input} > {output}"

rule samtools_index:
	input:
		"sorted_reads/{sample}.bam"
	output:
		"sorted_reads/{sample}.bam.bai"
	shell:
		"samtools index {input}"
	
rule bcftools_call:
	input:
		fa="data/genome.fa",
		bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
		bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
	output:
		"calls/all.vcf"
	shell:
		"samtools mpileup -g -f {input.fa} {input.bam} | "
		"bcftools call -mv - > {output}"
	
rule report:
	input:
		"calls/all.vcf"
	output:
		"report.html"
	run:
		from snakemake.utils import report
		with open(input[0]) as vcf:
			n_calls = sum(1 for l in vcf if not l.startswith("#"))

		report("""
		An example variant calling workflow
		===================================

		Reads were mapped to the Yeast
		reference genome and variants were called jointly with
		SAMtools/BCFtools.

		This resulted in {n_calls} variants (see Table T1_).
		""", output[0], T1=input[0])

三、其他好玩的技巧

1.指定运行的核数

a.程序自带运行的核

rule bwa_map:
	input:
		"data/genome.fa",
		lambda wildcards: config["samples"][wildcards.sample]
	output:
		"mapped_reads/{sample}.bam"
	threads: 8
	shell:
		"bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

b.通过外部命令来指定核的个数

snakemake --cores 10

2.配置文件

支持json或者yaml的文件作为config

#放在文件的开头
configfile: "config.yaml"

config.yaml文件

samples:
	A: data/samples/A.fastq
	B: data/samples/B.fastq

Snakefile

rule bcftools_call:
	input:
		fa="data/genome.fa",
		bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
		bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
	output:
		"calls/all.vcf"
	shell:
		"samtools mpileup -g -f {input.fa} {input.bam} | "
		"bcftools call -mv - > {output}"

3.参数设置

rule bwa_map:
	input:
		"data/genome.fa",
		lambda wildcards: config["samples"][wildcards.sample]
	output:
		"mapped_reads/{sample}.bam"
	params:
		rg="@RG\tID:{sample}\tSM:{sample}"
	threads: 8
	shell:
		"bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"

4.log管理

rule bwa_map:
	input:
		"data/genome.fa",
		lambda wildcards: config["samples"][wildcards.sample]
	output:
		"mapped_reads/{sample}.bam"
	params:
		rg="@RG\tID:{sample}\tSM:{sample}"
	log:
		"logs/bwa_mem/{sample}.log"
	threads: 8
	shell:
		"(bwa mem -R '{params.rg}' -t {threads} {input} | "
		"samtools view -Sb - > {output}) 2> {log}"

5.临时文件

最后删除temp

rule bwa_map:
	input:
		"data/genome.fa",
		lambda wildcards: config["samples"][wildcards.sample]
	output:
		temp("mapped_reads/{sample}.bam")
	params:
		rg="@RG\tID:{sample}\tSM:{sample}"
	log:
		"logs/bwa_mem/{sample}.log"
	threads: 8
	shell:
		"(bwa mem -R '{params.rg}' -t {threads} {input} | "
		"samtools view -Sb - > {output}) 2> {log}"

需要被保护起来protected

rule samtools_sort:
	input:
		"mapped_reads/{sample}.bam"
	output:
		protected("sorted_reads/{sample}.bam")
	shell:
		"samtools sort -T sorted_reads/{wildcards.sample} "
		"-O bam {input} > {output}"

6.调用其他的脚本

调用Snakefile

include: "path/to/other.snakefile"

调用其他的脚本

rule report:
	input:
		T1="calls/all.vcf",
		T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
	output:
		"report.html"
	script:
		"scripts/report.py"

7.独立的运行环境

envs/samtools.yaml

channels:
  - bioconda
dependencies:
  - samtools =1.3

8.并行运行

snakemake -j 4

完整的流程

configfile: "config.yaml"
	
rule all:
	input:
		"report.html"


rule bwa_map:
	input:
		"data/genome.fa",
		lambda wildcards: config["samples"][wildcards.sample]
	output:
		temp("mapped_reads/{sample}.bam")
	params:
		rg="@RG\tID:{sample}\tSM:{sample}"
	log:
		"logs/bwa_mem/{sample}.log"
	threads: 8
	shell:
		"(bwa mem -R '{params.rg}' -t {threads} {input} | "
		"samtools view -Sb - > {output}) 2> {log}"


rule samtools_sort:
	input:
		"mapped_reads/{sample}.bam"
	output:
		protected("sorted_reads/{sample}.bam")
	shell:
		"samtools sort -T sorted_reads/{wildcards.sample} "
		"-O bam {input} > {output}"


rule samtools_index:
	input:
		"sorted_reads/{sample}.bam"
	output:
		"sorted_reads/{sample}.bam.bai"
	shell:
		"samtools index {input}"


rule bcftools_call:
	input:
		fa="data/genome.fa",
		bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
		bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
	output:
		"calls/all.vcf"
	shell:
		"samtools mpileup -g -f {input.fa} {input.bam} | "
		"bcftools call -mv - > {output}"


rule report:
	input:
		"calls/all.vcf"
	output:
		"report.html"
	run:
		from snakemake.utils import report
		with open(input[0]) as vcf:
			n_calls = sum(1 for l in vcf if not l.startswith("#"))

		report("""
		An example variant calling workflow
		===================================

		Reads were mapped to the Yeast
		reference genome and variants were called jointly with
		SAMtools/BCFtools.

		This resulted in {n_calls} variants (see Table T1_).
		""", output[0], T1=input[0])

参考资料:

https://snakemake.readthedocs.io/en/stable/tutorial/basics.html

http://slides.com/johanneskoester/deck-1#/31

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn