【3】数据分析-11-流程搭建-snakemake

官方文档: https://snakemake.readthedocs.io/en/stable/

snakemake是一个基于Python的MIT-licensed workflow management。

例子:

rule targets:
    input:
        "plots/dataset1.pdf",
        "plots/dataset2.pdf"

rule plot:
    input:
        "raw/{dataset}.csv"
    output:
        "plots/{dataset}.pdf"
    shell:
        "somecommand {input} {output}"
  1. 跟GNU Make比较类似,ke可以在最开始的地方定义pseudo-rule
  2. 可以自定义规则,目标文件和中间文件如何来自输入文件
  3. 可以编写shell语言或者python语言,或者直接调用python或者R的包

一、安装

1.安装miniconda

下载: https://conda.io/miniconda.html

cd /Users/tanqianshan/Documents/project/4.conda/conda3
bash Miniconda3-latest-MacOSX-x86_64.sh

2.安装snakemake

conda install -c bioconda snakemake

二、教学

1.准备好文件夹和数据

wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
tar -xf v3.11.0.tar.bz2 --strip 1

2.安装示例所需要的各种文件

file下包含了各种所需要的工具,snakemake-tutorial为配置的环境

conda env create --name snakemake-tutorial --file environment.yaml

3.启动环境

source activate snakemake-tutorial

#测试是否安装完毕
snakemake --help

#关闭conda环境
source deactivate

4.范例说明:

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/A.fastq"
    output:
        "mapped_reads/A.bam"
    shell:
        "bwa mem {input} |samtools view -Sb -  > {output}"

1.文件命令Snakefile

  1. 先开启snakemake-tutorial,然后通过 snakemake -np mapped_reads/A.bam运行,-n并不真正的运行程序,-p显示运行的信息,如果已经生成文件,则不再运行,除非是输入文件有变更或删除输出文件。不知道怎么回事,前面的运行不出结果,我的直接snakemake运行处结果

  2. 文件中rule 定义这个函数名,Input 对应的是输入,output对应的是输出,shell是运行脚本

5.多个样本的运行-sample

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} |samtools view -Sb -  > {output}"

运行:

# 加入-n 就不真正运行程序了
snakemake -p mapped_reads/A.bam mapped_reads/B.bam
后者snakemake -np mapped_reads/{A,B}.bam

例二:

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} -O bam {input} > {output}"

运行:

snakemake -p sorted_reads/B.bam
#wildcards.sample 表示当前用的sample的名字

例三:流程图

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

做流程图:

snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

例四:合并所有输入文件

expand("sorted_reads/{sample}.bam", sample=SAMPLES)
输出为:
["sorted_reads/A.bam", "sorted_reads/B.bam"]

expand("sorted_reads/{sample}.{replicate}.bam", sample=SAMPLES, replicate=[0, 1])
输出为:
["sorted_reads/A.0.bam", "sorted_reads/A.1.bam", "sorted_reads/B.0.bam", "sorted_reads/B.1.bam"]

例五、定义的多种类型的输入

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

例六:写报告

rule report:
    input:
        "calls/all.vcf"
    output:
        "report.html"
    run:
        from snakemake.utils import report
        with open(input[0]) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        """, output[0], T1=input[0])

例七:不用在命令行中定义输出是什么

在最前面定义,

rule all:
    input:
        "report.html"

这个就是最后要达到的输出或输入 然后命令行 snakemake 就可以运行了

总结整个流程:

SAMPLES = ["A", "B"]

rule all:
    input:
        "report.html"

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

rule report:
    input:
        "calls/all.vcf"
    output:
        "report.html"
    run:
        from snakemake.utils import report
        with open(input[0]) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        """, output[0], T1=input[0])

三、其他好玩的技巧

1.指定运行的核数

a.程序自带运行的核

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

b.通过外部命令来指定核的个数

snakemake --cores 10

2.配置文件

支持json或者yaml的文件作为config

#放在文件的开头
configfile: "config.yaml"

config.yaml文件

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

Snakefile

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

3.参数设置

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    threads: 8
    shell:
        "bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"

4.log管理

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

5.临时文件

最后删除temp

rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

需要被保护起来protected

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

6.调用其他的脚本

调用Snakefile

include: "path/to/other.snakefile"

调用其他的脚本

rule report:
    input:
        T1="calls/all.vcf",
        T2=expand("benchmarks/{sample}.bwa.benchmark.txt", sample=config["samples"])
    output:
        "report.html"
    script:
        "scripts/report.py"

7.独立的运行环境

envs/samtools.yaml

channels:
  - bioconda
dependencies:
  - samtools =1.3

8.并行运行

snakemake -j 4

完整的流程

configfile: "config.yaml"

rule all:
    input:
        "report.html"


rule bwa_map:
    input:
        "data/genome.fa",
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg="@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"


rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"


rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"


rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"


rule report:
    input:
        "calls/all.vcf"
    output:
        "report.html"
    run:
        from snakemake.utils import report
        with open(input[0]) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        """, output[0], T1=input[0])

参考资料:

https://snakemake.readthedocs.io/en/stable/tutorial/basics.html

http://slides.com/johanneskoester/deck-1#/31

个人公众号,比较懒,很少更新,可以在上面提问题:

更多精彩,请移步公众号阅读:

Sam avatar
About Sam
专注生物信息 专注转化医学