【2】文件读写-2-3-python读写读写Bam/SAM文件(pysam)
pysam是python中用来读写SAM/BAM文件的一个模块。底层是用C写的
一、简介
pysam官网
Pysam is a wrapper of the htslib C-API and provides facilities to read and write SAM/BAM/VCF/BCF/BED/GFF/GTF/FASTA/FASTQ files as well as access to the command line functionality of the samtools and bcftools packages. The module supports compression and random access through indexing.
安装
pip install pysam
二. 函数详解
1.读写bam
import pysam
samfile = pysam.AlignmentFile(input_bam, "rb")
#然后可以通过fetch方法来调取比对到某个区间对应的reads信息
for read in samfile.fetch('chr1', 100, 120):
print read
break
samfile.close()
得到结果:
EAS56_57:6:190:289:82 0 99 <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; 69 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA 0 192 1
EAS56_57:6:190:289:82 0 99 <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; 137 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC 73 64 1
EAS51_64:3:190:727:308 0 102 <<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<844 99 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG 99 18 1
...
#取的位置是交集还是并集呢? 初步理解为跟这个区间有交集的序列都提出来
也可以写bam文件
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
pairedreads = pysam.AlignmentFile("allpaired.bam", "wb", template=samfile)
for read in samfile.fetch():
if read.is_paired:
pairedreads.write(read)
pairedreads.close()
samfile.close()
注:
-
AlignmentFile(filepath_or_object, mode=None, template=None, reference_names=None, reference_lengths=None, text=NULL, header=None, add_sq_text=False, check_header=True, check_sq=True, reference_filename=None, filename=None, duplicate_filehandle=True)
-
AlignmentFile打开得bam文件需要是已经建立索引,有.bai文件存在的bam文件
-
template将通过AlignmentFile打开得bam文件的header拷贝过来
2.提取某个碱基
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb" )
for pileupcolumn in samfile.pileup("chr1", 100, 120,truncate=True):
print ("\ncoverage at base %s = %s" %
(pileupcolumn.pos, pileupcolumn.n))
for pileupread in pileupcolumn.pileups:
if not pileupread.is_del and not pileupread.is_refskip:
# query position is None if is_del or is_refskip is set.
print ('\tbase in read %s = %s' %
(pileupread.alignment.query_name,
pileupread.alignment.query_sequence[pileupread.query_position]))
samfile.close()
输出结果:
coverage at base 99 = 1
base in read EAS56_57:6:190:289:82 = A
coverage at base 100 = 1
base in read EAS56_57:6:190:289:82 = G
coverage at base 101 = 1
base in read EAS56_57:6:190:289:82 = G
coverage at base 102 = 2
base in read EAS56_57:6:190:289:82 = G
base in read EAS51_64:3:190:727:308 = G
...
实现逻辑:
1.根据给得区间"chr1", 100, 120,一次,从100到120点所对应的所有reads,通过pileup提取的reads有这几个属性:‘n’,‘nsegments’, ‘pileups’, ‘pos’, ‘reference_id’, ‘reference_pos’, ‘tid’
a.起点,终点?这里应该是先找出跟区间至少有一个碱基交集的reads,然后这些reads的所覆盖的区间的起点为pileup的起点
b.如果加上参数truncate=True,那起点就成了100,119,注意这里起始的变化,这里仅仅是输出做了一个限制,但是提取的reads的方式还是没有变化,就是只要跟指定的区间有一个碱基交集的reads就给提出来了
c.而pos代表该位点的坐标,n代表它的coverage,pileups代表对应的reads
2.每个点对应的每条read可以通过pileups来提,而这些reads有这些属性’alignment', ‘indel’, ‘is_del’, ‘is_head’, ‘is_refskip’, ‘is_tail’, ‘level’, ‘query_position
3.alignment的属性包含: aend, alen, aligned_pairs, bin, blocks, cigar, cigarstring, cigartuples, compare, flag, get_aligned_pairs, get_blocks, get_overlap, get_reference_positions, get_tag, get_tags, has_tag, infer_query_length, inferred_length, is_duplicate, is_paired, is_proper_pair, is_qcfail, is_read1, is_read2, is_reverse, is_secondary, is_supplementary, is_unmapped, isize, mapping_quality, mapq, mate_is_reverse, mate_is_unmapped, mpos, mrnm, next_reference_id, next_reference_start, opt, overlap, pnext, pos, positions, qend, qlen, qname, qqual, qstart, qual, query, query_alignment_end, query_alignment_length, query_alignment_qualities, query_alignment_sequence, query_alignment_start, query_length, query_name, query_qualities, query_sequence, reference_end, reference_id, reference_length, reference_start, rlen, rname, rnext, seq, setTag, set_tag, set_tags, tags, template_length, tid, tlen
3.排序
pysam.sort("ex1.bam", "output")
跟samtools的效果一样
samtools sort ex1.bam output
4.打开其他文件(这个功能暂时没有用到)
import pysam
tabixfile = pysam.TabixFile("example.gtf.gz")
for gtf in tabixfile.fetch("chr1", 1000, 2000):
print (gtf.contig, gtf.start, gtf.end, gtf.gene_id)
5.其他函数与功能略
import pysam
def mutate(args):
assert mutend > mutstart, "mutation start must occur before mutation end: " + mutid
<span style="color: #ff0000;">bamfile = pysam.Samfile(args.bamFileName, 'rb')</span> #<span style="color: #008000;">读取bam文件</span>
<span style="color: #ff0000;">for pcol in bamfile.pileup(reference=chrom, start=mutstart, end=mutend, max_depth=int(args.maxdepth), truncate=True):</span>
#提取包含chrom:mutstart-mutend这个区段的所有reads。pcol.pos 提取碱基的位置;包含mutstart,不包含mutend;truncate=True,指定到我们指定的碱基位置
if pcol.pos:
num_good_reads=0
num_better_reads=0
refbase = reffile.fetch(chrom, pcol.pos-1, pcol.pos)
#提取pcol.pos的碱基
<span style="color: #ff0000;">for pread in pcol.pileups</span>: #提取一条序列pread
if avoid is not None and pread.alignment.qname in avoid:
print "WARN\t" + now() + "\t" + region + "\tdropped mutation due to read in --avoidlist", pread.alignment.qname
if<span style="color: #008000;"> pread.query_position is not None and not pread.alignment.is_secondary and bin(pread.alignment.flag & 2048) != bin(2048)</span>:
# is_secondary==true if not primary alignment
#query_position==position of the read base at the pileup site, 0-based. None if is_del or is_refskip is set
if <span style="color: #008000;">pread.alignment.is_read2</span>:
pairname = 'S' # read is second in pair
if not pread.alignment.is_paired:
pairname = 'U' # read is unpaired
outreads_qual[extqname]=<span style="color: #008000;">pread.alignment.qual</span> #质量数据
<span style="color: #ff0000;">bamfile.close()</span>
提取成对reads的start和end
CNVkit will give you the depths in each targeted region, but won't distinguish between amplicons by read pair start/end coordinates.
I think you can do this analysis with Pysam. Sort the BAM file by read name (not coordinates), then do something like:
from __future__ import print_function
import collections, pysam
amplicons = collections.defaultdict(int)
all_reads = pysam.Samfile(bam_filename, 'rb').fetch()
for read in all_reads:
mate = next(all_reads)
if read.is_proper_pair():
if read.reference_start < mate.reference_start:
coords = (reference_name, read.reference_start, mate.reference_end)
else:
coords = (reference_name, mate.reference_start, read.reference_end)
amplicons[coords] += 1
# Print the read counts for each amplicon to standard output
for (refname, start, end), count in sorted(amplicons.iteritems()):
print(refname, start+1, end, count, sep='\t')
参考资料:https://github.com/etal/cnvkit/issues/212
三、讨论:
- ValueError: quality and sequence mismatch: 148 != 151
原因: pysam读取后的seq和qual长度必须一样啊!!
2.文件的写入
注: 先整理到这里,后面用的时候,再回来梳理。
参考资料:
http://pysam.readthedocs.org/en/latest/glossary.html#term-column
http://pysam.readthedocs.org/en/latest/api.html#pysam.AlignmentFile.pileup
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn