【2.2】Fastq格式--phred33/64

May 21, 2015 file_type 阅读量：次

Fastq格式里的reads质量得分编码方式有好几种，现在Illumina用的一般是Phred33，但偶尔还会遇到Phred64（旧版本）的。附件里的perl脚本可以把质量得分以数字的形式打印出来，并帮助判断是Fastq32格式还是Phred64。

FASTQ是基于文本的，保存生物序列（通常是核酸序列）和其测序质量信息的标准格式。一般由测序下级文件bcl得来，其序列以及质量信息都是使用一个ASCII字符标示，最初由Sanger开发，目的是将FASTA序列与质量数据放到一起，目前已经成为高通量测序结果的事实标准。

一、文件的命名

fastq文件一般遵循这样的命令规程

__L_R_.fastq.gz

例如：

NA10831_ATCACG_L002_R1_001.fastq.gz

二、文件的格式

FASTQ格式的序列一般都包含有四行:

第一行由'@‘开始，后面跟着序列的描述信息，这点跟FASTA格式是一样的。
第二行是序列
第三行由’+‘开始，后面也可以跟着序列的描述信息
第四行是第二行序列的质量评价**（quality values，注：应该是测序的质量评价），字符数跟第二行的序列是相等的。

例子：

@FCC0U6BACXX:6:1101:1418:2067#CTAGTTAT/1
CCGGTAAAGGATCGTATCCTGCGTGCACGATGGCGGTATTTGCGCTGGATACACCCATCCCAATATCAGCTGCTTTATCGATCAACAAGA
+
abbecceegggggiihhhfgihiifhhiihiiiihiZafgffhihg]aabdedddcab^ac`bcbb_]`bcccR]SSYSWQ[JT]`_^X[

2.1 第一行（ Sequence identifier）

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>

每个元素的意思如下：

Element	Requirements	Description
@	@	Each sequence identifier line starts with @
< instrument>	Characters allowed: a-z, A-Z, 0-9 and underscore	Instrument ID
< run number>	Numerical	Run number on instrument
< flowcell ID>	Characters allowed: a-z, A-Z, 0-9	-
< lane>	Numerical	Lane number
< tile>	Numerical	Tile number
< x_pos>	Numerical	X coordinate of cluster
< y_pos>	Numerical	Y coordinate of cluster
< read>	Numerical	Read number. 1 can be single read or read 2 of paired-end
< is filtered>	Y or N	Y if the read is filtered, N otherwise
< control number>	Numerical	0 when none of the control bits are on, otherwise it is an even number. See below.
< index sequence>	ACTG	Index sequence

关于control number

The tenth columns () is zero if the read is not identified as a control. If the read is identified as a control, the number is greater than zero, and the value specifies what kind of control it is. The value is the decimal representation of a bit-wise encoding scheme, with bit 0 having a decimal value of 1, bit 1 a value of 2, bit 2 a value of 4, and so on. The bits are used as follows:

Bit 0: always empty (0)
Bit 1: was the read identified as a control?
Bit 2: was the match ambiguous?
Bit 3: did the read match the phiX tag?
Bit 4: did the read align to match the phiX tag?
Bit 5: did the read match the control index sequence?
Bits 6,7: reserved for future use
Bits 8..15: the report key for the matched record in the controls.fasta file (specified by the REPORT_KEY metadata)

2.2 关于质量编码格式

质量评分指的是一个碱基的错误概率的对数值。其最初在Phred拼接软件中定义与使用，其后在许多软件中得到使用。其质量得分与错误概率的对应关系见下表：

Phred quality scores are logarithmically linked to error probabilities
PHRED QUALITY SCORE PROBABILITY OF INCORRECT BASE CALL BASE CALL ACCURACY
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1000 99.9 %
40 1 in 10000 99.99 %
50 1 in 100000 99.999 %
Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error probabilities P.
Q=-10log10P

对于每个碱基的质量编码标示，不同的软件采用不同的方案，目前有5种方案：

Sanger，Phred quality score，值的范围从0到92，对应的ASCII码从33到126，但是对于测序数据（raw read data）质量得分通常小于60，序列拼接或者mapping可能用到更大的分数。

Solexa/Illumina 1.0, Solexa/Illumina quality score，值的范围从-5到63，对应的ASCII码从59到126，对于测序数据，得分一般在-5到40之间；

Illumina 1.3+，Phred quality score，值的范围从0到62对应的ASCII码从64到126，低于测序数据，得分在0到40之间；

Illumina 1.5+，Phred quality score，但是0到2作为另外的标示，详见http://solexaqa.sourceforge.net/questions.htm#illumina

Illumina 1.8+

最重要的是通过下面的这个脚本，我知道了我的测序采用的是phred64这个编码。

fastq_phred_decide.pl

参考资料

jiewencai的个人博客 http://blog.sciencenet.cn/blog-630246-709629.html
博耘生物 http://boyun.sh.cn/bio/?p=1901
维基百科 http://en.wikipedia.org/wiki/FASTQ_format
http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

这里是一个广告位，，感兴趣的都可以发邮件聊聊：tiehan@sina.cn

个人公众号，比较懒，很少更新，可以在上面提问题，如果回复不及时，可发邮件给我： tiehan@sina.cn