1. Quality trimming
3. Contaminant filtering

## 一、安装

pip3 install  cutadapt


## 二、用法

#### 基本用法举例：

# 去掉3’端的接头
# -a 3’接头
# -o 输出，o是output的意思
# -j 选择几个核
cutadapt -j 10 -a AACCGGTT -o output.fastq input.fastq

# 也可以处理gz压缩格式的文件
# 支持 gzip (.gz), bzip2 (.bz2) 和 xz (.xz).
cutadapt -a AACCGGTT -o output.fastq.gz input.fastq.gz

# 去除5’端的接头
# -g 5'接头

# 去除poly-A尾，如去除100个及以上个A
# instead of writing ten A in a row (AAAAAAAAAA), write A{10}
cutadapt -a "A{100}" -o output.fastq input.fastq

# quality triming
# The -q (or --quality-cutoff) parameter can be used to trim low-quality ends from reads before adapter removal.
# By default, only the 3’ end of each read is quality-trimmed.
cutadapt -q 10 -o output.fastq input.fastq

# 去除多个 3’接头
cutadapt -a TGAGACACGCA -a AGGCACACAGGG -o output.fastq input.fastq


Keep in mind that Cutadapt removes the adapter that it finds and also the sequence following it, so even if the actual adapter sequence that is used in a protocol is longer than that (and possibly contains a variable index), it is sufficient to specify a prefix of the sequence(s).

Sequence: 'ACGTACGTACGTTAGCTAGC'; Length: 20; Trimmed: 2402 times.


No. of allowed errors:
0-7 bp: 0; 8-15 bp: 1; 16-20 bp: 2
The adapter, as was shown above, has a length of 20 characters. We are using a custom error rate of 0.12. What this implies is shown above: Matches up to a length of 7 bp are allowed to have no errors. Matches of lengths 8-15 bp are allowd to have 1 error and matches of length 16 or more can have 2 errors. See alsothe section about error-tolerant matching.

Finally, a table is output that gives more detailed information about the lengths of the removed sequences. The following is only an excerpt; some rows are left out:
Overview of removed sequences
length  count   expect  max.err error counts
3       140     156.2   0       140
4       57      39.1    0       57
5       50      9.8     0       50
6       35      2.4     0       35
7       13      0.3     0       1 12
8       31      0.1     1       0 31
...
100     397     0.0     3       358 36 3


## 三、案例

https://github.com/csf-ngs/fastqc/blob/master/Contaminants/contaminant_list.txt

cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq


-a和-A是左右端测序数据的3端接头，-g和-G是左右端测序数据的5端接头。

cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq


-a和-A是左右端测序数据的3端接头，-g和-G是左右端测序数据的5端接头。 支持fastq和fasta格式的gz压缩文件，必要时用-f参数指定测序文件数据格式即可。

python ~/miniconda2/pkgs/cutadapt-1.10-py27_0/bin/cutadapt
-a AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -a ATCTCGTATGCCGTCTTCTGCTTG
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -A CAAGCAGAAGACGGCATACGAGAT
-e 0.1 -O 5 -m 50  -n  2 --pair-filter=both


• 两个-a 参数后面接的是两种接头，两个-A参加后面接的是同样的两个接头的反向互补序列！

## 四、原理

### 4.1 什么是接头

4步获取测序数据

1. 制备文库
2. PCR扩增
3. 测序
4. 比对分析

##### 那什么是接头呢？

There are a number of different ways to prepare samples, all preparation methods add adapter to the end of DNA fragment. illumina 一般，一个DNA片段两端会有两个FlowCell的adapter，还有一个index。

### 为什么要扩增成DNA簇？

1. flowcell上有8条lane（泳道），每条lane可以直接物理区分测序样品。
2. 1次run（单次上机测序反应）最多可以同时上样8条Lane，大概产生4G-75G测序通量。
3. 每条Lane中排有2列tile，合计120个小区。每个小区上分布数目繁多的簇结合位点。
Flowcell. (图片来自illumina，仅供学习，侵删)

FlowCell for HiSeq 2000 (图片来自hackteria.org)

What is the difference between “Single-End” and “Paired-End” reads? Single-End Read: When the sequencing process only occurs in 1 direction (utilizing Read Primer 1). Paired-End Read: If two separate read cycles occur in both directions (utilizing both Read Primer 1 and 2). This kind of read will provide data about both sides of the fragment of interest (Blue). If the fragment size is consistent you will also be able to predict that both the forward and reverse reads will be a known distance from each other. This data can be used to help the software map the reads more accurately.