6.8 给定一系列的序列名字,从原始文件中提取对应的fasta序列
我们求得一条序列的各种信息,一系列分析之后,反过来,要从原文件(包含所有序列的fasta文件)中提取出我最后感兴趣的序列的fasta。一个脚本,可以很好实现要求。
perl extract.using.header.list.pl -l 序列名字(一行一个名字) -s fasta文件 -o 生成文件
#!/usr/bin/env perl
###############################################################################
#
# extract.using.header.list.pl - version 1.0
#
# Extracts a subset of seauences given a list of the headers to extract.
###############################################################################
#pragmas
use strict;
use warnings;
#core Perl modules
use Getopt::Long;
#locally-written modules
BEGIN {
select(STDERR);
$| = 1;
select(STDOUT);
$| = 1;
}
# get input params
my $global_options = checkParams();
my $inlist;
my $insequences;
my $outputfile;
$inlist = &overrideDefault("linlist.txt",'inlist');
$insequences = &overrideDefault("sequences.fa",'insequences');
$outputfile = &overrideDefault("outputfile.txt",'outputfile');
my $countp = 0;
my $countout = 0;
my $count = 0;
my $seq;
my $header = ">start[test]";
my $prevheader = ">start[test]";
my %extract;
######################################################################
# CODE HERE
######################################################################
open(INlist, $inlist) or die("Cannot read file: $inlist\n");
while ( my $line = < INlist > ) {
chomp $line;
if (!exists($extract{$line})){
$extract{$line} = 1;
$count++;
}
}
close INlist;
print "$count sequences to extract.\n";
open(INseq, $insequences) or die("Cannot read file: $insequences\n");
open(OUT, ">$outputfile") or die("Cannot create file: $outputfile\n");
while ( my $line = < INseq > ) {
chomp $line;
if ($line =~ m/>/) {
$countp++;
$prevheader = $header;
$header = $line;
my @splitline = split(/>/,$prevheader);
if(exists($extract{$splitline[1]}) or exists($extract{$prevheader})){
print OUT "$prevheader\n";
print OUT "$seq\n";
$countout++;
}
$seq = "";
}
else{
$seq = $seq.$line;
}
}
my @splitline = split(/>/,$header);
if(exists($extract{$splitline[1]}) or exists($extract{$header})){
print OUT "$header\n";
print OUT "$seq\n";
$countout++;
}
print "$countp sequences in total.\n";
print "$countout sequences extracted.\n";
close OUT;
close INseq;
######################################################################
# TEMPLATE SUBS
######################################################################
sub checkParams {
#-----
# Do any and all options checking here...
#
my @standard_options = ( "help|h+", "insequences|s:s", "inlist|l:s", "outputfile|o:s");
my %options;
# Add any other command line options, and the code to handle them
#
GetOptions( \%options, @standard_options );
#if no arguments supplied print the usage and exit
#
exec("pod2usage $0") if (0 == (keys (%options) ));
# If the -help option is set, print the usage and exit
#
exec("pod2usage $0") if $options{'help'};
# Compulsosy items
#if(!exists $options{'infile'} ) { print "**ERROR: $0 : \n"; exec("pod2usage $0"); }
return \%options;
}
sub overrideDefault
{
#-----
# Set and override default values for parameters
#
my ($default_value, $option_name) = @_;
if(exists $global_options->{$option_name})
{
return $global_options->{$option_name};
}
return $default_value;
}
__DATA__
=head1 NAME
extract.using.header.list.pl
=head1 COPYRIGHT
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
=head1 DESCRIPTION
=head1 SYNOPSIS
extract.using.header.list.pl -i [-h] - version 1.0
[-help -h] Displays this basic usage information
[-inlist -l] List of headers to use for extraction.
[-insequences -s] Sequence file where a subset of sequences are to be extracted from.
[-outputfile -o] Outputfile.
=cut
ps:
看来我现在perl功力还是不够啊,以后可能更多的经历是利用好的Perl脚本来更快实现工作,自己多写一些python
bitslife的建议:
- 直接推荐使用 seqtk http://t.cn/R7wxvGH 支持 fasta/fastq 的抽取,
- 如果非要使用perl 的话, 使用 $/ 修改掉默认的 ‘', 会给你惊喜的。
九葫的建议:
- grep 就可以实现,就是可能速度没perl快而已,写这么多只是为了一个提序列,说实话,完成工作自己写perl脚本效率太低了
这里是一个广告位,,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn