6.8 给定一系列的序列名字，从原始文件中提取对应的fasta序列

October 28, 2014 perl 阅读量：次

我们求得一条序列的各种信息，一系列分析之后，反过来，要从原文件（包含所有序列的fasta文件）中提取出我最后感兴趣的序列的fasta。一个脚本，可以很好实现要求。

perl extract.using.header.list.pl -l 序列名字（一行一个名字） -s fasta文件 -o 生成文件

#!/usr/bin/env perl
###############################################################################
#
#    extract.using.header.list.pl - version 1.0
#
# Extracts a subset of seauences given a list of the headers to extract.
###############################################################################

#pragmas
use strict;
use warnings;

#core Perl modules
use Getopt::Long;

#locally-written modules
BEGIN {
    select(STDERR);
    $| = 1;
    select(STDOUT);
    $| = 1;
}

# get input params
my $global_options = checkParams();

my $inlist;
my $insequences;
my $outputfile;

$inlist = &overrideDefault("linlist.txt",'inlist');
$insequences = &overrideDefault("sequences.fa",'insequences');
$outputfile = &overrideDefault("outputfile.txt",'outputfile');
 
my $countp = 0;
my $countout = 0;
my $count = 0;
my $seq;
my $header = ">start[test]";
my $prevheader = ">start[test]";
my %extract;

 
######################################################################
# CODE HERE
######################################################################


open(INlist, $inlist) or die("Cannot read file: $inlist\n");

while ( my $line = <  INlist > ) {
chomp $line; 
if (!exists($extract{$line})){
$extract{$line} = 1;
$count++;
}
}
close INlist;

print "$count sequences to extract.\n";


open(INseq, $insequences) or die("Cannot read file: $insequences\n");
open(OUT, ">$outputfile") or die("Cannot create file: $outputfile\n");

while ( my $line = < INseq > ) {
chomp $line; 
if ($line =~ m/>/) {
$countp++;
$prevheader = $header;
$header = $line;
my @splitline = split(/>/,$prevheader);
if(exists($extract{$splitline[1]}) or exists($extract{$prevheader})){
print OUT "$prevheader\n";
print OUT "$seq\n";
$countout++;
}
$seq = "";
}
else{
$seq = $seq.$line;
}
}
my @splitline = split(/>/,$header);
if(exists($extract{$splitline[1]}) or exists($extract{$header})){
print OUT "$header\n";
print OUT "$seq\n";
$countout++;
}

print "$countp sequences in total.\n";
print "$countout sequences extracted.\n";

close OUT;
close INseq;

######################################################################
# TEMPLATE SUBS
######################################################################
sub checkParams {
    #-----
    # Do any and all options checking here...
    #
    my @standard_options = ( "help|h+", "insequences|s:s", "inlist|l:s", "outputfile|o:s");
    my %options;

    # Add any other command line options, and the code to handle them
    # 
    GetOptions( \%options, @standard_options );
    
#if no arguments supplied print the usage and exit
    #
    exec("pod2usage $0") if (0 == (keys (%options) ));

    # If the -help option is set, print the usage and exit
    #
    exec("pod2usage $0") if $options{'help'};

    # Compulsosy items
    #if(!exists $options{'infile'} ) { print "**ERROR: $0 : \n"; exec("pod2usage $0"); }

    return \%options;
}

sub overrideDefault
{
    #-----
    # Set and override default values for parameters
    #
    my ($default_value, $option_name) = @_;
    if(exists $global_options->{$option_name}) 
    {
        return $global_options->{$option_name};
    }
    return $default_value;
}

__DATA__

=head1 NAME

    extract.using.header.list.pl

=head1 COPYRIGHT

   This program is free software: you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation, either version 3 of the License, or
   (at your option) any later version.

=head1 DESCRIPTION



=head1 SYNOPSIS

extract.using.header.list.pl  -i [-h] - version 1.0

 [-help -h]           Displays this basic usage information
 [-inlist -l]         List of headers to use for extraction.
 [-insequences -s]    Sequence file where a subset of sequences are to be extracted from.
 [-outputfile -o]     Outputfile. 
=cut

ps:

看来我现在perl功力还是不够啊，以后可能更多的经历是利用好的Perl脚本来更快实现工作，自己多写一些python

bitslife的建议：

直接推荐使用 seqtk http://t.cn/R7wxvGH 支持 fasta/fastq 的抽取，
如果非要使用perl 的话，使用 $/ 修改掉默认的 ‘', 会给你惊喜的。

九葫的建议：

grep 就可以实现，就是可能速度没perl快而已，写这么多只是为了一个提序列，说实话，完成工作自己写perl脚本效率太低了

这里是一个广告位，，感兴趣的都可以发邮件聊聊：tiehan@sina.cn

个人公众号，比较懒，很少更新，可以在上面提问题，如果回复不及时，可发邮件给我： tiehan@sina.cn