blast format: 5 = BLAST XML, 格式 相比于 6 格式 包含更多的信息内容,包括比对的序列信息,长度信息等等;接下来我们可以用bioperl包来解析,获取自己想要的信息;
#!/usr/bin/perl
use strict;
use Bio::SearchIO;
use Data::Dumper;
my $fi = $ARGV[0];
my $out = $ARGV[1];
my $format;
$format||="blastxml";
my $searchio = new Bio::SearchIO(
-format => "$format",
-file => "$fi"
);
open OUT,">$out" or die "$!";
######################################
print OUT "query_name\thit_name\tquery_length\thit_length\tidentity\talign_length\tmismatch\tgaps\tquery_start\tquery_end\thit_start\thit_end\tevalue\tbit_score\tquery_string\thit_string\tqfame\thfame\thit_description\n";
while(my $result = $searchio->next_result){
my $algorithm_type = $result->algorithm;
my $algorithm_version = $result ->algorithm_version;
while( my $hit = $result->next_hit ) {
# process the Bio::Search::Hit::HitI object
while (my $hsp = $hit->next_hsp) {
#print Dumper($hsp);
# process the Bio::Search::HSP::HSPI object
my $query_name = $result->query_name;
my $query_length = $result->query_length;
my $query_description = $result->query_description;
my $query_string = $hsp->query_string;
my $hit_string = $hsp->hit_string;
my $align = $hsp->homology_string;
$align =~ s/ //g;
my $qfame = $hsp->query->frame;
my $hfame =$hsp->hit->frame;
my $hit_name = $hit->name;
my $hit_description = $hit->description;
my $hit_length = $hit->length;
my $identity = sprintf "%0.2f", $hsp->frac_identical;
my $mismatch =()= $hsp->seq_inds('hit','nomatch');
my $gaps = $hsp->gaps;
my $align_length = $hsp->hsp_length;
my $query_start = $hsp->start('query');
my $query_end = $hsp->end('query');
my $hit_start = $hsp->start('hit');
my $hit_end = $hsp->end('hit');
my $evalue = $hsp->evalue;
my $bit_score = $hsp->bits;
print OUT qq{$query_description\t$hit_name\t$query_length\t$hit_length\t$identity\t$align_length\t$mismatch\t$gaps\t$query_start\t$query_end\t$hit_start\t$hit_end\t$evalue\t$bit_score\t$query_string\t$hit_string\t$qfame\t$hfame\t$hit_description\n};
}
}
}
close(OUT);
$searchio->close();
$result 对象 信息:
Result | algorithm | BLASTX | algorithm string | |||
Result | algorithm_version | 2.2.4 [Aug-26-2002] | algorithm version | |||
Result | query_name | gi | 20521485 | dbj | AP004641.2 | query name |
Result | query_accession | AP004641.2 | query accession | |||
Result | query_length | 3059 | query length | |||
Result | query_description | Oryza sativa … 977CE9AF checksum. | query description | |||
Result | database_name | test.fa | database name | |||
Result | database_letters | 1291 | number of residues in database | |||
Result | database_entries | 5 | number of database entries | |||
Result | available_statistics | effectivespaceused … dbletters | statistics used | |||
Result | available_parameters | gapext matrix allowgaps gapopen | parameters used | |||
Result | num_hits | 1 | number of hits | |||
Result | hits | List of all Bio::Search::Hit::GenericHit objects for this Result | ||||
Result | rewind | Reset the internal iterator that dictates where next_hit() is pointing, useful for re-iterating through the list of hits |
Table 2.1: The data returned by the Result object methods when the report shown above is used as input.
$hit 对象包含的信息:
Note that many of the methods shown can be used to either get or set values, but we’re just showing what they get.
Hit | name | gb | 443893 | 124775 | hit name |
Hit | length | 331 | Length of the Hit sequence | ||
Hit | accession | 443893 | accession (usually when this is a Genbank formatted id this will be an accession number - the part after the gb or emb ) | ||
Hit | description | LaForas sequence | hit description | ||
Hit | algorithm | BLASTX | algorithm | ||
Hit | raw_score | 92 | hit raw score | ||
Hit | significance | 2e-022 | hit significance | ||
Hit | bits | 92.0 | hit bits | ||
Hit | hsps | List of all Bio::Search::HSP::GenericHSP objects for this Hit | |||
Hit | num_hsps | 1 | number of HSPs in hit | ||
Hit | locus | 124775 | locus name | ||
Hit | accession_number | 443893 | accession number | ||
Hit | rewind | Resets the internal counter for next_hsp() so that the iterator will begin at the beginning of the list |
Table 2.2. The data returned by Hit object methods when the report shown above is used as input.
$hsp 包含的信息:
Many of the methods shown can be used to either get or set values, but we’re just showing what they get.
HSP | algorithm | BLASTX | algorithm |
HSP | evalue | 2e-022 | e-value |
HSP | expect | 2e-022 | alias for evalue() |
HSP | frac_identical | 0.884615384615385 | fraction identical |
HSP | frac_conserved | 0.923076923076923 | fraction conserved (conservative and identical replacements aka “fraction similar”) |
HSP | gaps | 2 | number of gaps |
HSP | query_string | DMGRCSSG .. | query string from alignment |
HSP | hit_string | DIVQNSS … | hit string from alignment |
HSPt | homology_string | D+ + SSGCN … | string from alignment |
HSP | length(‘total’) | 52 | length of HSP (including gaps) |
HSP | length(‘hit’) | 50 | length of hit participating in alignment minus gaps |
HSP | length(‘query’)t | 156 | length of query participating in alignment minus gaps |
HSPt | hsp_length | 52 | Length of the HSP (including gaps) alias for length(‘total’) |
HSPt | frame | 0 | $hsp->query->frame,$hsp->hit->frame |
HSP | num_conserved | 48 | number of conserved (conservative replacements, aka “similar”) residues |
HSP | num_identical | 46 | number of identical residues |
HSPt | rank | 1 | rank of HSP |
HSP | seq_inds(‘query’,’identical’) | (966,971,972,973,974,975 …) | identical positions as array |
HSP | seq_inds(‘query’,’conserved-not-identical’) | (967,969) | conserved, but not identical positions as array |
HSP | seq_inds(‘query’,’conserved’) | (966,967,969,971,973,974,975, …) | conserved or identical positions as array |
HSP | seq_inds(‘hit’,’identical’) | (197,202,203,204,205, …) | identical positions as array |
HSP | seq_inds(‘hit’,’conserved-not-identical’) | (198,200) | conserved not identical positions as array |
HSP | seq_inds(‘hit’,’conserved’,1) | (197,202-246) | conserved or identical positions as array, with runs of consecutive numbers compressed |
HSPt | score | 227 | score |
HSP | bits | 92.0 | score in bits |
HSP | range(‘query’) | (2896,3051) | start and end as array |
HSP | range(‘hit’) | (197,246) | start and end as array |
HSP | percent_identity | 88.4615384615385 | % identical |
HSP | strand(‘hit’) | 1 | strand of the hit |
HSP | strand(‘query’) | 1 | strand of the query |
HSP | start(‘query’) | 2896 | start position from alignment |
HSP | end(‘query’) | 3051 | end position from alignment |
HSP | start(‘hit’) | 197 | start position from alignment |
HSP | end(‘hit’) | 246 | end position from alignment |
HSP | matches(‘hit’) | (46,48) | number of identical and conserved as array |
HSP | matches(‘query’) | (46,48) | number of identical and conserved as array |
HSP | get_aln | sequence alignment | Bio::SimpleAlign object |
HSPt | hsp_group | Not available in this report | Group field from WU-BLAST reports run with -topcomboN or -topcomboE specified |
HSP | links | Not available in this report | Links field from WU-BLAST reports run with -links showing consistent HSP linking |
Table 2.3. The data returned by HSP object methods when the report shown above is used as input.
更多说明见:https://bioperl.org/howtos/SearchIO_HOWTO.html
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!