merged.gtf 合并同一转录本的exon位置

在merged.gtf文件中有所有外显子的信息,下面的脚本可根据此文件提取转录本的所有外显子位置信息。 merged.gtf文件实例: Chr00   Cufflinks       exon    37990   38333   .       +      ...

在merged.gtf文件中有所有外显子的信息,下面的脚本可根据此文件提取转录本的所有外显子位置信息。

merged.gtf文件实例:

Chr00   Cufflinks       exon    37990   38333   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "MD00G1000200"; oId "CUFF.2.1"; nearest_ref "mRNA:MD00G1000200"; class_code "j"; tss_id "TSS1";
Chr00   Cufflinks       exon    38607   38710   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "MD00G1000200"; oId "CUFF.2.1"; nearest_ref "mRNA:MD00G1000200"; class_code "j"; tss_id "TSS1";
Chr00   Cufflinks       exon    38814   38898   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "MD00G1000200"; oId "CUFF.2.1"; nearest_ref "mRNA:MD00G1000200"; class_code "j"; tss_id "TSS1";
Chr00   Cufflinks       exon    42611   42713   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "MD00G1000200"; oId "CUFF.2.1"; nearest_ref "mRNA:MD00G1000200"; class_code "j"; tss_id "TSS1";
Chr00   Cufflinks       exon    42906   43203   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "5"; gene_name "MD00G1000200"; oId "CUFF.2.1"; nearest_ref "mRNA:MD00G1000200"; class_code "j"; tss_id "TSS1";
输出文件示例:
Chr00   +       XLOC_000001     MD00G1000200    TCONS_00000001  exon    37990-38333     38607-38710     38814-38898     42611-42713     42906-43203
Chr00   +       XLOC_000001     MD00G1000200    TCONS_00000002  exon    38005-38333     38607-38710     38814-38898     42611-42726     42906-43167
Chr00   +       XLOC_000002     MD00G1000400    TCONS_00000003  exon    50386-50877
Chr00   +       XLOC_000003     MD00G1000500    TCONS_00000004  exon    76659-76991     77468-77544     77649-77715     77889-77970     78355-78424
Chr00   +       XLOC_000004     MD00G1000600    TCONS_00000005  exon    101951-102138   102228-102398   102957-103004   103099-103138   103227-103327  
Chr00   +       XLOC_000004     MD00G1000600    TCONS_00000006  exon    102003-102138   102228-102398   102957-103004   103099-103138   103227-103327  
Chr00   +       XLOC_000005     MD00G1000700    TCONS_00000007  exon    105542-105626   105926-106541   108356-108832
Chr00   +       XLOC_000005     MD00G1000700    TCONS_00000009  exon    105542-105626   105926-106541   108902-109696
Chr00   +       XLOC_000005     MD00G1000700    TCONS_00000008  exon    105542-105626   105926-106541   108949-109696
Chr00   +       XLOC_000006     MD00G1001100    TCONS_00000010  exon    276592-277221   280928-280975
其中第一列为染色体;第二列为正负链;第三列是gene_id;第四列为gene_name;第五列为转录本ID;之后是外显子的起始位置信息

代码如下:

#!/usr/bin/perl -w
use strict;
use warnings;
use Getopt::Long;
use Config::General;
use Cwd qw(abs_path getcwd);
use FindBin qw($Bin $Script);
my $version = "1.2";
## prepare parameters #######################################################################
## -------------------------------------------------------------------------------------------
## GetOptions
my %opts;
GetOptions(\%opts, "gtf=s", "od=s", "h");
my $od = $opts{od};
$od = abs_path($od);
mkdir $od unless(-d $od);
open(IN,"$opts{gtf}") ||die "open file $opts{gtf} failed.";
open(OUT,">$opts{od}/merged.tpm") ||die "open file $opts{od}/merged.tpm failed.";
while(<IN>){
next if(/^#/);
chomp;
my($chr,$a,$exon,$start,$end,$c,$link,$d,$lin) = split("\t",$_);
$lin=~/transcript_id \"([^\"]*)\"/;
my $trans = $1;
$lin=~/gene_name \"([^\"]*)\"/;
my $gene_name= $1;
$lin =~/gene_id \"([^\"]*)\"/;
my $gene_id= $1;
$lin =~/transcript_id \"([^\"]*)\"/;
my $trans_id = $1;
print OUT join("\t",$chr,$exon,$start,$end,$link,$gene_id,$trans_id)."\n";

}
close(IN);
close(OUT);
open(IN,"$opts{od}/merged.tpm") ||die "open file $opts{od}/merged.tpm failed.";
open(OUT,">$opts{od}/merged.gtf") ||die "open file $opts{od}/merged.gtf failed.";
my $cmd="";
my $key="";
while(<IN>){
next if(/^#/);
chomp;
my ($chr,$exon,$start,$end,$link,$gene_id,$gene_name,$trans_id) = split("\t",$_);
if($key eq $trans_id){
$cmd .= "\t".$start."-".$end;
}else{
$key = $trans_id;
if($cmd ne ""){
print OUT $cmd."\n";
}
$cmd = join("\t",$chr,$link,$gene_id,$trans_id,$exon,$start."-".$end);
}
}
close(IN);
close(OUT);



  • 发表于 2018-06-22 15:09
  • 阅读 ( 5110 )
  • 分类:perl

0 条评论

请先 登录 后评论
安生水
安生水

349 篇文章

作家榜 »

  1. omicsgene 701 文章
  2. 安生水 349 文章
  3. Daitoue 167 文章
  4. 生物女学霸 120 文章
  5. xun 82 文章
  6. 红橙子 78 文章
  7. rzx 76 文章
  8. CORNERSTONE 72 文章