长链非编码RNA转录组分析数据时,一般都是每个样本独立进行转录本的组装,之后采用cuffmerge将转录本进行合并,生成一个统一的基因注释GTF文件。
那我们需要筛选出新的转录本,那该如何筛呢?这个可以从GTF文件的class codes着手,该信息记录了每个转录本相对于已知转录本的位置信息。
1 | = | Complete match of intron chain |
2 | c | Contained |
3 | j | Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript |
4 | e | Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. |
5 | i | A transfrag falling entirely within a reference intron |
6 | o | Generic exonic overlap with a reference transcript |
7 | p | Possible polymerase run-on fragment (within 2Kbases of a reference transcript) |
8 | r | Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case |
9 | u | Unknown, intergenic transcript |
10 | x | Exonic overlap with reference on the opposite strand |
11 | s | An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors) |
12 | . | (.tracking file only, indicates multiple classifications) |
通过这个class_code 我们一般选在3种类型的转录本,分别是:
i : 内含子区的转录本
u: 基因间区的新转录本
x: 已知外显子的反义链转录本
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!