你的问题有些饶,不知道你要得到结果文件是什么样子的,这是ncbi分类数据库的下载地址以及说明,你可以根据文件内容之间的对应关系,写个perl脚本应该可以完成。
分类数据库下载taxid 和gi 的对应文件:ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
其中的文件1:nodes.dmp
tax_id -- node id in GenBank taxonomy database
parent tax_id -- parent node id in GenBank taxonomy database
rank -- rank of this node (superkingdom, kingdom, ...)
embl code -- locus-name prefix; not unique
division id -- see division.dmp file
inherited div flag (1 or 0) -- 1 if node inherits division from parent
genetic code id -- see gencode.dmp file
inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent
mitochondrial genetic code id -- see gencode.dmp file
inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent
GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage
hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet
comments
文件2:division.dmp
Divisions file has these fields:
division id -- taxonomy database division id
division cde -- GenBank division code (three characters)
division name -- e.g. BCT, PLN, VRT, MAM, PRI...
comments
具体物种分类信息:
0 | BCT | Bacteria | |
1 | INV | Invertebrates | |
2 | MAM | Mammals | |
3 | PHG | Phages | |
4 | PLN | Plants and Fungi | |
5 | PRI | Primates | |
6 | ROD | Rodents | |
7 | SYN | Synthetic and Chimeric | |
8 | UNA | Unassigned | No species nodes should inherit this division assignment |
9 | VRL | Viruses | |
10 | VRT | Vertebrates | |
11 | ENV | Environmental samples | Anonymous sequences cloned directly from the environment |
文件三:gi_taxid_nucl.dmp
The file gi_taxid_nucl.dmp contains two columns: the first (left) column is
the GenBank identifier (gi) of nucleotide record, the second (right) column is
taxonomy identifier (taxid).