A combination of structure-based and homology-based approaches was employed to identify TEs and other repeat regions. First, miniature inverted repeat transposable elements (MITEs) and long terminal repeat (LTR) elements were annotated by structure-based methods. The former was performed by MITE Hunter (11-2011)64, and the latter was performed by LTR_retriever (v2.8.7)65, which integrates LTRs predicted from LTRharvest66 and LTR_Finder (v1.0.7)67. Other known repeat sequences were annotated by searching RepBase (v20170127) (http://www.girinst.org/server/RepBase/index.php) in RepeatMasker (v4.1.0) (http://repeatmasker.org). Then, RepeatMasker was used to annotate all repeat sequences by searching a library that combined MITEs, LTRs and known repeat sequences. Furthermore, RepeatModeler (v2.0) (http://www.repeatmasker.org/RepeatModeler/) was used to update the repeat sequence library by identifying the types of repeat sequences, and RepeatMasker was finally employed to mask the patchouli genome. As a byproduct of LTR_retriever, the LAI was used to assess the patchouli genome assembly quality24.
The repeat masked genome was used for the gene annotation. The protein-coding genes were annotated by incorporating transcriptional evidence, homology support from related species, and ab initio methods. Eighteen Patchouli RNA-seq datasets (SRR8769986, SRR7268115, SRR7268117, SRR8785265, SRR1770488, SRR7268119, SRR8756845, SRR7345998, SRR7345999, SRR7346000, SRR8755904, SRR8767850, SRR8755475, SRR8775235, SRR8775238, SRR8793583, SRR8809556, and SRR8820010) were downloaded from the NCBI SRA database, containing samples from different accessions (Yinni (YN), Shipai (SP) and Hainan (HN)), different tissues (root, stem and leaf) and different treatments (MeJA, salicylic acid (SA), abscisic acid (ABA), ethanol and light)68,69,70. After trimming by Fastp (version 0.20.1)71, the clean data were aligned to the patchouli genome by HISAT2 (v2.1.0)72, and the transcripts were assembled by StringTie (v2.1.3b)73 with the default parameters. The transcripts from all samples were merged and subjected to TransDecoder (https://github.com/TransDecoder/TransDecoder/wiki) in PASA (v2.4.1)74 for protein-coding sequence prediction and quality filtering. Only complete transcripts were retained for further analysis. The protein sequences from Arabidopsis thaliana (Phytozome, TAIR10), Sesamum indicum (NCBI, GCF_000512975.1_S_indicum_v1.0), Solanum lycopersicum (Phytzome, ITAG 3.2) and Utricularia gibba (CoGe, ID29027) were mapped to the assembled genome using Genoma (v1.6.1)75 to obtain high-quality protein structures. SNAP (version 2006-07-28)76, GeneMark-ESSuite (version 4.57)77, and Augustus (v3.2.2)78 were used for the ab initio gene prediction. They were all trained by high-quality transcripts from the last step, and then, de novo gene identification was performed according to the instruction manuals. All gene structures predicted by the above methods were integrated into a nonredundant gene set using EVidenceModeler (EVM) (v1.1.1)74. The weight value was set to 10 for high-quality RNA-seq transcripts, 5 for high-quality homologous proteins, and 2 for ab initio predicted transcripts. The EVM-predicted genes were further corrected with PASA (v2.4.1)74 to predict the untranslated regions and alternative splicings. The resulting protein models were finally functionally annotated by integrating the annotation information from InterProScan (v5.18-57.0)79, the NCBI nonredundant protein database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/) and the eggNOG database (v5.0) (http://eggnog5.embl.de/#/app/downloads).
Shen, Y., Li, W., Zeng, Y. et al. Chromosome-level and haplotype-resolved genome provides insight into the tetraploid hybrid origin of patchouli. Nat Commun 13, 3511 (2022). https://doi.org/10.1038/s41467-022-31121-w
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!