最近有粉丝留言,TCGA数据库发生更新,下载数据时报错。详细的更新信息参考TCGA官网说明:
https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-320
这里提示几点重要更新:
例如:转录组之前HTSeq流程的数据,现在是STAR-Counts的数据。
受到影响的主要是表达数据,这里大家可以更新一下TCGA下载的代码,即可下载最新的数据,详细代码及说明如下:
#安装最新的 TCGAbiolinks R 包 BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") #一个方法,可以将ensembl ID 转换成 gene name ensemblToName<-function(expr,gene.info){ expr=as.data.frame(expr) #顺序保持一致 gene.info<-as.data.frame(gene.info) row.names(gene.info)<-gene.info$gene_id gene.info<-gene.info[rownames(expr),] sample.id=colnames(expr) #添加gene name列 相同基因随机选取 gene_name_rmdup=!duplicated(gene.info$gene_name) expr.res=expr[gene_name_rmdup,] gene.info=gene.info[gene_name_rmdup,] rownames(expr.res)<-gene.info$gene_name expr.res } #加载需要的包,这里需要两个R包 package_list <- c("SummarizedExperiment","TCGAbiolinks") for(p in package_list){ if(!suppressWarnings(suppressMessages(require(p, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)))){ if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(p) suppressWarnings(suppressMessages(library(p, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE))) } } #查询TCGA数据,这里以TCGA-CHOL为例,样本少好下载 query <- GDCquery(project = "TCGA-CHOL", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", # 有些样本只有 "Isoform Expression Quantification" workflow.type = "STAR - Counts", #sample.type = sample_type, legacy = FALSE) files <- getResults(query,cols=c("cases")) cat("Total files to download:", length(files)) # TP 样本数量 肿瘤样本 dataSmTP <- TCGAquery_SampleTypes(barcode = files, typesample = "TP") cat("Total TP samples to down:", length(dataSmTP)) # NT 样本数量 正常组织样本 dataSmNT <- TCGAquery_SampleTypes(barcode = files,typesample = "NT") cat("Total NT samples to down:", length(dataSmNT)) #下载数据 GDCdownload(query = query,directory = "TCGA-CHOL",files.per.chunk=10, method='api') # 保存整理下载数据结果 gene.data <- GDCprepare(query = query,directory = "TCGA-CHOL" ) #获得样本数据信息 sample.info.list=colData(gene.data)@listData sample.info=as.data.frame(sample.info.list[1:10]) #获得基因信息数据 gene.info=as.data.frame(rowRanges(gene.data)) # 获得表达数据 ,注意这里表达数据有6种,可以根据i参数指定: # unstranded stranded_first stranded_second tpm_unstrand fpkm_unstrand fpkm_uq_unstrand data_expr <- assay(gene.data,i = "unstranded") #修改ensembl ID 为基因name data_expr<-ensemblToName(data_expr,gene.info) #数据写出 write.table(data.frame(ID=rownames(data_expr),data_expr,check.names = F), file = 'gene_expression_Counts.tsv', sep="\t",row.names =F, quote = F) write.table(data.frame(ID=rownames(gene.info),gene.info,check.names = F), file = 'gene.info.tsv', sep="\t",row.names =F, quote = F) write.table(data.frame(ID=rownames(sample.info),sample.info,check.names = F), file = 'sample.info.tsv', sep="\t",row.names =F, quote = F)
命令行运行:
为保证代码正常运行成功,确保安装最新的R包 TCGAbiolinks,安装方法:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data") BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
或者用我们组学大讲堂提供的最新 docker镜像,里面已经更新了最新的TCGAbiolinks包:
omicsclass/r-server:lastest
用命令行代码《tcga_gene_exp_download.r》自动下载表达数据,运行命令如下:
Rscript tcga_gene_exp_download.r -p TCGA-CHOL
结果展示:
运行完上面的命令行,即可自动下载临床数据和表达数据,并自动整理成表格,方便后续分析:
扫描下方二维码学习命令行分析数据,以及获得TCGA数据下载最新代码:
https://zzw.h5.xeknow.com/s/3mdwk4
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!