---
title: "TCGA数据库介绍与下载"
author: "Omicsgene"
output:
html_document:
toc: true
theme: united
date: '`r format(Sys.time(),"%d %B, %Y")`'
---
![image](https://note.youdao.com/yws/api/group/92990402/noteresource/D0C43D04435648A4A5A9B36556716345/version/421?method=get-resource&shareToken=76A1AE2F0EB444EB93D783F84996259D&entryId=444757277) | ![image](https://note.youdao.com/yws/api/group/92990402/noteresource/8FE8F607E39A44B69A1ADDA56370D048/version/422?method=get-resource&shareToken=76A1AE2F0EB444EB93D783F84996259D&entryId=444757277)|
---|---|---
问答社区:http://www.omicsclass.com/ | 组学大讲堂公众号|
---
![image](http://note.youdao.com/yws/public/resource/85d47c122bfcc6a031dafe9aa8ec2be2/xmlnote/WEBRESOURCEe658c49370ee8d61c5f961a962d94674/2328)|![image](http://note.youdao.com/yws/public/resource/85d47c122bfcc6a031dafe9aa8ec2be2/xmlnote/WEBRESOURCEf51813692a388309959dd51d757e6508/2330)|![image](http://note.youdao.com/yws/public/resource/85d47c122bfcc6a031dafe9aa8ec2be2/xmlnote/WEBRESOURCEefa1f09e4a82df5b07364beba13cc8b0/2320)
---|---|---
课程推荐1:[R语言入门与基础绘图](https://bdtcd.xetslk.com/s/2G8tHr)|课程推荐2:[R语言绘图(ggplot)](https://bdtcd.xetslk.com/s/2G8tHr)|所有生信课程:[点击](https://study.omicsclass.com/index)
---
ESTIMATE使用基因表达数据评估肿瘤样本中基质细胞的存在和免疫细胞的浸润。可用于目前公开可用的数据集,以及新的芯片数据或RNA-Seq数据。该方法的预测能力已通过大型独立数据集验证。但ESTIMATE无法准确推断造血或间质肿瘤(例如,白血病,肉瘤和胃肠道间质肿瘤)的肿瘤细胞性,并且(由于数据缺乏)无法应用于前列腺癌或胰腺癌等肿瘤类型。
An overview of the ESTIMATE algorithm. 算法使用基因表达数据输出估计的浸润基质细胞和免疫细胞的水平和估计的肿瘤纯度。整合来自6个平台的表达数据,共10412个常见基因,经过筛选得到Stromal signature(141 genes)和Immune signature(141 genes)。基于ssGSEA方法计算Stromal score/Immune score和Estimate score。
(1) stromal score基质评分(捕捉肿瘤组织中基质细胞的存在);
(2) immune score免疫评分(肿瘤组织中免疫细胞的浸润情况)
(3) ESTIMATE score评估评分(推断肿瘤纯度)。
REF:Yoshihara K, Shahmoradgoli M, Martínez E, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612. doi:10.1038/ncomms3612
## 2.1 加载需要的包
```{r, warning=FALSE}
# 1 安装CRAN来源常用包
#设置镜像,
local({r <- getOption("repos")
r["CRAN"] <- "http://mirrors.tuna.tsinghua.edu.cn/CRAN/"
options(repos=r)})
# 依赖包列表:
package_list <- c("ggplot2","tidyverse","DT")
if(!suppressWarnings(suppressMessages(require("estimate", character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)))){
rforge <- "http://r-forge.r-project.org"
install.packages("estimate", repos=rforge, dependencies=TRUE)
}
# 判断R包加载是否成功如果加载不成功自动安装
for(p in package_list){
if(!suppressWarnings(suppressMessages(require(p, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)))){
install.packages(p, warn.conflicts = FALSE)
suppressWarnings(suppressMessages(library(p, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)))
}
}
#2 安装bioconductor常用包
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")
package_list <- c("GSVA","limma","GSEABase")
for(p in package_list){
if(!suppressWarnings(suppressMessages(require(p, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)))){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(p)
suppressWarnings(suppressMessages(library(p, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)))
}
}
```
## 2.1 数据下载代码
```{r, warning = FALSE, message = FALSE}
###读入数据,
gene.info<-read.csv("../01.TCGA_download/TCGA-STAD_gene_info.csv", stringsAsFactors=FALSE, row.names=1,check.names = F)
sample.info<-read.csv("../01.TCGA_download/TCGA-STAD_sample_info.csv", stringsAsFactors=FALSE, row.names=1,check.names = F)
geneExpSet.fpkm<-read.csv("../01.TCGA_download/TCGA-STAD_gene_expression_FPKM.csv", stringsAsFactors=FALSE, row.names=1,check.names = F)
```
筛选肿瘤样本:
```{r}
sampleTP.id=intersect(colnames(geneExpSet.fpkm),sample.info$barcode[sample.info$shortLetterCode=="TP"])
#按列名提取数据
geneExpSet.fpkm<-geneExpSet.fpkm[,sampleTP.id]
```
制作estimate 需要的表达格式文件输出:estimate要求行名为基因name,列名为样本名称的基因表达矩阵
```{r}
#顺序保持一致
geneExpSet.estimate<-geneExpSet.fpkm[rownames(gene.info),]
#添加gene name列
geneExpSet.estimate$gene_name=gene.info$external_gene_name
#用dplyr包,相同的 gene_name 求平均
geneExpSet.estimate %>% group_by(gene_name) %>% summarise(across(everything(), list(mean)),useColName = TRUE)->geneExpSet.estimate.res
rownames(geneExpSet.estimate.res)<-geneExpSet.estimate.res$gene_name
geneExpSet.estimate.res.name<-geneExpSet.estimate.res[,-1] #去除第一列gene_name
#输出表达矩阵文件
write.table(geneExpSet.estimate.res,file="geneExpSet.txt",sep="\t",quote = F,row.names = F)
```
```{r}
#转换城gct文件
filterCommonGenes(input.f="geneExpSet.txt",
output.f="geneExpSet.gct",
id="GeneSymbol")
#完成数据格式转换后,使用estimateScore对gct格式的input表达矩阵进行计算,得到基质评分、免疫评分,以及估计分数,并保存到本地文件中。支持的平台:c("affymetrix", "agilent", "illumina")
estimateScore(input.ds = "geneExpSet.gct",
output.ds="estimate_score.gct",
platform="illumina")
#plotPurity函数,即Plot tumor purity,用于绘制肿瘤纯度。该函数需要estimateScore方法输出的包含每个样品的基质,免疫,估计值和肿瘤纯度GCT文件。这一步可以将每个样本生成散点图,绘制肿瘤纯度与ESTIMATE得分的相关性图,并导出PNG文件。然而目前这个函数仅支持Affymetrix平台。
if( !file.exists("estimated_purity_plots") ){
dir.create("estimated_purity_plots", showWarnings = FALSE, recursive = TRUE)
}
#plotPurity(scores="estimate_score.gct", samples="all_samples",
# platform="illumina",output.dir="estimated_purity_plots")
#整理数据准备输出
scores=read.table("estimate_score.gct",skip = 2,header = T)
rownames(scores)=scores[,1]
scores=as.data.frame(t(scores[,3:ncol(scores)]))
TumourPurity = cos (0.6049872018 + 0.0001467884 * scores$ESTIMATEScore)
scores$TumourPurity=TumourPurity
write.table(scores,file="estimate_score.tsv",sep = "\t",quote = F)
```
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!