Advanced Search
Last updated date: Mar 11, 2022 DOI: 10.21769/p1583 Views: 702 Forks: 0
Outline
The analysis pipeline is a three-step process involving data normalisation and visualisation (Step 1), differential abundance analysis with two different algorithms (Step 2), and finally merging the output of both algorithms and ranking the proteins based on combined probability of being differentially expressed (Step 3).
Step 1
The whole analysis starts off at normalising the raw protein abundances so that the data are evenly distributed. We used a channel (replicate) median normalisation, i.e. normalising the abundances of each protein in each replicate on the median of protein abundances in the replicate. Whether or not this normalisation strategy is suitable to other datasets, needs to be assessed individually.
To visualise the normalisation we used the following script:
##########################################
###Boxplot Proteomics Data Distribution###
##########################################
#load data
norm<-read.csv("your_normalised_input.csv", row.names=1)
raw<-read.csv("your_raw_input.csv", row.names=1)
#rename columns
header<-c("your","sample","names")
names(norm)<-header
names(raw)<-header
#log2 transform
norm.log=log(norm[,c(1:ncol(norm))],2)
raw.log=log(raw[,c(1:ncol(raw)],2)
#calculate means
mean.norm.log<-mean(unlist(as.vector(norm.log[,1:ncol(norm.log)])),na.rm=TRUE)
mean.raw.log<-mean(unlist(as.vector(raw.log[,1:ncol(norm.log)])),na.rm=TRUE)
#boxplot
par(mfrow=c(1,2))
boxplot(raw.log, ylab="log2(raw protein values)",
ylim=c(0,20), main="Proteomic samples all proteins raw abundances",
xaxt="n", yaxt="n", frame.plot=FALSE)
abline(h=mean.raw.log, col="black", lty="dashed")
y.axis<-c(0,5,10,15,20)
axis(2, at=seq(0,20,by=5), labels =F, col=1)
text(y=seq(0.05,20.05, by=5), par("usr") [1], labels=y.axis, pos=2,offset=.75,
srt= 0, xpd=T, col=1)
boxplot(norm.log, ylab="log2(normalised protein values)",
ylim=c(-10,10), main="Proteomic samples all proteins normalised abundances",
xaxt="n", yaxt="n", frame.plot=FALSE)
abline(h=mean.norm.log, col="black", lty="dashed")
y.axis<-c(-10,5,0,5,10)
axis(2, at=seq(-10,10,by=5), labels =F, col=1)
text(y=seq(-10.05,10.05, by=5), par("usr") [1], labels=y.axis, pos=2,offset=.75,
srt= 0, xpd=T, col=1)
Step 2
Next, we used two different R packages to perform different protein abundance analysis:
The LIMMA script is based on the publication of Kammers et al. (2015, https://www.biostat.jhsph.edu/~kkammers/software/eupa/R_guide.html) and has only been modified to include the more conservative Benjamini-Hochberg correction of p-values in lieu of the q-values as introduced by Storey and Tibshirani (2003). The ROTS script follows the standard analysis pipeline given in the package vignette.
LIMMA script:
############################################################
###assessing differentially expressed proteins with LIMMA###
############################################################
##based on http://www.biostat.jhsph.edu/~kkammers/software/eupa/R_guide.html
#first, load the protein file and log2 tranform
prot<-read.csv("your_protein_expression_data.csv", row.names=1)
head(prot)
log.prot<-log2(prot)
#order and rename the columns, this is needed for the actual analysis below
prot.idx<-c(your ordering)
log.prot<-log.prot[,prot.idx]
prot.header<-c(your column names)
names(log.prot)<-prot.header
head(log.prot)
#install and load limma and qvalue packages
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.14")
biocLite("limma")
biocLite("qvalue")
library("limma")
library("qvalue")
#analysis function
#source("http://www.biostat.jhsph.edu/~kkammers/software/eupa/source.functions.r")
eb.fit <- function(dat, design){
n <- dim(dat)[1]
fit <- lmFit(dat, design)
fit.eb <- eBayes(fit)
logFC <- fit.eb$coefficients[, 2]
df.r <- fit.eb$df.residual
df.0 <- rep(fit.eb$df.prior, n)
s2.0 <- rep(fit.eb$s2.prior, n)
s2 <- (fit.eb$sigma)^2
s2.post <- fit.eb$s2.post
t.ord <- fit.eb$coefficients[, 2]/fit.eb$sigma/fit.eb$stdev.unscaled[, 2]
t.mod <- fit.eb$t[, 2]
p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
p.mod <- fit.eb$p.value[, 2]
q.ord <- qvalue(p.ord)$q
q.mod <- qvalue(p.mod)$q
results.eb <- data.frame(logFC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.r, df.0, s2.0, s2, s2.post)
results.eb <- results.eb[order(results.eb$p.mod), ]
return(results.eb)
}
#now build a matrix for the analysis
design<-model.matrix(~factor(c(2,2,2,1,1,1)))
colnames(design)<-c("Intercept", "Diff")
design
#run the analysis
#LIMMA does pairwise comparisions, so you need to define which two conditions you want to compare
condA<-prot.header[c(your data columns)]
condB<-prot.header[c(your data columns)]
#analysis
result<-eb.fit(log.prot[, c(condB,condA)], design)
p<-result$p.mod
pi0<-pi0est(p)
pi0<-unlist(pi0)
pi0<-as.vector(pi0[1])
result[13]<-pi0
names(result)[13]<-"pi0"
head(result)
nrow(result)
#now, subset for desired fdr (q.mod is the column to look at), and kick out #un-necessary columns
#important columns
#logFC: b-a (=> +high in b/- high in a); p.ord/p.mod: ordinary/moderated p-#value; t.ord: ordinary t-value; q.ord/q.mod: ordinary/moderated q-value #t.mod: moderated t-value
fdr<-0.1
fdrSubset<-subset(result, q.mod<=fdr, select=c("logFC","p.mod","q.mod","pi0"))
head(fdrSubset)
#run this, if you want to have the more conservative adjusted p-value based on the Benjamini-Hochberg algorithm
padjust<-p.adjust(result$p.mod, method="fdr")
result.padjust<-result
result.padjust[14]<-padjust
names(result.padjust)[14]<-"adjusted.pvalue"
p.adjSubset<-subset(result.padjust, adjusted.pvalue<=fdr, select=c("logFC","p.mod","q.mod","pi0","adjusted.pvalue"))
head(p.adjSubset)
#visualise the p-value distribution if you like
hist(p, breaks=10, col=c("green","lightgreen","lightblue","grey","grey","grey","grey","grey","grey","grey"))
#write the result file(s)
#everything
write.csv(result.padjust, "your_output_file_full.csv")
#subsets
write.csv(fdrSubset, "your_output_file_FDR.csv")
write.csv(p.adjSubset, "your_output_file_pAdjust.csv")
ROTS script:
######################################
###ROTS for differential proteomics###
######################################
#first, load the protein file and log2 transform
prot<-read.csv("your_protein_expression_data.csv", row.names=1)
head(prot)
log.prot<-log2(prot)
#order and rename the columns, this is needed for the actual analysis below
prot.idx<-c(your ordering)
log.prot<-log.prot[,prot.idx]
prot.header<-c(your column names)
names(log.prot)<-prot.header
head(log.prot)
#load ROTS package
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.14")
biocLite("ROTS")
library("ROTS")
#ROTS does pairwise comparisions, so you need to define which two conditions #you want to compare
#logFC: b-a (=> +high in b/- high in a)
condA<-c(your data columns)
condB<-c(your data columns)
input<-log.prot[c(condB,condA)]
head(input)
input<-as.matrix(input)
groups<-c(rep(0,3),rep(1,3))
results<-ROTS(data = input, groups = groups , B = 500 , K = 1000 , seed = 1234)
#build the output table
logFC<-as.data.frame(results$logfc)
pval<-as.data.frame(results$pvalue)
fdr<-as.data.frame(results$FDR)
results<-cbind(logFC,pval,fdr)
names(results)<-c("logFC","pval","fdr")
sort<-order(results$fdr)
results<-results[sort,]
results0.1<-subset(results, fdr<=0.1)
head(results0.1)
nrow(results0.1)
#write the result file
write.csv(results, "your_output_table.csv")
Step 3
The last step of the pipeline involved combining the two output files generated per condition, and rank the differentially expressed proteins based on their LIMMA FDR and followed by their ROTS FDR. LIMMA and ROTS rank were then added to yield the final rank sum and generate a table for subsequent analysis of candidates. This is done with the following script:
##############################
###ROTS and LIMMA Rank Sums###
##############################
#load data
rots<-read.csv("your_ROTS_input.csv")
limma<-read.csv("your_LIMMA_input.csv")
head(limma)
head(rots)
#shorten and rename data. attention with the limma data: if you use the whole list, b<-c(1,2,5,15) else b<-c(1:3,5)
a<-c(1,2,3,4)
rots<-rots[,a]
rots.header<-c("ID","rots.logFC","rots.pvalue","rots.adj.pvalue")
names(rots)<-rots.header
b<-c(1,2,5,15)
limma<-limma[,b]
limma.header<-c("ID","limma.logFC", "limma.pvalue", "limma.adj.pvalue")
names(limma)<-limma.header
#merge files
rank.sum<-merge(limma,rots, by=1, all=TRUE)
#build rank sum
idx.limma<-order(rank.sum$limma.adj.pvalue)
rank.sum<-rank.sum[idx.limma,]
rank.sum[8]<-c(1:nrow(rank.sum))
names(rank.sum)[8]<-"limma.rank"
idx.rots<-order(rank.sum$rots.adj.pvalue)
rank.sum<-rank.sum[idx.rots,]
rank.sum[9]<-c(1:nrow(rank.sum))
names(rank.sum)[9]<-"rots.rank"
rank.sum[10]<-rank.sum$limma.rank+rank.sum$rots.rank
names(rank.sum)[10]<-"rank.sum"
ord<-order(rank.sum$rank.sum)
rank.sum<-rank.sum[ord,]
#write the result file
write.csv(rank.sum, "your_output_file.csv", row.names=F)
References
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. 2015. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43:e47. DOI:
Elo LL, Filén S, Lahesmaa R, Aittokallio T. 2008. Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5:423–431. DOI: https://doi.org/10.1109/tcbb.2007.1078
Elo LL, Hiissa J, Tuimala J, Kallio A, Korpelainen E, Aittokallio T. 2009. Optimized detection of differential expression in global profiling experiments: case studies in clinical transcriptomic and quantitative proteomic datasets. Briefings in Bioinformatics 10:547–555. DOI: https://doi.org/10.1093/bib/bbp033
Kammers K, Cole RN, Tiengwe C, Ruczinski I. 2015. Detecting significant changes in protein abundance. EuPA Open Proteomics 7:11–19. DOI: https://doi.org/10.1016/j.euprot.2015.02.002
Storey JD, Tibshirani R. 2003. Statistical significance for genome wide studies. PNAS 100:9440–9445. DOI: https://doi.org/10.1073/pnas.1530509100
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Share
Bluesky
X
Copy link