Differential abundance analysis of proteomics data

Sven Schenk; Stephanie C Bannister; Florian Raible; Kristin Tessmar-Raible

doi:10.21769/p1583

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

Differential abundance analysis of proteomics data

SS Sven Schenk

SB Stephanie C Bannister

FR Florian Raible

KT Kristin Tessmar-Raible

Last updated date: Mar 11, 2022 DOI: 10.21769/p1583 Views: 722 Forks: 0

An abbreviated version of this protocol was published in eLife in Feb, 2019

Combined transcriptome and proteome profiling reveals specific molecular brain signatures for sex, maturation and circalunar clock phase

Download PDF

Ask a question

How to cite

Favorite

Outline

The analysis pipeline is a three-step process involving data normalisation and visualisation (Step 1), differential abundance analysis with two different algorithms (Step 2), and finally merging the output of both algorithms and ranking the proteins based on combined probability of being differentially expressed (Step 3).

Step 1

The whole analysis starts off at normalising the raw protein abundances so that the data are evenly distributed. We used a channel (replicate) median normalisation, i.e. normalising the abundances of each protein in each replicate on the median of protein abundances in the replicate. Whether or not this normalisation strategy is suitable to other datasets, needs to be assessed individually.

To visualise the normalisation we used the following script:

##########################################
###Boxplot Proteomics Data Distribution###
##########################################

#load data

norm<-read.csv("your_normalised_input.csv", row.names=1)

raw<-read.csv("your_raw_input.csv", row.names=1)

#rename columns

header<-c("your","sample","names")

names(norm)<-header

names(raw)<-header

#log2 transform

norm.log=log(norm[,c(1:ncol(norm))],2)

raw.log=log(raw[,c(1:ncol(raw)],2)

#calculate means

mean.norm.log<-mean(unlist(as.vector(norm.log[,1:ncol(norm.log)])),na.rm=TRUE)

mean.raw.log<-mean(unlist(as.vector(raw.log[,1:ncol(norm.log)])),na.rm=TRUE)

#boxplot

par(mfrow=c(1,2))

boxplot(raw.log, ylab="log2(raw protein values)",
ylim=c(0,20), main="Proteomic samples all proteins raw abundances",
xaxt="n", yaxt="n", frame.plot=FALSE)
abline(h=mean.raw.log, col="black", lty="dashed")
y.axis<-c(0,5,10,15,20)
axis(2, at=seq(0,20,by=5), labels =F, col=1)
text(y=seq(0.05,20.05, by=5), par("usr") [1], labels=y.axis, pos=2,offset=.75,
srt= 0, xpd=T, col=1)

boxplot(norm.log, ylab="log2(normalised protein values)",
ylim=c(-10,10), main="Proteomic samples all proteins normalised abundances",
xaxt="n", yaxt="n", frame.plot=FALSE)
abline(h=mean.norm.log, col="black", lty="dashed")
y.axis<-c(-10,5,0,5,10)
axis(2, at=seq(-10,10,by=5), labels =F, col=1)
text(y=seq(-10.05,10.05, by=5), par("usr") [1], labels=y.axis, pos=2,offset=.75,
srt= 0, xpd=T, col=1)

Step 2

Next, we used two different R packages to perform different protein abundance analysis:

LIMMA (Linear Models for Microarray Data, Ritchie et al. 2015)
ROTS (Reproducibility Optimized Test Statistics; Elo et al., 2008; Elo et al., 2009)

The LIMMA script is based on the publication of Kammers et al. (2015, https://www.biostat.jhsph.edu/~kkammers/software/eupa/R_guide.html) and has only been modified to include the more conservative Benjamini-Hochberg correction of p-values in lieu of the q-values as introduced by Storey and Tibshirani (2003). The ROTS script follows the standard analysis pipeline given in the package vignette.

LIMMA script:

############################################################
###assessing differentially expressed proteins with LIMMA###
############################################################
##based on http://www.biostat.jhsph.edu/~kkammers/software/eupa/R_guide.html

#first, load the protein file and log2 tranform

prot<-read.csv("your_protein_expression_data.csv", row.names=1)

head(prot)

log.prot<-log2(prot)

#order and rename the columns, this is needed for the actual analysis below

prot.idx<-c(your ordering)

log.prot<-log.prot[,prot.idx]

prot.header<-c(your column names)

names(log.prot)<-prot.header

head(log.prot)

#install and load limma and qvalue packages

if (!require("BiocManager", quietly = TRUE))

install.packages("BiocManager")
BiocManager::install(version = "3.14")

biocLite("limma")

biocLite("qvalue")

library("limma")
library("qvalue")

#analysis function

#source("http://www.biostat.jhsph.edu/~kkammers/software/eupa/source.functions.r")
eb.fit <- function(dat, design){
n <- dim(dat)[1]
fit <- lmFit(dat, design)
fit.eb <- eBayes(fit)
logFC <- fit.eb$coefficients[, 2]
df.r <- fit.eb$df.residual
df.0 <- rep(fit.eb$df.prior, n)
s2.0 <- rep(fit.eb$s2.prior, n)
s2 <- (fit.eb$sigma)^2
s2.post <- fit.eb$s2.post
t.ord <- fit.eb$coefficients[, 2]/fit.eb$sigma/fit.eb$stdev.unscaled[, 2]
t.mod <- fit.eb$t[, 2]
p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
p.mod <- fit.eb$p.value[, 2]
q.ord <- qvalue(p.ord)$q
q.mod <- qvalue(p.mod)$q
results.eb <- data.frame(logFC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.r, df.0, s2.0, s2, s2.post)
results.eb <- results.eb[order(results.eb$p.mod), ]
return(results.eb)
}

#now build a matrix for the analysis

design<-model.matrix(~factor(c(2,2,2,1,1,1)))
colnames(design)<-c("Intercept", "Diff")
design

#run the analysis
#LIMMA does pairwise comparisions, so you need to define which two conditions you want to compare

condA<-prot.header[c(your data columns)]
condB<-prot.header[c(your data columns)]

#analysis

result<-eb.fit(log.prot[, c(condB,condA)], design)
p<-result$p.mod
pi0<-pi0est(p)
pi0<-unlist(pi0)
pi0<-as.vector(pi0[1])
result[13]<-pi0
names(result)[13]<-"pi0"
head(result)
nrow(result)

#now, subset for desired fdr (q.mod is the column to look at), and kick out #un-necessary columns
#important columns
#logFC: b-a (=> +high in b/- high in a); p.ord/p.mod: ordinary/moderated p-#value; t.ord: ordinary t-value; q.ord/q.mod: ordinary/moderated q-value #t.mod: moderated t-value

fdr<-0.1
fdrSubset<-subset(result, q.mod<=fdr, select=c("logFC","p.mod","q.mod","pi0"))
head(fdrSubset)

#run this, if you want to have the more conservative adjusted p-value based on the Benjamini-Hochberg algorithm

padjust<-p.adjust(result$p.mod, method="fdr")
result.padjust<-result
result.padjust[14]<-padjust
names(result.padjust)[14]<-"adjusted.pvalue"

p.adjSubset<-subset(result.padjust, adjusted.pvalue<=fdr, select=c("logFC","p.mod","q.mod","pi0","adjusted.pvalue"))
head(p.adjSubset)

#visualise the p-value distribution if you like
hist(p, breaks=10, col=c("green","lightgreen","lightblue","grey","grey","grey","grey","grey","grey","grey"))

#write the result file(s)

#everything
write.csv(result.padjust, "your_output_file_full.csv")

#subsets
write.csv(fdrSubset, "your_output_file_FDR.csv")
write.csv(p.adjSubset, "your_output_file_pAdjust.csv")

ROTS script:

######################################
###ROTS for differential proteomics###
######################################

#first, load the protein file and log2 transform

prot<-read.csv("your_protein_expression_data.csv", row.names=1)
head(prot)
log.prot<-log2(prot)

#order and rename the columns, this is needed for the actual analysis below

prot.idx<-c(your ordering)
log.prot<-log.prot[,prot.idx]

prot.header<-c(your column names)

names(log.prot)<-prot.header
head(log.prot)

#load ROTS package

if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.14")

biocLite("ROTS")

library("ROTS")

#ROTS does pairwise comparisions, so you need to define which two conditions #you want to compare

#logFC: b-a (=> +high in b/- high in a)

condA<-c(your data columns)
condB<-c(your data columns)

input<-log.prot[c(condB,condA)]
head(input)
input<-as.matrix(input)
groups<-c(rep(0,3),rep(1,3))

results<-ROTS(data = input, groups = groups , B = 500 , K = 1000 , seed = 1234)

#build the output table

logFC<-as.data.frame(results$logfc)
pval<-as.data.frame(results$pvalue)
fdr<-as.data.frame(results$FDR)

results<-cbind(logFC,pval,fdr)
names(results)<-c("logFC","pval","fdr")
sort<-order(results$fdr)
results<-results[sort,]
results0.1<-subset(results, fdr<=0.1)
head(results0.1)
nrow(results0.1)

#write the result file

write.csv(results, "your_output_table.csv")

Step 3

The last step of the pipeline involved combining the two output files generated per condition, and rank the differentially expressed proteins based on their LIMMA FDR and followed by their ROTS FDR. LIMMA and ROTS rank were then added to yield the final rank sum and generate a table for subsequent analysis of candidates. This is done with the following script:

##############################
###ROTS and LIMMA Rank Sums###
##############################

#load data

rots<-read.csv("your_ROTS_input.csv")
limma<-read.csv("your_LIMMA_input.csv")

head(limma)
head(rots)

#shorten and rename data. attention with the limma data: if you use the whole list, b<-c(1,2,5,15) else b<-c(1:3,5)

a<-c(1,2,3,4)
rots<-rots[,a]
rots.header<-c("ID","rots.logFC","rots.pvalue","rots.adj.pvalue")
names(rots)<-rots.header

b<-c(1,2,5,15)
limma<-limma[,b]
limma.header<-c("ID","limma.logFC", "limma.pvalue", "limma.adj.pvalue")
names(limma)<-limma.header

#merge files

rank.sum<-merge(limma,rots, by=1, all=TRUE)

#build rank sum

idx.limma<-order(rank.sum$limma.adj.pvalue)
rank.sum<-rank.sum[idx.limma,]
rank.sum[8]<-c(1:nrow(rank.sum))
names(rank.sum)[8]<-"limma.rank"

idx.rots<-order(rank.sum$rots.adj.pvalue)
rank.sum<-rank.sum[idx.rots,]
rank.sum[9]<-c(1:nrow(rank.sum))
names(rank.sum)[9]<-"rots.rank"

rank.sum[10]<-rank.sum$limma.rank+rank.sum$rots.rank
names(rank.sum)[10]<-"rank.sum"

ord<-order(rank.sum$rank.sum)
rank.sum<-rank.sum[ord,]

#write the result file

write.csv(rank.sum, "your_output_file.csv", row.names=F)

References

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. 2015. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43:e47. DOI:

Elo LL, Filén S, Lahesmaa R, Aittokallio T. 2008. Reproducibility-optimized test statistic for ranking genes in microarray studies. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5:423–431. DOI: https://doi.org/10.1109/tcbb.2007.1078

Elo LL, Hiissa J, Tuimala J, Kallio A, Korpelainen E, Aittokallio T. 2009. Optimized detection of differential expression in global profiling experiments: case studies in clinical transcriptomic and quantitative proteomic datasets. Briefings in Bioinformatics 10:547–555. DOI: https://doi.org/10.1093/bib/bbp033

Kammers K, Cole RN, Tiengwe C, Ruczinski I. 2015. Detecting significant changes in protein abundance. EuPA Open Proteomics 7:11–19. DOI: https://doi.org/10.1016/j.euprot.2015.02.002

Storey JD, Tibshirani R. 2003. Statistical significance for genome wide studies. PNAS 100:9440–9445. DOI: https://doi.org/10.1073/pnas.1530509100

How to cite：

Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:

Schenk, S, Bannister, S, Raible, F and Tessmar-Raible, K(2022). Differential abundance analysis of proteomics data. Bio-protocol Preprint. DOI: 10.21769/p1583.
Schenk, S., Bannister, S. C., Sedlazeck, F. J., Anrather, D., Minh, B. Q., Bileck, A., Hartl, M., von Haeseler, A., Gerner, C., Raible, F. and Tessmar-Raible, K.(2019). Combined transcriptome and proteome profiling reveals specific molecular brain signatures for sex, maturation and circalunar clock phase. eLife. DOI: 10.7554/eLife.41556

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

This protocol preprint was submitted via the "Request a Protocol" track.

Share your protocol with your peers.

Submit a Preprint Protocol