Hello,

Thank you for requesting a more detailed protocol for individual genes.
I first dowloaded from BioMart on Ensembl (https://www.ensembl.org/biomart/martview/509ba4815fdb1363eae85c37f90d5c91) the following informations for all the protein coding genes (from Ensembl release 83) of the Human reference genome. 

Ensembl.Gene.ID : Ensembl gene identifiers

gene.symbol : gene name

Transcript : Ensembl transcript ID

Chrom : Chromosome

Start_hg18 : start position on hg18

End_hg18 : end position

LgGenes : gene length

LgCDS : CDS length

GC : GC content of CDS

GC3 : GC content at third position

GCflank : GC content at flanking regions (10kb upstream and 10kb downstream of the transcription unit)

GCi : intronic GC content


In parallel, I downloaded the sex-averaged recombination map from HapMap release 22 from ftp://ftp.ncbi.nlm.nih.gov/hapmap/recombination/latest/rates/. I recommend you to use the most recent version of recombination map if you have just sarted your project on humans. You could also have a look at the Decode map.
Recombination rate R in cM/Mb is computed as: R = (Gj - Gi) / (Pj - Pi)*1e6 where Gj is the genic position (in cM) of the nucleotide j and Pj its physical position (in bp). We estimated the average intragenic recombination rate between the beginning (i) and the end(j) of genes that are > 5kb.


Concerning the expression datasets, we used the following datasets:

1. From Guo et al, 2015, we downloaded the Panel 4 ("FPKM of pool-split PGCs") of the table S1 "Summary of Single-Cell RNA-Seq Dataset and Expression Levels of RefSeq Genes in Human PGCs and Neighboring Somatic Cells"

2. From Kryuchkova-Mostacci N, Robinson-Rechavi M. (2015), we used the processed table "File_31_Hum_Data_Tissues_Fagerberg.txt" available as a Supplementary material.

3. From Lesch et al, 2016, we downloaded the expression levels in PS and RS of 3 males from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68507. These files are: GSM1673959 (human1_PS_RNA), GSM1673963 (human1_RS_RNA), GSM1673967 (human2_PS_RNA), GSM1673971 (human2_RS_RNA), GSM1673975 (human3_PS_RNA) and GSM1673978 (human3_RS_RNA).


We combine these infos using awk and bash scripting.


Once the table is made, we used an R script to make the figures. The R script "figures_HumanCodonUsage_functions.R" as well as the README are available on zenodo: https://zenodo.org/record/835063#.XwRa499fg5k in the zipped folder: fig_HumanCodonUsage.zip


In terms of GO analyses: the choice of GO categories, proliferation and differentiation categories was done according to the paper of Gingold et al (2014) meaning that I followed their protocol and the legend of the PCA figure to decide whether a GO category is associated to proliferation for instance. Once I have the GO_* files prepared, I concatenated, sorted and extracted unique genes names associated with proliferation (resp. differenciation). I used a combination of cat, sort and uniq in the terminal. Then I compared the 2 lists: if a gene name was present twice I put it in the "both" category while if it was present once it was restricted to either prolif or diff.



Please let me know if I have answered your questions. 
Best,
Fanny Pouyet