Expression quantification. Read quality was assessed in FastQC v0.11.5 (www.bioinformatics.babraham.ac.uk/projects/fastqc). Raw reads were trimmed of sequencing adaptors and low-quality bases (PHRED < 5) using Cutadapt v1.16 (50). Expression estimates for each sample were obtained using the “align_and_estimate_abundance.pl” script within the Trinity package v2.6.6 (51): For each library, trimmed reads were mapped against our published bluehead wrasse transcriptome assembly (19) using Bowtie2 v2.3.2 (52), and transcript abundances were estimated using RSEM v1.3.0 (53) with default settings. Transcript-level count matrices were generated for brain and gonad samples separately using Trinity’s “abundance_estimates_to_matrix.pl” script and the scaledTPM method.

PCA and gene set enrichment. PCA (centered and unscaled) was used to visualize transcriptome-wide expression variation within groups and among samples, following normalization by variance stabilizing transformation in DESeq2 v1.20.0 (54) in R v3.5.0 (55). The top 10,000 transcripts with greatest variance across samples were used, and confidence ellipses around barycenters were plotted using the stat_conf_ellipse() function in ggpubr v0.1.8. For gonad only, to identify the 500 transcripts contributing most to each principal component, component loadings (defined as eigenvectors scaled by the square root of the respective eigenvalues) were represented as coordinates in a Cartesian plane. Given the bimodal and skewed distribution of the values, percentiles rather than SDs were used as thresholds. Thresholds were defined as the 5th and 95th percentiles and divided the plane into four spatial regions: “Female” and “Male” represented the extremes of PC1; sexually “Differentiated” and “Transitionary” represented the extremes of PC2. Among regions, unique and shared transcripts were represented in a Euler diagram using the R package eulerr v4.1.0. To validate the results of our method, we visualized normalized expression across sex change for unique transcripts from each spatial region.

To evaluate whether intermediate stages of sex change represented a unique transitional phase, and not a simple replacement from one differentiated population to another, we replicated the above PCA analysis but included simulated mixed datasets of male and female reads. We used fastq-tools v0.8 (http://homes.cs.washington.edu/~dcjones/fastq-tools/) to randomly subsample reads from three control female and three TP male libraries, adjusting the number of initial reads to represent different proportions in the simulated datasets (male:female, 5:95, 10:90, 20:80, 40:60, 60:40, and 80:20). Transcript expression for simulated samples was quantified as above, and the original PCA results were used as a training dataset to predict principal component scores for the simulated data.

To identify the functional categories of genes uniquely associated with each spatial region, GO term enrichment analysis was performed using TopGo v2.32.0 in R and zebrafish genome annotations downloaded from Bioconductor (org.Dr.eg.db v3.6.0). Significant GO terms (P < 0.01) for “biological processes” and “molecular function” were identified using Fisher’s exact test with the “weight01” algorithm. To reduce GO term redundancy and summarize the results, GO terms and P values weighted on the basis of the scores of neighboring GO terms were used as input for REViGO (56), using SimRel as a semantic similarity measure, medium allowed similarity (0.7), and the Danio rerio GO database.

Differential expression and enrichment analyses. Differential expression analyses were performed for brain and gonad separately using a generalized linear model (GLM) framework in DESeq2. Differentially expressed transcripts were called using the Wald test in pairwise comparisons between sex change stages and control females and between neighboring stages, after fitting a single GLM to estimate size factors and dispersion across all samples per tissue. False discovery rate (FDR) was controlled at 5% to account for multiple testing, and an adjusted significance value (FDR-P) of <0.05 was used to define significant differential expression. For gonadal samples, only transcripts with a log2 fold change of >1 were considered. No fold change cutoff was applied in analyses of brain samples because of the relatively subtle expression differences observed.

Transcripts showing differential expression at each sex-change stage (compared to control females) and in comparisons between neighboring stages were searched against the Ensembl zebrafish protein database (Danio_rerio.GRCz10) (BLASTX, E-value cutoff: 10-10). Matched zebrafish protein IDs were converted to unique Ensembl zebrafish gene IDs in BioMart (57) and used for gene pathway overrepresentation analysis in DAVID v6.8 (58).

WGBS analysis and correlation with RNA-seq. Raw WGBS sequences were processed in TrimGalore! v0.4.5 (www.bioinformatics.babraham.ac.uk/projects/trim_galore/); sequencing adapters were removed, and 10 bp was trimmed from the 5′ end of reads to account for sequence biases associated with PBAT library construction, followed by removal of low-quality base calls (PHRED score < 20). Read mapping and base calling were performed in Bismark v0.19.0 (59) specifying the option --pbat. Our draft bluehead wrasse genome assembly (Supplementary Materials and Methods) was used as a reference, obtaining an average of 51.47% mapping efficiency (SD ± 3.07). BAM files were deduplicated, and reports containing CG methylation were generated. The bisulfite treatment nonconversion rate was evaluated with the frequency of non-CG methylation, with all libraries having a conversion efficiency of at least 98.94% (data S4).

CG methylation calls were analyzed in SeqMonk v1.42.0 (www.bioinformatics.babraham.ac.uk/projects/seqmonk/). Genome scaffolds were grouped into 23 pseudochromosomes, and tracks were built using in-house annotations (Supplementary Materials and Methods). Among-sample variation was examined with PCA, using 10-kb probes generated in nonoverlapping windows with a minimum of 100 CG calls. To compare among stages, the PCA was rerun with individual samples merged and using 2-kb windows with a minimum of 100 CpG calls.

To examine methylation across TSS, the longest transcript per gene was used. TSS were defined as 200 bp centered on the first nucleotide of an annotated mRNA, and a minimum of five methylation calls were applied as a threshold for inclusion. To evaluate coupling between TSS methylation and gene expression, trimmed RNA-seq reads were mapped to the reference genome using HTseq v0.9.1 (60) and imported into SeqMonk, specifying a minimum mapping quality of 60 to select only uniquely aligned reads. The SeqMonk RNA-seq quantitation pipeline was used to generate raw counts across exons of protein coding genes. Counts were corrected by transcript length and DNA contamination, and transcripts were divided into quartiles according to expression level.

To analyze methylation at individual CGs, probes of two consecutive nucleotides with a minimum of 10 methylation calls were generated and the percentage methylation measured as number of methylated calls/total calls. CGIs were identified using published methodology (61). For the regions of interest, 200-bp windows moving at 1-bp intervals were considered CGIs if the Obs/Exp value was greater than 0.6 and the GC content exceeded 60%.

Scatter plots and violin plots were drawn using ggplot2 v3.0.0 in R (62). Histograms and genome annotations were generated using Gviz v1.24.0 (63).

Note: The content above has been extracted from a research article, so it may not display correctly.



Q&A
Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.



We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.