2.7. Mechanical Properties of the PsgM02F and PsgM04F Phage Genomes

ES Erica C. Silva
CQ Carlos A. Quinde
BC Basilio Cieza
AB Aakash Basu
MV Marta M. D. C. Vila
VB Victor M. Balcão
request Request a Protocol
ask Ask a question
Favorite

DNA structural features. DNA is sequence-dependent, and its study is important in genome-wide analysis. Finding out the structural features of DNA can help reveal the preferred conformations that are intrinsic to a given DNA sequence and its dynamics. To do that, we used “DNAshape”, a web-based application that uses Monte Carlo simulations in high-throughput (HT) studies that can predict multiple DNA structural features such as Minor Grove Width (MGW), Roll, Propeller twist (ProT) and Helix twist (HelT). DNA shape features at a single-nucleotide position are determined by the sequence context of the corresponding bp. The context is the immediate neighbors of a bp or a larger number of adjacent bp, which in turn is characterized as a function environment of its pentameric environment. In summary, each one of the features was determined entirely by the nucleotide sequence context of the genomes, using a high-throughput methodology that includes a pentamer model to predict the structural values except for the two terminal bp in MGW and ProT, or one bp step at each end in roll and HelT [73]. The DNA structural features of the genomes of the two phages were calculated according to the procedures described in detail by Harada et al. [33] and Balcão et al. [30] using the “DNAshape” web-based application (http://rohslab.cmb.usc.edu/DNAshape/ (accessed on 4 September 2023)) [73]. A Python (version 3.9.12) custom script for plotting the resulting heatmaps data was then created and run in Jupyter Notebook (version 6.4.8) within Anaconda Navigator (version 2.1.4, Anaconda Inc., Austin, TX, USA). Once the predicted values were obtained, data was fully analyzed, and its characteristics were obtained to better understand the predicted values. Correlations between the four structural features were then analyzed, and heatmaps plotting the number of nucleotides per genome and the four structural features were produced.

Correlation of the DNA shape of both genomes. Once the values were predicted, a pairwise correlation of the DNA shape was computed to quantify their linear relationship using a custom Python script.

Dinucleotide distance correlation patterns of both PsgM02F and PsgM04F genomes. In this analysis, we computed the pairwise distance distribution function following the procedures outlined by Basu et al. [74]. The pairwise distance distribution function is a measure of how frequently two specific dinucleotides occur at a given separation within a DNA sequence. This separation is quantified in terms of nucleotide intervals. We have explored the self-pairwise distance distribution function for the sixteen dinucleotide combinations possible, viz. AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT, independently for each genome. This was accomplished by counting the occurrences of each dinucleotide in each genome and dividing it by the respective genome’s length. To identify correlation, a 1 was assigned to each dinucleotide when it was found and 0 when it was not. Next, the frequency of closeness among the dinucleotides was calculated for a total of 100 steps. These values were compared with random expected values and plotted as a function of the 100 steps and the correlation frequency of the specific dinucleotide found in both genomes. The resulting plots illustrate the pairwise correlation between both genomes for each dinucleotide within 100 steps of distance. The PsgM02 genome is depicted in red, while the PsgM04 genome is depicted in blue. The dotted line in the graph corresponds to the random expected sequence. All the calculations were independently repeated for each of the two-phage genomes. A Python (version 3.9.12) custom script was written using Jupyter Notebook (version 6.4.8) for calculating dinucleotide distance correlations and running in Anaconda Navigator (version 2.1.4, Anaconda Inc., Austin, TX, USA).

Frequency of the 16 dinucleotides in the PsgM02F and PsgM04F phage genomes. One has used a Python script to investigate the net occurrence of the 16 dinucleotide combinations in the genomes of phages PsgM02F and PsgM04F. This was accomplished by counting the occurrence of each dinucleotide and dividing it by the respective genome’s length.

Differential dinucleotide frequency between PsgM02F and PsgM04F phage genomes. We computed the differential dinucleotide frequency between the PsgM02F and PsgM04F genome. To do that, the frequency of occurrence of each possible dinucleotide combination per genome was calculated, and then the calculated frequencies of PsgM02F were subtracted from those of PsgM04F, yielding a total of 256 differential values. The resulting data was represented as a heatmap, where a positive differential frequency is depicted in red while a negative differential frequency is depicted in blue.

Phage genome cyclizability. In this analysis, we calculated the cyclizability values associated with each genome following the procedure described by Basu et al. [74]. A Python (version 3.9.12) custom script for calculating genome cyclizability was created and run in Jupyter Notebook (version 6.4.8) within Anaconda Navigator (version 2.1.4, Anaconda Inc., Austin, TX, USA). Cyclizability of a genome sequence may be defined as the natural logarithm ratio of probabilities for finding sequences in the looped vs. control groups (i.e., the natural logarithm of the ratio of the relative population of a nucleotide sequence in a sample pool to that in control), whereas intrinsic cyclizability is defined as the mean over such variation, and can be regarded as a proxy for bendability [34,75]. Cyclizability values were only calculated every 7th base pair, aiming to check how the bendability changes around some important locations in the phage genomes and to simply average bendability over the entire genome and compare different phages. Cyclizability was computed using nucleotide intervals of 50 base pairs with a seven-base pair overlap for each genome. The calculations were performed independently for both PsgM02F and PsgM04F genomes. Subsequently, the results were displayed as heatmaps, and box plots were elaborated to display the statistics and distribution of cyclizability values per genome. The mean range is indicated by the black line within each box plot, and the maximum and minimum whiskers indicate the highest and lowest cyclizability values. Outliers are shown in open circles.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A