Bioinformatic and statistical analysis

MZ Mohamed Zeineldin
KL Kimberly Lehman
NU Natalie Urie
MB Matthew Branan
AW Alyson Wiedenheft
KM Katherine Marshall
SR Suelee Robbe-Austerman
TT Tyler Thacker
request Request a Protocol
ask Ask a question
Favorite

The sequence data were analyzed for PRNP polymorphisms using Geneious prime software (http://www.geneious.com). DNA sequences were compared and aligned with the Capra hircus PrP gene reference sequence (GenBank: HM038415.1). The full open reading frame was examined for amino acid polymorphisms to estimate the genetic variability of PRNP gene in goats based on codons 127, 142, 143, 146, 154, 211, 222 and 240. All sequence data obtained in the current study were uploaded to the sequence read archive on the NCBI website with a bio-project accession number PRJNA728650.

Weighted genotypic and allelic proportions were calculated across all operations, goat breeds (Alpine, Angora, Boer, Cashmere, Fainting goats, Kiko, LaMancha, Nigerian dwarf, Nubian, Oberhasli, Pygmy, Pygora, Saanen, Sable, Savannah, Spanish, Toggenburg, Crossbred, and Other breeds), region of the operation (west and east), primary production of the operation (meat, dairy, and other), and goat sex (does and bucks). Weighted descriptive estimation was carried out using SAS-callable SUDAAN software (version 11.0.1, Research Triangle Institute, 2012; version 9.4, SAS Institute, 2012), which allows for the proper analysis of data from complex surveys by accounting for the study design. The estimation of genotypic and allelic frequencies accounted for stratification by State, operation size, and primary production type of the operation, sampling without replacement, finite population corrections, and unequal probabilities of selection adjusted for nonresponse within each stratum. Standard errors and 95% confidence intervals were computed using Taylor series linearization to reflect measurements of uncertainty in the estimates presented. These adjustments were made so that inference could be generalized to the population of doe and buck goats aged 15 months or older in the states included in the study.

While descriptive estimates were made using both genotypic and polymorphic variants proportions, statistical comparisons of variants proportions were made between levels of the breakout variables (breed, region, primary production, and gender) because of low proportions of goats being homozygous in the minor variants at each codon. Overall differences among levels of the breakout variables were assessed using log-linear test p-values adjusting for the survey design and weights in SUDAAN [30,31]. For overall tests that were statistically significant at the 0.05 significance level, pairwise comparisons between levels of the given breakout variable with respect to allelic proportions were made using Tukey-Kramer multiple comparisons adjusted p-values from logistic regression models regressing allele presence (0 if the animal had the most common genotype at the given codon and 1 if the animal had a minor allele at the codon) on the significant breakout variable fit using SAS’ SURVEYLOGISTIC procedure, which accounts for the survey design. Statistically significant pairwise comparisons were made at the 0.05 level on the adjusted p-value scale. A capital letter coding is used to indicate whether levels of breakout variables are significantly different from one another using the Tukey-Kramer-adjusted p-values. Levels of the same variable that share a letter are not significantly different, and levels that do not share a letter are significantly different at the 0.05 family-wise significance level. In the results and discussion sections, percentages of goats by breed are reported without making explicit statistical comparisons to other breeds with respect to genotypic or allelic proportions.

In order to assess the clustering of prolonged incubation and resistant genotypes on operations, the intraclass correlation coefficients (ICCs) as described by [32] was performed. The ICC estimate and the variance components estimates are asymptotically equivalent and were identical to the hundredths decimal place in all except for one codon, and so only the Fleiss and Cusick estimates are presented [32]. The ICC estimate was computed as:

Where

i = 1,…,m indexes operations and m is the number of operations,

ni is the operation sample size and n¯=1mi=1mni is the average operation sample size,

Xi is the number of animals on operation i with the genotype of interest (e.g., S127),

π^ is the estimate of the prevalence of the genotype of interest among all animals.

The closer ρFC is to 0, the lower the correlation among animals on a given operation with respect to presence of the genotype of interest (meaning if one animal has the genotype of interest, the less likely we are to find another animal on the same operation with the genotype of interest). Conversely, the closer ρFC is to 1, the higher the correlation among animals on a given operation (meaning if one animal has the genotype of interest on an operation, the more likely we are to find more).

To assess the multivariate variability of allelic representations, an Analysis of Molecular Variance (AMOVA) framework was adopted [33]. Using this method, the variance among dissimilarities in individual animal’s allelic representations in multivariate space can be partitioned and attributed to factors of interest. In this case, the stratification variables (State, operation size, and primary production), gender, breed, and operation identifier were used to explore variability between animal-level [34] dissimilarities in multivariate space. The adonis function from the vegan package in R (version 3.5.3; R Core Team, 2019), implemented within R Studio (version 1.1.463; R Studio Team, 2020) was used to fit AMOVA models using the method proposed in [35], which is akin to Multivariate Analysis of Variance (MANOVA) on dissimilarity matrices, though test statistics are estimated using non-parametric permutation tests rather than distributional assumptions. Type I AMOVA sums of squares were used to partition variance and test for the significance of terms, where the order of terms added to the model was the fully crossed stratification variables, gender, breed, State-breed interaction, and operation identifier. Permutations were performed assuming animals within the same sampling stratum were exchangeable. In addition, a multivariate extension of Levene’s test for homogeneity [36] was used to test for the multivariate homogeneity of dispersion among groups defined by goat breed using the betadisper function from the vegan package in R.

To explore the multivariate genetic distance between breeds, correspondence analysis [37] was used to reduce the dimensionality of the weight-scaled allelic rate measures at the breed level and to visualize multivariate distance relationships between breeds and between codons. The correspondence analysis was carried out using the dudi.coa function from the ade4 package [38] in R on the weight-scaled, pairwise Nei distance matrix. To further explore relationships between breeds, a neighbor-joining tree was estimated using the nj function from the ape package in R, which estimates neighbor-joining tree structures [34].

Statistical significance is a tool used here to focus the analysis and discussion towards common or likely relationships in the population in a complex, multivariate system of genetic expressions in goats within a hierarchical population structure. Relationships that do or do not meet the significance levels are not implied to be present or absent at a practical scale; they just did or did not meet the requirements to be statistically significant given the methods described above. Point estimates and variance estimates are given for the reader to be able to further investigate relationships beyond the binary decisions made regarding statistical significance.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A