Building fifth order Markov models

Hyun-Seok Park

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Building fifth order Markov models

HP Hyun-Seok Park

This method is extracted from research article: Genomics Inform, Sep 2018

A Short Report on the Markov Property of DNA Sequences on 200-bp Genomic Units of ENCODE/Broad ChromHMM Annotations: A Computational Perspective

DOI: 10.5808/GI.2018.16.3.65

Ask a question

Favorite

After we assigned a dominant chromatin state for each 200-bp unit, frequency counts were used to build fifteen initial transition tables for the fifth order Markov models [10]. For example, a uniform fifth order Markov chain is specified by a vector with initial probabilities P(X_n-5, X_n-4, X_n-3, X_n-2, X_n-1) for 4,096 components as well as a matrix of transitional probabilities P(X_n | X_n-5, X_n-4, X_n-3, X_n-2, X_n-1) with a size of 4,096 × 4. These tables were used to build a global Markov chain classifier to explore and rank sub-optimal predictions of the chromatin states. Based on the nucleotide frequency profiles, given a random sequence x₁, x₂,⋯, x₂₀₀ in the state of a cell line, we compared sequences π₁,π₂,⋯,π₂₀₀ of chromatin states that maximized the following probability of the initial 15 Markov chain models, where a_πiπi_＋1 is a transition probability:

By trial and error, we rebuilt newer Markov chains by iteratively analyzing the variability count of the chromatin states of a given 200-bp unit, and by eliminating the highly variable 200-bp units in training.

Fig. 3 summarizes our process of building Markov chains. When the human genome was dissected into 200-bp units, there were originally 14,075,448 units. By trial and error, we rebuilt newer Markov chains by eliminating the highly variable 200-bp units in training. We finally excluded 200-bp units that showed more than two different chromatin state signatures when training our transition tables. Thus, our result is based on 7,038,863 units, which accounted for approximately 49.75% of the entire human genome. However, determining whether the remaining 50.25% of highly variable 200-bp units of the genome would show a Markov property is beyond the scope of this paper.

Flowchart of building Markov chains by iteratively eliminating highly variable 200-bp units.

By this process, we found that some inactive chromatin states were highly constitutive and marked in most of the 9 epigenomes. For example, state 13 (Hetero_Chromatin state), which covered on average 70.48% of each reference epigenome, was excluded when considering the variability count of the chromatin states. We also excluded units in which a transcribed state showed both promoter and enhancer signatures. Mostly, we profiled each 200-bp with chromatin states and built new transition tables by training the 200-bp blocks with a chromatin variability of less than 2 (and containing at least one active state).

These fifteen chromatin states were then merged into six broad states: Promoter, Enhancer, Insulator, Transition, Repressed, and Inactive. Our final transition tables for the Promoter, Enhancer, Insulator, Transition and Repressed state (excluding inactive states) were built from 121,500, 701,636, 89,844, 4,023,295, and 155,411 200-bp units, respectively. As these Markov chains could be used as a Naive Bayes classifier, we calculated the sequence of each 200-bp unit that maximized our Markov models. We defined a correctly predicted unit as one in which the predicted result matched one of the dominant chromatin states in the same broad state.

It is identical to the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol