Simulating gene sequence analysis with Hidden Markov Models

Mike Chaykowsky
5 min readMar 29, 2019

For some reason I had an inkling to go back to HMMs recently and here was the result. I started with this article from Nature by Sean R Eddy on biological sequence analysis and gene identification. Note: I am not a biologist and do not have any strong understanding of gene sequencing so it may (most certainly will) happen that I use the wrong terminology.

Here is an image from the article that lays out the general plan we will try to replicate.

The key here will be to try and identify when a 5' splice comes up as indicated in the State Path above.

In gene identification, we want to label nucleotides as exons, introns, or intergenic sequence (Eddy, 2004).

I need to simulate a string of length 26 (as in the Eddy study) and determine the probability of that string given the structure while inferring the hidden state path. But first I can construct the HMM in R.

require(HMM)nSim = 26
States = c("S","exon", "5SS", "intron", "N")
Symbols = LETTERS[c(1,3,7,20)]
transitions = matrix(c(0,1, 0, 0, 0, 0, 0.9, 0.1, 0, 0,0,0,0,1,0,0,0,0,0.9,0.1,0,0,0,0,1), c(length(States), length(States)), byrow = TRUE)
row0 <- c(0,0,0,0)
row1 <- c(1/4, 1/4, 1/4, 1/4)
row2 <- c(0.05, 0, 0.95, 0)
row3 <- c(0.4,0.1,0.1,0.4)
row4 <- c(0,0,0,0)
emissions = matrix(c(row0,row1,row2,row3,row4), nrow=5, ncol=4, byrow = TRUE)
hmm = initHMM(States, Symbols, transProbs = transitions, emissionProbs = emissions)

So our HMM looks like this:

$States
[1] "S" "exon" "5SS" "intron" "N"
$Symbols
[1] "A" "C" "G" "T"
$startProbs
S exon 5SS intron N
0.2 0.2 0.2 0.2 0.2
$transProbs
to
from S exon 5SS intron N
S 0 1.0 0.0 0.0 0.0
exon 0 0.9 0.1 0.0 0.0
5SS 0 0.0 0.0 1.0 0.0
intron 0 0.0 0.0 0.9 0.1
N 0 0.0 0.0 0.0 1.0
$emissionProbs
symbols
states A C G T
S 0.00 0.00 0.00 0.00
exon 0.25 0.25 0.25 0.25
5SS 0.05 0.00 0.95 0.00
intron 0.40 0.10 0.10 0.40
N 0.00 0.00 0.00 0.00

Now we can run the simulation. It does not always run because of the probability structure, so I put together a function that just runs it until we get something that works.

runSim <- function(num_trys=100) {
# Function to keep trying sim until it works
#
for (i in c(1:num_trys)) {
possibleError <- tryCatch(
simHMM(hmm, 26),
error=function(e) e
)
if(!inherits(possibleError, "error")) {
return(possibleError)
}
}
}
sim <- runSim()

So here is what our simulation turned out:

$states
[1] "exon" "exon" "exon" "exon" "exon" "exon" "exon" "exon" "exon" "exon"
[11] "exon" "exon" "exon" "exon" "5SS" "intron" "intron" "intron" "intron" "intron"
[21] "intron" "intron" "intron" "intron" "intron" "intron"
$observation
[1] "T" "T" "A" "C" "A" "A" "A" "G" "T" "C" "C" "G" "C" "C" "G" "T" "A" "C" "A" "T" "T" "T" "T"
[24] "A" "C" "A"

Now we will use Viterbi’s most probable path to indicate what was the most likely series of states given the series of observations.

vit = viterbi(hmm, sim$observation)
f = forward(hmm, sim$observation)
b = backward(hmm, sim$observation)
i <- f[2, nSim]
j <- f[3, nSim]
probObservations = (i + log(1 + exp(j - i)))
posterior = exp((f + b) - probObservations)
x = list(hmm = hmm, sim = sim, vit = vit, posterior = posterior)

The Viterbi algorithm is just like the forward procedure except that instead of tracking the total probability of generating the observations seen so far, we need only track the maximum probability and record its corresponding state sequence. CS229 Stanford HMM pg. 8

Now let’s try to plot this thing and see what we got.

plot(as.numeric(factor(x$sim$observation)), ylim = c(-7.5, 6), pch=3, ylab = "", xlab = "", bty = "n", xaxt = "n", yaxt = "n"); axis(1, at=seq_along(x$sim$observation), labels=x$sim$observation)
axis(2, at = 1:4)
# Simulating (truth) #
text(0.3, -1.2, adj = 0, cex = 0.8, col = "black", "True state path")
for (i in 1:nSim) {
if (x$sim$states[i] == "exon")
rect(i, -1, i + 1, 0, col = "green", border = NA)
else if (x$sim$states[i] == "5SS")
rect(i, -1, i + 1, 0, col = "white")
else rect(i, -1, i + 1, 0, col = "red", border = NA)
}
# Inference and Viterbi algorithm: finding the most probable path #
text(0.3, -3.2, adj = 0, cex = 0.8, col = "black", "Viterbi's most probable path")
for (i in 1:nSim) {
if (x$vit[i] == "exon")
rect(i, -3, i + 1, -2, col = "green", border = NA)
else if (x$sim$states[i] == "5SS")
rect(i, -1, i + 1, 0, col = "white")
else rect(i, -3, i + 1, -2, col = "red", border = NA)
}
# Differences:
text(0.3, -5.2, adj = 0, cex = 0.8, col = "black", "Difference")
differing = !(x$sim$states == x$vit)
for (i in 1:nSim) {
if (differing[i])
rect(i, -5, i + 1, -4, col = rgb(0.3, 0.3, 0.3),
border = NA)
else rect(i, -5, i + 1, -4, col = rgb(0.9, 0.9, 0.9),
border = NA)
}
points(x$posterior[2, ] - 3, type = "l")
# Difference with classification by posterior-probability:
text(0.3, -7.2, adj = 0, cex = 0.8, col = "black", "Difference by posterior-probability")
differing = !(x$sim$states == x$vit)
for (i in 1:nSim) {
if (posterior[2, i] > 0.5) {
if (x$sim$states[i] == "exon")
rect(i, -7, i + 1, -6, col = rgb(0.9, 0.9, 0.9),
border = NA)
else rect(i, -7, i + 1, -6, col = rgb(0.3, 0.3, 0.3),
border = NA)
}
else {
if (x$sim$states[i] == "intron")
rect(i, -7, i + 1, -6, col = rgb(0.9, 0.9, 0.9),
border = NA)
else rect(i, -7, i + 1, -6, col = rgb(0.3, 0.3, 0.3),
border = NA)
}
}
Success!

Viterbi worked! Now this is not always the case. Often if the true state path indicates a 5' splice very early or late in the sequence Viterbi does not pick it up. Notice the 5' splice occurs on a G which makes sense given the initial model.

--

--

Mike Chaykowsky

Data Scientist at Rivian. x- RAND Researcher. Based out of Los Angeles. @ChaykowskyMike linkedin.com/in/chaykowsky