I'm modeling the punctuation prediction problem as arising from a hidden event model, and am trying to follow the algorithm described in Stolcke's paper Modeling the Prosody of Hidden Events for Improved Word Recognition.
After calculating an ngram model, he describes the algorithm for calculating the maximum likelihood sequence of events:
By using an N-gram model for P(W,S), and decomposing the prosodic likelihoods as in Equation 4, the joint model P(W,S,F) becomes equivalent to a hidden Markov model (HMM). The HMM states are the (word,event) pairs, while prosodic features form the observations. Transition probabilities are given by the N-gram model; emission probabilities are estimated by the prosodic model described below. Based on this construction, we can carry out the summation over all possible event sequences efficiently with the familiar forward dynamic programming algorithm for HMMs.
I'm confused how this can be a Markov model with states (word, event) since if our underlying model is an N-gram model, it seems to me that the state needs to encode the N-1 previous words in order to have all necessary information to predict the next state. What's going on here? Thanks!