I'm training a 2-state HMM on a large English text (first 50,000 characters of Brown Corpus, including only letters and spaces), my algorithm follows from Mark Stamp's tutorial (www.cs.sjsu.edu/~stamp/RUA/HMM.pdf).
Since the observations include only the 26 letters and space, initially I gave each observation (within a state) a probability of 1/27, then modifying each by 0.0001 while keeping the row stochastic.
Running the trainer for 50 iterations gives me very minute incremental improvement in log[P(O|lambda)], where lambda is the updated model. Further, in the observation matrix of the final model, the probability for each observation is almost identical across the two states (see http://pastebin.com/xVVYNhGs).
I figured I'm stuck on a local maximum, so I altered the initial guess for observation matrix to match Stamp's, and it actually gave me a updated observation matrix differing by states* within the same number of iterations. (50 iterations: http://pastebin.com/U0AgrJ2N; 100 iterations: http://pastebin.com/yAkruNjs)
My question is, my altered initial observation matrix (Emission prob) clearly broke me out of the sad local maximum; but how would I go about finding/optimizing this initial guess?