Markov Chain Generators - LLM's Single Celled Ancestor
Markov chain generators give you all the thrill of GPTs with none of the dumb math and engineering
One of my favorite pet interview questions for programmers is something I came up with after learning about Markov chain generators when I was a wee programmer back in the early 00s. I’d ask the candidate if they were familiar with Markov chain generators (most were not), and then briefly explain the concept. Then I’d ask them to write one in pseudo-code.
In my defense, this was usually a bonus question I’d throw in toward the end of the technical part of an interview and only if the candidate blew through my more simple questions. I’d never ding them for not getting it but a candidate’s answer to that question has pushed me from a ‘no’ to a ‘yes’ a handful of times. I’d also emphasize that this is a collaborative part of our interview and that questions and discussion is encouraged. I helped a few candidates figure it out while working together.
A Markov chain generator: using a sample corpus of text, write a function that will generate a string of text using a Markov chain. A Markov chain “… is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.”, or for those of us who only speak English, predict something (in this case a word) based on the probability of it appearing in a sequence depending on previous inputs (words). Practically speaking, predict the next word in a given partial sentence based on sentences you’ve seen in the past. Sounds familiar.
Ok, so that’s not a whole lot clearer, let’s consider an example:
Take the phrase “the cat and the dog and the bird sat on the mat on the deck” as our sample corpus. Let’s decide to create a bi-gram Markov chain generator. We’ll look at n-grams of length 2. I like to think of it like a sliding window. If we consider this sentence as starting with a hidden ‘start’ token, the final corpus is “<s> the cat and the dog and the bird sat on the mat on the deck”.
Looking at each word whilst ‘remembering’ the last two words, we can calculate the probabilities of each word in the corpus appearing after each bi-gram. Starting at “cat” (so that we have a 2 word history to build a bi-gram from), the previous bi-gram is “<s> the”. We calculate that the probability of “cat” appearing after “<s> the” is 100%. Sliding the window forward, we calculate that the probability of “and” appearing after “the cat” is 100%. Consider “dog” - “dog” appears after “and the” 100% of the time. Skipping ahead to “bird”, we calculate that the probability of “bird” appearing after “and the” is 50%. We’ve seen “dog” come after “and the”, and now we’ve seen “bird”. We also need to update the probability for the “dog” word to 50%, from 100%. Therefore in this bi-gram Markov model, “and the” can produce “dog” half the time, and “bird” half the time.
That was a lot to explain something that’s honestly pretty simple and intuitive. The explanation usually goes pretty fast in person. Once they “get” it, we move onto implementation. I love this question because it triple dips - I get to see a candidates algorithm and data structure chops, I get to see how they handle a novel problem and I get to see how they handle being abruptly thrown in the deep end (with support). Once they either have a solution or run out of time, I compare notes and tell them about my implementation. I never ask questions I can’t answer myself.
These things are easy to implement in a dumb naïve way. Here’s some Python that does it:
import datasets
import random
import re
expr = r'[\-\,\.\:\'\?;\!]+'
d = datasets.load_dataset('tiny_shakespeare')['train']
d = d[0]['text'].split()
unigrams = {}
bigrams = {}
# build the model
for idx in range(2, len(d), 2):
unigram = re.sub(expr, '', d[idx-1].strip()).lower()
if unigram not in unigrams:
unigrams[unigram] = [re.sub(expr, '', d[idx].strip()).lower()]
else:
unigrams[unigram].append(re.sub(expr, '', d[idx].strip()).lower())
bigram = (re.sub(expr, '', d[idx-2].strip()).lower(), re.sub(expr, '', d[idx-1].strip()).lower())
if bigram not in bigrams:
bigrams[bigram] = [re.sub(expr, '', d[idx].strip()).lower()]
else:
bigrams[bigram].append(re.sub(expr, '', d[idx].strip()).lower())
# generate some text
n = 20
sample = random.sample(list(bigrams.keys()), 1)[0]
text = ' '.join([] + list(sample) + random.sample(bigrams[sample], 1))
print('text:', text, end='')
for i in range(n):
prev_unigram = text.split()[-1]
prev_bigram = tuple(text.split()[-2:])
if prev_bigram in bigrams.keys():
word = random.sample(bigrams[prev_bigram], 1)[0]
elif prev_unigram in unigrams.keys():
word = random.sample(unigrams[prev_unigram], 1)[0]
else:
print(" " + word)
text += " " + word
break
print(" " + word, end='')
text += " " + word
Drop that code into a Jupyter notebook and you’ll get some Shakespeare-ish text:
text: angelo thy faults richard ii norfolk for a horse and men the sweetest sleep when this stream of suspicion has wings of grasshoppers
Basically it’s building a bi-gram model from the “tiny_shakespeare” dataset on Huggingface with some basic text cleaning through regular expressions. It falls back to a single n-gram model if it can’t find a generated bi-gram, and if it can’t find a word that follows the single n-gram (not sure why it happens, probably a bug) it terminates the generation.
Generating bi-grams that don’t exist in the corpus is a function both of the kind of corpus (very heterogenous sources), the amount of data, and finally just an inherent limitation in the Markov model. It’s telling us there’s aspects of language that are hidden in the Markov model, and the examples it’s seen don’t cover all possible states of the outputs.
Still, it’s quick and the output seems reasonable…ish.
I’ve been researching deep learning and transformer models and they sure do remind me of these toy Markov models. Our probabilities are calculated from the training corpus in a single shot, whereas a transformer model has probabilities that are calculated from the training corpus with noise added, and a gradient descent training loop. The Markov chain model can be thought of having a context length of n
where n
is the size of our n-gram, compared to the 512 context length of BERT, for example. The Markov chain is a single layer disconnected network, compared to the multi-layer fully connected transformers. If you squint enough you can see the underlying theme.
The dimensionality of a Markov model is orders of magnitude smaller than a transformer model, which makes it much easier to see what’s going on. The probabilities are directly tied to the data and you can easily look up bi-grams and their associated following word probabilities. Once you grasp Markov models, transformer models are less intimidating. They’re just bigger Markov models on steroids.