MarkovText is a C++ application that implements a word-level Markov Chain algorithm to simulate natural language patterns. By analyzing the statistical distribution of word transitions in a source corpus, the engine builds a probability map (N-gram model) to generate coherent, stylistically consistent text extensions.
This project serves as a low-level demonstration of the probabilistic principles behind modern Large Language Models (LLMs).
- N-Gram Modeling: Builds a dynamic graph of word states and transition probabilities (Order-K Markov Model) to predict the next token in a sequence.
- Contextual Adaptation: Features a "Merge" algorithm that analyzes user input, identifies high-frequency terms, and biases the generation model to mimic the user's specific writing style.
- Corpus Agnostic: Capable of ingesting any plain text (.txt) corpus—from Shakespeare to financial reports—and adopting its statistical voice immediately.
- Efficient Lookups: Utilizes high-performance hash maps to store thousands of state transitions with O(1) average access time.
The engine parses text into a multi-dimensional map structure:
- Training: The source text is scanned to create a frequency table of
Seed -> {Next_Word, Probability}. - Analysis: The "Order" (K) determines how many previous words constitute a "state." (e.g., in an Order-2 model, "The quick" predicts "brown").
- Generation: Starting from a seed, the engine performs a weighted random walk through the probability graph to construct new sentences.
- Time Complexity: O(N) for training (linear scan of corpus).
- Space Complexity: O(U) where U is the number of unique K-gram combinations in the source text.
- C++ Compiler (g++ or clang) supporting C++11 standard.
- StanfordCPPLib (Required for console interface and simplified I/O).
-
Compile the source code using the provided Makefile or your preferred IDE.
-
Place your training data (e.g.,
literature.txt) in theres/directory. -
Run the executable:
./MarkovText
-
Follow the console prompts to select a "Base Style" (training file) and an "Input Context" (file to extend).
The generation behavior can be fine-tuned via constants in the header file:
- NUM_OF_WORDS_MERGED: Controls how aggressively the model blends the user's input style with the base corpus.
- MIN_WORD_TO_GENERATE: Sets the lower bound for output length.
- ORDER_K: (Advanced) Adjusts the Markov chain order (higher K = more coherent but less creative).
├── src/
│ ├── RandomWriter.cpp # Core Markov Chain logic and generation loop
│ └── main.cpp # Entry point and UI handling
├── res/ # Training corpora (.txt files)
├── README.md
└── LICENSE
- Refactor to remove dependencies on educational libraries (Standard STL only).
- Implement character-level generation for code or poetry modeling.
- Add serialization to save trained models to disk for instant loading.
Author: Luka Aladashvili