Skip to content

A flexible C++ program that uses word-level Markov chains to generate coherent text based on input training files. It can be used to simulate writing styles, extend unfinished documents, or analyze text patterns using statistical modeling.

Notifications You must be signed in to change notification settings

llukito/RandomWriter-Markov-Based-Text-Style-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MarkovText: Stochastic N-Gram Prediction Engine

Language Type Status

MarkovText is a C++ application that implements a word-level Markov Chain algorithm to simulate natural language patterns. By analyzing the statistical distribution of word transitions in a source corpus, the engine builds a probability map (N-gram model) to generate coherent, stylistically consistent text extensions.

This project serves as a low-level demonstration of the probabilistic principles behind modern Large Language Models (LLMs).

Key Features

  • N-Gram Modeling: Builds a dynamic graph of word states and transition probabilities (Order-K Markov Model) to predict the next token in a sequence.
  • Contextual Adaptation: Features a "Merge" algorithm that analyzes user input, identifies high-frequency terms, and biases the generation model to mimic the user's specific writing style.
  • Corpus Agnostic: Capable of ingesting any plain text (.txt) corpus—from Shakespeare to financial reports—and adopting its statistical voice immediately.
  • Efficient Lookups: Utilizes high-performance hash maps to store thousands of state transitions with O(1) average access time.

Technical Implementation

The Algorithm

The engine parses text into a multi-dimensional map structure:

  1. Training: The source text is scanned to create a frequency table of Seed -> {Next_Word, Probability}.
  2. Analysis: The "Order" (K) determines how many previous words constitute a "state." (e.g., in an Order-2 model, "The quick" predicts "brown").
  3. Generation: Starting from a seed, the engine performs a weighted random walk through the probability graph to construct new sentences.

Complexity

  • Time Complexity: O(N) for training (linear scan of corpus).
  • Space Complexity: O(U) where U is the number of unique K-gram combinations in the source text.

Installation & Usage

Prerequisites

  • C++ Compiler (g++ or clang) supporting C++11 standard.
  • StanfordCPPLib (Required for console interface and simplified I/O).

Usage

  1. Compile the source code using the provided Makefile or your preferred IDE.

  2. Place your training data (e.g., literature.txt) in the res/ directory.

  3. Run the executable:

    ./MarkovText

  4. Follow the console prompts to select a "Base Style" (training file) and an "Input Context" (file to extend).

Configuration

The generation behavior can be fine-tuned via constants in the header file:

  • NUM_OF_WORDS_MERGED: Controls how aggressively the model blends the user's input style with the base corpus.
  • MIN_WORD_TO_GENERATE: Sets the lower bound for output length.
  • ORDER_K: (Advanced) Adjusts the Markov chain order (higher K = more coherent but less creative).

Project Structure

├── src/
│   ├── RandomWriter.cpp    # Core Markov Chain logic and generation loop
│   └── main.cpp            # Entry point and UI handling
├── res/                    # Training corpora (.txt files)
├── README.md
└── LICENSE

Future Improvements

  • Refactor to remove dependencies on educational libraries (Standard STL only).
  • Implement character-level generation for code or poetry modeling.
  • Add serialization to save trained models to disk for instant loading.

Author: Luka Aladashvili

About

A flexible C++ program that uses word-level Markov chains to generate coherent text based on input training files. It can be used to simulate writing styles, extend unfinished documents, or analyze text patterns using statistical modeling.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published