A NumPy-only implementation of the core training loop of word2vec using the skip-gram with negative sampling formulation.
The goal of this project is to implement the core optimization procedure of word2vec in pure NumPy, without using PyTorch, TensorFlow, or other machine-learning frameworks.
The implementation includes:
- text preprocessing
- vocabulary construction
- skip-gram pair generation
- embedding initialization
- negative sampling
- forward pass with dot-product scores
- loss computation
- parameter updates
- a simple multi-epoch training loop
This project uses skip-gram with negative sampling.
For each training example:
- the center word is used as input
- the context word is treated as a positive target
- randomly sampled words are used as negative targets
The model learns word embeddings by increasing scores for positive pairs and decreasing scores for negative pairs.
The current implementation uses a small sample text corpus stored in:
data/sample.txt
This keeps the project simple and focused on the mechanics of the training loop itself.
data/sample.txt— sample text corpussrc/data_utils.py— text preprocessing and skip-gram pair generationsrc/model.py— embeddings, negative sampling, scores, loss, and training loopsrc/main.py— end-to-end demo run of the training pipeline
python3 src/main.pyThe pipeline currently performs the following steps:
- Read and tokenize text
- Build a vocabulary
- Encode tokens as integer ids
- Generate skip-gram training pairs
- Initialize input and output embeddings
- Compute positive and negative scores
- Compute skip-gram loss with negative sampling
- Apply one-step gradient-based updates
- Run training across multiple epochs
- Track average loss across epochs
The script prints:
- tokenized text
- vocabulary mapping
- example skip-gram pairs
- example positive and negative scores
- loss before and after one update step
- average training loss across epochs
A decreasing average loss across epochs indicates that the update rule moves the embeddings in the expected direction.
This is a compact educational implementation intended to demonstrate the core training mechanics of word2vec.
Current limitations:
- it uses a very small toy corpus
- negative sampling is uniform rather than frequency-based
- no batching is used
- no subsampling of frequent words is implemented
- no evaluation on downstream similarity tasks is included
Possible next steps include:
- using a larger text corpus
- implementing frequency-based negative sampling
- adding subsampling for frequent words
- supporting CBOW as an alternative formulation
- adding nearest-neighbor inspection for learned embeddings
- vectorizing more of the training loop for efficiency