word2vec-numpy

A NumPy-only implementation of the core training loop of word2vec using the skip-gram with negative sampling formulation.

Project goal

The goal of this project is to implement the core optimization procedure of word2vec in pure NumPy, without using PyTorch, TensorFlow, or other machine-learning frameworks.

The implementation includes:

text preprocessing
vocabulary construction
skip-gram pair generation
embedding initialization
negative sampling
forward pass with dot-product scores
loss computation
parameter updates
a simple multi-epoch training loop

Model variant

This project uses skip-gram with negative sampling.

For each training example:

the center word is used as input
the context word is treated as a positive target
randomly sampled words are used as negative targets

The model learns word embeddings by increasing scores for positive pairs and decreasing scores for negative pairs.

Dataset

The current implementation uses a small sample text corpus stored in:

data/sample.txt

This keeps the project simple and focused on the mechanics of the training loop itself.

Project structure

data/sample.txt — sample text corpus
src/data_utils.py — text preprocessing and skip-gram pair generation
src/model.py — embeddings, negative sampling, scores, loss, and training loop
src/main.py — end-to-end demo run of the training pipeline

How to run

python3 src/main.py

Current implementation

The pipeline currently performs the following steps:

Read and tokenize text
Build a vocabulary
Encode tokens as integer ids
Generate skip-gram training pairs
Initialize input and output embeddings
Compute positive and negative scores
Compute skip-gram loss with negative sampling
Apply one-step gradient-based updates
Run training across multiple epochs
Track average loss across epochs

Example behavior

The script prints:

tokenized text
vocabulary mapping
example skip-gram pairs
example positive and negative scores
loss before and after one update step
average training loss across epochs

A decreasing average loss across epochs indicates that the update rule moves the embeddings in the expected direction.

Limitations

This is a compact educational implementation intended to demonstrate the core training mechanics of word2vec.

Current limitations:

it uses a very small toy corpus
negative sampling is uniform rather than frequency-based
no batching is used
no subsampling of frequent words is implemented
no evaluation on downstream similarity tasks is included

Possible improvements

Possible next steps include:

using a larger text corpus
implementing frequency-based negative sampling
adding subsampling for frequent words
supporting CBOW as an alternative formulation
adding nearest-neighbor inspection for learned embeddings
vectorizing more of the training loop for efficiency

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

word2vec-numpy

Project goal

Model variant

Dataset

Project structure

How to run

Current implementation

Example behavior

Limitations

Possible improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

word2vec-numpy

Project goal

Model variant

Dataset

Project structure

How to run

Current implementation

Example behavior

Limitations

Possible improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages