Khisto

Optimal Binning Histograms for Python

Khisto is a Python library for creating histograms using the Khiops optimal binning algorithm. Unlike standard histograms that use fixed-width bins or simple heuristics, Khisto automatically determines the optimal number of bins and their variable widths to best represent the underlying data distribution.

Features

Optimal Binning: Uses the MODL (Minimum Description Length) principle to find the best discretization.
Variable-Width Bins: Captures dense regions with fine bins and sparse regions with wider bins.
NumPy Compatible: Drop-in replacement for numpy.histogram.
Matplotlib Integration: khisto.matplotlib.hist works like plt.hist.
Minimal Dependencies: Only requires NumPy (matplotlib optional for plotting).

Standard Gaussian	Heavy-tailed Pareto

Reproducing The Example Distributions

The complete runnable script is available in scripts/generate_distribution_examples.py.

Run it from the repository root to regenerate both example distributions and the figure files used in this README:

python scripts/generate_distribution_examples.py

Installation

pip install khisto

With matplotlib support:

pip install "khisto[matplotlib]"

Quick Start

NumPy-like API

import numpy as np
from khisto import histogram

# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)

# Compute optimal histogram (drop-in replacement for np.histogram)
hist, bin_edges = histogram(data)

# With density normalization
density, bin_edges = histogram(data, density=True)

# Limit maximum number of bins
hist, bin_edges = histogram(data, max_bins=10)

# Specify range
hist, bin_edges = histogram(data, range=(-2, 2))

Using 10,000 samples keeps the adaptive refinement visible while remaining fast to compute.

Heavy-tailed example:

import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

# Generate 10,000 samples from a Pareto distribution, shifted to start at 1 for better log-log visualization
shape = 3
long_tail_data = np.random.pareto(shape, size=10000) + 1

# Plot an adaptive histogram on logarithmic axes.
n, bins, patches = hist(long_tail_data, density=True)
plt.xscale("log")
plt.yscale("log")
plt.show()

Matplotlib Integration

import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)

# Density is usually the most interpretable view with variable-width bins.
n, bins, patches = hist(data, density=True)
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()

# Cumulative density follows matplotlib semantics.
n, bins, patches = hist(data, density=True, cumulative=True)
plt.ylabel('Cumulative probability')
plt.show()

How It Works

Khisto uses the Khiops optimal binning algorithm based on the MODL (Minimum Optimal Description Length) principle. Instead of using fixed-width bins like traditional histograms, it:

Analyzes the data distribution
Finds bin boundaries that minimize information loss
Creates variable-width bins that adapt to data density

This results in histograms that better represent the underlying distribution, with finer bins in dense regions and wider bins in sparse regions.

The method implemented in Khiops is comprehensively detailed in [2] and further extended in [1].

[1] M. Boullé. Floating-point histograms for exploratory analysis of large scale real-world data sets. Intelligent Data Analysis, 28(5):1347-1394, 2024
[2] V. Zelaya Mendizábal, M. Boullé, F. Rossi. Fast and fully-automated histograms for large-scale data sets. Computational Statistics & Data Analysis, 180:0-0, 2023

Development

# Clone repository
git clone https://github.com/khiops/khisto-python.git
cd khisto-python

# Install with dev dependencies
uv sync --group dev --extra all

# Run tests
uv run pytest

Documentation

See the API and API Comparison for detailed information on available functions, parameters, and how Khisto compares to standard histogram implementations.

License

BSD 3-Clause Clear License

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
docs		docs
sandbox		sandbox
scripts		scripts
src/khisto		src/khisto
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.trivyignore		.trivyignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Khisto

Features

Reproducing The Example Distributions

Installation

Quick Start

NumPy-like API

Matplotlib Integration

How It Works

Development

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Khisto

Features

Reproducing The Example Distributions

Installation

Quick Start

NumPy-like API

Matplotlib Integration

How It Works

Development

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages