Optimal Binning Histograms for Python
Khisto is a Python library for creating histograms using the Khiops optimal binning algorithm. Unlike standard histograms that use fixed-width bins or simple heuristics, Khisto automatically determines the optimal number of bins and their variable widths to best represent the underlying data distribution.
- Optimal Binning: Uses the MODL (Minimum Description Length) principle to find the best discretization.
- Variable-Width Bins: Captures dense regions with fine bins and sparse regions with wider bins.
- NumPy Compatible: Drop-in replacement for
numpy.histogram. - Matplotlib Integration:
khisto.matplotlib.histworks likeplt.hist. - Minimal Dependencies: Only requires NumPy (matplotlib optional for plotting).
| Standard Gaussian | Heavy-tailed Pareto |
|---|---|
![]() |
![]() |
The complete runnable script is available in scripts/generate_distribution_examples.py.
Run it from the repository root to regenerate both example distributions and the figure files used in this README:
python scripts/generate_distribution_examples.pypip install khistoWith matplotlib support:
pip install "khisto[matplotlib]"import numpy as np
from khisto import histogram
# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)
# Compute optimal histogram (drop-in replacement for np.histogram)
hist, bin_edges = histogram(data)
# With density normalization
density, bin_edges = histogram(data, density=True)
# Limit maximum number of bins
hist, bin_edges = histogram(data, max_bins=10)
# Specify range
hist, bin_edges = histogram(data, range=(-2, 2))Using 10,000 samples keeps the adaptive refinement visible while remaining fast to compute.
Heavy-tailed example:
import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist
# Generate 10,000 samples from a Pareto distribution, shifted to start at 1 for better log-log visualization
shape = 3
long_tail_data = np.random.pareto(shape, size=10000) + 1
# Plot an adaptive histogram on logarithmic axes.
n, bins, patches = hist(long_tail_data, density=True)
plt.xscale("log")
plt.yscale("log")
plt.show()import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist
# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)
# Density is usually the most interpretable view with variable-width bins.
n, bins, patches = hist(data, density=True)
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()
# Cumulative density follows matplotlib semantics.
n, bins, patches = hist(data, density=True, cumulative=True)
plt.ylabel('Cumulative probability')
plt.show()Khisto uses the Khiops optimal binning algorithm based on the MODL (Minimum Optimal Description Length) principle. Instead of using fixed-width bins like traditional histograms, it:
- Analyzes the data distribution
- Finds bin boundaries that minimize information loss
- Creates variable-width bins that adapt to data density
This results in histograms that better represent the underlying distribution, with finer bins in dense regions and wider bins in sparse regions.
The method implemented in Khiops is comprehensively detailed in [2] and further extended in [1].
- [1] M. Boullé. Floating-point histograms for exploratory analysis of large scale real-world data sets. Intelligent Data Analysis, 28(5):1347-1394, 2024
- [2] V. Zelaya Mendizábal, M. Boullé, F. Rossi. Fast and fully-automated histograms for large-scale data sets. Computational Statistics & Data Analysis, 180:0-0, 2023
# Clone repository
git clone https://github.com/khiops/khisto-python.git
cd khisto-python
# Install with dev dependencies
uv sync --group dev --extra all
# Run tests
uv run pytestSee the API and API Comparison for detailed information on available functions, parameters, and how Khisto compares to standard histogram implementations.

