Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 52 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,20 @@ Khisto is a Python library for creating histograms using the **Khiops optimal bi
- **Matplotlib Integration**: `khisto.matplotlib.hist` works like `plt.hist`.
- **Minimal Dependencies**: Only requires NumPy (matplotlib optional for plotting).

| Standard Gaussian | Heavy-tailed Pareto |
| --- | --- |
| ![Adaptive Gaussian histogram](docs/images/gaussian-quick-start.png) | ![Adaptive Pareto histogram](docs/images/pareto-quick-start.png) |

## Reproducing The Example Distributions

The complete runnable script is available in `scripts/generate_distribution_examples.py`.

Run it from the repository root to regenerate both example distributions and the figure files used in this README:

```bash
python scripts/generate_distribution_examples.py
```

## Installation

```bash
Expand All @@ -32,8 +46,8 @@ pip install "khisto[matplotlib]"
import numpy as np
from khisto import histogram

# Generate data
data = np.random.normal(0, 1, 1000)
# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)

# Compute optimal histogram (drop-in replacement for np.histogram)
hist, bin_edges = histogram(data)
Expand All @@ -48,86 +62,48 @@ hist, bin_edges = histogram(data, max_bins=10)
hist, bin_edges = histogram(data, range=(-2, 2))
```

### Matplotlib Integration
Using 10,000 samples keeps the adaptive refinement visible while remaining fast to compute.

Heavy-tailed example:

```python
import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

data = np.random.normal(0, 1, 1000)
# Generate 10,000 samples from a Pareto distribution, shifted to start at 1 for better log-log visualization
shape = 3
long_tail_data = np.random.pareto(shape, size=10000) + 1

# Create optimal histogram plot
n, bins, patches = hist(data)
plt.show()

# With density
n, bins, patches = hist(data, density=True)
plt.xlabel('Value')
plt.ylabel('Density')
# Plot an adaptive histogram on logarithmic axes.
n, bins, patches = hist(long_tail_data, density=True)
plt.xscale("log")
plt.yscale("log")
plt.show()
```

## API Reference

### `khisto.histogram`
### Matplotlib Integration

```python
def histogram(
a: ArrayLike,
range: Optional[tuple[float, float]] = None,
max_bins: Optional[int] = None,
density: bool = False,
) -> tuple[ndarray, ndarray]
```

Compute an optimal histogram using the Khiops binning algorithm.

**Parameters:**
- `a`: Input data. The histogram is computed over the flattened array.
- `range`: The lower and upper range of the bins. Values outside are ignored.
- `max_bins`: Maximum number of bins. If None, the algorithm selects optimal.
- `density`: If True, return probability density. If False, return counts.
import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

**Returns:**
- `hist`: The values of the histogram (counts or density).
- `bin_edges`: The bin edges (length = len(hist) + 1).
# Generate 10,000 samples from a standard Gaussian distribution.
data = np.random.normal(0, 1, 10000)

### `khisto.matplotlib.hist`
# Density is usually the most interpretable view with variable-width bins.
n, bins, patches = hist(data, density=True)
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()

```python
def hist(
x: ArrayLike,
range: Optional[tuple[float, float]] = None,
max_bins: Optional[int] = None,
density: bool = False,
histtype: str = "bar",
orientation: Literal["vertical", "horizontal"] = "vertical",
log: bool = False,
color: Optional[str] = None,
label: Optional[str] = None,
ax: Optional[Axes] = None,
**kwargs,
) -> tuple[ndarray, ndarray, Any]
# Cumulative density follows matplotlib semantics.
n, bins, patches = hist(data, density=True, cumulative=True)
plt.ylabel('Cumulative probability')
plt.show()
```

Plot an optimal histogram using matplotlib.

**Parameters:**
- `x`: Input data.
- `range`: The lower and upper range of the bins.
- `max_bins`: Maximum number of bins.
- `density`: If True, plot probability density.
- `histtype`: Type of histogram (`"bar"`, `"step"`, `"stepfilled"`).
- `orientation`: `"vertical"` or `"horizontal"`.
- `log`: If True, set log scale on the value axis.
- `ax`: Matplotlib axes to plot on.

**Returns:**
- `n`: The histogram values.
- `bins`: The bin edges.
- `patches`: The matplotlib patches.

## How It Works

Khisto uses the Khiops optimal binning algorithm based on the MODL (Minimum Optimal Description Length) principle. Instead of using fixed-width bins like traditional histograms, it:
Expand All @@ -138,6 +114,11 @@ Khisto uses the Khiops optimal binning algorithm based on the MODL (Minimum Opti

This results in histograms that better represent the underlying distribution, with finer bins in dense regions and wider bins in sparse regions.

The method implemented in Khiops is comprehensively detailed in [2] and further extended in [1].

- [1] M. Boullé. Floating-point histograms for exploratory analysis of large scale real-world data sets. Intelligent Data Analysis, 28(5):1347-1394, 2024
- [2] V. Zelaya Mendizábal, M. Boullé, F. Rossi. Fast and fully-automated histograms for large-scale data sets. Computational Statistics & Data Analysis, 180:0-0, 2023

## Development

```bash
Expand All @@ -146,12 +127,16 @@ git clone https://github.com/khiops/khisto-python.git
cd khisto-python

# Install with dev dependencies
pip install -e ".[matplotlib]"
uv sync --group dev --extra all

# Run tests
pytest
uv run pytest
```

## Documentation

See the [API](docs/API.md) and [API Comparison](docs/API_COMPARISON.md) for detailed information on available functions, parameters, and how Khisto compares to standard histogram implementations.

## License

[BSD 3-Clause Clear License](LICENSE)
79 changes: 48 additions & 31 deletions docs/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Complete API reference for the Khisto library.
- [HistogramResult](#histogramresult)
- [Matplotlib API](#matplotlib-api)
- [hist](#hist)
- [How It Works](#how-it-works)

---

Expand All @@ -23,7 +24,7 @@ khisto.histogram(
a: ArrayLike,
range: Optional[tuple[float, float]] = None,
max_bins: Optional[int] = None,
density: Optional[bool] = None,
density: bool = False,
) -> tuple[NDArray[np.floating], NDArray[np.floating]]
```

Expand All @@ -33,10 +34,10 @@ Compute an optimal histogram using the Khiops binning algorithm.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `a` | `ArrayLike` | required | Input data. The histogram is computed over the flattened array. |
| `a` | `ArrayLike` | required | Input data. The input is converted to a floating-point array and flattened to one dimension. |
| `range` | `tuple[float, float]` | `None` | Lower and upper range of the bins. Values outside are ignored. |
| `max_bins` | `int` | `None` | Maximum number of bins. If not provided, the optimal number is determined automatically. |
| `density` | `bool` | `None` | If `False` or `None`, return counts; if `True`, return probability density values. |
| `density` | `bool` | `False` | If `False`, return counts; if `True`, return probability density values. |

#### Returns

Expand All @@ -47,7 +48,7 @@ Compute an optimal histogram using the Khiops binning algorithm.

#### See Also

- [`numpy.histogram`](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html) — NumPy's histogram function (`bins` and `weights` parameters are not supported).
- [`numpy.histogram`](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html) — NumPy's histogram function (`bins` and `weights` are not supported in Khisto).

#### Examples

Expand Down Expand Up @@ -81,6 +82,14 @@ hist, bin_edges = histogram(data, max_bins=5)
print(f"Number of bins: {len(hist)}") # <= 5
```

Concatenating nested inputs into a single dataset:

```python
data = [np.array([0.0, 1.0]), np.array([2.0, 3.0, 4.0])]
hist, bin_edges = histogram(data)
print(hist.sum()) # 5
```

---

## Core API
Expand Down Expand Up @@ -176,7 +185,8 @@ import numpy as np
from khisto.core import compute_histogram

data = np.random.normal(0, 1, 1000)
result = compute_histogram(data)
results = compute_histogram(data)
result = next(r for r in results if r.is_best)

# Access bin information
print(f"Bin edges: {result.bin_edges}")
Expand Down Expand Up @@ -204,8 +214,8 @@ khisto.matplotlib.hist(
x: ArrayLike,
range: Optional[tuple[float, float]] = None,
max_bins: Optional[int] = None,
density: bool = True,
ax: Optional[Axes] = None,
density: bool = False,
cumulative: bool | float = False,
**kwargs,
) -> tuple[NDArray[np.floating], NDArray[np.floating], Any]
```
Expand All @@ -216,10 +226,12 @@ Compute and plot an optimal histogram.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `x` | `ArrayLike` | required | Input data. The histogram is computed over the flattened array. |
| `x` | `ArrayLike` | required | Input data, or a sequence of array-like objects. Nested inputs are concatenated and histogrammed as one dataset. |
| `max_bins` | `int` | `None` | Maximum number of bins. If `None`, uses optimal binning. |
| `density` | `bool` | `False` | If `True`, return and plot probability densities. If `False`, return counts. |
| `cumulative` | `bool or float` | `False` | Cumulative mode, following `matplotlib.pyplot.hist`. Negative values accumulate in reverse order. |

Other parameters are passed to matplotlib. See [`matplotlib.pyplot.hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) for styling options.
Other parameters are passed to matplotlib for styling. `ax` can be provided to draw on a specific axes. The `bins`, `weights`, and `stacked` arguments are not supported.

#### Returns

Expand All @@ -231,8 +243,8 @@ Other parameters are passed to matplotlib. See [`matplotlib.pyplot.hist`](https:

#### See Also

- [`matplotlib.pyplot.hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) — Matplotlib's histogram function (`bins`, `weights`, `cumulative`, and stacked/multiple dataset features are not supported).
- [`khisto.histogram`](#histogram) — Underlying histogram computation.
- [`matplotlib.pyplot.hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) — Matplotlib's histogram function.
- [`khisto.histogram`](#histogram) — Underlying non-cumulative histogram computation.

#### Examples

Expand All @@ -243,47 +255,52 @@ import numpy as np
import matplotlib.pyplot as plt
from khisto.matplotlib import hist

data = np.random.normal(0, 1, 1000)
data = np.random.normal(0, 1, 10000)

# Default is density=True
n, bins, patches = hist(data)
# Density is usually the clearest view with variable-width bins.
n, bins, patches = hist(data, density=True)
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Optimal Histogram')
plt.show()
```

Frequency plot:
Cumulative density:

```python
n, bins, patches = hist(data, density=False)
plt.xlabel('Value')
plt.ylabel('Count')
n, bins, patches = hist(data, density=True, cumulative=True)
plt.ylabel('Cumulative probability')
plt.show()
```

Step histogram:
Heavy-tailed Pareto example:

```python
n, bins, patches = hist(data, histtype='step', color='blue', label='Data')
plt.legend()
shape = 3
long_tail_data = np.random.pareto(shape, size=10000) + 1

n, bins, patches = hist(long_tail_data, density=True)
plt.xscale('log')
plt.yscale('log')
plt.show()
```

Using specific axes:
---

```python
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
## How It Works

hist(data, ax=ax1)
ax1.set_title('Counts')
Khisto uses the Khiops optimal binning algorithm based on the MODL (Minimum Optimal Description Length) principle. Instead of using fixed-width bins like traditional histograms, it:

hist(data, density=True, ax=ax2)
ax2.set_title('Density')
1. Analyzes the data distribution
2. Finds bin boundaries that minimize information loss
3. Creates variable-width bins that adapt to data density

plt.tight_layout()
plt.show()
```
This results in histograms that better represent the underlying distribution, with finer bins in dense regions and wider bins in sparse regions.

The method implemented in Khiops is comprehensively detailed in [2] and further extended in [1].

- [1] M. Boullé. Floating-point histograms for exploratory analysis of large scale real-world data sets. Intelligent Data Analysis, 28(5):1347-1394, 2024
- [2] V. Zelaya Mendizábal, M. Boullé, F. Rossi. Fast and fully-automated histograms for large-scale data sets. Computational Statistics & Data Analysis, 180:0-0, 2023

---

Expand Down
13 changes: 9 additions & 4 deletions docs/API_COMPARISON.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ khisto.matplotlib.hist(
range=None,
max_bins=None,
density=False,
cumulative=False,
histtype='bar',
orientation='vertical',
log=False,
Expand All @@ -130,11 +131,11 @@ khisto.matplotlib.hist(
|---------|------------|--------|
| **Binning** | Fixed-width | Optimal variable-width |
| **Bins param** | `bins` | `max_bins` |
| **Axes param** | Implicit (current) | Explicit `ax` parameter |
| **Cumulative** | Supported | Not supported |
| **Axes param** | Implicit (current) | Optional `ax` parameter |
| **Cumulative** | Supported | Supported |
| **Stacked** | Supported | Not supported |
| **Weights** | Supported | Not supported |
| **Multiple datasets** | Supported | Single dataset only |
| **Multiple datasets** | Supported | Sequences are concatenated into one dataset |

#### Usage Comparison

Expand Down Expand Up @@ -167,6 +168,10 @@ plt.show()
plt.hist(data, density=True)
hist(data, density=True)

# cumulative view
plt.hist(data, density=True, cumulative=True)
hist(data, density=True, cumulative=True)

# histogram type
plt.hist(data, histtype='step')
hist(data, histtype='step')
Expand Down Expand Up @@ -224,7 +229,7 @@ n, bins, patches = hist(data, max_bins=30) # max_bins is optional
| Density | ✓ | ✓ | ✓ |
| Range | ✓ | ✓ | ✓ |
| Weights | ✓ | ✓ | ✗ |
| Cumulative | ✗ | ✓ | |
| Cumulative | ✗ | ✓ | |
| Plotting | ✗ | ✓ | ✓ |
| Step histogram | ✗ | ✓ | ✓ |
| Horizontal | ✗ | ✓ | ✓ |
Expand Down
Binary file added docs/images/gaussian-quick-start.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/pareto-quick-start.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading