Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 160 additions & 94 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,129 +1,195 @@
<h1 align="center">FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures
</h1>


![Alt text](assets/Cover_Image.png)

## About

⚠️ *Currently under construction.*

This is the official implementation of the **FISBe (FlyLight Instance Segmentation Benchmark)**
evaluation pipeline, the first publicly available multi-neuron light microscopy dataset with
pixel-wise annotations.
This is the official implementation of the **FISBe (FlyLight Instance Segmentation Benchmark)** evaluation pipeline. It is the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations.

You can download the dataset on the official project page:
**Download the dataset:** [https://kainmueller-lab.github.io/fisbe/](https://kainmueller-lab.github.io/fisbe/)

👉 https://kainmueller-lab.github.io/fisbe/
The benchmark supports 2D and 3D segmentations and computes a wide range of commonly used evaluation metrics (e.g., AP, F1, coverage). Crucially, it provides specialized error attribution for topological errors (False Merges, False Splits) relevant to filamentous structures.

The benchmark supports 2D and 3D segmentations and computes a wide range of commonly used evaluation
metrics (e.g., AP, F1, coverage, precision, recall). Additionally, it provides a visualization
of segmentation errors.
### Features
- **Standard Metrics:** AP, F1, Precision, Recall.
- **FISBe Metrics:** Greedy many-to-many matching for False Merges (FM) and False Splits (FS).
- **Flexibility:** Supports HDF5 (`.hdf`, `.h5`) and Zarr (`.zarr`) files.
- **Modes:** Run on single files, entire folders, or in stability analysis mode.
- **Partly Labeled Data:** Robust evaluation ignoring background conflicts for sparse Ground Truth.

Overview:
-------------
This toolkit provides:
---

- Standard and FlyLight-specific evaluation metrics
- Error attribution (false merges, splits, FP/FN instances)
- Visualizations for neurons and nuclei
- Support for partially annotated datasets
- Coverage metrics (skeleton, dimension-based, overlap-based)
- Command-line and Python API usage
## Installation

-------------
The recommended way to install is using `uv` (fastest) or `micromamba`.

Installation:
-------------
The recommended way is to install it into your micromamba/python virtual environment.
### Option 1: Using `uv` (Fastest)

```bash
git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation
# 1. Install uv (if not installed)
pip install uv

# 2. Clone and install
git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation.git
cd evaluate-instance-segmentation
uv venv
uv pip install -e .
```

micromamba create -n evalinstseg -f environment.yml
### Option 2: Using `micromamba` or `conda`

```bash
micromamba create -n evalinstseg python=3.10
micromamba activate evalinstseg

git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation.git
cd evaluate-instance-segmentation
pip install -e .
```

## Run Benchmark
## Usage: Command Line (CLI)
The `evalinstseg` command is automatically available after installation.

You can use this repository in two ways:
### 1. Evaluate a Single File
```bash
evalinstseg \
--res_file tests/pred/R14A02-20180905.hdf \
--res_key volumes/labels \
--gt_file tests/gt/R14A02-20180905.zarr \
--gt_key volumes/gt_instances \
--out_dir tests/results \
--app flylight
```

1. As a Python package (via `evaluate_file` / `evaluate_volume`)
2. From the command line
### 2. Evaluate an Entire Folder
If you provide a directory path to `--res_file`, the tool will look for matching Ground Truth files in the `--gt_file` folder. Files are matched by name.

Example:
```bash
evalinstseg \
--res_file tests/pred/R14A02-20180905_65_A6.hdf \
--res_key volumes/gmm_label_cleaned \
--gt_file tests/gt/R14A02-20180905_65_A6.zarr \
--res_file /path/to/predictions_folder \
--res_key volumes/labels \
--gt_file /path/to/ground_truth_folder \
--gt_key volumes/gt_instances \
--out_dir tests/results \
--out_dir /path/to/output_folder \
--app flylight
```

### 3. Stability & Robustness Mode
Compute the **Mean ± Std** of metrics across exactly 3 different training runs (e.g., different random seeds).

```bash
evalinstseg \
--stability_mode \
--run_dirs experiments/seed1 experiments/seed2 experiments/seed3 \
--gt_file data/ground_truth_folder \
--out_dir results/stability_report \
--app flylight
```

By setting `--app flylight`, the pipeline automatically uses the default FlyLight benchmark configuration.

You can also define custom configurations, including:
- localization criteria
- assignment strategy
- metric subsets

Output:

- evaluation metrics are written to toml-file and returned as dict


## Metrics Overview:
The evaluation computes metrics at multiple levels: per-threshold instance metrics, aggregated AP/F-scores, and global statistics

### Instance-Level Metrics (per threshold confusion_matrix.th_*)
| Metric | Description |
| --------------------------- | ---------------------------------------- |
| **AP_TP** | True positives at threshold |
| **AP_FP** | False positives at threshold |
| **AP_FN** | False negatives at threshold |
| **precision** | TP / (TP + FP) |
| **recall** | TP / (TP + FN) |
| **AP** | Approximate AP proxy: precision × recall |
| **fscore** | Harmonic mean of precision and recall |
| *optional:* **false_split** | Number of false splits |
| *optional:* **false_merge** | Number of false merges |

Metrics are computed for thresholds:
0.1, 0.2, ..., 0.9, 0.55, 0.65, 0.75, 0.85, 0.95.

### Aggregate Metrics
| Metric | Description |
| -------------- | ----------------------------------- |
| **avAP** | Mean AP for thresholds ≥ 0.5 |
| **avAP59** | AP averaged over thresholds 0.5–0.9 |
| **avAP19** | AP averaged over thresholds 0.1–0.9 |
| **avFscore** | Mean F-score for thresholds 0.1–0.9 |
| **avFscore59** | Mean F-score for thresholds 0.5–0.9 |
| **avFscore19** | Mean F-score for thresholds 0.1–0.9 |

### General Metrics
| Metric | Description |
| ---------------------- | ------------------------------------------------ |
| **Num GT** | Number of ground-truth instances |
| **Num Pred** | Number of predicted instances |
| **TP_05** | True positives at threshold 0.5 |
| **TP_05_rel** | TP_05 / Num GT |
| **TP_05_cldice** | clDice scores of matched pairs at threshold 0.5 |
| **avg_TP_05_cldice** | Mean clDice over matched pairs at threshold 0.5 |

### Optional General Metrics
| Metric | Description |
| ----------------------- | ----------- |
| **avg_gt_skel_coverage** | Mean skeleton coverage over all GT instances |
| **avg_tp_skel_coverage** | Mean skeleton coverage over TP GT instances (> 0.5) |
| **avg_f1_cov_score** | 0.5 × avFscore19 + 0.5 × avg_gt_skel_coverage |
| **FM** | Many-to-many false merge score (threshold `fm_thresh`) |
| **FS** | Many-to-many false split score (threshold `fs_thresh`) |
| **avg_gt_cov_dim** | Mean GT coverage for “dim” instances |
| **avg_gt_cov_overlap** | Mean GT coverage for overlapping-instance regions |
**Requirements:**
- `--run_dirs`: Provide exactly 3 folders.
- `--gt_file`: The folder containing Ground Truth files (filenames must match predictions).

### 4. Partly Labeled Data
If your ground truth is sparse (not fully dense), use the `--partly` flag. T

## Usage: Python Package
You can integrate the benchmark directly into your Python scripts or notebooks.

### Evaluate a File
```python
from evalinstseg import evaluate_file

# Run evaluation
metrics = evaluate_file(
res_file="tests/pred/sample_01.hdf",
gt_file="tests/gt/sample_01.zarr",
res_key="volumes/labels",
gt_key="volumes/gt_instances",
out_dir="output_folder",
ndim=3,
app="flylight", # Applies default FISBe config
partly=False # Set True for sparse GT
)

# Access metrics directly
print("AP:", metrics['confusion_matrix']['avAP'])
print("False Merges:", metrics['general']['FM'])
```

### Evaluate Raw Numpy Arrays
If you already have the arrays loaded in memory:

```python
import numpy as np
from evalinstseg import evaluate_volume

pred_array = np.load(...) # Shape: (Z, Y, X)
gt_array = np.load(...)

metrics = evaluate_volume(
gt_labels=gt_array,
pred_labels=pred_array,
ndim=3,
outFn="output_path_prefix",
localization_criterion="cldice", # or 'iou'
assignment_strategy="greedy",
add_general_metrics=["false_merge", "false_split"]
)
```
### 4. Partly Labeled Data (`--partly`)
Some samples contain sparse / incomplete GT annotations. In this setting, counting all unmatched predictions as false positives is not meaningful.

When `--partly` is enabled, we approximate FP by counting only **unmatched predictions whose best match is a foreground GT instance** (based on the localization matrix used for evaluation, e.g. clPrecision for `cldice`).
Unmatched predictions whose best match is **background** are ignored.

Concretely, we compute for each unmatched prediction the index of the GT label with maximal overlap score; it is counted as FP only if that index is > 0 (foreground), not 0 (background).

---

## Metrics Explanation

### 1. Standard Instance Metrics (TP/FP/FN, F-score, AP proxy)
These metrics are computed from a **one-to-one matching** between GT and prediction instances (Hungarian or greedy), using a chosen localization criterion (default for FlyLight is `cldice`).

- **TP**: matched pairs above threshold
- **FP**: unmatched predictions (or, in `--partly`, only those whose best match is foreground)
- **FN**: unmatched GT instances
- **precision** = TP / (TP + FP)
- **recall** = TP / (TP + FN)
- **fscore** = 2 * precision * recall / (precision + recall)
- **AP**: we report a simple AP proxy `precision × recall` at each threshold and average it across thresholds (this is not COCO-style AP).

### 2. FISBe Error Attribution (False Splits / False Merges)
False splits (FS) and false merges (FM) aim to quantify **instance topology errors** for long-range thin filamentous structures.

We compute FS/FM using **greedy many-to-many matching with consumption**:
- Candidate GT–Pred pairs above threshold are processed in descending score order.
- After selecting a match, we update “available” pixels so that already explained structure is not matched again.
- FS counts when one GT is explained by multiple preds (excess preds per GT).
- FM counts when one pred explains multiple GTs (excess GTs per pred).

This produces an explicit attribution of split/merge errors rather than only TP/FP/FN.

### Metric Definitions

#### Instance-Level (per threshold)
| Metric | Description |
| :--- | :--- |
| **AP_TP** | True Positives (1-to-1 match) |
| **AP_FP** | False Positives (unmatched preds; in `--partly`: only unmatched preds whose best match is foreground) |
| **AP_FN** | False Negatives (unmatched GT) |
| **precision** | TP / (TP + FP) |
| **recall** | TP / (TP + FN) |
| **fscore** | Harmonic mean of precision and recall |

#### Global / FISBe
| Metric | Description |
| :--- | :--- |
| **avAP** | Mean AP proxy across thresholds ≥ 0.5 |
| **FM** | False Merges (many-to-many matching with consumption) |
| **FS** | False Splits (many-to-many matching with consumption) |
| **avg_gt_skel_coverage** | Mean skeleton coverage of GT instances by associated predictions (association via best-match mapping) |
65 changes: 38 additions & 27 deletions evalinstseg/compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
get_centerline_overlap_single,
get_centerline_overlap,
)
from .match import assign_labels, greedy_many_to_many_matching
from .match import assign_labels, greedy_many_to_many_matching, get_m2m_matches

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -160,34 +160,45 @@ def get_gt_coverage_overlap(
return gt_ovlp, tp_05_ovlp, tp_05_rel_ovlp, gt_covs_ovlp, avg_cov_ovlp


def get_m2m_fm(
gt_labels, pred_labels, num_pred_labels, recallMat, fm_thresh, matches=None
):
# get false merges
if matches is None:
# call many-to-many matching based on clRecall
matches = greedy_many_to_many_matching(
gt_labels, pred_labels, recallMat, fm_thresh
)
def compute_m2m_stats(matches, num_pred_labels):
'''Helper to compute false merges and false splits from many-to-many matches.'''

fm = 0
if matches is not None:
fms = np.zeros(num_pred_labels) # without 0 background
for k, v in matches.items():
for cv in v:
fms[cv - 1] += 1
fms = np.maximum(fms - 1, np.zeros(num_pred_labels))
fm = int(np.sum(fms))
return fm, matches


def get_m2m_fs(gt_labels, pred_labels, recallMat, fs_thresh, matches=None):
# get false splits
if matches is None:
matches = greedy_many_to_many_matching(
gt_labels, pred_labels, recallMat, fs_thresh
)
fs = 0
if matches is not None:
# FS calculation
for k, v in matches.items():
fs += max(0, len(v) - 1)
return fs, matches

# FM calculation
if num_pred_labels > 0:
fms = np.zeros(num_pred_labels) # without 0 background
for k, v in matches.items():
for cv in v:
fms[cv - 1] += 1
fms = np.maximum(fms - 1, np.zeros(num_pred_labels))
fm = int(np.sum(fms))

return fm, fs


def get_m2m_metrics(gt_labels, pred_labels, num_pred_labels, matchMat, thresh, overlaps=True):
"""
Compute false merge and false split metrics for any localization criterion using many-to-many matching.

Args:
gt_labels: Ground truth labels
pred_labels: Predicted labels
num_pred_labels: Number of predicted labels
matchMat: Recall matrix for clDice (for IoU matrix appropriate m2m matching is needed)
thresh: Threshold for matching
overlaps: Whether to allow overlapping instances

Returns:
Tuple of (false_merge, false_split, matches)
"""
matches = get_m2m_matches(
matchMat, thresh, gt_labels, pred_labels, overlaps
)
fm, fs = compute_m2m_stats(matches, num_pred_labels)
return fm, fs, matches
Loading