diff --git a/README.md b/README.md
index cc2bf5f..6cf2540 100644
--- a/README.md
+++ b/README.md
@@ -1,129 +1,195 @@
 <h1 align="center">FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures
 </h1>
 
-
 ![Alt text](assets/Cover_Image.png)
 
 ## About
 
 ⚠️ *Currently under construction.*
 
-This is the official implementation of the **FISBe (FlyLight Instance Segmentation Benchmark)**
-evaluation pipeline, the first publicly available multi-neuron light microscopy dataset with
-pixel-wise annotations. 
+This is the official implementation of the **FISBe (FlyLight Instance Segmentation Benchmark)** evaluation pipeline. It is the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations.
 
-You can download the dataset on the official project page:
+**Download the dataset:** [https://kainmueller-lab.github.io/fisbe/](https://kainmueller-lab.github.io/fisbe/)
 
-👉 https://kainmueller-lab.github.io/fisbe/
+The benchmark supports 2D and 3D segmentations and computes a wide range of commonly used evaluation metrics (e.g., AP, F1, coverage). Crucially, it provides specialized error attribution for topological errors (False Merges, False Splits) relevant to filamentous structures.
 
-The benchmark supports 2D and 3D segmentations and computes a wide range of commonly used evaluation
-metrics (e.g., AP, F1, coverage, precision, recall). Additionally, it provides a visualization
-of segmentation errors.
+### Features
+- **Standard Metrics:** AP, F1, Precision, Recall.
+- **FISBe Metrics:** Greedy many-to-many matching for False Merges (FM) and False Splits (FS).
+- **Flexibility:** Supports HDF5 (`.hdf`, `.h5`) and Zarr (`.zarr`) files.
+- **Modes:** Run on single files, entire folders, or in stability analysis mode.
+- **Partly Labeled Data:** Robust evaluation ignoring background conflicts for sparse Ground Truth.
 
-Overview:
--------------
-This toolkit provides:
+---
 
-- Standard and FlyLight-specific evaluation metrics
-- Error attribution (false merges, splits, FP/FN instances)
-- Visualizations for neurons and nuclei
-- Support for partially annotated datasets
-- Coverage metrics (skeleton, dimension-based, overlap-based)
-- Command-line and Python API usage
+## Installation
 
--------------
+The recommended way to install is using `uv` (fastest) or `micromamba`.
 
-Installation:
--------------
-The recommended way is to install it into your micromamba/python virtual environment.
+### Option 1: Using `uv` (Fastest)
 
 ```bash
-git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation
+# 1. Install uv (if not installed)
+pip install uv
+
+# 2. Clone and install
+git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation.git
 cd evaluate-instance-segmentation
+uv venv
+uv pip install -e .
+```
 
-micromamba create -n evalinstseg -f environment.yml
+### Option 2: Using `micromamba` or `conda`
+
+```bash
+micromamba create -n evalinstseg python=3.10
 micromamba activate evalinstseg
 
+git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation.git
+cd evaluate-instance-segmentation
 pip install -e .
 ```
 
-## Run Benchmark
+## Usage: Command Line (CLI)
+The `evalinstseg` command is automatically available after installation.
 
-You can use this repository in two ways:
+### 1. Evaluate a Single File
+```bash
+evalinstseg \
+  --res_file tests/pred/R14A02-20180905.hdf \
+  --res_key volumes/labels \
+  --gt_file tests/gt/R14A02-20180905.zarr \
+  --gt_key volumes/gt_instances \
+  --out_dir tests/results \
+  --app flylight
+```
 
-1. As a Python package (via `evaluate_file` / `evaluate_volume`)
-2. From the command line
+### 2. Evaluate an Entire Folder
+If you provide a directory path to `--res_file`, the tool will look for matching Ground Truth files in the `--gt_file` folder. Files are matched by name.
 
-Example:
 ```bash
 evalinstseg \
-  --res_file tests/pred/R14A02-20180905_65_A6.hdf \
-  --res_key volumes/gmm_label_cleaned \
-  --gt_file tests/gt/R14A02-20180905_65_A6.zarr \
+  --res_file /path/to/predictions_folder \
+  --res_key volumes/labels \
+  --gt_file /path/to/ground_truth_folder \
   --gt_key volumes/gt_instances \
-  --out_dir tests/results \
+  --out_dir /path/to/output_folder \
+  --app flylight
+```
+
+### 3. Stability & Robustness Mode
+Compute the **Mean ± Std** of metrics across exactly 3 different training runs (e.g., different random seeds).
+
+```bash
+evalinstseg \
+  --stability_mode \
+  --run_dirs experiments/seed1 experiments/seed2 experiments/seed3 \
+  --gt_file data/ground_truth_folder \
+  --out_dir results/stability_report \
   --app flylight
 ```
 
-By setting `--app flylight`, the pipeline automatically uses the default FlyLight benchmark configuration.
-
-You can also define custom configurations, including:
-- localization criteria
-- assignment strategy
-- metric subsets
-
-Output:
-
-- evaluation metrics are written to toml-file and returned as dict
-
-
-## Metrics Overview:
-The evaluation computes metrics at multiple levels: per-threshold instance metrics, aggregated AP/F-scores, and global statistics
-
-### Instance-Level Metrics (per threshold confusion_matrix.th_*)
-| Metric                      | Description                              |
-| --------------------------- | ---------------------------------------- |
-| **AP_TP**                   | True positives at threshold              |
-| **AP_FP**                   | False positives at threshold             |
-| **AP_FN**                   | False negatives at threshold             |
-| **precision**               | TP / (TP + FP)                           |
-| **recall**                  | TP / (TP + FN)                           |
-| **AP**                      | Approximate AP proxy: precision × recall |
-| **fscore**                  | Harmonic mean of precision and recall    |
-| *optional:* **false_split** | Number of false splits                   |
-| *optional:* **false_merge** | Number of false merges                   |
-
-Metrics are computed for thresholds:
-0.1, 0.2, ..., 0.9, 0.55, 0.65, 0.75, 0.85, 0.95.
-
-### Aggregate Metrics
-| Metric         | Description                         |
-| -------------- | ----------------------------------- |
-| **avAP**       | Mean AP for thresholds ≥ 0.5        |
-| **avAP59**     | AP averaged over thresholds 0.5–0.9 |
-| **avAP19**     | AP averaged over thresholds 0.1–0.9 |
-| **avFscore**   | Mean F-score for thresholds 0.1–0.9 |
-| **avFscore59** | Mean F-score for thresholds 0.5–0.9 |
-| **avFscore19** | Mean F-score for thresholds 0.1–0.9 |
-
-### General Metrics
-| Metric                 | Description                                      |
-| ---------------------- | ------------------------------------------------ |
-| **Num GT**             | Number of ground-truth instances                 |
-| **Num Pred**           | Number of predicted instances                    |
-| **TP_05**              | True positives at threshold 0.5                  |
-| **TP_05_rel**          | TP_05 / Num GT                                   |
-| **TP_05_cldice**       | clDice scores of matched pairs at threshold 0.5  |
-| **avg_TP_05_cldice**   | Mean clDice over matched pairs at threshold 0.5  |
-
-### Optional General Metrics 
-| Metric                  | Description |
-| ----------------------- | ----------- |
-| **avg_gt_skel_coverage** | Mean skeleton coverage over all GT instances |
-| **avg_tp_skel_coverage** | Mean skeleton coverage over TP GT instances (> 0.5) |
-| **avg_f1_cov_score**     | 0.5 × avFscore19 + 0.5 × avg_gt_skel_coverage |
-| **FM**                   | Many-to-many false merge score (threshold `fm_thresh`) |
-| **FS**                   | Many-to-many false split score (threshold `fs_thresh`) |
-| **avg_gt_cov_dim**       | Mean GT coverage for “dim” instances |
-| **avg_gt_cov_overlap**   | Mean GT coverage for overlapping-instance regions |
+**Requirements:**
+- `--run_dirs`: Provide exactly 3 folders.
+- `--gt_file`: The folder containing Ground Truth files (filenames must match predictions).
+
+### 4. Partly Labeled Data
+If your ground truth is sparse (not fully dense), use the `--partly` flag. T
+
+## Usage: Python Package
+You can integrate the benchmark directly into your Python scripts or notebooks.
+
+### Evaluate a File
+```python
+from evalinstseg import evaluate_file
+
+# Run evaluation
+metrics = evaluate_file(
+    res_file="tests/pred/sample_01.hdf",
+    gt_file="tests/gt/sample_01.zarr",
+    res_key="volumes/labels",
+    gt_key="volumes/gt_instances",
+    out_dir="output_folder",
+    ndim=3,
+    app="flylight",  # Applies default FISBe config
+    partly=False     # Set True for sparse GT
+)
+
+# Access metrics directly
+print("AP:", metrics['confusion_matrix']['avAP'])
+print("False Merges:", metrics['general']['FM'])
+```
 
+### Evaluate Raw Numpy Arrays
+If you already have the arrays loaded in memory:
+
+```python
+import numpy as np
+from evalinstseg import evaluate_volume
+
+pred_array = np.load(...) # Shape: (Z, Y, X)
+gt_array = np.load(...)
+
+metrics = evaluate_volume(
+    gt_labels=gt_array,
+    pred_labels=pred_array,
+    ndim=3,
+    outFn="output_path_prefix",
+    localization_criterion="cldice",  # or 'iou'
+    assignment_strategy="greedy",
+    add_general_metrics=["false_merge", "false_split"]
+)
+```
+### 4. Partly Labeled Data (`--partly`)
+Some samples contain sparse / incomplete GT annotations. In this setting, counting all unmatched predictions as false positives is not meaningful.
+
+When `--partly` is enabled, we approximate FP by counting only **unmatched predictions whose best match is a foreground GT instance** (based on the localization matrix used for evaluation, e.g. clPrecision for `cldice`).  
+Unmatched predictions whose best match is **background** are ignored.
+
+Concretely, we compute for each unmatched prediction the index of the GT label with maximal overlap score; it is counted as FP only if that index is > 0 (foreground), not 0 (background).
+
+---
+
+## Metrics Explanation
+
+### 1. Standard Instance Metrics (TP/FP/FN, F-score, AP proxy)
+These metrics are computed from a **one-to-one matching** between GT and prediction instances (Hungarian or greedy), using a chosen localization criterion (default for FlyLight is `cldice`).
+
+- **TP**: matched pairs above threshold  
+- **FP**: unmatched predictions (or, in `--partly`, only those whose best match is foreground)  
+- **FN**: unmatched GT instances  
+- **precision** = TP / (TP + FP)  
+- **recall** = TP / (TP + FN)  
+- **fscore** = 2 * precision * recall / (precision + recall)  
+- **AP**: we report a simple AP proxy `precision × recall` at each threshold and average it across thresholds (this is not COCO-style AP).
+
+### 2. FISBe Error Attribution (False Splits / False Merges)
+False splits (FS) and false merges (FM) aim to quantify **instance topology errors** for long-range thin filamentous structures.
+
+We compute FS/FM using **greedy many-to-many matching with consumption**:
+- Candidate GT–Pred pairs above threshold are processed in descending score order.
+- After selecting a match, we update “available” pixels so that already explained structure is not matched again.
+- FS counts when one GT is explained by multiple preds (excess preds per GT).
+- FM counts when one pred explains multiple GTs (excess GTs per pred).
+
+This produces an explicit attribution of split/merge errors rather than only TP/FP/FN.
+
+### Metric Definitions
+
+#### Instance-Level (per threshold)
+| Metric | Description |
+| :--- | :--- |
+| **AP_TP** | True Positives (1-to-1 match) |
+| **AP_FP** | False Positives (unmatched preds; in `--partly`: only unmatched preds whose best match is foreground) |
+| **AP_FN** | False Negatives (unmatched GT) |
+| **precision** | TP / (TP + FP) |
+| **recall** | TP / (TP + FN) |
+| **fscore** | Harmonic mean of precision and recall |
+
+#### Global / FISBe
+| Metric | Description |
+| :--- | :--- |
+| **avAP** | Mean AP proxy across thresholds ≥ 0.5 |
+| **FM** | False Merges (many-to-many matching with consumption) |
+| **FS** | False Splits (many-to-many matching with consumption) |
+| **avg_gt_skel_coverage** | Mean skeleton coverage of GT instances by associated predictions (association via best-match mapping) |
diff --git a/evalinstseg/compute.py b/evalinstseg/compute.py
index a344195..689d325 100644
--- a/evalinstseg/compute.py
+++ b/evalinstseg/compute.py
@@ -7,7 +7,7 @@
     get_centerline_overlap_single,
     get_centerline_overlap,
 )
-from .match import assign_labels, greedy_many_to_many_matching
+from .match import assign_labels, greedy_many_to_many_matching, get_m2m_matches
 
 logger = logging.getLogger(__name__)
 
@@ -160,34 +160,45 @@ def get_gt_coverage_overlap(
     return gt_ovlp, tp_05_ovlp, tp_05_rel_ovlp, gt_covs_ovlp, avg_cov_ovlp
 
 
-def get_m2m_fm(
-    gt_labels, pred_labels, num_pred_labels, recallMat, fm_thresh, matches=None
-):
-    # get false merges
-    if matches is None:
-        # call many-to-many matching based on clRecall
-        matches = greedy_many_to_many_matching(
-            gt_labels, pred_labels, recallMat, fm_thresh
-        )
+def compute_m2m_stats(matches, num_pred_labels):
+    '''Helper to compute false merges and false splits from many-to-many matches.'''
+    
     fm = 0
-    if matches is not None:
-        fms = np.zeros(num_pred_labels)  # without 0 background
-        for k, v in matches.items():
-            for cv in v:
-                fms[cv - 1] += 1
-        fms = np.maximum(fms - 1, np.zeros(num_pred_labels))
-        fm = int(np.sum(fms))
-    return fm, matches
-
-
-def get_m2m_fs(gt_labels, pred_labels, recallMat, fs_thresh, matches=None):
-    # get false splits
-    if matches is None:
-        matches = greedy_many_to_many_matching(
-            gt_labels, pred_labels, recallMat, fs_thresh
-        )
     fs = 0
     if matches is not None:
+        # FS calculation
         for k, v in matches.items():
             fs += max(0, len(v) - 1)
-    return fs, matches
+
+        # FM calculation
+        if num_pred_labels > 0:
+            fms = np.zeros(num_pred_labels)  # without 0 background
+            for k, v in matches.items():
+                for cv in v:
+                    fms[cv - 1] += 1
+            fms = np.maximum(fms - 1, np.zeros(num_pred_labels))
+            fm = int(np.sum(fms))
+    
+    return fm, fs
+
+
+def get_m2m_metrics(gt_labels, pred_labels, num_pred_labels, matchMat, thresh, overlaps=True):
+    """
+    Compute false merge and false split metrics for any localization criterion using many-to-many matching.
+    
+    Args:
+        gt_labels: Ground truth labels
+        pred_labels: Predicted labels
+        num_pred_labels: Number of predicted labels
+        matchMat: Recall matrix for clDice (for IoU matrix appropriate m2m matching is needed)
+        thresh: Threshold for matching
+        overlaps: Whether to allow overlapping instances
+        
+    Returns:
+        Tuple of (false_merge, false_split, matches)
+    """
+    matches = get_m2m_matches(
+        matchMat, thresh, gt_labels, pred_labels, overlaps
+    )
+    fm, fs = compute_m2m_stats(matches, num_pred_labels)
+    return fm, fs, matches
diff --git a/evalinstseg/evaluate.py b/evalinstseg/evaluate.py
index 20af541..46b0fc7 100644
--- a/evalinstseg/evaluate.py
+++ b/evalinstseg/evaluate.py
@@ -21,8 +21,7 @@
     get_gt_coverage,
     get_gt_coverage_dim,
     get_gt_coverage_overlap,
-    get_m2m_fm,
-    get_m2m_fs,
+    get_m2m_metrics,
 )
 from .visualize import visualize_neurons, visualize_nuclei
 from .summarize import (
@@ -42,7 +41,7 @@ def evaluate_file(
     res_key=None,
     gt_key=None,
     suffix="",
-    localization_criterion="iou",  # "iou", "cldice"
+    localization_criterion="cldice",  # "iou", "cldice"
     assignment_strategy="greedy",  # "hungarian", "greedy", "gt_0_5"
     add_general_metrics=[],
     visualize=False,
@@ -99,10 +98,6 @@ def evaluate_file(
         remove_small_components,
     )
 
-    # if from_scratch is set, overwrite existing evaluation files
-    # otherwise try to load precomputed metrics
-    #   if check_for_metric is None, just check if matching file exists
-    #   otherwise check if check_for_metric is contained within file
     if not from_scratch and len(glob.glob(outFn + ".toml")) > 0:
         with open(outFn + ".toml", "r") as tomlFl:
             metricsDict = toml.load(tomlFl)
@@ -175,7 +170,7 @@ def evaluate_volume(
     pred_labels,
     ndim,
     outFn,
-    localization_criterion="iou",
+    localization_criterion="cldice",
     assignment_strategy="hungarian",
     evaluate_false_labels=False,
     add_general_metrics=[],
@@ -221,6 +216,11 @@ def evaluate_volume(
             gt_labels, pred_labels, remove_small_components, foreground_only
         )
 
+    # Check for overlapping instances
+    gt_overlaps = np.any(np.sum(gt_labels_rel > 0, axis=0) > 1)
+    pred_overlaps = np.any(np.sum(pred_labels_rel > 0, axis=0) > 1)
+    overlaps = gt_overlaps or pred_overlaps
+
     logger.debug(
         "are there pixels with multiple instances?: "
         f"{np.sum(np.sum(gt_labels_rel > 0, axis=0) > 1)}"
@@ -231,7 +231,7 @@ def evaluate_volume(
     num_gt_labels = int(np.max(gt_labels_rel))
     num_matches = min(num_gt_labels, num_pred_labels)
 
-    # get localization criterion -> TODO: check: do we still need recallMat_wo_overlap?
+    # get localization criterion
     locMat, recallMat, precMat, recallMat_wo_overlap = compute_localization_criterion(
         pred_labels_rel,
         gt_labels_rel,
@@ -273,7 +273,6 @@ def evaluate_volume(
                     gt_ind,
                     num_pred_labels,
                     num_gt_labels,
-                    locMat,
                     precMat,
                     recallMat,
                     th,
@@ -391,21 +390,43 @@ def evaluate_volume(
             avg_f1_cov_score = 0.5 * avFscore19 + 0.5 * gt_skel_coverage
             metrics.addMetric(tblNameGen, "avg_f1_cov_score", avg_f1_cov_score)
 
-        # TODO: rename "false_merge" and "false_splits" to sth with many-to-many?
-        m2m_matches = None
-        if "false_merge" in add_general_metrics:
-            fm, m2m_matches = get_m2m_fm(
-                gt_labels_rel, pred_labels_rel, num_pred_labels, recallMat, fm_thresh
-            )
-            metrics.addMetric("general", "FM", fm)
-        if "false_split" in add_general_metrics:
-            # if fm and fs thresh are different, reset matches, reuse otherwise
-            if fm_thresh != fs_thresh:
-                m2m_matches = None
-            fs, _ = get_m2m_fs(
-                gt_labels_rel, pred_labels_rel, recallMat, fs_thresh, m2m_matches
-            )
-            metrics.addMetric("general", "FS", fs)
+        if "false_merge" in add_general_metrics or "false_split" in add_general_metrics:
+            if fm_thresh == fs_thresh:
+                # Optimized path: Same threshold, compute both at once
+                fm, fs, _ = get_m2m_metrics(
+                    gt_labels_rel,
+                    pred_labels_rel,
+                    num_pred_labels,
+                    recallMat,
+                    fm_thresh,
+                    overlaps=overlaps,
+                )
+                if "false_merge" in add_general_metrics:
+                    metrics.addMetric("general", "FM", fm)
+                if "false_split" in add_general_metrics:
+                    metrics.addMetric("general", "FS", fs)
+            else:
+                # Different thresholds, compute separately
+                if "false_merge" in add_general_metrics:
+                    fm, _, _ = get_m2m_metrics(
+                        gt_labels_rel,
+                        pred_labels_rel,
+                        num_pred_labels,
+                        recallMat,
+                        fm_thresh,
+                        overlaps=overlaps,
+                    )
+                    metrics.addMetric("general", "FM", fm)
+                if "false_split" in add_general_metrics:
+                    _, fs, _ = get_m2m_metrics(
+                        gt_labels_rel,
+                        pred_labels_rel,
+                        num_pred_labels,
+                        recallMat,
+                        fs_thresh,
+                        overlaps=overlaps,
+                    )
+                    metrics.addMetric("general", "FS", fs)
 
         if "avg_gt_cov_dim" in add_general_metrics:
             gt_dim, tp_05_dim, tp_05_rel_dim, gt_covs_dim, avg_cov_dim = (
@@ -463,7 +484,13 @@ def main():
     parser = argparse.ArgumentParser()
     # input output
     parser.add_argument(
-        "--res_file", nargs="+", type=str, help="path to result file", required=True
+        "--stability_mode", action="store_true", help="Run 3x stability evaluation"
+    )
+    parser.add_argument(
+        "--run_dirs", nargs="+", type=str, help="List of 3 experiment directories"
+    )
+    parser.add_argument(
+        "--res_file", nargs="+", type=str, help="path to result file"
     )
     parser.add_argument(
         "--gt_file",
@@ -603,193 +630,230 @@ def main():
     logger.debug("arguments %s", tuple(sys.argv))
     args = parser.parse_args()
 
-    # shortcut if res_file and gt_file contain folders
-    if len(args.res_file) == 1 and len(args.gt_file) == 1:
-        res_file = args.res_file[0]
-        gt_file = args.gt_file[0]
-        if (os.path.isdir(res_file) and not res_file.endswith(".zarr")) and (
-            os.path.isdir(gt_file) and not gt_file.endswith(".zarr")
+    def get_gt_file(in_fn, gt_folder):
+        """Helper to get gt file corresponding to input result file."""
+        out_fn = os.path.join(
+            gt_folder, os.path.basename(in_fn).split(".")[0] + ".zarr"
+        )
+        return out_fn
+
+    def _run_loop(res_files, gt_files, out_dirs, partly_list_loc):
+        """Core evaluation loop used in normal and stability mode."""
+        
+        loop_samples = []
+        loop_metrics = []
+        for res_file, gt_file, partly, out_dir in zip(
+            res_files, gt_files, partly_list_loc, out_dirs
         ):
-            args.res_file = natsorted(glob.glob(res_file + "/*.hdf"))
+            if not os.path.exists(out_dir): os.makedirs(out_dir, exist_ok=True)
+            
+            sample_name = os.path.basename(res_file).split(".")[0]
+            logger.info("sample_name: %s", sample_name)
+            
+            metric_dict = evaluate_file(
+                res_file,
+                gt_file,
+                args.ndim,
+                out_dir,
+                res_key=args.res_key,
+                gt_key=args.gt_key,
+                suffix=args.suffix,
+                localization_criterion=args.localization_criterion,
+                assignment_strategy=args.assignment_strategy,
+                add_general_metrics=args.add_general_metrics,
+                visualize=args.visualize,
+                visualize_type=args.visualize_type,
+                partly=partly,
+                foreground_only=args.foreground_only,
+                remove_small_components=args.remove_small_components,
+                evaluate_false_labels=args.evaluate_false_labels,
+                fm_thresh=args.fm_thresh,
+                fs_thresh=args.fs_thresh,
+                from_scratch=args.from_scratch,
+                eval_dim=args.eval_dim,
+                debug=args.debug,
+            )
+            loop_metrics.append(metric_dict)
+            loop_samples.append(sample_name)
+            print(f"Evaluated {sample_name}: {metric_dict}")
+            
+        return loop_metrics, loop_samples
+
+    # Stability Mode (Wraps logic 3 times)
+    if args.stability_mode:
+        if not args.run_dirs or len(args.run_dirs) != 3:
+            raise ValueError("Stability mode requires exactly 3 directories passed to --run_dirs")
+        
+        stability_scores = []
+        print("--- EVALUTE USING STABILITY MODE ---")
+
+        for run_idx, run_dir in enumerate(args.run_dirs):
+            print(f"Processing Run {run_idx+1}: {run_dir}")
+            
+            # Auto-detect files for this run
+            run_res_files = natsorted(glob.glob(run_dir + "/*.hdf")) 
+            if not run_res_files: 
+                run_res_files = natsorted(glob.glob(run_dir + "/*.zarr"))
+            
+            # Assume gt_file is the PARENT folder
+            run_gt_files = [get_gt_file(fn, args.gt_file[0]) for fn in run_res_files]
+            run_out_dirs = [os.path.join(args.out_dir[0], f"seed_{run_idx+1}")] * len(run_res_files)
+            
+            # Run the inner loop
+            m_dicts, s_names = _run_loop(run_res_files, run_gt_files, run_out_dirs, [args.partly]*len(run_res_files))
+            
+            # Aggregate just this run
+            metrics_full = {s: m for m, s in zip(m_dicts, s_names) if m is not None}
+            acc, _ = average_flylight_score_over_instances(s_names, metrics_full)
+            stability_scores.append(acc)
+
+        # Print Average and Std Dev across runs
+        print("\n=== FISBe BENCHMARK RESULTS (Mean ± Std) ===")
+        if stability_scores:
+            for key in stability_scores[0].keys():
+                values = [s[key] for s in stability_scores if key in s]
+                if len(values) == 3:
+                    print(f"{key:<30}: {np.mean(values):.4f} ± {np.std(values):.4f}")
+
+    # Normal Mode 
+    else:
+        print("--- EVALUTE USING SINGLE DIR ---")
+        # shortcut if res_file and gt_file contain folders
+        if len(args.res_file) == 1 and len(args.gt_file) == 1:
+            res_file = args.res_file[0]
+            gt_file = args.gt_file[0]
+            if (os.path.isdir(res_file) and not res_file.endswith(".zarr")) and (
+                os.path.isdir(gt_file) and not gt_file.endswith(".zarr")
+            ):
+                args.res_file = natsorted(glob.glob(res_file + "/*.hdf"))
+                args.gt_file = [get_gt_file(fn, gt_file) for fn in args.res_file]
+
+        # check same length for result and gt files
+        assert len(args.res_file) == len(args.gt_file), (
+            "Please check, not the same number of result and gt files"
+        )
+        # set partly parameter for all samples if not done already
+        if len(args.res_file) > 1:
+            if args.partly_list is not None:
+                assert len(args.partly_list) == len(args.res_file), (
+                    "Please check, not the same number of result files "
+                    "and partly_list values"
+                )
+                partly_list = np.array(args.partly_list, dtype=bool)
+            else:
+                partly_list = [args.partly] * len(args.res_file)
+        else:
+            partly_list = [args.partly]
 
-            def get_gt_file(in_fn, gt_folder):
-                out_fn = os.path.join(
-                    gt_folder, os.path.basename(in_fn).split(".")[0] + ".zarr"
+        # check out_dir
+        if len(args.res_file) > 1:
+            if len(args.out_dir) > 1:
+                assert len(args.res_file) == len(args.out_dir), (
+                    "Please check, number of input files and output folders should correspond"
                 )
-                assert (
-                    os.path.basename(in_fn).split(".")[0]
-                    == os.path.basename(out_fn).split(".")[0]
+                outdir_list = args.out_dir
+            else:
+                outdir_list = args.out_dir * len(args.res_file)
+        else:
+            assert len(args.out_dir) == 1, "Please check number of output directories"
+            outdir_list = args.out_dir
+        # check output dir for summary
+        if args.summary_out_dir is None:
+            args.summary_out_dir = args.out_dir[0]
+
+        if args.app is not None:
+            if args.app == "flylight":
+                print(
+                    "Warning: parameter app is set and will overwrite parameters. "
+                    "This might not be what you want."
                 )
-                return out_fn
-
-            args.gt_file = [get_gt_file(fn, gt_file) for fn in args.res_file]
-
-    # check same length for result and gt files
-    assert len(args.res_file) == len(args.gt_file), (
-        "Please check, not the same number of result and gt files"
-    )
-    # set partly parameter for all samples if not done already
-    if len(args.res_file) > 1:
-        if args.partly_list is not None:
-            assert len(args.partly_list) == len(args.res_file), (
-                "Please check, not the same number of result files "
-                "and partly_list values"
+                args.ndim = 3
+                args.localization_criterion = "cldice"
+                args.assignment_strategy = "greedy"
+                args.remove_small_components = 800
+                # args.evaluate_false_labels = True
+                args.metric = "general.avg_f1_cov_score"
+                args.add_general_metrics = [
+                    "avg_gt_skel_coverage",
+                    "avg_f1_cov_score",
+                    "false_merge",
+                    "false_split",
+                    "avg_gt_cov_dim",
+                    "avg_gt_cov_overlap",
+                ]
+                args.summary = [
+                    "general.Num GT",
+                    "general.Num Pred",
+                    "general.avg_f1_cov_score",
+                    "confusion_matrix.avFscore",
+                    "general.avg_gt_skel_coverage",
+                    "confusion_matrix.th_0_5.AP_TP",
+                    "confusion_matrix.th_0_5.AP_FP",
+                    "confusion_matrix.th_0_5.AP_FN",
+                    "general.FM",
+                    "general.FS",
+                    "general.TP_05",
+                    "general.TP_05_rel",
+                    "general.avg_TP_05_cldice",
+                    "general.GT_dim",
+                    "general.TP_05_dim",
+                    "general.TP_05_rel_dim",
+                    "general.avg_gt_cov_dim",
+                    "general.GT_overlap",
+                    "general.TP_05_overlap",
+                    "general.TP_05_rel_overlap",
+                    "general.avg_gt_cov_overlap",
+                    "confusion_matrix.th_0_1.fscore",
+                    "confusion_matrix.th_0_2.fscore",
+                    "confusion_matrix.th_0_3.fscore",
+                    "confusion_matrix.th_0_4.fscore",
+                    "confusion_matrix.th_0_5.fscore",
+                    "confusion_matrix.th_0_6.fscore",
+                    "confusion_matrix.th_0_7.fscore",
+                    "confusion_matrix.th_0_8.fscore",
+                    "confusion_matrix.th_0_9.fscore",
+                ]
+                args.visualize_type = "neuron"
+                args.fm_thresh = 0.1
+                args.fs_thresh = 0.05
+                args.eval_dim = True
+
+        metric_dicts, samples = _run_loop(args.res_file, args.gt_file, outdir_list, partly_list)
+
+        # aggregate over instances
+        metrics_full = {}
+        acc_all_instances = None
+        for metric_dict, sample in zip(metric_dicts, samples):
+            if metric_dict is None:
+                continue
+            metrics_full[sample] = metric_dict
+        if len(np.unique(partly_list)) > 1:
+            print("averaging for combined")
+            # get average over instances for completely
+            samples = np.array(samples)
+            acc_cpt, acc_inst_cpt = average_flylight_score_over_instances(
+                samples[partly_list == False], metrics_full
             )
-            partly_list = np.array(args.partly_list, dtype=bool)
-        else:
-            partly_list = [args.partly] * len(args.res_file)
-    else:
-        partly_list = [args.partly]
-
-    # check out_dir
-    if len(args.res_file) > 1:
-        if len(args.out_dir) > 1:
-            assert len(args.res_file) == len(args.out_dir), (
-                "Please check, number of input files and output folders should correspond"
+            acc_prt, acc_inst_prt = average_flylight_score_over_instances(
+                samples[partly_list == True], metrics_full
             )
-            outdir_list = args.out_dir
-        else:
-            outdir_list = args.out_dir * len(args.res_file)
-    else:
-        assert len(args.out_dir) == 1, "Please check number of output directories"
-        outdir_list = args.out_dir
-    # check output dir for summary
-    if args.summary_out_dir is None:
-        args.summary_out_dir = args.out_dir[0]
-
-    if args.app is not None:
-        if args.app == "flylight":
-            print(
-                "Warning: parameter app is set and will overwrite parameters. "
-                "This might not be what you want."
+            acc, acc_all_instances = average_sets(
+                acc_cpt, acc_inst_cpt, acc_prt, acc_inst_prt
             )
-            args.ndim = 3
-            args.localization_criterion = "cldice"
-            args.assignment_strategy = "greedy"
-            args.remove_small_components = 800
-            # args.evaluate_false_labels = True
-            args.metric = "general.avg_f1_cov_score"
-            args.add_general_metrics = [
-                "avg_gt_skel_coverage",
-                "avg_f1_cov_score",
-                "false_merge",
-                "false_split",
-                "avg_gt_cov_dim",
-                "avg_gt_cov_overlap",
-            ]
-            args.summary = [
-                "general.Num GT",
-                "general.Num Pred",
-                "general.avg_f1_cov_score",
-                "confusion_matrix.avFscore",
-                "general.avg_gt_skel_coverage",
-                "confusion_matrix.th_0_5.AP_TP",
-                "confusion_matrix.th_0_5.AP_FP",
-                "confusion_matrix.th_0_5.AP_FN",
-                "general.FM",
-                "general.FS",
-                "general.TP_05",
-                "general.TP_05_rel",
-                "general.avg_TP_05_cldice",
-                "general.GT_dim",
-                "general.TP_05_dim",
-                "general.TP_05_rel_dim",
-                "general.avg_gt_cov_dim",
-                "general.GT_overlap",
-                "general.TP_05_overlap",
-                "general.TP_05_rel_overlap",
-                "general.avg_gt_cov_overlap",
-                "confusion_matrix.th_0_1.fscore",
-                "confusion_matrix.th_0_2.fscore",
-                "confusion_matrix.th_0_3.fscore",
-                "confusion_matrix.th_0_4.fscore",
-                "confusion_matrix.th_0_5.fscore",
-                "confusion_matrix.th_0_6.fscore",
-                "confusion_matrix.th_0_7.fscore",
-                "confusion_matrix.th_0_8.fscore",
-                "confusion_matrix.th_0_9.fscore",
-            ]
-            args.visualize_type = "neuron"
-            args.fm_thresh = 0.1
-            args.fs_thresh = 0.05
-            args.eval_dim = True
-
-    samples = []
-    metric_dicts = []
-    for res_file, gt_file, partly, out_dir in zip(
-        args.res_file, args.gt_file, partly_list, outdir_list
-    ):
-        sample_name = os.path.basename(res_file).split(".")[0]
-        logger.info("sample_name: %s", sample_name)
-        logger.info("res_file: %s", res_file)
-        logger.info("gt_file: %s", gt_file)
-        logger.info("partly: %s", partly)
-        logger.info("localization: %s", args.localization_criterion)
-        logger.info("assignment: %s", args.assignment_strategy)
-        logger.info("from scratch: %s", args.from_scratch)
-        logger.info("add general metrics: %s", args.add_general_metrics)
-
-        samples.append(os.path.basename(res_file).split(".")[0])
-        metric_dict = evaluate_file(
-            res_file,
-            gt_file,
-            args.ndim,
-            out_dir,
-            res_key=args.res_key,
-            gt_key=args.gt_key,
-            suffix=args.suffix,
-            localization_criterion=args.localization_criterion,
-            assignment_strategy=args.assignment_strategy,
-            add_general_metrics=args.add_general_metrics,
-            visualize=args.visualize,
-            visualize_type=args.visualize_type,
-            partly=partly,
-            foreground_only=args.foreground_only,
-            remove_small_components=args.remove_small_components,
-            evaluate_false_labels=args.evaluate_false_labels,
-            fm_thresh=args.fm_thresh,
-            fs_thresh=args.fs_thresh,
-            from_scratch=args.from_scratch,
-            eval_dim=args.eval_dim,
-            debug=args.debug,
-        )
-        metric_dicts.append(metric_dict)
-        print(metric_dict)
-
-    # aggregate over instances
-    metrics_full = {}
-    acc_all_instances = None
-    for metric_dict, sample in zip(metric_dicts, samples):
-        if metric_dict is None:
-            continue
-        metrics_full[sample] = metric_dict
-    if len(np.unique(partly_list)) > 1:
-        print("averaging for combined")
-        # get average over instances for completely
-        samples = np.array(samples)
-        acc_cpt, acc_inst_cpt = average_flylight_score_over_instances(
-            samples[partly_list == False], metrics_full
-        )
-        acc_prt, acc_inst_prt = average_flylight_score_over_instances(
-            samples[partly_list == True], metrics_full
-        )
-        acc, acc_all_instances = average_sets(
-            acc_cpt, acc_inst_cpt, acc_prt, acc_inst_prt
-        )
 
-    else:
-        acc, acc_all_instances = average_flylight_score_over_instances(
-            samples, metrics_full
-        )
-    if args.summary:
-        summarize_metric_dict(
-            metric_dicts,
-            samples,
-            args.summary,
-            os.path.join(args.summary_out_dir, "summary.csv"),
-            agg_inst_dict=acc_all_instances,
-        )
+        else:
+            acc, acc_all_instances = average_flylight_score_over_instances(
+                samples, metrics_full
+            )
+        if args.summary:
+            summarize_metric_dict(
+                metric_dicts,
+                samples,
+                args.summary,
+                os.path.join(args.summary_out_dir, "summary.csv"),
+                agg_inst_dict=acc_all_instances,
+            )
 
 
 if __name__ == "__main__":
-    main()
+    main()
\ No newline at end of file
diff --git a/evalinstseg/match.py b/evalinstseg/match.py
index 2a4fa5a..798eb50 100644
--- a/evalinstseg/match.py
+++ b/evalinstseg/match.py
@@ -164,7 +164,7 @@ def greedy_many_to_many_matching(gt_labels, pred_labels, locMat, thresh,
 
 
 def get_false_labels(
-        tp_pred_ind, tp_gt_ind, num_pred_labels, num_gt_labels, locMat,
+        tp_pred_ind, tp_gt_ind, num_pred_labels, num_gt_labels,
         precMat, recallMat, thresh, recallMat_wo_overlap):
 
     # get false positive indices
@@ -213,3 +213,31 @@ def get_false_labels(
         fp_ind, fn_ind, fs_ind, fm_pred_ind, fm_gt_ind, fm_count,
         fp_ind_only_bg)
 
+
+def get_m2m_matches(locMat, thresh, gt_labels=None, pred_labels=None, overlaps=True):
+    """Get many-to-many matches between gt and predicted labels. 
+    If we have no overlaps, we can do easy matching based on thresholding the locMat."""
+
+    # If we have overlapping instances, we need to do expensive greedy many-to-many matching
+    if overlaps:
+        if gt_labels is None or pred_labels is None:
+            raise ValueError("gt_labels and pred_labels required when overlaps=True")
+        matches = greedy_many_to_many_matching(gt_labels, pred_labels, locMat, thresh)
+        if matches is not None:
+            # key and values are 0-based, convert to 1-based
+            matches = {k + 1: [v + 1 for v in val] for k, val in matches.items()}
+        return matches
+    else:
+        # Simple matching based on threshold
+        matches = {}
+        locFgMat = locMat[1:, 1:] # excluding background
+        rows, cols = np.nonzero(locFgMat > thresh)
+        for gt_idx, pred_idx in zip(rows, cols):
+            gt_id = gt_idx + 1 # 1-based IDs
+            pred_id = pred_idx + 1
+            if gt_id not in matches:
+                matches[gt_id] = [pred_id]
+            else:
+                matches[gt_id].append(pred_id)
+        return matches
+