Skip to content

transform() fails on large partitioned points with duplicate index #1095

@enric-bazz

Description

@enric-bazz

With the recent unreleased spatialdata version, transform() fails when a points element comes from a large parquet file that gets partitioned.

Reproduce

import numpy as np
import pandas as pd
import dask.dataframe as dd
import spatialdata as sd

n = 40_000_000  # number of points

df = pd.DataFrame({
    "x": 100 * np.random.rand(n),
    "y": 100 * np.random.rand(n),
})

df.to_parquet("/tmp/points_file.parquet")

ddf = dd.read_parquet("/tmp/points_file.parquet")

points = sd.models.PointsModel.parse(ddf)

sdata = sd.SpatialData.init_from_elements({"transcripts": points})

sd.transform(sdata, to_coordinate_system='global')

Error

ValueError: cannot reindex on an axis with duplicate labels

Traceback (most recent call last):

  File "<ipython-input-1>", line 1
    sd.transform(sdata, to_coordinate_system='global')

  File ".../spatialdata/_core/operations/transform.py", line 288, in _
    return data.transform_to_coordinate_system(target_coordinate_system=to_coordinate_system)

  File ".../spatialdata/_core/spatialdata.py", line 880, in transform_to_coordinate_system
    transformed = sdata.transform_element_to_coordinate_system(...)

  File ".../spatialdata/_core/spatialdata.py", line 819, in transform_element_to_coordinate_system
    transformed = transform(element, to_coordinate_system=target_coordinate_system, maintain_positioning=maintain_positioning)

  File ".../spatialdata/_core/operations/transform.py", line 466, in _
    new_col = pd.Series(new_ax.data.flatten().compute(), index=transformed.index)

  File ".../pandas/core/indexes/base.py", line 4436, in reindex
    raise ValueError("cannot reindex on an axis with duplicate labels")

Note: This is the root cause of this issue I encountered previously.

Environment

  • spatialdata version: 0.7.3a1.dev9+g094b86905
  • dask: 2026.1.1
  • pandas: 2.3.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions