-
Notifications
You must be signed in to change notification settings - Fork 86
Open
Description
With the recent unreleased spatialdata version, transform() fails when a points element comes from a large parquet file that gets partitioned.
Reproduce
import numpy as np
import pandas as pd
import dask.dataframe as dd
import spatialdata as sd
n = 40_000_000 # number of points
df = pd.DataFrame({
"x": 100 * np.random.rand(n),
"y": 100 * np.random.rand(n),
})
df.to_parquet("/tmp/points_file.parquet")
ddf = dd.read_parquet("/tmp/points_file.parquet")
points = sd.models.PointsModel.parse(ddf)
sdata = sd.SpatialData.init_from_elements({"transcripts": points})
sd.transform(sdata, to_coordinate_system='global')Error
ValueError: cannot reindex on an axis with duplicate labels
Traceback (most recent call last):
File "<ipython-input-1>", line 1
sd.transform(sdata, to_coordinate_system='global')
File ".../spatialdata/_core/operations/transform.py", line 288, in _
return data.transform_to_coordinate_system(target_coordinate_system=to_coordinate_system)
File ".../spatialdata/_core/spatialdata.py", line 880, in transform_to_coordinate_system
transformed = sdata.transform_element_to_coordinate_system(...)
File ".../spatialdata/_core/spatialdata.py", line 819, in transform_element_to_coordinate_system
transformed = transform(element, to_coordinate_system=target_coordinate_system, maintain_positioning=maintain_positioning)
File ".../spatialdata/_core/operations/transform.py", line 466, in _
new_col = pd.Series(new_ax.data.flatten().compute(), index=transformed.index)
File ".../pandas/core/indexes/base.py", line 4436, in reindex
raise ValueError("cannot reindex on an axis with duplicate labels")
Note: This is the root cause of this issue I encountered previously.
Environment
spatialdataversion: 0.7.3a1.dev9+g094b86905dask: 2026.1.1pandas: 2.3.3
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels