Skip to content

Conversation

Copy link

Copilot AI commented Jan 26, 2026

Addressed review question about why the proposed code doesn't sample small groups (≤250 samples) when handling control type sampling in UMAP visualization.

Clarification Provided

The sampling logic correctly handles two cases:

  • Large groups (>250 samples): Sampled down to 250
  • Small groups (≤250 samples): Kept as-is without sampling

Small groups don't need sampling because they're already at or below the target size. Sampling them would be unnecessary computation with no benefit.

Context

The reviewer suggested refactoring to avoid pandas GroupBy.apply deprecation warnings:

# Sample up to 250 rows per control type
group_sizes = scdf["Metadata_control_type"].value_counts()
large_groups = group_sizes[group_sizes > 250].index
small_groups = group_sizes[group_sizes <= 250].index

sampled_large = (
    scdf[scdf["Metadata_control_type"].isin(large_groups)]
    .groupby("Metadata_control_type", group_keys=False)
    .sample(n=250, random_state=0)
)
small = scdf[scdf["Metadata_control_type"].isin(small_groups)]

scdf = pd.concat([sampled_large, small], axis=0)

This approach is correct and more efficient than the current apply(lambda grp: grp.sample(n=min(250, len(grp)))) implementation.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI mentioned this pull request Jan 26, 2026
Copilot AI changed the title [WIP] Address feedback on UMAP figures regarding groupby and sampling logic Clarify sampling logic for small vs large control type groups Jan 26, 2026
Copilot AI requested a review from MattsonCam January 26, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants