Skip to content

FIX: Stable random sampling in DatasetConfiguration#1697

Open
adrian-gavrila wants to merge 1 commit intomicrosoft:mainfrom
adrian-gavrila:adrian-gavrila/stable-dataset-sampling
Open

FIX: Stable random sampling in DatasetConfiguration#1697
adrian-gavrila wants to merge 1 commit intomicrosoft:mainfrom
adrian-gavrila:adrian-gavrila/stable-dataset-sampling

Conversation

@adrian-gavrila
Copy link
Copy Markdown
Contributor

Description

When a Scenario runs with include_default_baseline=True and a DatasetConfiguration whose max_dataset_size is set, the baseline atomic attack ended up evaluating a different random subset of
objectives than the strategy-based atomic attacks. Baseline-vs-strategy success-rate comparisons measured two different populations and were meaningless.

Root cause: random.sample ran fresh on every call to DatasetConfiguration.get_seed_groups() (Path 1, used by most scenarios) and get_all_seeds() (Path 2, used by EncodingDatasetConfiguration).
Scenario._get_atomic_attacks_async and Scenario._get_baseline_data each called these methods independently and got different samples.

Fix: memoize both methods. The resolved sample is cached for the lifetime of the configuration object, and reassigning max_dataset_size invalidates the cache. Returns are defensive container copies so
callers can mutate without poisoning the cache. max_dataset_size is now a property whose setter re-validates the value (mirroring __init__).

Subclasses inherit the fix automatically when they use base resolution methods. A short subclassing note in the class docstring flags the two methods that any future override must memoize itself.

Tests and Documentation

  • New TestDatasetConfigurationMemoization and TestDatasetConfigurationMaxDatasetSizeSetter classes in test_dataset_configuration.py covering both call paths, multi-dataset stability, cache
    invalidation, setter validation, and defensive-copy semantics. All randomness-sensitive tests patch random.sample for determinism.
  • Encoding-specific regression test in test_encoding.py (the override routes through get_all_seeds, which is why both paths needed memoization).
  • End-to-end regression test in test_scenario.py asserting set(baseline.objectives) == set(strategy.objectives) after initialize_async with max_dataset_size set.

Verified by stashing the production change and watching the new tests fail (7 failures), then restoring and watching them pass.

Memoize get_seed_groups() and get_all_seeds() so the random subset
selected when max_dataset_size is set is stable for the lifetime of the
configuration. Reassigning max_dataset_size invalidates the cache.

Without this, baseline and strategy atomic attacks each call
get_all_seed_attack_groups() independently and receive different random
subsets of objectives, making baseline-vs-strategy comparison
meaningless.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
self._scenario_strategies = scenario_strategies
self._resolved_groups_cache: Optional[dict[str, list[SeedGroup]]] = None
self._resolved_seeds_cache: Optional[list[Seed]] = None
self._max_dataset_size: Optional[int] = None
Copy link
Copy Markdown
Contributor

@rlundeen2 rlundeen2 May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we simplify this?

Instead of a cache, what if we added a baseline scenario technique that is just PromptSending. We get rid of this in initialize

        if self._include_baseline:
            baseline_attack = self._get_baseline()
            self._atomic_attacks.insert(0, baseline_attack)

and

    def _get_baseline(self) -> AtomicAttack:

And instead add a tag in _get_attack_technique_factories that adds a PromptSending technique as baseline?

_build_display_group would also likely need to be updated to support baseline?

There might be some hiccups, but it feels like a more natural place to include it as an additional technique vs trying to cache the datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants