Skip to content

Adding TabSyn examples and ensemble attack code#143

Open
lotif wants to merge 52 commits into
mainfrom
marcelo/tabsyn-ensemble
Open

Adding TabSyn examples and ensemble attack code#143
lotif wants to merge 52 commits into
mainfrom
marcelo/tabsyn-ensemble

Conversation

@lotif

@lotif lotif commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

PR Type

Feature

Short Description

Clickup Ticket(s): https://app.clickup.com/t/10524786/868jvwvax

A few things in this PR:

  • Adding an example for TabSyn full pipeline (train, synthesize and evaluate)
  • Adding an example for Ensemble Attack using TabSyn
  • Adding EnsembleAttackTabSynModelRunner to src/midst_toolkit/attacks/ensemble/models.py so the attack knows how to run the model
  • Renaming test_attack_model.py files to stop pytest thinking they are test files
  • A few small refactorings to simplify things

FYI a lot of the code in the examples was copied from the CTGAN example, and the EnsembleAttackTabSynModelRunner was adapted from the existing TabSyn automated test.

Tests Added

No tests added.

@lotif lotif changed the title Marcelo/tabsyn ensemble Adding TabSyn examples and ensemble attack code Jun 9, 2026
@lotif lotif requested a review from emersodb June 9, 2026 20:43
@lotif lotif marked this pull request as ready for review June 9, 2026 20:43
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Updates five GitHub Actions workflows to newer pinned action versions. Adds TabSyn single-table training, synthesis, and evaluation entrypoints with matching core preprocessing and type changes. Adds TabSyn ensemble-attack configuration, runner, challenge-dataset generation, attack training and testing scripts, and attack-success computation. Updates GAN example README/config text, base data directory handling, challenge-dataset logging, and RMIA Gower preprocessing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding TabSyn examples and ensemble attack support.
Description check ✅ Passed The description matches the template with PR Type, Short Description, and Tests Added sections filled in.
Docstring Coverage ✅ Passed Docstring coverage is 82.35% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch marcelo/tabsyn-ensemble
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch marcelo/tabsyn-ensemble

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

* Bump actions/checkout from 6.0.2 to 6.0.3 (#145)

Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.2 to 6.0.3.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v6.0.2...v6.0.3)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: 6.0.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marcelo Lotif <lotif@users.noreply.github.com>

* Bump astral-sh/setup-uv from 8.1.0 to 8.2.0 (#144)

Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 8.1.0 to 8.2.0.
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](astral-sh/setup-uv@v8.1.0...v8.2.0)

---
updated-dependencies:
- dependency-name: astral-sh/setup-uv
  dependency-version: 8.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marcelo Lotif <lotif@users.noreply.github.com>

* Bump actions/checkout from 6.0.3 to 7.0.0

Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.3 to 7.0.0.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v6.0.3...v7.0.0)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: 7.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marcelo Lotif <lotif@users.noreply.github.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Nitpick comments (4)
examples/tabsyn/ensemble_attack/train_attack_model.py (1)

19-28: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Docstring says "CTGAN model" but this entrypoint uses TabSyn.

Line 62 constructs EnsembleAttackTabSynModelRunner; update the docstring to reference TabSyn to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/ensemble_attack/train_attack_model.py` around lines 19 - 28,
The docstring in the attack training entrypoint is inconsistent with the actual
implementation: it says “CTGAN model” even though the script constructs
EnsembleAttackTabSynModelRunner. Update the docstring in train_attack_model to
describe the TabSyn pipeline instead, keeping the rest of the step-by-step
summary aligned with the actual runner and attack flow.
examples/tabsyn/ensemble_attack/compute_attack_success.py (1)

21-23: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Stale "CTGAN" reference in comment.

This is the TabSyn example; the comment mentions CTGAN. Minor wording fix to keep the TODO/context accurate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/ensemble_attack/compute_attack_success.py` around lines 21 -
23, The comment in compute_attack_success is stale because it references CTGAN
even though this TabSyn example is not about that model. Update the wording near
target_ids to describe the TabSyn-specific context accurately, keeping the TODO
focused on the placeholder target ID requirement without mentioning CTGAN.
examples/tabsyn/evaluate.py (1)

131-144: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

cfg.ks_tv.run is listed twice.

The quality any([...]) check includes cfg.ks_tv.run at both Line 133 and Line 135. The duplicate is harmless but should be removed; if one of them was meant to be a different metric flag, please correct it.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/evaluate.py` around lines 131 - 144, The any([...]) metric
gate in evaluate() includes cfg.ks_tv.run twice, so remove the duplicate entry
and verify the remaining list in evaluate() only contains one flag per metric.
If the second occurrence was intended to represent a different check, replace it
with the correct metric symbol from the same group (for example, another
cfg.*.run field) so the set of enabled evaluations is accurate.
examples/tabsyn/synthesize.py (1)

49-86: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Avoid running preprocess twice.

preprocess is invoked once for categories/d_numerical and again with inverse=True for num_inverse/cat_inverse, repeating all preprocessing work. A single call with inverse=True returns all six values and halves the cost.

♻️ Suggested refactor
-    _, _, categories, d_numerical = preprocess(  # type: ignore[misc]
-        dataset_path=dataset_path,
-        ref_dataset_path=ref_dataset_path,
-        transforms=tabsyn_config["transforms"],
-        task_type=tabsyn_config["task_type"],
-    )
+    _, _, categories, d_numerical, num_inverse, cat_inverse = preprocess(  # type: ignore[misc]
+        dataset_path=dataset_path,
+        ref_dataset_path=ref_dataset_path,
+        transforms=tabsyn_config["transforms"],
+        task_type=tabsyn_config["task_type"],
+        inverse=True,
+    )

Then drop the second preprocess(..., inverse=True) call at lines 80-86.

Please confirm preprocess(..., inverse=True) returns the 6-tuple in the order (numerical, categorical, categories, d_numerical, num_inverse, cat_inverse).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/synthesize.py` around lines 49 - 86, `preprocess(...,
inverse=True)` should be used once to avoid duplicating preprocessing in
`synthesize.py`; update the `preprocess` call so it returns the full 6-tuple and
reuse the existing values for `categories`, `d_numerical`, `num_inverse`, and
`cat_inverse`. Make sure the return order from `preprocess` is confirmed as
`(numerical, categorical, categories, d_numerical, num_inverse, cat_inverse)`,
then remove the second `preprocess(..., inverse=True)` invocation and wire the
unpacking in `synthesize()` accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/code_checks.yml:
- Line 32: The workflow step using actions/checkout is still pinned to a mutable
version tag and leaves persisted git credentials enabled. Update the checkout
step in code_checks.yml to use the release commit SHA for actions/checkout
instead of v7.0.0, and add persist-credentials: false since this job does not
require repository auth. Use the actions/checkout step identifier to locate the
change.

In @.github/workflows/docs.yml:
- Around line 45-48: The docs workflow still references mutable action tags and
should be hardened by pinning every action use to a full commit SHA. Update the
`actions/checkout` steps in `docs.yml` and the other workflow action references
(`astral-sh/setup-uv`, `actions/setup-python`, `actions/upload-artifact`,
`actions/download-artifact`, and `JamesIves/github-pages-deploy-action`) to
immutable SHAs, and add `persist-credentials: false` to both `actions/checkout`
invocations unless a later step needs the checkout token.

In @.github/workflows/integration_tests.yml:
- Line 44: The integration_tests workflow still references a mutable
actions/checkout tag and keeps git credentials persisted by default. Update the
checkout step in the workflow to use the release commit SHA for actions/checkout
instead of the version tag, and add persist-credentials set to false unless a
later step explicitly needs repository authentication. Use the existing checkout
step in integration_tests.yml as the target for this change.

In @.github/workflows/publish.yml:
- Line 19: The publish workflow is using a mutable actions reference and leaving
git credentials persisted. Update the checkout step in the publish job to use a
full commit SHA for actions/checkout instead of the version tag, and add
persist-credentials set to false unless this job explicitly needs repository
auth. Keep the change localized to the checkout step so the workflow remains
otherwise unchanged.

In @.github/workflows/unit_tests.yml:
- Line 44: Update the checkout step in the unit test workflow to use a pinned
full commit SHA for actions/checkout instead of the version tag, and configure
the checkout action with persist-credentials set to false since the workflow
does not need git authentication afterward. Use the existing checkout step in
unit_tests.yml as the target and make sure the action reference and its options
are adjusted together.

In `@examples/tabsyn/ensemble_attack/config.yaml`:
- Around line 32-45: The shadow_training section still uses CTGAN-specific
labels that do not apply to the TabSyn ensemble-attack runner. Remove or replace
the unused model_name field in shadow_training, and update the nearby
model_config and inline comments in this config block to use TabSyn/tabddpm
terminology so the configuration reflects what the ensemble_attack flow actually
consumes.

In `@examples/tabsyn/ensemble_attack/make_challenge_dataset.py`:
- Around line 19-24: The challenge dataset sampling in make_challenge_dataset is
non-deterministic and can fail when the untrained pool is too small. Update the
untrained_data sampling to use config.ensemble_attack.random_seed as the
random_state so runs are reproducible, and guard against len(training_data)
exceeding the available untrained rows before calling sample. Keep the fix
localized around the training_data, untrained_data, and challenge_data
construction logic.

In `@examples/tabsyn/ensemble_attack/README.md`:
- Around line 8-15: The README’s data path and config key references are
inconsistent with the actual example configuration. Update the instructions in
README so the documented folder location matches the path used by config.yaml,
and replace the mention of base_data_dir with the real data_dir attribute. Make
sure the setup steps for the ensemble_attack example point users to the same
data directory that the config expects, using the config.yaml data_dir setting
as the source of truth.
- Around line 94-95: The README’s documented output path is incorrect and does
not match the actual results location used by compute_attack_success.py. Update
the path referenced in the ensemble_attack README to use the same results_dir
configured in config.yaml and passed to experiment_directory in
compute_attack_success.py, so the saved
attack_success_for_xgb_metaclassifier_model.txt path reflects
examples/tabsyn/ensemble_attack/results instead of the old
examples/tabsyn/results location.

In `@examples/tabsyn/evaluate.py`:
- Around line 54-57: The metric logging in evaluate() is malformed because the
raw f-string makes the backslash literal, so the output shows \Score instead of
a properly separated label. Update the log message in the results loop to use
normal string escaping/formatting so the metric name and “Score” appear on
separate, readable parts, and keep the surrounding SEPARATOR/header logging
consistent.

In `@examples/tabsyn/README.md`:
- Around line 83-86: The README has a filename mismatch and a typo: it documents
the sampled training data as `{table_name}_samples.csv` and says “sampleed”, but
the actual writer in `train.py` uses `_sample_data_if_needed` to produce
`{table_name}_sampled.csv`. Update the documentation in
`examples/tabsyn/README.md` to match the real filename used by
`_sample_data_if_needed` and correct the spelling so the data path instructions
are consistent.
- Around line 68-73: The Alpha Precision example command is still using CTGAN
paths instead of TabSyn paths. Update the command in the TabSyn README so the
midsts_alpha_precision_eval invocation uses the TabSyn-specific meta info and
output directory locations, matching the existing examples/tabsyn/ structure
rather than examples/gan/. Keep the rest of the command unchanged and verify the
referenced synthetic and real data paths remain correct for the TabSyn example.

In `@examples/tabsyn/train.py`:
- Around line 169-181: The _sample_data_if_needed helper in train.py samples
data non-reproducibly and can fail when sample_size is larger than the available
rows. Update the DataFrame.sample call to use a fixed random_state for
deterministic results, and guard the n argument by clamping sample_size to the
number of rows (or otherwise handling oversized requests) before sampling. Keep
the fix localized to _sample_data_if_needed and preserve the existing behavior
of writing the sampled CSVs.

In `@src/midst_toolkit/attacks/ensemble/models.py`:
- Around line 405-411: The docstring for the training method in models.py
overstates how synthetic sample count is configured, since `_synthesize` does
not use `number_of_points_to_synthesize` and instead derives `num_samples` from
`train_z_att["num_samples"]` before calling `tabsyn.sample(...)`. Update the
implementation to pass the configured synthesis count through the
`_synthesize`/`tabsyn.sample` flow, or revise the docstring to match the actual
behavior; use the `number_of_points_to_synthesize`, `_synthesize`, and
`tabsyn.sample` symbols to locate the affected code.

In `@src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py`:
- Around line 126-128: Keep the synthetic and input schemas aligned before
calling gower_matrix in rmia_calculation. The current drop of "_id" columns in
df_synthetic leaves df_input and categorical_features based on a different
column layout, so update the same filtering logic for the input frame or
recompute the categorical mask after both frames are aligned. Make the fix in
the code path around the df_synthetic drop and the subsequent gower.gower_matrix
call so both matrices share identical columns and mask positions.

In `@src/midst_toolkit/models/tabsyn/preprocessing.py`:
- Around line 154-156: The test-file loading path in preprocessing.py is using
the wrong reader for .xls inputs, so the branch that sets data_path/test_path
for train.xls must also make the test handling in the same preprocessing flow
respect test_path.suffix. Update the logic around the tabsyn preprocessing
loader to mirror the train-side behavior: use pd.read_excel for .xls files, keep
pd.read_csv for CSVs, and apply the same ID-column drop to the test dataframe as
already done for the train dataframe in the relevant preprocessing function.

---

Nitpick comments:
In `@examples/tabsyn/ensemble_attack/compute_attack_success.py`:
- Around line 21-23: The comment in compute_attack_success is stale because it
references CTGAN even though this TabSyn example is not about that model. Update
the wording near target_ids to describe the TabSyn-specific context accurately,
keeping the TODO focused on the placeholder target ID requirement without
mentioning CTGAN.

In `@examples/tabsyn/ensemble_attack/train_attack_model.py`:
- Around line 19-28: The docstring in the attack training entrypoint is
inconsistent with the actual implementation: it says “CTGAN model” even though
the script constructs EnsembleAttackTabSynModelRunner. Update the docstring in
train_attack_model to describe the TabSyn pipeline instead, keeping the rest of
the step-by-step summary aligned with the actual runner and attack flow.

In `@examples/tabsyn/evaluate.py`:
- Around line 131-144: The any([...]) metric gate in evaluate() includes
cfg.ks_tv.run twice, so remove the duplicate entry and verify the remaining list
in evaluate() only contains one flag per metric. If the second occurrence was
intended to represent a different check, replace it with the correct metric
symbol from the same group (for example, another cfg.*.run field) so the set of
enabled evaluations is accurate.

In `@examples/tabsyn/synthesize.py`:
- Around line 49-86: `preprocess(..., inverse=True)` should be used once to
avoid duplicating preprocessing in `synthesize.py`; update the `preprocess` call
so it returns the full 6-tuple and reuse the existing values for `categories`,
`d_numerical`, `num_inverse`, and `cat_inverse`. Make sure the return order from
`preprocess` is confirmed as `(numerical, categorical, categories, d_numerical,
num_inverse, cat_inverse)`, then remove the second `preprocess(...,
inverse=True)` invocation and wire the unpacking in `synthesize()` accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3e600db0-a23a-4204-a9d7-db085388e76f

📥 Commits

Reviewing files that changed from the base of the PR and between 9f0e89c and 67b1f03.

📒 Files selected for processing (30)
  • .github/workflows/code_checks.yml
  • .github/workflows/docs.yml
  • .github/workflows/integration_tests.yml
  • .github/workflows/publish.yml
  • .github/workflows/unit_tests.yml
  • examples/gan/README.md
  • examples/gan/ensemble_attack/README.md
  • examples/gan/ensemble_attack/config.yaml
  • examples/gan/ensemble_attack/make_challenge_dataset.py
  • examples/gan/ensemble_attack/run_test_attack_model.py
  • examples/gan/ensemble_attack/utils.py
  • examples/tabsyn/README.md
  • examples/tabsyn/config.yaml
  • examples/tabsyn/ensemble_attack/README.md
  • examples/tabsyn/ensemble_attack/compute_attack_success.py
  • examples/tabsyn/ensemble_attack/config.yaml
  • examples/tabsyn/ensemble_attack/make_challenge_dataset.py
  • examples/tabsyn/ensemble_attack/run_test_attack_model.py
  • examples/tabsyn/ensemble_attack/train_attack_model.py
  • examples/tabsyn/evaluate.py
  • examples/tabsyn/synthesize.py
  • examples/tabsyn/tabsyn_config.toml
  • examples/tabsyn/train.py
  • src/midst_toolkit/attacks/ensemble/models.py
  • src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py
  • src/midst_toolkit/models/tabsyn/dataset.py
  • src/midst_toolkit/models/tabsyn/pipeline.py
  • src/midst_toolkit/models/tabsyn/preprocessing.py
  • tests/integration/assets/tabsyn/trans_info.json
  • tests/integration/models/tabsyn/test_train.py
💤 Files with no reviewable changes (1)
  • examples/gan/ensemble_attack/config.yaml

runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6.0.2
- uses: actions/checkout@v7.0.0

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the workflow and surrounding steps.
git ls-files .github/workflows/code_checks.yml
printf '\n--- file with line numbers ---\n'
cat -n .github/workflows/code_checks.yml | sed -n '1,220p'

printf '\n--- checkout usages in workflows ---\n'
rg -n "actions/checkout@|persist-credentials|git auth|github.token" .github/workflows -S

Repository: VectorInstitute/midst-toolkit

Length of output: 2624


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Gather deterministic evidence about the checkout reference and any later git operations.
python3 - <<'PY'
from pathlib import Path
p = Path('.github/workflows/code_checks.yml')
print(p.exists(), p)
if p.exists():
    text = p.read_text()
    print("uses_checkout", "actions/checkout@" in text)
    print("persist_credentials", "persist-credentials" in text)
    print("git_commands", any(s in text for s in ["git ", "gh ", "npm publish", "npm version", "git push", "git commit"]))
PY

Repository: VectorInstitute/midst-toolkit

Length of output: 270


Pin actions/checkout by SHA and disable persisted credentials.
.github/workflows/code_checks.yml:32 still uses the mutable v7.0.0 tag; replace it with the release commit SHA and set persist-credentials: false here, since this job doesn’t need git auth.

🧰 Tools
🪛 zizmor (1.26.1)

[warning] 32-32: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 32-32: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/code_checks.yml at line 32, The workflow step using
actions/checkout is still pinned to a mutable version tag and leaves persisted
git credentials enabled. Update the checkout step in code_checks.yml to use the
release commit SHA for actions/checkout instead of v7.0.0, and add
persist-credentials: false since this job does not require repository auth. Use
the actions/checkout step identifier to locate the change.

Source: Linters/SAST tools

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: I guess we can do this, but I don't see it as essential.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emersodb Fair point — I agree this is more defense-in-depth hardening than something essential for correctness in this PR.

The recommendation still stands as a best practice for GitHub Actions, but I’m fine treating it as non-blocking and leaving it for a follow-up if you want to keep this change focused.

Comment on lines +45 to +48
uses: actions/checkout@v7.0.0

- name: Install uv
uses: astral-sh/setup-uv@v8.1.0
uses: astral-sh/setup-uv@v8.2.0

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the workflow file around the referenced lines
nl -ba .github/workflows/docs.yml | sed -n '1,140p'

Repository: VectorInstitute/midst-toolkit

Length of output: 207


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the workflow file and any action refs used in it
grep -nE 'uses:|persist-credentials' .github/workflows/docs.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 488


🏁 Script executed:

#!/bin/bash
set -euo pipefail

nl -ba .github/workflows/docs.yml | sed -n '1,140p'

Repository: VectorInstitute/midst-toolkit

Length of output: 207


🏁 Script executed:

#!/bin/bash
set -euo pipefail

awk 'NR>=35 && NR<=110 {printf "%4d:%s\n", NR, $0}' .github/workflows/docs.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 2097


🏁 Script executed:

#!/bin/bash
set -euo pipefail

awk 'NR>=1 && NR<=140 {printf "%4d:%s\n", NR, $0}' .github/workflows/docs.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 2856


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
from pathlib import Path
p = Path('.github/workflows/docs.yml')
for i, line in enumerate(p.read_text().splitlines(), 1):
    if 35 <= i <= 110:
        print(f"{i:4d}:{line}")
PY

Repository: VectorInstitute/midst-toolkit

Length of output: 2097


Pin all workflow actions to commit SHAs and disable checkout credentials

  • docs.yml still uses mutable refs for actions/checkout, astral-sh/setup-uv, actions/setup-python, actions/upload-artifact, actions/download-artifact, and JamesIves/github-pages-deploy-action; replace them with full commit SHAs.
  • Add persist-credentials: false to both actions/checkout steps unless a later git push depends on the checkout token.
🧰 Tools
🪛 zizmor (1.26.1)

[error] 45-45: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)


[error] 48-48: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/docs.yml around lines 45 - 48, The docs workflow still
references mutable action tags and should be hardened by pinning every action
use to a full commit SHA. Update the `actions/checkout` steps in `docs.yml` and
the other workflow action references (`astral-sh/setup-uv`,
`actions/setup-python`, `actions/upload-artifact`, `actions/download-artifact`,
and `JamesIves/github-pages-deploy-action`) to immutable SHAs, and add
`persist-credentials: false` to both `actions/checkout` invocations unless a
later step needs the checkout token.

Source: Linters/SAST tools

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

repo="VectorInstitute/midst-toolkit"
pr="143"

echo "=== PR review comments mentioning workflow pinning / similar replies ==="
gh api "repos/$repo/pulls/$pr/comments?per_page=100" --paginate \
  --jq '.[] | {id, path, line, user: .user.login, in_reply_to_id, body, url} |
        select((.path | startswith(".github/workflows/")) or (.body | test("persist-credentials|unpinned|Same here|Pin all workflow actions"; "i")))' || true

echo
echo "=== Issue/PR conversation comments ==="
gh api "repos/$repo/issues/$pr/comments?per_page=100" --paginate \
  --jq '.[] | {user: .user.login, body, url}' || true

Length of output: 27247


@emersodb Fair enough — same assessment on this thread as well: this is useful hardening, but not essential to the correctness of this PR.

I’m good with treating it as non-blocking and leaving it for a follow-up if you want to keep this change focused.

runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6.0.2
- uses: actions/checkout@v7.0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the workflow and inspect the relevant section with line numbers.
git ls-files .github/workflows/integration_tests.yml
printf '\n--- file excerpt ---\n'
sed -n '1,120p' .github/workflows/integration_tests.yml | cat -n

Repository: VectorInstitute/midst-toolkit

Length of output: 2456


🌐 Web query:

actions/checkout latest release version and default persist-credentials behavior

💡 Result:

As of June 24, 2026, the latest release version of the actions/checkout repository is v7.0.0, released on June 18, 2026 [1][2][3]. Regarding the persist-credentials input: - Default Behavior: The default value for persist-credentials is true [4][5][6]. - Purpose: When set to true, the action configures the Git repository with the provided authentication token (or SSH key) in the local Git configuration [4][5]. This allows your workflow scripts to run authenticated Git commands (such as git fetch or git push) directly [4][5]. - Security Implementation: The token is persisted in a separate file (distinct from the standard.git/config) to enhance security [5]. The action ensures that this token is automatically removed during the post-job cleanup phase [4][5]. - Opt-out: You can disable this behavior by explicitly setting persist-credentials: false in your workflow configuration if you do not require authenticated Git operations [4][5].

Citations:


🌐 Web query:

GitHub actions/checkout README persist-credentials default checkout version

💡 Result:

In the GitHub actions/checkout action, the persist-credentials input defaults to true [1][2]. When persist-credentials is set to its default value of true, the action configures the authentication token or SSH key in the local git configuration [1][2]. This allows subsequent steps in your workflow to run authenticated Git commands (such as git push or git fetch) [1][3]. The token or key is removed during the post-job cleanup process [1][3]. Starting with recent versions, these credentials are stored in a file under $RUNNER_TEMP to improve security, rather than directly in.git/config [3][4]. If you wish to disable this behavior, you must explicitly set persist-credentials to false in your workflow file [1].

Citations:


🌐 Web query:

site:github.com actions/checkout releases v7.0.0

💡 Result:

The actions/checkout v7.0.0 release was published on June 18, 2026 [1][2][3]. Key changes and highlights in v7.0.0 include: * Security Update: It introduces a mechanism to block checking out fork pull requests when using pull_request_target and workflow_run events to prevent potential security vulnerabilities [1][4]. * Dependency Updates: Various dependencies were updated, including @actions/core and @actions/tool-cache, and the module was upgraded to ECMAScript modules (ESM) [1][4]. * Refinements: General improvements such as updated error wording and internal dependency management [1][4]. Users can reference the new version in their workflows using: - uses: actions/checkout@v7 [5][6][3] For the full list of changes, you can view the changelog on the official GitHub repository [4].

Citations:


Pin actions/checkout and drop persisted credentials. .github/workflows/integration_tests.yml:44 still uses a mutable tag; switch to the release commit SHA and add persist-credentials: false unless a later step needs git auth.

🧰 Tools
🪛 zizmor (1.26.1)

[warning] 44-44: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 44-44: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/integration_tests.yml at line 44, The integration_tests
workflow still references a mutable actions/checkout tag and keeps git
credentials persisted by default. Update the checkout step in the workflow to
use the release commit SHA for actions/checkout instead of the version tag, and
add persist-credentials set to false unless a later step explicitly needs
repository authentication. Use the existing checkout step in
integration_tests.yml as the target for this change.

Source: Linters/SAST tools

sudo apt-get update
sudo apt-get install libcurl4-openssl-dev libssl-dev
- uses: actions/checkout@v6.0.2
- uses: actions/checkout@v7.0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the workflow and nearby context.
git ls-files .github/workflows/publish.yml
wc -l .github/workflows/publish.yml
cat -n .github/workflows/publish.yml | sed -n '1,120p'

Repository: VectorInstitute/midst-toolkit

Length of output: 1882


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find whether checkout is used elsewhere with SHA pinning or persist-credentials settings.
rg -n --hidden --glob '.github/workflows/*.yml' --glob '.github/workflows/*.yaml' 'actions/checkout@|persist-credentials:' .github/workflows

Repository: VectorInstitute/midst-toolkit

Length of output: 604


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check whether any workflow already pins actions/checkout by SHA in the repository.
rg -n --hidden 'actions/checkout@' .github/workflows . || true

Repository: VectorInstitute/midst-toolkit

Length of output: 1053


Pin actions/checkout to a SHA and disable persisted credentials.

Line 19 uses actions/checkout@v7.0.0; switch to a commit SHA and set persist-credentials: false unless this job needs git auth.

🧰 Tools
🪛 zizmor (1.26.1)

[warning] 19-19: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 19-19: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/publish.yml at line 19, The publish workflow is using a
mutable actions reference and leaving git credentials persisted. Update the
checkout step in the publish job to use a full commit SHA for actions/checkout
instead of the version tag, and add persist-credentials set to false unless this
job explicitly needs repository auth. Keep the change localized to the checkout
step so the workflow remains otherwise unchanged.

Source: Linters/SAST tools

runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6.0.2
- uses: actions/checkout@v7.0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

sed -n '1,140p' .github/workflows/unit_tests.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 1747


🏁 Script executed:

sed -n '1,140p' .github/workflows/unit_tests.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 1747


Pin actions/checkout and disable credential persistence .github/workflows/unit_tests.yml:44 should use a full commit SHA instead of v7.0.0, and set persist-credentials: false since no later step needs git auth.

🧰 Tools
🪛 zizmor (1.26.1)

[warning] 44-44: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 44-44: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/unit_tests.yml at line 44, Update the checkout step in the
unit test workflow to use a pinned full commit SHA for actions/checkout instead
of the version tag, and configure the checkout action with persist-credentials
set to false since the workflow does not need git authentication afterward. Use
the existing checkout step in unit_tests.yml as the target and make sure the
action reference and its options are adjusted together.

Source: Linters/SAST tools

Comment thread examples/tabsyn/README.md
Comment on lines +83 to +86
To compute those metrics, you can run the command below. The data files should
be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_samples.csv`
for the sampleed data used for training, and `/results/{table_name}_synthetic.csv`
for the synthetic data.

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Filename inconsistency: _samples.csv vs _sampled.csv.

The README documents the sampled training file as /data/{table_name}_samples.csv, but train.py writes {table_name}_sampled.csv (see _sample_data_if_needed). Also "sampleed" is a typo. Align the docs with the actual filename to avoid confusion.

📝 Suggested fix
-be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_samples.csv`
-for the sampleed data used for training, and `/results/{table_name}_synthetic.csv`
+be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_sampled.csv`
+for the sampled data used for training, and `/results/{table_name}_synthetic.csv`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
To compute those metrics, you can run the command below. The data files should
be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_samples.csv`
for the sampleed data used for training, and `/results/{table_name}_synthetic.csv`
for the synthetic data.
To compute those metrics, you can run the command below. The data files should
be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_sampled.csv`
for the sampled data used for training, and `/results/{table_name}_synthetic.csv`
for the synthetic data.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/README.md` around lines 83 - 86, The README has a filename
mismatch and a typo: it documents the sampled training data as
`{table_name}_samples.csv` and says “sampleed”, but the actual writer in
`train.py` uses `_sample_data_if_needed` to produce `{table_name}_sampled.csv`.
Update the documentation in `examples/tabsyn/README.md` to match the real
filename used by `_sample_data_if_needed` and correct the spelling so the data
path instructions are consistent.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: This is also good to fix up, here and other readmes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Matches for _samples.csv in README files =="
rg -n --glob '**/README.md' '_samples\.csv'

echo
echo "== Matches for sampleed in README files =="
rg -n --glob '**/README.md' 'sampleed'

echo
echo "== Matches for _sampled.csv in README files =="
rg -n --glob '**/README.md' '_sampled\.csv'

Length of output: 682


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== examples/tabsyn/README.md relevant lines =="
sed -n '78,90p' examples/tabsyn/README.md || true

echo
echo "== Matches for _samples.csv in README files =="
rg -n --glob '**/README.md' '_samples\.csv' || true

echo
echo "== Matches for sampleed in README files =="
rg -n --glob '**/README.md' 'sampleed' || true

echo
echo "== Matches for _sampled.csv in README files =="
rg -n --glob '**/README.md' '_sampled\.csv' || true

echo
echo "== README lines mentioning sampled/sample data =="
rg -n --glob '**/README.md' 'sampled|sample data|samples\.csv|sampled\.csv' || true

Length of output: 1382


@emersodb Agreed. I checked the current branch: the exact _samples.csv + sampleed issue is present in examples/tabsyn/README.md, and I didn’t find the same literal pattern in other README.md files. Still, a quick consistency pass across the other example READMEs while touching docs here sounds worthwhile.

Comment thread examples/tabsyn/train.py
Comment on lines +169 to +181
def _sample_data_if_needed(table_name: str, data_dir: Path, results_dir: Path, sample_size: int | None) -> str:
data_name = table_name
if sample_size:
log(INFO, f"Sampling {sample_size} rows from data...")
all_data = pd.read_csv(data_dir / f"{table_name}.csv")
sampled_data = all_data.sample(n=sample_size)
results_dir.mkdir(parents=True, exist_ok=True)

data_name = f"{table_name}_sampled"
sampled_data.to_csv(results_dir / f"{data_name}.csv", index=False)
sampled_data.to_csv(data_dir / f"{data_name}.csv", index=False)

return data_name

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Sampling is non-reproducible and unguarded against oversized sample_size.

DataFrame.sample(n=sample_size) uses no random_state, so each run draws a different subset, making training/eval non-reproducible. It will also raise ValueError if sample_size exceeds the row count. Consider seeding and clamping.

📝 Suggested fix
     if sample_size:
         log(INFO, f"Sampling {sample_size} rows from data...")
         all_data = pd.read_csv(data_dir / f"{table_name}.csv")
-        sampled_data = all_data.sample(n=sample_size)
+        sampled_data = all_data.sample(n=min(sample_size, len(all_data)), random_state=42)
         results_dir.mkdir(parents=True, exist_ok=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _sample_data_if_needed(table_name: str, data_dir: Path, results_dir: Path, sample_size: int | None) -> str:
data_name = table_name
if sample_size:
log(INFO, f"Sampling {sample_size} rows from data...")
all_data = pd.read_csv(data_dir / f"{table_name}.csv")
sampled_data = all_data.sample(n=sample_size)
results_dir.mkdir(parents=True, exist_ok=True)
data_name = f"{table_name}_sampled"
sampled_data.to_csv(results_dir / f"{data_name}.csv", index=False)
sampled_data.to_csv(data_dir / f"{data_name}.csv", index=False)
return data_name
def _sample_data_if_needed(table_name: str, data_dir: Path, results_dir: Path, sample_size: int | None) -> str:
data_name = table_name
if sample_size:
log(INFO, f"Sampling {sample_size} rows from data...")
all_data = pd.read_csv(data_dir / f"{table_name}.csv")
sampled_data = all_data.sample(n=min(sample_size, len(all_data)), random_state=42)
results_dir.mkdir(parents=True, exist_ok=True)
data_name = f"{table_name}_sampled"
sampled_data.to_csv(results_dir / f"{data_name}.csv", index=False)
sampled_data.to_csv(data_dir / f"{data_name}.csv", index=False)
return data_name
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/train.py` around lines 169 - 181, The _sample_data_if_needed
helper in train.py samples data non-reproducibly and can fail when sample_size
is larger than the available rows. Update the DataFrame.sample call to use a
fixed random_state for deterministic results, and guard the n argument by
clamping sample_size to the number of rows (or otherwise handling oversized
requests) before sampling. Keep the fix localized to _sample_data_if_needed and
preserve the existing behavior of writing the sampled CSVs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: Also a worthwhile change.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emersodb Agreed — that would make the sampling path more robust and reproducible with a very localized change.

🐇👍


🧠 Learnings used
Learnt from: lotif
Repo: VectorInstitute/midst-toolkit PR: 107
File: examples/gan/synthesize.py:1-47
Timestamp: 2025-12-11T16:08:49.024Z
Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.

Comment on lines +405 to +411

Args:
dataset: The dataset as a pandas DataFrame.
synthesize: Flag indicating whether to generate synthetic data after training.
The number of points to synthesize and the save directory is controlled by
the `number_of_points_to_synthesize` and `save_dir` attributes of the training
config. Optional, default is True.

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Docstring overstates configurability of synthesis count.

The docstring states the number of points to synthesize is controlled by number_of_points_to_synthesize, but _synthesize derives num_samples from train_z_att["num_samples"] (i.e., the training set size) and never reads number_of_points_to_synthesize. Either wire the config value into tabsyn.sample(...) or correct the docstring to avoid misleading callers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/midst_toolkit/attacks/ensemble/models.py` around lines 405 - 411, The
docstring for the training method in models.py overstates how synthetic sample
count is configured, since `_synthesize` does not use
`number_of_points_to_synthesize` and instead derives `num_samples` from
`train_z_att["num_samples"]` before calling `tabsyn.sample(...)`. Update the
implementation to pass the configured synthesis count through the
`_synthesize`/`tabsyn.sample` flow, or revise the docstring to match the actual
behavior; use the `number_of_points_to_synthesize`, `_synthesize`, and
`tabsyn.sample` symbols to locate the affected code.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: This is a good point.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emersodb Thanks for confirming — agreed, this looks like a valid mismatch to clean up.

🐇👍


🧠 Learnings used
Learnt from: lotif
Repo: VectorInstitute/midst-toolkit PR: 107
File: examples/gan/synthesize.py:1-47
Timestamp: 2025-12-11T16:08:49.024Z
Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.

Comment on lines +126 to +128
# Drop id columns
id_columns_names = [col for col in df_synthetic.columns if "_id" in col]
df_synthetic = df_synthetic.drop(columns=id_columns_names)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Find all callsites of compute_gower_for_model and inspect nearby context.
rg -n -C4 '\bcompute_gower_for_model\s*\(' src examples tests

# 2) Inspect where categorical_features is constructed/passed.
rg -n -C4 '\bcategorical_features\b' src/midst_toolkit/attacks/ensemble/rmia

# 3) Inspect where df_input is prepared/dropped to confirm it mirrors the new synthetic _id-column drop.
rg -n -C4 '\bdf_input\b|\bdrop\s*\(.*_id|id_column_name' src/midst_toolkit/attacks/ensemble/rmia

Repository: VectorInstitute/midst-toolkit

Length of output: 45302


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant function bodies and surrounding context.
sed -n '89,210p' src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py

Repository: VectorInstitute/midst-toolkit

Length of output: 5540


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant function bodies and surrounding context.
sed -n '89,210p' src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py

Repository: VectorInstitute/midst-toolkit

Length of output: 5540


🌐 Web query:

gower.gower_matrix data_x data_y must have same columns categorical feature list length

💡 Result:

The error indicating that data_x and data_y must have the same number of columns or that the categorical feature list must match the number of columns typically arises in the gower Python package [1][2] when there is a mismatch between the input dataset's dimensionality and the provided categorical feature mask. In the gower library, the gower_matrix(data_x, data_y=None, cat_features=None) function requires that [1]: 1. Column Alignment: If both data_x and data_y are provided, they must have the same number of columns [1]. 2. Categorical Mask Length: If a cat_features argument is provided, it must be a boolean array or list with a length equal to the number of columns in the dataset [1][3][4]. To resolve this: - Ensure the cat_features mask has the correct length: If your dataset has N columns, ensure your boolean mask (or integer array) has exactly N entries [3][4]. You can verify this by checking data_x.shape[1] [1]. - Automatic detection: If you are unsure which features are categorical, you can omit the cat_features argument [1]. The gower package will attempt to automatically detect categorical columns by identifying non-numeric dtypes (e.g., objects or categories) in your dataframe [1]. - Consistent Input: Ensure that data_x and data_y (if used) are of the same format (e.g., both pandas DataFrames or both numpy arrays) and share the same feature space [1]. If you are using the gower package, checking your input shape with data_x.shape and comparing it to the length of your cat_features list is the most direct way to debug the mismatch [1].

Citations:


🌐 Web query:

gower.gower_matrix data_x data_y must have same columns categorical feature list length

💡 Result:

In the Python gower library, the gower.gower_matrix function computes Gower's distance [1][2]. When users encounter issues or errors related to "categorical feature list length," it typically stems from the optional cat_features argument [3]. The cat_features parameter accepts a boolean list or array that specifies which columns in your data should be treated as categorical [3]. The length of this list must exactly match the number of columns (features) in the input data frame [3]. If the length of cat_features does not correspond to the number of columns in your data, or if there is a mismatch between the provided categorical list and the input structure, the function will raise an error [3]. To resolve this: 1. Ensure the length of your cat_features list is equal to the number of columns in your input data [3]. 2. Alternatively, you can omit the cat_features argument entirely, in which case the library will attempt to infer the data types automatically (often based on pandas dtypes like 'object') [3]. Note that this behavior is distinct from the R gower package, which primarily handles pairings between two data sets (x and y) and manages column matching via pair_x and pair_y arguments [4][5][6]. If you are working in Python, focus on ensuring your cat_features boolean mask aligns with the columns of your input DataFrame [3].

Citations:


Keep the synthetic and input schemas aligned before Gower distance calculation.
df_synthetic drops every "_id" column here, but df_input and categorical_features are still built from the unfiltered input. gower.gower_matrix() expects both frames to share the same column count and the categorical mask to match that layout, so this can break as soon as any synthetic _id column is present. df_input needs the same filtering, or the mask must be recomputed after alignment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py` around lines 126
- 128, Keep the synthetic and input schemas aligned before calling gower_matrix
in rmia_calculation. The current drop of "_id" columns in df_synthetic leaves
df_input and categorical_features based on a different column layout, so update
the same filtering logic for the input frame or recompute the categorical mask
after both frames are aligned. Make the fix in the code path around the
df_synthetic drop and the subsequent gower.gower_matrix call so both matrices
share identical columns and mask positions.

Comment on lines +154 to +156
elif (raw_data_dir / "train.xls").exists():
data_path = raw_data_dir / "train.xls"
test_path = raw_data_dir / "test.xls"

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

.xls test files are read with pd.read_csv, which will fail.

When the train.xls branch is taken, test_path is set to raw_data/test.xls, but the test branch unconditionally calls pd.read_csv(test_path) at Line 195. Reading an Excel file with read_csv will error (or produce garbage). The train side correctly uses pd.read_excel(...) and drops the ID column, so the test side should mirror that handling based on test_path.suffix.

Proposed fix
     if test_path:
         # if testing data is given
-        test_df = pd.read_csv(test_path)
+        if test_path.suffix == ".csv":
+            test_df = pd.read_csv(test_path)
+        else:  # .xls
+            test_df = pd.read_excel(test_path, sheet_name="Data", header=1)
+            test_df = test_df.drop("ID", axis=1)
         train_df = data_df

Also applies to: 193-196

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/midst_toolkit/models/tabsyn/preprocessing.py` around lines 154 - 156, The
test-file loading path in preprocessing.py is using the wrong reader for .xls
inputs, so the branch that sets data_path/test_path for train.xls must also make
the test handling in the same preprocessing flow respect test_path.suffix.
Update the logic around the tabsyn preprocessing loader to mirror the train-side
behavior: use pd.read_excel for .xls files, keep pd.read_csv for CSVs, and apply
the same ID-column drop to the test dataframe as already done for the train
dataframe in the relevant preprocessing function.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: Agreed that this will be an issue.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emersodb Thanks for confirming. Yes — the if test_path: branch should mirror the train-side loader so .xls uses pd.read_excel(..., sheet_name="Data", header=1) and drops the ID column, while .csv continues to use pd.read_csv(...). That should resolve the failure path around Line 193.

🐇👍


🧠 Learnings used
Learnt from: lotif
Repo: VectorInstitute/midst-toolkit PR: 107
File: examples/gan/synthesize.py:1-47
Timestamp: 2025-12-11T16:08:49.024Z
Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.

@emersodb emersodb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are fairly small. I think the more fundamental ones that we want to fix were already highlighted by CodeRabbit, which is nice.

runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6.0.2
- uses: actions/checkout@v7.0.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: I guess we can do this, but I don't see it as essential.

Comment on lines +45 to +48
uses: actions/checkout@v7.0.0

- name: Install uv
uses: astral-sh/setup-uv@v8.1.0
uses: astral-sh/setup-uv@v8.2.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

"data_dir": config.base_data_dir,
"workspace_dir": str(Path(config.base_data_dir) / "shadow_workspace"),
"data_dir": base_data_dir,
"workspace_dir": str(Path(base_data_dir) / "shadow_workspace"),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor, but I think you can avoid the str -> Path -> str conversion here with

os.join.path(base_data_dir, "shadow_workspace")

compute_attack_success_for_given_targets(
target_model_config=config.ensemble_attack.target_model,
# TODO: refactor this to work better outside of the challenge context (i.e. no target ID)
# No target ID needed for CTGAN, but it needs at least one element in this array. The value does not matter.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says CTGAN. I think you mean Tabsyn?

run_metaclassifier_training: true

shadow_training:
model_name: ctgan

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tabsyn? Is this mis-direction to ctgan problematic for the attack?

Comment on lines +405 to +411

Args:
dataset: The dataset as a pandas DataFrame.
synthesize: Flag indicating whether to generate synthetic data after training.
The number of points to synthesize and the save directory is controlled by
the `number_of_points_to_synthesize` and `save_dir` attributes of the training
config. Optional, default is True.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: This is a good point.


Args:
n_head: The number of heads.
factor: The factor for the dimension of the hidden layer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand what factor is here?

test_path: Path | None = None
if (raw_data_dir / "train.csv").exists():
data_path = raw_data_dir / "train.csv"
test_path = raw_data_dir / "test.csv"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should comment here that we're checking about existence to get the proper file suffix and assume that both train.x and test.x exists

else:
data_path = data_dir / f"{data_name}.csv"

assert data_path.exists(), (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings violate our line length restrictions don't they?

Comment on lines +154 to +156
elif (raw_data_dir / "train.xls").exists():
data_path = raw_data_dir / "train.xls"
test_path = raw_data_dir / "test.xls"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: Agreed that this will be an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants