Adding TabSyn examples and ensemble attack code by lotif · Pull Request #143 · VectorInstitute/midst-toolkit

lotif · 2026-06-03T21:35:23Z

PR Type

Feature

Short Description

Clickup Ticket(s): https://app.clickup.com/t/10524786/868jvwvax

A few things in this PR:

Adding an example for TabSyn full pipeline (train, synthesize and evaluate)
Adding an example for Ensemble Attack using TabSyn
Adding EnsembleAttackTabSynModelRunner to src/midst_toolkit/attacks/ensemble/models.py so the attack knows how to run the model
Renaming test_attack_model.py files to stop pytest thinking they are test files
A few small refactorings to simplify things

FYI a lot of the code in the examples was copied from the CTGAN example, and the EnsembleAttackTabSynModelRunner was adapted from the existing TabSyn automated test.

Tests Added

No tests added.

coderabbitai · 2026-06-09T20:49:23Z

📝 Walkthrough

Walkthrough

Updates five GitHub Actions workflows to newer pinned action versions. Adds TabSyn single-table training, synthesis, and evaluation entrypoints with matching core preprocessing and type changes. Adds TabSyn ensemble-attack configuration, runner, challenge-dataset generation, attack training and testing scripts, and attack-success computation. Updates GAN example README/config text, base data directory handling, challenge-dataset logging, and RMIA Gower preprocessing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding TabSyn examples and ensemble attack support.
Description check	✅ Passed	The description matches the template with PR Type, Short Description, and Tests Added sections filled in.
Docstring Coverage	✅ Passed	Docstring coverage is 82.35% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch marcelo/tabsyn-ensemble

⚔️ Resolve merge conflicts

Resolve merge conflict in branch marcelo/tabsyn-ensemble

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

* Bump actions/checkout from 6.0.2 to 6.0.3 (#145) Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.2 to 6.0.3. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v6.0.2...v6.0.3) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: 6.0.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marcelo Lotif <lotif@users.noreply.github.com> * Bump astral-sh/setup-uv from 8.1.0 to 8.2.0 (#144) Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 8.1.0 to 8.2.0. - [Release notes](https://github.com/astral-sh/setup-uv/releases) - [Commits](astral-sh/setup-uv@v8.1.0...v8.2.0) --- updated-dependencies: - dependency-name: astral-sh/setup-uv dependency-version: 8.2.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marcelo Lotif <lotif@users.noreply.github.com> * Bump actions/checkout from 6.0.3 to 7.0.0 Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.3 to 7.0.0. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v6.0.3...v7.0.0) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: 7.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marcelo Lotif <lotif@users.noreply.github.com>

coderabbitai

Actionable comments posted: 16

🧹 Nitpick comments (4)

examples/tabsyn/ensemble_attack/train_attack_model.py (1)
19-28: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Docstring says "CTGAN model" but this entrypoint uses TabSyn.

Line 62 constructs EnsembleAttackTabSynModelRunner; update the docstring to reference TabSyn to avoid confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/ensemble_attack/train_attack_model.py` around lines 19 - 28,
The docstring in the attack training entrypoint is inconsistent with the actual
implementation: it says “CTGAN model” even though the script constructs
EnsembleAttackTabSynModelRunner. Update the docstring in train_attack_model to
describe the TabSyn pipeline instead, keeping the rest of the step-by-step
summary aligned with the actual runner and attack flow.
examples/tabsyn/ensemble_attack/compute_attack_success.py (1)
21-23: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Stale "CTGAN" reference in comment.

This is the TabSyn example; the comment mentions CTGAN. Minor wording fix to keep the TODO/context accurate.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/ensemble_attack/compute_attack_success.py` around lines 21 -
23, The comment in compute_attack_success is stale because it references CTGAN
even though this TabSyn example is not about that model. Update the wording near
target_ids to describe the TabSyn-specific context accurately, keeping the TODO
focused on the placeholder target ID requirement without mentioning CTGAN.
examples/tabsyn/evaluate.py (1)
131-144: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

cfg.ks_tv.run is listed twice.

The quality any([...]) check includes cfg.ks_tv.run at both Line 133 and Line 135. The duplicate is harmless but should be removed; if one of them was meant to be a different metric flag, please correct it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/evaluate.py` around lines 131 - 144, The any([...]) metric
gate in evaluate() includes cfg.ks_tv.run twice, so remove the duplicate entry
and verify the remaining list in evaluate() only contains one flag per metric.
If the second occurrence was intended to represent a different check, replace it
with the correct metric symbol from the same group (for example, another
cfg.*.run field) so the set of enabled evaluations is accurate.
examples/tabsyn/synthesize.py (1)
49-86: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Avoid running preprocess twice.

preprocess is invoked once for categories/d_numerical and again with inverse=True for num_inverse/cat_inverse, repeating all preprocessing work. A single call with inverse=True returns all six values and halves the cost.
♻️ Suggested refactor
-    _, _, categories, d_numerical = preprocess(  # type: ignore[misc]
-        dataset_path=dataset_path,
-        ref_dataset_path=ref_dataset_path,
-        transforms=tabsyn_config["transforms"],
-        task_type=tabsyn_config["task_type"],
-    )
+    _, _, categories, d_numerical, num_inverse, cat_inverse = preprocess(  # type: ignore[misc]
+        dataset_path=dataset_path,
+        ref_dataset_path=ref_dataset_path,
+        transforms=tabsyn_config["transforms"],
+        task_type=tabsyn_config["task_type"],
+        inverse=True,
+    )
Then drop the second preprocess(..., inverse=True) call at lines 80-86.
Please confirm preprocess(..., inverse=True) returns the 6-tuple in the order (numerical, categorical, categories, d_numerical, num_inverse, cat_inverse).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/synthesize.py` around lines 49 - 86, `preprocess(...,
inverse=True)` should be used once to avoid duplicating preprocessing in
`synthesize.py`; update the `preprocess` call so it returns the full 6-tuple and
reuse the existing values for `categories`, `d_numerical`, `num_inverse`, and
`cat_inverse`. Make sure the return order from `preprocess` is confirmed as
`(numerical, categorical, categories, d_numerical, num_inverse, cat_inverse)`,
then remove the second `preprocess(..., inverse=True)` invocation and wire the
unpacking in `synthesize()` accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/code_checks.yml:
- Line 32: The workflow step using actions/checkout is still pinned to a mutable
version tag and leaves persisted git credentials enabled. Update the checkout
step in code_checks.yml to use the release commit SHA for actions/checkout
instead of v7.0.0, and add persist-credentials: false since this job does not
require repository auth. Use the actions/checkout step identifier to locate the
change.

In @.github/workflows/docs.yml:
- Around line 45-48: The docs workflow still references mutable action tags and
should be hardened by pinning every action use to a full commit SHA. Update the
`actions/checkout` steps in `docs.yml` and the other workflow action references
(`astral-sh/setup-uv`, `actions/setup-python`, `actions/upload-artifact`,
`actions/download-artifact`, and `JamesIves/github-pages-deploy-action`) to
immutable SHAs, and add `persist-credentials: false` to both `actions/checkout`
invocations unless a later step needs the checkout token.

In @.github/workflows/integration_tests.yml:
- Line 44: The integration_tests workflow still references a mutable
actions/checkout tag and keeps git credentials persisted by default. Update the
checkout step in the workflow to use the release commit SHA for actions/checkout
instead of the version tag, and add persist-credentials set to false unless a
later step explicitly needs repository authentication. Use the existing checkout
step in integration_tests.yml as the target for this change.

In @.github/workflows/publish.yml:
- Line 19: The publish workflow is using a mutable actions reference and leaving
git credentials persisted. Update the checkout step in the publish job to use a
full commit SHA for actions/checkout instead of the version tag, and add
persist-credentials set to false unless this job explicitly needs repository
auth. Keep the change localized to the checkout step so the workflow remains
otherwise unchanged.

In @.github/workflows/unit_tests.yml:
- Line 44: Update the checkout step in the unit test workflow to use a pinned
full commit SHA for actions/checkout instead of the version tag, and configure
the checkout action with persist-credentials set to false since the workflow
does not need git authentication afterward. Use the existing checkout step in
unit_tests.yml as the target and make sure the action reference and its options
are adjusted together.

In `@examples/tabsyn/ensemble_attack/config.yaml`:
- Around line 32-45: The shadow_training section still uses CTGAN-specific
labels that do not apply to the TabSyn ensemble-attack runner. Remove or replace
the unused model_name field in shadow_training, and update the nearby
model_config and inline comments in this config block to use TabSyn/tabddpm
terminology so the configuration reflects what the ensemble_attack flow actually
consumes.

In `@examples/tabsyn/ensemble_attack/make_challenge_dataset.py`:
- Around line 19-24: The challenge dataset sampling in make_challenge_dataset is
non-deterministic and can fail when the untrained pool is too small. Update the
untrained_data sampling to use config.ensemble_attack.random_seed as the
random_state so runs are reproducible, and guard against len(training_data)
exceeding the available untrained rows before calling sample. Keep the fix
localized around the training_data, untrained_data, and challenge_data
construction logic.

In `@examples/tabsyn/ensemble_attack/README.md`:
- Around line 8-15: The README’s data path and config key references are
inconsistent with the actual example configuration. Update the instructions in
README so the documented folder location matches the path used by config.yaml,
and replace the mention of base_data_dir with the real data_dir attribute. Make
sure the setup steps for the ensemble_attack example point users to the same
data directory that the config expects, using the config.yaml data_dir setting
as the source of truth.
- Around line 94-95: The README’s documented output path is incorrect and does
not match the actual results location used by compute_attack_success.py. Update
the path referenced in the ensemble_attack README to use the same results_dir
configured in config.yaml and passed to experiment_directory in
compute_attack_success.py, so the saved
attack_success_for_xgb_metaclassifier_model.txt path reflects
examples/tabsyn/ensemble_attack/results instead of the old
examples/tabsyn/results location.

In `@examples/tabsyn/evaluate.py`:
- Around line 54-57: The metric logging in evaluate() is malformed because the
raw f-string makes the backslash literal, so the output shows \Score instead of
a properly separated label. Update the log message in the results loop to use
normal string escaping/formatting so the metric name and “Score” appear on
separate, readable parts, and keep the surrounding SEPARATOR/header logging
consistent.

In `@examples/tabsyn/README.md`:
- Around line 83-86: The README has a filename mismatch and a typo: it documents
the sampled training data as `{table_name}_samples.csv` and says “sampleed”, but
the actual writer in `train.py` uses `_sample_data_if_needed` to produce
`{table_name}_sampled.csv`. Update the documentation in
`examples/tabsyn/README.md` to match the real filename used by
`_sample_data_if_needed` and correct the spelling so the data path instructions
are consistent.
- Around line 68-73: The Alpha Precision example command is still using CTGAN
paths instead of TabSyn paths. Update the command in the TabSyn README so the
midsts_alpha_precision_eval invocation uses the TabSyn-specific meta info and
output directory locations, matching the existing examples/tabsyn/ structure
rather than examples/gan/. Keep the rest of the command unchanged and verify the
referenced synthetic and real data paths remain correct for the TabSyn example.

In `@examples/tabsyn/train.py`:
- Around line 169-181: The _sample_data_if_needed helper in train.py samples
data non-reproducibly and can fail when sample_size is larger than the available
rows. Update the DataFrame.sample call to use a fixed random_state for
deterministic results, and guard the n argument by clamping sample_size to the
number of rows (or otherwise handling oversized requests) before sampling. Keep
the fix localized to _sample_data_if_needed and preserve the existing behavior
of writing the sampled CSVs.

In `@src/midst_toolkit/attacks/ensemble/models.py`:
- Around line 405-411: The docstring for the training method in models.py
overstates how synthetic sample count is configured, since `_synthesize` does
not use `number_of_points_to_synthesize` and instead derives `num_samples` from
`train_z_att["num_samples"]` before calling `tabsyn.sample(...)`. Update the
implementation to pass the configured synthesis count through the
`_synthesize`/`tabsyn.sample` flow, or revise the docstring to match the actual
behavior; use the `number_of_points_to_synthesize`, `_synthesize`, and
`tabsyn.sample` symbols to locate the affected code.

In `@src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py`:
- Around line 126-128: Keep the synthetic and input schemas aligned before
calling gower_matrix in rmia_calculation. The current drop of "_id" columns in
df_synthetic leaves df_input and categorical_features based on a different
column layout, so update the same filtering logic for the input frame or
recompute the categorical mask after both frames are aligned. Make the fix in
the code path around the df_synthetic drop and the subsequent gower.gower_matrix
call so both matrices share identical columns and mask positions.

In `@src/midst_toolkit/models/tabsyn/preprocessing.py`:
- Around line 154-156: The test-file loading path in preprocessing.py is using
the wrong reader for .xls inputs, so the branch that sets data_path/test_path
for train.xls must also make the test handling in the same preprocessing flow
respect test_path.suffix. Update the logic around the tabsyn preprocessing
loader to mirror the train-side behavior: use pd.read_excel for .xls files, keep
pd.read_csv for CSVs, and apply the same ID-column drop to the test dataframe as
already done for the train dataframe in the relevant preprocessing function.

---

Nitpick comments:
In `@examples/tabsyn/ensemble_attack/compute_attack_success.py`:
- Around line 21-23: The comment in compute_attack_success is stale because it
references CTGAN even though this TabSyn example is not about that model. Update
the wording near target_ids to describe the TabSyn-specific context accurately,
keeping the TODO focused on the placeholder target ID requirement without
mentioning CTGAN.

In `@examples/tabsyn/ensemble_attack/train_attack_model.py`:
- Around line 19-28: The docstring in the attack training entrypoint is
inconsistent with the actual implementation: it says “CTGAN model” even though
the script constructs EnsembleAttackTabSynModelRunner. Update the docstring in
train_attack_model to describe the TabSyn pipeline instead, keeping the rest of
the step-by-step summary aligned with the actual runner and attack flow.

In `@examples/tabsyn/evaluate.py`:
- Around line 131-144: The any([...]) metric gate in evaluate() includes
cfg.ks_tv.run twice, so remove the duplicate entry and verify the remaining list
in evaluate() only contains one flag per metric. If the second occurrence was
intended to represent a different check, replace it with the correct metric
symbol from the same group (for example, another cfg.*.run field) so the set of
enabled evaluations is accurate.

In `@examples/tabsyn/synthesize.py`:
- Around line 49-86: `preprocess(..., inverse=True)` should be used once to
avoid duplicating preprocessing in `synthesize.py`; update the `preprocess` call
so it returns the full 6-tuple and reuse the existing values for `categories`,
`d_numerical`, `num_inverse`, and `cat_inverse`. Make sure the return order from
`preprocess` is confirmed as `(numerical, categorical, categories, d_numerical,
num_inverse, cat_inverse)`, then remove the second `preprocess(...,
inverse=True)` invocation and wire the unpacking in `synthesize()` accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3e600db0-a23a-4204-a9d7-db085388e76f

📥 Commits

Reviewing files that changed from the base of the PR and between 9f0e89c and 67b1f03.

📒 Files selected for processing (30)

.github/workflows/code_checks.yml
.github/workflows/docs.yml
.github/workflows/integration_tests.yml
.github/workflows/publish.yml
.github/workflows/unit_tests.yml
examples/gan/README.md
examples/gan/ensemble_attack/README.md
examples/gan/ensemble_attack/config.yaml
examples/gan/ensemble_attack/make_challenge_dataset.py
examples/gan/ensemble_attack/run_test_attack_model.py
examples/gan/ensemble_attack/utils.py
examples/tabsyn/README.md
examples/tabsyn/config.yaml
examples/tabsyn/ensemble_attack/README.md
examples/tabsyn/ensemble_attack/compute_attack_success.py
examples/tabsyn/ensemble_attack/config.yaml
examples/tabsyn/ensemble_attack/make_challenge_dataset.py
examples/tabsyn/ensemble_attack/run_test_attack_model.py
examples/tabsyn/ensemble_attack/train_attack_model.py
examples/tabsyn/evaluate.py
examples/tabsyn/synthesize.py
examples/tabsyn/tabsyn_config.toml
examples/tabsyn/train.py
src/midst_toolkit/attacks/ensemble/models.py
src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py
src/midst_toolkit/models/tabsyn/dataset.py
src/midst_toolkit/models/tabsyn/pipeline.py
src/midst_toolkit/models/tabsyn/preprocessing.py
tests/integration/assets/tabsyn/trans_info.json
tests/integration/models/tabsyn/test_train.py

💤 Files with no reviewable changes (1)

examples/gan/ensemble_attack/config.yaml

coderabbitai · 2026-06-24T20:01:13Z

    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6.0.2
+      - uses: actions/checkout@v7.0.0


🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the workflow and surrounding steps. git ls-files .github/workflows/code_checks.yml printf '\n--- file with line numbers ---\n' cat -n .github/workflows/code_checks.yml | sed -n '1,220p' printf '\n--- checkout usages in workflows ---\n' rg -n "actions/checkout@|persist-credentials|git auth|github.token" .github/workflows -S

Repository: VectorInstitute/midst-toolkit

Length of output: 2624

🏁 Script executed:

#!/bin/bash set -euo pipefail # Gather deterministic evidence about the checkout reference and any later git operations. python3 - <<'PY' from pathlib import Path p = Path('.github/workflows/code_checks.yml') print(p.exists(), p) if p.exists(): text = p.read_text() print("uses_checkout", "actions/checkout@" in text) print("persist_credentials", "persist-credentials" in text) print("git_commands", any(s in text for s in ["git ", "gh ", "npm publish", "npm version", "git push", "git commit"])) PY

Repository: VectorInstitute/midst-toolkit

Length of output: 270

Pin actions/checkout by SHA and disable persisted credentials.
.github/workflows/code_checks.yml:32 still uses the mutable v7.0.0 tag; replace it with the release commit SHA and set persist-credentials: false here, since this job doesn’t need git auth.

🧰 Tools

🪛 zizmor (1.26.1)

[warning] 32-32: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 32-32: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/code_checks.yml at line 32, The workflow step using actions/checkout is still pinned to a mutable version tag and leaves persisted git credentials enabled. Update the checkout step in code_checks.yml to use the release commit SHA for actions/checkout instead of v7.0.0, and add persist-credentials: false since this job does not require repository auth. Use the actions/checkout step identifier to locate the change.

Source: Linters/SAST tools

@lotif: I guess we can do this, but I don't see it as essential.

@emersodb Fair point — I agree this is more defense-in-depth hardening than something essential for correctness in this PR.

The recommendation still stands as a best practice for GitHub Actions, but I’m fine treating it as non-blocking and leaving it for a follow-up if you want to keep this change focused.

coderabbitai · 2026-06-24T20:01:13Z

+        uses: actions/checkout@v7.0.0

      - name: Install uv
-        uses: astral-sh/setup-uv@v8.1.0
+        uses: astral-sh/setup-uv@v8.2.0


🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the workflow file around the referenced lines nl -ba .github/workflows/docs.yml | sed -n '1,140p'

Repository: VectorInstitute/midst-toolkit

Length of output: 207

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the workflow file and any action refs used in it grep -nE 'uses:|persist-credentials' .github/workflows/docs.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 488

🏁 Script executed:

#!/bin/bash set -euo pipefail nl -ba .github/workflows/docs.yml | sed -n '1,140p'

Repository: VectorInstitute/midst-toolkit

Length of output: 207

🏁 Script executed:

#!/bin/bash set -euo pipefail awk 'NR>=35 && NR<=110 {printf "%4d:%s\n", NR, $0}' .github/workflows/docs.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 2097

🏁 Script executed:

#!/bin/bash set -euo pipefail awk 'NR>=1 && NR<=140 {printf "%4d:%s\n", NR, $0}' .github/workflows/docs.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 2856

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' from pathlib import Path p = Path('.github/workflows/docs.yml') for i, line in enumerate(p.read_text().splitlines(), 1): if 35 <= i <= 110: print(f"{i:4d}:{line}") PY

Repository: VectorInstitute/midst-toolkit

Length of output: 2097

Pin all workflow actions to commit SHAs and disable checkout credentials

docs.yml still uses mutable refs for actions/checkout, astral-sh/setup-uv, actions/setup-python, actions/upload-artifact, actions/download-artifact, and JamesIves/github-pages-deploy-action; replace them with full commit SHAs.

Add persist-credentials: false to both actions/checkout steps unless a later git push depends on the checkout token.

🧰 Tools

🪛 zizmor (1.26.1)

[error] 45-45: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

[error] 48-48: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/docs.yml around lines 45 - 48, The docs workflow still references mutable action tags and should be hardened by pinning every action use to a full commit SHA. Update the `actions/checkout` steps in `docs.yml` and the other workflow action references (`astral-sh/setup-uv`, `actions/setup-python`, `actions/upload-artifact`, `actions/download-artifact`, and `JamesIves/github-pages-deploy-action`) to immutable SHAs, and add `persist-credentials: false` to both `actions/checkout` invocations unless a later step needs the checkout token.

Source: Linters/SAST tools

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail repo="VectorInstitute/midst-toolkit" pr="143" echo "=== PR review comments mentioning workflow pinning / similar replies ===" gh api "repos/$repo/pulls/$pr/comments?per_page=100" --paginate \ --jq '.[] | {id, path, line, user: .user.login, in_reply_to_id, body, url} | select((.path | startswith(".github/workflows/")) or (.body | test("persist-credentials|unpinned|Same here|Pin all workflow actions"; "i")))' || true echo echo "=== Issue/PR conversation comments ===" gh api "repos/$repo/issues/$pr/comments?per_page=100" --paginate \ --jq '.[] | {user: .user.login, body, url}' || true

Length of output: 27247

@emersodb Fair enough — same assessment on this thread as well: this is useful hardening, but not essential to the correctness of this PR.

I’m good with treating it as non-blocking and leaving it for a follow-up if you want to keep this change focused.

coderabbitai · 2026-06-24T20:01:13Z

    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6.0.2
+      - uses: actions/checkout@v7.0.0


🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the workflow and inspect the relevant section with line numbers. git ls-files .github/workflows/integration_tests.yml printf '\n--- file excerpt ---\n' sed -n '1,120p' .github/workflows/integration_tests.yml | cat -n

Repository: VectorInstitute/midst-toolkit

Length of output: 2456

🌐 Web query:

actions/checkout latest release version and default persist-credentials behavior

💡 Result:

As of June 24, 2026, the latest release version of the actions/checkout repository is v7.0.0, released on June 18, 2026 [1][2][3]. Regarding the persist-credentials input: - Default Behavior: The default value for persist-credentials is true [4][5][6]. - Purpose: When set to true, the action configures the Git repository with the provided authentication token (or SSH key) in the local Git configuration [4][5]. This allows your workflow scripts to run authenticated Git commands (such as git fetch or git push) directly [4][5]. - Security Implementation: The token is persisted in a separate file (distinct from the standard.git/config) to enhance security [5]. The action ensures that this token is automatically removed during the post-job cleanup phase [4][5]. - Opt-out: You can disable this behavior by explicitly setting persist-credentials: false in your workflow configuration if you do not require authenticated Git operations [4][5].

Citations:

1: https://github.com/actions/checkout/releases/tag/v7.0.0

2: https://github.com/actions/checkout/releases

3: https://github.com/actions/checkout/?tab=readme-ov-file

4: https://github.com/actions/checkout/blob/main/README.md

5: https://github.com/actions/checkout

6: https://github.com/actions/checkout/blob/34e114876b0b11c390a56381ad16ebd13914f8d5/action.yml

🌐 Web query:

GitHub actions/checkout README persist-credentials default checkout version

💡 Result:

In the GitHub actions/checkout action, the persist-credentials input defaults to true [1][2]. When persist-credentials is set to its default value of true, the action configures the authentication token or SSH key in the local git configuration [1][2]. This allows subsequent steps in your workflow to run authenticated Git commands (such as git push or git fetch) [1][3]. The token or key is removed during the post-job cleanup process [1][3]. Starting with recent versions, these credentials are stored in a file under $RUNNER_TEMP to improve security, rather than directly in.git/config [3][4]. If you wish to disable this behavior, you must explicitly set persist-credentials to false in your workflow file [1].

Citations:

1: https://github.com/actions/checkout/blob/main/README.md

2: https://github.com/actions/checkout/blob/72f2cec99f417b1a1c5e2e88945068983b7965f9/action.yml

3: https://github.com/actions/checkout

4: actions/checkout@v5.0.1...v6.0.0

🌐 Web query:

site:github.com actions/checkout releases v7.0.0

💡 Result:

The actions/checkout v7.0.0 release was published on June 18, 2026 [1][2][3]. Key changes and highlights in v7.0.0 include: * Security Update: It introduces a mechanism to block checking out fork pull requests when using pull_request_target and workflow_run events to prevent potential security vulnerabilities [1][4]. * Dependency Updates: Various dependencies were updated, including @actions/core and @actions/tool-cache, and the module was upgraded to ECMAScript modules (ESM) [1][4]. * Refinements: General improvements such as updated error wording and internal dependency management [1][4]. Users can reference the new version in their workflows using: - uses: actions/checkout@v7 [5][6][3] For the full list of changes, you can view the changelog on the official GitHub repository [4].

Citations:

1: https://github.com/actions/checkout/releases/tag/v7.0.0

2: https://github.com/actions/checkout/releases

3: https://github.com/actions/checkout/?tab=readme-ov-file

4: https://github.com/actions/checkout/blob/main/CHANGELOG.md

5: actions/checkout@v6.0.3...v7.0.0

6: https://github.com/marketplace/actions/checkout

Pin actions/checkout and drop persisted credentials. .github/workflows/integration_tests.yml:44 still uses a mutable tag; switch to the release commit SHA and add persist-credentials: false unless a later step needs git auth.

🧰 Tools

🪛 zizmor (1.26.1)

[warning] 44-44: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 44-44: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/integration_tests.yml at line 44, The integration_tests workflow still references a mutable actions/checkout tag and keeps git credentials persisted by default. Update the checkout step in the workflow to use the release commit SHA for actions/checkout instead of the version tag, and add persist-credentials set to false unless a later step explicitly needs repository authentication. Use the existing checkout step in integration_tests.yml as the target for this change.

Source: Linters/SAST tools

coderabbitai · 2026-06-24T20:01:13Z

          sudo apt-get update
          sudo apt-get install libcurl4-openssl-dev libssl-dev
-      - uses: actions/checkout@v6.0.2
+      - uses: actions/checkout@v7.0.0


🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the workflow and nearby context. git ls-files .github/workflows/publish.yml wc -l .github/workflows/publish.yml cat -n .github/workflows/publish.yml | sed -n '1,120p'

Repository: VectorInstitute/midst-toolkit

Length of output: 1882

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find whether checkout is used elsewhere with SHA pinning or persist-credentials settings. rg -n --hidden --glob '.github/workflows/*.yml' --glob '.github/workflows/*.yaml' 'actions/checkout@|persist-credentials:' .github/workflows

Repository: VectorInstitute/midst-toolkit

Length of output: 604

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check whether any workflow already pins actions/checkout by SHA in the repository. rg -n --hidden 'actions/checkout@' .github/workflows . || true

Repository: VectorInstitute/midst-toolkit

Length of output: 1053

Pin actions/checkout to a SHA and disable persisted credentials.

Line 19 uses actions/checkout@v7.0.0; switch to a commit SHA and set persist-credentials: false unless this job needs git auth.

🧰 Tools

🪛 zizmor (1.26.1)

[warning] 19-19: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 19-19: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/publish.yml at line 19, The publish workflow is using a mutable actions reference and leaving git credentials persisted. Update the checkout step in the publish job to use a full commit SHA for actions/checkout instead of the version tag, and add persist-credentials set to false unless this job explicitly needs repository auth. Keep the change localized to the checkout step so the workflow remains otherwise unchanged.

Source: Linters/SAST tools

coderabbitai · 2026-06-24T20:01:13Z

    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6.0.2
+      - uses: actions/checkout@v7.0.0


🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

sed -n '1,140p' .github/workflows/unit_tests.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 1747

🏁 Script executed:

sed -n '1,140p' .github/workflows/unit_tests.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 1747

Pin actions/checkout and disable credential persistence .github/workflows/unit_tests.yml:44 should use a full commit SHA instead of v7.0.0, and set persist-credentials: false since no later step needs git auth.

🧰 Tools

🪛 zizmor (1.26.1)

[warning] 44-44: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 44-44: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/unit_tests.yml at line 44, Update the checkout step in the unit test workflow to use a pinned full commit SHA for actions/checkout instead of the version tag, and configure the checkout action with persist-credentials set to false since the workflow does not need git authentication afterward. Use the existing checkout step in unit_tests.yml as the target and make sure the action reference and its options are adjusted together.

Source: Linters/SAST tools

coderabbitai · 2026-06-24T20:01:14Z

+To compute those metrics, you can run the command below. The data files should
+be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_samples.csv`
+for the sampleed data used for training, and `/results/{table_name}_synthetic.csv`
+for the synthetic data.


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Filename inconsistency: _samples.csv vs _sampled.csv.

The README documents the sampled training file as /data/{table_name}_samples.csv, but train.py writes {table_name}_sampled.csv (see _sample_data_if_needed). Also "sampleed" is a typo. Align the docs with the actual filename to avoid confusion.

📝 Suggested fix

-be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_samples.csv` -for the sampleed data used for training, and `/results/{table_name}_synthetic.csv` +be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_sampled.csv` +for the sampled data used for training, and `/results/{table_name}_synthetic.csv`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

To compute those metrics, you can run the command below. The data files should

be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_samples.csv`

for the sampleed data used for training, and `/results/{table_name}_synthetic.csv`

for the synthetic data.

To compute those metrics, you can run the command below. The data files should

be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_sampled.csv`

for the sampled data used for training, and `/results/{table_name}_synthetic.csv`

for the synthetic data.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/tabsyn/README.md` around lines 83 - 86, The README has a filename mismatch and a typo: it documents the sampled training data as `{table_name}_samples.csv` and says “sampleed”, but the actual writer in `train.py` uses `_sample_data_if_needed` to produce `{table_name}_sampled.csv`. Update the documentation in `examples/tabsyn/README.md` to match the real filename used by `_sample_data_if_needed` and correct the spelling so the data path instructions are consistent.

@lotif: This is also good to fix up, here and other readmes

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Matches for _samples.csv in README files ==" rg -n --glob '**/README.md' '_samples\.csv' echo echo "== Matches for sampleed in README files ==" rg -n --glob '**/README.md' 'sampleed' echo echo "== Matches for _sampled.csv in README files ==" rg -n --glob '**/README.md' '_sampled\.csv'

Length of output: 682

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== examples/tabsyn/README.md relevant lines ==" sed -n '78,90p' examples/tabsyn/README.md || true echo echo "== Matches for _samples.csv in README files ==" rg -n --glob '**/README.md' '_samples\.csv' || true echo echo "== Matches for sampleed in README files ==" rg -n --glob '**/README.md' 'sampleed' || true echo echo "== Matches for _sampled.csv in README files ==" rg -n --glob '**/README.md' '_sampled\.csv' || true echo echo "== README lines mentioning sampled/sample data ==" rg -n --glob '**/README.md' 'sampled|sample data|samples\.csv|sampled\.csv' || true

Length of output: 1382

@emersodb Agreed. I checked the current branch: the exact _samples.csv + sampleed issue is present in examples/tabsyn/README.md, and I didn’t find the same literal pattern in other README.md files. Still, a quick consistency pass across the other example READMEs while touching docs here sounds worthwhile.

coderabbitai · 2026-06-24T20:01:14Z

+def _sample_data_if_needed(table_name: str, data_dir: Path, results_dir: Path, sample_size: int | None) -> str:
+    data_name = table_name
+    if sample_size:
+        log(INFO, f"Sampling {sample_size} rows from data...")
+        all_data = pd.read_csv(data_dir / f"{table_name}.csv")
+        sampled_data = all_data.sample(n=sample_size)
+        results_dir.mkdir(parents=True, exist_ok=True)
+
+        data_name = f"{table_name}_sampled"
+        sampled_data.to_csv(results_dir / f"{data_name}.csv", index=False)
+        sampled_data.to_csv(data_dir / f"{data_name}.csv", index=False)
+
+    return data_name


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Sampling is non-reproducible and unguarded against oversized sample_size.

DataFrame.sample(n=sample_size) uses no random_state, so each run draws a different subset, making training/eval non-reproducible. It will also raise ValueError if sample_size exceeds the row count. Consider seeding and clamping.

📝 Suggested fix

if sample_size: log(INFO, f"Sampling {sample_size} rows from data...") all_data = pd.read_csv(data_dir / f"{table_name}.csv") - sampled_data = all_data.sample(n=sample_size) + sampled_data = all_data.sample(n=min(sample_size, len(all_data)), random_state=42) results_dir.mkdir(parents=True, exist_ok=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _sample_data_if_needed(table_name: str, data_dir: Path, results_dir: Path, sample_size: int | None) -> str:

data_name = table_name

if sample_size:

log(INFO, f"Sampling {sample_size} rows from data...")

all_data = pd.read_csv(data_dir / f"{table_name}.csv")

sampled_data = all_data.sample(n=sample_size)

results_dir.mkdir(parents=True, exist_ok=True)

data_name = f"{table_name}_sampled"

sampled_data.to_csv(results_dir / f"{data_name}.csv", index=False)

sampled_data.to_csv(data_dir / f"{data_name}.csv", index=False)

return data_name

def _sample_data_if_needed(table_name: str, data_dir: Path, results_dir: Path, sample_size: int | None) -> str:

data_name = table_name

if sample_size:

log(INFO, f"Sampling {sample_size} rows from data...")

all_data = pd.read_csv(data_dir / f"{table_name}.csv")

sampled_data = all_data.sample(n=min(sample_size, len(all_data)), random_state=42)

results_dir.mkdir(parents=True, exist_ok=True)

data_name = f"{table_name}_sampled"

sampled_data.to_csv(results_dir / f"{data_name}.csv", index=False)

sampled_data.to_csv(data_dir / f"{data_name}.csv", index=False)

return data_name

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/tabsyn/train.py` around lines 169 - 181, The _sample_data_if_needed helper in train.py samples data non-reproducibly and can fail when sample_size is larger than the available rows. Update the DataFrame.sample call to use a fixed random_state for deterministic results, and guard the n argument by clamping sample_size to the number of rows (or otherwise handling oversized requests) before sampling. Keep the fix localized to _sample_data_if_needed and preserve the existing behavior of writing the sampled CSVs.

@lotif: Also a worthwhile change.

@emersodb Agreed — that would make the sampling path more robust and reproducible with a very localized change.

🐇👍

🧠 Learnings used

Learnt from: lotif Repo: VectorInstitute/midst-toolkit PR: 107 File: examples/gan/synthesize.py:1-47 Timestamp: 2025-12-11T16:08:49.024Z Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.

coderabbitai · 2026-06-24T20:01:14Z

+
+        Args:
+            dataset: The dataset as a pandas DataFrame.
+            synthesize: Flag indicating whether to generate synthetic data after training.
+                The number of points to synthesize and the save directory is controlled by
+                the `number_of_points_to_synthesize` and `save_dir` attributes of the training
+                config. Optional, default is True.


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Docstring overstates configurability of synthesis count.

The docstring states the number of points to synthesize is controlled by number_of_points_to_synthesize, but _synthesize derives num_samples from train_z_att["num_samples"] (i.e., the training set size) and never reads number_of_points_to_synthesize. Either wire the config value into tabsyn.sample(...) or correct the docstring to avoid misleading callers.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/midst_toolkit/attacks/ensemble/models.py` around lines 405 - 411, The docstring for the training method in models.py overstates how synthetic sample count is configured, since `_synthesize` does not use `number_of_points_to_synthesize` and instead derives `num_samples` from `train_z_att["num_samples"]` before calling `tabsyn.sample(...)`. Update the implementation to pass the configured synthesis count through the `_synthesize`/`tabsyn.sample` flow, or revise the docstring to match the actual behavior; use the `number_of_points_to_synthesize`, `_synthesize`, and `tabsyn.sample` symbols to locate the affected code.

@lotif: This is a good point.

@emersodb Thanks for confirming — agreed, this looks like a valid mismatch to clean up.

🐇👍

🧠 Learnings used

Learnt from: lotif Repo: VectorInstitute/midst-toolkit PR: 107 File: examples/gan/synthesize.py:1-47 Timestamp: 2025-12-11T16:08:49.024Z Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.

coderabbitai · 2026-06-24T20:01:14Z

+    # Drop id columns
+    id_columns_names = [col for col in df_synthetic.columns if "_id" in col]
+    df_synthetic = df_synthetic.drop(columns=id_columns_names)


🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Find all callsites of compute_gower_for_model and inspect nearby context. rg -n -C4 '\bcompute_gower_for_model\s*\(' src examples tests # 2) Inspect where categorical_features is constructed/passed. rg -n -C4 '\bcategorical_features\b' src/midst_toolkit/attacks/ensemble/rmia # 3) Inspect where df_input is prepared/dropped to confirm it mirrors the new synthetic _id-column drop. rg -n -C4 '\bdf_input\b|\bdrop\s*\(.*_id|id_column_name' src/midst_toolkit/attacks/ensemble/rmia

Repository: VectorInstitute/midst-toolkit

Length of output: 45302

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the relevant function bodies and surrounding context. sed -n '89,210p' src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py

Repository: VectorInstitute/midst-toolkit

Length of output: 5540

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the relevant function bodies and surrounding context. sed -n '89,210p' src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py

Repository: VectorInstitute/midst-toolkit

Length of output: 5540

🌐 Web query:

gower.gower_matrix data_x data_y must have same columns categorical feature list length

💡 Result:

The error indicating that data_x and data_y must have the same number of columns or that the categorical feature list must match the number of columns typically arises in the gower Python package [1][2] when there is a mismatch between the input dataset's dimensionality and the provided categorical feature mask. In the gower library, the gower_matrix(data_x, data_y=None, cat_features=None) function requires that [1]: 1. Column Alignment: If both data_x and data_y are provided, they must have the same number of columns [1]. 2. Categorical Mask Length: If a cat_features argument is provided, it must be a boolean array or list with a length equal to the number of columns in the dataset [1][3][4]. To resolve this: - Ensure the cat_features mask has the correct length: If your dataset has N columns, ensure your boolean mask (or integer array) has exactly N entries [3][4]. You can verify this by checking data_x.shape[1] [1]. - Automatic detection: If you are unsure which features are categorical, you can omit the cat_features argument [1]. The gower package will attempt to automatically detect categorical columns by identifying non-numeric dtypes (e.g., objects or categories) in your dataframe [1]. - Consistent Input: Ensure that data_x and data_y (if used) are of the same format (e.g., both pandas DataFrames or both numpy arrays) and share the same feature space [1]. If you are using the gower package, checking your input shape with data_x.shape and comparing it to the length of your cat_features list is the most direct way to debug the mismatch [1].

Citations:

1: https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py

2: https://github.com/sbobek/gower

3: https://docs.deepchecks.com/en/stable/api/generated/deepchecks.utils.gower_distance.gower_matrix.html

4: https://stackoverflow.com/questions/66441887/how-can-i-implement-pam-clustering-algorithm-using-gower-distance-in-sklearn

🌐 Web query:

gower.gower_matrix data_x data_y must have same columns categorical feature list length

💡 Result:

In the Python gower library, the gower.gower_matrix function computes Gower's distance [1][2]. When users encounter issues or errors related to "categorical feature list length," it typically stems from the optional cat_features argument [3]. The cat_features parameter accepts a boolean list or array that specifies which columns in your data should be treated as categorical [3]. The length of this list must exactly match the number of columns (features) in the input data frame [3]. If the length of cat_features does not correspond to the number of columns in your data, or if there is a mismatch between the provided categorical list and the input structure, the function will raise an error [3]. To resolve this: 1. Ensure the length of your cat_features list is equal to the number of columns in your input data [3]. 2. Alternatively, you can omit the cat_features argument entirely, in which case the library will attempt to infer the data types automatically (often based on pandas dtypes like 'object') [3]. Note that this behavior is distinct from the R gower package, which primarily handles pairings between two data sets (x and y) and manages column matching via pair_x and pair_y arguments [4][5][6]. If you are working in Python, focus on ensuring your cat_features boolean mask aligns with the columns of your input DataFrame [3].

Citations:

1: https://pypi.org/project/gower/

2: https://github.com/sbobek/gower

3: https://stats.stackexchange.com/questions/509788/options-for-clustering-analysis-with-numeric-nominal-data-with-gower-distance

4: https://cran.case.edu/web/packages/gower/refman/gower.html

5: https://cran.r-project.org/web/packages/gower/gower.pdf

6: https://www.stat.math.ethz.ch/CRAN/web/packages/gower/vignettes/intro.Rnw

Keep the synthetic and input schemas aligned before Gower distance calculation.
df_synthetic drops every "_id" column here, but df_input and categorical_features are still built from the unfiltered input. gower.gower_matrix() expects both frames to share the same column count and the categorical mask to match that layout, so this can break as soon as any synthetic _id column is present. df_input needs the same filtering, or the mask must be recomputed after alignment.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/midst_toolkit/attacks/ensemble/rmia/rmia_calculation.py` around lines 126 - 128, Keep the synthetic and input schemas aligned before calling gower_matrix in rmia_calculation. The current drop of "_id" columns in df_synthetic leaves df_input and categorical_features based on a different column layout, so update the same filtering logic for the input frame or recompute the categorical mask after both frames are aligned. Make the fix in the code path around the df_synthetic drop and the subsequent gower.gower_matrix call so both matrices share identical columns and mask positions.

coderabbitai · 2026-06-24T20:01:14Z

+    elif (raw_data_dir / "train.xls").exists():
+        data_path = raw_data_dir / "train.xls"
+        test_path = raw_data_dir / "test.xls"


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

.xls test files are read with pd.read_csv, which will fail.

When the train.xls branch is taken, test_path is set to raw_data/test.xls, but the test branch unconditionally calls pd.read_csv(test_path) at Line 195. Reading an Excel file with read_csv will error (or produce garbage). The train side correctly uses pd.read_excel(...) and drops the ID column, so the test side should mirror that handling based on test_path.suffix.

Proposed fix

if test_path: # if testing data is given - test_df = pd.read_csv(test_path) + if test_path.suffix == ".csv": + test_df = pd.read_csv(test_path) + else: # .xls + test_df = pd.read_excel(test_path, sheet_name="Data", header=1) + test_df = test_df.drop("ID", axis=1) train_df = data_df

Also applies to: 193-196

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/midst_toolkit/models/tabsyn/preprocessing.py` around lines 154 - 156, The test-file loading path in preprocessing.py is using the wrong reader for .xls inputs, so the branch that sets data_path/test_path for train.xls must also make the test handling in the same preprocessing flow respect test_path.suffix. Update the logic around the tabsyn preprocessing loader to mirror the train-side behavior: use pd.read_excel for .xls files, keep pd.read_csv for CSVs, and apply the same ID-column drop to the test dataframe as already done for the train dataframe in the relevant preprocessing function.

@lotif: Agreed that this will be an issue.

@emersodb Thanks for confirming. Yes — the if test_path: branch should mirror the train-side loader so .xls uses pd.read_excel(..., sheet_name="Data", header=1) and drops the ID column, while .csv continues to use pd.read_csv(...). That should resolve the failure path around Line 193.

🐇👍

🧠 Learnings used

Learnt from: lotif Repo: VectorInstitute/midst-toolkit PR: 107 File: examples/gan/synthesize.py:1-47 Timestamp: 2025-12-11T16:08:49.024Z Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.

emersodb

Most of my comments are fairly small. I think the more fundamental ones that we want to fix were already highlighted by CodeRabbit, which is nice.

emersodb · 2026-06-29T14:14:30Z

    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6.0.2
+      - uses: actions/checkout@v7.0.0


@lotif: I guess we can do this, but I don't see it as essential.

emersodb · 2026-06-29T14:14:52Z

+        uses: actions/checkout@v7.0.0

      - name: Install uv
-        uses: astral-sh/setup-uv@v8.1.0
+        uses: astral-sh/setup-uv@v8.2.0


emersodb · 2026-06-29T14:17:33Z

-            "data_dir": config.base_data_dir,
-            "workspace_dir": str(Path(config.base_data_dir) / "shadow_workspace"),
+            "data_dir": base_data_dir,
+            "workspace_dir": str(Path(base_data_dir) / "shadow_workspace"),


Super minor, but I think you can avoid the str -> Path -> str conversion here with

os.join.path(base_data_dir, "shadow_workspace")

emersodb · 2026-06-29T14:18:37Z

+    compute_attack_success_for_given_targets(
+        target_model_config=config.ensemble_attack.target_model,
+        # TODO: refactor this to work better outside of the challenge context (i.e. no target ID)
+        # No target ID needed for CTGAN, but it needs at least one element in this array. The value does not matter.


This says CTGAN. I think you mean Tabsyn?

emersodb · 2026-06-29T14:23:55Z

+    run_metaclassifier_training: true
+
+  shadow_training:
+    model_name: ctgan


tabsyn? Is this mis-direction to ctgan problematic for the attack?

emersodb · 2026-06-29T15:29:45Z

+
+        Args:
+            dataset: The dataset as a pandas DataFrame.
+            synthesize: Flag indicating whether to generate synthetic data after training.
+                The number of points to synthesize and the save directory is controlled by
+                the `number_of_points_to_synthesize` and `save_dir` attributes of the training
+                config. Optional, default is True.


@lotif: This is a good point.

emersodb · 2026-06-29T15:32:37Z

+
+        Args:
+            n_head: The number of heads.
+            factor: The factor for the dimension of the hidden layer.


I don't think I understand what factor is here?

emersodb · 2026-06-29T15:34:38Z

+    test_path: Path | None = None
+    if (raw_data_dir / "train.csv").exists():
+        data_path = raw_data_dir / "train.csv"
+        test_path = raw_data_dir / "test.csv"


Maybe we should comment here that we're checking about existence to get the proper file suffix and assume that both train.x and test.x exists

emersodb · 2026-06-29T16:02:45Z

+    else:
+        data_path = data_dir / f"{data_name}.csv"
+
+    assert data_path.exists(), (


These strings violate our line length restrictions don't they?

emersodb · 2026-06-29T16:09:44Z

+    elif (raw_data_dir / "train.xls").exists():
+        data_path = raw_data_dir / "train.xls"
+        test_path = raw_data_dir / "test.xls"


@lotif: Agreed that this will be an issue.

lotif and others added 30 commits March 23, 2026 16:53

WIP pulling the model code in

dbf2a1f

Removing files that were not supposed to be submitted

ba2bbcc

Merge branch 'main' into marcelo/tabsyn

bdde267

WIP started testing

42fbabe

WIP progressed on testing

1b63a7d

Finished train test

2af4a09

Finished load and synthesize test

5eaebb5

Started fixing mypy errors

492c106

Continue fixing mypy errors

3e86710

A bunch more mypy errors fixed

bd41b3c

Fixed all errors 🎉

cffe8f9

Merge branch 'main' into marcelo/tabsyn

92a7b7f

Merge branch 'main' into marcelo/tabsyn

cbd05bc

Fixing unit tests

990e278

Skipping integration tests that need models to be retrained

101e167

adding repickled files

7db80e3

Uncommenting tests

51811b7

Adding tabsyn train code

498c21c

Using the DEVICE variable instead of doing the IF again

34686be

adding device so it runs in the cluster

db18a07

Adding synthesize script

72b7351

Actually fixing synthesize

bf3585f

Adding evaluation script

22d298d

WIP beginning ensemble attack code

fc01bae

Actually making the training config

c895320

Small vae save path fix

7ba95a2

Adding make challenge dataset

b4e6007

adding sampling to training

067e3c9

Fixing the scripts and configs

2dee575

Small code fix

1cad880

lotif and others added 19 commits April 29, 2026 16:32

Updating file link

547b102

Dropping all id columns, not only the main one

c5008ae

last fixes and scripts

04ea715

Fixing fine tuning and training data

bb681cb

Adding evaluation scripts

b0ea88b

Adding logs

b13acc6

Small fixes to evaluation script

6ca19b5

Addressing comments by coderabbit

6aa3032

Merge branch 'main' into marcelo/tabsyn

b00fc81

Addressing comments by David

c75b529

CR by David

7aef62a

Merge branch 'marcelo/tabsyn' into marcelo/tabsyn-ensemble

26d3cf2

Adding readme instructions

a31e1eb

Better comments on is_numerical handling

d04cc1d

Merge branch 'marcelo/tabsyn' into marcelo/tabsyn-ensemble

4b55ff5

Merge branch 'main' into marcelo/tabsyn

7216f9d

Merge branch 'marcelo/tabsyn' into marcelo/tabsyn-ensemble

84b0c32

Merge branch 'main' into marcelo/tabsyn-ensemble

76c5f83

small fixes

7664d3f

lotif changed the title ~~Marcelo/tabsyn ensemble~~ Adding TabSyn examples and ensemble attack code Jun 9, 2026

Merge branch 'main' into marcelo/tabsyn-ensemble

bed857c

lotif requested a review from emersodb June 9, 2026 20:43

lotif marked this pull request as ready for review June 9, 2026 20:43

Small fix

7082237

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

emersodb reviewed Jun 29, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Jun 29, 2026

📝 CodeRabbit Chat: Update tabsyn ensemble attack README results_dir docs #151

Open

Uh oh!

Conversation

lotif commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Short Description

Tests Added

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

lotif commented Jun 3, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

coderabbitai Bot Jun 24, 2026 •

edited

Loading

coderabbitai Bot Jun 24, 2026 •

edited

Loading

coderabbitai Bot Jun 24, 2026 •

edited

Loading

coderabbitai Bot Jun 24, 2026 •

edited

Loading

coderabbitai Bot Jun 24, 2026 •

edited

Loading

coderabbitai Bot Jun 24, 2026 •

edited

Loading