Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
dbf2a1f
WIP pulling the model code in
lotif Mar 23, 2026
ba2bbcc
Removing files that were not supposed to be submitted
lotif Mar 23, 2026
bdde267
Merge branch 'main' into marcelo/tabsyn
lotif Apr 6, 2026
42fbabe
WIP started testing
lotif Apr 8, 2026
1b63a7d
WIP progressed on testing
lotif Apr 9, 2026
2af4a09
Finished train test
lotif Apr 10, 2026
5eaebb5
Finished load and synthesize test
lotif Apr 10, 2026
492c106
Started fixing mypy errors
lotif Apr 10, 2026
3e86710
Continue fixing mypy errors
lotif Apr 13, 2026
bd41b3c
A bunch more mypy errors fixed
lotif Apr 14, 2026
cffe8f9
Fixed all errors 🎉
lotif Apr 16, 2026
92a7b7f
Merge branch 'main' into marcelo/tabsyn
lotif Apr 16, 2026
cbd05bc
Merge branch 'main' into marcelo/tabsyn
lotif Apr 24, 2026
990e278
Fixing unit tests
lotif Apr 24, 2026
101e167
Skipping integration tests that need models to be retrained
lotif Apr 24, 2026
7db80e3
adding repickled files
Apr 24, 2026
51811b7
Uncommenting tests
lotif Apr 24, 2026
498c21c
Adding tabsyn train code
lotif Apr 27, 2026
34686be
Using the DEVICE variable instead of doing the IF again
lotif Apr 27, 2026
db18a07
adding device so it runs in the cluster
lotif Apr 27, 2026
72b7351
Adding synthesize script
lotif Apr 28, 2026
bf3585f
Actually fixing synthesize
lotif Apr 28, 2026
22d298d
Adding evaluation script
lotif Apr 28, 2026
fc01bae
WIP beginning ensemble attack code
lotif Apr 28, 2026
c895320
Actually making the training config
lotif Apr 28, 2026
7ba95a2
Small vae save path fix
lotif Apr 28, 2026
b4e6007
Adding make challenge dataset
lotif Apr 29, 2026
067e3c9
adding sampling to training
lotif Apr 29, 2026
2dee575
Fixing the scripts and configs
lotif Apr 29, 2026
1cad880
Small code fix
lotif Apr 29, 2026
547b102
Updating file link
lotif Apr 29, 2026
c5008ae
Dropping all id columns, not only the main one
lotif Apr 29, 2026
04ea715
last fixes and scripts
lotif Apr 30, 2026
bb681cb
Fixing fine tuning and training data
lotif May 1, 2026
b0ea88b
Adding evaluation scripts
lotif May 4, 2026
b13acc6
Adding logs
lotif May 4, 2026
6ca19b5
Small fixes to evaluation script
lotif May 4, 2026
6aa3032
Addressing comments by coderabbit
lotif May 22, 2026
b00fc81
Merge branch 'main' into marcelo/tabsyn
lotif May 25, 2026
c75b529
Addressing comments by David
lotif May 26, 2026
7aef62a
CR by David
lotif Jun 2, 2026
26d3cf2
Merge branch 'marcelo/tabsyn' into marcelo/tabsyn-ensemble
lotif Jun 3, 2026
a31e1eb
Adding readme instructions
lotif Jun 3, 2026
d04cc1d
Better comments on is_numerical handling
lotif Jun 3, 2026
4b55ff5
Merge branch 'marcelo/tabsyn' into marcelo/tabsyn-ensemble
lotif Jun 3, 2026
7216f9d
Merge branch 'main' into marcelo/tabsyn
lotif Jun 3, 2026
84b0c32
Merge branch 'marcelo/tabsyn' into marcelo/tabsyn-ensemble
lotif Jun 3, 2026
76c5f83
Merge branch 'main' into marcelo/tabsyn-ensemble
lotif Jun 3, 2026
7664d3f
small fixes
lotif Jun 9, 2026
bed857c
Merge branch 'main' into marcelo/tabsyn-ensemble
lotif Jun 9, 2026
7082237
Small fix
lotif Jun 9, 2026
67b1f03
Bump actions/checkout from 6.0.3 to 7.0.0 (#147)
dependabot[bot] Jun 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/code_checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ jobs:
run-code-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6.0.2
- uses: actions/checkout@v7.0.0

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the workflow and surrounding steps.
git ls-files .github/workflows/code_checks.yml
printf '\n--- file with line numbers ---\n'
cat -n .github/workflows/code_checks.yml | sed -n '1,220p'

printf '\n--- checkout usages in workflows ---\n'
rg -n "actions/checkout@|persist-credentials|git auth|github.token" .github/workflows -S

Repository: VectorInstitute/midst-toolkit

Length of output: 2624


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Gather deterministic evidence about the checkout reference and any later git operations.
python3 - <<'PY'
from pathlib import Path
p = Path('.github/workflows/code_checks.yml')
print(p.exists(), p)
if p.exists():
    text = p.read_text()
    print("uses_checkout", "actions/checkout@" in text)
    print("persist_credentials", "persist-credentials" in text)
    print("git_commands", any(s in text for s in ["git ", "gh ", "npm publish", "npm version", "git push", "git commit"]))
PY

Repository: VectorInstitute/midst-toolkit

Length of output: 270


Pin actions/checkout by SHA and disable persisted credentials.
.github/workflows/code_checks.yml:32 still uses the mutable v7.0.0 tag; replace it with the release commit SHA and set persist-credentials: false here, since this job doesn’t need git auth.

🧰 Tools
🪛 zizmor (1.26.1)

[warning] 32-32: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 32-32: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/code_checks.yml at line 32, The workflow step using
actions/checkout is still pinned to a mutable version tag and leaves persisted
git credentials enabled. Update the checkout step in code_checks.yml to use the
release commit SHA for actions/checkout instead of v7.0.0, and add
persist-credentials: false since this job does not require repository auth. Use
the actions/checkout step identifier to locate the change.

Source: Linters/SAST tools

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: I guess we can do this, but I don't see it as essential.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emersodb Fair point — I agree this is more defense-in-depth hardening than something essential for correctness in this PR.

The recommendation still stands as a best practice for GitHub Actions, but I’m fine treating it as non-blocking and leaving it for a follow-up if you want to keep this change focused.


- name: Install uv
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b
uses: astral-sh/setup-uv@fac544c07dec837d0ccb6301d7b5580bf5edae39
with:
# Install a specific version of uv.
version: "0.5.21"
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v6.0.2
uses: actions/checkout@v7.0.0

- name: Install uv
uses: astral-sh/setup-uv@v8.1.0
uses: astral-sh/setup-uv@v8.2.0
Comment on lines +45 to +48

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the workflow file around the referenced lines
nl -ba .github/workflows/docs.yml | sed -n '1,140p'

Repository: VectorInstitute/midst-toolkit

Length of output: 207


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the workflow file and any action refs used in it
grep -nE 'uses:|persist-credentials' .github/workflows/docs.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 488


🏁 Script executed:

#!/bin/bash
set -euo pipefail

nl -ba .github/workflows/docs.yml | sed -n '1,140p'

Repository: VectorInstitute/midst-toolkit

Length of output: 207


🏁 Script executed:

#!/bin/bash
set -euo pipefail

awk 'NR>=35 && NR<=110 {printf "%4d:%s\n", NR, $0}' .github/workflows/docs.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 2097


🏁 Script executed:

#!/bin/bash
set -euo pipefail

awk 'NR>=1 && NR<=140 {printf "%4d:%s\n", NR, $0}' .github/workflows/docs.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 2856


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
from pathlib import Path
p = Path('.github/workflows/docs.yml')
for i, line in enumerate(p.read_text().splitlines(), 1):
    if 35 <= i <= 110:
        print(f"{i:4d}:{line}")
PY

Repository: VectorInstitute/midst-toolkit

Length of output: 2097


Pin all workflow actions to commit SHAs and disable checkout credentials

  • docs.yml still uses mutable refs for actions/checkout, astral-sh/setup-uv, actions/setup-python, actions/upload-artifact, actions/download-artifact, and JamesIves/github-pages-deploy-action; replace them with full commit SHAs.
  • Add persist-credentials: false to both actions/checkout steps unless a later git push depends on the checkout token.
🧰 Tools
🪛 zizmor (1.26.1)

[error] 45-45: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)


[error] 48-48: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/docs.yml around lines 45 - 48, The docs workflow still
references mutable action tags and should be hardened by pinning every action
use to a full commit SHA. Update the `actions/checkout` steps in `docs.yml` and
the other workflow action references (`astral-sh/setup-uv`,
`actions/setup-python`, `actions/upload-artifact`, `actions/download-artifact`,
and `JamesIves/github-pages-deploy-action`) to immutable SHAs, and add
`persist-credentials: false` to both `actions/checkout` invocations unless a
later step needs the checkout token.

Source: Linters/SAST tools

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

repo="VectorInstitute/midst-toolkit"
pr="143"

echo "=== PR review comments mentioning workflow pinning / similar replies ==="
gh api "repos/$repo/pulls/$pr/comments?per_page=100" --paginate \
  --jq '.[] | {id, path, line, user: .user.login, in_reply_to_id, body, url} |
        select((.path | startswith(".github/workflows/")) or (.body | test("persist-credentials|unpinned|Same here|Pin all workflow actions"; "i")))' || true

echo
echo "=== Issue/PR conversation comments ==="
gh api "repos/$repo/issues/$pr/comments?per_page=100" --paginate \
  --jq '.[] | {user: .user.login, body, url}' || true

Length of output: 27247


@emersodb Fair enough — same assessment on this thread as well: this is useful hardening, but not essential to the correctness of this PR.

I’m good with treating it as non-blocking and leaving it for a follow-up if you want to keep this change focused.

with:
version: "0.5.21"
enable-cache: true
Expand Down Expand Up @@ -77,7 +77,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v6.0.2
uses: actions/checkout@v7.0.0

- name: Configure Git Credentials
run: |
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/integration_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,10 @@ jobs:
integration-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6.0.2
- uses: actions/checkout@v7.0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the workflow and inspect the relevant section with line numbers.
git ls-files .github/workflows/integration_tests.yml
printf '\n--- file excerpt ---\n'
sed -n '1,120p' .github/workflows/integration_tests.yml | cat -n

Repository: VectorInstitute/midst-toolkit

Length of output: 2456


🌐 Web query:

actions/checkout latest release version and default persist-credentials behavior

💡 Result:

As of June 24, 2026, the latest release version of the actions/checkout repository is v7.0.0, released on June 18, 2026 [1][2][3]. Regarding the persist-credentials input: - Default Behavior: The default value for persist-credentials is true [4][5][6]. - Purpose: When set to true, the action configures the Git repository with the provided authentication token (or SSH key) in the local Git configuration [4][5]. This allows your workflow scripts to run authenticated Git commands (such as git fetch or git push) directly [4][5]. - Security Implementation: The token is persisted in a separate file (distinct from the standard.git/config) to enhance security [5]. The action ensures that this token is automatically removed during the post-job cleanup phase [4][5]. - Opt-out: You can disable this behavior by explicitly setting persist-credentials: false in your workflow configuration if you do not require authenticated Git operations [4][5].

Citations:


🌐 Web query:

GitHub actions/checkout README persist-credentials default checkout version

💡 Result:

In the GitHub actions/checkout action, the persist-credentials input defaults to true [1][2]. When persist-credentials is set to its default value of true, the action configures the authentication token or SSH key in the local git configuration [1][2]. This allows subsequent steps in your workflow to run authenticated Git commands (such as git push or git fetch) [1][3]. The token or key is removed during the post-job cleanup process [1][3]. Starting with recent versions, these credentials are stored in a file under $RUNNER_TEMP to improve security, rather than directly in.git/config [3][4]. If you wish to disable this behavior, you must explicitly set persist-credentials to false in your workflow file [1].

Citations:


🌐 Web query:

site:github.com actions/checkout releases v7.0.0

💡 Result:

The actions/checkout v7.0.0 release was published on June 18, 2026 [1][2][3]. Key changes and highlights in v7.0.0 include: * Security Update: It introduces a mechanism to block checking out fork pull requests when using pull_request_target and workflow_run events to prevent potential security vulnerabilities [1][4]. * Dependency Updates: Various dependencies were updated, including @actions/core and @actions/tool-cache, and the module was upgraded to ECMAScript modules (ESM) [1][4]. * Refinements: General improvements such as updated error wording and internal dependency management [1][4]. Users can reference the new version in their workflows using: - uses: actions/checkout@v7 [5][6][3] For the full list of changes, you can view the changelog on the official GitHub repository [4].

Citations:


Pin actions/checkout and drop persisted credentials. .github/workflows/integration_tests.yml:44 still uses a mutable tag; switch to the release commit SHA and add persist-credentials: false unless a later step needs git auth.

🧰 Tools
🪛 zizmor (1.26.1)

[warning] 44-44: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 44-44: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/integration_tests.yml at line 44, The integration_tests
workflow still references a mutable actions/checkout tag and keeps git
credentials persisted by default. Update the checkout step in the workflow to
use the release commit SHA for actions/checkout instead of the version tag, and
add persist-credentials set to false unless a later step explicitly needs
repository authentication. Use the existing checkout step in
integration_tests.yml as the target for this change.

Source: Linters/SAST tools


- name: Install uv
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b
uses: astral-sh/setup-uv@fac544c07dec837d0ccb6301d7b5580bf5edae39
with:
# Install a specific version of uv.
version: "0.5.21"
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ jobs:
run: |
sudo apt-get update
sudo apt-get install libcurl4-openssl-dev libssl-dev
- uses: actions/checkout@v6.0.2
- uses: actions/checkout@v7.0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the workflow and nearby context.
git ls-files .github/workflows/publish.yml
wc -l .github/workflows/publish.yml
cat -n .github/workflows/publish.yml | sed -n '1,120p'

Repository: VectorInstitute/midst-toolkit

Length of output: 1882


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find whether checkout is used elsewhere with SHA pinning or persist-credentials settings.
rg -n --hidden --glob '.github/workflows/*.yml' --glob '.github/workflows/*.yaml' 'actions/checkout@|persist-credentials:' .github/workflows

Repository: VectorInstitute/midst-toolkit

Length of output: 604


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check whether any workflow already pins actions/checkout by SHA in the repository.
rg -n --hidden 'actions/checkout@' .github/workflows . || true

Repository: VectorInstitute/midst-toolkit

Length of output: 1053


Pin actions/checkout to a SHA and disable persisted credentials.

Line 19 uses actions/checkout@v7.0.0; switch to a commit SHA and set persist-credentials: false unless this job needs git auth.

🧰 Tools
🪛 zizmor (1.26.1)

[warning] 19-19: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 19-19: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/publish.yml at line 19, The publish workflow is using a
mutable actions reference and leaving git credentials persisted. Update the
checkout step in the publish job to use a full commit SHA for actions/checkout
instead of the version tag, and add persist-credentials set to false unless this
job explicitly needs repository auth. Keep the change localized to the checkout
step so the workflow remains otherwise unchanged.

Source: Linters/SAST tools


- name: Install uv
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b
uses: astral-sh/setup-uv@fac544c07dec837d0ccb6301d7b5580bf5edae39
with:
# Install a specific version of uv.
version: "0.5.21"
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,10 @@ jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6.0.2
- uses: actions/checkout@v7.0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

sed -n '1,140p' .github/workflows/unit_tests.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 1747


🏁 Script executed:

sed -n '1,140p' .github/workflows/unit_tests.yml

Repository: VectorInstitute/midst-toolkit

Length of output: 1747


Pin actions/checkout and disable credential persistence .github/workflows/unit_tests.yml:44 should use a full commit SHA instead of v7.0.0, and set persist-credentials: false since no later step needs git auth.

🧰 Tools
🪛 zizmor (1.26.1)

[warning] 44-44: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 44-44: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/unit_tests.yml at line 44, Update the checkout step in the
unit test workflow to use a pinned full commit SHA for actions/checkout instead
of the version tag, and configure the checkout action with persist-credentials
set to false since the workflow does not need git authentication afterward. Use
the existing checkout step in unit_tests.yml as the target and make sure the
action reference and its options are adjusted together.

Source: Linters/SAST tools


- name: Install uv
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b
uses: astral-sh/setup-uv@fac544c07dec837d0ccb6301d7b5580bf5edae39
with:
# Install a specific version of uv.
version: "0.5.21"
Expand Down
2 changes: 1 addition & 1 deletion examples/gan/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ Kolmogorov-Smirnov (KS) test, Total Variation Distance (TVD), Correlation Matrix
and Mutual Information Difference.

To compute those metrics, you can run the command below. The name of the table should be
defined in the `dataset_meta.json` file, and the file for synthetic data should be under
defined in the `dataset_meta.json` file, and the data files should be under
`/data/{table_name}.csv` for the real data and `/results/{table_name}_synthetic.csv`
for the synthetic data.

Expand Down
6 changes: 1 addition & 5 deletions examples/gan/ensemble_attack/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,6 @@ python -m examples.gan.ensemble_attack.make_challenge_dataset

## 4. Training the attack model

> [!NOTE]
> In the [`config.yaml`](config.yaml) file, the attribute `ensemble_attack.shadow_training.model_name`
> is what determines this attack will be run with the CTGAN model.

To train the attack models, execute the following command:

```bash
Expand All @@ -81,7 +77,7 @@ To test the attack model against the target model and synthetic data produced on
[step 2](#2-generating-target-synthetic-data-to-be-tested), please run:

```bash
python -m examples.gan.ensemble_attack.test_attack_model
python -m examples.gan.ensemble_attack.run_test_attack_model
```

## 6. Compute the attack success
Expand Down
1 change: 0 additions & 1 deletion examples/gan/ensemble_attack/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@ ensemble_attack:
run_metaclassifier_training: true

shadow_training:
model_name: ctgan
model_config: # Configurations specific for the CTGAN model
training:
epochs: 300
Expand Down
2 changes: 2 additions & 0 deletions examples/gan/ensemble_attack/make_challenge_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ def make_challenge_dataset(config: DictConfig) -> None:
log(INFO, f"Saving challenge labels to {challenge_label_path}")
np.save(challenge_label_path, challenge_data_labels)

log(INFO, "Done!")


if __name__ == "__main__":
make_challenge_dataset()
14 changes: 11 additions & 3 deletions examples/gan/ensemble_attack/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,14 @@ def make_training_config(config: DictConfig) -> dict[Any, Any]:
Returns:
The ensemble attack training config for the CTGAN model.
"""
base_data_dir = str
if "base_data_dir" in config:
base_data_dir = config.base_data_dir
elif "data_dir" in config:
base_data_dir = config.data_dir
else:
raise ValueError("Either base_data_dir or data_dir must be provided in the config.")

# Saving the model config from the config.yaml into a json file
# because that's what the ensemble attack code will be looking for
training_config_path = Path(config.ensemble_attack.shadow_training.training_json_config_paths.training_config_path)
Expand All @@ -48,10 +56,10 @@ def make_training_config(config: DictConfig) -> dict[Any, Any]:
training_config = OmegaConf.to_container(config.ensemble_attack.shadow_training.model_config, resolve=True)
assert isinstance(training_config, dict), "Training config must be a dictionary."
training_config["general"] = {
"test_data_dir": config.base_data_dir,
"test_data_dir": base_data_dir,
"sample_prefix": "ctgan",
"data_dir": config.base_data_dir,
"workspace_dir": str(Path(config.base_data_dir) / "shadow_workspace"),
"data_dir": base_data_dir,
"workspace_dir": str(Path(base_data_dir) / "shadow_workspace"),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor, but I think you can avoid the str -> Path -> str conversion here with

os.join.path(base_data_dir, "shadow_workspace")

"exp_name": "pre_trained_model",
}
json.dump(training_config, f)
Expand Down
92 changes: 92 additions & 0 deletions examples/tabsyn/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# TabSyn Single Table Example

This example will go over training a single-table [TabSyn](https://arxiv.org/abs/2310.09656)
model and synthesizing data afterwards.


## Downloading data

First, we need the data. Download it from this
[Google Drive link](https://drive.google.com/file/d/1HTgfgeL5GXc8uAGfeQirJrUynK7vFeyb/view?usp=drive_link),
extract the files and place them in a `/data` folder in within this folder
(`examples/tabsyn`).

> [!NOTE]
> If you wish to change the data folder, you can do so by editing the `base_data_dir` attribute
> of the [`config.yaml`](config.yaml) file.

Here is a description of the files that have been extracted:
- `trans.csv`: The training data. It consists of information about bank transactions and it
contains 20,000 data points.
- `trans_info.json`: Metadata about the `trans.csv` data, with information such as which columns are
numerical and which are categorical, what is the task type, etc.


## Kicking off training

To kick off training, simply run the command below from the project's root folder:

```bash
python -m examples.tabsyn.train
```


## Training results

The result files will be saved inside a `/results` folder within this folder
(`examples/tabsyn`).

> [!NOTE]
> If you wish to change the save folder, you can do so by editing the `results_dir` attribute
> of the [`config.yaml`](config.yaml) file.

In the `/results/trans` folder, there will be a file called `model.pt`,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the path discussions above, just want to make sure this is the right path for the model checkpoint?

which is a pytorch saved model.


## Synthesizing data

To synthesize some data with the trained model, run:

```bash
python -m examples.tabsyn.synthesize
```

If there is already a trained model in the `/results` folder, it will use that model.
Otherwise it will train one from scratch. At the end of the script, it will save the
synthesized data to `/results/trans/synthetic_data/trans_synthetic.csv`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this guy.



## Evaluating the quality of the synthetic data

### Alpha Precision

To run a round of evaluation with [Alpha Precision](https://arxiv.org/abs/2301.07573)
metrics on a set of synthetic data, run the `evaluate.py` script:

```bash
python -m midst_toolkit.evaluation.quality.scripts.midst_alpha_precision_eval \
--synthetic_data_path examples/tabsyn/results/trans/synthetic_data/trans_synthetic.csv \
--real_data examples/tabsyn/data/trans_sampled.csv \
--meta_info_path examples/gan/data/meta_info.json \
--save_directory examples/gan/results/
```
Comment on lines +68 to +73

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Alpha Precision command points at examples/gan/ paths.

--meta_info_path and --save_directory reference examples/gan/..., which looks like a leftover from the CTGAN example. For the TabSyn example these should point under examples/tabsyn/.

📝 Suggested fix
   --real_data examples/tabsyn/data/trans_sampled.csv \
-  --meta_info_path examples/gan/data/meta_info.json \
-  --save_directory examples/gan/results/
+  --meta_info_path examples/tabsyn/data/meta_info.json \
+  --save_directory examples/tabsyn/results/
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
python -m midst_toolkit.evaluation.quality.scripts.midst_alpha_precision_eval \
--synthetic_data_path examples/tabsyn/results/trans/synthetic_data/trans_synthetic.csv \
--real_data examples/tabsyn/data/trans_sampled.csv \
--meta_info_path examples/gan/data/meta_info.json \
--save_directory examples/gan/results/
```
python -m midst_toolkit.evaluation.quality.scripts.midst_alpha_precision_eval \
--synthetic_data_path examples/tabsyn/results/trans/synthetic_data/trans_synthetic.csv \
--real_data examples/tabsyn/data/trans_sampled.csv \
--meta_info_path examples/tabsyn/data/meta_info.json \
--save_directory examples/tabsyn/results/
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/README.md` around lines 68 - 73, The Alpha Precision example
command is still using CTGAN paths instead of TabSyn paths. Update the command
in the TabSyn README so the midsts_alpha_precision_eval invocation uses the
TabSyn-specific meta info and output directory locations, matching the existing
examples/tabsyn/ structure rather than examples/gan/. Keep the rest of the
command unchanged and verify the referenced synthetic and real data paths remain
correct for the TabSyn example.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: Seconded 🙂

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emersodb Thanks for confirming — this still looks like a valid README fix for the TabSyn example.

🐇🙂


It will save the evaluation results under the `/results/model.txt` file.

### Additional Metrics

The calculation of additional metrics are set up in the `evaluate.py` file. They are the
Kolmogorov-Smirnov (KS) test, Total Variation Distance (TVD), Correlation Matrix Difference
and Mutual Information Difference.

To compute those metrics, you can run the command below. The data files should
be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_samples.csv`
for the sampleed data used for training, and `/results/{table_name}_synthetic.csv`
for the synthetic data.
Comment on lines +83 to +86

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Filename inconsistency: _samples.csv vs _sampled.csv.

The README documents the sampled training file as /data/{table_name}_samples.csv, but train.py writes {table_name}_sampled.csv (see _sample_data_if_needed). Also "sampleed" is a typo. Align the docs with the actual filename to avoid confusion.

📝 Suggested fix
-be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_samples.csv`
-for the sampleed data used for training, and `/results/{table_name}_synthetic.csv`
+be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_sampled.csv`
+for the sampled data used for training, and `/results/{table_name}_synthetic.csv`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
To compute those metrics, you can run the command below. The data files should
be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_samples.csv`
for the sampleed data used for training, and `/results/{table_name}_synthetic.csv`
for the synthetic data.
To compute those metrics, you can run the command below. The data files should
be under `/data/{table_name}.csv` for the real data, `/data/{table_name}_sampled.csv`
for the sampled data used for training, and `/results/{table_name}_synthetic.csv`
for the synthetic data.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/tabsyn/README.md` around lines 83 - 86, The README has a filename
mismatch and a typo: it documents the sampled training data as
`{table_name}_samples.csv` and says “sampleed”, but the actual writer in
`train.py` uses `_sample_data_if_needed` to produce `{table_name}_sampled.csv`.
Update the documentation in `examples/tabsyn/README.md` to match the real
filename used by `_sample_data_if_needed` and correct the spelling so the data
path instructions are consistent.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lotif: This is also good to fix up, here and other readmes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Matches for _samples.csv in README files =="
rg -n --glob '**/README.md' '_samples\.csv'

echo
echo "== Matches for sampleed in README files =="
rg -n --glob '**/README.md' 'sampleed'

echo
echo "== Matches for _sampled.csv in README files =="
rg -n --glob '**/README.md' '_sampled\.csv'

Length of output: 682


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== examples/tabsyn/README.md relevant lines =="
sed -n '78,90p' examples/tabsyn/README.md || true

echo
echo "== Matches for _samples.csv in README files =="
rg -n --glob '**/README.md' '_samples\.csv' || true

echo
echo "== Matches for sampleed in README files =="
rg -n --glob '**/README.md' 'sampleed' || true

echo
echo "== Matches for _sampled.csv in README files =="
rg -n --glob '**/README.md' '_sampled\.csv' || true

echo
echo "== README lines mentioning sampled/sample data =="
rg -n --glob '**/README.md' 'sampled|sample data|samples\.csv|sampled\.csv' || true

Length of output: 1382


@emersodb Agreed. I checked the current branch: the exact _samples.csv + sampleed issue is present in examples/tabsyn/README.md, and I didn’t find the same literal pattern in other README.md files. Still, a quick consistency pass across the other example READMEs while touching docs here sounds worthwhile.


```bash
python -m examples.tabsyn.evaluate
```

The results will be saved in the `/results/evaluation.json` file.
85 changes: 85 additions & 0 deletions examples/tabsyn/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
data_dir: examples/tabsyn/data
results_dir: examples/tabsyn/results
tabsyn_config: examples/tabsyn/tabsyn_config.toml
table_name: trans

training:
sample_size: 20000

evaluation:
# Reporting Configuration
write_report: True
metric_report_path: ${results_dir}/evaluation_results.txt

# Privacy Evaluation Configurations
dcr:
run: True
norm: "l2"
batch_size: 1000

median_dcr:
run: True
norm: "l2"
batch_size: 1000

hitting_rate:
run: True
hitting_threshold: 0.03

eir:
run: True
norm: "gower"

nndr:
run: True
norm: "l2"
batch_size: 1000

# Quality Evaluation Configurations
ks_tv:
run: True
significance_level: 0.05
permutations: 1000

alpha_precision:
run: True
naive_only: False

ci_overlap:
run: True
confidence_level: 95

correlation_diff:
run: True
compute_mixed_correlations: True

mean_diff:
run: True

f1_score_diff:
run: True
label_column: "trans_type"
folds: 5
f1_type: "macro"

regression_score_diff:
run: True
label_column: "trans_type"
preprocess_labels: True
measure_metrics_in_original_label_space: False
verbose: True
convert_label_to_float: True

hellinger:
run: True
include_numerical_columns: True

propensity_mse:
run: True
folds: 5
max_iterations: 50
solver: "liblinear"

mutual_information:
run: True
include_numerical_columns: False
Loading