Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion claude-skills/ruleset-builder/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "ruleset-builder",
"version": "1.0.0",
"version": "1.1.0",
"description": "Convert auto-generated DataMasque rulesets into production-ready form. Validate and iterate.",
"author": { "name": "DataMasque Ltd" },
"repository": "https://github.com/datamasque/datamasque-cli",
Expand Down
109 changes: 55 additions & 54 deletions claude-skills/ruleset-builder/skills/ruleset-builder/SKILL.md
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will apply following small fixes before merge:

  • Change Step 0-4 to Step 1-5
  • Align tables in SKILL.md

Original file line number Diff line number Diff line change
Expand Up @@ -12,29 +12,34 @@ Transform auto-generated DataMasque rulesets into production-ready rulesets with
2. **`hash_columns`** — on every applicable `mask_table` task for deterministic consistency
3. **Clean structure** — `skip_defaults`, no doc blocks, validated

**4-step process. Complete all 4 steps. Report after each step before proceeding.**
FK cascade is automatic: mask the parent PK with `imitate_unique` (or `imitate_uuid` / `imitate_nz_ird`) and the engine replicates the rule onto every FK column referencing it. **Do NOT add explicit rules for FK columns.** Avoid `from_unique_imitate` and `mask_unique_key` (both deprecated). Never skip IDs.

Use `TaskCreate` for all 4 steps before starting. The prompt must include business domain and application type — ask if missing.
5-step process (1–5). Use `TaskCreate` to track all 5; report after each step before proceeding. The prompt must include business domain and application type — ask if missing.

---

## Step 0: Report version
Report: **Version 1.5**
## Step 1: Report versions

Report the Ruleset Builder version (from `plugin.json`) and `dm version` so the operator can correlate output with releases.

---

## Step 1: Read reference docs
## Step 2: Read reference docs

Canonical mask reference:
<https://portal.datamasque.com/portal/documentation/latest/masking-functions-overview.html>

Read all three before any other work:
Read all of these before any other work:
```
${CLAUDE_PLUGIN_ROOT}/skills/ruleset-builder/references/fk-cascade.md
${CLAUDE_PLUGIN_ROOT}/skills/ruleset-builder/references/mask-definitions-guide.md
${CLAUDE_PLUGIN_ROOT}/skills/ruleset-builder/references/hash-columns-guide.md
${CLAUDE_PLUGIN_ROOT}/skills/ruleset-builder/references/ruleset-yaml-reference.md
```

---

## Step 2: Extract ruleset_library
## Step 3: Extract ruleset_library

Write a Python script using `ruamel.yaml` (`uv pip install ruamel.yaml`).

Expand All @@ -52,50 +57,48 @@ masks:
### Classification rules (apply in order)

**1. ID columns** — any column ending in `_ID`, `_NO`, `_NR`, `_NBR` is an entity identifier.
- Strip adjective/verb prefixes before the noun: `PREVIOUS_`, `OLD_`, `TRANSFERRED_`, `PRIOR_`, `CURR_`, `NEW_`, `NEXT_`, `ALT_`, `PARENT_`, `CHILD_`, `SOURCE_`, `TARGET_`, `ORIG_`, `PENDING_`, `ARCHIVED_`, `DELETED_`
- Extract the core entity: `PREVIOUS_INVOICE_ID` → `invoice`, `TRANSFERRED_ACCOUNT_ID` → `account`, `INVOICE_ACCOUNT_ID` → `invoice_account` (compound kept — no prefix stripped)
- Group all derivatives to one rule: `$ref: "Global/RuleLib#masks/{entity}_id"`
- Library entry: `type: imitate_unique`, `seed: "{entity}"` — **seed is required**
- This overrides whatever mask was originally generated (even `imitate_unique`, `from_random_number`, etc.)
- **FK side: drop the rule entirely.** If an ID column is a foreign key (the table's `Foreign Keys` metadata in the discovery CSV has an entry for it), do NOT emit a rule for it. The engine cascades automatically from the parent PK rule. See `fk-cascade.md`.
- **PK side: use `imitate_unique` with `seed:`.** Strip adjective/verb prefixes before the noun: `PREVIOUS_`, `OLD_`, `TRANSFERRED_`, `PRIOR_`, `CURR_`, `NEW_`, `NEXT_`, `ALT_`, `PARENT_`, `CHILD_`, `SOURCE_`, `TARGET_`, `ORIG_`, `PENDING_`, `ARCHIVED_`, `DELETED_`. Extract the core entity (`PREVIOUS_INVOICE_ID` → `invoice`).
- Library entry name: `{entity}_id`. Reference it as `$ref: "Global/RuleLib#masks/{entity}_id"`.
- Library entry body: `type: imitate_unique`, `seed: "{entity}"`. The `seed` is optional but recommended: it namespaces by entity so unrelated IDs don't collide (e.g. `customer.id=42` doesn't mask to the same value as `product.id=42`). Doesn't affect FK cascade.
- This overrides whatever mask was originally generated (even `from_random_number`).

**2. Named patterns** — detect by mask structure:

| Pattern | Detection | Library rule |
|---------|-----------|--------------|
| Email | `chain(concat(concat(firstName+lastName, glue='.')+email_suffix)+transform_case(lower))` | `email_address` |
| Full name | `chain(concat(firstName+lastName, glue=' ')+take_substring)` OR plain `concat(firstName+lastName, glue=' ')` — column not containing USERNAME/LOGIN | `full_name` |
| Username | Same mask as full_name but column name contains USERNAME, USER_NAME, LOGIN, LOGON | `username` |
| First name only | `from_file` with firstNames seed | `name_first` |
| Last name only | `from_file` with lastNames seed | `name_last` |
| DOB | Column name contains DOB/BIRTH/DATE_OF_BIRTH — use `retain_age` regardless of original type | `dob` |
| Company | `chain(from_file(companies)+take_substring)` | `company_name` |
| Country name | `from_file(country_codes, seed_column=name)` | `country_name` |
| Country alpha-2 | `from_file(country_codes, seed_column=alpha_2)` | `country_code_2` |
| Country alpha-3 | `from_file(country_codes, seed_column=alpha_3)` | `country_code_3` |
| Phone/fax | `imitate` on column name containing PHONE, TEL, FAX, MOBILE, CELL | `phone` |
| Address line 1 | `from_file(addresses, seed_column=street_address)` on LINE_1/ADDRESS_LINE_1 columns | `address_line1` |
| Address line N | Same for LINE_2, LINE_3 etc. | `address_lineN` |
| Address full | `from_file(addresses, seed_column=street_address)` on non-line-numbered columns | `address_full` |
| Address expr | `concat(address+city+state+postcode, glue=', ')` | `network_address_expr` |
| City | `from_file(addresses, seed_column=city)` | `city` |
| Postcode | `from_file(addresses, seed_column=postcode)` | `post_code` |
| Suburb | `from_file(addresses, seed_column=suburb)` | `suburb` |
| Occupation | `from_file(occupations)` | `occupation` |
| Pattern | Detection | Library rule |
|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|
| Email | `chain(concat(concat(firstName+lastName, glue='.')+email_suffix)+transform_case(lower))` | `email_address` |
| Full name | `chain(concat(firstName+lastName, glue=' ')+take_substring)` OR plain `concat(firstName+lastName, glue=' ')` — column not containing USERNAME/LOGIN | `full_name` |
| Username | Same mask as full_name but column name contains USERNAME, USER_NAME, LOGIN, LOGON | `username` |
| First name only | `from_file` with firstNames seed | `name_first` |
| Last name only | `from_file` with lastNames seed | `name_last` |
| DOB | Column name contains DOB/BIRTH/DATE_OF_BIRTH — use `retain_age` regardless of original type | `dob` |
| Company | `chain(from_file(companies)+take_substring)` | `company_name` |
| Country name | `from_file(country_codes, seed_column=name)` | `country_name` |
| Country alpha-2 | `from_file(country_codes, seed_column=alpha_2)` | `country_code_2` |
| Country alpha-3 | `from_file(country_codes, seed_column=alpha_3)` | `country_code_3` |
| Phone/fax | `imitate` on column name containing PHONE, TEL, FAX, MOBILE, CELL | `phone` |
| Address line 1 | `from_file(addresses, seed_column=street_address)` on LINE_1/ADDRESS_LINE_1 columns | `address_line1` |
| Address line N | Same for LINE_2, LINE_3 etc. | `address_lineN` |
| Address full | `from_file(addresses, seed_column=street_address)` on non-line-numbered columns | `address_full` |
| Address expr | `concat(address+city+state+postcode, glue=', ')` | `network_address_expr` |
| City | `from_file(addresses, seed_column=city)` | `city` |
| Postcode | `from_file(addresses, seed_column=postcode)` | `post_code` |
| Suburb | `from_file(addresses, seed_column=suburb)` | `suburb` |
| Occupation | `from_file(occupations)` | `occupation` |

**3. Remaining** — group by column name concept. Where column names share a root (e.g., `RESULT3_VALUE`, `RESULT5_VALUE` → `result_value`; `GENERAL_2`, `GENERAL_6` → `general`), use one shared rule. Strip adjective prefixes. Use first occurrence's parameters.

- `imitate_unique` (non-ID cols) → `{col_group}: type: imitate_unique, seed: "{col_group}"` — **seed is required**
- `imitate_unique` (non-ID cols) → `{col_group}: type: imitate_unique, seed: "{col_group}"` (seed recommended for namespacing; see ID columns section).
- `from_random_date` → `{col_group}: type: from_random_date, min/max from first occurrence`
- `from_random_number` → `{col_group}: type: from_random_number, min/max from first occurrence`
- `imitate` (non-phone) → `{col_group}: type: imitate`
- String catch-all → `{col_group}: type: imitate_unique, seed: "{col_group}"` (use `imitate` only for types `imitate_unique` can't handle, e.g. datetime, bool).
- Complex chains → keep structure, group by column name

> **Critical rule:** Every `imitate_unique` entry in `ruleset_library.yaml` MUST have a `seed` value.
> - Entity ID rules: `seed: "{entity_name}"` (e.g., `account_id` → `seed: "account"`)
> - All other `imitate_unique` rules: `seed: "{rule_name}"` (e.g., `field_name` → `seed: "field_name"`)

### Output format

`Global/RuleLib` below is a placeholder for `<namespace>/<library_name>` — substitute the operator's real values, and create the library with `dm libraries create` before running the ruleset.

```yaml
version: '1.0'
skip_defaults:
Expand All @@ -116,11 +119,11 @@ tasks:

Do NOT write a custom YAML serializer. Use `ruamel.yaml` round-trip dumper. Use `DoubleQuotedScalarString` for `$ref` values.

**Report:** "Step 2 done — extracted N rule library definitions: [list each name and usage count]."
**Report:** "Step 3 done — extracted N rule library definitions: [list each name and usage count]."

---

## Step 3: Add hash_columns
## Step 4: Add hash_columns

Write a Python script that:

Expand All @@ -146,30 +149,28 @@ Build a lookup of `(schema, table)` → columns with constraint and FK metadata:

4. Write to output file

**Report:** "Step 3 done — added hash_columns to N tables, skipped M (all-unique), skipped K (no suitable key). Top hash columns: [column → count]."
**Report:** "Step 4 done — added hash_columns to N tables, skipped M (all-unique), skipped K (no suitable key). Top hash columns: [column → count]."

---

## Step 4: Validate and clean up
## Step 5: Validate and clean up

Remove any comment lines containing `ROWID`.

Run:
```bash
dm rulesets validate --file <output_file>
```
Run `dm rulesets validate --file <output_file> --type database`
(use `file` for file-masking rulesets).

Fix any errors and re-validate until passing.

---

## Summary

| Metric | Value |
|--------|-------|
| Total tables | N |
| Metric | Value |
|----------------------------|----------------|
| Total tables | N |
| Mask definitions extracted | N (list names) |
| Tables with hash_columns | N |
| Tables skipped (no key) | N |
| Validation | passed/failed |
| Output file | path |
| Tables with hash_columns | N |
| Tables skipped (no key) | N |
| Validation | passed/failed |
| Output file | path |
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# FK Cascade Invariant

The most important rule when refining a DataMasque ruleset that spans
related tables. Get this wrong and you either leak identity (by skipping
IDs entirely) or break the engine (by adding rules for FK columns).

## The rule

**Mask only the parent PK column. The engine cascades the same masked value
to every FK column referencing it.**

Three masks support this cascade:

- `imitate_unique` — recommended for new work.
- `imitate_uuid` — for UUID-shaped IDs.
- `imitate_nz_ird` — for NZ IRD numbers.

(`from_unique_imitate` and `mask_unique_key` are deprecated; do not emit.)

When `mask_table` runs and a rule on a referenced column uses one of these
masks, the engine:

1. Discovers child tables with FKs referencing this column.
2. Auto-replicates the parent's rule onto every FK column.
3. Same mask config → same masked output → joins survive.

This is documented at
<https://portal.datamasque.com/portal/documentation/latest/unique-masks.html>:

> "You can apply an `imitate_unique` mask to a primary key column or a
> column that is used as a foreign key in another table. References will be
> updated automatically. Composite primary keys are supported."

## Worked example

Schema:
- `customers.id` (PK), `customers.email`
- `orders.id` (PK), `orders.customer_id` (FK → `customers.id`), `orders.tracking_number`

Correct ruleset:

```yaml
- type: mask_table
table: customers
key: id
rules:
- column: id
masks:
- type: imitate_unique
seed: customer
- column: email
masks:
- type: from_file
seed_file: DataMasque_emails.csv
seed_column: email

- type: mask_table
table: orders
key: id
rules:
# customer_id is intentionally absent — the engine replicates the
# `customers.id` rule onto it automatically. Adding it here would
# be rejected by the runtime FK check.
- column: tracking_number
masks:
- type: imitate_unique
seed: tracking
```

After the run, `orders.customer_id` holds the same masked values as
`customers.id`, joins remain intact, and `tracking_number` is independently
masked with its own seed.

## Anti-patterns to refuse

- **Adding explicit FK rules** ("I'll mask both PK and FK with shared
`$ref` so the cascade works"). The runtime rejects this by default with
the error:
*"To preserve referential integrity, the following foreign key columns
cannot be directly masked by this task."*
The engine will replicate the rule for you; adding your own conflicts.
- **Skipping IDs to "preserve FK joins"**. Leaves identifiers in plain
sight. Mask the parent PK with `imitate_unique` — joins survive via
the auto-cascade.
- **Inventing linking parameters** (`source_table`, `source_column`,
`parent_column`, `link_to`). None of these exist on any DataMasque mask.
- **Inventing a hashing mask** (`hash_text`, `hash`, `link`, `match_id`).
None of these exist. `imitate_unique` is the deterministic mask.
- **Using `from_unique_imitate` or `mask_unique_key`**. Both deprecated.
`imitate_unique` replaces both.

## Cross-run consistency requires `run_secret`

Within a single run, `imitate_unique` is deterministic via a per-run
`insecure_seed`. Across runs, the cascade only holds if the run is
invoked with a `run_secret`. Without it, the same input maps to a
different masked value next run. If cross-run consistency matters, flag
this in the final summary.

## Self-check before finishing

For each FK relationship in the schema:

1. Is the parent PK masked with `imitate_unique`, `imitate_uuid`, or
`imitate_nz_ird`?
2. Is the FK column **absent** from your output (no explicit rule)?
3. Are `from_unique_imitate` and `mask_unique_key` absent from your output?

If any answer is "no", fix it before validation.
Original file line number Diff line number Diff line change
Expand Up @@ -71,14 +71,14 @@ hash_columns:

Every table belongs to a domain entity. Find the column that identifies that entity:

| Domain | Typical hash column | Examples |
|--------|-------------------|----------|
| Customer | `cust_id`, `customer_id`, `client_id` | CUST_MASTER, CUST_ADDRESS |
| Account | `acc_id`, `account_id`, `account_no` | DEP_ACCOUNT, DEP_EMAIL_ALERT |
| Card | `card_id`, `card_no` | CARD_MASTER, CARD_INSURANCE |
| Loan | `loan_id`, `loan_no` | LOAN_COLLATERAL, LOAN_GUARANTOR |
| Employee | `emp_id`, `emp_no`, `employee_id` | COM_EMPLOYEE, COM_EMP_ROLE |
| Transaction | `tx_id`, `trf_id`, `fx_tx_id` | TRF_MASTER, FX_RECEIPT |
| Domain | Typical hash column | Examples |
|-------------|---------------------------------------|---------------------------------|
| Customer | `cust_id`, `customer_id`, `client_id` | CUST_MASTER, CUST_ADDRESS |
| Account | `acc_id`, `account_id`, `account_no` | DEP_ACCOUNT, DEP_EMAIL_ALERT |
| Card | `card_id`, `card_no` | CARD_MASTER, CARD_INSURANCE |
| Loan | `loan_id`, `loan_no` | LOAN_COLLATERAL, LOAN_GUARANTOR |
| Employee | `emp_id`, `emp_no`, `employee_id` | COM_EMPLOYEE, COM_EMP_ROLE |
| Transaction | `tx_id`, `trf_id`, `fx_tx_id` | TRF_MASTER, FX_RECEIPT |

### Step 2: Check foreign keys in the DDL

Expand Down
Loading
Loading