Skip to content

refactor: Overhaul Ansible scripts to provide separate environments#222

Merged
martyngigg merged 47 commits intomainfrom
cloud-dev-infra
Feb 5, 2026
Merged

refactor: Overhaul Ansible scripts to provide separate environments#222
martyngigg merged 47 commits intomainfrom
cloud-dev-infra

Conversation

@martyngigg
Copy link
Contributor

@martyngigg martyngigg commented Feb 4, 2026

Summary

In order to provide separate QA/Dev environments the Ansible scripts have been refactored. Provisioning as been removed from Ansible and a new Terraform project takes care of this.

This used to be done by the provisioning step
Simplify discovering variables
Simplifed:
 - Moved all containers to host networking
 - Removed docker_configure role
 - Split out Keycloak bootstrapping
Create clients and scopes in Keycloak
Implement user federation in Keycloak
Our deployment is not big enough to have multiple projects.
It also simplifies the catalog connections.
@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

Important

Review skipped

Too many files!

This PR contains 168 files, which is 18 over the limit of 150.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
infra/ansible/roles/lakekeeper/files/bootstrap-warehouse.py (1)

378-390: ⚠️ Potential issue | 🟡 Minor

Redundant duplicate permission assignment.

The server_admin users are assigned {"server": ["admin"]} permissions twice - once on line 382 and again within the second call on lines 386-389. The second call should only include the project permissions.

♻️ Proposed fix to remove redundant assignment
     if server_admin is not None and identity_provider is not None:
         # Server admins are human users and will be provisioned the first time they log in
         server.assign_permissions(
             identity_provider.oidc_ids(server_admin, server.access_token),
-            entities={"server": ["admin"]},
-        )
-        server.assign_permissions(
-            identity_provider.oidc_ids(server_admin, server.access_token),
             entities={
                 "server": ["admin"],
                 "project": ["project_admin"],
             },
         )
🤖 Fix all issues with AI agents
In @.gitattributes:
- Line 1: The .gitattributes entry has a typo in the path string: replace
"infra/ansibler/group_vars/all/vault.yml" with the correct path
"infra/ansible/group_vars/all/vault.yml" so the diff=ansible-vault attribute
applies to the actual vault file; update the string in the .gitattributes line
(the path literal in the diff entry) and commit the change.

In `@elt-common/src/elt_common/iceberg/trino.py`:
- Around line 34-37: The code currently coerces env values using field.type(val)
which mis-parses boolean strings (e.g., bool("false") is True); update the
coercion in the try/except block in elt_common.iceberg.trino (the code using
field.type) to special-case booleans: if field.type is bool, parse val.lower()
against a canonical true set like {"true","1","yes","y","t"} (return False for
other values), otherwise call field.type(val) as before, and keep the existing
except TypeError fallback to return val.

In `@infra/ansible/inventories/dev/group_vars/lakekeeper.yml`:
- Line 7: The dev inventory sets lakekeeper_bootstrap_credentials using the
wrong variable name; replace the singular vault_keycloak_client_secret with the
plural vault_keycloak_client_secrets in the assignment to
lakekeeper_bootstrap_credentials so it matches other roles (Keycloak, ELT,
Trino) and the actual variable defined elsewhere; ensure the rest of the string
interpolation remains intact: lakekeeper_bootstrap_credentials: "ansible:{{
vault_keycloak_client_secrets['ansible'] }}".

In `@infra/ansible/roles/elt/tasks/main.yml`:
- Around line 46-56: The task "Ensure AWS configuration is present" uses
ansible.builtin.copy to write AWS credentials with mode "u=rw,o=r,g=r" which
leaves the file world-readable; change the file permissions to be owner-only
readable/writable (e.g., set mode to 0600 or "u=rw,go=") in the
ansible.builtin.copy task so the credentials at "{{ ansible_env['HOME']
}}/.aws/credentials" are not readable by group or others.
- Around line 137-149: The cron job name is static ("iceberg-maintenance")
causing each loop iteration over elt_warehouses to overwrite the previous entry;
update the ansible.builtin.cron task so the cron "name" is unique per warehouse
(e.g., include the loop_var warehouse.name or warehouse.id) in the Ensure
Iceberg table maintenance jobs exist task, keeping the same job, state, schedule
and loop, so each warehouse gets its own cron entry.
- Around line 73-95: The tasks "Ensure secrets directories exist" and "Generate
secrets" set modes that allow group/other read; change the directory mode in the
ansible.builtin.file task for "{{ elt_secrets_root_dir }}/{{ warehouse.name }}"
to owner-only (e.g. u=rwx,g=,o=) and change the template dest file mode in the
ansible.builtin.template task for "{{ elt_secrets_root_dir }}/{{ warehouse.name
}}/envvars" to owner-only (e.g. u=rw,g=,o=) so secrets are accessible only by
the owner; update the mode strings in those two tasks (referenced by their task
names) accordingly.

In `@infra/ansible/roles/keycloak/defaults/main.yml`:
- Around line 1-12: Update the pinned Keycloak image tag to a patched release
(26.3.5 or later) to mitigate the reported CVEs; modify the keycloak_image
default value (variable name: keycloak_image) in
infra/ansible/roles/keycloak/defaults/main.yml to reference
quay.io/keycloak/keycloak:26.3.5 (or a newer safe patch) and ensure any
deployment manifests or image references that use
keycloak_image_optimized_name/keycloak_image are updated consistently so the
runtime uses the upgraded image.

In `@infra/ansible/roles/keycloak/tasks/main.yml`:
- Around line 47-72: This task "Bootstrap Keycloak admin" (container name
keycloak_bootstrap, command "bootstrap-admin user" using
keycloak_bootstrap_admin_user and KC_BOOTSTRAP_ADMIN_PASSWORD) must be made
idempotent: add a pre-check that queries Keycloak for the admin user (e.g.,
curl/kcadm inside a short container run) and skip this task if the user exists,
or add a targeted failed_when to ignore the specific "user with username exists"
error (KC-SERVICES0010) so re-runs do not fail; implement the check/conditional
using the same community.docker.docker_container task context (or an immediately
preceding command task) and gate the bootstrap-admin task with when: or a
failed_when that matches the KC-SERVICES0010 message.

In `@infra/ansible/roles/keycloak/tasks/setup-ldap.yml`:
- Around line 35-43: In the LDAP mapper config block in setup-ldap.yml remove
the malformed key `read.only"` (extra quote) and the duplicate `read.only: true`
entry so only a single valid `read.only: true` remains; edit the `config:`
mapping (the lines around `ldap.attribute`, `user.model.attribute`,
`is.mandatory.in.ldap`, `attribute.force.default`, `is.binary.attribute`,
`read.only`, `write.only`) to delete the bad line and ensure keys are correctly
spelled and not duplicated.

In `@infra/ansible/roles/lakekeeper/tasks/main.yml`:
- Around line 82-92: The command is using a literal block scalar (|) which
embeds newlines into the command string; change the docker/container task's
command parameter to use a list form (an array of command and args) rather than
the multiline scalar so each argument is a separate list item (e.g., split the
current lines for "uv run", "/opt/work/bootstrap-warehouse.py",
"--project-name", "{{ lakekeeper_project_name }}", etc.), matching the other
docker_container tasks and avoiding embedded \n; alternatively replace the
scalar with a folded scalar (>- ) if you must keep a single string, but prefer
converting the command to a YAML list in this task to stay consistent with the
codebase conventions.

In `@infra/ansible/roles/minio/tasks/main.yml`:
- Around line 27-31: The env block has invalid YAML using KEY=VALUE; update the
env mapping to proper key: value pairs and quote interpolations: replace the
current env entries with keys MINIO_ROOT_USER, MINIO_ROOT_PASSWORD and
MINIO_BROWSER_REDIRECT_URL as YAML mappings and set their values to "{{
minio_root_user }}", "{{ minio_root_password }}" and "https://{{
top_level_domain }}/{{ minio_console_base_path }}" respectively so the env
mapping parses correctly.

In `@infra/ansible/roles/postgres/tasks/main.yml`:
- Around line 29-36: The task "Create postgres secrets" writes the DB password
to /secrets/postgres-passwd with mode u=r,g=r,o=r making it world-readable;
change the ansible.builtin.copy invocation to set mode to "0600" (or "u=rw,go=")
so only root can read/write, keep owner: root and group: root, and add no_log:
true to the task to prevent the secret from appearing in logs; ensure you
reference the same task name and the variable postgres_db_passwd when updating
the task.

In `@infra/ansible/roles/redis/tasks/main.yml`:
- Around line 14-16: Fix the IPv6 bind typo and avoid waiting on a non-existent
container HEALTHCHECK: update the redis-server command invocation (the command:
array in the task) to use "--bind 127.0.0.1 ::1" (remove the erroneous leading
dash before ::1) so Redis receives a correct IPv6 loopback address, and change
the task's state from "healthy" to "started" (or add a proper HEALTHCHECK to the
container image) so the playbook does not block waiting for a health status;
locate this in the ansible task that defines the "command" and "state" for
redis-server.

In `@infra/ansible/roles/traefik/tasks/static-config.yml`:
- Around line 2-21: The task "Ensure Traefik directories exist" ignores per-item
modes because a top-level mode is hardcoded; change the task to use the loop
variable's mode (traefik_dir.mode) instead of the hardcoded string so each loop
item (e.g., the certs entry with mode "u=rwx,g=,o=") is applied; keep
loop_control loop_var: traefik_dir and ensure the file module uses path: "{{
traefik_dir.path }}" and mode: "{{ traefik_dir.mode }}" (and leave owner/group
as needed).

In `@infra/ansible/roles/traefik/templates/etc/traefik/dynamic/lakekeeper.yml.j2`:
- Around line 1-22: Fix the router name typo and guard against an empty
lakekeeper group: rename the router from "lakeeeper" to "lakekeeper" so it
matches the service name "lakekeeper", and wrap or conditionalize the service
server URL generation that uses groups['lakekeeper'][0] (or check
groups['lakekeeper'] | length > 0 / groups.get('lakekeeper')) so the template
does not attempt to index into an empty list; ensure the
router/service/middleware blocks are only rendered when a non-empty
groups['lakekeeper'] exists.

In `@infra/ansible/roles/traefik/templates/etc/traefik/dynamic/trino.yml.j2`:
- Around line 14-15: The Traefik upstream currently uses the inventory hostname
groups['trino'][0], which may not resolve inside the container; update the
dynamic/trino.yml.j2 template to build the backend URL using hostvars and
ansible_host as a fallback — e.g., derive a host variable from
hostvars[groups['trino'][0]]['ansible_host'] (fall back to groups['trino'][0] if
ansible_host is unset) and then use that host with trino_http_port to form the
url; ensure references to groups['trino'][0], hostvars and ansible_host are used
so routing falls back correctly when DNS/hosts are unavailable.

In `@infra/ansible/roles/trino/tasks/main.yml`:
- Around line 64-78: The inline trino_rules definition in the copy task
generates an overly broad catalogue regex "(.*)" for users "elt" and "superset";
update this by extracting the catalog pattern into a configurable variable
(e.g., trino_catalog_patterns or trino_rules) and provide a sane default in role
defaults (defaults/main.yml), then reference that variable in the trino_rules
used by the copy task; also add a short comment above the variable explaining
the intent and security implications so operators can narrow the regex or
override per-environment.

In `@infra/local/warehouses/trino/dev_isis_cleaned.properties`:
- Line 3: The property iceberg.rest-catalog.warehouse was changed to the
hard-coded value "dev_isis_cleaned", which can cause cross-environment
collisions; revert this to an environment-scoped value or templated placeholder
(e.g., use an env var like ${ENV_WAREHOUSE} or a build-time token) and update
the code that reads this property to fall back to the intended per-environment
default if unset; specifically change the iceberg.rest-catalog.warehouse
assignment back to an env-driven value and verify any consumers that construct
Lakekeeper project/catalog or S3 prefixes use the environment identifier to keep
prefixes unique across environments.

In `@infra/local/warehouses/trino/dev_isis_raw.properties`:
- Line 3: The property iceberg.rest-catalog.warehouse is hard-coded to
"dev_isis_raw", which can cause catalog collisions across environments; change
this to use an environment-scoped value or template (e.g., interpolate an ENV
var) or add an environment prefix/suffix so each environment has a unique
warehouse name, or else explicitly document and enforce isolation by configuring
separate Lakekeeper projects or distinct MinIO buckets for each environment to
guarantee no collisions.
🟡 Minor comments (15)
infra/ansible/roles/postgres/tasks/main.yml-22-22 (1)

22-22: ⚠️ Potential issue | 🟡 Minor

Fix grammatical error in task name.

Same issue as above - "Create secrets directory exists" should be "Ensure secrets directory exists".

Proposed fix
-- name: Create secrets directory exists
+- name: Ensure secrets directory exists
infra/ansible/roles/postgres/tasks/main.yml-13-13 (1)

13-13: ⚠️ Potential issue | 🟡 Minor

Fix grammatical error in task name.

The task name "Create postgres data directory exists" is grammatically incorrect.

Proposed fix
-- name: Create postgres data directory exists
+- name: Ensure postgres data directory exists
infra/ansible/roles/trino/defaults/main.yml-6-6 (1)

6-6: ⚠️ Potential issue | 🟡 Minor

Set trino_shared_secret from inventory or vault rather than defaulting to empty string.

The internal-communication.shared-secret configuration is used for Trino's internal node-to-node communication security. Defaulting to an empty string disables this authentication layer. Rather than relying on developers to remember to override this value, require it to be explicitly set via inventory/vault or use Ansible assertion to fail on empty values in production deployments.

infra/ansible/roles/trino/defaults/main.yml-4-4 (1)

4-4: ⚠️ Potential issue | 🟡 Minor

Consider updating Trino image to a more recent version.

trinodb/trino:477 is a valid, officially released image (released 24 September 2025), but it is now 5 months outdated as of February 2026. Verify whether using this older release is intentional or if a more recent version should be used.

infra/ansible/roles/traefik/templates/etc/traefik/dynamic/traefik.yml.j2-16-16 (1)

16-16: ⚠️ Potential issue | 🟡 Minor

Bcrypt hash regenerates on each playbook run, causing unnecessary configuration reloads.

The password_hash('bcrypt') filter generates a new random salt on each invocation when one isn't supplied. This means the file will change on every Ansible run even if the password hasn't changed. Since Traefik watches dynamic config files for changes, this triggers reloads despite no actual configuration change.

Use a fixed salt (22 characters) to ensure idempotent hashes:

users: admin:{{ vault_traefik_admin_passwd | password_hash('bcrypt', rounds=10, ident='2y', salt='abcdefghijklmnopqrstuv') }}

Alternatively, pre-compute the hash in the vault if the password is static.

infra/ansible/roles/elt/templates/cron/iceberg_maintenance.sh.j2-18-18 (1)

18-18: ⚠️ Potential issue | 🟡 Minor

Quote variable expansion to handle paths with spaces.

The variable $ELT_SECRETS should be quoted to prevent word splitting if the path contains spaces.

🛡️ Proposed fix
-set -o allexport; source $ELT_SECRETS; set +o allexport
+set -o allexport; source "$ELT_SECRETS"; set +o allexport
infra/ansible/roles/elt/templates/cron/iceberg_maintenance.sh.j2-24-24 (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Use "$@" instead of $* for argument forwarding.

Using $* without quotes causes all arguments to be concatenated and re-split on whitespace. Use "$@" to preserve argument boundaries correctly.

🛡️ Proposed fix
 uvx \
   --with "${uv_git_ref}[iceberg-maintenance]" \
-  python -m elt_common.iceberg.maintenance $*
+  python -m elt_common.iceberg.maintenance "$@"
infra/ansible/roles/base/tasks/main.yml-13-13 (1)

13-13: ⚠️ Potential issue | 🟡 Minor

Quote the file mode to ensure correct octal interpretation.

Ansible interprets unquoted numeric values as decimal integers. mode: 644 is interpreted as decimal 644, not octal 0644, which would result in incorrect permissions (decimal 644 = octal 1204, yielding -w----xr--). Use a quoted string or explicit octal notation.

🔧 Proposed fix
-    mode: 644
+    mode: "0644"
infra/ansible/roles/elt/tasks/main.yml-80-84 (1)

80-84: ⚠️ Potential issue | 🟡 Minor

Bug: Missing Jinja2 templating in loop label.

Line 83 uses label: warehouse.name which will display the literal string "warehouse.name" instead of the actual warehouse name. This should use Jinja2 templating like line 95 does.

🐛 Proposed fix
   loop_control:
     loop_var: warehouse
-    label: warehouse.name
+    label: "{{ warehouse.name }}"
docs-devel/deployment/prerequisites.md-18-20 (1)

18-20: ⚠️ Potential issue | 🟡 Minor

Use British English “licence” here.

Minor wording polish for en‑GB spelling.

✏️ Suggested wording update
-*Note: Terraform no longer has an open-source license. [OpenTofu](https://opentofu.org/) is a*
+*Note: Terraform no longer has an open‑source licence. [OpenTofu](https://opentofu.org/) is a*
infra/ansible/group_vars/keycloak.yml-2-2 (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Fix minor typo in comment.
Line 2: “Eveything” → “Everything”.

🛠️ Proposed fix
-# Eveything is assigned to the realm defined by keycloak_realm.name in all/all.yml
+# Everything is assigned to the realm defined by keycloak_realm.name in all/all.yml
docs-devel/deployment/index.md-34-39 (1)

34-39: ⚠️ Potential issue | 🟡 Minor

Fix malformed endpoint placeholders.
Line 36–39 include backslashes inside the angle-bracket placeholders, which will render incorrectly.

🛠️ Proposed fix
-- Keycloak: <https://\domain\>/iceberg>
-- Lakekeeper: <https://\<domain\>/authn>
-- Superset instances:
--   - <https://\<domain\>/workspace/accelerator>
+- Keycloak: <https://<domain>/iceberg>
+- Lakekeeper: <https://<domain>/authn>
+- Superset instances:
+   - <https://<domain>/workspace/accelerator>
infra/ansible/roles/superset_farm/templates/superset-cli.sh.j2-3-14 (1)

3-14: ⚠️ Potential issue | 🟡 Minor

Quote variables and use "$@" for argument forwarding.

Unquoted variables risk breaking when values contain spaces or special characters. Using $* instead of "$@" fails to preserve argument boundaries.

Suggested fix
-CONTAINER_IMAGE={{ superset_farm_image }}
-SUPERSET_CONFIG_MOUNT={{ instance_work_dir }}/docker/pythonpath
-ENV_FILE={{ instance_work_dir }}/docker/.env
+CONTAINER_IMAGE="{{ superset_farm_image }}"
+SUPERSET_CONFIG_MOUNT="{{ instance_work_dir }}/docker/pythonpath"
+ENV_FILE="{{ instance_work_dir }}/docker/.env"
@@
-  --env-file $ENV_FILE \
-  --volume ${SUPERSET_CONFIG_MOUNT}:/app/docker/pythonpath:ro \
+  --env-file "$ENV_FILE" \
+  --volume "${SUPERSET_CONFIG_MOUNT}:/app/docker/pythonpath:ro" \
@@
-  ${CONTAINER_IMAGE} \
-  superset $*
+  "${CONTAINER_IMAGE}" \
+  superset "$@"
infra/ansible/roles/trino/tasks/main.yml-88-101 (1)

88-101: ⚠️ Potential issue | 🟡 Minor

Escape usernames in the password file regexp.

Usernames containing regex metacharacters (e.g., . or +) can cause unintended matches in the lineinfile regexp pattern. The regex_escape filter is the correct Ansible approach for this.

Proposed fix
-    regexp: "^{{ user.key }}"
+    regexp: "^{{ user.key | regex_escape }}"
infra/ansible/terraform/main.tf-47-68 (1)

47-68: ⚠️ Potential issue | 🟡 Minor

Guard merge() against empty security_groups.

If var.security_groups is passed as an empty map, merge([...]...) expands to zero arguments and Terraform errors during plan. Seeding with an empty map avoids this edge case.

Proposed fix
 locals {
-  sec_groups_map = merge([
-    for sg_name, sg_config in var.security_groups : {
-      for port in sg_config.ports_ingress : "${sg_name}-${port}" => {
-        sec_group_id = openstack_networking_secgroup_v2.sec_group[sg_name].id
-        port_ingress = port
-      }
-    }
-  ]...)
+  sec_groups_map = merge(
+    {},
+    [for sg_name, sg_config in var.security_groups : {
+      for port in sg_config.ports_ingress : "${sg_name}-${port}" => {
+        sec_group_id = openstack_networking_secgroup_v2.sec_group[sg_name].id
+        port_ingress = port
+      }
+    }]...
+  )
 }
🧹 Nitpick comments (21)
infra/ansible/roles/postgres/tasks/main.yml (1)

8-11: Consider documenting or isolating the system package override.

The --break-system-packages flag bypasses PEP 668 protections on externally managed Python environments. Whilst this is a common workaround for Ansible on modern Debian/Ubuntu systems, it can lead to conflicts with system packages.

Consider using a virtual environment or documenting why this approach is acceptable for this deployment.

infra/ansible/roles/superset_farm/files/docker/pythonpath/superset_config.py (1)

150-165: Guard REDIS_DB parsing against empty/invalid values.

int(os.getenv("REDIS_DB", "0")) will throw if the env var is set to an empty string. A small guard keeps boot deterministic while still allowing overrides.

Proposed change
-REDIS_DB = int(os.getenv("REDIS_DB", "0"))
+REDIS_DB = int(os.getenv("REDIS_DB") or "0")
infra/local/superset/docker/pythonpath/superset_config.py (1)

103-120: Guard REDIS_DB parsing against empty/invalid values.

Same as the ansible config, an empty REDIS_DB env var would cause a startup failure.

Proposed change
-REDIS_DB = int(os.getenv("REDIS_DB", "0"))
+REDIS_DB = int(os.getenv("REDIS_DB") or "0")
infra/ansible/roles/superset_farm/templates/docker/docker-init.sh.j2 (1)

54-58: Consider quoting the database name argument.

The --database_name argument value is unquoted, which could cause issues if analytic_db.name contains spaces or special shell characters. The --uri value is already correctly quoted.

💡 Suggested fix
   superset set-database-uri \
-    --database_name {{ analytic_db.name }} --uri "{{ analytic_db.db_url }}"
+    --database_name "{{ analytic_db.name }}" --uri "{{ analytic_db.db_url }}"
infra/ansible/roles/redis/tasks/main.yml (1)

2-7: Working directory created but not mounted for persistence.

The task creates redis_work_dir on the host (lines 2-7), but the container does not mount this directory. Redis data will not persist across container restarts. If persistence is intended, consider adding a volume mount.

♻️ Proposed fix to add volume mount
     network_mode: host
     restart_policy: unless-stopped
+    volumes:
+      - "{{ redis_work_dir }}:/data"

Also applies to: 9-20

infra/ansible/roles/elt/templates/cron/iceberg_maintenance.sh.j2 (1)

9-12: Usage message does not reflect additional arguments.

The usage message only mentions <git_url>, but the script forwards additional arguments via $* to the Python module. Consider updating the usage to clarify this.

📝 Proposed fix
 if [ $# -eq 0  ]; then
-  echo "Usage $0 <git_url>"
+  echo "Usage: $0 <git_url> [maintenance_args...]"
   exit 1
 fi
infra/ansible/roles/superset_farm/templates/docker/dotenv.j2 (1)

23-24: Misleading comment about SECRET_KEY.

The comment suggests manually setting the secret key, but it's already templated from instance.value.secret_key. Consider updating the comment to reflect that the value is managed via Ansible variables.

📝 Proposed fix
-# Make sure you set this to a unique secure random value on production
+# Secret key managed via Ansible variables - ensure a unique secure random value in production
 SUPERSET_SECRET_KEY={{ instance.value.secret_key }}
infra/ansible/roles/superset_farm/tasks/setup-instance.yml (1)

11-12: Add a name to the set_fact task for consistency and clarity.

All other tasks in this file have descriptive names. Adding a name improves readability and makes Ansible output easier to follow during execution.

♻️ Proposed fix
-- ansible.builtin.set_fact:
+- name: Set instance working directory
+  ansible.builtin.set_fact:
     instance_work_dir: "{{ superset_farm_work_dir_prefix }}/{{ instance.key }}"
infra/ansible/terraform/README.md (1)

1-1: Consider expanding the README with essential documentation.

This README is a placeholder. Given that Terraform provisioning is a significant addition to the project, consider documenting:

  • Prerequisites (Terraform/OpenTofu version, cloud provider credentials)
  • Available environments and their configuration files
  • Basic usage commands (terraform init, plan, apply)
  • How the generated inventory integrates with Ansible
infra/ansible/roles/traefik/templates/etc/traefik/dynamic/tls.yml.j2 (1)

1-4: Consider parameterising the certificate file paths.

The certificate paths are hardcoded. Using Jinja2 variables would provide greater flexibility across different environments and domains.

Proposed refactor using variables
 tls:
   certificates:
-    - certFile: /certs/cert__domain.pem
-      keyFile: /certs/key__domain.pem
+    - certFile: /certs/{{ traefik_tls_cert_file | default('cert__domain.pem') }}
+      keyFile: /certs/{{ traefik_tls_key_file | default('key__domain.pem') }}
infra/ansible/roles/traefik/tasks/main.yml (1)

2-8: Consider using include_tasks instead of import_tasks with conditional logic for clearer intent.

When using ansible.builtin.import_tasks, the when condition is applied to each task within the imported file at parse time. Since the intent is to conditionally include or exclude entire configuration files based on traefik_container_only, include_tasks is more semantically appropriate—it evaluates the condition at runtime and either includes or excludes the entire file, rather than conditionally skipping individual tasks.

Proposed refactor
-- ansible.builtin.import_tasks: static-config.yml
+- ansible.builtin.include_tasks: static-config.yml
   when: not (traefik_container_only | default(False))

-- ansible.builtin.import_tasks: container.yml
+- ansible.builtin.include_tasks: container.yml

-- ansible.builtin.import_tasks: dynamic-config.yml
+- ansible.builtin.include_tasks: dynamic-config.yml
   when: not (traefik_container_only | default(False))
infra/ansible/terraform/.gitignore (1)

5-5: Remove .terraform.lock.hcl from .gitignore to ensure reproducible provider versions.

The .terraform.lock.hcl file records exact provider versions and checksums. HashiCorp's official documentation recommends committing this file to version control so that all team members and CI/CD pipelines use identical provider versions, enabling consistent and reproducible deployments.

Proposed change
 .terraform/
-.terraform.lock.hcl
infra/ansible/roles/superset_farm/defaults/main.yml (1)

2-4: Consider pinning the Superset image by digest.

Tags can drift; digest pinning improves reproducibility and security.

🔒 Example pinning by digest
-superset_farm_image: ghcr.io/isisneutronmuon/analytics-data-platform-superset:a4f0e8011f34785d759269dceedfa2693db81c19
+superset_farm_image: ghcr.io/isisneutronmuon/analytics-data-platform-superset@sha256:<digest>
infra/ansible/roles/traefik/templates/etc/traefik/dynamic/superset_farm.yml.j2 (1)

13-17: Consider load-balancing across all superset_farm hosts.
Only the first host is used, so additional instances never receive traffic.

♻️ Suggested server list expansion
-        servers:
-          - url: "http://{{ groups['superset_farm'][0] }}:{{ props.http_port }}"
+        servers:
+          {% for host in groups['superset_farm'] %}
+          - url: "http://{{ host }}:{{ props.http_port }}"
+          {% endfor %}
infra/ansible/roles/traefik/templates/etc/traefik/dynamic/datastore.yml.j2 (1)

24-32: Include all datastore hosts in the load balancer.
Only the first datastore host is used, so additional hosts never receive traffic.

♻️ Suggested server list expansion
     minioapi:
       loadBalancer:
         servers:
-          - url: "http://{{ groups['datastore'][0] }}:{{ minio_api_port }}"
+          {% for host in groups['datastore'] %}
+          - url: "http://{{ host }}:{{ minio_api_port }}"
+          {% endfor %}
     minioconsole:
       loadBalancer:
         servers:
-          - url: "http://{{ groups['datastore'][0] }}:{{ minio_console_port }}"
+          {% for host in groups['datastore'] %}
+          - url: "http://{{ host }}:{{ minio_console_port }}"
+          {% endfor %}
infra/ansible/terraform/ansible-inventory.tmpl (1)

6-9: Consider restructuring to avoid index-based group/IP pairing for clarity.

Currently, ansible_groups and ansible_ips are created from the same source collection in parallel, so the index alignment is safe. However, this pattern relies on list ordering and could introduce subtle bugs if the code is refactored separately in future. A map-based approach would make the group-to-IP relationship explicit and eliminate index alignment entirely.

Suggested refactoring (requires map output)
-%{ for index, group in ansible_groups ~}
-
-[${ansible_groups[index]}]
-${ansible_ips[index]}
-%{ endfor ~}
+%{ for group, ips in ansible_group_ips ~}
+[${group}]
+%{ for ip in ips ~}
+${ip}
+%{ endfor ~}
+%{ endfor ~}
infra/ansible/roles/traefik/templates/etc/traefik/dynamic/lakekeeper.yml.j2 (1)

4-4: Fix the router name typo for clarity.

The router name looks like an accidental misspelling, which makes dashboards/metrics harder to interpret.

✏️ Rename the router
-    lakeeeper:
+    lakekeeper:
infra/ansible/roles/minio/defaults/main.yml (1)

2-2: Pin the MinIO image digest for reproducibility and supply-chain safety.

The tag quay.io/minio/minio:RELEASE.2025-09-07T16-13-09Z is available. Consider pinning the digest instead of the tag to prevent tag drift and ensure all deployments use the exact same image content.

🔒 Digest pinning
-minio_image: quay.io/minio/minio:RELEASE.2025-09-07T16-13-09Z
+minio_image: quay.io/minio/minio@sha256:14cea493d9a34af32f524e538b8346cf79f3321eff8e708c1e2960462bd8936e
infra/ansible/roles/keycloak/tasks/setup-realm.yml (1)

31-48: Consider adding defaults for future-proofing, though all current clients define these fields.

All four currently defined clients (ansible, trino, elt, lakekeeper-ui) include both optional_client_scopes_additional and attributes. However, if future clients omit these fields, the task will fail. Adding defaults makes the template more defensive:

Proposed improvement
-    optional_client_scopes: "{{ keycloak_client_optional_scopes_minimum + item.optional_client_scopes_additional }}"
-    attributes: "{{ item.attributes }}"
+    optional_client_scopes: "{{ keycloak_client_optional_scopes_minimum + (item.optional_client_scopes_additional | default([])) }}"
+    attributes: "{{ item.attributes | default({}) }}"
infra/ansible/terraform/main.tf (1)

100-110: Consider using vm.network[0].fixed_ip_v4 for explicit IP reference.

While access_ip_v4 is auto-populated by the provider, it may be unreliable when networks are attached via Neutron ports. This code uses direct network attachment (not port-based), so access_ip_v4 should populate reliably; however, using network[0].fixed_ip_v4 provides more explicit and defensive control over the inventory IP source.

Suggested adjustment (single-NIC case)
-      ansible_ips = [for vm in openstack_compute_instance_v2.vm : vm.access_ip_v4]
+      ansible_ips = [for vm in openstack_compute_instance_v2.vm : vm.network[0].fixed_ip_v4]
infra/ansible/terraform/variables.tf (1)

36-45: Add validation for CIDR and port ranges.

Input variable validation blocks in Terraform (stable since v0.13.0) enable early detection of misconfiguration, preventing late failures during apply. The suggested validation blocks use cidrnetmask() to validate CIDR syntax and alltrue() with port range checks (1–65535) to validate port entries, following established Terraform patterns.

Suggested validation blocks
 variable "network_subnet_cidr" {
   description = "CIDR of network subnet"
   type        = string
+  validation {
+    condition     = can(cidrnetmask(var.network_subnet_cidr))
+    error_message = "network_subnet_cidr must be a valid CIDR."
+  }
 }
 
 variable "security_groups" {
   description = "Details of the ports to open on each compute instance"
   type = map(object({
     ports_ingress = list(number)
   }))
+  validation {
+    condition = alltrue([
+      for sg in var.security_groups : alltrue([
+        for p in sg.ports_ingress : p >= 1 && p <= 65535
+      ])
+    ])
+    error_message = "ports_ingress entries must be valid TCP ports (1-65535)."
+  }
 }

@martyngigg martyngigg merged commit 22a75c5 into main Feb 5, 2026
4 checks passed
@martyngigg martyngigg deleted the cloud-dev-infra branch February 5, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant