diff --git a/skills/mlops/LICENSE b/skills/mlops/LICENSE new file mode 100644 index 00000000..3b8f57ee --- /dev/null +++ b/skills/mlops/LICENSE @@ -0,0 +1,22 @@ +Snowflake Skills License + +© 2026 Snowflake Inc. All rights reserved. + +LICENSE: Use of these materials (including all code, prompts, assets, files, and other components of these skills (collectively, “Skills”)) is governed by your agreement with Snowflake for the Service. If no separate agreement exists, use is governed by Snowflake’s Terms of Service (available at: https://www.snowflake.com/en/legal/terms-of-service/). + +Your applicable agreement is referred to as the "Agreement." "Service" is as defined in the Agreement. + +ADDITIONAL RESTRICTIONS: Notwithstanding anything in the Agreement to the contrary, you may not: + +* Extract from the Service or retain copies of the Skills outside use with the Service; +* Reproduce or copy the Skills , except for temporary copies created automatically during authorized use of the Service; +* Create derivative works based on the Skills; +* Distribute, sublicense, or transfer the Skills to any third party; +* Make, offer to sell, sell, or import any inventions embodied in the Skills; nor, +* Reverse engineer, decompile, or disassemble the Skills. + +The receipt, viewing, or possession of the Skills does not convey or imply any license or right beyond those expressly granted above. + +Snowflake retains all rights, title, and interest in the Skills, including all copyrights, trademarks, patents, and all other applicable intellectual property rights. + +THE SKILLS ARE PROVIDED “AS IS,” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SKILLS OR THE USE OR OTHER DEALINGS IN THE SKILLS. diff --git a/skills/mlops/SKILL.md b/skills/mlops/SKILL.md new file mode 100644 index 00000000..49a4dd18 --- /dev/null +++ b/skills/mlops/SKILL.md @@ -0,0 +1,104 @@ +--- +name: mlops +title: Plan and run MLOps +summary: Router for MLOps work on Snowflake — assess maturity, pick promotion patterns, and implement CI/CD, monitoring, and governance. +description: "Use when a developer or data engineer wants to assess MLOps maturity, design a promotion strategy (Code/Model/Hybrid), or implement MLOps capabilities (CI/CD, monitoring, retraining, governance) on Snowflake for traditional ML or LLM/GenAI workloads. Triggers: mlops, mlops maturity, mlops assessment, mlops strategy, mlops pattern, mlops framework, model promotion, ml ci/cd, ml monitoring, llmops, rag pipeline ops, fine-tuning ops." +prompt: Help me set up MLOps on Snowflake for my ML project. +language: en +status: Published +author: Snowflake Solutions Team +type: snowflake +tools: + - snowflake_sql_execute + - Bash + - Read + - Write + - Edit + - Glob + - Grep +--- + +# Plan and run MLOps + +## Overview + +Router skill for operationalizing ML and LLM/GenAI workloads on Snowflake. It covers the *process and governance layer* — when to promote, what gates to enforce, what to monitor, how to roll back. It does **not** cover SDK-level code (model registration, feature store APIs, training loops) — that belongs to the `machine-learning` skill. + +This skill applies to traditional ML *and* GenAI (prompt management, RAG, fine-tuning, agentic apps). There is no separate "LLMOps" — LLM operationalization is part of MLOps with workload-specific adaptations. + +**Scope split** + +| Question | Owner | +|---|---| +| When should I promote a model? What gates must it pass? | mlops | +| How do I register a model or deploy an endpoint? (code) | machine-learning | +| What should I monitor after deployment? When to roll back? | mlops | +| How do I set up Feature Store / Cortex Search? (code) | machine-learning | +| How should I govern Feature Store / Registry across environments? | mlops | +| How do I train / fine-tune / build RAG? (code) | machine-learning | +| How should I operationalize training across environments? | mlops | + +**Platform constraint:** All recommendations assume Snowflake as the platform (Model Registry, Feature Store, Cortex AI, Snowpark, Tasks/Streams). Do not propose third-party platforms unless the user explicitly asks. + +**Explain before asking:** Always introduce concepts (maturity levels L0–L3, promotion patterns, capability dimensions) before asking the user to make decisions about them. Do not assume prior knowledge. + +## Sub-flows + +- `implement-patterns/INSTRUCTIONS.md` — implementation playbooks for promotion, CI/CD, monitoring, governance (includes maturity assessment as part of the pattern selection workflow) + +## Workflow + +### Step 1: Detect intent + +Ask the user which path they need: + +1. **Assessment & strategy** — evaluate current maturity, pick patterns, build a roadmap +2. **Implementation patterns** — guidance for a specific capability (CI/CD, monitoring, etc.) +3. **Full setup** — end-to-end MLOps design from scratch + +### Step 2: Route + +| Intent | Route | +|---|---| +| ASSESS — "assess maturity", "gap analysis", "roadmap", "where are we" | Load `implement-patterns/INSTRUCTIONS.md` — start with promotion pattern determination | +| PATTERNS — "promotion pattern", "ci/cd", "monitoring", "retraining", "feature store governance", "RAG pipeline ops", "LLM monitoring" | Load `implement-patterns/INSTRUCTIONS.md` | +| FULL SETUP — "setup mlops from scratch", "end to end" | Load `implement-patterns/INSTRUCTIONS.md` — start with promotion pattern determination, then work through capabilities per priority | + +⚠️ STOPPING POINT: Before loading `implement-patterns/INSTRUCTIONS.md`, the user MUST have an explicit promotion pattern (Code / Model / Hybrid). If unknown, run the decision tree (ask about team structure, artifact type, deployment frequency). Do not generate implementation guidance without it. + +### Step 3: Per-message intent re-evaluation + +On every user message — not just the first — re-check intent. If the user shifts to implementation ("start with X", "let's build", "show me the code", "what SQL do I need"): + +1. STOP generating from general knowledge. +2. Load `implement-patterns/INSTRUCTIONS.md` immediately, passing known context (pattern, maturity, environments). +3. If promotion pattern is unknown, determine it briefly before loading. + +## Common Mistakes + +- Generating implementation code from general knowledge instead of loading `implement-patterns/INSTRUCTIONS.md`. +- Skipping promotion-pattern selection and producing pattern-agnostic recommendations (they will be wrong). +- Treating LLM/GenAI as a separate "LLMOps" track instead of a workload variant. +- Recommending non-Snowflake tools (SageMaker, Vertex, Databricks, MLflow) when the user did not ask. +- Answering "how do I register a model" inside this skill — that's `machine-learning`. +- Asking the user to choose between L1 and L2 without first explaining what the levels mean. + +## Red Flags + +Refuse these rationalizations: + +- "The user seems to know what they want, I'll skip the promotion-pattern question." — No. Pattern is a hard prerequisite. +- "I'll generate the CI/CD pipeline from memory, faster than loading the sub-flow." — No. Sub-flow content is curated and tested; general-knowledge output drifts. +- "They asked about MLflow, I'll just answer." — Only if they explicitly asked. Default is Snowflake-native. +- "The roadmap is obvious, I'll skip the assessment." — No. Maturity baseline drives sequencing. +- "They want to start implementing, I don't need to re-check intent each turn." — Re-evaluate every message. + +## Stopping Points + +- Step 2 — wait for explicit promotion pattern (Code / Model / Hybrid) before loading `implement-patterns/INSTRUCTIONS.md`. If unknown, run decision tree or full assessment first. + +## Output + +- Assessment route: maturity scorecard + prioritized roadmap. +- Patterns route: implementation playbook for the selected capability. +- Full setup: complete architecture with sequenced implementation plan. diff --git a/skills/mlops/implement-patterns/INSTRUCTIONS.md b/skills/mlops/implement-patterns/INSTRUCTIONS.md new file mode 100644 index 00000000..7b2d3169 --- /dev/null +++ b/skills/mlops/implement-patterns/INSTRUCTIONS.md @@ -0,0 +1,118 @@ + +# Implementation Patterns + +> **Platform constraint (inherited from parent):** All recommendations must assume Snowflake as the platform. Do NOT propose non-Snowflake tools or platforms unless the user explicitly asks. + +## When to Load + +`mlops/SKILL.md` Step 2: When user needs implementation guidance for a specific MLOps capability. + +## Setup + +**Load** the decision tree (ask about team structure, artifact type, and deployment frequency) if maturity context is needed. + +## Workflow + +### Step 1: Identify Topic + +**Ask** user which area they need guidance on: + +1. **Promotion Patterns** - Code / Model / Hybrid workflows, environment structure, LLM artifact promotion +2. **CI/CD & Testing** - Test strategy, deployment automation, pipeline architecture, LLM-specific tests +3. **Continuous Training** - Retraining triggers, scheduling, automation, LLM iteration cycles +4. **Monitoring & Rollback** - Drift detection, alerting, rollback, recovery, LLM evaluation +5. **Model Lifecycle** - Registry, versioning, Champion/Challenger, promotion gates, LLM versioning +6. **Data & Features** - Data validation, feature store, skew prevention, vector DB / search index +7. **Governance & Metadata** - Lineage, compliance, audit, metadata management, LLM access control + +### Step 2: Gather Context + +**If routed from the parent skill with a roadmap**, use the known promotion pattern, maturity levels, and environment names. Skip to Step 3. + +**Otherwise, ask** user: +1. **Promotion pattern**: Before asking, briefly introduce the three promotion patterns so the user has context: + - **Code Promotion** — Training code moves through environments (DEV → STAGING → PROD). The model is retrained in each environment using that environment's data. Best when production data is accessible from the production environment. + - **Model Promotion** — The model is trained in one environment (typically DEV) and the trained artifact is promoted to other environments. Only the artifact moves, not the training code. Best when training is expensive or environments cannot access production data. + - **Hybrid Promotion** — Code moves to a middle environment (e.g., STAGING) that has production data access, the model is trained there, and the artifact is promoted to production. Combines aspects of both patterns. + Then ask: Which pattern fits your situation? Code / Model / Hybrid (or undecided) +2. **Current maturity level**: Before asking, briefly introduce the maturity levels so the user has context: + - **L0 (Ad-hoc / Experimental)** — No formal process. Notebooks, manual everything. No production deployment. + - **L1 (Manual)** — All core AI/ML features available — but every step is executed and approved by humans. No CI/CD, no automated monitoring, no automated governance. + - **L2 (Semi-automated)** — CI/CD runs tests automatically, but model validation and promotion require human approval gates. + - **L3 (Fully Automated)** — End-to-end automation including monitoring-triggered retraining, auto-validation, and auto-promotion with rollback. + Then ask: Where does your current setup fall? L0 / L1 / L2 / L3 (or unknown) +3. **Target maturity level**: L1 / L2 / L3 +4. **Environment setup**: How many environments, what names, fully isolated or shared components? + + Use the user's chosen names, environment count, and isolation model in **all outputs** (checklists, diagrams, recommendations). For full environment guidance (2-env vs 3-env trade-offs, isolation models, canonical name table), see the parent mlops skill Step 2. + +> **Promotion pattern is a hard prerequisite**: All implementation recommendations in this skill are pattern-specific — CI/CD pipelines, environment structure, promotion gates, and governance all vary fundamentally between Code, Model, and Hybrid promotion. **Do not proceed to Step 3** until the user has explicitly confirmed a promotion pattern. +> +> If the user says "undecided" or doesn't know: +> 1. **Quick path**: Walk them through the decision tree in the decision tree (ask about team structure, artifact type, and deployment frequency) § "Decision Tree" — this takes ~5 minutes and yields a clear pattern choice. +> 2. **Full path**: Recommend a full assessment via the parent mlops skill for comprehensive maturity + pattern evaluation (~15 minutes). +> 3. **Do not skip**: Generating implementation guidance without a promotion pattern leads to rework (e.g., building CI/CD for Code Promotion when the team actually needs Model Promotion). +> +> If maturity level or environments are unknown, these can be estimated — but promotion pattern **must** be explicit. + +If maturity level is unknown, estimate based on their answers or suggest running the parent mlops skill first. **Never ask the user to self-assess their maturity level without first explaining what each level means.** + +### Step 3: Load and Present Pattern + +Based on topic selection, **Load** the corresponding reference: + +| Topic | Reference | +|-------|-----------| +| Promotion Patterns | `references/promotion-patterns.md` | +| CI/CD & Testing | `references/ci-cd-testing.md` | +| Continuous Training | `references/continuous-training.md` | +| Monitoring & Rollback | `references/monitoring-rollback.md` | +| Model Lifecycle | `references/model-lifecycle.md` | +| Data & Features | `references/data-features.md` | +| Governance & Metadata | `references/governance-metadata.md` | + +Present the relevant maturity level section (L1/L2/L3) for the user's promotion pattern. Include: +- What to implement +- How it works +- Key decisions +- Risk callouts (if applicable) + +**When the topic is Promotion Patterns**: Always present the "Promotion Mechanisms and Snowflake Features" section from `references/promotion-patterns.md`. This gives the user a concrete view of how each artifact type moves between environments via CI/CD, which Snowflake commands are used, and which features enable the workflow. Present it alongside the pattern-specific guidance — do not wait for the user to ask. + +### Step 4: Actionable Checklist + +Produce an implementation checklist tailored to the user's context: + +``` +Implementation Checklist: [Topic] at L[X] [Pattern] Promotion +============================================================== +[ ] Step 1: [specific action] +[ ] Step 2: [specific action] +[ ] Step 3: [specific action] +... +Prerequisites: [list] +Depends on: [other capabilities that must be in place] +``` + +## Stopping Points + +- ✋ After Step 1: Confirm topic selection +- ✋ After Step 2: **Hard gate** — promotion pattern must be explicitly confirmed before proceeding. Confirm context (promotion pattern, maturity levels, environment setup) before loading reference. **Load only** the corresponding reference (do not preload all references) +- ✋ After Step 3: **Present** the key decisions and risk callouts from the pattern. **Ask** the user to confirm the approach before generating the checklist +- ✋ After Step 4: Review checklist for feasibility + +## Output + +- Pattern guidance for selected topic at specified maturity level +- Implementation checklist with prerequisites and dependencies + +## Troubleshooting + +**User doesn't know their maturity level:** +- Suggest running the parent mlops skill first for a full assessment, or estimate based on their answers. + +**Pattern doesn't cover the user's use case:** +- Check if a combination of patterns applies. Present the closest match and note gaps explicitly. + +**User wants guidance across multiple topics at once:** +- Prioritize by dependency order: promotion patterns -> CI/CD -> data/features -> CT -> monitoring -> governance. Work through one at a time. diff --git a/skills/mlops/implement-patterns/references/ci-cd-testing.md b/skills/mlops/implement-patterns/references/ci-cd-testing.md new file mode 100644 index 00000000..bc4ed395 --- /dev/null +++ b/skills/mlops/implement-patterns/references/ci-cd-testing.md @@ -0,0 +1,381 @@ +# CI/CD & Testing + +> **Environment naming**: This file uses canonical names **DEV**, **STAGING**, **PROD**. Substitute the user's preferred names in all outputs. For **2-environment setups** (DEV → PROD only), see the 2-env adaptation notes in the Environment Structure section. For isolation strategies, see the "Environment Isolation Strategies" section below. + +## CI - Code Tests + +### L1 - Manual +- Tests exist but are run manually by data scientists +- Testing is part of notebook or ML Jobs execution +- No automated test suite +- No Git Integration or Snowflake CLI (automation tools — L2) + +### L2 - Semi-automated +- Auto-run on PR via CI pipeline +- Unit tests for feature engineering logic (transformations, encodings) +- Unit tests for data processing methods +- Unit tests for utility functions and helpers +- Test coverage tracked but not blocking + +### L3 - Fully Automated +- All L2 tests + blocking merge gate (PRs cannot merge if tests fail) +- Test coverage requirements enforced +- Linting and code quality checks automated + +## CI - ML-Specific Tests + +### L1 - Manual +- Ad-hoc checks in notebooks (Experiments for manual comparison) + +### L2 - Semi-automated +- Model training convergence test (loss decreases over iterations on sample data) +- NaN/infinity checks (no NaN values from division by zero or extreme values) +- Artifact production validation (each pipeline step produces expected outputs) +- Model overfits on small sample (sanity check that model can learn) + +### L3 - Fully Automated +- All L2 tests + component output tests (each component produces expected schema/format) +- Cross-component integration tests (full pipeline runs end-to-end in staging) +- Model reproducibility tests (same input produces consistent outputs) +- Performance regression tests (new model meets minimum quality bar) + +## CD - Infrastructure Validation + +### L1 - Manual +- Manual check that packages are installed in serving environment +- Manual verification of compute resources +- SPCS available for containerized workloads (manual container health checks) + +### L2 - Semi-automated +- Automated package/dependency compatibility verification before deploy +- Automated check that model format matches serving infrastructure expectations +- Automated verification of API contract (input/output schema) + +### L3 - Fully Automated +- All L2 checks + memory/compute/accelerator resource availability checks +- Container image vulnerability scanning +- Network connectivity validation +- Serving environment health checks pre-deployment + +## CD - Service Testing + +### L1 - Manual +- Manual API call to test prediction service +- Manual spot-check of predictions + +### L2 - Semi-automated +- Automated prediction service API tests (expected inputs produce expected outputs) +- Input/output schema validation +- Error handling tests (malformed input, missing features) +- Load testing (QPS benchmarks, latency SLAs) +- Canary deployment testing (small traffic percentage to new model, human promotion decision) +- Shadow mode testing (new model runs alongside old, outputs compared) + +### L3 - Fully Automated +- All L2 tests + automated canary validation (auto-promote/rollback based on metrics) +- End-to-end latency profiling with automated regression detection +- Automated traffic ramp-up and rollback without human intervention + +## CD - Deployment Execution + +### L1 - Manual +- Manual alias update in Model Registry +- Manual endpoint configuration +- Manual notification to stakeholders + +### L2 - Semi-automated +- CI/CD triggers deployment +- Human approval gate before production deployment +- Automated deployment to staging/test environments +- Semi-automated deployment to production after approval + +### L3 - Fully Automated +- Fully automated deployment with zero-downtime (blue-green or rolling) +- Automated rollback on health check failure +- Automated canary promotion (gradual traffic shift) +- Deployment notifications and audit logging + +## Pipeline Architecture + +### L1 - Manual +- Same scripts run manually in each environment +- ML Jobs available for manual execution (no orchestration) +- SPCS available for containerized workloads (manually deployed) +- Manual transitions between steps + +### L2 - Semi-automated +- Parameterized pipeline, same code across environments with config differences +- Modularized components (reusable across pipelines) +- Pipeline orchestrator manages step transitions +- Shared libraries for common operations +- Containerized pipelines available for teams ready to adopt + +### L3 - Fully Automated +- Identical containerized pipeline across all environments (config-driven only) +- Components independently versioned, composable, and auto-tested +- Pipeline DAG defined declaratively +- Automatic retry and failure handling per component + +## Environment Isolation Strategies + +Environments can be isolated at two levels: + +| Strategy | How It Works | When to Use | +|---|---|---| +| **Same-account isolation** (recommended default) | Each environment is a separate **database** within the same Snowflake account. Naming conventions distinguish environments (e.g., `PROJECT_DEV`, `PROJECT_PROD`). | Most teams. Simpler to manage, lower overhead. | +| **Multi-account isolation** | Each environment is a **separate Snowflake account**. | Strict regulatory/compliance requirements, hard network/data boundaries between environments. | + +**Schema-level isolation is not recommended** — using schemas within the same database to separate environments leads to fragile naming, complex RBAC, and accidental cross-environment access. + +In both strategies, **naming conventions are critical** — they are the primary mechanism for identifying which environment, layer, and purpose each object serves. See the "Naming Conventions" section below. + +## Data Architecture Layers + +Data architectures typically follow a **three-layer pattern**. Names vary across organizations, but the concept is consistent: + +| Layer | Also Called | Purpose | +|---|---|---| +| **Raw** | Bronze, Landing, Ingestion | Raw data as-is from source systems | +| **Integration** | Silver, Curated, Cleaned | Cleaned, conformed, deduplicated data | +| **Presentation** | Gold, Consumption, Analytics | Business-ready data, aggregates, feature tables, ML-ready datasets | + +**Layer implementation**: The recommended approach is to implement each layer as a **separate database** (e.g., `RAW_DEV`, `INTEGRATION_DEV`, `PRESENTATION_DEV`). This provides clear isolation between layers, simpler RBAC (grants at the database level), and cleaner naming. Using schemas within a single database to represent layers (e.g., `PROJECT_DEV.RAW`, `PROJECT_DEV.INTEGRATION`) is possible but **not recommended** — it leads to a single large database with complex cross-schema permissions and makes it harder to manage access by layer. + +**The data architecture must be identical across all environments.** Each environment (DEV, STAGING, PROD) should have the same layers, same databases, same structure — only the data content and volume may differ. This ensures that code tested in DEV behaves identically when promoted to PROD. + +## Environment Structure + +For environment isolation strategies (same-account vs multi-account) and data architecture layers, see the dedicated sections above. + +### L1 - Manual +- Separate environments for dev/staging/prod — a database per environment (same-account) or a separate account per environment (multi-account). See "Environment Isolation Strategies" above. +- RBAC with basic role separation across environments +- Manual access control +- No formal promotion process between environments +- No Git Integration or Snowflake CLI (L2+) + +### L2 - Semi-automated +- Separate environments per environment (database or account — see isolation strategy) +- Role-based access control +- CI/CD orchestrates transitions between environments +- Staging mirrors production configuration + +### L3 - Fully Automated +- Fully isolated environments with infrastructure-as-code +- Automated environment provisioning +- Production-like staging (same compute, same data access patterns) +- Resource isolation between training and serving in production + +### 2-Environment Adaptation (Environment Structure) +When no STAGING environment exists (DEV → PROD only): +- **L1**: Single DEV workspace for experimentation + validation. PROD receives promoted artifacts or code. Manual access control between environments. +- **L2**: DEV must absorb staging responsibilities — integration tests, validation pipelines, and approval gates all run in DEV. CI/CD gates must be stricter to compensate. Consider a dedicated DEV namespace or schema for pre-production validation. +- **L3**: DEV environment must be production-like (same compute profile, representative data) for meaningful automated validation. Resource isolation between validation workloads and experimentation in DEV. +- **Cross-component integration tests** (normally run in staging at L3) move to DEV. Ensure DEV has sufficient compute for these heavier test suites. +- **Registry and Feature Store per environment**: At L1+, each environment should have its own Model Registry and Feature Store (separate databases). CI/CD pipelines enforce access control at the promotion boundary (e.g., only the CD pipeline can execute `CREATE MODEL ... FROM MODEL` to promote models to production). A centralized registry across environments is only acceptable at L0 for experimentation. +- **Graduation signal**: If DEV validation frequently misses issues caught only in PROD, or if compliance audits require a dedicated pre-production tier, recommend adding a STAGING environment. + +For RBAC and security model guidance, see `governance-metadata.md` § "RBAC / Security Model." + +## LLM/GenAI CI/CD Adaptation + +### CI — LLM-Specific Tests + +#### L2 - Semi-automated +- **Prompt regression tests**: Run updated prompts against a golden evaluation set; compare LLM-as-judge scores to baseline +- **RAG retrieval quality tests**: Verify retrieval precision@k and recall@k on known query-document pairs +- **Chain integration tests**: End-to-end test of multi-step LLM pipelines (retrieval → generation → post-processing) +- **Safety/guardrail tests**: Verify prompts don't produce harmful outputs on adversarial test inputs + +#### L3 - Fully Automated +- All L2 tests + blocking merge gate for prompt/RAG changes +- **Multi-model comparison**: Automated evaluation across model versions (fine-tuned) using AI Observability +- **Cost regression tests**: Flag prompt changes that significantly increase token usage +- **Latency regression tests**: Flag changes that increase end-to-end response time + +### CD — LLM Artifact Deployment + +| Artifact | Deployment Mechanism | +|---|---| +| Prompt templates | Git-based deployment (same as application code) | +| RAG index config | Deploy config → Cortex Search rebuilds index per environment | +| Fine-tuned weights | `CREATE MODEL ... FROM MODEL` (cross-env promotion via CI/CD); alias switch for Champion/Challenger within environment | +| Agent definitions | Git-based deployment of tool configs and orchestration code | + +For rollback patterns per artifact type, see `monitoring-rollback.md` § "LLM Rollback Patterns." + +### Environment Considerations +- **Search index per environment**: Cortex Search index is rebuilt per environment — see `promotion-patterns.md` § "Environment Considerations for LLM Workloads." +- **GPU compute pools**: CD pipeline must verify target environment has available GPU resources before deploying fine-tuned model serving endpoints. +- **Token budget gates**: Optional — CD pipeline can enforce maximum estimated token cost per deployment before promoting to production. + +## CI/CD Authentication + +CI/CD pipelines need non-interactive authentication to Snowflake. The approach depends on the CI platform and security requirements. + +### L1 - Manual +- Personal credentials used interactively +- No service accounts or automation-specific credentials + +### L2 - Semi-automated +Pipelines should use **dedicated service accounts** with the **least-privilege principle**: + +1. **Snowflake service users** (`TYPE = SERVICE`) — the preferred identity type for CI/CD automation. Service users cannot log in interactively and are purpose-built for programmatic access. +2. **One service identity per environment** — each environment gets its own service account scoped to its own resources, enforcing isolation at the identity layer. +3. **Short-lived credentials preferred** — authentication methods that issue short-lived tokens reduce blast radius if compromised. +4. **No stored secrets when possible** — some CI platforms support identity federation (e.g., OIDC). Snowflake supports Workload Identity Federation (WIF), allowing CI platforms to authenticate via JWT tokens without storing secrets. This requires a one-time API integration setup (ACCOUNTADMIN). Present OIDC/WIF as an option when the customer asks about authentication — it eliminates secret management but requires platform support and initial setup. Do not default to it without confirming it meets customer requirements. +5. **Temporary connections** — when using OIDC/WIF, prefer `--temporary-connection` with the Snowflake CLI. This avoids requiring a `config.toml` file in the CI environment; authentication is handled entirely via the OIDC token and environment variables (e.g., `SNOWFLAKE_ACCOUNT`). +6. **Connection validation step** — always validate the connection before deploying (`snow connection test --temporary-connection`). This catches authentication or network issues early, before any changes are applied. +7. **Python scripts in CI/CD** typically cannot use `get_active_session()` — they must create explicit sessions using environment variables injected by the CI pipeline. For OIDC/WIF, the required environment variables are: + - `SNOWFLAKE_ACCOUNT` — the Snowflake account identifier + - `SNOWFLAKE_AUTHENTICATOR` — set to `WORKLOAD_IDENTITY` + - `SNOWFLAKE_WORKLOAD_IDENTITY_PROVIDER` — set to `OIDC` + - `SNOWFLAKE_TOKEN` — the JWT token issued by the CI platform (injected automatically by the OIDC action) + + The Python script reads these from `os.environ` and passes them to `Session.builder.configs(connection_params).create()`. The environment variable for the target environment (e.g., `var_environment`) is passed separately and used to construct database/schema names at runtime. + +**Common authentication patterns** (non-exhaustive): + +| Pattern | Secret Storage | Token Lifetime | Availability | +|---|---|---|---| +| **OIDC / WIF** | None — CI platform issues JWT | Minutes | Platforms supporting OIDC | +| **Key-pair rotation** | Private key in CI secrets | Until rotated | Any | +| **OAuth client credentials** | Client ID + secret in CI secrets | Configurable | Any | + +### L3 - Fully Automated +- All L2 + network policies restricting CI/CD traffic to known IP ranges +- Automated credential rotation and audit logging +- Least-privilege service roles reviewed periodically + +## Environment Parameterization + +A single codebase should deploy across all environments by parameterizing environment-specific values (database names, schema names, warehouse names, roles). This works best when the **environment** is a naming segment in the naming convention — the parameterization variable substitutes the environment portion of object names, allowing the same code to target DEV, STAGING, or PROD without modification. + +**Approaches** (choose based on team tooling and preferences): + +| Approach | How It Works | +|---|---| +| **Snowflake CLI Jinja2** (suggested default) | `snow sql -f .sql --enable-templating JINJA -D "var_environment=PROD"` | +| **CI/CD variable substitution** | `envsubst`, `sed`, or platform-native variable injection | +| **Python-based rendering** | Jinja2 or string formatting in a deploy script | +| **Snowflake Scripting** | Session variables (`SET var_environment = 'PROD';`) | + +**Database naming** should include the **environment** segment (highly recommended — see "Naming Conventions" below) along with other relevant segments agreed upon with the user (business entity, layer, etc.). This enables the same code to target different databases per environment by substituting a single variable. + +For Python scripts, pass the environment as an environment variable and read it at runtime. + +## Naming Conventions + +Consistent naming is foundational — it is the primary mechanism for identifying objects across environments, layers, and teams. + +### Snowflake Object Naming + +A naming convention should be agreed upon with the user during the assessment phase. The following are **naming segments to consider** — present all of them to the user, discuss which are relevant, and let the user decide which to adopt, in what order, and with what separators. Not all segments apply to every object type. + +| Segment | Purpose | Examples | Applies To | +|---|---|---|---| +| **Environment** | Which environment this belongs to | `DEV`, `STAGING`, `PROD` | Highly recommended for databases, warehouses, roles. Primary discriminator for environment parameterization. | +| **Business entity / project** | What business domain or project this belongs to | `SUPPLY_CHAIN`, `FRAUD`, `MARKETING` | Databases, schemas, tables | +| **Data architecture layer** | Which layer in the data pipeline | `RAW`, `INTEGRATION`, `PRESENTATION` (or `BRONZE`, `SILVER`, `GOLD`) | Databases (recommended), tables, views | +| **Department / team / function** | Who owns or consumes this | `DATA_ENG`, `ML`, `ANALYTICS`, `FINANCE` | Databases, schemas, roles, warehouses | +| **Source system** | Where the data originates | `SAP`, `SALESFORCE`, `STRIPE`, `KAFKA` | Tables, stages, pipes | +| **Object type** | What kind of object this is (when not obvious from context) | `TBL`, `VW`, `SP`, `TASK`, `AGT` | Schema-level objects (optional — some teams prefer this, others find it redundant) | +| **Region / geography** | Where the data applies geographically | `US`, `EU`, `APAC`, `GLOBAL` | Databases, schemas, tables | +| **Temporal grain / cadence** | Frequency or time granularity | `DAILY`, `MONTHLY`, `HOURLY`, `SNAPSHOT` | Tables, tasks, streams | +| **Version / variant** | Distinguishes variants or versions of the same object | `LEGACY`, `EXPERIMENTAL` | Tables, models, views | +| **Additional modifier(s)** | Extra clarity as needed | `` | Any | + +The convention applies to **all Snowflake objects** — databases, schemas, tables, views, agents, warehouses, roles, etc. Database names in particular should include the **environment** segment since it is the primary discriminator for environment parameterization (see "Environment Parameterization" above). + +**Do not prescribe or enforce a specific convention** — present the full list of segments, discuss which are relevant for the user's context, and let the user define or confirm a convention that fits their organization. The naming convention should be reviewed and agreed upon during the assessment process (see the parent mlops skill). + +### Repository Structure + +The repository folder structure should ideally **mirror the Snowflake object hierarchy** when practical: database → schema → schema-level objects. This alignment makes it intuitive to navigate, ensures CI/CD pipelines can map files to their target locations, and keeps the repository in line with the data architecture layers. However, this is not always feasible (e.g., monorepos with non-Snowflake code, legacy project structures, or cross-database shared utilities). When a strict mirror is not possible, aim to preserve the mapping at the level that matters most (typically database folders aligned to data architecture layers). + +**Recommended structure** (layers as databases): +``` +/ +├── / # One folder per database (layer + environment parameterized) +│ ├── / # One folder per schema within the database +│ │ ├── tables/ # Grouped by object type (optional — depends on volume) +│ │ │ ├── .sql +│ │ │ └── ... +│ │ ├── views/ +│ │ ├── procedures/ +│ │ └── ... +│ └── / +├── / # Another layer database +│ └── ... +├── snowflake.yml # Project definition (if using managed entities) +└── ... +``` + +- **Database folders** correspond to Snowflake databases, typically one per data architecture layer (e.g., `raw/`, `integration/`, `presentation/`). When using environment parameterization, the folder name represents the database *template* (the environment segment is resolved at deploy time). +- **Schema folders** represent schemas within each layer database, organized by business domain, function, or source system. +- **Object-type subfolders** (e.g., `tables/`, `views/`, `procedures/`, `tasks/`, `agents/`) are optional — useful when a schema contains many objects, unnecessary when it has few. +- The structure is a recommendation — adapt it to the organization's existing conventions. The key principle is that **navigating the repo should feel like navigating Snowflake**. + +### File Naming and Organization + +- **One file per Snowflake object is recommended where practical** — ideally, each SQL or Python file defines exactly one Snowflake object (one table, one view, one agent, etc.). This enables selective deployment, clean diffs, and clear ownership. However, this is not always feasible (e.g., tightly coupled objects, migration scripts, or objects with cross-dependencies). When a file must define multiple objects, document the reason and keep the scope as narrow as possible. +- **File names should closely mirror the Snowflake object name** — this makes it easy to find the source file for any object and vice versa. Example: `supply_chain_demand_forecast_features.sql` → `SUPPLY_CHAIN_DEMAND_FORECAST_FEATURES` table. +- **Numeric prefixes** are a simple option to indicate execution order when files have dependencies (e.g., `01_raw_table.sql`, `02_integration_view.sql`, `03_presentation_agent.sql`). For runtime orchestration, Snowflake Tasks can also manage dependency ordering declaratively. +- Apply the same naming discipline to Python and notebook files — the file name should indicate which object or pipeline step it implements. + +## Deployable Artifact Types + +CI/CD pipelines may deploy different file types to Snowflake. Each has different strengths: + +| Type | Best For | CI/CD Deployment | +|---|---|---| +| **SQL files** (`.sql`) | DDL, object definitions (tables, views, agents, semantic views, grants) | `snow sql -f --enable-templating JINJA -D "var_environment="` | +| **Python files** (`.py`) | Complex logic, evaluations, data transformations, API calls | `var_environment= python ` | +| **Notebooks** (`.ipynb`) | Exploration, experimentation, ad-hoc analysis | `snow notebook deploy ` + `EXECUTE NOTEBOOK .(...)` | + +**Advisory guidance on artifact selection:** + +- **SQL and Python are recommended for production deployments** — they are deterministic, easily testable, diffable in code review, and straightforward to parameterize. +- **Notebooks are best suited for experimentation and exploration.** When a notebook matures into a production artifact, consider converting it to a Python script — extract the logic, add proper argument handling, and remove interactive/visualization cells. +- If a notebook must be deployed (e.g., reporting step), it can be managed via `snowflake.yml` project definitions. Deployment is a two-step process: first deploy the entity (`snow notebook deploy --replace`), then execute it (`EXECUTE NOTEBOOK .(...)`). Environment parameters can be passed at both steps. +- When notebook files use numeric prefixes for ordering (e.g., `04_eval_review.ipynb`), strip the prefix to derive the Snowflake entity name (e.g., `eval_review`). +- CI/CD pipelines should **route deployment by file type** — each type has a different execution mechanism. The pipeline should iterate over changed files, detect the extension, and call the appropriate deployment command. + +## Selective Deployment + +At L2+, CI/CD pipelines should deploy **only changed files** rather than re-executing the entire codebase. This reduces deployment time and blast radius. + +**Change detection pattern:** +1. Compare the current commit to the previous deployment baseline (e.g., `HEAD~1` or last deployed tag) +2. Filter for added, copied, modified, and renamed files (exclude deleted) — only deployable extensions (`.sql`, `.py`, `.ipynb`) +3. Sort the results — if files use numeric prefixes for ordering, alphabetical sort produces the correct execution order +4. Route each file to its deployment mechanism based on file type (see "Deployable Artifact Types" above) +5. If no deployable files changed, exit early with a success status + +**Ordering matters** — files often have dependencies (e.g., table must exist before semantic view, semantic view before agent). Dependency ordering can be managed via naming conventions (numeric prefixes), explicit dependency manifests, or Snowflake Tasks for runtime orchestration. + +## CI/CD Pipeline Structure (L2+) + +A well-structured CI/CD pipeline follows a consistent shape regardless of CI platform: + +1. **Checkout** — clone the repository with minimal depth (only enough for change detection) +2. **Authenticate** — set up Snowflake CLI with the chosen authentication method +3. **Install dependencies** — install Python packages needed by deployment scripts +4. **Validate connection** — test that authentication works before deploying anything +5. **Detect changes** — identify which files changed since the last deployment +6. **Deploy** — route each changed file to its deployment mechanism by file type + +**Key principles:** +- **Pin tool versions** — pin the Snowflake CLI version for reproducible deployments across runs +- **One environment per pipeline run** — use CI platform environments (e.g., GitHub Environments) to scope OIDC trust, variables, and approval gates per environment +- **Environment as a variable** — pass the environment identifier at the command level for every deployment command, not as a global config +- **Fail fast** — validate the connection before deploying; detect empty changesets early +- **Minimal checkout** — shallow clone with just enough history for diff-based change detection + +For an annotated example using GitHub Actions with OIDC authentication (one of the supported patterns), see `templates/github-actions-deploy.yml`. + +## See Also + +- `promotion-patterns.md` — How environment structure varies by promotion pattern +- `continuous-training.md` — Retraining pipelines that CI/CD must support +- `data-features.md` — Data validation tests to include in CI pipeline diff --git a/skills/mlops/implement-patterns/references/continuous-training.md b/skills/mlops/implement-patterns/references/continuous-training.md new file mode 100644 index 00000000..baa3dac2 --- /dev/null +++ b/skills/mlops/implement-patterns/references/continuous-training.md @@ -0,0 +1,121 @@ +# Continuous Training + +> **Environment naming**: This file uses canonical names **DEV**, **STAGING**, **PROD**. Substitute the user's preferred names in all outputs. For 2-environment setups, omit STAGING references. For environment isolation strategies and naming conventions, see `ci-cd-testing.md`. + +## Trigger Types + +### On-Demand +- Manual execution of training pipeline +- Used for: initial training, debugging, ad-hoc experiments +- Available at: all maturity levels + +### Scheduled +- Training pipeline runs on a fixed cadence (daily, weekly, monthly) +- Used for: regularly updated data, predictable data patterns +- Available at: L2+ +- **Considerations**: frequency depends on data freshness requirements and training cost + +### On New Data Availability +- Pipeline triggered when new labeled data lands in source tables +- Used for: irregular data collection, event-driven data sources +- Available at: L3 +- **Implementation**: event-based trigger (e.g., data pipeline completion callback, table change notification) + +### On Performance Degradation +- Pipeline triggered when model performance metrics drop below threshold +- Used for: production models with accuracy SLAs +- Available at: L3 +- **Implementation**: monitoring pipeline detects metric anomaly -> triggers retraining workflow +- **Considerations**: define clear thresholds; avoid retraining loops from noisy metrics + +### On Concept Drift +- Pipeline triggered when input data distributions change significantly +- Used for: models sensitive to data distribution shifts +- Available at: L3 +- **Implementation**: statistical tests on feature distributions (KS test, PSI, Jensen-Shannon divergence) +- **Considerations**: distinguish between natural distribution shift and data quality issues; data validation should run first + +## Patterns by Maturity Level + +### L1 - Manual +- Data scientist notices model degradation or gets new data +- Manually triggers retraining (ML Jobs, Distributed Training, HPO available for manual execution) +- Manually compares new model to current production model (Experiments) +- Manual Champion alias switch if improvement confirmed (within the same environment) +- **Cadence**: ad-hoc, typically monthly or less + +### L2 - Semi-automated +- Scheduled retraining jobs (e.g., weekly) +- Automated training pipeline with logging +- Automated validation produces metrics +- Human reviews metrics and approves promotion +- **Cadence**: scheduled (weekly/monthly), plus on-demand +- **Key components**: + - Orchestrated training pipeline (workflow orchestrator) + - Experiment tracker logging all runs + - Automated metric comparison against baseline + +### L3 - Fully Automated +- All trigger types active (scheduled + data-driven + drift-driven + performance-driven) +- Automated validation + promotion if thresholds pass +- Automated rollback if new model underperforms in production +- Full metadata trail linking trigger -> training run -> model version -> deployment +- **Cadence**: event-driven, can be daily or more frequent +- **Key components**: + - Data validation gate before training + - Automated Champion/Challenger comparison + - Monitoring pipeline that emits retraining triggers + - Metadata store linking triggers to outcomes + +## Retraining Pipeline Design + +### Minimal Pipeline (L1-L2) +``` +Data Extract -> Data Prep -> Train -> Evaluate -> Register +``` + +### Full Pipeline (L3) +``` +Trigger -> Data Validation -> Feature Computation -> Train + Tune -> +Evaluate -> Model Validation -> Register -> Champion/Challenger -> +Deploy -> Monitor +``` + +### Key Decisions +- **Data scope**: Retrain on all historical data or rolling window? +- **Hyperparameter handling**: Reuse best known hyperparameters or re-tune each time? +- **Fallback**: What happens if retraining produces a worse model? (Answer: keep current Champion, alert team) +- **Resource isolation**: Retraining should not impact serving latency or availability + +## LLM/GenAI Retraining & Iteration Adaptation + +LLM workloads have different "retraining" triggers depending on the development approach: + +### Prompt Iteration (Code Promotion) +- **Trigger**: Quality regression detected by LLM-as-judge, user feedback trends, new use cases +- **Process**: Update prompt template in Git → CI runs prompt regression tests → deploy to staging → evaluate → promote to prod +- **Cadence**: Can be frequent (daily or more) since prompt changes are lightweight + +### RAG Index Refresh +- **Trigger**: New documents available, corpus updated, retrieval quality degradation +- **Process**: Update source data → Cortex Search index rebuilds (scheduled or on new data via Streams) → retrieval quality tests → promote config if changed +- **Cadence**: Tied to data freshness requirements (hourly to weekly) + +### Fine-Tuning Re-runs +- **Trigger**: Accumulated human feedback data, domain shift, new training data available +- **Process**: Same as traditional ML retraining — new fine-tuning job → evaluate → register → Champion/Challenger → promote +- **Cadence**: Less frequent than prompt iteration (weekly to monthly), driven by feedback data volume + +### Agentic Workflow Updates +- **Trigger**: New tools available, tool behavior changes, routing accuracy degradation +- **Process**: Update agent configuration in Git → integration tests (tool execution, routing) → promote +- **Cadence**: Event-driven (when tools change or new capabilities added) + +### Key Difference from Traditional ML +Traditional ML retraining produces a new model. LLM "retraining" may produce a new prompt version (cheap, fast), a refreshed search index (medium cost), or new fine-tuned weights (expensive, slow). The CI/CD pipeline should handle all three artifact types with appropriate validation gates for each. + +## See Also + +- `monitoring-rollback.md` — Drift detection triggers that feed into continuous training +- `model-lifecycle.md` — Champion/Challenger workflow for validating retrained models +- `data-features.md` — Data validation before retraining begins diff --git a/skills/mlops/implement-patterns/references/data-features.md b/skills/mlops/implement-patterns/references/data-features.md new file mode 100644 index 00000000..35ea6ae3 --- /dev/null +++ b/skills/mlops/implement-patterns/references/data-features.md @@ -0,0 +1,138 @@ +# Data & Features + +> **Scope**: This file covers the *process* layer — when to adopt a Feature Store, what validation gates to enforce, how to prevent skew across environments. For the *technical implementation* (Feature Store API, Cortex Search setup code), use the `machine-learning` skill. + +> **Environment naming**: This file uses canonical names **DEV**, **STAGING**, **PROD**. Substitute the user's preferred names in all outputs. For 2-environment setups, omit STAGING references. For environment isolation strategies and naming conventions, see `ci-cd-testing.md`. + +## Data Validation + +### Pre-Training Validation + +#### L1 - Manual +- Data scientist manually inspects data in notebook +- Ad-hoc checks on row counts, null rates, distributions +- No automated gates + +#### L2 - Semi-automated +- Automated schema validation before training pipeline runs: + - All expected features present + - No unexpected features + - Data types match expectations + - No schema version mismatch +- Pipeline halts on schema skew; team notified +- Basic statistical checks (null rate thresholds, row count minimums) + +#### L3 - Fully Automated +- All L2 checks + statistical distribution validation: + - Feature distributions compared to reference baseline (KS test, PSI) + - Significant data value skew triggers retraining (not halt) + - Anomalous records quarantined for review +- Auto-decision: schema skew -> halt pipeline; value skew -> trigger retraining +- Validation results logged to metadata store + +### Data Schema Skews (anomalies that should halt pipeline) +- Unexpected features received +- Expected features missing +- Feature data type changed +- Feature value range outside expected bounds + +### Data Value Skews (changes that should trigger retraining) +- Significant shift in feature distributions +- Change in class balance (classification) +- Change in target variable distribution (regression) +- New categorical values appearing + +## Feature Store + +At L1+, each environment should have its own Feature Store (separate databases), following the same per-environment isolation principle as Model Registry. A centralized Feature Store across environments is only acceptable at L0 for experimentation. This ensures that feature definitions, refresh schedules, and data access are isolated per environment. + +### L1 - Manual (Feature Store Available) +- Per-environment Feature Store (each environment has its own database) +- Feature Store available for centralized definitions within the environment (manually maintained) +- Features can also be computed inline in training code +- Feature logic may be duplicated between training and serving +- Training-serving skew risk managed by manual review + +### L2 - Semi-automated (Automated Feature Store) +- Centralized feature definitions and storage (upgrade from manual L1 setup) +- Automated incremental refresh from batch/streaming sources +- Offline serving for training (batch feature retrieval) +- Feature discovery: data scientists can find and reuse existing features +- Feature metadata tracked (owner, description, freshness) +- Training pipeline reads from feature store +- **Benefit**: Feature reuse, consistent definitions, reduced duplication + +### L3 - Fully Automated (Feature Store Expected) +- Unified offline (training) and online (serving) feature serving +- Feature versioning and lineage tracking +- Automated feature freshness monitoring +- Point-in-time correct feature retrieval for training +- Low-latency online serving for real-time predictions +- **Benefit**: Eliminates training-serving skew, enables real-time features + +### Feature Store Key Decisions +- **Offline vs online**: Do you need real-time feature serving or batch only? +- **Freshness**: How stale can features be before predictions degrade? +- **Compute**: Where are features computed? (batch pipeline, streaming, on-demand) +- **Storage**: Unified platform (Snowflake Feature Store) vs external online store (if latency requirements exceed platform capabilities) + +## Training-Serving Skew Prevention + +### What Causes Skew +- Different feature computation code in training vs serving +- Different data sources or preprocessing between environments +- Stale features in online store +- Time-of-prediction features computed differently than time-of-training + +### Prevention Patterns + +#### L1 - Manual +- Code review to verify feature logic matches between training and serving +- Manual testing with sample data through both paths + +#### L2 - Semi-automated +- Shared transformation code between training and serving pipelines +- Or: feature store provides consistent feature values for both +- Automated tests comparing feature outputs from training and serving paths on same input + +#### L3 - Fully Automated +- Feature store as single source of truth for both training and serving +- Automated skew detection: compare feature distributions at training time vs serving time +- Alerts on significant divergence +- Feature monitoring dashboard + +## LLM/GenAI Data Adaptation + +### Vector DB / Search Index as Feature Store Equivalent + +For RAG workloads, Cortex Search plays the role that Feature Store plays for traditional ML: + +| Aspect | Feature Store (Traditional ML) | Cortex Search (RAG/LLM) | +|---|---|---| +| Purpose | Consistent feature vectors for training + serving | Consistent document retrieval for generation | +| Consistency concern | Training-serving skew | Retrieval quality drift | +| Per-environment | Feature computations run per environment | Search index rebuilt per environment | +| Freshness | Feature refresh on schedule or data change | Index refresh on schedule or new documents | +| Monitoring | Feature distribution drift | Retrieval precision@k, recall@k | + +### Data Validation for LLM Workloads + +#### RAG Corpus Validation +- **Schema**: Document format, metadata fields present, no empty/corrupt documents +- **Quality**: Duplicate detection, stale content flagging, language validation +- **Coverage**: New topics or domains missing from corpus + +#### Fine-Tuning Data Validation +- **Format**: Training examples match expected schema (instruction/response pairs, etc.) +- **Quality**: Label quality checks, deduplication, toxicity/bias screening +- **Volume**: Minimum training set size met; class balance acceptable + +### Training-Serving Skew for LLM Workloads +- **RAG skew**: Development corpus differs significantly from production corpus → retrieval quality degrades in production +- **Prevention**: Per-environment Cortex Search indexes built from environment-specific data; retrieval quality tests in CI compare against known query-document pairs + +## See Also + +- `ci-cd-testing.md` — Data validation tests integrated into CI pipeline +- `continuous-training.md` — Data-availability triggers for retraining +- `monitoring-rollback.md` — Feature drift as a monitoring signal diff --git a/skills/mlops/implement-patterns/references/governance-metadata.md b/skills/mlops/implement-patterns/references/governance-metadata.md new file mode 100644 index 00000000..f5632aad --- /dev/null +++ b/skills/mlops/implement-patterns/references/governance-metadata.md @@ -0,0 +1,198 @@ +# Governance & Metadata + +> **Environment naming**: This file uses canonical names **DEV**, **STAGING**, **PROD**. Substitute the user's preferred names in all outputs. For 2-environment setups, omit STAGING references. For environment isolation strategies and naming conventions, see `ci-cd-testing.md`. + +## Metadata Management + +### What to Track + +#### Per Training Run +- Pipeline and component versions executed +- Start/end time and duration per step +- Who/what triggered the run +- Parameter arguments passed +- Pointers to intermediate artifacts (prepared data, validation results, statistics) +- Input data snapshot or reference (which tables, which date range) + +#### Per Model Version +- Training run ID (links to all run metadata above) +- Evaluation metrics (train set + test set) +- Hyperparameters used +- Code commit hash +- Data lineage (which features, which tables) +- Validation status and results +- Tags (model_validation_status, deployment_status, etc.) + +#### Per Deployment +- Which model version is serving +- Deployment timestamp +- Environment (staging/prod) +- Endpoint configuration (resources, replicas) +- Traffic split (if A/B testing) + +### Patterns by Maturity Level + +#### L1 - Manual +- Experiments used to manually log params and metrics per training run +- Datasets used for manual data snapshots +- Model Registry used for manual model registration +- Key metrics recorded but not centralized or searchable + +#### L2 - Semi-automated +- Experiment tracker logs params, metrics, artifacts per run +- Model registry stores version metadata +- Tags and annotations on model versions +- Searchable experiment history + +#### L3 - Fully Automated +- Full pipeline metadata store: + - Every pipeline execution recorded with component versions + - Artifact lineage auto-tracked (which data produced which model) + - Auto-linked: trigger -> run -> model version -> deployment +- Queryable metadata API for auditing and debugging +- Automated metadata quality checks (no model registered without required tags) + +## Lineage + +### Data Lineage +- Track which source tables/features went into each model version +- Enable impact analysis: "if this table changes, which models are affected?" + +### Model Lineage +- Track which code, data, and parameters produced each model version +- Enable reproducibility: "recreate this exact model version" + +### Pipeline Lineage +- Track which pipeline version ran, when, with what configuration +- Enable debugging: "what was different about last Tuesday's run?" + +### Patterns by Maturity Level + +#### L1 - Manual +- Data scientist documents data sources in notebook or Experiments +- Manual code snapshot (version noted in experiment metadata) + +#### L2 - Semi-automated +- Experiment tracker auto-captures code snapshot, data inputs, parameters +- Model registry links model version to training run +- Pipeline orchestrator logs DAG execution history + +#### L3 - Fully Automated +- End-to-end lineage graph: data source -> features -> training run -> model -> deployment +- Automated impact analysis queries +- Lineage-aware alerting (upstream data change notifies downstream model owners) + +## RBAC / Security Model + +MLOps does **not** require a separate RBAC design. The security model for MLOps objects (databases, schemas, tables, models, endpoints, pipelines) must be governed by and follow the **same standards as the customer's existing RBAC framework**. + +Key principles: +- **Integrate, don't isolate** — MLOps roles, grants, and access policies should fit within the organization's existing role hierarchy, naming conventions, and governance processes. +- **Environment-scoped roles** — roles should be scoped per environment (e.g., `ML_DEV_ADMIN`, `ML_PROD_READONLY`) following the same environment isolation boundaries as the rest of the data platform. +- **CI/CD service roles** — service users used for automation should have dedicated roles with least-privilege grants, reviewed periodically. These roles follow the same RBAC standards as any other service account in the organization. +- **Do not create shadow governance** — avoid building a parallel permission system for ML objects. Use Snowflake's native RBAC (roles, grants, database roles) consistently. + +For CI/CD-specific service user setup, see `ci-cd-testing.md` § "CI/CD Authentication." + +## Compliance & Audit + +### Patterns by Maturity Level + +#### L1 - Manual +- Manual documentation of model purpose and behavior +- Ad-hoc compliance checks +- No formal audit trail + +#### L2 - Semi-automated +- Tags on model versions (owner, purpose, data_sensitivity, approval_status) +- Automated approval gates in promotion workflow +- Model cards or documentation attached to registered models +- Access control on model registry (who can register, promote, deploy) + +#### L3 - Fully Automated +- Full audit trail: every action on a model version logged (who, when, what) +- Automated compliance checks before promotion (required tags present, documentation complete) +- Policy enforcement (models without required metadata cannot be promoted) +- Regulatory reporting: automated generation of model risk documentation +- Data privacy compliance: track which PII features are used by which models + +### Governance Checklist per Maturity Level + +**L1 Minimum:** +- [ ] Model purpose documented +- [ ] Training data source identified +- [ ] Model owner assigned +- [ ] Basic performance metrics recorded + +**L2 Standard:** +- [ ] All L1 items +- [ ] Model version tagged with required metadata +- [ ] Approval gate before production deployment +- [ ] Experiment history searchable +- [ ] Access control configured on registry + +**L3 Comprehensive:** +- [ ] All L2 items +- [ ] End-to-end lineage tracked +- [ ] Automated compliance checks enforced +- [ ] Full audit trail queryable +- [ ] Retention and archival policies active +- [ ] Impact analysis available for upstream changes + +## LLM/GenAI Governance Adaptation + +### LLM-Specific Metadata to Track + +#### Per Prompt Version +- Prompt template text and version (Git commit hash) +- System instructions, few-shot examples +- Target foundation model and parameters (temperature, max_tokens) +- Evaluation scores (LLM-as-judge metrics) + +#### Per RAG Configuration +- Corpus source tables and date range +- Chunking strategy and parameters +- Embedding model used +- Retrieval quality metrics (precision@k, recall@k) + +#### Per Fine-Tuned Model +- Base model used +- Training data snapshot (Dataset reference) +- Fine-tuning parameters (epochs, learning rate, etc.) +- Evaluation metrics (LLM-as-judge + task-specific) + +### LLM Lineage +- **Prompt lineage**: Git history of prompt templates → evaluation results → deployment events +- **RAG lineage**: Source documents → Cortex Search index → retrieval + generation quality metrics +- **Fine-tuned model lineage**: Training data → fine-tuning run → model version → deployment (same as traditional ML, extends to Model Registry) + +### LLM Access Control +- **Cortex AI RBAC**: Control which roles can access which foundation models (e.g., restrict expensive models to production use) +- **Prompt access**: Version-controlled prompts inherit Git repo access control +- **Search index access**: Cortex Search service access controlled via Snowflake RBAC (grants on the service) +- **Fine-tuned model access**: Model Registry RBAC (same as traditional ML) + +### LLM Governance Checklist Additions + +**L1 Minimum (add to existing checklist):** +- [ ] LLM development approach documented (API / prompt / RAG / fine-tuning) +- [ ] Foundation model selection documented with rationale + +**L2 Standard (add to existing checklist):** +- [ ] Prompt templates versioned in Git +- [ ] LLM evaluation scores tracked per deployment +- [ ] Cortex AI RBAC configured (model access by role) +- [ ] RAG corpus source documented and refresh schedule defined + +**L3 Comprehensive (add to existing checklist):** +- [ ] Automated LLM-as-judge evaluation on production traffic +- [ ] Human feedback loop integrated +- [ ] Token cost tracking and alerting configured +- [ ] Safety/guardrail tests in CI pipeline +- [ ] Full prompt + RAG + fine-tuning lineage tracked + +## See Also + +- `model-lifecycle.md` — Registry and versioning that governance tracks +- `monitoring-rollback.md` — Incident events that require audit trail +- `data-features.md` — Data lineage and feature provenance diff --git a/skills/mlops/implement-patterns/references/model-lifecycle.md b/skills/mlops/implement-patterns/references/model-lifecycle.md new file mode 100644 index 00000000..7e13b807 --- /dev/null +++ b/skills/mlops/implement-patterns/references/model-lifecycle.md @@ -0,0 +1,168 @@ +# Model Lifecycle + +> **Scope**: This file covers the *process* layer — when to register, how to version, what gates to pass, how to promote between environments. For the *technical implementation* (API calls, SDK code to log/deploy models), use the `machine-learning` skill. + +> **Environment naming**: This file uses canonical names **DEV**, **STAGING**, **PROD**. Substitute the user's preferred names in all outputs. For 2-environment setups, omit STAGING references. For environment isolation strategies and naming conventions, see `ci-cd-testing.md`. + +## Model Registry + +### L1 - Manual +- Per-environment model registry (each environment has its own registry database) +- Models registered to Model Registry (manual registration) +- Manual naming convention for versions +- Aliases used for manual deployment (human updates alias) +- Metadata attached manually (params, metrics from Experiments) + +### L2 - Semi-automated +- Models registered automatically by training pipeline +- Version numbers auto-incremented +- Metadata (params, metrics, data lineage) attached to each version +- Aliases used for stage management (Champion, Challenger) + +### L3 - Fully Automated +- All L2 + automated lifecycle management +- Auto-archive old versions based on retention policy +- Cross-environment model visibility (same-account: models visible across databases; multi-account: via replication groups) +- Automated compliance checks on registration + +## Versioning Strategy + +### When to Create a New Version (under same registered model) +- Retrained on new data (same code) +- Hyperparameters tuned +- Minor code changes to training pipeline +- **Benefit**: Unified lineage, alias-based routing, zero pipeline changes for consumers + +### When to Create a New Model Object +- Fundamentally different algorithm or architecture +- Different input features or target variable +- Different business problem +- **Benefit**: Clean separation, independent lifecycle + +### Version Naming and Lifecycle Management + +A common pattern for managing model versions within an environment: + +**Concept: active / candidate versioning** +- Only **one active version** exists at a time — this is the version serving predictions (set as default version). Default name suggestion: `LIVE` (alternatives: `PROD`, `CHAMPION`, or any name the team agrees on). +- New model versions start as **candidates** until they pass validation gates. Default name suggestion: `CANDIDATE_` (alternatives: `CANARY_`, `CHALLENGER_`, etc.). +- Multiple candidates can accumulate for comparison and audit. +- The version naming convention should be agreed upon with the customer — do not enforce specific names. + +**Metric-gated promotion logic** (within an environment): +1. Train new model, evaluate against a defined metric threshold (e.g., accuracy >= 0.6) +2. If metric **below threshold** → register as candidate, do not promote +3. If metric **meets threshold** and no current active version → register as active +4. If metric **meets threshold** and **beats current active** → archive current active to candidate, register new version as active +5. If metric **meets threshold** but **does not beat current active** → register as candidate + +**Archive-before-replace pattern**: Before overwriting the active version, copy it to a timestamped candidate name (`ALTER MODEL ADD VERSION FROM MODEL VERSION `), then drop the old active. This preserves rollback capability — the previous best model is always available. + +**Setting the default version**: After promoting a new active version, set it as the default (`ALTER MODEL SET DEFAULT_VERSION = `). Serving endpoints and prediction scripts that reference the model without specifying a version will pick up the new default. + +**Important considerations:** +- Version names in Snowflake Model Registry are case-sensitive and stored in uppercase — always use uppercase in `log_model(version_name=...)`, `model.version(...)`, and SQL commands +- The metric threshold and improvement delta should be agreed upon with the customer (e.g., minimum improvement of N%, statistical significance, multi-metric gating) +- This pattern works within a single environment; for cross-environment promotion, see `promotion-patterns.md` § "Promotion Mechanisms and Snowflake Features" + +### Prediction Pipeline Pattern + +After a model is registered and promoted to LIVE, predictions follow a consistent pattern: + +1. **Connect to Feature Store** — use the same FeatureView used during training to ensure feature consistency +2. **Read feature data** — retrieve features for the scoring population (consider point-in-time spines for production scoring of only new/recent records) +3. **Load model from registry** — load the active version (or default version) from the current environment's registry +4. **Run predictions** — score using the loaded model +5. **Persist results** (when applicable) — write predictions to a table with metadata columns: model version used, prediction timestamp, and any relevant identifiers. Persisting is recommended for batch scoring, audit trails, and downstream consumption, but may not apply to all use cases (e.g., real-time inference serving results directly to an application). The decision depends on the purpose of inference and business needs. + +This pattern ensures that the same features and transformations used in training are applied during inference (prevents training-serving skew). See `data-features.md` § "Training-Serving Skew Prevention" for the full skew prevention framework. + +## Champion/Challenger Workflow + +### L1 - Manual +- Data scientist trains new model, compares metrics in Experiments +- Manual A/B testing via Model Serving (human deploys Challenger alongside Champion) +- Manual decision to promote or reject based on comparison +- Manual alias update to switch Champion to new version if approved + +### L2 - Semi-automated +- Automated offline comparison: + 1. New model registered as version N with "Challenger" alias + 2. Validation pipeline loads both Champion and Challenger + 3. Both evaluated on held-out test set + 4. Metrics compared automatically + 5. Results presented to human for approval + 6. On approval, "Champion" alias moved to new version +- Online A/B testing (when applicable): + 1. Challenger deployed alongside Champion (traffic split or shadow mode) + 2. Online metrics collected for both (A/B test framework) + 3. Results presented to human for promotion decision + 4. Gradual traffic ramp-up for Challenger (canary) + +### L3 - Fully Automated +- All L2 capabilities + automated decision-making: + 1. Statistical significance test determines winner automatically + 2. Auto-promote if Challenger wins; auto-reject if not + 3. Auto-rollback if Challenger degrades during ramp + 4. No human gate required (humans notified, not blocking) + +### Key Decisions +- **Offline vs online comparison**: Offline is faster/cheaper; online captures real-world behavior +- **Metric selection**: Which metrics determine the winner? (accuracy, latency, business KPI) +- **Statistical significance**: How long to run A/B test? What confidence level? +- **Fallback**: If no Champion exists, compare against business heuristic or baseline threshold + +## Promotion Gates + +### Pre-Registration Gates +- Training pipeline completed successfully +- Evaluation metrics logged +- No NaN/infinity values in predictions + +### Pre-Promotion Gates (Challenger -> Champion) +- Model validation checks pass (format, metadata, compliance) +- Performance on test set meets minimum threshold +- Performance consistent across data segments/slices +- Infrastructure compatibility verified +- (L2) Online A/B test results reviewed by human +- (L3) Online A/B test auto-evaluated; auto-promote if thresholds pass + +### Post-Promotion Gates +- Serving endpoint healthy after deployment +- Prediction latency within SLA +- No error rate spike +- Monitoring pipeline active and collecting data + +## LLM/GenAI Lifecycle Adaptation + +### Versioning by Artifact Type + +| Artifact | Where to Version | Strategy | +|---|---|---| +| **Prompt templates** | Git | Semantic versioning or commit-based. | +| **Fine-tuned model weights** | Model Registry | Same as traditional ML — register, version, alias. | +| **RAG index configuration** | Git (config) + Cortex Search (index) | Config versioned in Git. Index rebuilt per environment. | +| **Agent definitions** | Git | Versioned as code. | + +Foundation model API calls require no versioning or lifecycle management — see `mlops-pattern-framework.md` § "What Gets Promoted." + +### Champion/Challenger for LLMs + +#### Prompt Versions +- **Offline**: Run both prompt versions against an evaluation dataset using LLM-as-judge (AI Observability) — compare groundedness, relevance, accuracy scores +- **Online**: A/B test prompt versions on live traffic; measure user satisfaction, task completion rate, safety metrics +- **Decision**: Automated if quality metrics improve; human review if metrics are mixed + +#### Fine-Tuned Models +- Same Champion/Challenger workflow as traditional ML (Model Registry aliases) +- Evaluation includes LLM-specific metrics: fluency, factual accuracy, instruction following + +#### RAG Configurations +- Compare retrieval quality (precision@k, recall@k) between index versions or chunking strategies +- End-to-end evaluation: does the full RAG pipeline (retrieval + generation) produce better answers? + +## See Also + +- `promotion-patterns.md` — How model promotion fits into Code/Model/Hybrid workflows +- `monitoring-rollback.md` — Post-deployment monitoring and rollback mechanisms +- `governance-metadata.md` — Metadata and lineage tracking per model version diff --git a/skills/mlops/implement-patterns/references/monitoring-rollback.md b/skills/mlops/implement-patterns/references/monitoring-rollback.md new file mode 100644 index 00000000..bc0b2b85 --- /dev/null +++ b/skills/mlops/implement-patterns/references/monitoring-rollback.md @@ -0,0 +1,159 @@ +# Monitoring & Rollback + +> **Scope**: This file covers the *process* layer — what to monitor, when to alert, when to roll back, what runbooks to follow. For the *technical implementation* (setting up monitoring dashboards, ML Observability API, alerting code), use the `machine-learning` skill. + +## What to Monitor + +### Model Performance Metrics +- Accuracy, precision, recall, F1 (classification) +- RMSE, MAE, MAPE (regression) +- Business-specific KPIs (conversion rate, revenue impact) +- Prediction confidence distribution + +### Data Drift +- Feature distribution shifts (KS test, PSI, Jensen-Shannon divergence) +- Schema changes (new/missing features, type changes) +- Data volume anomalies (sudden increase/decrease in input data) + +### Concept Drift +- Relationship between features and target has changed +- Detected by comparing model predictions to delayed ground truth +- Leading indicator: performance degradation on recent data + +### Infrastructure Metrics +- Prediction latency (p50, p95, p99) +- Queries per second (QPS) / throughput +- Error rates (4xx, 5xx) +- Resource utilization (CPU, memory, GPU) + +## Patterns by Maturity Level + +### L1 - Manual +- Manual dashboard checks (ad-hoc) +- Data scientists periodically review model predictions +- Model Serving Autocapture available (inference logs collected, manually reviewed) +- No automated alerting (ML Observability is L2+) +- Performance issues discovered reactively +- **Tools**: Notebooks, manual queries against prediction logs, Experiments (metric comparison) + +### L2 - Semi-automated +- Automated dashboards tracking key metrics +- Alerting rules for threshold violations (email, Slack) +- Inference tables capture request/response data automatically +- Scheduled jobs compute drift metrics +- **Tools**: Monitoring dashboards, SQL alerts, inference tables +- **Key setup**: + - Define metric baselines from initial model deployment + - Set alert thresholds (e.g., accuracy drops >5% from baseline) + - Schedule weekly drift analysis jobs + +### L3 - Fully Automated +- Real-time monitoring with automated drift detection +- Alerts trigger automated actions (retraining, rollback) +- A/B test monitoring with automated winner selection and auto-promotion +- Anomaly detection on metrics (not just threshold-based) +- Full observability pipeline: logs -> metrics -> traces -> alerts -> actions +- **Tools**: Real-time monitoring, automated trigger pipelines +- **Key setup**: + - Monitoring pipeline feeds into CT trigger system + - Automated rollback rules (e.g., if p95 latency > 500ms for 5 min, rollback) + - Canary analysis automation (compare canary metrics to baseline) + +## Rollback Patterns + +Rollback is done by updating the version name and alias to point to a previous known-good version. The archive-before-replace pattern (see `model-lifecycle.md` § "Version Naming and Lifecycle Management") ensures previous versions are always available for rollback. + +### L1 - Manual Alias Revert +- Identify previous model version in registry +- Manually switch alias to point to previous version, or set default version to a known-good version +- Verify serving endpoint picks up the change +- **RTO**: Minutes to hours depending on team availability + +### L2 - Documented Runbook +- Written procedure for rollback +- Previous model version tagged and easily identifiable +- Semi-automated: human triggers rollback, automation executes +- Post-rollback validation checklist +- **RTO**: Minutes (once triggered) + +### L3 - Automated Rollback +- Monitoring detects degradation automatically +- Rollback triggered if performance drops below threshold for sustained period +- Previous version auto-restored via alias switch or `ALTER MODEL SET DEFAULT_VERSION` to a known-good version +- Serving endpoint auto-switches with zero downtime +- Automated notification to team +- Post-rollback diagnostic job runs automatically +- **RTO**: Seconds to minutes (fully automated) + +## Alerting Strategy + +| Severity | Condition | Action | +|----------|-----------|--------| +| **P1 - Critical** | Model returning errors, endpoint down | Auto-rollback + page on-call | +| **P2 - High** | Performance below SLA threshold | Auto-rollback or trigger retraining + alert team | +| **P3 - Medium** | Drift detected, performance trending down | Trigger retraining + notify data scientist | +| **P4 - Low** | Minor metric changes, informational | Log + dashboard update | + +## LLM/GenAI Monitoring Adaptation + +### LLM-Specific Metrics +- **Hallucination rate**: Frequency of factually incorrect or unsupported claims +- **Groundedness**: Degree to which responses are grounded in provided context (RAG) or training data +- **Answer relevance**: How well responses address the user's question +- **Token cost**: Input + output token consumption per request (cost tracking) +- **Safety violations**: Responses flagged for harmful, biased, or inappropriate content +- **Retrieval quality** (RAG): Precision@k and recall@k of retrieved context chunks +- **Latency breakdown**: Time spent in retrieval vs generation vs tool execution (agentic) + +### LLM Evaluation Patterns + +These expand on the LLM Evaluation capability dimension from `mlops-pattern-framework.md`: + +#### L1 - Manual +- Human reviewers spot-check a sample of LLM outputs +- Manual assessment of quality, relevance, safety + +#### L2 - Semi-automated +- **LLM-as-judge**: Automated evaluation using AI Observability — metrics for accuracy, groundedness, relevance scored by evaluator LLM +- Human review on flagged outputs (low-confidence or safety-flagged) +- Evaluation runs on each deployment (prompt change, RAG update, fine-tune) +- Multi-version comparison: A/B test prompt versions or RAG configurations (human decision) + +#### L3 - Fully Automated +- Continuous LLM-as-judge evaluation on production traffic (sampled) +- Human feedback loops integrated (thumbs up/down, corrections feed back into evaluation) +- Automated multi-version comparison with auto-promotion based on evaluation metrics +- Auto-alert on quality regression; auto-rollback if metrics drop below threshold + +### LLM Rollback Patterns +- **Prompt rollback**: Revert to previous prompt template version in Git (fast, zero-downtime) +- **RAG rollback**: Revert search index configuration or switch to previous index version +- **Fine-tuned model rollback**: Revert Model Registry alias to previous fine-tuned version (same as traditional ML) +- **Agentic rollback**: Revert agent configuration (tool definitions, routing rules) via Git + +## Agent Evaluation Pipeline (CI/CD-Integrated) + +Cortex Agents can be evaluated automatically in CI/CD using `EXECUTE_AI_EVALUATION`. This enables regression testing on every agent deployment. + +### Pipeline Concepts + +1. **Evaluation data table** — a table of ground-truth Q&A pairs (input queries + expected outputs), curated from domain experts or historical validated interactions. +2. **Evaluation configuration** — a YAML config specifying the dataset, agent reference, run metadata, and metrics to evaluate (e.g., `answer_correctness`, `logical_consistency`). The config should be parameterized for multi-environment deployment (database/schema names vary per environment). +3. **Execution** — upload the rendered config to a stage and call `EXECUTE_AI_EVALUATION`. Use unique run identifiers (e.g., timestamp suffixes) to avoid collisions across concurrent runs. +4. **Deployment gate** — block promotion if evaluation metrics fall below a threshold. + +### Integration with CI/CD +- **L2**: Evaluation runs on every agent deployment; human reviews results before promotion +- **L3**: Automated evaluation with auto-reject if metrics regress below baseline + +### Key Practices +- Use **unique identifiers** on dataset/run names to avoid collisions across concurrent or parallel runs +- Store evaluation configs alongside agent code in version control +- Compare evaluation results across deployments to detect quality regressions +- Parameterize environment-specific references (database, schema, agent name) in the eval config + +## See Also + +- `continuous-training.md` — Automated retraining triggered by monitoring alerts +- `model-lifecycle.md` — Rollback via Champion/Challenger alias swap +- `governance-metadata.md` — Audit trail for rollback events and incident response diff --git a/skills/mlops/implement-patterns/references/promotion-patterns.md b/skills/mlops/implement-patterns/references/promotion-patterns.md new file mode 100644 index 00000000..955e4f73 --- /dev/null +++ b/skills/mlops/implement-patterns/references/promotion-patterns.md @@ -0,0 +1,285 @@ +# Promotion Patterns + +> **Environment naming**: This file uses canonical names **DEV**, **STAGING**, **PROD**. Substitute the user's preferred names in all outputs. For **2-environment setups** (DEV → PROD only), see the 2-env adaptation notes in each pattern section. For environment isolation strategies (same-account vs multi-account), naming conventions, and data architecture layers, see `ci-cd-testing.md`. The term **"catalog"** in this file refers to the set of objects belonging to an environment — this may be a database (same-account isolation) or an entire account (multi-account isolation), depending on the chosen strategy. + +## Code Promotion + +### Overview +Training code moves from dev -> staging -> prod. The model is retrained in each environment. The production model is trained on production data in the production environment. + +### L1 - Manual +- External Git repo with dev branch for experimentation (Snowflake Git Integration is L2) +- Data scientist develops training code in notebooks/scripts +- Manual code review before merging to main +- Training code executed manually in each environment (ML Jobs, Distributed Training available) +- Model registered manually to Model Registry per environment +- Model Serving (SPCS) available for deployment (manual setup) +- Supporting code (inference, monitoring) deployed with training code +- **Environment structure**: Separate catalogs per environment (dev/staging/prod), RBAC for basic isolation +- **Data access**: Production data accessible from prod environment + +### L2 - Semi-automated +- PR-based workflow with branch policies +- CI runs unit + integration tests on PR +- Training pipeline deployed as automated job in staging (on data subset) +- Integration tests validate full pipeline end-to-end in staging +- On merge to release branch, CD deploys pipeline to production +- Production pipeline trains model on full production data +- Model registered automatically; human approves Champion alias switch to new version +- **Environment structure**: Separate catalogs + workspaces, CI/CD orchestrates transitions +- **Data access**: Each environment accesses its own catalog; prod pipeline accesses prod data + +### L3 - Fully Automated +- Trunk-based development with feature flags +- CI/CD auto-deploys pipeline to production on release +- Production training triggered by schedule, new data, or drift detection +- Automated validation promotes model without human intervention +- Champion/Challenger auto-promotion (offline + online A/B with automated decision) +- Zero-downtime deployment of new model versions +- Automated rollback on performance degradation +- **Environment structure**: Identical containerized pipelines, config-driven per environment +- **Data access**: Production pipeline auto-accesses prod data; resource isolation between training and serving + +### 2-Environment Adaptation (Code Promotion) +When no STAGING exists, DEV absorbs staging responsibilities: +- **L1**: Code is reviewed and tested in DEV, then deployed directly to PROD. Manual validation happens in DEV before promotion. +- **L2**: CI runs all tests in DEV (including integration tests that would normally run in staging). CD deploys directly to PROD with a human approval gate. +- **L3**: Full CI/CD validates in DEV; auto-deploys to PROD. DEV must have production-like config for meaningful validation. +- **Model Registry**: Each environment must have its own registry (separate databases). Promotion uses `CREATE MODEL ... FROM MODEL` via CI/CD. + +--- + +## Model Promotion + +### Overview +Model artifact is trained in development and promoted to staging -> prod. Only the artifact moves, not the training code. + +### L1 - Manual +- Data scientist trains model in dev environment (ML Jobs, Distributed Training, HPO available) +- Model registered to Model Registry in dev +- Artifact manually copied/promoted to staging catalog for validation +- Manual validation (checklist-based, Experiments for comparison) +- Artifact manually promoted to prod catalog +- Model Serving (SPCS) available for deployment (manual setup) +- Supporting code (inference, monitoring) deployed separately +- **Environment structure**: Single workspace or loosely coupled environments +- **Data access**: Dev environment needs access to representative data + +### L2 - Semi-automated +- Training pipeline runs in dev (may be scheduled) +- Model registered automatically to dev catalog +- Automated validation pipeline runs in staging on the artifact +- Human approval gate before promotion to prod +- Supporting code has its own CI/CD pipeline (deployed separately) +- **Environment structure**: Separate catalogs; staging used for artifact validation only +- **Data access**: Dev trains on dev-accessible data; validation uses staging/prod data subsets + +### L3 - Fully Automated +- Automated retraining in dev on schedule or trigger +- Automated validation + promotion pipeline +- Auto-promote if thresholds pass; auto-reject with notification if not +- Supporting code pipelines also fully automated +- **Risk**: Dev-trained artifacts may not reflect production data distribution. Validation gates must be robust. +- **Environment structure**: Dev pipeline auto-registers; promotion pipeline auto-moves artifact +- **Data access**: Dev must have representative data; validation must cover prod data characteristics + +### 2-Environment Adaptation (Model Promotion) +Model Promotion is the **most natural fit** for 2-environment setups **when Code Promotion is not possible** (i.e., production data is not accessible from the production environment), since the artifact already originates in DEV. Confirm this choice with the customer: +- **L1**: Model trained in DEV, validated in DEV, manually promoted to PROD. No staging needed. +- **L2**: Automated validation pipeline runs in DEV on the artifact. Human approval gate before PROD promotion. Validation must be stricter since there is no separate environment to catch issues. +- **L3**: Automated retraining + validation in DEV, auto-promotion to PROD. Risk mitigation: ensure DEV data is representative and validation gates are robust. +- **Model Registry**: Each environment should have its own registry (separate databases) at L1+. Promotion uses `CREATE MODEL ... FROM MODEL` via CI/CD. See the "Promotion Mechanisms and Snowflake Features" section below for details on registry organization and cross-environment promotion. + +--- + +## Hybrid Promotion + +### Overview +Training code moves to staging (like Code Promotion). Model is trained in staging with production data access. The resulting artifact is promoted to production (like Model Promotion). + +### L1 - Manual +- Data scientist develops code in dev +- Code manually deployed to staging +- Model trained in staging with production data (ML Jobs available for execution) +- Manual validation in staging (Experiments for comparison) +- Artifact manually promoted to prod via Model Registry +- **Environment structure**: Staging has production data access; prod receives artifact only +- **Data access**: Staging reads production data; prod serves model only + +### L2 - Semi-automated +- PR-based workflow; CI runs tests +- CD deploys training pipeline to staging +- Staging pipeline trains on production data +- Automated validation pipeline in staging; human approval for prod promotion +- Artifact promoted to prod catalog on approval +- **Environment structure**: CI/CD deploys code to staging; artifact pipeline promotes to prod +- **Data access**: Staging has read access to prod data catalog + +### L3 - Fully Automated +- Full CI/CD deploys pipeline to staging automatically +- Staging trains on production data on schedule/trigger +- Automated validation + auto-promotion to prod +- Automated rollback if production model degrades +- **Risk**: Staging bears full training compute cost. Staging-prod data sync must be reliable. +- **Environment structure**: Staging sized for training workloads; prod sized for serving +- **Data access**: Staging has reliable, low-latency access to production data + +### 2-Environment Adaptation (Hybrid Promotion) +Hybrid Promotion **fundamentally requires a middle tier** (staging with production data access). In a 2-env setup: +- **If DEV can access production data**: The setup effectively becomes Code Promotion (train in DEV on prod data, deploy code to PROD). Recommend migrating to Code Promotion pattern. +- **If DEV cannot access production data**: A 2-env setup is not viable for Hybrid. Recommend either (a) adding a STAGING tier, or (b) switching to Model Promotion (accept that the model is trained on dev data only). +- **Planning ahead**: If a team on Model Promotion already has a Model Registry and Feature Store set up per environment (or centralized), this infrastructure simplifies a future transition to Hybrid when a STAGING tier is added. + +## LLM/GenAI Promotion Adaptation + +LLM workloads follow the same promotion patterns. See `mlops-pattern-framework.md` § "What Gets Promoted (by Development Approach)" for the complete mapping of LLM artifacts to promotion patterns. + +Key points: +- **Prompts, RAG configs, agent definitions** → Code Promotion (version in Git, promote through environments) +- **Fine-tuned model weights** → Model Promotion (register in Model Registry, promote across environments via CI/CD) +- **Foundation model API calls** → No MLOps promotion needed (standard software CI/CD) + +### Environment Considerations for LLM Workloads +- **GPU-aware serving**: When deploying fine-tuned models or custom inference via SPCS, ensure target environment has appropriate GPU compute pools. +- **Search index per environment**: Each environment maintains its own Cortex Search index built from environment-specific data. Do not promote indexes across environments. +- **Cost management**: Foundation model API costs scale with usage. Monitor token consumption per environment to avoid cost surprises during testing. + +## Promotion Mechanisms and Snowflake Features + +> **Always present this section** when advising on promotion patterns. It maps Snowflake objects to their promotion mechanisms and the features that enable them — essential context for the user to understand how each artifact type moves between environments. + +Not all Snowflake objects move between environments in the same way, but **all promotion is executed via CI/CD pipelines** — including model registry operations. The CI/CD pipeline is the single mechanism that deploys, copies, or replicates objects across environments. Business needs (security, compliance, deployment velocity, team structure, risk tolerance) influence which promotion pattern is appropriate. There is no single correct answer; the choice must be validated with the customer. + +### Model Artifacts (Model Promotion Pattern) + +Model artifacts are promoted across environments via CI/CD using registry commands: + +| Object | CI/CD Promotion Command | Snowflake Feature | +|---|---|---| +| **Trained model** (Snowflake-trained or external) | `CREATE MODEL ... FROM MODEL` (same-account) or replication groups (multi-account) | Model Registry | +| **Fine-tuned LLM weights** | Same as trained model | Model Registry + Cortex Fine-tuning | +| **Model serving endpoint** | Deploy/update endpoint configuration per environment | Model Serving (SPCS) | + +**Cross-environment promotion** (executed by CI/CD) — each environment has its own registry (separate databases): +- **Same-account**: `CREATE MODEL .. FROM MODEL ..` copies the model to the target environment's database. +- **Multi-account**: Use **replication groups** to replicate Model Registry (ML model objects) across accounts. + +> **Note**: A centralized registry (single database for all environments) may be acceptable at L0 for experimentation, but at L1+ each environment should have its own registry to maintain proper isolation. + +**Aliases and versions** serve a different purpose — they are **not** the promotion mechanism between environments. Their role is: +- **A/B testing**: Route traffic between model versions (Champion vs Challenger) within the same environment +- **Experimentation**: Compare model versions offline using Experiments +- **Model switching after retraining**: Update the Champion alias to point to a newly validated version within the same environment +- **Rollback**: Revert the alias to a previous version if the new one degrades + +### Code/Config Artifacts (Code Promotion Pattern) + +For code artifacts, promotion means **redeploying the same parameterized code to the target environment** via CI/CD. The source of truth is Git. + +| Object | CI/CD Promotion Command | Snowflake Feature / Tool | +|---|---|---| +| **Tables, views, stages** | DDL redeployed per environment | `snow sql -f` (Snowflake CLI) | +| **Stored procedures, UDFs** | DDL/code redeployed per environment | `snow sql -f` (Snowflake CLI) | +| **Cortex Agents** | Agent definition redeployed per environment | `snow sql -f` (Snowflake CLI) | +| **Semantic views** | Definition redeployed per environment | `snow sql -f` (Snowflake CLI) | +| **Cortex Search indexes** | Config redeployed; index rebuilt per environment from environment-specific data | `snow sql -f` + Cortex Search | +| **Tasks, streams, dynamic tables** | DDL redeployed per environment | `snow sql -f` (Snowflake CLI) | +| **Grants / RBAC** | Grant statements redeployed per environment | `snow sql -f` (Snowflake CLI) | +| **Prompt templates** | Versioned in Git; redeployed as part of application code | Git + CI/CD | +| **Notebooks** (when deployed) | Entity redeployed + executed per environment | `snow notebook deploy` + `EXECUTE NOTEBOOK` | +| **Python scripts** (evaluation, data prep) | Executed per environment with env variable | `python ` with `var_environment` | + +**Key distinction**: Code-promoted objects are **recreated** in each environment from the same source code. Each environment has its own independent instance of every object. No artifact sharing across environments. + +### Hybrid Promotion + +Combines both mechanisms (all via CI/CD): +- **Code** is deployed to the training environment (the environment with production data access) via CI/CD (Code Promotion) +- **Model artifact** is trained in that environment, then promoted to the serving/production environment via CI/CD executing `CREATE MODEL ... FROM MODEL` or replication groups (Model Promotion) +- Supporting objects (tables, views, grants) follow Code Promotion + +The specific environment names and count depend on the customer's setup — use whatever the customer calls their environments. Examples: +- **3-env** (DEV → STAGING → PROD): Code deploys to STAGING, model trains there, artifact promotes to PROD +- **3-env** (DEV → PREPROD → PROD): Code deploys to PREPROD, model trains there, artifact promotes to PROD +- **2-env** (DEV → PROD): See the "2-Environment Adaptation (Hybrid Promotion)" section above — Hybrid fundamentally requires a middle tier with production data access + +### Features That Enable Promotion Workflows + +| Feature | Role in Promotion | +|---|---| +| **Model Registry** | Version and store model artifacts; `CREATE MODEL ... FROM MODEL` for cross-database promotion; aliases for A/B testing and model switching within an environment | +| **Replication Groups** | Replicate Model Registry objects across accounts (multi-account promotion) | +| **Model Serving (SPCS)** | Serve models; A/B testing; canary deployment within an environment | +| **Experiments** | Compare model versions before promotion (offline evaluation) | +| **ML Observability** | Validate model performance post-promotion; trigger rollback | +| **Git Integration** | Version control for code artifacts; PR-based promotion gates | +| **Snowflake CLI** | Execute all promotion commands (SQL, Python, notebooks) via CI/CD | +| **AI Observability** | Evaluate LLM/agent quality before promotion (LLM-as-judge) | +| **Cortex Search** | Rebuild search indexes per environment (RAG promotion) | +| **Cortex Fine-tuning** | Produce fine-tuned model versions for registry promotion | + +## Concrete Deployment Patterns + +### ML Pipeline Step Sequence + +The typical ML pipeline follows a sequence of steps, each implemented as a separate file. The sequence differs by promotion pattern: + +**Code Promotion** (all steps run in each environment): +1. **Infrastructure setup** — database, schema, roles, grants (SQL, often a one-time prerequisite run outside CI/CD) +2. **Raw data ingestion** — create/load raw data tables (SQL) +3. **Feature engineering** — create Feature Store entities and FeatureViews (Python) +4. **Model training + registration** — train model using Feature Store, evaluate metrics, register with version logic (Python) +5. **Prediction / scoring** — load registered model, score using Feature Store, persist predictions (Python) + +Each file runs the **same code in every environment** — environment parameterization resolves which databases/schemas to target. + +**Model Promotion** (training in DEV, promotion to PROD): +1. **Infrastructure setup** — same as Code Promotion (one-time, per environment) +2. **Raw data ingestion** — per environment (SQL, deployed via CI/CD) +3. **Feature engineering** — per environment (Python, deployed via CI/CD) +4. **Model training + registration** — runs in DEV only; model registered to DEV registry with version logic (Python) +5. **Model promotion** — CI/CD copies the validated model from DEV to PROD using `CREATE MODEL ... FROM MODEL`, applying the same version logic in the target environment (Python) +6. **Prediction / scoring** — runs in PROD only, loading the promoted model from PROD registry (Python) + +The key difference: in Code Promotion, the training script (step 4) runs identically in all environments. In Model Promotion, training runs only in DEV (step 4), and a separate promotion step (step 5) handles cross-environment model copy. + +**Single-file dual-role pattern** (Model Promotion): A common approach is a single training/promotion script that branches on the environment variable: +- If `ENV == DEV`: train model, evaluate, register with version logic +- If `ENV == PROD`: find the candidate version in DEV registry, read its metrics, apply version logic, promote via `CREATE MODEL ... WITH VERSION ... FROM MODEL ...` + +This keeps the promotion logic colocated with training logic and reduces the number of files. + +For version management logic (LIVE/CANDIDATE, metric-gated promotion, archive-before-replace), see `model-lifecycle.md` § "Version Naming and Lifecycle Management." + +### Environment-Parameterized Artifacts (All Patterns, L2+) + +All deployable artifacts (SQL, Python, configs) should be parameterized so a single codebase deploys to any environment. For the full comparison of parameterization approaches, database naming guidance, and artifact type selection, see `ci-cd-testing.md` § "Environment Parameterization," "Naming Conventions," and "Deployable Artifact Types." + +### Agent Promotion Pattern (Code Promotion) + +Cortex Agents are code artifacts — agent definitions, semantic views, and tool configurations are versioned in source control and promoted through environments: + +**Typical structure:** +- Data table definitions (parameterized) +- Semantic view definitions (parameterized) +- Agent/tool definitions (parameterized) +- Evaluation pipeline (script + config, parameterized) +- Project definition for managed entities (e.g., notebooks) + +The CI/CD pipeline deploys the same files to each environment, substituting the environment identifier. Each environment gets its own agent, semantic view, and data — no artifact sharing across environments. + +For file naming, one-file-per-object guidance (recommended where practical), and dependency ordering (numeric prefixes, dependency manifests, Snowflake Tasks), see `ci-cd-testing.md` § "File Naming and Organization" and "Selective Deployment." + +For agent evaluation as a CI/CD gate, see `monitoring-rollback.md` § "Agent Evaluation Pipeline." + +### Project Definitions (`snowflake.yml`) + +For Snowflake-managed entities (notebooks, etc.), use `snowflake.yml` project definitions with environment parameterization. The project definition specifies the entity type, target database/schema, and associated files. The environment identifier is resolved at deployment time. + +**Note**: `snowflake.yml` uses Snowflake CLI templating syntax, while SQL files may use a different templating engine (e.g., Jinja2). Be aware of which engine applies to which file type. + +## See Also + +- `ci-cd-testing.md` — CI/CD pipelines and environment structure for each promotion pattern +- `model-lifecycle.md` — Registry, versioning, and Champion/Challenger workflow +- `continuous-training.md` — Retraining triggers and automation by maturity level diff --git a/skills/mlops/implement-patterns/references/templates/github-actions-deploy.yml b/skills/mlops/implement-patterns/references/templates/github-actions-deploy.yml new file mode 100644 index 00000000..7c1a2f57 --- /dev/null +++ b/skills/mlops/implement-patterns/references/templates/github-actions-deploy.yml @@ -0,0 +1,105 @@ +# GitHub Actions CI/CD Template — Snowflake Deployment with OIDC +# +# This template demonstrates the best practices described in ci-cd-testing.md: +# - OIDC/WIF authentication (no stored secrets) +# - Selective deployment (only changed files) +# - File-type routing (SQL, Python, Notebook) +# - Environment parameterization via Snowflake CLI Jinja2 +# - Pinned CLI version for reproducibility +# +# ADAPTATION REQUIRED: +# - Replace with the target environment name (e.g., PROD, STAGING) +# - Replace in the EXECUTE NOTEBOOK command with the actual +# database and schema for notebook entities in that environment +# - Adjust `pip install` dependencies to match the project's requirements +# - Adjust trigger (push to main, workflow_dispatch, etc.) to match branching strategy +# - For multi-environment pipelines, duplicate the job per environment or use a +# matrix strategy with environment-specific GitHub Environments +# +# PREREQUISITES: +# - Snowflake service user (TYPE = SERVICE) with OIDC/WIF configured +# - GitHub Environment created (e.g., "PROD") — the environment name must match +# the subject claim in the Snowflake API integration +# - SNOWFLAKE_ACCOUNT stored as a GitHub variable (not secret — it's not sensitive) +# - snowflake.yml project definition if deploying notebooks +# +# See ci-cd-testing.md for the full guidance on each concept demonstrated here. + +name: deploy_to_snowflake + +on: + workflow_dispatch: + # push: + # branches: + # - main + +# OIDC requires id-token:write so GitHub can issue a JWT for Snowflake authentication. +# contents:read is needed for actions/checkout. +permissions: + id-token: write + contents: read + +jobs: + deploy: + runs-on: ubuntu-latest + + # GitHub Environment — controls which OIDC subject claim is used. + # Must match the subject configured in the Snowflake API integration. + environment: + + env: + SNOWFLAKE_ACCOUNT: ${{ vars.SNOWFLAKE_ACCOUNT }} + var_environment: + + steps: + # fetch-depth: 2 — only the last 2 commits are needed for git diff. + # persist-credentials: false — security hygiene; prevents token leakage. + - name: Checkout repository + uses: actions/checkout@v6 + with: + fetch-depth: 2 + persist-credentials: false + + # Pin the CLI version for reproducible deployments. + # use-oidc: true — authenticates via GitHub's OIDC token (no secrets stored). + - name: Set up Snowflake CLI + uses: snowflakedb/snowflake-cli-action@v2.0 + with: + use-oidc: true + cli-version: "3.16.0" + + - name: Install Python dependencies + run: pip install snowflake-connector-python snowflake-snowpark-python snowflake-ml-python + + # Validate that OIDC authentication works before deploying anything. + - name: Test connection + run: snow connection test --temporary-connection + + # Selective deployment: only deploy files that changed in the last commit. + # File-type routing: each file type has a different execution mechanism. + # --temporary-connection avoids requiring a config.toml — OIDC handles auth. + - name: Deploy changed files + run: | + CHANGED_FILES=$(git diff --name-only --diff-filter=ACMR HEAD~1 HEAD | grep -E '\.(sql|py|ipynb)$' | sort || true) + if [ -z "$CHANGED_FILES" ]; then + echo "No changed SQL/Python/Notebook files to deploy" + exit 0 + fi + echo "$CHANGED_FILES" | while read -r file; do + ext="${file##*.}" + if [ "$ext" = "sql" ]; then + echo "-> Deploying SQL: $file" + snow sql -f "$file" --enable-templating JINJA -D "var_environment=${var_environment}" --temporary-connection + elif [ "$ext" = "py" ]; then + echo "-> Executing Python: $file" + var_environment=${var_environment} python "$file" + elif [ "$ext" = "ipynb" ]; then + notebook_name=$(basename "$file" .ipynb) + notebook_dir=$(dirname "$file") + entity_name=$(echo "$notebook_name" | sed 's/^[0-9]*_//' | tr '[:upper:]' '[:lower:]') + echo "-> Deploying notebook: $entity_name" + (cd "$notebook_dir" && snow notebook deploy "$entity_name" --replace --env "var_environment=${var_environment}" --temporary-connection) + echo "-> Executing notebook: $entity_name" + snow sql -q "EXECUTE NOTEBOOK .${entity_name}('var_environment=${var_environment}');" --temporary-connection + fi + done