diff --git a/skills/agentic-migration-workshop/LICENSE b/skills/agentic-migration-workshop/LICENSE new file mode 100644 index 00000000..3b8f57ee --- /dev/null +++ b/skills/agentic-migration-workshop/LICENSE @@ -0,0 +1,22 @@ +Snowflake Skills License + +© 2026 Snowflake Inc. All rights reserved. + +LICENSE: Use of these materials (including all code, prompts, assets, files, and other components of these skills (collectively, “Skills”)) is governed by your agreement with Snowflake for the Service. If no separate agreement exists, use is governed by Snowflake’s Terms of Service (available at: https://www.snowflake.com/en/legal/terms-of-service/). + +Your applicable agreement is referred to as the "Agreement." "Service" is as defined in the Agreement. + +ADDITIONAL RESTRICTIONS: Notwithstanding anything in the Agreement to the contrary, you may not: + +* Extract from the Service or retain copies of the Skills outside use with the Service; +* Reproduce or copy the Skills , except for temporary copies created automatically during authorized use of the Service; +* Create derivative works based on the Skills; +* Distribute, sublicense, or transfer the Skills to any third party; +* Make, offer to sell, sell, or import any inventions embodied in the Skills; nor, +* Reverse engineer, decompile, or disassemble the Skills. + +The receipt, viewing, or possession of the Skills does not convey or imply any license or right beyond those expressly granted above. + +Snowflake retains all rights, title, and interest in the Skills, including all copyrights, trademarks, patents, and all other applicable intellectual property rights. + +THE SKILLS ARE PROVIDED “AS IS,” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SKILLS OR THE USE OR OTHER DEALINGS IN THE SKILLS. diff --git a/skills/agentic-migration-workshop/SETUP.md b/skills/agentic-migration-workshop/SETUP.md new file mode 100644 index 00000000..0d11b08f --- /dev/null +++ b/skills/agentic-migration-workshop/SETUP.md @@ -0,0 +1,85 @@ +# Migration Workshop Skill — Setup Guide + +## Prerequisites + +- A Snowflake account (free trial works: https://signup.snowflake.com) +- macOS (Apple Silicon or Intel), Linux (x64/arm64), or Windows (WSL or native preview) + +## Step 1: Install Cortex Code CLI + +**macOS / Linux / WSL:** +```bash +curl -LsS https://ai.snowflake.com/static/cc-scripts/install.sh | sh +``` + +**Windows (PowerShell):** +```powershell +irm https://ai.snowflake.com/static/cc-scripts/install.ps1 | iex +``` + +No Snowflake account yet? Sign up for a free Cortex Code trial at https://signup.snowflake.com/cortex-code — it includes Cortex Code CLI usage for 30 days. + +## Step 2: Connect to Snowflake + +Run `cortex` in your terminal. A setup wizard walks you through connecting to your Snowflake account. You can use browser-based SSO, key-pair auth, or username/password. + +If your account requires cross-region inference for the AI models, an ACCOUNTADMIN must run: +```sql +ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'AWS_US'; +``` + +## Step 3: Install the Skill + +Copy the entire `migration-workshop/` folder to the Cortex Code global skills directory: + +```bash +mkdir -p ~/.snowflake/cortex/skills +cp -R migration-workshop ~/.snowflake/cortex/skills/ +``` + +That's it. Cortex Code auto-discovers skills in `~/.snowflake/cortex/skills/`. + +## Step 4: Verify + +Inside a Cortex Code session, type: + +``` +/skill +``` + +You should see `agentic-migration-workshop` listed under Global skills. To use it: + +``` +$agentic-migration-workshop I need to migrate our Oracle database to Snowflake +``` + +Or simply describe your migration task — the skill triggers automatically on keywords like `migrate`, `migration`, `convert`, `translate SQL`, `SnowConvert`, `SSIS`, `Power BI repointing`. + +## What's Included + +``` +migration-workshop/ +├── SKILL.md # Main router — welcome flow, intent detection +├── assessment/SKILL.md # Migration assessment & effort estimation +├── schema-conversion/SKILL.md # DDL conversion & data type mapping +├── data-migration/SKILL.md # Staging, loading, validation +├── query-translation/SKILL.md # SQL/stored procedure conversion +├── snowconvert-ai/SKILL.md # SnowConvert AI automated conversion +├── ssis-replatform/SKILL.md # SSIS package replatforming +├── powerbi-repointing/SKILL.md # Power BI datasource repointing +├── references/ # Platform-specific migration guides +│ ├── oracle.md +│ ├── teradata.md +│ ├── redshift.md +│ ├── sqlserver.md +│ └── best-practices.md +├── scripts/assess_complexity.py # DDL complexity scoring tool +└── pyproject.toml # Python dependencies for scripts +``` + +## Supported Source Platforms + +- Oracle +- Teradata +- Amazon Redshift +- SQL Server (including SSIS and Power BI) diff --git a/skills/agentic-migration-workshop/SKILL.md b/skills/agentic-migration-workshop/SKILL.md new file mode 100644 index 00000000..e0cc8cdf --- /dev/null +++ b/skills/agentic-migration-workshop/SKILL.md @@ -0,0 +1,99 @@ +--- +name: agentic-migration-workshop +title: Migration to Snowflake +summary: Guide developers through migrating Oracle, Teradata, Redshift, or SQL Server workloads to Snowflake. +description: | + Use when migrating a database from Oracle, Teradata, Amazon Redshift, or SQL Server to Snowflake. Covers assessment, schema conversion, data loading, SQL translation, and optional SSIS/Power BI repointing. Routes to focused sub-flows for each phase. Triggers: migrate, migration, convert, translate SQL, SnowConvert, SSIS, Power BI repointing, data validation, ETL, DDL conversion, schema conversion. +tools: + - snowflake_sql_execute + - Bash + - Read + - Write + - Edit + - Glob + - Grep +prompt: I want to migrate my Oracle database to Snowflake. +language: en +status: Published +author: Snowflake Solutions Team +type: snowflake +--- + +# Migration to Snowflake + +## Overview + +Migrate a legacy database (Oracle, Teradata, Redshift, SQL Server) to Snowflake. The skill walks through five focused phases — each lives in its own sub-flow with detailed steps. Pick the phase that matches where you are; you do not need to run them in order. + +| Phase | Sub-flow | Output | +|---|---|---| +| 1. Assess | `assessment/INSTRUCTIONS.md` | Inventory, complexity score, effort estimate | +| 2. Convert schema | `schema-conversion/INSTRUCTIONS.md` | Snowflake DDL + type mapping | +| 3. Load data | `data-migration/INSTRUCTIONS.md` | Staging, COPY scripts, reconciliation | +| 4. Translate SQL | `query-translation/INSTRUCTIONS.md` | Converted queries, procs, behavioral diffs | +| 5. Automate (optional) | `snowconvert-ai/INSTRUCTIONS.md` | Bulk conversion via SnowConvert AI | + +Add-ons: `ssis-replatform/INSTRUCTIONS.md` (SQL Server) and `powerbi-repointing/INSTRUCTIONS.md`. + +## When to Use + +Use when you have a workload on Oracle, Teradata, Redshift, or SQL Server and want to move it to Snowflake yourself. Use the Assess sub-flow first if you do not yet know scope or effort. Skip straight to Schema Conversion if your DDL is ready, or Query Translation if your tables already exist in Snowflake. + +## Setup + +1. Verify connectivity: + ```sql + SELECT CURRENT_ROLE(), CURRENT_WAREHOUSE(), CURRENT_DATABASE(); + ``` +2. Load `references/best-practices.md` plus the file matching your source: `references/oracle.md`, `teradata.md`, `redshift.md`, or `sqlserver.md`. +3. Optional but recommended — install **SnowConvert AI** (free) for automated DDL extraction and bulk conversion. Download from `https://snowconvert.snowflake.com`. + +## Routing + +Detect intent from the first user message: + +| Trigger phrases | Sub-flow | +|---|---| +| assess, readiness, complexity, scope, effort | `assessment/INSTRUCTIONS.md` | +| convert schema, DDL, data types, create tables | `schema-conversion/INSTRUCTIONS.md` | +| load data, migrate data, reconcile | `data-migration/INSTRUCTIONS.md` | +| translate SQL, convert query, stored procedure | `query-translation/INSTRUCTIONS.md` | +| SnowConvert, automated conversion | `snowconvert-ai/INSTRUCTIONS.md` | +| SSIS, dtsx, integration services | `ssis-replatform/INSTRUCTIONS.md` | +| Power BI, .pbit, repoint | `powerbi-repointing/INSTRUCTIONS.md` | + +If the user asks for an end-to-end run, start with `assessment/INSTRUCTIONS.md` and chain phases. + +⚠️ STOPPING POINT: After detecting intent, confirm with the user which phase to load before proceeding. Do not chain into a sub-flow without explicit user approval. + +## Stopping Points + +⚠️ STOPPING POINT: Before running any DDL, INSERT, COPY, or GRANT against the target Snowflake account, show the user the exact SQL and wait for explicit approval. + +⚠️ STOPPING POINT: Before applying SnowConvert output, show the EWI/FDM summary and let the user resolve EWI errors first. + +⚠️ STOPPING POINT: Before deleting or truncating staging tables after a load, confirm reconciliation passed. + +Per-step stops: +- Routing — confirm which phase the user wants before loading any sub-flow +- Assessment — none (read-only) +- Schema conversion — confirm DDL before `CREATE` +- Data migration — confirm load plan, then confirm cleanup +- Query translation — confirm replacement before overwriting source files +- SnowConvert AI — confirm before deploying converted artifacts + +## Common Mistakes + +- **Skipping assessment.** Jumping into conversion without an object inventory leads to missed dependencies (sequences, synonyms, materialized views). +- **One-to-one data type mapping.** Oracle `NUMBER` and Teradata `DECIMAL` often need precision tuning; Redshift `VARCHAR(MAX)` should not always become `VARCHAR(16777216)`. +- **Ignoring behavioral differences.** Implicit casting, NULL ordering, date arithmetic, and empty-string handling differ across platforms. Capture each in a behavioral diff log. +- **Validating only row counts.** Counts match while values diverge. Validate counts, aggregates (SUM, MIN, MAX), and row-level samples. +- **Loading before reconciling staging.** Stage to a scratch schema, reconcile, then promote. Do not load directly into production tables. +- **Treating EWI warnings as optional.** SnowConvert EWI errors block correctness; resolve them before deployment. FDM warnings still need a business-impact review. +- **No rollback plan.** Keep source available read-only until validation passes end-to-end. + +## Troubleshooting + +- **No source DDL** — extract with SnowConvert AI or `https://github.com/Snowflake-Labs/SC.DDLExportScripts`, or query the source `INFORMATION_SCHEMA`. +- **Unsupported source feature** — document a Snowflake-native alternative and confirm with the user before substituting. +- **Permission errors** — most migrations need `CREATE DATABASE`, `CREATE SCHEMA`, `CREATE TABLE`, `CREATE STAGE`, and `USAGE` on a warehouse. Use a dedicated migration role. diff --git a/skills/agentic-migration-workshop/assessment/INSTRUCTIONS.md b/skills/agentic-migration-workshop/assessment/INSTRUCTIONS.md new file mode 100644 index 00000000..f16777fb --- /dev/null +++ b/skills/agentic-migration-workshop/assessment/INSTRUCTIONS.md @@ -0,0 +1,246 @@ + +# Workshop Session: Migration Assessment + +## Session Overview + +**Present to user:** +> Welcome to the **Migration Assessment** session. This is where we build a clear, data-driven picture of your migration — what you're working with, how complex it is, and how long it will take. +> +> By the end of this session, you'll have: +> - A complete inventory of your source objects +> - A complexity scorecard with risk ratings +> - A feature gap analysis with Snowflake alternatives +> - A Migration Readiness Report you can share with stakeholders +> - An effort estimate and timeline (if you have SnowConvert AI data) +> +> Let's start by understanding what we're migrating. + +## Prerequisites +- Platform reference file read (from `/references/`) +- `references/best-practices.md` read + +## Session Flow + +### Part 1: Source Inventory + +**Goal:** Build a complete catalog of objects in scope. + +**Context to share with user:** This migration is a strategic modernization initiative — not just a cost-saving exercise. Beyond the immediate move, Snowflake unlocks capabilities your current platform can't easily deliver: lakehouse architecture, semi-structured data, cross-cloud data sharing, and Snowpark ML. Common drivers by platform: +- **SQL Server**: License costs, vertical scaling limits, DBA overhead, SSIS/SSRS complexity +- **Redshift**: Concurrency bottlenecks, cluster management, WLM tuning, VACUUM/ANALYZE overhead, scaling delays +- **Oracle**: License/support costs, RAC complexity, Exadata lock-in, PL/SQL maintenance burden +- **Teradata**: Cost per TB, hardware refresh cycles, BTEQ/TPT tooling limitations + +**Ask the user** (via `ask_user_question`) how they'd like to provide their source inventory: +- Paste or upload DDL export files +- Provide a database/schema name (if source is queryable) +- Share a manual list of objects +- Provide SnowConvert AI extraction results +- They need help extracting DDL first (guide them to https://github.com/Snowflake-Labs/SC.DDLExportScripts) + +**Then categorize** every object into this inventory table and present it: + +| Category | Examples | Count | +|----------|----------|-------| +| Tables | Heap, partitioned, temporary, external | | +| Views | Standard, materialized, recursive | | +| Procedures | Stored procedures, functions, packages (Oracle) | | +| Indexes | B-tree, bitmap, function-based, columnstore | | +| Constraints | PK, FK, unique, check, default | | +| Sequences | Auto-increment, identity columns | | +| Triggers | DML triggers, DDL triggers | | +| Other | Synonyms, DBLinks, user-defined types | | + +**Present the inventory** to the user with a summary: *"Here's what I found — [N] total objects across [M] categories."* + +### Part 2: Complexity Scoring + +**Goal:** Rate each object category and identify high-risk items. + +**Explain to user:** +> Now let's assess how difficult each part of your migration will be. I'll score everything on a 1-5 scale based on how directly it maps to Snowflake equivalents. + +**Scoring rubric (share with user):** + +| Score | Difficulty | What It Means | +|-------|-----------|---------------| +| 1 | Trivial | Direct 1:1 mapping — Snowflake handles this natively | +| 2 | Simple | Minor syntax changes needed | +| 3 | Moderate | Significant rewrite, but Snowflake has a clear alternative | +| 4 | Complex | Major redesign required, no direct equivalent | +| 5 | Critical | Requires architectural change or external tooling | + +**Platform-specific complexity drivers** (use when scoring): + +- **Oracle**: PL/SQL packages (4), DBLinks (4), bitmap indexes (1), materialized views (2), sequences (2), synonyms (2) +- **Teradata**: BTEQ scripts (3), MultiValue compression (1), temporal tables (3), MERGE with complex conditions (2), hash indexes (1) +- **Redshift**: Distribution keys (1), sort keys (2), COPY from S3 (2), spectrum tables (3), late-binding views (2), WLM queues (3), PL/pgSQL procedures (3), VACUUM/ANALYZE dependencies (1) +- **SQL Server**: CLR procedures (5), linked servers (4), SSRS reports (4), SSIS packages (4), temporal tables (3), columnstore indexes (1) + +**Produce a complexity scorecard:** +1. Score each object category +2. Calculate weighted complexity: `SUM(count * score) / SUM(count)` +3. Highlight anything scoring >= 4 as a critical item requiring special attention + +**Present to user:** *"Your overall complexity score is [X]/5. Here are the items that need the most attention..."* + +### Part 3: Feature Gap Analysis + +**Goal:** Identify source features without direct Snowflake equivalents — and their alternatives. + +**Explain to user:** +> Every platform has features that don't translate directly to Snowflake. The good news is that Snowflake almost always has a modern alternative. Let me map those for you. + +**Check the platform reference** for known gaps, then build the gap analysis: + +| Source Feature | Snowflake Alternative | Effort | Risk | +|---------------|----------------------|--------|------| +| Row-level security | Row Access Policies | | | +| Column masking | Dynamic Data Masking | | | +| Stored procedures with cursors | Snowflake Scripting or JavaScript UDFs | | | +| Database links/remote queries | Data sharing or external tables | | | +| Scheduled jobs | Snowflake Tasks | | | +| Change Data Capture | Streams + Tasks | | | + +**Present with context:** For each gap, briefly explain *why* the Snowflake alternative is different and what the migration implication is. + +### Part 4: Migration Readiness Report + +**Goal:** Produce a polished, stakeholder-ready report. + +**Compile** results from Parts 1-3 and present as a formatted report: + +``` +# Migration Readiness Report +## [Source Platform] → Snowflake +## Date: [Today's Date] + +### Executive Summary +- Total objects in scope: [count] +- Overall complexity score: [weighted average] / 5 +- Estimated effort: [hours/days] +- Migration readiness: [Ready / Ready with caveats / Needs redesign] + +### Object Inventory +[Table from Part 1] + +### Complexity Scorecard +[Table from Part 2] + +### Critical Items (Score >= 4) +[List with specific mitigation strategies for each] + +### Feature Gap Analysis +[Table from Part 3] + +### Risk Register +| Risk | Likelihood | Impact | Mitigation | +|------|-----------|--------|------------| + +### Recommended Migration Order +1. [Phase 1 — lowest complexity first, to build momentum] +2. [Phase 2] +3. [Phase 3 — highest complexity last] + +### Next Steps +- [ ] Review and approve this readiness report +- [ ] Proceed to Schema Conversion +- [ ] Address critical items before full migration +``` + +**Present to user:** +> Here's your Migration Readiness Report. Take a moment to review — I want to make sure everything looks right before we move on. + +**CHECKPOINT:** Wait for user approval before continuing. + +### Part 5: SnowConvert AI Assessment (Optional) + +**Introduce to user:** +> If you have SnowConvert AI installed, we can augment this assessment with automated metrics. SnowConvert AI achieves a **96%+ automated conversion rate** for Redshift and reduces manual effort by **50-70%** for SQL Server. The free assessment alone gives you data-driven scope and complexity estimates. + +**If user has SnowConvert AI**, guide them through running the assessment and interpret results: + +| Status | Meaning | How to Score | +|--------|---------|-------------| +| Green | Successfully converted | Trivial/Simple (1-2) | +| Yellow (FDM) | Further Development Mandatory | Moderate/Complex (3-4) | +| Red (EWI) | Error with Impact | Critical (5) — must resolve manually | + +Merge SnowConvert AI results with the manual complexity scores and update the readiness report. + +### Part 6: Effort Estimation (6-Step Process) + +**Introduce to user:** +> Now let's build a rigorous effort estimate using SnowConvert AI's reports. This is the same 6-step methodology Snowflake's partners use for engagement planning. + +**Step 6.1 — Code Extraction & Automated Scoring:** +- Feed all source code to SnowConvert AI +- Key metric: Total Conversion Percentage (e.g., 85% automated) +- SnowConvert builds an AST for semantic analysis, not just pattern matching + +**Step 6.2 — Code Inventory & Workload Sizing:** +- Reports: Top-Level Code Unit Report + Elements Report +- Key metrics: Object counts, total LOC, LOC breakdown by complexity + +**Step 6.3 — ETL Re-platforming Analysis:** +- Reports: ETL Replatform Component Summary + Issues Report +- Strategy: simple flows → Dynamic Tables; complex flows → Snowpark Python + +**Step 6.4 — BI Repointing Analysis:** +- Report: Power BI Repointing Automation Score +- Distinguishes auto-repointed (low risk) from manual refactoring (high risk) + +**Step 6.5 — Manual Effort Quantification:** +- Reports: Issues Report (EWIs) + Functions Usage Report +- Apply complexity multipliers: + +| Complexity | Multiplier (Days/100 LOC) | Examples | +|-----------|--------------------------|---------| +| Low | 1.0 | Simple SQL DML fixes | +| Medium | 2.5 | Functions/UDFs with proprietary logic | +| High | 4.0 | T-SQL cursor rewrites to Snowpark Python | + +**Step 6.6 — Timeline:** +- Automated deployment + Manual rework + Testing buffer (30-40% of coding time) + +**Present a sample estimation** to calibrate expectations: + +| Phase | Duration | Notes | +|-------|----------|-------| +| Schema & Data Model Deployment | 1 week | Automated DDL deployment | +| Manual Code Refactoring | 3 weeks | Resolving high-effort EWIs | +| Data Migration & Initial Load | 1 week | Parallel to code fixes | +| Testing & Validation (SIT/UAT) | 3 weeks | 100% of converted objects + BI | +| Go-Live (Cutover) | 1 day | Clone and Swap methodology | +| **Total** | **~8 weeks** | 30K LOC, 50TB example | + +**CHECKPOINT:** Review effort estimate with user. + +### Part 7: Pilot Evaluation (Large/Complex Migrations) + +**Ask the user** if their migration is large or complex enough to warrant a pilot: + +| Indicator | Discovery Workshops | Migration Pilot | +|-----------|-------------------|----------------| +| Platform | Standard RDBMS | Complex/heterogeneous | +| Approach | Lift-and-shift | Data modernization | +| Size | Small to medium | Large | + +If a pilot is appropriate, guide through: +- **Lineage-based use case selection** — start at consumption (reports), trace backward to source +- **Three parallel workstreams:** Planning & Discovery, E2E Pilot, User Pilot (repoint reports quickly) +- **Entry/exit criteria** and scaling to wave-based delivery + +## Session Wrap-Up + +**Present to user:** +> Here's what we accomplished in this Assessment session: +> - [Summary of deliverables produced] +> +> Your migration readiness is [Ready / Ready with caveats / Needs redesign] with an overall complexity of [X]/5. + +**CHECKPOINT:** Confirm assessment is complete before transitioning. + +## Next Session + +If Full Workshop → proceed to **Schema Conversion** (read `schema-conversion/SKILL.md`) diff --git a/skills/agentic-migration-workshop/data-migration/INSTRUCTIONS.md b/skills/agentic-migration-workshop/data-migration/INSTRUCTIONS.md new file mode 100644 index 00000000..63565429 --- /dev/null +++ b/skills/agentic-migration-workshop/data-migration/INSTRUCTIONS.md @@ -0,0 +1,288 @@ + +# Workshop Session: Data Migration (Day 4 — Data Migration) + +## Session Overview + +**Present to user:** +> Welcome to **Data Migration** — this is Day 4 of the LiftOff framework. Now that your schema is in Snowflake, we'll load your data and make sure everything arrived correctly. +> +> Here's our plan: +> 1. Figure out the best way to get your data into Snowflake +> 2. Set up staging infrastructure (file formats, stages) +> 3. Generate and run load scripts +> 4. Validate completeness — row counts, aggregates, and spot-checks +> 5. Run through testing phases +> 6. Produce a Reconciliation Report +> +> Let's start with how your data is currently stored. + +## Prerequisites +- Target Snowflake tables exist (from Schema Conversion or pre-existing) +- Source data accessible (files, cloud storage, or direct connection) +- `references/best-practices.md` read + +## Session Flow + +### Part 1: Determine Data Source + +**Ask the user** (via `ask_user_question`) how they'll provide source data: +- CSV/Parquet/JSON files (local or cloud storage) +- Direct export from source database (I'll generate export commands) +- Data is already in cloud storage (S3, Azure Blob, GCS) +- Snowflake data sharing / replication +- SnowConvert AI data migration (automated, supports SQL Server & Redshift direct) + +**Based on selection, gather details:** + +| Source Type | What I Need From You | +|-------------|---------------------| +| Local files | File paths, format, delimiter, encoding, header row | +| Cloud storage | Bucket/container URL, credentials or storage integration, file format | +| Direct export | Source connection details, preferred export tool | +| Data sharing | Provider account, share name | +| SnowConvert AI | SnowConvert AI handles extraction and loading | + +### Part 2: Create Staging Infrastructure + +**Explain to user:** +> Before we can load data, Snowflake needs two things: a **file format** (how to parse your files) and a **stage** (where to find them). Let me set those up. + +**Generate and execute** file format DDL: + +```sql +CREATE OR REPLACE FILE FORMAT migration_csv_format + TYPE = 'CSV' + FIELD_DELIMITER = ',' + SKIP_HEADER = 1 + FIELD_OPTIONALLY_ENCLOSED_BY = '"' + NULL_IF = ('NULL', 'null', '') + EMPTY_FIELD_AS_NULL = TRUE + ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE; +``` + +**Generate and execute** stage DDL (adapt based on source type): + +For local files: +```sql +CREATE OR REPLACE STAGE migration_stage + FILE_FORMAT = migration_csv_format; +``` + +For S3: +```sql +CREATE OR REPLACE STAGE migration_stage + URL = 's3://bucket/path/' + STORAGE_INTEGRATION = [integration_name] + FILE_FORMAT = migration_csv_format; +``` + +Execute via `snowflake_sql_execute`. + +### Part 3: Generate Load Scripts + +**For each table**, generate COPY INTO: + +```sql +COPY INTO target_db.target_schema.table_name + FROM @migration_stage/table_name/ + FILE_FORMAT = migration_csv_format + ON_ERROR = 'CONTINUE' + PURGE = FALSE; +``` + +**Handle special cases** (explain each to user): + +| Scenario | Approach | Why | +|----------|----------|-----| +| Large tables (>1B rows) | Multiple files, parallel loading | Snowflake processes files in parallel across nodes | +| Semi-structured data | COPY INTO with VARIANT column, then FLATTEN | Preserves nested structure for later querying | +| Incremental loads | COPY with FORCE=FALSE | Skips already-loaded files automatically | +| Type mismatches | Explicit CAST in SELECT from stage | Prevents silent truncation | +| Date format differences | DATE_FORMAT/TIMESTAMP_FORMAT in file format | Ensures correct parsing | + +**Platform-specific export guidance:** + +**Redshift → Snowflake (via S3):** +> "The recommended path is: Unload from Redshift to Parquet files in S3, then COPY from S3 into Snowflake. Important: your S3 bucket should be in the same region as your Redshift cluster. Redshift distribution styles and sort keys don't need to be preserved — Snowflake auto-optimizes data layout. No VACUUM or ANALYZE needed post-load." + +**SQL Server → Snowflake:** +> "You can use BCP for bulk export, or if you have SnowConvert AI, it supports direct streaming from SQL Server to Snowflake with real-time progress monitoring." + +**Oracle → Snowflake:** +> "Use Data Pump (expdp), SQL*Plus spool, or UTL_FILE to export to CSV/Parquet, then upload to cloud storage." + +**Teradata → Snowflake:** +> "Use BTEQ .EXPORT, FastExport, or TPT for extraction, then upload to cloud storage." + +**Performance tips to share:** +> "A few tips to get the best load performance: split large files into 100-250MB chunks, use a larger warehouse for the initial bulk load (you can scale down after), and use PARQUET format when possible — it preserves schema and compresses better." + +**CHECKPOINT:** *"Here are the load scripts I've generated. Want to review them before we run?"* + +### Part 4: Execute Data Load + +**If local files:** Use PUT + COPY pattern: +```sql +PUT file:///path/to/data/*.csv @migration_stage/table_name/ AUTO_COMPRESS=TRUE; + +COPY INTO target_db.target_schema.table_name + FROM @migration_stage/table_name/ + FILE_FORMAT = migration_csv_format; +``` + +**Execute** via `snowflake_sql_execute` and capture results. + +**Present load status:** +``` +| Table | Files Loaded | Rows Loaded | Errors | Status | +|-------|-------------|-------------|--------|--------| +``` + +**If errors occur:** +- Query COPY_HISTORY for error details +- Show rejected rows from VALIDATE function +- Suggest fixes (type casting, null handling, encoding) +- *"Don't worry — errors during loading are normal. Let me diagnose what happened..."* + +### Part 5: Data Validation + +**Explain to user:** +> Now for the most critical part: making sure everything arrived correctly. I'll run validation at multiple levels — from simple row counts up to cell-level spot-checks. + +**Schema Validation** (structural integrity): +- Table names match exactly +- Column names preserved correctly +- Ordinal positions maintained +- Data types converted appropriately +- Character lengths and numeric precision preserved + +**Row Count Validation:** +```sql +SELECT '[table_name]' AS table_name, + COUNT(*) AS snowflake_count +FROM target_db.target_schema.table_name; +``` +Compare against source counts (ask user to provide or query source). + +**Null Distribution Check:** +```sql +SELECT + COUNT(*) AS total_rows, + COUNT(col1) AS col1_non_null, + COUNT(col2) AS col2_non_null +FROM target_db.target_schema.table_name; +``` + +**Aggregate Validation** (sum, min, max on numeric columns): +```sql +SELECT + SUM(amount_col) AS total_amount, + MIN(date_col) AS min_date, + MAX(date_col) AS max_date +FROM target_db.target_schema.table_name; +``` + +**Statistical Validation:** +```sql +SELECT + MIN(numeric_col) AS min_val, + MAX(numeric_col) AS max_val, + AVG(numeric_col) AS avg_val, + STDDEV(numeric_col) AS stddev_val, + COUNT(DISTINCT key_col) AS distinct_count +FROM target_db.target_schema.table_name; +``` + +**Sample Spot-Check:** +```sql +SELECT * FROM target_db.target_schema.table_name +WHERE primary_key_col IN ([user-provided sample PKs]) +ORDER BY primary_key_col; +``` + +**Present validation results using this scale:** + +| Level | Meaning | Action | +|-------|---------|--------| +| Pass | Values match exactly | No action needed | +| Warning | Minor differences (e.g., higher precision) | Verify acceptable business impact | +| Fail | Values don't match | Investigation required | + +### Part 6: Testing Phases + +**Explain to user:** +> Beyond data validation, a production migration needs systematic testing. Here's the testing roadmap — we can work through whichever phases apply to your situation. + +| Phase | Purpose | Typical Duration | +|-------|---------|-----------------| +| Integration Testing | Verify data flows between migrated components | 1-2 weeks | +| SIT (System Integration) | Validate full system behavior across all integrations | 1-2 weeks | +| Performance Testing | Benchmark queries against source baseline | 1 week | +| Load & Stress Testing | Simulate peak concurrency, validate auto-scaling | 3-5 days | +| Security Testing | Test RBAC, masking policies, SSO/MFA | 3-5 days | +| UAT (User Acceptance) | Business users validate reports and workflows | 2-3 weeks | +| Parallel Run | Run both systems simultaneously, compare outputs | 2-4 weeks | + +**Validation layers:** + +| Layer | What to Check | +|-------|---------------| +| Completeness | Row counts, table counts, object counts | +| Accuracy | Cell-level comparison on critical tables | +| Integrity | Referential integrity, business rule validation | +| Consistency | Cross-table aggregation checks, business metric reconciliation | +| Timeliness | Incremental data freshness, pipeline latency | + +**Performance benchmarking:** +1. Capture baseline query set from source (top N by frequency/cost) +2. Execute same queries against Snowflake; compare runtimes +3. Use Query Profile to analyze slow queries +4. Right-size warehouses based on workload patterns +5. Add CLUSTER BY for large tables (>1TB) with frequent range filters + +### Part 7: Reconciliation Report + +**Compile** all validation results into a polished report: + +``` +# Data Migration Reconciliation Report +## Date: [Today] + +### Load Summary +| Table | Source Rows | Snowflake Rows | Match | Errors | +|-------|------------|---------------|-------|--------| + +### Aggregate Checks +| Table | Column | Source Value | Snowflake Value | Match | +|-------|--------|-------------|-----------------|-------| + +### Issues Found +| Issue | Table | Details | Resolution | +|-------|-------|---------|------------| + +### Testing Summary +| Test Phase | Status | Notes | +|-----------|--------|-------| + +### Data Quality Notes +- [Any encoding issues, truncations, null handling differences] +``` + +**Present to user:** +> Here's your Data Migration Reconciliation Report. Everything that passed is green-lit for production. Let me walk you through any items that need attention. + +**CHECKPOINT:** Wait for user sign-off before proceeding. + +## Session Wrap-Up + +**Present to user:** +> Data Migration complete! Here's the summary: +> - [X] tables loaded successfully +> - [Y] total rows migrated +> - Row counts match: [pass/fail] +> - Aggregate checks: [pass/fail] +> - [Any issues and their resolutions] + +## Next Session + +If Full Workshop → proceed to **Query Translation** (read `query-translation/SKILL.md`) diff --git a/skills/agentic-migration-workshop/powerbi-repointing/INSTRUCTIONS.md b/skills/agentic-migration-workshop/powerbi-repointing/INSTRUCTIONS.md new file mode 100644 index 00000000..871e4656 --- /dev/null +++ b/skills/agentic-migration-workshop/powerbi-repointing/INSTRUCTIONS.md @@ -0,0 +1,128 @@ + +# Workshop Session: Power BI Repointing (Day 5 — Data Consumption) + +## Session Overview + +**Present to user:** +> Welcome to **Power BI Repointing** — part of Day 5: Data Consumption. We'll redirect your Power BI reports from your source database to Snowflake, so your reporting layer continues working seamlessly. +> +> SnowConvert AI automates most of this — it swaps connection strings, updates schema references, and flags any reports that need manual attention. +> +> Here's our plan: +> 1. Prepare your Power BI files +> 2. Run SnowConvert AI repointing +> 3. Review and validate the repointed reports +> +> Let's get your reports ready. + +## Prerequisites +- SnowConvert AI installed +- Power BI reports saved as `.pbit` files (template format) +- DDLs migrated (recommended — helps SnowConvert AI identify tables and views) + +## Supported Sources + +| Source Platform | Supported | +|----------------|-----------| +| SQL Server | Yes | +| Oracle | Yes | +| Teradata | Yes | +| Redshift | Yes | +| Azure Synapse | Yes | +| PostgreSQL | Yes | + +## Session Flow + +### Part 1: Prepare Power BI Files + +**Ask the user** (via `ask_user_question`): +- Have you saved your Power BI projects as `.pbit` (template) format? +- Do you have DDL files for the underlying database objects? +- Which source platform do your reports currently connect to? + +**If user hasn't saved as .pbit:** +> Power BI reports need to be in `.pbit` (template) format for SnowConvert AI to process them. In Power BI Desktop: File → Save As → Power BI Template (.pbit). + +### Part 2: Run SnowConvert AI Repointing + +**Walk the user through the process:** + +> In SnowConvert AI: +> 1. Optionally add your DDLs (improves object identification) +> 2. Select the source language used in your Power BI reports (e.g., SQL Server) +> 3. Add your `.pbit` files in the Power BI repointing section +> 4. Click **"Continue to Conversion"** + +### Part 3: Review and Validate + +**Guide the user through validation:** + +> Let's verify your repointed reports work correctly: +> 1. Open the repointed Power BI report +> 2. Fill in the Snowflake parameters (SnowConvert AI adds these automatically): +> - Server link +> - Warehouse name +> - Database name +> 3. Refresh data +> 4. Compare against the original report — same numbers, same charts, same filters +> 5. Save in your preferred format (`.pbix`) + +**Review the assessment report:** +> Check the "ETLAndBiRepointing" report for a summary of which connectors were changed and any items requiring attention. + +**CHECKPOINT:** +> How do the repointed reports look? Do the numbers match your original reports? + +## Estimation Reference + +**Share with user for planning:** + +> For context, here's what a typical Power BI repointing engagement looks like: + +**Sample Timeline (500 Reports, 3-4 BI Developers + 1 Architect):** + +| Phase | Duration | Key Activities | +|-------|----------|---------------| +| Assessment & Analysis | 1 week | SnowConvert AI scans all PBIT/PBIX files; outputs Repointing Automation Score | +| Automated Repointing | 1-2 weeks | Auto-swap connections for low-risk reports (typically ~75%) | +| Manual Refactoring | 2-3 weeks | BI developers refactor Custom T-SQL/M-Code in Power Query/DAX (~25%) | +| Functional Validation (UAT) | 2-3 weeks | Business users validate data integrity, filters, measures, charts | +| **Total** | **6-9 weeks** | Production-ready BI layer | + +**Report risk categories:** + +| Category | Risk | Effort | Description | +|----------|------|--------|-------------| +| Direct table queries | Low | Auto-repointed | Connection string swap only | +| Standard SQL queries | Low-Medium | Mostly automated | Minor ANSI SQL adjustments | +| Custom SQL in Power Query | High | Manual refactoring | Proprietary T-SQL in Power Query/M-Code | +| DAX with source-specific logic | Medium-High | Manual review | DAX measures referencing source patterns | + +**Key metrics:** +- **Repointing Automation Score** — % of reports auto-updated (the higher, the faster) +- **High-risk reports** — number requiring query refactoring (largest time driver) +- **UAT bottleneck** — business user testing is often the longest phase + +## Session Wrap-Up + +**Present to user:** +> Power BI Repointing complete! Here's the summary: +> - [X] reports repointed to Snowflake +> - [Y]% automated (connection swap only) +> - [Z] required manual refactoring +> - All reports validated with data refresh +> +> Your reporting layer is now running on Snowflake. + +## Broader Data Consumption Context + +When repointing is part of a larger migration, also consider: +- **Outbound integration inventory** — all systems consuming data (reports, analytics, APIs, extracts) +- **Platform compatibility** — verify each tool's Snowflake connector support +- **User training** — developers and users need to learn Snowflake access patterns +- **Data governance** — maintain or enhance cataloging in the new environment + +## Deliverables +- Repointed `.pbit`/`.pbix` files with Snowflake connectors +- ETLAndBiRepointing assessment report +- Repointing Automation Score report diff --git a/skills/agentic-migration-workshop/pyproject.toml b/skills/agentic-migration-workshop/pyproject.toml new file mode 100644 index 00000000..24b3284f --- /dev/null +++ b/skills/agentic-migration-workshop/pyproject.toml @@ -0,0 +1,10 @@ +[project] +name = "migration-workshop" +version = "0.1.0" +description = "Digital migration workshop skill - assists with database migration to Snowflake from Oracle, Teradata, Redshift, and SQL Server" +requires-python = ">=3.11" +dependencies = [ + "snowflake-connector-python>=3.6.0", + "pyyaml>=6.0", + "tabulate>=0.9.0", +] diff --git a/skills/agentic-migration-workshop/query-translation/INSTRUCTIONS.md b/skills/agentic-migration-workshop/query-translation/INSTRUCTIONS.md new file mode 100644 index 00000000..de61e40c --- /dev/null +++ b/skills/agentic-migration-workshop/query-translation/INSTRUCTIONS.md @@ -0,0 +1,264 @@ + +# Workshop Session: Query Translation (Day 4 — Data Integration & Transformation) + +## Session Overview + +**Present to user:** +> Welcome to **Query Translation** — part of Day 4: Data Integration & Transformation. This is where we convert your SQL queries, stored procedures, and functions to run natively on Snowflake. +> +> Here's our approach: +> 1. Collect and categorize your source SQL by complexity +> 2. Apply platform-specific translation rules systematically +> 3. Convert stored procedures and functions +> 4. Optionally use SnowConvert AI for automated verification +> 5. Test everything against your Snowflake data +> 6. Produce a Translation Report +> +> Let's see what you're working with. + +## Prerequisites +- Platform reference file read (from `/references/`) +- Source SQL queries or stored procedures available +- Target Snowflake tables exist (for validation) +- `references/best-practices.md` read + +## Session Flow + +### Part 1: Collect and Categorize Source SQL + +**Ask the user** to provide their source SQL: +- SQL query files +- Stored procedure definitions +- Function definitions +- Scheduled job/ETL scripts +- Report queries +- SnowConvert AI converted output (if available — we'll review and fix) + +**Categorize by complexity** and share with user: + +| Category | What It Means | Examples | +|----------|-------------|---------| +| Simple | Basic SELECT/JOIN/WHERE — quick translation | Reporting queries | +| Moderate | Subqueries, window functions, CTEs | Analytics queries | +| Complex | Dynamic SQL, cursors, temp tables | ETL procedures | +| Critical | Platform-specific extensions, optimizer hints | Heavily tuned queries | + +**Present:** *"I've categorized your [N] SQL objects: [X] simple, [Y] moderate, [Z] complex, [W] critical. Let me start translating — I'll explain the important changes as I go."* + +### Part 2: Apply Translation Rules + +**Explain to user:** +> Most SQL translations follow predictable patterns. I'll apply these systematically and highlight anything that changes behavior — not just syntax. + +**Universal translations (all platforms):** + +| Source Pattern | Snowflake Equivalent | Notes | +|---------------|---------------------|-------| +| `TOP N` (SQL Server/Teradata) | `LIMIT N` | | +| `ROWNUM` (Oracle) | `ROW_NUMBER() OVER()` or `LIMIT` | | +| `NVL()` (Oracle) | `NVL()` or `COALESCE()` | Both work in Snowflake | +| `ISNULL()` (SQL Server) | `NVL()` or `COALESCE()` | | +| `GETDATE()` (SQL Server) | `CURRENT_TIMESTAMP()` | | +| `SYSDATE` (Oracle) | `CURRENT_TIMESTAMP()` | | +| `DATEADD` variations | `DATEADD(part, amount, date)` | | +| `DATEDIFF` variations | `DATEDIFF(part, start, end)` | | +| `CONVERT(type, expr)` | `CAST(expr AS type)` or `TRY_CAST()` | TRY_CAST returns NULL on failure | +| `STRING_AGG` / `LISTAGG` | `LISTAGG(col, delim)` | | +| Recursive CTE | Same ANSI syntax | | +| `MERGE` | Snowflake MERGE (ANSI-compliant) | | +| Temp tables `#temp` / `DECLARE GTT` | `CREATE TEMPORARY TABLE` | | + +**Platform-specific translations:** + +**Oracle → Snowflake:** + +| Oracle | Snowflake | Teaching Moment | +|--------|-----------|----------------| +| `(+)` outer join | ANSI `LEFT/RIGHT JOIN` | Snowflake only supports ANSI join syntax | +| `CONNECT BY / START WITH` | Recursive CTE | Same logic, cleaner syntax | +| `DECODE()` | `CASE WHEN` or `DECODE()` | Both supported — CASE is more readable | +| `TO_DATE('str', 'fmt')` | `TO_DATE('str', 'fmt')` | Verify format tokens match | +| PL/SQL blocks | Snowflake Scripting (SQL) or JavaScript UDF | | +| `DBMS_OUTPUT.PUT_LINE` | `SYSTEM$LOG()` | | +| `%TYPE` / `%ROWTYPE` | Explicit type declarations | | +| `BULK COLLECT / FORALL` | Set-based operations or RESULTSET | | +| `CURSOR` loops | Snowflake CURSOR in Scripting or set-based rewrite | Set-based is preferred | + +**Teradata → Snowflake:** + +| Teradata | Snowflake | Teaching Moment | +|----------|-----------|----------------| +| `SEL` | `SELECT` | Abbreviation not supported | +| `QUALIFY` | `QUALIFY` | Snowflake supports this natively! | +| `SAMPLE n` | `SAMPLE (n ROWS)` or `TABLESAMPLE` | | +| `FORMAT 'fmt'` | `TO_CHAR(col, 'fmt')` | | +| `CHARACTERS()` | `LENGTH()` | | +| `TITLE 'alias'` | `AS alias` | | +| `CASESPECIFIC` / `NOT CASESPECIFIC` | `COLLATE` or `UPPER()`/`LOWER()` | | +| `COLLECT STATISTICS` | Remove | Snowflake auto-manages statistics | +| `LOCKING ROW FOR ACCESS` | Remove | Snowflake MVCC handles concurrency | + +**Redshift → Snowflake:** + +| Redshift | Snowflake | Teaching Moment | +|----------|-----------|----------------| +| `GETDATE()` | `CURRENT_TIMESTAMP()` | | +| `LEN()` | `LENGTH()` | | +| `STRTOL()` | `TRY_TO_NUMBER()` with base | | +| `JSON_EXTRACT_PATH_TEXT()` | `col:path::STRING` (dot notation) | Snowflake's semi-structured access is much cleaner | +| `APPROXIMATE COUNT(DISTINCT)` | `APPROX_COUNT_DISTINCT()` | | +| `UNLOAD TO` | `COPY INTO @stage` | | +| Spectrum queries | External tables or data sharing | | +| `WLM` queue references | Remove | Use separate Snowflake warehouses instead | + +**SQL Server → Snowflake:** + +| SQL Server | Snowflake | Teaching Moment | +|------------|-----------|----------------| +| `SET NOCOUNT ON` | Remove | Not needed in Snowflake | +| `@@ROWCOUNT` | `SQLROWCOUNT` in Scripting | | +| `@@ERROR` | `SQLCODE` in Scripting | | +| `TRY...CATCH` | `BEGIN...EXCEPTION...END` | | +| `sp_executesql` | `EXECUTE IMMEDIATE` | | +| `CROSS APPLY` / `OUTER APPLY` | `LATERAL JOIN` / `LATERAL FLATTEN` | | +| `PIVOT` / `UNPIVOT` | Snowflake `PIVOT` / `UNPIVOT` | | +| `STRING_SPLIT()` | `SPLIT_TO_TABLE()` or `LATERAL FLATTEN(SPLIT())` | | +| `FOR XML PATH` | `LISTAGG()` or `ARRAY_AGG()` | | +| `OPENROWSET` / `OPENQUERY` | External tables or stages | | + +**As you translate, explain significant changes:** +> "I'm changing your `CROSS APPLY` to a `LATERAL JOIN` — functionally identical, but this is Snowflake's syntax for correlated subqueries in the FROM clause." + +### Part 3: Convert Stored Procedures + +**Explain to user:** +> Stored procedures are usually the most complex part of query translation. For each one, I'll determine the best Snowflake approach. + +**Assessment strategy for each procedure:** + +| Source Pattern | Best Snowflake Approach | When to Use | +|---------------|------------------------|-------------| +| Simple cursor loop | Rewrite as set-based SQL | Always preferred — much faster | +| Complex cursor with business logic | Snowflake Scripting with CURSOR | When set-based isn't feasible | +| Dynamic SQL | `EXECUTE IMMEDIATE` with binds | | +| Temp table pipeline | Snowflake temp tables + Scripting | | +| Error handling | `BEGIN...EXCEPTION...END` | | +| Output parameters | RETURN value or RESULTSET | | +| Package (Oracle) | Separate procedures + shared tables/stages | No package concept in Snowflake | + +**Generate** Snowflake procedure DDL: +```sql +CREATE OR REPLACE PROCEDURE proc_name(param1 TYPE, param2 TYPE) + RETURNS VARCHAR + LANGUAGE SQL + EXECUTE AS CALLER +AS +$$ +BEGIN + -- Converted logic + RETURN 'Success'; +END; +$$; +``` + +**Validate** each procedure compiles: `snowflake_sql_execute` with `only_compile: true` + +### Part 4: AI Verification (Optional) + +**Introduce to user:** +> If you have SnowConvert AI, we can use its AI Verification feature to automatically test and fix conversion errors. The AI agents execute the converted code in your Snowflake account and fix issues — all grounded with tests over synthetic data. + +**If user has SnowConvert AI:** +1. Select converted objects for verification +2. AI agents execute and fix issues automatically +3. Review AI results per object ("SEE DETAILS") +4. Manually merge AI fixes with initial conversion + +**Track code completeness:** + +| Status | Count | Action | +|--------|-------|--------| +| Green (ready) | [n] | Deploy as-is | +| Yellow (FDM) | [n] | Review and document; may deploy with caveats | +| Red (EWI) | [n] | Must resolve manually | + +### Part 5: Test Translated Queries + +**Explain to user:** +> Let's verify that your translated SQL produces the right results. I'll run each query against your Snowflake data and check for correctness. + +**For each translated query:** +1. Execute against Snowflake: `snowflake_sql_execute` +2. Check for compilation errors +3. Verify result set structure matches expected output +4. Compare against source results if available (row counts, column values, aggregates) + +**Performance check:** +```sql +SELECT * FROM TABLE(GET_QUERY_OPERATOR_STATS(LAST_QUERY_ID())); +``` + +**Document behavioral differences** (these are important for UAT): +- NULL handling differences between platforms +- String collation differences +- Date/time precision differences +- Rounding behavior differences + +### Part 6: Translation Report + +**Compile** all results into a polished report: + +``` +# Query Translation Report +## Date: [Today] + +### Translation Summary +| Category | Total | Translated | Validated | Issues | +|----------|-------|-----------|-----------|--------| +| Queries | | | | | +| Procedures | | | | | +| Functions | | | | | + +### Translation Details +| Object | Source Lines | Snowflake Lines | Complexity | Status | +|--------|------------|----------------|-----------|--------| + +### Behavioral Differences +| Object | Difference | Impact | Mitigation | +|--------|-----------|--------|------------| + +### Manual Review Required +| Object | Reason | Guidance | +|--------|--------|----------| + +### Recommended Testing +- [ ] Unit test each procedure with sample inputs +- [ ] Compare query results against source for key reports +- [ ] Performance test with production-scale data +``` + +**Present to user:** +> Here's your Query Translation Report. All [N] objects have been converted and validated. Let me highlight the behavioral differences you should be aware of for UAT... + +**CHECKPOINT:** Wait for user approval. + +## Session Wrap-Up + +**Present to user:** +> Query Translation complete! Here's the summary: +> - [X] queries translated and validated +> - [Y] stored procedures converted +> - [Z] behavioral differences documented +> - All objects compile and execute in Snowflake + +## Workshop Context (Day 4 — Data Integration & Transformation) + +During the LiftOff engagement, this session covers: +- Data source catalog: frequency and volume of extraction +- Loading and transforming data into Snowflake +- Stored procedure conversion demo using SnowConvert AI +- SSIS/Informatica to dbt conversion demo +- Data load best practices +- Estimation of data integration migration LOE and timeline + +**Key estimation factors:** Object count and complexity, data product inventory, deployment framework, technology POCs, third-party library evaluation, orchestration/monitoring compatibility, external system connections diff --git a/skills/agentic-migration-workshop/references/best-practices.md b/skills/agentic-migration-workshop/references/best-practices.md new file mode 100644 index 00000000..56df84f3 --- /dev/null +++ b/skills/agentic-migration-workshop/references/best-practices.md @@ -0,0 +1,607 @@ +# Migration Best Practices + +Guidance from Snowflake's official migration guides, SnowConvert AI quickstart, and partner migration workshops. + +## Prerequisites and Permissions + +**Snowflake account setup:** +```sql +GRANT CREATE MIGRATION ON ACCOUNT TO ROLE ; +GRANT USAGE ON WAREHOUSE migration_wh TO ROLE migration_role; +GRANT CREATE DATABASE ON ACCOUNT TO ROLE migration_role; +GRANT CREATE TABLE ON SCHEMA target_db.public TO ROLE migration_role; +GRANT INSERT, SELECT ON ALL TABLES IN SCHEMA target_db.public TO ROLE migration_role; +``` + +**Source database requirements:** +- Read access to all objects in migration scope +- Ability to extract DDL code for those objects +- For direct extraction (SQL Server, Redshift): network connectivity and auth credentials + +## Planning Phase + +### Migration Approach Selection + +| Approach | When to Use | Risk | Speed | +|----------|------------|------|-------| +| Lift and shift | Minimize changes; fast migration | Lower | Faster | +| Re-architecture | Modernize data models, ETL, procedural logic | Higher | Slower | +| Phased/hybrid | Start with lift-and-shift, optimize post-migration | Medium | Medium | + +### Scope Definition + +1. **Inventory** all source objects using catalog queries or DDL export scripts +2. **Triage** by usage: remove obsolete, unused, and temporary objects +3. **Classify** by business impact and technical complexity +4. **Prioritize**: Start with high-impact, low-complexity workloads to build momentum +5. **Exclude** system databases (SQL Server: master/msdb/tempdb/model; Teradata: DBC/Sys_Calendar/etc.) + +### Team and Governance + +- Establish a RACI matrix (Responsible, Accountable, Consulted, Informed) +- Roles: Project Manager, Data Engineer, Source DBA, Snowflake Architect, Security Admin, Business Analyst +- Coordinate with finance early: Snowflake is consumption-based pricing +- Use Snowflake object tagging for cost attribution by department/project + +## DDL Extraction + +**Direct extraction supported:** SQL Server, Amazon Redshift (via SnowConvert AI) + +**File-based extraction (all platforms):** +- Use DDL export scripts: https://github.com/Snowflake-Labs/SC.DDLExportScripts +- Export to `.sql` files organized in logical folder structures +- SnowConvert extraction scripts per platform: + - Oracle: https://docs.snowconvert.com/sc/general/getting-started/code-extraction/oracle + - Teradata: https://docs.snowconvert.com/sc/general/getting-started/code-extraction/teradata + - SQL Server: Direct extraction via SnowConvert AI + - Redshift: Direct extraction via SnowConvert AI + +## Code Conversion + +### Status Indicators + +| Status | Meaning | Action | +|--------|---------|--------| +| Green | Successfully converted | Ready for deployment | +| Yellow / FDM | Further Development Mandatory | Review for business impact; may deploy with documentation | +| Red / EWI | Error with Impact | MUST resolve manually before deployment | + +### Code Completeness +- Score below 100% = missing object references in conversion +- Address missing references by including all dependent objects + +### Code Preparation Best Practices +- Clean up source code before conversion (remove commented-out legacy code) +- Ensure consistent encoding across all files (UTF-8 recommended) +- Document complex business logic before conversion +- Organize source code in logical folder structures +- Maintain backup copies of original source code + +### Key Constraint Difference (Critical!) +**Source platforms** (Oracle, SQL Server, Teradata, Redshift) enforce PK, FK, UNIQUE constraints. +**Snowflake** defines but does **NOT enforce** PK, FK, UNIQUE constraints (only NOT NULL is enforced). +**Action:** Move all data integrity checks into ETL/ELT pipelines. + +## Deployment + +### Dependency Order +Objects must deploy in this sequence: +1. Databases +2. Schemas +3. Sequences +4. Tables (parent tables first, then child tables with FKs) +5. Views (base views first, then dependent views) +6. Functions +7. Stored Procedures +8. File Formats and Stages + +### Pre-Deployment Checklist +- [ ] All EWI errors resolved +- [ ] FDM warnings reviewed and documented +- [ ] Converted code reviewed in IDE +- [ ] Test environment deployment tested first +- [ ] Rollback strategy planned +- [ ] All permissions and roles configured +- [ ] Service accounts created (not using username/password auth) + +## Environment Strategy + +### Multi-Account Setup (Recommended for Enterprise) +| Account | Purpose | Security Level | +|---------|---------|---------------| +| Production | Production data and workloads | Strictest controls | +| Development/QA | Development and testing | Moderate; migration team has more freedom | +| Sandbox (optional) | Experimental work | Relaxed; still maintain basic security | + +### SQL Server Environment Naming Convention +Best practice: separate databases by environment, not schemas. + +| Object Type | Naming Pattern | Example | +|-------------|---------------|--------| +| Database | `[ENVIRONMENT]_[DATABASE]` | `DEV_SALES_DB`, `QA_SALES_DB`, `PROD_SALES_DB` | +| Schema | Mirror source schema names | `PROD_SALES_DB.dbo_schema` | +| Warehouse | `[FUNCTION]_WH_[ENVIRONMENT]` | `ANALYTICS_WH_DEV`, `ETL_WH_PROD` | +| Role | `[FUNCTION]_ROLE_[ENVIRONMENT]` | `DATA_ENGINEER_ROLE_PROD`, `ANALYST_ROLE_QA` | + +### Warehouse Strategy for Migration +| Warehouse | Purpose | Sizing | +|-----------|---------|--------| +| WH_MIGRATION_LOAD | Initial data load | Large/X-Large (scale down after) | +| WH_MIGRATION_VALIDATE | Data validation queries | Medium | +| WH_TRANSFORM | ETL/ELT transformations | Medium (adjust based on workload) | +| WH_BI_ANALYTICS | BI tool queries (post-migration) | Small-Medium (auto-scaling) | + +**Auto-suspend:** Set to 60 seconds on all warehouses to avoid paying for idle compute. + +## Data Migration + +### Platform-Specific Strategies + +**Redshift → Snowflake (via S3):** +1. Unload from Redshift to PARQUET files in S3 +2. Create Snowflake external stage pointing to S3 +3. COPY INTO Snowflake tables from stage +4. Automatic cleanup of temporary files +- S3 bucket must be in same region as Redshift cluster +- Requires IAM Role for Redshift (s3:PutObject, GetObject, ListBucket) +- Requires storage integration or IAM User for Snowflake (s3:GetObject, ListBucket) + +**SQL Server → Snowflake (direct streaming via SnowConvert AI):** +1. Bulk data extraction from SQL Server via BCP or direct streaming +2. Transfer to cloud storage stage or direct streaming to Snowflake +3. Real-time progress monitoring + +**Oracle → Snowflake:** +1. Extract via Data Pump (expdp), SQL*Plus spool, or UTL_FILE to CSV/Parquet +2. Upload to cloud storage (S3/Azure Blob/GCS) +3. COPY INTO from external stage + +**Teradata → Snowflake:** +1. Extract via BTEQ .EXPORT, FastExport, or TPT +2. Upload to cloud storage +3. COPY INTO from external stage + +### General Data Load Best Practices +- Split large files into 100-250MB chunks for maximum parallelism +- Migrate large tables during off-peak hours +- Use parallel migration for multiple small tables +- Use a dedicated, larger warehouse for initial bulk load; scale down after +- Monitor network bandwidth utilization +- Consider table partitioning for very large datasets +- Implement retry logic for transient failures +- Validate partial migrations before continuing +- Use PARQUET format when possible (preserves schema, better compression) + +### Incremental Data Migration +After historical data load, set up ongoing replication: +- Use source platform CDC (e.g., SQL Server CDC, Oracle LogMiner) +- Land changes in cloud storage → Snowpipe for automatic ingestion +- Apply changes to target with MERGE statement +- For complex dependencies, use Streams + Tasks pattern + +## Data Validation + +### Multi-Layered Validation Strategy + +**Level 1 — File/Object Validation:** +- Verify file checksums/hashes after transfer to cloud storage +- Confirm file counts match expected + +**Level 2 — Schema Validation:** +- Table names match exactly +- Column names preserved correctly +- Ordinal positions maintained +- Data types converted appropriately +- Character lengths preserved +- Numeric precision and scale maintained + +**Level 3 — Reconciliation (Aggregate Validation):** +- Row counts match between source and target +- MIN, MAX, AVG, SUM values per numeric column +- NULL value counts +- DISTINCT value counts + +**Level 4 — Cell-Level Validation (Data Diff):** +- For critical tables: cell-by-cell comparison of statistically significant sample +- Compare specific PKs with source data +- Standard deviation and variance checks +- **MD5 hash comparison (SQL Server recommended):** Create MD5 hash across key columns in SQL Server; generate corresponding hash in Snowflake; compare hashes. SnowConvert AI data migration feature automates this. +- **Redshift behavioral validation:** Validate `GREATEST`/`LEAST` NULL handling (Snowflake returns NULL if any arg is NULL; Redshift returns non-NULL), numeric precision with `TO_NUMBER`, timestamp/timezone differences (`TIMESTAMP_NTZ` vs `TIMESTAMPTZ`), hash consistency when replacing `FNV_HASH` with `HASH()` + +**Critical platform differences affecting validation:** +- Collation behavior (SQL Server case-insensitive vs Snowflake case-sensitive; Redshift lowercases vs Snowflake uppercases) +- Floating point arithmetic differences between platforms +- Date/time precision (SQL Server DATETIME 3.33ms → Snowflake TIMESTAMP_NTZ nanosecond; Redshift microsecond → Snowflake nanosecond) +- Business users must understand these for UAT sign-off +- **Redshift-specific:** Passing structural and aggregate validation does not guarantee behavioral equivalence; business-critical queries should always be validated directly + +**Level 5 — Business Logic Validation:** +- Run key business reports against both source and target +- Compare aggregated outputs +- Custom business metrics (e.g., total revenue, customer counts) + +### Validation Result Levels + +| Level | Meaning | Action | +|-------|---------|--------| +| Pass | Values match exactly | No action needed | +| Warning | Minor differences (e.g., higher precision) | Reconcile: apply transformation or change ingestion | +| Fail | Values don't match | Investigation required | + +### Common Validation Issues + +| Issue | Cause | Resolution | +|-------|-------|------------| +| Row count mismatch | Incomplete migration | Re-run data migration for affected tables | +| Precision differences | Data type conversion | Verify acceptable business impact | +| Date format variations | Timezone or format changes | Standardize date handling | +| Null handling differences | Platform-specific null behavior | Update conversion rules | +| Empty string vs NULL | Oracle treats '' as NULL; others don't | Add explicit null handling in ETL | +| Case differences | Collation changes | Normalize with UPPER/LOWER or COLLATE | + +### Pre-Validation Checklist +- [ ] Ensure data stability during validation (no concurrent updates) +- [ ] Complete all migration steps before validation +- [ ] Have sufficient system resources available +- [ ] Plan validation during maintenance windows + +## Testing Strategy + +### Test Types + +| Test | When | What | +|------|------|------| +| Functional | After code conversion | All migrated applications and functionalities work as expected | +| Integration | After data load | Migrated components work together, data flows between systems | +| SIT (System Integration) | After integration | Full system behavior validated across all integrated systems | +| Performance | After data load | Query performance, data loading speed, system responsiveness | +| Load & Stress | Before cutover | System handles expected peak concurrency and auto-scaling | +| Security | Before cutover | RBAC, data masking, row access policies, SSO/MFA all validated | +| Regression | After each phase | Previously working features still work | +| UAT (User Acceptance) | Before cutover | Business users validate reports and daily tasks | +| Parallel Run | Before cutover | Both systems running simultaneously, outputs compared | + +### Performance Benchmarking +1. **Capture** baseline query set from source platform (top N queries by frequency/cost) +2. **Execute** same queries against Snowflake; compare runtimes +3. **Use** Query Profile tool in Snowflake to analyze slow queries +4. **Right-size** warehouses based on workload patterns +5. **Add** CLUSTER BY for large tables (>1TB) with frequent range filters + +### Query Performance Optimization +```sql +-- Check query profile +SELECT * FROM TABLE(GET_QUERY_OPERATOR_STATS(LAST_QUERY_ID())); + +-- Review query history for slow queries +SELECT query_id, query_text, execution_time, warehouse_size +FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY +WHERE start_time > DATEADD('day', -7, CURRENT_TIMESTAMP()) +ORDER BY execution_time DESC +LIMIT 20; +``` + +## Cutover Strategy + +### Approaches + +| Strategy | Risk | Downtime | Complexity | +|----------|------|----------|-----------| +| Big Bang | High | Short (planned window) | Lower | +| Phased Rollout | Low | None (per component) | Higher | +| Parallel Run | Lowest | None | Highest (run both systems) | + +### Phased Rollout (Recommended) +1. Migrate applications/reports one at a time +2. Implement bridging strategy so users don't query both systems +3. Validate each component before proceeding to next +4. Data synchronization for non-migrated applications happens behind the scenes + +### SQL Server Specific Cutover +- Run SQL Server and Snowflake simultaneously during parallel run +- High confidence from automated testing allows minimal parallel run window +- Cutover only after: initial data migrated, processes keep data current, all testing complete, all tools redirected +- **Cutover action:** Turn off SQL Server data processes, revoke user/tool access +- **Define cutover plan early** — lack of clarity creates parallel environment overhead + +### Amazon Redshift Specific Cutover +- Run Redshift and Snowflake in parallel; validate pipelines and analytics +- Minimize overlap duration through automated testing +- **Cutover sequence:** Disable Redshift ingestion → redirect consumers to Snowflake → decommission Redshift clusters +- **Cutover readiness:** Final data reconciliation complete, BI tools validated, upstream writes disabled, resource monitors enabled, decommissioning plan reviewed + +### Cutover Checklist +- [ ] All stakeholders aligned and signed off +- [ ] All permissions and roles configured in Snowflake +- [ ] Service accounts created and tested +- [ ] Active Directory / SSO roles configured +- [ ] Final incremental data sync completed +- [ ] All ETL pipelines pointing to Snowflake +- [ ] BI tools repointed and validated +- [ ] Rollback plan documented and tested +- [ ] Legacy platform set to read-only (fallback period) +- [ ] Surrogate keys synchronized between systems +- [ ] Monitoring and alerting active + +### Rollback Plan +1. Keep source platform in read-only state for defined fallback period +2. Document exact steps to revert connections +3. Maintain data sync from Snowflake back to source (if bidirectional needed) +4. Define rollback triggers (e.g., data integrity failure, performance SLA breach) +5. Practice rollback in test environment before production cutover + +## Security + +- Use principle of least privilege for database connections +- Enable MFA on all Snowflake accounts, especially privileged roles +- Configure SSO with corporate identity provider (Azure AD/Entra ID, Okta) +- Prioritize automated provisioning via IdP with SCIM +- Set up network policies to whitelist trusted IP ranges +- Regularly rotate access codes and credentials +- Audit migration activities and access logs +- Encrypt sensitive data during transit (SSL/TLS) +- Use storage integrations (not raw credentials) for cloud storage access +- Implement proper backup strategies +- Maintain audit trails for compliance +- Never commit credentials to version control +- Create migration-specific roles; revoke after migration complete + +### SQL Server RBAC Migration Pattern + +**SQL Server:** DAC (Discretionary Access Control) + RBAC mix; Login + User separation +**Snowflake:** Pure hierarchical RBAC; unified User object; authenticate via SSO/OAuth + +**Role hierarchy best practice:** + +| Role Type | Description | Naming | Example | +|-----------|-------------|--------|--------| +| Access Roles | Low-level; specific permissions on database objects | `[PERMISSION]_[OBJECT]` | `WH_ANALYTICS_USAGE`, `DB_SALES_READ` | +| Functional Roles | High-level; aligned with business functions; granted Access Roles | `[FUNCTION]_ROLE_[ENV]` | `DATA_ANALYST_ROLE`, `DATA_ENGINEER_ROLE_PROD` | + +**Key actions:** +- Use **future grants** for auto-applying permissions to new objects +- Establish audit processes for role/user creation, deletion, privilege changes +- Set `QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE` for SQL Server migrations (reporting tool compatibility) + +## Performance + +- Ensure adequate memory (8GB+ recommended for large conversion projects) +- Monitor disk space for temporary files +- Use SSD storage for better I/O performance during local processing +- Plan migrations during off-peak hours +- Use incremental migration strategies for very large tables +- Right-size warehouses: start small, scale up based on performance data +- Set auto-suspend to 60 seconds on all warehouses + +## Post-Migration + +### Immediate (Week 1-2) +- Validate application connectivity to Snowflake +- Monitor query performance; identify and optimize slow queries +- Track user adoption and gather feedback +- Resolve any data discrepancies found by users + +### Short-Term (Month 1-3) +- Right-size warehouses based on actual usage patterns +- Implement CLUSTER BY on large, frequently queried tables +- Set up resource monitors for cost control +- Implement showback/chargeback model for cost attribution +- Refine RBAC hierarchy; audit roles and permissions +- Implement Dynamic Data Masking and Row Access Policies for sensitive data + +### Long-Term (Ongoing) +- Continuous performance monitoring via ACCOUNT_USAGE +- Regular security audits +- Cost optimization reviews +- Explore Snowflake-native features (data sharing, marketplace, Snowpark) +- Decommission legacy platform after sufficient fallback period +- Always test conversions in development environments first +- Maintain detailed migration documentation + +## Resources + +- SnowConvert AI (free): https://www.snowflake.com/en/migrate-to-the-cloud/snowconvert-ai/ +- SnowConvert AI Training (free): https://training.snowflake.com +- DDL Export Scripts: https://github.com/Snowflake-Labs/SC.DDLExportScripts +- SnowConvert AI Docs: https://docs.snowflake.com/en/migrations/snowconvert-docs/overview +- Official Migration Guides: + - Oracle: https://docs.snowflake.com/en/migrations/guides/oracle + - SQL Server: https://docs.snowflake.com/en/migrations/guides/sqlserver + - Teradata: https://docs.snowflake.com/en/migrations/guides/teradata +- SnowConvert Support: snowconvert-support@snowflake.com +- Professional Services: https://www.snowflake.com/en/solutions/professional-services/ + +## Engagement Deliverables & Templates + +### Final Readout Structure (Day 10) +The Engagement Delivery Readout synthesizes all workshop findings: +1. **Objectives**: Components of migration, engagement overview, commitment +2. **Migration scope**: Environment metrics, migration effort estimation, code conversion results +3. **High-level timeline**: Assumptions, resources, milestones, go-live date +4. **Tool recommendations**: Partners, third-party tools, Snowflake features +5. **Assessment findings**: Risk register, mitigation strategies +6. **Go-forward plan**: Execution approach, next steps +7. **Documentation appendix**: Estimation approach, workshop estimates, task list, RACI, open items + +### RACI Matrix + +Typical migration RACI roles: + +| Task Area | Customer | SI/Partner | Snowflake | +|-----------|----------|-----------|-----------| +| Code extraction (DDL/DML) | R, A | C | C | +| SnowConvert AI conversion | C | R, A | C | +| Manual EWI resolution | C | R, A | C | +| Data migration execution | C | R, A | I | +| Data validation | R, A | C | I | +| UAT sign-off | R, A | C | I | +| BI repointing | C | R, A | I | +| Security model (RBAC) | R, A | C | C | +| Cutover execution | C | R, A | C | +| Post-migration support | C | R, A | I | + +(R=Responsible, A=Accountable, C=Consulted, I=Informed) + +### Delivery Checklist Topics +- [ ] Introduction and Engagement Overview (team intros, scope, questionnaire, architecture review) +- [ ] Database Conversion (code assessment, extraction, analysis, conversion, deployment plan) +- [ ] Data Migration (table scope, data volume, methods, security model review) +- [ ] Data Integration (data sources, ingestion, transformation, orchestration, deployment) +- [ ] Data Validation (validation approach, expectations, remediation plan, environments) +- [ ] Data Consumption (platform inventory, repointing, training plan) +- [ ] Roles and Responsibilities (RACI selection, participant roles) +- [ ] Migration Timeline (calculator inputs, resource assumptions, milestones) +- [ ] Partner and Tool Recommendations +- [ ] Readout Review and Final Delivery + +## Migration Questionnaire Areas + +When gathering customer requirements, cover these sections: + +### Project Section (Green) +- Migration drivers, target completion date +- Preferred migration approach (lift-and-shift, re-architecture, phased) +- Go-live approach (big bang, phased, parallel) +- Logical divisions / segmentation for phased migration +- Business-critical workload SLAs +- Performance integration testing criteria +- Parallel execution requirements +- Governance: PM methodology, change control, escalation process +- Staff: team size, productive hours/week, Snowflake expertise level + +### Architecture Section (Green) +- Cloud provider and region (production, DR, non-production) +- Network bandwidth and private networking method +- Snowflake account strategy (single vs. multi-account) +- Environment strategy (Dev, Test, QA, Stage, Prod, DR) +- Complete technology inventory: data modeling, ETL, transformation, reporting, analytics, data science, orchestration, monitoring, data quality, cataloging, CI/CD, identity management + +### Data Section (Green) +- Total data volume (MB/GB/TB/PB) +- Historical data transfer method +- Character encoding (UTF-8 compatibility) +- Semi-structured and unstructured data formats +- Data model modifications or in-flight projects +- OLTP workloads, low-latency requirements +- Data masking and row-level security +- Data sharing requirements +- Sensitive data (PII/PHI/PCI) and regulatory compliance (HIPAA, PCI, SOX, CCPA, GDPR) +- Records retention policies + +### Platform to Migrate (Yellow) +- Database name, technology, version, edition, character set, hosting location, administration model + +### Data Suppliers/Sources (Yellow) +- Name, purpose, technology, hosting, sensitive data, integration technology/method, frequency, strategic plan, job count, volume, execution duration + +### Data Consumers/Targets (Yellow) +- Name, purpose, technology/version, consumption method, connection type, hosting, frequency, strategic plan, asset count, semantic models, volume + +### Platform-Specific Questions (Blue) +**Oracle:** Exadata form factor, SaaS apps, AWR reports, UTPLSQL, global variables, Advanced Queuing, RAC, Data Guard, multi-tenant, Spatial, Database Vault, Compression, Golden Gate, partitioning +**SQL Server:** Resource usage reports, CLR integration, spatial data types, spatial index, data compression, SQL Server Replication +**Teradata:** Transaction mode (ANSI vs BTET), Bankers Rounding, Data Labs + +## Folder Structure + +Standard engagement folder organization: + +``` +LiftoffEngagementPackage/ +├── CapN_CustomerFacing_Deck.pptx # Customer delivery deck (all days) +├── CapN_Liftoff_Runbook.docx # Partner step-by-step guide (all days) +├── Recommended_Agenda.xlsx # Workshop agenda with attendees/outcomes +└── MigrationPrototype_MigrationPlanning/ + ├── Prerequisites/ + │ ├── Partner_PreReq_Tracker.xlsx # Partner prerequisite tracking + │ └── Files to be Sent Out to Customer/ + │ ├── Migration_Questionnaire.xlsx # Customer questionnaire (Oracle + SQL Server tabs) + │ └── Engagement_Prerequisites_Checklist.xlsx + ├── Secure share folder structure/ # Shared drive (Customer + Partner + Snowflake) + │ ├── Prerequisites/ + │ │ ├── Database conversion (DDL) code extracts/ + │ │ ├── Data integration (ETL) code extracts/ + │ │ ├── Data consumption code extracts/ + │ │ └── Architecture diagrams/ + │ └── Documentation/ + │ ├── White papers and guides/ + │ ├── Analysis/ + │ │ ├── Code conversion reports/ + │ │ └── Scripts/ + │ └── Presentations/ + ├── MigrationPlanningTemplates/ + │ ├── Liftoff_Engagement_Readout_Template.pptx + │ ├── Migration_Timeline.xlsx # Schedule with milestones + │ ├── Engagement_Delivery_Checklist.xlsx + │ ├── Engagement_Action_Tracker.xlsx + │ ├── Engagement_Running_Notes.docx + │ └── Migration_Task_List_and_RACI.xlsx + └── Email templates/ + └── Daily_Workshop_Summary_Email_Template.docx +``` + +## Partner Learning Materials + +### Migration Overview +| For Developers | For Architects | +|---------------|----------------| +| Intro to Snowflake Migration Master Class | Migrate To The Snowflake AI Data Cloud | +| From Legacy to Cloud: Snowflake's Roadmap (Video) | Migration Master Class Academy | +| End-to-End Migration: Data and Pipelines (Hands-on Lab) | SnowConvert On-Demand Training | + +### Database & Object Conversion +| For Developers | For Architects | +|---------------|----------------| +| SnowConvert for Developers (Required) | Migration Master Class Academy | +| Migrate to the Snowflake AI Cloud | Best Practices for Migrating Historical Data | +| AI Feature: Migration Assistant Blog | E2E Migration Hands-on Lab | +| Quickstart: E2E Migration | SnowConvert Docs Overview | +| SnowConvert AI-Powered Migrations (Video) | Accelerate Migrations: What's New in SnowConvert | + +### Data Migration +| For Developers | For Architects | +|---------------|----------------| +| Level Up: Data Loading | Level Up: Data Loading | +| Doc: Data Loading Overview | Doc: Data Loading Overview | + +### Data Integration & Transformation +| For Developers | For Architects | +|---------------|----------------| +| Snowflake Openflow | 9 Best Practices: On-Premises to Cloud | +| Snowflake Data Integration | | +| Workshop: Data Engineering | | + +### Data Validation +| For Developers | For Architects | +|---------------|----------------| +| SnowConvert Migration Assistant | Accelerate Migrations: What's New (Video) | +| SnowConvert AI Documentation | | +| SnowConvert Data Validation | | + +### Data Consumption +| For Developers | For Architects | +|---------------|----------------| +| SnowConvert Power BI Repointing | SnowConvert AI Documentation | +| SnowConvert Teradata ETL-BI Repointing | | +| ETL BI Repointing | | +| Power BI Transact Repointing | | +| SSIS Repointing | | + +## SnowConvert AI Quick Reference + +- **Average conversion rate:** +95% (based on total LOC for Oracle, SQL Server, Teradata migrations) +- **Lines of code converted to date:** 2.0B+ +- **Database objects converted:** 46M+ +- **Average timeline acceleration:** +88% +- **Supported platforms:** Teradata, Oracle, SQL Server, Amazon Redshift, Synapse, Sybase*, BigQuery*, Netezza*, Postgres*, Greenplum*, Databricks SQL* (*tables and views only) +- **ETL Replatform:** SSIS (Public Preview), Informatica Power Center (Private Preview) → dbt projects +- **BI Repointing:** Power BI (Public Preview) +- **Features:** Code Conversion (GA), Migration Assistant (GA), Code Verification (Public Preview), Data Validation (Public Preview) + +### Key Resources +- Download SnowConvert AI: Available from Snowsight → Ingestion/Migrations +- Training: https://learn.snowflake.com/en/courses/OD-SC-D/ +- Documentation: https://docs.snowflake.com/en/migrations/snowconvert-docs/overview +- Mastering Migration Planning: On-Demand Course +- E2E Migration Hands-on Lab: Virtual Hands-On Lab +- Power BI Repointing Blog: Available on Snowflake Blog diff --git a/skills/agentic-migration-workshop/references/oracle.md b/skills/agentic-migration-workshop/references/oracle.md new file mode 100644 index 00000000..4b20bdcf --- /dev/null +++ b/skills/agentic-migration-workshop/references/oracle.md @@ -0,0 +1,318 @@ +# Oracle to Snowflake Reference + +## Architecture Differences + +| Aspect | Oracle | Snowflake | +|--------|--------|-----------| +| Architecture | Monolithic or shared-disk (RAC); tightly coupled compute & storage | Decoupled compute, storage, and cloud services | +| Storage | DBA-managed on local disks, SAN, NAS (filesystems/ASM) | Centralized object storage with auto micro-partitioning | +| Compute | Fixed server resources (CPU, Memory, I/O) | Elastic, on-demand virtual warehouses | +| Concurrency | Limited by server hardware and session/process limits | High concurrency via multi-cluster warehouses | +| Scaling | Vertical (more powerful server) or horizontal (RAC). Often requires downtime | Instant scale up/down/out (seconds); storage scales automatically | +| Maintenance | DBA tasks: index rebuilds, statistics gathering, tablespace management | Fully managed; maintenance automated in background | +| Constraints | PK, FK, UNIQUE, CHECK all enforced | Only NOT NULL enforced; PK/FK/UNIQUE are metadata-only | + +## Data Type Mapping + +| Oracle | Snowflake | Notes | +|--------|-----------|-------| +| NUMBER(p,s) | NUMBER(p,s) | Direct mapping | +| NUMBER (no precision) | NUMBER(38,0) | Unspecified Oracle NUMBER → max precision integer | +| BINARY_FLOAT | FLOAT | Single-precision | +| BINARY_DOUBLE | FLOAT | Double-precision | +| VARCHAR2(n) | VARCHAR(n) | Snowflake max 16MB | +| NVARCHAR2(n) | VARCHAR(n) | Snowflake native UTF-8; N-prefix types unnecessary | +| CHAR(n) | CHAR(n) | Or VARCHAR(n) | +| NCHAR(n) | CHAR(n) | Snowflake native UTF-8 | +| CLOB | VARCHAR(16777216) | 16MB max | +| NCLOB | VARCHAR(16777216) | 16MB max | +| BLOB | BINARY(8388608) | 8MB max; consider external stage for larger | +| RAW(n) | BINARY(n) | | +| LONG | VARCHAR(16777216) | Deprecated in Oracle | +| LONG RAW | BINARY(8388608) | Deprecated in Oracle | +| DATE | TIMESTAMP_NTZ | Oracle DATE includes time component (critical difference!) | +| TIMESTAMP | TIMESTAMP_NTZ | | +| TIMESTAMP WITH TIME ZONE | TIMESTAMP_TZ | | +| TIMESTAMP WITH LOCAL TIME ZONE | TIMESTAMP_LTZ | | +| INTERVAL YEAR TO MONTH | VARCHAR | Store as string; use date functions for calculations | +| INTERVAL DAY TO SECOND | VARCHAR | Store as string; use date functions for calculations | +| BOOLEAN (21c+) | BOOLEAN | | +| XMLTYPE | VARIANT | Parse XML to VARIANT | +| SDO_GEOMETRY | GEOGRAPHY or GEOMETRY | Snowflake geospatial types | +| ROWID / UROWID | Not needed | Snowflake does not use ROWIDs | +| BFILE | External stage | Reference files in external storage | + +## Feature Mapping + +| Oracle Feature | Snowflake Equivalent | +|---------------|---------------------| +| Tablespaces | Not needed (Snowflake manages storage) | +| Partitioning (range/list/hash/composite) | Micro-partitions (automatic); use CLUSTER BY for ordering | +| Bitmap indexes | Automatic micro-partition pruning | +| B-tree indexes | Not needed; Snowflake auto-optimizes | +| Function-based indexes | CLUSTER BY on expressions | +| Materialized views | MATERIALIZED VIEW or Dynamic Tables | +| Materialized view logs | Streams (for change tracking) | +| Synonyms | Fully qualified names or wrapper views | +| Database links (DBLinks) | Data sharing, external tables, or Snowpipe | +| Sequences | SEQUENCE (native support) | +| PL/SQL packages | Separate procedures + optional shared state via tables/stages | +| PL/SQL procedures | Snowflake Scripting (SQL) or JavaScript procedures | +| PL/SQL functions | Snowflake UDFs (SQL, JavaScript, Python) | +| Pipelined table functions | Snowflake UDTFs | +| Triggers (DML/DDL) | Streams + Tasks (event-driven) | +| Oracle Scheduler (DBMS_SCHEDULER) | Tasks (with CRON schedules) | +| Flashback queries | Time Travel (SELECT ... AT/BEFORE) | +| Flashback Data Archive | Time Travel + Fail-Safe | +| Virtual Private Database (VPD) | Row Access Policies | +| Data Redaction | Dynamic Data Masking Policies | +| Advanced Queuing (AQ) | Streams + Tasks or external messaging | +| Autonomous transactions | Not directly supported; redesign with separate transactions | +| Global temporary tables | TEMPORARY TABLE | +| External tables | External tables (S3/Azure/GCS) | +| Oracle hints (`/*+ ... */`) | Remove; Snowflake auto-optimizes (no hint system) | +| AWR / ASH performance views | Query History (ACCOUNT_USAGE.QUERY_HISTORY), Query Profile | +| DUAL table | Not needed; `SELECT 1;` is valid | +| Edition-based redefinition | Not applicable; use zero-downtime deployment via CREATE OR REPLACE | +| Oracle Data Pump (expdp/impdp) | Extract to files → Stage → COPY INTO | +| SQL*Loader | COPY INTO from staged files | +| UTL_FILE | Stages + COPY INTO / GET / PUT | +| DBMS_OUTPUT | SYSTEM$LOG() or RETURN | +| DBMS_LOB | VARCHAR/BINARY operations | +| DBMS_SQL | EXECUTE IMMEDIATE | +| %TYPE / %ROWTYPE | Not supported; use explicit type declarations | +| BULK COLLECT / FORALL | Rewrite as set-based SQL (preferred) or use RESULTSET | +| PRAGMA directives | Remove (not applicable) | +| Object types / nested tables / varrays | Flatten to native types; use VARIANT/ARRAY/OBJECT | + +## Common PL/SQL to Snowflake Scripting Patterns + +### DUAL Table +```sql +-- Oracle +SELECT SYSDATE FROM DUAL; +SELECT seq.NEXTVAL FROM DUAL; + +-- Snowflake +SELECT CURRENT_TIMESTAMP(); +SELECT my_schema.seq.NEXTVAL; +``` + +### Outer Join (+) Syntax → ANSI JOIN +```sql +-- Oracle (proprietary) +SELECT e.name, d.dept_name +FROM employees e, departments d +WHERE e.dept_id = d.dept_id(+); + +-- Snowflake (ANSI required) +SELECT e.name, d.dept_name +FROM employees e +LEFT JOIN departments d ON e.dept_id = d.dept_id; +``` + +### CONNECT BY → Recursive CTE +```sql +-- Oracle +SELECT employee_id, manager_id, LEVEL, SYS_CONNECT_BY_PATH(name, '/') AS path +FROM employees +START WITH manager_id IS NULL +CONNECT BY PRIOR employee_id = manager_id +ORDER SIBLINGS BY name; + +-- Snowflake +WITH RECURSIVE org AS ( + SELECT employee_id, manager_id, name, 1 AS lvl, '/' || name AS path + FROM employees WHERE manager_id IS NULL + UNION ALL + SELECT e.employee_id, e.manager_id, e.name, o.lvl + 1, o.path || '/' || e.name + FROM employees e JOIN org o ON e.manager_id = o.employee_id +) +SELECT employee_id, manager_id, lvl, path FROM org +ORDER BY path; +``` + +### DECODE → CASE or DECODE +```sql +-- Oracle +SELECT DECODE(status, 'A', 'Active', 'I', 'Inactive', 'Unknown') FROM items; + +-- Snowflake (DECODE is supported) +SELECT DECODE(status, 'A', 'Active', 'I', 'Inactive', 'Unknown') FROM items; +-- or standard CASE +SELECT CASE status WHEN 'A' THEN 'Active' WHEN 'I' THEN 'Inactive' ELSE 'Unknown' END FROM items; +``` + +### ROWNUM → ROW_NUMBER / LIMIT +```sql +-- Oracle (ROWNUM applied before ORDER BY!) +SELECT * FROM (SELECT * FROM employees ORDER BY salary DESC) WHERE ROWNUM <= 10; + +-- Snowflake +SELECT * FROM employees ORDER BY salary DESC LIMIT 10; +-- or with ROW_NUMBER for more control +SELECT * FROM employees QUALIFY ROW_NUMBER() OVER (ORDER BY salary DESC) <= 10; +``` + +### Cursor Loop +```sql +-- Oracle +FOR rec IN (SELECT col1, col2 FROM my_table) LOOP + DBMS_OUTPUT.PUT_LINE(rec.col1); +END LOOP; + +-- Snowflake Scripting +DECLARE + c1 CURSOR FOR SELECT col1, col2 FROM my_table; + v_col1 VARCHAR; +BEGIN + OPEN c1; + LOOP + FETCH c1 INTO v_col1; + IF (NOT FOUND) THEN LEAVE; END IF; + -- process v_col1 + END LOOP; + CLOSE c1; +END; +-- PREFERRED: Rewrite as set-based SQL whenever possible +``` + +### BULK COLLECT / FORALL → Set-Based SQL +```sql +-- Oracle +DECLARE + TYPE id_array IS TABLE OF NUMBER; + v_ids id_array; +BEGIN + SELECT employee_id BULK COLLECT INTO v_ids FROM employees WHERE dept_id = 10; + FORALL i IN 1..v_ids.COUNT + UPDATE audit_log SET processed = 'Y' WHERE emp_id = v_ids(i); +END; + +-- Snowflake: Rewrite as single set-based statement +UPDATE audit_log a +SET processed = 'Y' +FROM employees e +WHERE a.emp_id = e.employee_id AND e.dept_id = 10; +``` + +### Exception Handling +```sql +-- Oracle +BEGIN + INSERT INTO t VALUES (1); +EXCEPTION + WHEN DUP_VAL_ON_INDEX THEN + UPDATE t SET col = 'val' WHERE id = 1; + WHEN NO_DATA_FOUND THEN + NULL; + WHEN OTHERS THEN + RAISE; +END; + +-- Snowflake Scripting +BEGIN + INSERT INTO t VALUES (1); +EXCEPTION + WHEN OTHER THEN + LET err_code := SQLCODE; + LET err_msg := SQLERRM; + UPDATE t SET col = 'val' WHERE id = 1; +END; +-- Note: Snowflake only supports WHEN OTHER (no named exceptions) +``` + +### Dynamic SQL +```sql +-- Oracle +EXECUTE IMMEDIATE 'SELECT COUNT(*) FROM ' || v_table INTO v_count; + +-- Snowflake +EXECUTE IMMEDIATE 'SELECT COUNT(*) FROM ' || v_table; +LET v_count := (SELECT COUNT(*) FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))); +``` + +### MERGE Statement +```sql +-- Oracle +MERGE INTO target t USING source s ON (t.id = s.id) +WHEN MATCHED THEN UPDATE SET t.val = s.val +WHEN NOT MATCHED THEN INSERT (id, val) VALUES (s.id, s.val); + +-- Snowflake (same ANSI syntax) +MERGE INTO target t USING source s ON t.id = s.id +WHEN MATCHED THEN UPDATE SET t.val = s.val +WHEN NOT MATCHED THEN INSERT (id, val) VALUES (s.id, s.val); +``` + +### Date/Time Functions +```sql +-- Oracle -- Snowflake +SYSDATE CURRENT_TIMESTAMP() or CURRENT_DATE() +SYSTIMESTAMP CURRENT_TIMESTAMP() +ADD_MONTHS(dt, 3) DATEADD('month', 3, dt) +MONTHS_BETWEEN(d1, d2) DATEDIFF('month', d2, d1) +LAST_DAY(dt) LAST_DAY(dt) -- same +NEXT_DAY(dt, 'FRIDAY') NEXT_DAY(dt, 'FR') +TRUNC(dt) DATE_TRUNC('day', dt) or dt::DATE +TRUNC(dt, 'MM') DATE_TRUNC('month', dt) +EXTRACT(YEAR FROM dt) EXTRACT(YEAR FROM dt) or YEAR(dt) +TO_DATE('01-JAN-2024', 'DD-MON-YYYY') TO_DATE('01-JAN-2024', 'DD-MON-YYYY') +TO_CHAR(dt, 'YYYY-MM-DD') TO_CHAR(dt, 'YYYY-MM-DD') +TO_TIMESTAMP(str, 'fmt') TO_TIMESTAMP(str, 'fmt') +``` + +### String Functions +```sql +-- Oracle -- Snowflake +NVL(expr, default) NVL(expr, default) or IFNULL / COALESCE +NVL2(expr, if_not_null, if_null) NVL2(expr, if_not_null, if_null) -- same +INSTR(str, 'sub') POSITION('sub', str) or CHARINDEX('sub', str) +SUBSTR(str, start, len) SUBSTR(str, start, len) -- same +LENGTH(str) LENGTH(str) -- same +LPAD(str, n, 'x') LPAD(str, n, 'x') -- same +RPAD(str, n, 'x') RPAD(str, n, 'x') -- same +REPLACE(str, 'old', 'new') REPLACE(str, 'old', 'new') -- same +REGEXP_SUBSTR(str, pattern) REGEXP_SUBSTR(str, pattern) -- same +REGEXP_REPLACE(str, pattern, repl) REGEXP_REPLACE(str, pattern, repl) -- same +LISTAGG(col, ',') LISTAGG(col, ',') -- same +``` + +## DDL Conversion Checklist + +1. **Remove** physical storage: `TABLESPACE`, `STORAGE`, `PCTFREE`, `INITRANS`, `LOGGING/NOLOGGING` +2. **Remove** all index DDL (B-tree, bitmap, function-based); consider CLUSTER BY for large tables +3. **Convert** `VARCHAR2` → `VARCHAR`, `NVARCHAR2` → `VARCHAR`, `NCHAR` → `CHAR` +4. **Convert** `DATE` → `TIMESTAMP_NTZ` (Oracle DATE includes time!) +5. **Convert** `CLOB/NCLOB` → `VARCHAR(16777216)`, `BLOB` → `BINARY(8388608)` +6. **Convert** `RAW/LONG RAW` → `BINARY` +7. **Convert** `XMLTYPE` → `VARIANT` +8. **Remove** Oracle hints (`/*+ ... */`) +9. **Remove** `STORAGE` and `LOB` storage clauses +10. **Replace** synonyms with fully qualified names or views +11. **Replace** DB Links with data sharing or external tables +12. **Note** constraints: PK, FK, UNIQUE defined but **not enforced**; move integrity checks to ETL +13. **Convert** sequences: syntax is similar; verify START WITH and INCREMENT BY + +## Data Extraction Methods + +| Method | Best For | +|--------|---------| +| Oracle Data Pump (expdp) | Large-scale export to dump files | +| SQL*Plus spooling | Simple CSV extraction | +| UTL_FILE package | File-based extraction | +| Third-party tools (Fivetran, etc.) | Managed CDC replication | +| SnowConvert AI (file-based) | DDL export scripts → conversion | + +## Common Pitfalls + +1. **Oracle DATE includes time**: `DATE` in Oracle stores both date and time; must map to `TIMESTAMP_NTZ`, not `DATE`. +2. **Empty string = NULL**: Oracle treats `''` as `NULL`; Snowflake treats `''` as empty string. Test NVL/COALESCE logic. +3. **Constraint enforcement**: Oracle enforces PK/FK/UNIQUE; Snowflake does not. Move integrity to ETL. +4. **PL/SQL packages**: No direct equivalent; decompose into separate procedures with shared state via tables. +5. **Named exceptions**: Snowflake only supports `WHEN OTHER` (no `DUP_VAL_ON_INDEX`, `NO_DATA_FOUND`, etc.). +6. **Autonomous transactions**: Not supported; redesign with separate transaction patterns. +7. **ROWNUM behavior**: `ROWNUM` is applied before `ORDER BY` in Oracle; use `LIMIT` or `ROW_NUMBER()` in Snowflake. +8. **Implicit commit on DDL**: Both Oracle and Snowflake auto-commit DDL, but verify transaction patterns. +9. **Sequence caching**: Snowflake sequences may have gaps; similar to Oracle but verify application assumptions. +10. **Oracle hints**: All removed; Snowflake auto-optimizes. Monitor Query Profile for performance issues. diff --git a/skills/agentic-migration-workshop/references/redshift.md b/skills/agentic-migration-workshop/references/redshift.md new file mode 100644 index 00000000..1d4b0113 --- /dev/null +++ b/skills/agentic-migration-workshop/references/redshift.md @@ -0,0 +1,627 @@ +# Amazon Redshift to Snowflake Reference + +Based on Snowflake's official Amazon Redshift migration guide. Intended for solution architects, data engineers, program managers and Snowflake solution partners. + +## Architecture Differences + +Amazon Redshift is a **cluster-based, massively parallel processing (MPP)** data warehouse where compute and storage are tightly coupled within a cluster. Performance tuning relies on node selection, distribution styles, sort keys and ongoing maintenance. Scaling typically requires resizing or rebuilding clusters, which introduces operational overhead. + +Snowflake is a **cloud services-based platform** where compute, storage and cloud services are fully decoupled. Compute is delivered via independent virtual warehouses that can scale up or out instantly. Storage is centralized, automatically optimized and shared across all compute. Platform services (metadata, optimization, security, governance) are fully managed. + +| Area | Amazon Redshift | Snowflake | +|------|----------------|-----------| +| Architecture | Traditional MPP shared-nothing, cluster-based | Multi-cluster, shared-data with fully decoupled compute and storage | +| Scaling Model | Cluster resizing requires full data redistribution; cluster enters read-only mode for potentially hours | No data redistribution; compute scales independently and instantly | +| Storage/Compute Coupling | Fixed ratio; scaling one requires scaling both | Storage and compute scale independently based on workload demand | +| Compute Cost Model | Compute must remain running to access data; pay for idle compute unless data is unloaded and reloaded | Compute fully suspends without unloading data; true pay-for-use | +| Semi-Structured Data | Limited; JSON stored as strings, not optimized for large volumes; fields often extracted at load time | Native VARIANT supports JSON, Avro, XML with optimized storage and performance | +| Concurrency | Constrained by limited query slots managed through WLM queues | Elastic via separate or multi-cluster virtual warehouses without manual tuning | +| Performance Management | Manual tuning of distribution keys, sort keys and WLM configurations | Automatic data optimization and workload isolation; no distribution/sort key management | +| Security Management | Encryption and key management optional; customer-configured via AWS services | Always-on encryption with automatic key management and rotation | +| Metadata/File Management | Manual management of files, metadata and storage layout | Fully managed, transparent file and metadata management | +| Disaster Recovery | Single availability zone; depends on snapshots and customer-managed restore | Built-in multi-datacenter deployment managed by Snowflake | +| Operational Overhead | Significant ongoing administration and infrastructure management | No data warehouse management; platform services fully managed | + +### Redshift Operational Constraints + +Redshift is a tightly coupled, cluster-based system in which compute, storage and query coordination are bound to fixed infrastructure: + +- **Leader node bottleneck**: Queries coordinated by a leader node that can become a bottleneck at scale +- **Slice-based distribution**: Compute nodes subdivided into slices, requiring careful data distribution to avoid skew +- **Manual tuning**: Performance depends heavily on manual selection of distribution styles and sort keys +- **VACUUM overhead**: Must be run incrementally to maintain performance; resource-intensive operations +- **Free space requirements**: Requires 20%+ free space (or up to 3x the largest table) for VACUUM and re-sorting +- **Snapshot-based recovery**: Backup/recovery relies on periodic snapshots with restores required for failover + +Snowflake eliminates these constraints through decoupled compute, storage and services, enabling elastic scaling, automatic optimization and zero infrastructure management. + +### Scalability, Concurrency and Cost Model + +| Feature | Amazon Redshift | Snowflake | Value Proposition | +|---------|----------------|-----------|------------------| +| Scalability | Manual cluster resize; node-based | Instant elastic scaling of compute | Agility without downtime | +| Concurrency | Limited by cluster resources and WLM | Multi-cluster virtual warehouses | Predictable performance | +| Cost Model | Node-hour based | Pay-per-use (per-second compute) | Cost transparency and control | + +**Cost governance:** While Snowflake enables elastic scaling, cost efficiency requires intentional governance. Establish workload-specific warehouse sizing standards, enable auto-suspend/auto-resume, and use resource monitors to prevent unplanned consumption. + +### Important: Redshift ≠ PostgreSQL + +Although Amazon Redshift is derived from PostgreSQL, it is **not fully compatible** with PostgreSQL semantics. Teams should not rely on PostgreSQL behavior alone to validate Redshift logic when planning a Snowflake migration. + +## Migration Methodology (8 Phases) + +| Phase | Focus | +|-------|-------| +| 1. Planning and Design | Scope, strategy, team, budget, test plan, automated assessment | +| 2. Environments and Security | Warehouse setup, RBAC hierarchy (IAM→RBAC), case sensitivity, environment separation | +| 3. Database Code Conversion | Automated SQL translation (SnowConvert AI 96%+ conversion rate), SQL refactoring, procedural rewrite | +| 4. Data Migration and Ingestion | UNLOAD→S3→COPY INTO, Snowpipe/Streams/Tasks for ongoing, legacy batch simplification | +| 5. Reporting and Analytics | Tool repointing, workload isolation, semantic/behavioral validation, access modernization | +| 6. Data Validation and Testing | Structural + behavioral validation, numeric precision, timestamp handling, NULL semantics | +| 7. Deployment | Parallel run, cutover, Redshift cluster decommissioning | +| 8. Optimize and Run | Warehouse sizing, clustering keys, resource monitors, zero maintenance advantage | + +### Phase 1: Planning and Design + +**Strategic goals:** This migration is not simply a platform swap — it is a strategic modernization initiative designed to support long-term scalability, analytics and AI readiness. Snowflake is optimized for OLAP workloads, semi-structured data, data sharing and AI/ML use cases. + +**Common Redshift migration drivers:** Concurrency bottlenecks, cluster management overhead, scaling delays, maintenance requirements (VACUUM/ANALYZE), WLM tuning complexity. + +**Automated assessment:** Run SnowConvert AI assessment first (free) to inventory Redshift objects, estimate conversion effort and identify potential incompatibilities early. + +**Document the existing environment:** +- Databases, schemas, tables, views, materialized views +- Distribution styles (DISTKEY) and sort keys (SORTKEY) +- Data ingestion pipelines (COPY jobs, Glue, Airflow, Fivetran, custom scripts) +- Downstream consumers (QuickSight, Tableau, Power BI, custom applications) +- Security model (IAM roles, database users, schema privileges) +- Rationalize data and decommission unused objects + +**Migration approach:** + +| Approach | Description | Recommendation | +|----------|-------------|---------------| +| Lift and shift | Minimal change for speed | Fastest but retains Redshift-specific constructs | +| Lift and adjust | Remove Redshift constructs, adopt Snowflake best practices | **Recommended** for faster time-to-value while reducing technical debt | +| Modernize | Re-architect pipelines and models | Highest value but highest effort | + +**Project logistics:** +- Prioritize datasets for early migration (quick wins) +- Define Dev/QA/Prod environments +- Identify migration team and responsibilities +- Establish timelines, budget and success criteria (e.g., Redshift cluster decommissioning date) + +**Test plan:** Define repeatable, automated testing for schema validation, data reconciliation, transformation logic validation and performance benchmarking. Automation is critical to reducing risk and shortening parallel run duration. + +### Phase 2: Environments and Security + +**Security model shift (IAM → RBAC):** +- Redshift security is optional and heavily integrated with AWS IAM and security groups; encryption/key rotation require explicit configuration +- Snowflake security is always-on and fully managed: end-to-end encryption, automatic key rotation, native RBAC, built-in data masking and governance +- Redshift relies on AWS IAM integration and database-level permissions → Snowflake uses centralized, hierarchical RBAC +- Grant privileges to roles, not users; assign roles to users +- Integrate with enterprise IdPs (Okta, Entra ID) via SSO and SCIM + +**Warehouse setup best practices:** +- Separate warehouses by environment (dev/QA/prod) +- Separate warehouses by workload (ELT, BI, ad hoc) +- Enable auto-suspend and auto-resume +- Use resource monitors for cost governance + +**System catalog migration:** +- `pg_*` system tables have no direct equivalent in Snowflake +- Use `INFORMATION_SCHEMA` and `ACCOUNT_USAGE` views instead +- ACLs must be reinterpreted using Snowflake RBAC, not system catalogs + +**Case sensitivity:** +- Redshift lowercases unquoted identifiers by default; Snowflake uppercases unquoted identifiers +- Quoted identifiers are case-sensitive in Snowflake +- **Recommendation:** Set `QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE` to minimize BI tool compatibility issues +- Avoid quoted identifiers in Snowflake long-term; use the parameter during transition only + +### Phase 3: Database Code Conversion + +**SnowConvert AI achieves 96%+ average automated conversion rate** for supported SQL and DDL constructs, enabling teams to focus on targeted refactoring, testing and optimization rather than bulk code translation. + +**Areas requiring manual review:** +- Timestamp and time zone semantics +- Numeric precision and rounding behavior +- Business logic embedded in stored procedures +- Performance optimization and warehouse sizing +- Validation of analytic and reporting workloads + +**High-frequency Redshift SQL incompatibilities:** + +**Identifier casing:** Redshift lowercases unquoted identifiers; Snowflake uppercases them. Queries referencing quoted lowercase identifiers may fail. Avoid quoted identifiers; use `QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE` during transition. + +**Date/timestamp arithmetic:** PostgreSQL-style timestamp arithmetic (`timestamp - timestamp`, `timestamp - interval`, `TRUNC(timestamp)`) is not supported. Use `DATEDIFF()`, `DATEADD()`, and explicit casting to DATE. + +**Time zone handling:** Redshift implicitly assumes timestamps without time zones are UTC. Snowflake supports `TIMESTAMP_NTZ`, `TIMESTAMP_TZ`, and `TIMESTAMP_LTZ`. **Recommendation:** Normalize all ingested timestamps to UTC, store as `TIMESTAMP_NTZ`, perform localization in downstream BI tools. + +**SQL function and casting differences:** Redshift allows flexible numeric parsing without explicit precision/scale. Snowflake requires explicit precision and scale when a format mask is used, and fails fast when the format does not exactly match the input string. SnowConvert AI flags these cases, but manual review is required for correctness, especially in financial datasets. + +**Unsupported or changed SQL constructs:** + +| Redshift Construct | Snowflake Equivalent | +|-------------------|---------------------| +| `SELECT INTO` | `CREATE TABLE AS SELECT` | +| `ALTER TABLE … APPEND` | `INSERT INTO … SELECT` | +| `REFRESH MATERIALIZED VIEW` | Not required (automatic in Snowflake) | +| `VARCHAR(MAX)` | `VARCHAR` (max length by default) | +| `IS TRUE / IS FALSE` | Boolean predicates (`NOT col`, `col`) | +| `IN TIMEZONE` | `CONVERT_TIMEZONE()` | +| `COALESCE(expr)` | Requires at least two arguments | + +**Stored procedures:** Redshift PL/pgSQL must be rewritten using Snowflake Scripting (SQL), JavaScript procedures, or Snowpark Python. + +### Phase 4: Data Migration and Ingestion + +**Redshift data layout considerations:** Redshift data layouts are often optimized for cluster-based execution and may reflect historical VACUUM operations, fragmented sort order or skewed distribution. During migration: +- Distribution styles and sort keys should **not** be preserved +- Data extracted from Redshift may not be physically ordered +- Snowflake automatically optimizes data layout during ingestion +- No post-load maintenance (VACUUM/ANALYZE) required + +**Initial data transfer (common approach):** UNLOAD from Redshift to S3 (PARQUET), then COPY INTO Snowflake. Also: SnowConvert AI data migration accelerators, external tables for staged validation. + +**Modern ingestion patterns:** +- **Snowpipe**: Continuous ingestion from S3 +- **Streams + Tasks**: CDC and orchestration +- **dbt**: Transformations with incremental materializations +- Legacy Redshift batch jobs can often be simplified or eliminated + +### Phase 5: Reporting and Analytics + +**Tool repointing:** Update JDBC/ODBC drivers, repoint BI tools and semantic layers, validate queries/dashboards/scheduled reports, verify authentication and RBAC. + +**Workload isolation advantage:** In Redshift, reporting competes with batch processing for cluster resources via WLM queues. Snowflake eliminates this through dedicated virtual warehouses, multi-cluster warehouses for burst concurrency, and independent scaling. + +**Semantic and behavioral differences to validate:** +- Case sensitivity of quoted identifiers +- Timestamp and time zone handling +- NULL behavior in aggregate functions +- Numeric precision and rounding differences +- Implicit casting differences in filters and joins +- Validate dashboards for visual parity, calculated fields and business KPIs (not just row counts) + +**Performance:** Right-size warehouses for reporting concurrency, monitor with Query Profile, separate ad hoc from production dashboards, evaluate clustering keys only for very large fact tables. + +**Access modernization:** Map Redshift IAM-based access to Snowflake RBAC, ensure roles align with functional reporting groups, validate row/column-level access controls, review dynamic data masking policies. + +**Post-migration opportunities:** Direct querying of semi-structured data via VARIANT, secure data sharing, integration with AI/ML via Snowpark, consolidation of reporting/engineering/AI on a single platform. + +### Phase 6: Data Validation and Testing + +**Structural validation:** Row counts, aggregates, schema comparison. + +**Behavioral validation (critical for Redshift):** Teams must validate behavioral equivalence beyond structure: +- Numeric precision validation for `TO_NUMBER` +- NULL-handling validation for `GREATEST`/`LEAST` (Snowflake returns NULL if any argument is NULL; Redshift returns non-NULL value) +- Timestamp/timezone validation (`TIMESTAMP_NTZ` vs `TIMESTAMPTZ`) +- Hash consistency checks when replacing `FNV_HASH` with `HASH()` +- BI/reporting layer validation (differences surface only in dashboards/visualizations) + +**Validation methods:** Aggregate comparisons, hash-based validation, business metric validation, targeted query benchmarking. Automate wherever possible. Passing structural validation does not guarantee behavioral equivalence — business-critical queries must be validated directly. + +### Phase 7: Deployment + +**Parallel run:** Run Redshift and Snowflake simultaneously, validate pipelines and analytics, minimize overlap through automation. + +**Cutover readiness checklist:** +- Final data reconciliation and validation complete +- BI tools and downstream consumers validated against Snowflake +- Ingestion and upstream writes to Redshift disabled +- Snowflake resource monitors and warehouse sizing controls enabled +- Redshift decommissioning plan reviewed and approved + +**Cutover:** Disable Redshift ingestion → redirect consumers to Snowflake → decommission Redshift clusters. + +### Phase 8: Optimize and Run + +**Zero maintenance advantage:** Eliminate VACUUM, ANALYZE, distribution/sort key tuning. + +**Performance and cost optimization:** +- Right-size warehouses (primary cost/performance lever) +- Use multi-cluster warehouses for concurrency +- Apply clustering keys for very large tables (>1TB) with frequent range filters +- Monitor with Query Profile and resource monitors + +**Redshift migration lessons learned:** +- Do not migrate distribution styles or sort keys +- Rewrite timestamp arithmetic early in the project +- Normalize timestamps to UTC +- Avoid quoted identifiers +- Explicitly validate numeric precision and rounding +- Expect significantly reduced operational overhead post-migration + +## Data Type Mapping + +| Redshift | Snowflake | Notes | +|----------|-----------|-------| +| SMALLINT / INT2 | SMALLINT | | +| INTEGER / INT / INT4 | INTEGER | | +| BIGINT / INT8 | BIGINT | | +| DECIMAL(p,s) / NUMERIC(p,s) | NUMBER(p,s) | | +| REAL / FLOAT4 | FLOAT | Single-precision | +| DOUBLE PRECISION / FLOAT8 / FLOAT | FLOAT | Double-precision | +| BOOLEAN / BOOL | BOOLEAN | | +| CHAR(n) / CHARACTER(n) / NCHAR(n) / BPCHAR | CHAR(n) | | +| VARCHAR(n) / CHARACTER VARYING(n) / NVARCHAR(n) / TEXT | VARCHAR(n) | Redshift max 65535; Snowflake max 16MB | +| DATE | DATE | | +| TIMESTAMP / TIMESTAMP WITHOUT TIME ZONE | TIMESTAMP_NTZ | | +| TIMESTAMPTZ / TIMESTAMP WITH TIME ZONE | TIMESTAMP_TZ | | +| TIME / TIME WITHOUT TIME ZONE | TIME | | +| TIMETZ / TIME WITH TIME ZONE | TIME | Snowflake TIME does not store timezone; consider TIMESTAMP_TZ | +| SUPER | VARIANT | Semi-structured type | +| HLLSKETCH | Not direct | Use APPROX_COUNT_DISTINCT() | +| GEOMETRY | GEOMETRY | | +| GEOGRAPHY | GEOGRAPHY | | +| VARBYTE / VARBINARY / BINARY VARYING | VARBINARY | | + +## Feature Mapping + +| Redshift Feature | Snowflake Equivalent | +|-----------------|---------------------| +| DISTSTYLE EVEN/KEY/ALL | Not needed (Snowflake auto-distributes) | +| DISTKEY | Not needed | +| SORTKEY (compound) | CLUSTER BY (similar intent, automatic maintenance) | +| SORTKEY (interleaved) | CLUSTER BY (Snowflake handles multi-column pruning) | +| ENCODE compression | Not needed (Snowflake auto-compresses) | +| BACKUP YES/NO | Not applicable; remove | +| WLM (Workload Management) | Warehouses (multi-cluster, auto-scaling) | +| Concurrency scaling | Multi-cluster warehouse auto-scaling | +| Redshift Spectrum | External tables on S3/Azure/GCS | +| Late-binding views | Standard views (Snowflake views are always late-binding) | +| Materialized views | MATERIALIZED VIEW or Dynamic Tables | +| COPY from S3 | COPY INTO from S3 stage (via storage integration) | +| UNLOAD to S3 | COPY INTO @stage (to S3/Azure/GCS) | +| Stored procedures (PL/pgSQL) | Snowflake Scripting or JavaScript procedures | +| UDFs (SQL) | Snowflake UDFs (SQL, JavaScript, Python, Java) | +| UDFs (Python) | Snowflake Python UDFs | +| Lambda UDFs | External functions | +| Federated queries | External tables or data sharing | +| Data sharing (Redshift) | Snowflake Data Sharing (native, cross-account) | +| RA3 managed storage | Not applicable (Snowflake decouples natively) | +| Snapshot / backup | Time Travel + Fail-Safe | +| Row-level security | Row Access Policies | +| Column-level access | Column-level masking policies | +| Leader node functions | Not applicable; all functions run on compute | +| System tables (STL, STV, SVL, SVV) | INFORMATION_SCHEMA / ACCOUNT_USAGE views | +| VACUUM | Not needed (Snowflake auto-manages) | +| ANALYZE | Not needed (Snowflake auto-manages statistics) | +| Query monitoring rules | Resource monitors + query tag-based monitoring | +| Cross-database queries | Cross-database queries supported natively | + +## Common Redshift to Snowflake Patterns + +### COPY Command +```sql +-- Redshift +COPY my_table FROM 's3://mybucket/data/' +IAM_ROLE 'arn:aws:iam::123456789:role/MyRole' +FORMAT AS CSV +DELIMITER ',' +IGNOREHEADER 1 +DATEFORMAT 'auto' +TIMEFORMAT 'auto' +REGION 'us-west-2' +MAXERROR 100 +BLANKSASNULL +EMPTYASNULL +ACCEPTINVCHARS; + +-- Snowflake +CREATE OR REPLACE STAGE my_s3_stage + URL = 's3://mybucket/data/' + STORAGE_INTEGRATION = my_s3_integration; + +COPY INTO my_table + FROM @my_s3_stage + FILE_FORMAT = ( + TYPE='CSV' + SKIP_HEADER=1 + FIELD_DELIMITER=',' + EMPTY_FIELD_AS_NULL=TRUE + NULL_IF=('NULL','') + ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE + ) + ON_ERROR='CONTINUE'; +``` + +### UNLOAD Command +```sql +-- Redshift +UNLOAD ('SELECT * FROM my_table') +TO 's3://mybucket/unload/' +IAM_ROLE 'arn:aws:iam::123456789:role/MyRole' +FORMAT AS PARQUET +ALLOWOVERWRITE +PARALLEL ON +MAXFILESIZE 256 MB; + +-- Snowflake +COPY INTO @my_s3_stage/unload/ + FROM my_table + FILE_FORMAT = (TYPE='PARQUET') + MAX_FILE_SIZE = 268435456 + OVERWRITE = TRUE; +``` + +### JSON Handling (SUPER Type → VARIANT) +```sql +-- Redshift +SELECT JSON_EXTRACT_PATH_TEXT(json_col, 'key1', 'key2') FROM my_table; +SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT(json_col, 0) FROM my_table; +SELECT json_col.key1.key2 FROM my_table; -- PartiQL syntax (Redshift SUPER) + +-- Snowflake (dot notation) +SELECT json_col:key1.key2::STRING FROM my_table; +SELECT json_col[0]::STRING FROM my_table; +-- or function-based +SELECT GET_PATH(json_col, 'key1.key2')::STRING FROM my_table; +``` + +### SUPER Type Querying → VARIANT + FLATTEN +```sql +-- Redshift (SUPER type with PartiQL) +SELECT c.customer_id, o.order_id +FROM customers c, c.orders o +WHERE o.amount > 100; + +-- Snowflake (VARIANT + LATERAL FLATTEN) +SELECT c.customer_id, f.value:order_id::INT AS order_id +FROM customers c, +LATERAL FLATTEN(INPUT => c.orders) f +WHERE f.value:amount::NUMBER > 100; +``` + +### Identity Columns +```sql +-- Redshift +CREATE TABLE t (id INT IDENTITY(1,1), name VARCHAR(100)); +-- Or: id BIGINT GENERATED BY DEFAULT AS IDENTITY(1,1) + +-- Snowflake +CREATE TABLE t (id INT AUTOINCREMENT START 1 INCREMENT 1, name VARCHAR(100)); +-- Or: id INT IDENTITY(1,1) -- Snowflake also supports IDENTITY keyword +``` + +### Approximate Functions +```sql +-- Redshift +SELECT APPROXIMATE COUNT(DISTINCT user_id) FROM events; + +-- Snowflake +SELECT APPROX_COUNT_DISTINCT(user_id) FROM events; +``` + +### Spectrum (External Tables) +```sql +-- Redshift Spectrum +CREATE EXTERNAL SCHEMA spectrum_schema +FROM DATA CATALOG +DATABASE 'mydb' +IAM_ROLE 'arn:aws:iam::123456789:role/MyRole'; + +SELECT * FROM spectrum_schema.external_table; + +-- Snowflake +CREATE OR REPLACE EXTERNAL TABLE external_table + WITH LOCATION = @my_s3_stage/path/ + FILE_FORMAT = (TYPE = 'PARQUET') + AUTO_REFRESH = TRUE; + +SELECT * FROM external_table; +``` + +### Window Functions with Default Frame +```sql +-- Both Redshift and Snowflake: +-- Default window frame with ORDER BY: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW +-- Default without ORDER BY: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING +-- Generally compatible, but verify edge cases with RANGE frames +``` + +### Stored Procedure (PL/pgSQL → Snowflake Scripting) +```sql +-- Redshift (PL/pgSQL) +CREATE OR REPLACE PROCEDURE update_status(p_id INTEGER, p_status VARCHAR) +LANGUAGE plpgsql +AS $$ +DECLARE + v_count INTEGER; +BEGIN + SELECT COUNT(*) INTO v_count FROM orders WHERE id = p_id; + IF v_count > 0 THEN + UPDATE orders SET status = p_status WHERE id = p_id; + ELSE + RAISE EXCEPTION 'Order not found: %', p_id; + END IF; +END; +$$; + +-- Snowflake Scripting +CREATE OR REPLACE PROCEDURE update_status(p_id INTEGER, p_status VARCHAR) + RETURNS VARCHAR + LANGUAGE SQL + EXECUTE AS CALLER +AS +BEGIN + LET v_count INTEGER := (SELECT COUNT(*) FROM orders WHERE id = :p_id); + IF (v_count > 0) THEN + UPDATE orders SET status = :p_status WHERE id = :p_id; + RETURN 'Updated'; + ELSE + RETURN 'Order not found: ' || :p_id; + END IF; +END; +``` + +### Date/Time Functions +```sql +-- Redshift -- Snowflake +GETDATE() CURRENT_TIMESTAMP() +SYSDATE CURRENT_TIMESTAMP() +DATE_TRUNC('month', dt) DATE_TRUNC('month', dt) -- same +DATEADD(day, 7, dt) DATEADD('day', 7, dt) -- quote the part +DATEDIFF(day, d1, d2) DATEDIFF('day', d1, d2) -- quote the part +EXTRACT(year FROM dt) EXTRACT(year FROM dt) -- same +TO_CHAR(dt, 'YYYY-MM-DD') TO_CHAR(dt, 'YYYY-MM-DD') -- same +CONVERT_TIMEZONE('US/Eastern', ts) CONVERT_TIMEZONE('US/Eastern', ts) -- same +ADD_MONTHS(dt, 3) DATEADD('month', 3, dt) +LAST_DAY(dt) LAST_DAY(dt) -- same +MONTHS_BETWEEN(d1, d2) DATEDIFF('month', d2, d1) +``` + +### String Functions +```sql +-- Redshift -- Snowflake +LEN(str) LENGTH(str) +CHARINDEX(sub, str) CHARINDEX(sub, str) -- same +POSITION(sub IN str) POSITION(sub IN str) -- same +REPLACE(str, old, new) REPLACE(str, old, new) -- same +CONCAT(a, b) CONCAT(a, b) or a || b -- same +REGEXP_SUBSTR(str, pattern) REGEXP_SUBSTR(str, pattern) -- same +STRTOL(str, base) Custom UDF or TRY_TO_NUMBER with base conversion +LISTAGG(col, delim) LISTAGG(col, delim) -- same +NVL(a, b) NVL(a, b) or COALESCE(a, b) -- same +NVL2(expr, val1, val2) NVL2(expr, val1, val2) -- same +BTRIM(str) TRIM(str) +ENCODE(col, 'base64') BASE64_ENCODE(col) +DECODE(col, 'base64') BASE64_DECODE_STRING(col) +``` + +### System/Admin Functions +```sql +-- Redshift -- Snowflake +PG_LAST_COPY_ID() LAST_QUERY_ID() +PG_LAST_COPY_COUNT() RESULT_SCAN(LAST_QUERY_ID()) +SVL_QUERY_SUMMARY / STL_QUERY ACCOUNT_USAGE.QUERY_HISTORY +SVV_TABLE_INFO INFORMATION_SCHEMA.TABLES +STV_BLOCKLIST Not applicable (auto-managed) +STV_TBL_PERM Not applicable +SVV_EXTERNAL_SCHEMAS SHOW EXTERNAL TABLES +PG_CATALOG tables INFORMATION_SCHEMA views +``` + +## DDL Conversion Checklist + +1. **Remove** `DISTSTYLE` (EVEN/KEY/ALL), `DISTKEY(col)` +2. **Remove** `SORTKEY(col1, col2)` and `INTERLEAVED SORTKEY`; consider CLUSTER BY for large tables +3. **Remove** `ENCODE` compression directives (auto/bytedict/lzo/zstd/etc.) +4. **Remove** `BACKUP YES/NO` +5. **Convert** `IDENTITY(seed,step)` → `AUTOINCREMENT START seed INCREMENT step` +6. **Convert** `SUPER` → `VARIANT` +7. **Convert** `TIMETZ` → `TIME` or `TIMESTAMP_TZ` (assess timezone needs) +8. **Convert** `HLLSKETCH` → remove; use `APPROX_COUNT_DISTINCT()` +9. **Replace** `CREATE EXTERNAL SCHEMA` → external tables with stages +10. **Replace** PL/pgSQL procedures → Snowflake Scripting +11. **Remove** `VACUUM` and `ANALYZE` statements +12. **Note** constraints: PK, FK, UNIQUE defined but **not enforced** in Snowflake + +## Migration via S3 (Recommended Path) + +The most common Redshift → Snowflake migration path uses S3 as an intermediary: + +1. **UNLOAD** from Redshift to PARQUET files in S3: + ```sql + UNLOAD ('SELECT * FROM schema.table') + TO 's3://migration-bucket/table/' + IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftUnloadRole' + FORMAT AS PARQUET + ALLOWOVERWRITE; + ``` + +2. **Create stage** in Snowflake pointing to S3: + ```sql + CREATE OR REPLACE STAGE migration_stage + URL = 's3://migration-bucket/' + STORAGE_INTEGRATION = my_s3_integration; + ``` + +3. **COPY INTO** Snowflake: + ```sql + COPY INTO target_table + FROM @migration_stage/table/ + FILE_FORMAT = (TYPE = 'PARQUET') + MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE; + ``` + +**Requirements:** +- S3 bucket in same region as Redshift cluster (minimize transfer costs) +- IAM Role for Redshift: `s3:PutObject`, `s3:GetObject`, `s3:ListBucket` +- Storage integration or IAM User for Snowflake: `s3:GetObject`, `s3:ListBucket` + +## Data Extraction Methods + +| Method | Best For | +|--------|---------| +| UNLOAD to S3 (PARQUET) | Primary method; best performance, schema preservation | +| UNLOAD to S3 (CSV) | Legacy or simple tables | +| Redshift Data API | Programmatic extraction in small batches | +| Fivetran / Airbyte | Managed CDC replication | +| AWS DMS | Change data capture for ongoing replication | +| SnowConvert AI (Redshift) | Automated DDL/SQL conversion + S3-based data migration | + +## Common Pitfalls + +1. **Distribution/sort keys**: Simply remove; don't try to replicate distribution logic. Snowflake handles it automatically. +2. **VACUUM/ANALYZE**: Remove all maintenance commands; Snowflake auto-manages. +3. **Leader-node-only functions**: Some Redshift functions only run on the leader node; verify Snowflake equivalents exist. +4. **SUPER vs VARIANT**: PartiQL syntax (`table.array[0].field`) must be rewritten to Snowflake dot notation (`col:array[0].field`). +5. **Timestamp precision**: Redshift default TIMESTAMP is microseconds; Snowflake TIMESTAMP is nanoseconds. Verify comparisons. +6. **TIMETZ**: Snowflake TIME does not store timezone offset; use TIMESTAMP_TZ if timezone needed. +7. **Redshift-specific SQL**: Functions like `STRTOL()`, `APPROXIMATE COUNT(DISTINCT)` need rewriting. +8. **Spectrum tables**: Must be recreated as Snowflake external tables with proper stages. +9. **WLM queues**: Translate queue-based workload isolation to separate Snowflake warehouses. +10. **Case sensitivity**: Redshift lowercases unquoted identifiers; Snowflake uppercases them. Quoted identifiers are case-sensitive in Snowflake. +11. **Constraint enforcement**: Redshift enforces UNIQUE/PK on some node types; Snowflake never enforces. Move checks to ETL. +12. **GREATEST/LEAST NULL handling**: Snowflake returns NULL if any argument is NULL; Redshift returns the non-NULL value. Validate and apply COALESCE if required. +13. **Numeric precision**: Redshift allows flexible numeric parsing without explicit precision/scale; Snowflake requires explicit precision/scale with format masks and fails fast on mismatch. +14. **Timestamp arithmetic**: PostgreSQL-style `timestamp - timestamp` and `timestamp - interval` not supported; rewrite using DATEDIFF/DATEADD. +15. **FNV_HASH → HASH**: Hash functions produce different outputs; validate hash consistency post-conversion. +16. **pg_* system tables**: No direct equivalent; use INFORMATION_SCHEMA and ACCOUNT_USAGE views. + +## High-Risk SQL Conversions + +| Redshift Pattern | Snowflake Change | Risk | Mitigation | +|-----------------|-----------------|------|-----------| +| `TO_NUMBER(str, format)` | Add precision and scale | High | Silent truncation or runtime error; add explicit precision/scale | +| `SYSDATE` | `CURRENT_TIMESTAMP` | Low | Direct replacement; validate timestamp type in comparisons | +| `ARRAY_UPPER()` | `ARRAY_SIZE()` | Medium | Rewrite using ARRAY_SIZE | +| `ISNULL()` | `IFNULL()` | Low | Replace function name; validate boolean expressions | +| `FNV_HASH()` | `HASH()` | Medium | Validate hash consistency post-conversion | +| `GREATEST()`/`LEAST()` | Same, but NULL behavior differs | **High** | Snowflake returns NULL if any arg is NULL; validate and apply COALESCE | + +## Appendix: Feature and SQL Mapping + +### Architecture and Platform + +| Redshift Feature | Snowflake Equivalent | Notes | +|-----------------|---------------------|-------| +| Cluster | Account + Virtual Warehouses | Decoupled compute | +| WLM | Multi-cluster Warehouses | Automatic concurrency | +| Spectrum | External Tables | Native support | + +### Performance and Maintenance + +| Redshift Feature | Snowflake Equivalent | Notes | +|-----------------|---------------------|-------| +| DISTKEY | N/A | Not required | +| SORTKEY | Clustering Keys (optional) | Use sparingly | +| VACUUM | N/A | Fully managed | +| ANALYZE | N/A | Automatic | + +### Procedural Logic + +| Redshift Feature | Snowflake Equivalent | Notes | +|-----------------|---------------------|-------| +| PL/pgSQL Procedures | Snowflake Scripting / JS / Python | Must be rewritten | +| Temporary Tables | Temporary Tables | Session-scoped | + +## Professional Services and Partners + +- **Snowflake Professional Services**: Accelerated Redshift migrations leveraging SnowConvert AI and migration accelerators. Convert Redshift SQL, refactor PL/pgSQL stored procedures, streamline data movement from S3 to Snowflake. Modernized architectures eliminating DISTKEY, SORTKEY, WLM queues and VACUUM/ANALYZE. Support from assessment through secure cutover and Redshift cluster decommissioning. +- **Global Solution Partners**: Code and pipeline conversion (Redshift SQL, materialized views, stored procedures, ETL/ELT → Snowflake-native patterns using dbt, Snowflake Scripting, Snowpark). Data engineering and AI/ML enablement. End-to-end delivery including validation, performance tuning, FinOps, governance and compliance. +- Contact: Snowflake account team or Snowflake Community diff --git a/skills/agentic-migration-workshop/references/sqlserver.md b/skills/agentic-migration-workshop/references/sqlserver.md new file mode 100644 index 00000000..71c0967b --- /dev/null +++ b/skills/agentic-migration-workshop/references/sqlserver.md @@ -0,0 +1,734 @@ +# SQL Server to Snowflake Reference + +Based on Snowflake's official SQL Server migration guide. Intended for solution architects, program managers, and migration partners. + +## Architecture Differences + +SQL Server is **server-centric**: a single, fixed machine (physical or virtual) that tightly couples storage and compute. Snowflake is **cloud-centric**: a logical entity that decouples storage, compute, and cloud services. + +| Aspect | SQL Server | Snowflake | +|--------|-----------|-----------| +| Architecture | Monolithic; tightly coupled compute & storage | Decoupled compute, storage, and cloud services | +| Storage | Local/networked files (SAN, NAS) | Centralized, shared object storage; proprietary columnar format | +| Compute | Fixed server resources (CPU, Memory, I/O) | Elastic, on-demand virtual warehouses (independent, decoupled) | +| Concurrency | Contention-prone; dependent on server config/budget | Workload isolation; independent virtual warehouses eliminate contention | +| Scaling | Vertical (bigger server) or horizontal (Always On); requires hardware upgrades and downtime | Horizontal/automatic; scale up/down/out instantly, no downtime | +| Maintenance | DBA-managed (index rebuilds, stats, filegroups, DBCC) | Fully managed; no indexes, no partitions, no UPDATE STATISTICS | +| Constraints | PK, FK, UNIQUE, CHECK all enforced | Only NOT NULL enforced; PK/FK/UNIQUE are metadata-only | +| Cost Model | Fixed/license-based (CAPEX) | Consumption-based pay-per-use (OPEX) | + +### Value Proposition Summary + +| Feature | SQL Server | Snowflake | Value | +|---------|-----------|-----------|-------| +| Scalability | Vertical; requires hardware upgrades and downtime | Horizontal/automatic; instant scaling | Agility: scales to meet demand without manual intervention | +| Concurrency | Contention-prone; dependent on server budget | Workload isolation via independent virtual warehouses | Performance: different workloads run in parallel without impact | +| Cost Model | Fixed/license-based (CAPEX) | Consumption-based pay-per-use (OPEX) | Financial: shift from fixed cost to variable cost | + +## Migration Methodology (8 Phases) + +| Phase | Focus | +|-------|-------| +| 1. Planning and Design | Scope, strategy, team, budget, test plan, Snowflake prep | +| 2. Environments and Security | Warehouse setup, RBAC hierarchy, case sensitivity, environment separation | +| 3. Database Code Conversion | T-SQL conversion (SnowConvert AI automates 50-70%), stored procedure rewrite, feature remapping | +| 4. Data Migration and Ingestion | Initial data transfer, Snowpipe/Streams/Tasks for ongoing, SSIS modernization | +| 5. Reporting and Analytics | Tool repointing (Power BI, Tableau, etc.), connection string updates, metadata model changes | +| 6. Data Validation and Testing | Row counts, hash comparison, functional testing, performance benchmarking | +| 7. Deployment | Parallel run, cutover, decommission SQL Server | +| 8. Optimize and Run | Warehouse sizing, clustering keys, resource monitors, communicate success | + +### Phase 1: Planning and Design + +**Strategic goals:** This migration is not merely a cost-saving measure — it's a strategic move to prepare for advanced analytics and AI. Snowflake supports OLAP workloads, lakehouses, open/structured/semi-structured/unstructured data. + +**Automated assessment:** Run SnowConvert AI assessment first (free) for data-driven scope and complexity estimates. + +**Document the existing solution:** +- Database objects: List all databases, schemas, objects. Rationalize and decommission unnecessary data sets. Avoid migrating `sys` catalog tables/views. +- Data sources/processes: ETL/ELT tools (SSIS, Informatica), reporting (Power BI, Tableau), data science/ML +- Security: Roles, users, granted permissions; sensitive data sets and provisioning processes + +**Migration approach selection:** +- **Lift and shift**: Migrate as-is (Snowflake recommends this for first iteration) +- **Lift and adjust**: Minor reengineering +- **Complete redesign**: Rework broken/inadequate processes +- **Snowflake recommends minimal reengineering first** — changes to data structures impact downstream tools and extend timelines + +**Project logistics:** +- Prioritize data sets for quick wins with minimal effort; use SnowConvert AI for dependency documentation +- Document development environments (Dev/QA/Prod) and CI/CD processes +- Identify migration team: developer, QA engineer, business owner, project manager +- Define deadlines, budget (including Snowflake compute costs), and success/failure criteria + +### Phase 2: Environments and Security + +**Environment best practice — separate databases by environment:** +- Create dedicated databases per environment: `DEV_SALES_DB`, `QA_SALES_DB`, `PROD_SALES_DB` +- Create schemas matching SQL Server schemas: `PROD_SALES_DB.dbo_schema` +- Use naming convention `[ENVIRONMENT]_[DATABASE]` for warehouses and roles: `ANALYTICS_WH_DEV`, `DATA_ENGINEER_ROLE_PROD` + +**Security model shift (RBAC):** +- SQL Server uses DAC + RBAC mix → Snowflake uses pure hierarchical RBAC +- SQL Server Login + User → unified Snowflake User object +- Best practice: authenticate via SSO/OAuth, not SQL logins +- Prioritize automated provisioning via IdP (Okta/Entra ID) with SCIM + +**Role hierarchy:** + +| Role Type | Description | Example | +|-----------|-------------|--------| +| Access Roles | Low-level; specific permissions on objects | `WH_ANALYTICS_USAGE`, `DB_SALES_READ` | +| Functional Roles | High-level; aligned with business functions, granted Access Roles | `DATA_ANALYST_ROLE`, `DATA_ENGINEER_ROLE` | + +**Case sensitivity (critical):** +- SQL Server is typically case-insensitive (depending on collation) +- Snowflake is case-sensitive for unquoted and all quoted identifiers +- Reporting tools that auto-generate double-quoted SQL will fail if objects are uppercase +- **Solution:** Set `QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE` to resolve compatibility errors + +**Security checklist:** +- Use future grants for auto-applying permissions to new objects +- Enable MFA for all human users, especially privileged roles +- Establish audit processes for role/user creation, deletion, privilege changes + +### Phase 3: Database Code Conversion + +**SnowConvert AI reduces manual conversion by 50-70%:** +- Converts DDL, DML, and procedural T-SQL to Snowflake SQL +- Handles complex syntax differences (DATETIME→TIMESTAMP_NTZ, proprietary T-SQL constructs) +- After conversion, remaining EWIs are analyzed by Migration Assistant (Cortex AI) for fixes + +**T-SQL feature remapping:** + +| SQL Server Feature | Conversion Action | Snowflake Equivalent | +|-------------------|------------------|---------------------| +| Indexes/Partitioning | Remove all | N/A (micro-partitions + clustering keys) | +| Constraints | Not enforced (except NOT NULL); externalize validation | Metadata-only | +| Stored Procedures | Rewrite to Snowflake Scripting (SQL/JavaScript/Python) | Snow SQL-based procedural code | +| UDF DML Operations | Convert to Stored Procedure (UDFs cannot do DML) | N/A | +| Temporal Tables | Replace with Streams (CDC) + Tasks for automation | Streams and Tasks | +| Error Handling | Custom UDF needed (e.g., ERROR_SEVERITY()) | N/A for built-in functions | + +### Phase 4: Data Migration and Ingestion + +**Initial data transfer options:** + +| Tool/Method | Use Case | Volume | +|-------------|----------|--------| +| SnowConvert AI | Optimized transfer with migration and validation | Large-scale (TB/PB) | +| Physical appliances | AWS Snowball, Azure Data Box, Google Transfer Appliance | Petabytes of on-premises data | +| BCP / SnowSQL | Export to compressed files (50-250MB), PUT to stage, COPY INTO | Small to medium (BCP not supported by all SQL Server editions) | + +**Continuous data ingestion (ELT pattern):** +- **Snowpipe**: Automated, continuous ingestion from cloud storage (minutes/seconds) +- **Streams + Tasks**: CDC and procedural orchestration; replace/improve SQL Server loading +- **dbt**: Transform step with incremental materializations, tests, documentation, lineage +- **Zero-copy cloning**: Move data within Snowflake (QA→Dev) without additional storage costs + +**SSIS migration strategies:** + +| Strategy | Goal | Target | Recommendation | +|----------|------|--------|---------------| +| Modernize | Rewrite entire package into cloud-native tools | ADF, dbt, Snowflake Procedures | **Recommended** for 100% cloud architecture. SnowConvert AI converts SSIS/Informatica→dbt | +| Refactor | Keep SSIS control flow, enable high-speed bulk loading | SSIS with updated components | Use specialized connector (e.g., CData) for bulk load; direct ODBC is too slow | + +### Phase 5: Reporting and Analytics + +- Update all connection strings, ODBC/JDBC drivers, authentication to Snowflake +- **Power BI**: SnowConvert AI offers automatic connection repointing +- **Metadata-layer tools** (Cognos, Business Objects): Update metadata model to reflect Snowflake schema +- Compare tool output and evaluate performance after repointing + +### Phase 6: Data Validation and Testing + +**Validation methods:** +- Row count checks, distinct value counts, null counts, numerical metrics +- **MD5 hash comparison**: Create hash across key columns in SQL Server; generate corresponding hash in Snowflake +- SnowConvert AI data migration feature automates hash-based validation +- Functional testing: Validate refactored T-SQL (now Streams/Tasks/Python procedures) produces same results + +**Critical platform differences to understand during testing:** +- Collation behavior (case sensitivity) +- Floating point arithmetic differences +- Date/time precision differences (DATETIME 3.33ms → TIMESTAMP_NTZ nanosecond) +- Business users must understand these for UAT + +### Phase 7: Deployment + +**Parallel run strategy:** +- Run SQL Server and Snowflake simultaneously until migration is validated +- High confidence from automated testing allows minimal parallel run window +- Cutover only after: initial data migrated, processes keep data current, all testing complete, all tools redirected +- **Cutover**: Turn off SQL Server data processes, revoke user/tool access +- **Define cutover plan early** — lack of clarity creates parallel environment overhead + +### Phase 8: Optimize and Run + +**Zero management advantage:** +- Remove SQL Server commands: `DBCC`, locking hints, `FOR REPLICATION`, `UPDATE STATISTICS` — all unnecessary +- No managing physical table partitions or indexes + +**Performance optimization:** +- **Warehouse sizing**: Primary cost/performance lever. Right-size continuously; separate instances for workload isolation +- **Auto-suspend**: Set aggressive auto-suspend (60 seconds) on all warehouses +- **Resource monitors**: Track usage; take action at limits +- **Clustering keys**: For very large tables (>1TB) with frequent range filters +- **Query Profile**: Debug and optimize slow/inefficient queries + +**Communicate success:** Document actual benefits vs. captured outcomes from planning phase + +## Data Type Mapping + +| SQL Server | Snowflake | Notes | +|------------|-----------|-------| +| TINYINT | TINYINT | SQL Server: 0-255; Snowflake: 0-255 | +| SMALLINT | SMALLINT | | +| INT / INTEGER | INTEGER | | +| BIGINT | BIGINT | | +| DECIMAL(p,s) / NUMERIC(p,s) | NUMBER(p,s) | | +| FLOAT(n) | FLOAT | | +| REAL | FLOAT | | +| MONEY | NUMBER(19,4) | | +| SMALLMONEY | NUMBER(10,4) | | +| BIT | BOOLEAN / NUMBER | Use NUMBER for value-to-value migration; BOOLEAN for ternary logic (TRUE/FALSE/NULL) | +| CHAR(n) | CHAR(n) | | +| VARCHAR(n) | VARCHAR(n) | VARCHAR(MAX) → VARCHAR(16777216) | +| NCHAR(n) | CHAR(n) | Snowflake native UTF-8; N-prefix types unnecessary | +| NVARCHAR(n) | VARCHAR(n) | NVARCHAR(MAX) → VARCHAR(16777216) | +| TEXT | VARCHAR(16777216) | Deprecated in SQL Server | +| NTEXT | VARCHAR(16777216) | Deprecated in SQL Server | +| BINARY(n) | BINARY(n) | | +| VARBINARY(n) | VARBINARY(n) | VARBINARY(MAX) → BINARY(8388608) | +| IMAGE | BINARY(8388608) | Deprecated in SQL Server | +| DATE | DATE | | +| TIME | TIME | | +| DATETIME | TIMESTAMP_NTZ(3) | SQL Server datetime is not ANSI-compliant. TIMESTAMP_NTZ(3) is recommended explicit mapping. Precision: 3.33ms → Snowflake ns | +| DATETIME2 | TIMESTAMP_NTZ | Time-zone-unaware | +| SMALLDATETIME | TIMESTAMP_NTZ | Minute precision | +| DATETIMEOFFSET | TIMESTAMP_LTZ | Maps to TIMESTAMP with Local Time Zone | +| UNIQUEIDENTIFIER | VARCHAR(36) | Store as string; generate with UUID_STRING() | +| XML | VARIANT | Parse XML content; use XMLGET() for querying | +| SQL_VARIANT | VARIANT | | +| GEOGRAPHY | GEOGRAPHY | | +| GEOMETRY | GEOMETRY | | +| HIERARCHYID | VARCHAR | Serialize to string; process with UDFs | +| ROWVERSION / TIMESTAMP | Not needed | Use Snowflake Streams for change tracking; SQL Server TIMESTAMP is not a date/time type | +| INTERVAL MINUTE TO SECOND | Not supported | INTERVAL data types not supported; use DATEDIFF/DATEADD functions instead | +| TABLE (type) | TEMPORARY TABLE | | +| CURSOR (type) | CURSOR in Snowflake Scripting | | +| SYSNAME | VARCHAR(128) | System name type | + +## Feature Mapping + +| SQL Server Feature | Snowflake Equivalent | +|-------------------|---------------------| +| Clustered index | CLUSTER BY (optional, auto-maintained) | +| Non-clustered indexes | Not needed (auto micro-partition pruning) | +| Columnstore indexes | Not needed (Snowflake is columnar natively) | +| Filtered indexes | Not needed; rely on micro-partition pruning | +| Included columns | Not applicable | +| Filegroups / partitions | Micro-partitions (automatic) | +| Computed columns | Virtual columns not supported; use views or pre-compute in ETL | +| Schema-bound objects | Not applicable; views are always late-binding | +| Linked servers | External tables, data sharing, or external functions | +| SQL Server Agent jobs | Snowflake Tasks (with CRON/interval schedules) | +| SSIS packages | Snowpipe, Tasks, dbt, or external ETL tools; re-architect, don't repoint | +| SSRS reports | Decommission; rebuild in Power BI, Tableau, or Streamlit | +| SSAS cubes | Snowflake aggregation + BI layer | +| Always On / Availability Groups | Built-in replication and failover | +| Temporal tables (system-versioned) | Streams + Time Travel | +| Change Data Capture (CDC) | Streams; use CDC from transaction log for incremental replication | +| Change Tracking | Streams | +| T-SQL stored procedures | Snowflake Scripting or JavaScript procedures | +| T-SQL functions (scalar) | Snowflake UDFs (SQL, JavaScript, Python, Java) | +| T-SQL functions (table-valued) | Snowflake UDTFs or views | +| CLR stored procedures | JavaScript/Python procedures (full rewrite required) | +| Triggers (DML) | Streams + Tasks (event-driven pattern) | +| Triggers (DDL) | Not supported; use governance policies or alerts | +| Service Broker | External messaging + Tasks | +| TDE (Transparent Data Encryption) | Built-in (always encrypted at rest and in transit) | +| Dynamic Data Masking | Dynamic Data Masking Policies | +| Row-Level Security | Row Access Policies | +| Always Encrypted | Not direct equivalent; use masking policies | +| Contained databases | Not applicable (Snowflake is SaaS) | +| Replication (transactional/merge) | Snowflake replication / data sharing | +| Log shipping | Not needed (built-in durability + Fail-Safe) | +| Resource Governor | Warehouses + resource monitors | +| Query Store | Query History view (ACCOUNT_USAGE.QUERY_HISTORY) | +| Database snapshots | Time Travel (AT / BEFORE) | +| Synonyms | Fully qualified names or wrapper views | +| User-defined types (UDTs) | Flatten to native Snowflake types | +| Table variables | TEMPORARY TABLE or Snowflake Scripting arrays | +| Cursors | Eliminate; rewrite as set-based SQL (cursors are anti-pattern in Snowflake) | +| System databases (master, msdb, tempdb, model) | No equivalent; exclude from migration scope | + +## Common T-SQL to Snowflake Patterns + +### TRY...CATCH → BEGIN...EXCEPTION +```sql +-- SQL Server +BEGIN TRY + INSERT INTO t VALUES (1, 'test'); +END TRY +BEGIN CATCH + SELECT ERROR_MESSAGE() AS msg, ERROR_NUMBER() AS num; +END CATCH; + +-- Snowflake Scripting +BEGIN + INSERT INTO t VALUES (1, 'test'); +EXCEPTION + WHEN OTHER THEN + LET msg := SQLERRM; + LET code := SQLCODE; + RETURN msg; +END; +``` + +### CROSS APPLY / OUTER APPLY → LATERAL +```sql +-- SQL Server +SELECT o.order_id, d.product_id +FROM orders o +CROSS APPLY ( + SELECT TOP 1 product_id FROM order_details + WHERE order_id = o.order_id ORDER BY amount DESC +) d; + +-- Snowflake +SELECT o.order_id, d.product_id +FROM orders o, +LATERAL ( + SELECT product_id FROM order_details + WHERE order_id = o.order_id ORDER BY amount DESC LIMIT 1 +) d; + +-- OUTER APPLY → LEFT JOIN LATERAL +SELECT o.order_id, d.product_id +FROM orders o +LEFT JOIN LATERAL ( + SELECT product_id FROM order_details + WHERE order_id = o.order_id ORDER BY amount DESC LIMIT 1 +) d; +``` + +### STRING_SPLIT → SPLIT_TO_TABLE / FLATTEN +```sql +-- SQL Server +SELECT value FROM STRING_SPLIT('a,b,c', ','); + +-- Snowflake (option 1) +SELECT value FROM TABLE(SPLIT_TO_TABLE('a,b,c', ',')); +-- Snowflake (option 2) +SELECT value FROM LATERAL FLATTEN(INPUT => SPLIT('a,b,c', ',')); +``` + +### FOR XML PATH (String Aggregation) → LISTAGG +```sql +-- SQL Server +SELECT dept_id, STUFF(( + SELECT ',' + name FROM employees e2 WHERE e2.dept_id = d.dept_id + FOR XML PATH('') +), 1, 1, '') AS names +FROM departments d; + +-- Snowflake +SELECT dept_id, LISTAGG(name, ',') WITHIN GROUP (ORDER BY name) AS names +FROM employees +GROUP BY dept_id; +``` + +### STRING_AGG → LISTAGG +```sql +-- SQL Server (2017+) +SELECT dept_id, STRING_AGG(name, ',') WITHIN GROUP (ORDER BY name) AS names +FROM employees GROUP BY dept_id; + +-- Snowflake (identical syntax) +SELECT dept_id, LISTAGG(name, ',') WITHIN GROUP (ORDER BY name) AS names +FROM employees GROUP BY dept_id; +``` + +### Temp Tables +```sql +-- SQL Server +CREATE TABLE #temp (id INT, val VARCHAR(50)); +SELECT * INTO #temp2 FROM source_table; + +-- Snowflake +CREATE TEMPORARY TABLE temp (id INT, val VARCHAR(50)); +CREATE TEMPORARY TABLE temp2 AS SELECT * FROM source_table; +``` + +### Table Variables → Temporary Tables +```sql +-- SQL Server +DECLARE @results TABLE (id INT, name VARCHAR(100)); +INSERT INTO @results SELECT id, name FROM employees; + +-- Snowflake +CREATE TEMPORARY TABLE results (id INT, name VARCHAR(100)); +INSERT INTO results SELECT id, name FROM employees; +``` + +### Dynamic SQL +```sql +-- SQL Server +DECLARE @sql NVARCHAR(MAX) = N'SELECT * FROM ' + QUOTENAME(@tablename); +EXEC sp_executesql @sql, N'@param INT', @param = 42; + +-- Snowflake Scripting +LET sql_text := 'SELECT * FROM ' || :tablename; +EXECUTE IMMEDIATE :sql_text; +``` + +### Identity and Sequences +```sql +-- SQL Server +CREATE TABLE t (id INT IDENTITY(1,1) PRIMARY KEY, name VARCHAR(100)); +INSERT INTO t (name) VALUES ('Alice'); +SELECT SCOPE_IDENTITY(); + +-- Snowflake +CREATE TABLE t (id INT AUTOINCREMENT START 1 INCREMENT 1, name VARCHAR(100)); +INSERT INTO t (name) VALUES ('Alice'); +-- No SCOPE_IDENTITY(); use LAST_QUERY_ID() + RESULT_SCAN if needed +``` + +### IF OBJECT_ID → DROP IF EXISTS / CREATE OR REPLACE +```sql +-- SQL Server +IF OBJECT_ID('dbo.my_table', 'U') IS NOT NULL DROP TABLE dbo.my_table; +CREATE TABLE dbo.my_table (...); + +-- Snowflake +CREATE OR REPLACE TABLE my_table (...); +-- or +DROP TABLE IF EXISTS my_table; +CREATE TABLE my_table (...); +``` + +### MERGE Statement +```sql +-- SQL Server +MERGE INTO target t +USING source s ON t.id = s.id +WHEN MATCHED THEN UPDATE SET t.val = s.val +WHEN NOT MATCHED THEN INSERT (id, val) VALUES (s.id, s.val) +WHEN NOT MATCHED BY SOURCE THEN DELETE +OUTPUT $action, inserted.*, deleted.*; + +-- Snowflake +MERGE INTO target t +USING source s ON t.id = s.id +WHEN MATCHED THEN UPDATE SET t.val = s.val +WHEN NOT MATCHED THEN INSERT (id, val) VALUES (s.id, s.val); +-- Note: WHEN NOT MATCHED BY SOURCE and OUTPUT clause not supported +-- Use Streams to capture changes instead of OUTPUT +``` + +### OUTPUT Clause → Streams +```sql +-- SQL Server (capture affected rows) +DELETE FROM orders OUTPUT deleted.* INTO @deleted_orders WHERE status = 'cancelled'; + +-- Snowflake: No OUTPUT clause. Use Streams for change capture: +CREATE STREAM orders_changes ON TABLE orders; +DELETE FROM orders WHERE status = 'cancelled'; +SELECT * FROM orders_changes WHERE METADATA$ACTION = 'DELETE'; +``` + +### IIF and CHOOSE +```sql +-- SQL Server +SELECT IIF(score >= 70, 'Pass', 'Fail') FROM exams; +SELECT CHOOSE(status, 'Draft', 'Active', 'Closed') FROM items; + +-- Snowflake +SELECT IFF(score >= 70, 'Pass', 'Fail') FROM exams; -- IIF → IFF +SELECT CASE status WHEN 1 THEN 'Draft' WHEN 2 THEN 'Active' WHEN 3 THEN 'Closed' END FROM items; +``` + +### TRY_CONVERT / TRY_CAST +```sql +-- SQL Server +SELECT TRY_CONVERT(INT, '123abc'); -- Returns NULL +SELECT TRY_CAST('2024-01-01' AS DATE); + +-- Snowflake +SELECT TRY_CAST('123abc' AS INT); -- Returns NULL +SELECT TRY_CAST('2024-01-01' AS DATE); +``` + +### OPENJSON → PARSE_JSON / LATERAL FLATTEN +```sql +-- SQL Server +SELECT j.[key], j.value +FROM OPENJSON('{"a":1,"b":2}') j; + +-- Snowflake +SELECT f.key, f.value +FROM LATERAL FLATTEN(INPUT => PARSE_JSON('{"a":1,"b":2}')) f; +``` + +### JSON_VALUE / JSON_QUERY → Dot Notation +```sql +-- SQL Server +SELECT JSON_VALUE(data, '$.customer.name') FROM orders; +SELECT JSON_QUERY(data, '$.items') FROM orders; + +-- Snowflake (dot notation) +SELECT data:customer.name::STRING FROM orders; +SELECT data:items FROM orders; +``` + +### PIVOT / UNPIVOT +```sql +-- SQL Server +SELECT * FROM sales_data +PIVOT (SUM(amount) FOR quarter IN ([Q1],[Q2],[Q3],[Q4])) p; + +-- Snowflake +SELECT * FROM sales_data +PIVOT (SUM(amount) FOR quarter IN ('Q1','Q2','Q3','Q4')) p; +-- Note: Snowflake uses single quotes, not brackets +``` + +### TOP → LIMIT +```sql +-- SQL Server +SELECT TOP 10 * FROM employees ORDER BY hire_date DESC; +SELECT TOP 10 PERCENT * FROM employees; +SELECT TOP 5 WITH TIES * FROM employees ORDER BY salary DESC; + +-- Snowflake +SELECT * FROM employees ORDER BY hire_date DESC LIMIT 10; +-- TOP PERCENT: No direct equivalent; use window functions: +SELECT * FROM ( + SELECT *, NTILE(10) OVER (ORDER BY hire_date) AS tile FROM employees +) WHERE tile = 1; +-- TOP WITH TIES: use QUALIFY +SELECT * FROM employees QUALIFY RANK() OVER (ORDER BY salary DESC) <= 5; +``` + +### Date/Time Functions +```sql +-- SQL Server -- Snowflake +GETDATE() CURRENT_TIMESTAMP() +GETUTCDATE() CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP()) +SYSDATETIME() CURRENT_TIMESTAMP() +DATEADD(day, 7, @dt) DATEADD('day', 7, dt) +DATEDIFF(day, @start, @end) DATEDIFF('day', start_dt, end_dt) +DATENAME(month, @dt) MONTHNAME(dt) or TO_CHAR(dt, 'MMMM') +DATEPART(year, @dt) YEAR(dt) or DATE_PART('year', dt) +FORMAT(@dt, 'yyyy-MM-dd') TO_CHAR(dt, 'YYYY-MM-DD') +EOMONTH(@dt) LAST_DAY(dt) +ISDATE('2024-01-01') TRY_TO_DATE('2024-01-01') IS NOT NULL +SWITCHOFFSET(@dto, '+05:30') CONVERT_TIMEZONE('+05:30', dto) +TODATETIMEOFFSET(@dt, '-08:00') CONVERT_TIMEZONE('UTC', '-08:00', dt) +``` + +### String Functions +```sql +-- SQL Server -- Snowflake +LEN(str) LENGTH(str) +DATALENGTH(str) OCTET_LENGTH(str) +CHARINDEX('abc', str) CHARINDEX('abc', str) -- same +PATINDEX('%pattern%', str) REGEXP_INSTR(str, 'pattern') +REPLACE(str, 'old', 'new') REPLACE(str, 'old', 'new') -- same +STUFF(str, start, len, repl) INSERT(str, start, len, repl) +REPLICATE(str, n) REPEAT(str, n) +REVERSE(str) REVERSE(str) -- same +QUOTENAME(name) '"' || name || '"' +CONCAT_WS(',', a, b, c) CONCAT_WS(',', a, b, c) -- same (Snowflake supports) +STRING_ESCAPE(str, 'json') No direct equivalent; use REPLACE chains +TRANSLATE(str, 'abc', 'xyz') TRANSLATE(str, 'abc', 'xyz') -- same +TRIM(str) TRIM(str) -- same +``` + +### System Variables and Functions +```sql +-- SQL Server -- Snowflake +@@ROWCOUNT SQLROWCOUNT (in Snowflake Scripting) +@@ERROR SQLCODE (in Snowflake Scripting) +@@IDENTITY / SCOPE_IDENTITY() Not available; use sequences or RESULT_SCAN +@@SERVERNAME CURRENT_ACCOUNT() +@@VERSION CURRENT_VERSION() +DB_NAME() CURRENT_DATABASE() +SCHEMA_NAME() CURRENT_SCHEMA() +USER_NAME() / SUSER_SNAME() CURRENT_USER() +NEWID() UUID_STRING() +SET NOCOUNT ON Remove (not needed) +SET ANSI_NULLS ON Remove (Snowflake is ANSI-compliant) +SET QUOTED_IDENTIFIER ON Remove (always on in Snowflake) +PRINT 'message' SYSTEM$LOG('info', 'message') or remove +RAISERROR('msg', 16, 1) Use Snowflake exception handling +THROW 50000, 'msg', 1 Use Snowflake exception handling +WAITFOR DELAY '00:00:05' SYSTEM$WAIT(5) +``` + +### Transaction Patterns +```sql +-- SQL Server +BEGIN TRANSACTION; + UPDATE accounts SET balance = balance - 100 WHERE id = 1; + UPDATE accounts SET balance = balance + 100 WHERE id = 2; +COMMIT; + +-- Snowflake +BEGIN; + UPDATE accounts SET balance = balance - 100 WHERE id = 1; + UPDATE accounts SET balance = balance + 100 WHERE id = 2; +COMMIT; +-- Note: Snowflake auto-commits DDL. DML within a procedure uses explicit transactions. +``` + +### Window Functions (mostly compatible) +```sql +-- SQL Server +SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) rn FROM emp; +-- Snowflake: identical syntax + +-- SQL Server specific: WITHIN GROUP +SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY dept) FROM emp; +-- Snowflake: identical syntax +``` + +### Common Table Expressions (CTEs) +```sql +-- SQL Server recursive CTE +WITH cte AS ( + SELECT id, parent_id, name, 0 AS level FROM org WHERE parent_id IS NULL + UNION ALL + SELECT o.id, o.parent_id, o.name, c.level + 1 FROM org o JOIN cte c ON o.parent_id = c.id +) +SELECT * FROM cte OPTION (MAXRECURSION 100); + +-- Snowflake +WITH RECURSIVE cte AS ( + SELECT id, parent_id, name, 0 AS level FROM org WHERE parent_id IS NULL + UNION ALL + SELECT o.id, o.parent_id, o.name, c.level + 1 FROM org o JOIN cte c ON o.parent_id = c.id +) +SELECT * FROM cte; +-- Note: Add RECURSIVE keyword; no OPTION (MAXRECURSION); Snowflake has built-in depth limit +``` + +### Stored Procedure Conversion Patterns +```sql +-- SQL Server procedure with output parameters +CREATE PROCEDURE dbo.GetEmployeeCount + @dept_id INT, + @count INT OUTPUT +AS +BEGIN + SET NOCOUNT ON; + SELECT @count = COUNT(*) FROM employees WHERE department_id = @dept_id; +END; + +-- Snowflake: No output parameters; return value or result set +CREATE OR REPLACE PROCEDURE get_employee_count(dept_id INT) + RETURNS INT + LANGUAGE SQL + EXECUTE AS CALLER +AS +BEGIN + LET cnt INT := (SELECT COUNT(*) FROM employees WHERE department_id = :dept_id); + RETURN cnt; +END; +``` + +### Cursor Elimination (Preferred Approach) +```sql +-- SQL Server (row-by-row cursor) +DECLARE @id INT, @name VARCHAR(100); +DECLARE cur CURSOR FOR SELECT id, name FROM employees; +OPEN cur; +FETCH NEXT FROM cur INTO @id, @name; +WHILE @@FETCH_STATUS = 0 BEGIN + UPDATE audit_log SET last_seen = GETDATE() WHERE emp_id = @id; + FETCH NEXT FROM cur INTO @id, @name; +END; +CLOSE cur; DEALLOCATE cur; + +-- Snowflake: Rewrite as set-based SQL (preferred) +UPDATE audit_log a +SET last_seen = CURRENT_TIMESTAMP() +FROM employees e +WHERE a.emp_id = e.id; +-- Cursors are a severe performance anti-pattern in Snowflake; always prefer set-based +``` + +## DDL Conversion Checklist + +When converting SQL Server DDL to Snowflake: + +1. **Remove** physical storage clauses: `ON [filegroup]`, `TEXTIMAGE_ON`, `WITH (PAD_INDEX = ...)`, `FILLFACTOR` +2. **Remove** index definitions: `CLUSTERED`, `NONCLUSTERED`, `COLUMNSTORE` (Snowflake is columnar) +3. **Convert** `NVARCHAR/NCHAR` → `VARCHAR/CHAR` (Snowflake is native UTF-8) +4. **Convert** `DATETIME/DATETIME2` → `TIMESTAMP_NTZ`, `DATETIMEOFFSET` → `TIMESTAMP_TZ` +5. **Convert** `UNIQUEIDENTIFIER` → `VARCHAR(36)` +6. **Convert** `BIT` → `BOOLEAN` +7. **Convert** `MONEY/SMALLMONEY` → `NUMBER(19,4)` / `NUMBER(10,4)` +8. **Convert** `IDENTITY(seed,increment)` → `AUTOINCREMENT START seed INCREMENT increment` +9. **Remove** `SET NOCOUNT ON`, `SET ANSI_NULLS ON`, `SET QUOTED_IDENTIFIER ON` +10. **Remove** `GO` batch separators +11. **Replace** `dbo.` schema prefix → use fully qualified `DB.SCHEMA.TABLE` +12. **Convert** bracket-quoted identifiers `[name]` → double-quoted `"name"` or remove if unnecessary +13. **Note** constraints: PK, FK, UNIQUE defined but **not enforced** in Snowflake; move integrity checks to ETL + +## Migration Tool Ecosystem + +| Tool | Use Case | +|------|----------| +| SnowConvert AI | Automated DDL/DML/T-SQL conversion (free) | +| BCP (Bulk Copy Program) | Extract large tables to flat files for staging | +| Snowpipe | Continuous ingestion from cloud storage | +| Streams + Tasks | Replace triggers, CDC, SQL Server Agent | +| dbt | Replace SSIS transformation logic (ELT pattern) | +| Airflow / Azure Data Factory | Replace complex SQL Server Agent job chains | +| Snowflake Data Sharing | Replace linked servers and replication | + +## Data Extraction Methods + +| Method | Best For | +|--------|---------| +| BCP utility | Large table bulk export to CSV/delimited files | +| SSIS export | When SSIS is already in use for data movement | +| SnowConvert AI direct streaming | SQL Server → Snowflake real-time transfer | +| Azure Data Factory | Azure-centric environments; built-in Snowflake connector | +| Fivetran / Airbyte | Managed CDC replication | + +## Common Pitfalls + +1. **Constraint enforcement**: SQL Server enforces PK/FK/UNIQUE; Snowflake does not. Externalize validation logic; reengineer load processes to prevent duplicate/orphaned records. +2. **IDENTITY gaps**: Snowflake AUTOINCREMENT does not guarantee gap-free sequences. +3. **Case sensitivity (critical)**: SQL Server is case-insensitive by default; Snowflake is case-sensitive for unquoted and all quoted identifiers. Reporting tools auto-generating double-quoted SQL will fail. Set `QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE`. +4. **NULL handling in strings**: SQL Server can concatenate NULL + string = NULL or string depending on settings; Snowflake NULL + string = NULL. +5. **DATETIME precision**: SQL Server DATETIME has 3.33ms resolution; map to `TIMESTAMP_NTZ(3)` explicitly to preserve precision semantics. +6. **Implicit conversions**: SQL Server has extensive implicit type conversion; Snowflake is stricter. Add explicit CAST/TRY_CAST. +7. **Collation and floating point**: SQL Server supports per-column collation and its own floating-point arithmetic; test string comparisons and numeric calculations thoroughly. Business users must understand these for UAT. +8. **Empty string vs NULL**: SQL Server treats '' as empty string; Snowflake treats '' as '' (not NULL, unlike Oracle). +9. **System databases**: Never attempt to migrate `master`, `msdb`, `tempdb`, `model`. Avoid migrating `sys` prefix catalog tables/views. +10. **SSRS connectivity**: SSRS → Snowflake is problematic; plan to decommission SSRS and rebuild reports. +11. **UDF DML operations**: SQL Server UDFs can perform DML; Snowflake UDFs cannot. Convert DML-performing UDFs to stored procedures. +12. **CURRENT_TIMESTAMP into DATETIME columns**: CURRENT_TIMESTAMP() returns TIMESTAMP_LTZ; cannot insert into DATETIME (TIMESTAMP_NTZ) without session parameter. +13. **SSIS ODBC performance**: Direct ODBC from SSIS to Snowflake is too slow for bulk loads; use specialized connectors (CData) or modernize to dbt. +14. **Zero management misconception**: Remove all SQL Server system commands (DBCC, locking hints, FOR REPLICATION, UPDATE STATISTICS) — they are incompatible and unnecessary in Snowflake. + +## Utilities Mapping + +| SQL Server Utility | Snowflake Equivalent | Notes | +|-------------------|---------------------|-------| +| MSSQL-CLI / SQLCMD | SnowSQL | Command-line client for SQL execution, DDL/DML operations | +| BCP (Bulk Copy Program) | COPY INTO | COPY INTO supports AVRO, Parquet, JSON, CSV, etc. BCP also used for extraction to cloud staging | +| SQL Server Management Studio | Snowsight | Web-based UI for queries, worksheets, monitoring | +| SQL Server Agent | Snowflake Tasks | CRON/interval-based scheduling with DAG support | +| SQL Server Profiler / Extended Events | Query History (ACCOUNT_USAGE) | Snowflake Query Profile for visual query analysis | + +## Professional Services and Partners + +- **Snowflake Professional Services**: Accelerated migration using SnowConvert AI, high-performing architectures with Snowpark and Adaptive Compute, POC and implementation from planning to cutover +- **Global Solution Partners**: Code conversion (ETL/stored procedures/reports), AI/ML enablement (Snowpark, Cortex), end-to-end delivery (validation, performance tuning, FinOps, compliance) +- Contact: Snowflake sales team or Snowflake Community diff --git a/skills/agentic-migration-workshop/references/teradata.md b/skills/agentic-migration-workshop/references/teradata.md new file mode 100644 index 00000000..806cbd1b --- /dev/null +++ b/skills/agentic-migration-workshop/references/teradata.md @@ -0,0 +1,372 @@ +# Teradata to Snowflake Reference + +## Architecture Differences + +| Aspect | Teradata | Snowflake | +|--------|----------|-----------| +| Architecture | Shared-nothing MPP; tightly coupled compute & storage | Decoupled compute, storage, and cloud services | +| Data distribution | Primary Index (PI) hash-based distribution across AMPs | Automatic micro-partitioning; no user-managed distribution | +| Storage | DBA-managed; data distributed across AMPs | Centralized object storage; automatic management | +| Compute | Fixed nodes; scaling requires hardware changes | Elastic virtual warehouses; instant scale up/down/out | +| Concurrency | Workload management (TASM/TIWM) with priority classes | Warehouses with multi-cluster auto-scaling | +| Statistics | Manual `COLLECT STATISTICS` | Automatic; no user intervention needed | +| Maintenance | DBA tasks: stats, space management, skew monitoring | Fully managed; all maintenance automated | + +## Session Modes + +Teradata has two session modes that affect SQL behavior: + +| Behavior | ANSI Mode | Teradata (TERA) Mode | +|----------|-----------|---------------------| +| String comparisons | CASESPECIFIC | NOT CASESPECIFIC | +| Transaction | Explicit COMMIT required | Auto-commit after each statement | +| Truncation | Error on truncation | Silently truncates | + +**Snowflake mapping:** +- ANSI Mode CASESPECIFIC → No changes needed +- ANSI Mode NOT CASESPECIFIC → Add `COLLATE 'en-cs'` in column definition +- TERA Mode CASESPECIFIC → Convert string comparisons to `RTRIM(expression)` +- TERA Mode NOT CASESPECIFIC → Convert string comparisons to `RTRIM(UPPER(expression))` + +See SnowConvert AI documentation for detailed session mode translation rules. + +## Data Type Mapping + +| Teradata | Snowflake | Notes | +|----------|-----------|-------| +| BYTEINT | TINYINT / NUMBER | 1-byte signed integer | +| SMALLINT | SMALLINT / NUMBER | | +| INTEGER | INTEGER / NUMBER | | +| BIGINT | BIGINT / NUMBER | | +| DECIMAL(p,s) / NUMERIC(p,s) | NUMBER(p,s) | | +| FLOAT / REAL / DOUBLE PRECISION | FLOAT | | +| NUMBER | NUMBER(38,0) | Teradata NUMBER is different from Oracle NUMBER | +| CHAR(n) | VARCHAR | SnowConvert maps CHAR → VARCHAR for Teradata | +| VARCHAR(n) | VARCHAR(n) | | +| CLOB | VARCHAR(16777216) | 16MB max; not directly supported as CLOB | +| BYTE(n) | BINARY(n) | | +| VARBYTE(n) | BINARY(n) | | +| BLOB | BINARY(8388608) | 8MB max; not directly supported as BLOB | +| DATE | DATE | Teradata DATE is date-only (unlike Oracle) | +| TIME | TIME | | +| TIME WITH TIME ZONE | TIME | TIME WITH TIME ZONE not supported; stored as wall-clock only | +| TIMESTAMP | TIMESTAMP_NTZ | | +| TIMESTAMP WITH TIME ZONE | TIMESTAMP_TZ | | +| INTERVAL types (all) | VARCHAR / date functions | INTERVAL not supported; use DATEDIFF/DATEADD | +| PERIOD(DATE) | Two DATE columns (start, end) | No direct PERIOD type; split into start/end | +| PERIOD(TIMESTAMP) | Two TIMESTAMP columns | No direct PERIOD type | +| PERIOD(TIME) | Two TIME columns or VARCHAR | | +| JSON | VARIANT | | +| XML | VARIANT | | +| ARRAY | ARRAY | | +| ST_GEOMETRY | GEOGRAPHY or GEOMETRY | | +| UDT (User-Defined Type) | Not supported | Flatten to native types | +| DATASET | Not supported | | +| TD_ANYTYPE | Not supported | | + +## Feature Mapping + +| Teradata Feature | Snowflake Equivalent | +|-----------------|---------------------| +| PRIMARY INDEX (PI) | Not needed (Snowflake auto-distributes) | +| Secondary indexes (USI/NUSI) | Not needed (micro-partition pruning) | +| Hash indexes | Not needed | +| Join indexes | Materialized views or Dynamic Tables | +| PARTITION BY (TD-style) | Micro-partitions (automatic); CLUSTER BY for ordering | +| MULTISET tables | Default behavior (Snowflake allows duplicates) | +| SET tables | Add DISTINCT or UNIQUE constraints; handle in INSERT | +| Volatile tables | TEMPORARY TABLE | +| Global temporary tables | TEMPORARY TABLE | +| COLLECT STATISTICS | Not needed (Snowflake auto-manages statistics) | +| LOCKING ROW FOR ACCESS | Not needed (Snowflake MVCC handles concurrency) | +| QUALIFY clause | QUALIFY (Snowflake supports natively) | +| SAMPLE | SAMPLE or TABLESAMPLE | +| TITLE column alias | AS alias | +| FORMAT column format | TO_CHAR() for display formatting | +| CASESPECIFIC / NOT CASESPECIFIC | COLLATE or UPPER()/LOWER() | +| COMPRESS values | Not needed (Snowflake auto-compresses) | +| FALLBACK / NO FALLBACK | Not needed (Snowflake has built-in redundancy) | +| Journal tables | Streams (change tracking) | +| Macros | Stored procedures or views (macros not supported) | +| Stored procedures (SPL) | Snowflake Scripting or JavaScript procedures | +| UDFs | Snowflake UDFs (SQL, JavaScript, Python) | +| Teradata Scheduler | Snowflake Tasks | +| Access logging (DBQL) | ACCESS_HISTORY view (ACCOUNT_USAGE) | +| Row-level security | Row Access Policies | +| BTEQ scripts | Snowflake SQL worksheets, SnowSQL, or Python scripts | +| FastLoad | COPY INTO (bulk load) | +| FastExport | COPY INTO @stage (bulk unload) | +| MultiLoad | COPY INTO with MERGE pattern | +| TPT (Teradata Parallel Transporter) | Snowpipe or COPY INTO | +| TASM / TIWM (workload mgmt) | Warehouses + resource monitors | +| Data dictionary (DBC views) | INFORMATION_SCHEMA / ACCOUNT_USAGE | +| Surrogate keys | AUTOINCREMENT or SEQUENCE | + +## Databases to Exclude from Migration + +The following Teradata system databases should NOT be migrated: +`DBC`, `Sys_Calendar`, `SystemFe`, `SYSJDBC`, `SYSLIB`, `SYSSPATIAL`, `SYSUDTLIB`, `SysAdmin`, `TDStats`, `TD_SYSFNLIB`, `TD_SYSXML`, `TDPUSER`, `tdwm`, `All`, `Crashdumps`, `dbcmngr`, `Default`, `External_AP`, `EXTUSER`, `LockLogShredder`, `PUBLIC`, `SQLJ`, `SYSBAR`, `SYSUIF`, `TD_SERVER_DB`, `TD_SYSGPL`, `viewpoint`, `console` + +## Common Teradata SQL Patterns + +### QUALIFY (Native Support) +```sql +-- Teradata (same in Snowflake) +SELECT customer_id, order_date, amount +FROM orders +QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1; +``` + +### SAMPLE +```sql +-- Teradata +SELECT * FROM my_table SAMPLE 100; +SELECT * FROM my_table SAMPLE 0.10; -- 10% + +-- Snowflake +SELECT * FROM my_table SAMPLE (100 ROWS); +SELECT * FROM my_table SAMPLE (10); -- 10% of rows +``` + +### SET Table Behavior (Auto-Dedup) +```sql +-- Teradata SET table (auto-dedup on insert) +CREATE SET TABLE my_table (...); + +-- Snowflake: No SET tables; handle dedup explicitly +INSERT INTO my_table +SELECT DISTINCT * FROM source_table; +-- Or use MERGE to prevent duplicates +``` + +### PERIOD Columns → Split Columns +```sql +-- Teradata +CREATE TABLE emp ( + emp_id INTEGER, + emp_period PERIOD(DATE), + salary DECIMAL(10,2) +); +SELECT emp_id FROM emp WHERE emp_period P_INTERSECT PERIOD(DATE '2024-01-01', DATE '2024-12-31'); + +-- Snowflake +CREATE TABLE emp ( + emp_id INTEGER, + emp_period_start DATE, + emp_period_end DATE, + salary NUMBER(10,2) +); +SELECT emp_id FROM emp +WHERE emp_period_start < '2024-12-31' AND emp_period_end > '2024-01-01'; +``` + +### NORMALIZE (Merge Overlapping Periods) +```sql +-- Teradata +SELECT emp_id, BEGIN(emp_period), END(emp_period) +FROM emp NORMALIZE ON emp_period; + +-- Snowflake: Rewrite with window functions (gap-and-islands pattern) +WITH ordered AS ( + SELECT emp_id, emp_period_start, emp_period_end, + CASE WHEN emp_period_start <= LAG(emp_period_end) OVER (PARTITION BY emp_id ORDER BY emp_period_start) + THEN 0 ELSE 1 END AS new_group + FROM emp +), +grouped AS ( + SELECT *, SUM(new_group) OVER (PARTITION BY emp_id ORDER BY emp_period_start) AS grp + FROM ordered +) +SELECT emp_id, MIN(emp_period_start), MAX(emp_period_end) +FROM grouped GROUP BY emp_id, grp; +``` + +### EXPAND ON (Temporal Expansion) +```sql +-- Teradata +SELECT emp_id, BEGIN(pd) AS cal_date, salary +FROM emp EXPAND ON emp_period AS pd BY INTERVAL '1' DAY; + +-- Snowflake: Use GENERATOR or date spine +SELECT e.emp_id, d.cal_date, e.salary +FROM emp e +JOIN ( + SELECT DATEADD('day', seq, '2020-01-01')::DATE AS cal_date + FROM TABLE(GENERATOR(ROWCOUNT => 3650)) + t(seq) +) d ON d.cal_date >= e.emp_period_start AND d.cal_date < e.emp_period_end; +``` + +### Date Arithmetic +```sql +-- Teradata: integer date format (days since 1900-01-01) +-- When exporting, always export as DATE strings, not internal integers + +-- Teradata interval arithmetic +SELECT order_date + INTERVAL '30' DAY FROM orders; +-- Snowflake +SELECT DATEADD('day', 30, order_date) FROM orders; + +-- Teradata: date - date = integer (days) +SELECT date1 - date2 FROM t; +-- Snowflake +SELECT DATEDIFF('day', date2, date1) FROM t; +``` + +### SEL Abbreviation +```sql +-- Teradata allows abbreviated keywords +SEL * FROM my_table; +INS INTO my_table VALUES (1); +DEL FROM my_table WHERE id = 1; +UPD my_table SET val = 'x' WHERE id = 1; + +-- Snowflake: Use full keywords +SELECT * FROM my_table; +INSERT INTO my_table VALUES (1); +DELETE FROM my_table WHERE id = 1; +UPDATE my_table SET val = 'x' WHERE id = 1; +``` + +### TITLE and FORMAT +```sql +-- Teradata +SELECT emp_name (TITLE 'Employee Name'), salary (FORMAT '$$$,$$9.99') +FROM employees; + +-- Snowflake +SELECT emp_name AS "Employee Name", TO_CHAR(salary, '$999,999.99') AS salary +FROM employees; +``` + +### Teradata-Specific Functions +```sql +-- Teradata -- Snowflake +CHARACTERS(str) LENGTH(str) +ZEROIFNULL(val) ZEROIFNULL(val) -- supported +NULLIFZERO(val) NULLIFZERO(val) -- supported +HASHROW(cols) HASH(cols) +INDEX(str, 'sub') POSITION('sub' IN str) or CHARINDEX('sub', str) +OREPLACE(str, 'old', 'new') REPLACE(str, 'old', 'new') +OTRANSLATE(str, 'from', 'to') TRANSLATE(str, 'from', 'to') +STRTOK(str, delim, n) STRTOK(str, delim, n) -- same +RESET WHEN condition Rewrite with CASE in window functions +NAMED 'alias' AS alias +COALESCE(a, b) COALESCE(a, b) -- same +``` + +### LOCKING Clause → Remove +```sql +-- Teradata +LOCKING TABLE my_table FOR ACCESS +SELECT * FROM my_table; + +LOCKING ROW FOR ACCESS +SELECT * FROM big_table WHERE id = 123; + +-- Snowflake: Remove all LOCKING clauses +SELECT * FROM my_table; +SELECT * FROM big_table WHERE id = 123; +``` + +### COLLECT STATISTICS → Remove +```sql +-- Teradata +COLLECT STATISTICS ON my_table COLUMN (customer_id); +COLLECT STATISTICS ON my_table INDEX (primary_idx); +HELP STATISTICS my_table; + +-- Snowflake: Remove all COLLECT STATISTICS; Snowflake auto-manages +-- No action needed +``` + +### BTEQ Script Patterns +```sql +-- Teradata BTEQ +.LOGON server/user,password +.SET WIDTH 200 +.EXPORT FILE=/tmp/output.csv +SELECT * FROM my_table; +.EXPORT RESET +.IF ERRORCODE <> 0 THEN .GOTO ERROR_HANDLER +.LOGOFF +.QUIT + +-- Snowflake equivalent (SnowSQL or Python) +-- Use SnowSQL CLI: +-- snowsql -a account -u user -q "SELECT * FROM my_table" -o output_format=csv -o output_file=/tmp/output.csv +-- Or use SnowConvert AI to auto-translate BTEQ → Python +``` + +### Macro → Stored Procedure +```sql +-- Teradata +CREATE MACRO get_recent_orders AS ( + SELECT * FROM orders WHERE order_date > CURRENT_DATE - 30; +); +EXEC get_recent_orders; + +-- Snowflake: Use stored procedure or view +CREATE OR REPLACE VIEW get_recent_orders AS + SELECT * FROM orders WHERE order_date > DATEADD('day', -30, CURRENT_DATE()); +-- or +CREATE OR REPLACE PROCEDURE get_recent_orders() + RETURNS TABLE() + LANGUAGE SQL +AS +BEGIN + LET res RESULTSET := (SELECT * FROM orders WHERE order_date > DATEADD('day', -30, CURRENT_DATE())); + RETURN TABLE(res); +END; +``` + +## DDL Conversion Checklist + +1. **Remove** `PRIMARY INDEX`, `PARTITION BY` (Teradata-style), `UNIQUE PRIMARY INDEX` +2. **Remove** `SET` / `MULTISET` table keywords; handle dedup in INSERT logic for SET tables +3. **Remove** `FALLBACK` / `NO FALLBACK`, `JOURNAL`, `FREESPACE`, `CHECKSUM` +4. **Remove** all secondary index DDL (USI, NUSI, hash, join indexes) +5. **Remove** `COMPRESS` value lists (Snowflake auto-compresses) +6. **Remove** `COLLECT STATISTICS` statements +7. **Remove** `LOCKING` clauses +8. **Convert** `BYTEINT` → `TINYINT` or `NUMBER` +9. **Convert** `PERIOD(DATE/TIMESTAMP)` → two separate columns (start/end) +10. **Convert** `INTERVAL` types → `VARCHAR` or rewrite with DATEADD/DATEDIFF +11. **Convert** `CLOB` → `VARCHAR(16777216)`, `BLOB` → `BINARY(8388608)` +12. **Convert** `CHAR` → `VARCHAR` (SnowConvert default for Teradata) +13. **Replace** VOLATILE tables → `TEMPORARY TABLE` +14. **Replace** macros → stored procedures or views +15. **Note** constraints: PK, FK, UNIQUE defined but **not enforced** in Snowflake + +## Script Translation Tools + +| Teradata Tool | SnowConvert Translation Target | +|---------------|-------------------------------| +| BTEQ | → Snowflake SQL or Python scripts | +| FastLoad | → Python with COPY INTO | +| MultiLoad | → Python with MERGE + COPY INTO | +| TPT | → Python scripts | +| Stored procedures | → Snowflake Scripting or JavaScript | +| Macros | → Views or stored procedures | + +## Data Extraction Methods + +| Method | Best For | +|--------|---------| +| BTEQ .EXPORT | Small to medium table extraction | +| FastExport | Large table bulk export | +| TPT (export operator) | High-performance parallel extraction | +| SnowConvert AI (file-based) | DDL export scripts → stage → COPY INTO | + +## Common Pitfalls + +1. **Primary Index removal**: Removing PI changes data distribution; Snowflake handles this automatically. No action needed. +2. **SET table semantics**: Teradata SET tables reject duplicates on INSERT; Snowflake allows all duplicates. Add dedup logic. +3. **TERA mode string comparison**: NOT CASESPECIFIC default means case-insensitive; Snowflake is case-sensitive. Wrap in UPPER(). +4. **INTERVAL types**: Not supported in Snowflake; rewrite all INTERVAL arithmetic with DATEADD/DATEDIFF. +5. **PERIOD types**: Not supported; split into two columns and rewrite temporal predicates. +6. **Date integer format**: Teradata stores DATE internally as integer (days since 1900-01-01); export as DATE strings, not integers. +7. **QUALIFY**: Snowflake supports QUALIFY natively — this is one of the easiest translations. +8. **Macros**: Not supported; convert to views (for simple queries) or procedures (for parameterized logic). +9. **System databases**: Exclude all Teradata system databases (DBC, Sys_Calendar, etc.) from migration scope. +10. **Surrogate key lifecycle**: Surrogate keys from Teradata may behave differently in Snowflake AUTOINCREMENT; synchronize during cutover. diff --git a/skills/agentic-migration-workshop/schema-conversion/INSTRUCTIONS.md b/skills/agentic-migration-workshop/schema-conversion/INSTRUCTIONS.md new file mode 100644 index 00000000..b2ba40a2 --- /dev/null +++ b/skills/agentic-migration-workshop/schema-conversion/INSTRUCTIONS.md @@ -0,0 +1,182 @@ + +# Workshop Session: Schema Conversion (Day 3 — Database Conversion) + +## Session Overview + +**Present to user:** +> Welcome to **Schema Conversion** — this is Day 3 of the LiftOff framework: Database Conversion. We'll take your source DDL and translate it into Snowflake-ready code. +> +> Here's what we'll work through together: +> 1. Collect and categorize your source DDL +> 2. Map every data type to its Snowflake equivalent (I'll flag anything that needs your input) +> 3. Convert tables, views, sequences, and other objects +> 4. Assemble a deployment script in the correct dependency order +> 5. Validate everything compiles, then deploy when you're ready +> +> Let's get your DDL. + +## Prerequisites +- Platform reference file read (from `/references/`) +- Source DDL or object definitions available +- `references/best-practices.md` read + +## Session Flow + +### Part 1: Collect Source DDL + +**Ask the user** (via `ask_user_question`) how they'll provide their DDL: +- Paste DDL statements directly +- Provide file paths to `.sql` files +- Provide a database/schema name (if source is queryable) +- They already ran SnowConvert AI and have converted output to review + +**Parse and categorize** by object type: +- CREATE TABLE, CREATE VIEW, CREATE INDEX, CREATE SEQUENCE +- CREATE PROCEDURE/FUNCTION +- ALTER TABLE (constraints) +- Other DDL + +**Present to user:** *"I've found [N] objects: [X] tables, [Y] views, [Z] procedures... Let me start with the data type mapping."* + +### Part 2: Data Type Mapping + +**Explain to user:** +> This is one of the most important parts of schema conversion. Every source data type needs a Snowflake equivalent, and some mappings involve trade-offs I want you to be aware of. + +**Extract** all data types from the source DDL and map each one using the platform reference. **Flag anything requiring a decision:** + +| Consideration | What to Tell the User | +|--------------|----------------------| +| Precision loss | "This type has higher precision in [source] than Snowflake supports. Here's the impact..." | +| LOB types | "Large objects map to VARIANT or VARCHAR(16MB). If you have objects exceeding 16MB, we'll need a different approach." | +| Custom/UDT types | "Snowflake doesn't support user-defined types directly. I'll flatten these to native types." | +| Numeric scale | "Snowflake NUMBER supports 0-38 precision — your source uses [X], so we're good." | +| Timestamp zones | "You have three options: TIMESTAMP_TZ (with timezone), TIMESTAMP_NTZ (without), or TIMESTAMP_LTZ (local). Here's when to use each..." | + +**Produce a Data Type Mapping Report:** + +``` +| Source Column | Source Type | Snowflake Type | Notes | +|--------------|-------------|---------------|-------| +``` + +**CHECKPOINT:** *"Here's your data type mapping. Please review — especially the items I've flagged. Any precision changes need your sign-off before I proceed."* + +### Part 3: Convert Table DDL + +**Convert each CREATE TABLE** applying: +- Mapped data types from Part 2 +- Remove unsupported clauses (TABLESPACE, STORAGE, PARTITION BY range/list, ENGINE, DISTSTYLE, DISTKEY, SORTKEY, ENCODE, ON [filegroup], etc.) +- Convert identity/auto-increment to Snowflake AUTOINCREMENT or IDENTITY +- Preserve NOT NULL, DEFAULT, PRIMARY KEY, UNIQUE, CHECK, FOREIGN KEY constraints +- Add CLUSTER BY where beneficial (replacing source distribution/sort keys) +- Apply fully qualified naming: DB.SCHEMA.TABLE + +**Important teaching moment to share:** +> One critical difference: Snowflake **defines but does not enforce** PK, FK, and UNIQUE constraints (only NOT NULL is enforced). This means your data integrity checks need to move into your ETL/ELT pipelines. I'll flag this in the deployment notes. + +**Platform-specific conversions:** + +**Oracle:** +- Remove TABLESPACE, STORAGE, PCTFREE +- VARCHAR2 → VARCHAR, DATE → TIMESTAMP_NTZ (Oracle DATE includes time) +- RAW/LONG RAW → BINARY, CLOB → VARCHAR(16777216), BLOB → BINARY(8388608) + +**Teradata:** +- Remove PRIMARY INDEX, PARTITION BY, MULTISET/SET keywords, COMPRESS +- BYTEINT → TINYINT, character set conversions → VARCHAR + +**Redshift:** +- Remove DISTSTYLE, DISTKEY, SORTKEY, ENCODE, BACKUP +- SUPER → VARIANT, IDENTITY(seed,step) → AUTOINCREMENT(seed,step) + +**SQL Server:** +- Remove ON [filegroup], TEXTIMAGE_ON, CLUSTERED/NONCLUSTERED +- NVARCHAR → VARCHAR (Snowflake is native UTF-8), DATETIME/DATETIME2 → TIMESTAMP_NTZ +- UNIQUEIDENTIFIER → VARCHAR(36), BIT → BOOLEAN, MONEY → NUMBER(19,4) + +### Part 4: Convert Views and Other Objects + +**Views:** +- Apply query translation rules for the embedded SQL +- Materialized views → Snowflake MATERIALIZED VIEW or Dynamic Table +- Indexed views (SQL Server) → Snowflake MATERIALIZED VIEW +- Recursive views → Snowflake recursive CTE syntax +- Validate each view compiles: `snowflake_sql_execute` with `only_compile: true` + +**Sequences:** +- Convert CREATE SEQUENCE syntax (Snowflake supports natively) +- Map START WITH, INCREMENT BY, CACHE + +**Indexes:** +> "Snowflake doesn't use traditional indexes — its micro-partition pruning handles most use cases automatically. For large tables (>1TB) with frequent range filters, I'll recommend CLUSTER BY instead." + +**Synonyms:** +- No direct equivalent; use fully qualified names or views as aliases + +**File Formats & Stages:** +- Generate CREATE FILE FORMAT for expected data ingestion patterns +- Generate CREATE STAGE if external storage is involved + +### Part 5: Deployment Script + +**Assemble** all converted DDL in dependency order: + +```sql +-- 1. Databases +-- 2. Schemas +-- 3. Sequences (referenced by tables) +-- 4. Tables (parent tables first, then child tables with FKs) +-- 5. Views (base views first, then dependent views) +-- 6. Functions +-- 7. Stored Procedures +-- 8. File Formats and Stages +``` + +**Validate** the full script compiles: `snowflake_sql_execute` with `only_compile: true` + +**Present the Schema Conversion Summary:** +``` +Schema Conversion Summary +- Tables converted: [count] +- Views converted: [count] +- Sequences converted: [count] +- Data type mappings: [count] ([flagged] requiring review) +- Warnings/manual review items: [count] +``` + +**Pre-deployment checklist:** +- [ ] All EWI errors resolved (if SnowConvert AI was used) +- [ ] FDM warnings reviewed and documented +- [ ] Converted code reviewed +- [ ] Test environment deployment tested first +- [ ] Rollback strategy planned + +**CHECKPOINT:** +> "Your deployment script is ready. I can run it statement-by-statement (so we can catch any issues early) or as a batch. Which do you prefer?" + +Wait for approval, then execute. + +## Session Wrap-Up + +**Present to user:** +> Schema Conversion complete! Here's what we accomplished: +> - [X] tables, [Y] views, [Z] sequences converted to Snowflake DDL +> - [N] data type mappings applied +> - All objects compiled and deployed successfully +> +> Your Snowflake database is now structurally ready for data. + +## Next Session + +If Full Workshop → proceed to **Data Migration** (read `data-migration/SKILL.md`) + +## Workshop Context (Day 3) + +During the LiftOff engagement, Database Conversion covers: +- Review DDL, DML, metadata, and scripts from the source platform +- Demo SnowConvert AI conversion capabilities +- Convert and deploy DDLs in the customer's Snowflake environment +- Estimate database conversion LOE and timeline + +**Key estimation factors:** All database objects, one-time setup (code management, dev patterns), multiple environments, data type mapping, constraint enforcement differences, RBAC deployment diff --git a/skills/agentic-migration-workshop/scripts/assess_complexity.py b/skills/agentic-migration-workshop/scripts/assess_complexity.py new file mode 100644 index 00000000..9936d7ff --- /dev/null +++ b/skills/agentic-migration-workshop/scripts/assess_complexity.py @@ -0,0 +1,199 @@ +#!/usr/bin/env python3 +""" +assess_complexity.py - Parse source DDL and score migration complexity. + +Usage: + uv run --project python /scripts/assess_complexity.py \ + --input --platform \ + --output +""" + +import argparse +import json +import re +import sys +from collections import defaultdict + + +COMPLEXITY_RULES = { + "oracle": { + "trivial": [ + (r"CREATE\s+TABLE", "table"), + (r"CREATE\s+SEQUENCE", "sequence"), + (r"CREATE\s+(UNIQUE\s+)?INDEX", "index"), + ], + "simple": [ + (r"CREATE\s+(OR\s+REPLACE\s+)?VIEW", "view"), + (r"ALTER\s+TABLE.*ADD\s+CONSTRAINT", "constraint"), + ], + "moderate": [ + (r"CREATE\s+MATERIALIZED\s+VIEW", "materialized_view"), + (r"CREATE\s+(OR\s+REPLACE\s+)?FUNCTION", "function"), + ], + "complex": [ + (r"CREATE\s+(OR\s+REPLACE\s+)?PROCEDURE", "procedure"), + (r"CREATE\s+(OR\s+REPLACE\s+)?TRIGGER", "trigger"), + (r"CONNECT\s+BY", "hierarchical_query"), + (r"DBMS_\w+", "dbms_package_call"), + ], + "critical": [ + (r"CREATE\s+(OR\s+REPLACE\s+)?PACKAGE", "package"), + (r"DB_?LINK|DATABASE\s+LINK", "db_link"), + (r"PRAGMA\s+AUTONOMOUS_TRANSACTION", "autonomous_txn"), + (r"TYPE\s+\w+\s+(IS|AS)\s+(OBJECT|TABLE|RECORD)", "user_defined_type"), + ], + }, + "teradata": { + "trivial": [ + (r"CREATE\s+(MULTISET|SET)?\s*TABLE", "table"), + (r"CREATE\s+(UNIQUE\s+)?INDEX", "index"), + ], + "simple": [ + (r"CREATE\s+(OR\s+REPLACE\s+)?VIEW", "view"), + (r"COLLECT\s+STATISTICS", "collect_stats"), + ], + "moderate": [ + (r"CREATE\s+JOIN\s+INDEX", "join_index"), + (r"PERIOD\s*\(", "temporal_period"), + (r"CREATE\s+MACRO", "macro"), + ], + "complex": [ + (r"CREATE\s+PROCEDURE", "procedure"), + (r"CREATE\s+TRIGGER", "trigger"), + (r"NORMALIZE", "normalize_query"), + ], + "critical": [ + (r"\.EXPORT", "bteq_export"), + (r"\.IMPORT", "bteq_import"), + (r"\.LOGON", "bteq_logon"), + ], + }, + "redshift": { + "trivial": [ + (r"CREATE\s+TABLE", "table"), + ], + "simple": [ + (r"CREATE\s+(OR\s+REPLACE\s+)?VIEW", "view"), + (r"DISTSTYLE|DISTKEY|SORTKEY", "distribution_hint"), + ], + "moderate": [ + (r"CREATE\s+MATERIALIZED\s+VIEW", "materialized_view"), + (r"CREATE\s+(OR\s+REPLACE\s+)?FUNCTION", "udf"), + (r"CREATE\s+EXTERNAL\s+TABLE", "spectrum_table"), + ], + "complex": [ + (r"CREATE\s+(OR\s+REPLACE\s+)?PROCEDURE", "procedure"), + (r"CREATE\s+EXTERNAL\s+SCHEMA", "external_schema"), + ], + "critical": [ + (r"CREATE\s+LIBRARY", "custom_library"), + (r"LAMBDA", "lambda_udf"), + ], + }, + "sqlserver": { + "trivial": [ + (r"CREATE\s+TABLE", "table"), + (r"CREATE\s+(UNIQUE\s+)?(CLUSTERED\s+|NONCLUSTERED\s+)?INDEX", "index"), + (r"CREATE\s+SEQUENCE", "sequence"), + ], + "simple": [ + (r"CREATE\s+(OR\s+ALTER\s+)?VIEW", "view"), + (r"ALTER\s+TABLE.*ADD\s+CONSTRAINT", "constraint"), + ], + "moderate": [ + (r"CREATE\s+(OR\s+ALTER\s+)?FUNCTION", "function"), + (r"CREATE\s+(OR\s+ALTER\s+)?TRIGGER", "trigger"), + ], + "complex": [ + (r"CREATE\s+(OR\s+ALTER\s+)?PROC(EDURE)?", "procedure"), + (r"EXEC(UTE)?\s+sp_executesql", "dynamic_sql"), + (r"CROSS\s+APPLY|OUTER\s+APPLY", "apply_join"), + (r"FOR\s+XML\s+PATH", "xml_aggregation"), + ], + "critical": [ + (r"WITH\s+EXTERNAL_ACCESS|CLR", "clr_procedure"), + (r"OPENROWSET|OPENQUERY|OPENDATASOURCE", "linked_server_query"), + (r"CREATE\s+ASSEMBLY", "clr_assembly"), + (r"SERVICE\s+BROKER", "service_broker"), + ], + }, +} + +SCORE_MAP = {"trivial": 1, "simple": 2, "moderate": 3, "complex": 4, "critical": 5} + + +def assess_ddl(ddl_text: str, platform: str) -> dict: + rules = COMPLEXITY_RULES.get(platform) + if not rules: + return {"error": f"Unknown platform: {platform}"} + + findings = defaultdict(list) + object_counts = defaultdict(int) + total_score = 0 + total_count = 0 + + for level, patterns in rules.items(): + for pattern, obj_type in patterns: + matches = re.findall(pattern, ddl_text, re.IGNORECASE) + if matches: + count = len(matches) + score = SCORE_MAP[level] + object_counts[obj_type] = count + total_score += count * score + total_count += count + findings[level].append( + {"object_type": obj_type, "count": count, "score": score} + ) + + weighted_avg = round(total_score / total_count, 2) if total_count > 0 else 0 + + if weighted_avg <= 1.5: + readiness = "Ready" + elif weighted_avg <= 3.0: + readiness = "Ready with caveats" + else: + readiness = "Needs redesign" + + return { + "platform": platform, + "total_objects": total_count, + "weighted_complexity": weighted_avg, + "readiness": readiness, + "findings_by_level": dict(findings), + "object_counts": dict(object_counts), + "critical_items": findings.get("critical", []), + } + + +def main(): + parser = argparse.ArgumentParser(description="Assess DDL migration complexity") + parser.add_argument("--input", required=True, help="Path to source DDL file") + parser.add_argument( + "--platform", + required=True, + choices=["oracle", "teradata", "redshift", "sqlserver"], + help="Source database platform", + ) + parser.add_argument("--output", help="Output JSON file (default: stdout)") + args = parser.parse_args() + + try: + with open(args.input, "r") as f: + ddl_text = f.read() + except FileNotFoundError: + print(f"Error: File not found: {args.input}", file=sys.stderr) + sys.exit(1) + + result = assess_ddl(ddl_text, args.platform) + + output = json.dumps(result, indent=2) + if args.output: + with open(args.output, "w") as f: + f.write(output) + print(f"Report written to {args.output}") + else: + print(output) + + +if __name__ == "__main__": + main() diff --git a/skills/agentic-migration-workshop/snowconvert-ai/INSTRUCTIONS.md b/skills/agentic-migration-workshop/snowconvert-ai/INSTRUCTIONS.md new file mode 100644 index 00000000..92de7489 --- /dev/null +++ b/skills/agentic-migration-workshop/snowconvert-ai/INSTRUCTIONS.md @@ -0,0 +1,179 @@ + +# Workshop Session: SnowConvert AI — Automated Migration + +## Session Overview + +**Present to user:** +> Welcome to the **SnowConvert AI** session. This is Snowflake's free automated conversion tool — it handles the heavy lifting of translating your source code to Snowflake SQL. In partner engagements, SnowConvert AI typically achieves **95%+ automated conversion** and has converted over **2 billion lines of code** across thousands of migrations. +> +> Here's what we'll do together: +> 1. Make sure SnowConvert AI is set up and ready +> 2. Extract your source database objects +> 3. Run the automated conversion and review results +> 4. Optionally use AI Verification to auto-fix remaining issues +> 5. Deploy converted objects to Snowflake +> +> Let's check your setup first. + +## Prerequisites +- SnowConvert AI installed (free from Snowsight → Ingestion/Migrations) +- Snowflake account with `CREATE MIGRATION` privilege +- Access to source database (read + DDL extraction permissions) +- MFA enabled on Snowflake account + +## Session Flow + +### Part 1: Verify Setup + +**Ask the user** (via `ask_user_question`): +- Do you have SnowConvert AI installed? +- Which source platform? (Oracle, Teradata, SQL Server, Redshift) +- Direct database access, or will you provide DDL files? + +**If not installed, guide them:** +> SnowConvert AI is completely free. You can download it from Snowsight → Ingestion/Migrations. It runs on Windows 11+, macOS 13.3+, or Linux, and needs 4GB RAM (8GB+ recommended). Access codes auto-generate since v1.2.0 — one code works for all source platforms. + +**If using DDL files** (Oracle, Teradata): +> For platforms without direct extraction, you'll need to export your DDL to .sql files first. Use these export scripts: https://github.com/Snowflake-Labs/SC.DDLExportScripts + +**Verify migration privileges:** +```sql +GRANT CREATE MIGRATION ON ACCOUNT TO ROLE ; +``` + +### Part 2: Create Project + +**Walk the user through project creation:** + +> Let's create your SnowConvert AI project: +> 1. Launch SnowConvert AI → **"Create New Project"** +> 2. Select your source platform +> 3. Choose the input folder containing your source code +> 4. Select an output folder for converted code +> 5. Enter your access code + +*The `.snowct` project file saves everything — you can reopen it anytime to resume work.* + +### Part 3: Extract Database Objects + +**For SQL Server or Redshift** (direct extraction): +> SnowConvert AI can connect directly to your database and extract objects automatically. + +1. Configure connection: + - SQL Server: Standard auth or Windows Authentication + - Redshift: IAM Provisioned Cluster, IAM Serverless, or Standard auth +2. Connect → browse schemas → select objects +3. Click **"Extract Objects"** → review results +4. Click **"View Last Extraction Results"** to validate + +**For Oracle, Teradata, or other** (file-based): +> Place your exported `.sql` files in the input folder you specified. SnowConvert AI will read them automatically. + +**Extractable objects:** Tables, Views, Functions, Stored Procedures, Materialized Views + +### Part 4: Run Conversion + +**Guide the user through conversion settings:** + +> Before we run, let me explain the key settings: +> - **Encoding:** UTF-8 (default — leave this unless you have special characters) +> - **Custom Schema/Database:** Set these if your target names differ from source +> - **Target Language:** SnowScript (recommended) or JavaScript for procedures +> - **Comments:** Enable to annotate nodes with missing dependencies + +**Execute:** Click **"Save & Start Assessment"** + +**Review results together** using the traffic light system: + +| Color | Meaning | What We Do | +|-------|---------|-----------| +| Green | Successfully converted | Ready to deploy as-is | +| Yellow (FDM) | Further Development Mandatory | I'll review the business impact — often deployable with documentation | +| Red (EWI) | Error with Impact | We need to fix these before deployment | + +**Code Completeness:** If below 100%, some objects reference dependencies that weren't included. We may need to add more source files. + +**Assessment reports generated:** +- Conversion summary statistics +- Object-by-object conversion status +- Complexity analysis and recommendations +- Migration effort estimates + +**CHECKPOINT:** +> Here are your conversion results: [X]% converted automatically, [Y] objects green, [Z] yellow, [W] red. Let's review the red items — these need manual attention. + +**For EWI errors:** +1. Examine the flagged code in SnowConvert AI or your IDE +2. Fix the converted source code manually +3. Unit test the corrected file +4. Re-run conversion if needed + +### Part 5: AI Verification (Optional) + +**Introduce to user:** +> SnowConvert AI has an AI Verification feature (currently in Public Preview) that can automatically test and fix conversion errors. AI agents execute your converted code in your Snowflake account and fix issues, grounded with tests over synthetic data. +> +> This is optional — you can skip it if you prefer to review manually. + +**If user wants AI Verification:** +1. Select objects to verify (dependencies auto-selected) +2. Click **"VERIFY CODE"** — review disclaimers (AI executes in your Snowflake account via Cortex Complete) +3. Wait for verification (may take significant time for large codebases) +4. Review AI results: summary of fixes + per-object details ("SEE DETAILS") +5. Merge AI fixes with initial conversion (manual review required) + +### Part 6: Deploy to Snowflake + +**Guide the user through deployment:** + +> Your code is ready to deploy. Let me walk you through the process. + +**Pre-deployment checklist:** +- [ ] All EWI errors resolved +- [ ] FDM warnings reviewed for acceptability +- [ ] Only successfully converted objects selected +- [ ] Deployment dependencies considered + +**Authenticate** to Snowflake: +- SSO Authentication (enterprise identity) +- Standard Authentication (username + password + MFA) +- Account format: `orgname-account-name` + +**Configure target:** +``` +Account: myorg-myaccount +Warehouse: MIGRATION_WH +Database: TARGET_DB +Schema: PUBLIC +Role: MIGRATION_ROLE +``` + +**Deployment executes automatically** in dependency order: +1. Databases → 2. Schemas → 3. Tables → 4. Views → 5. Functions → 6. Stored Procedures + +**CHECKPOINT:** +> Deployment complete! [X] objects deployed successfully, [Y] failures. Let me help address any failures before we move on. + +## Session Wrap-Up + +**Present to user:** +> SnowConvert AI conversion complete! Here's what we accomplished: +> - [X] objects extracted from [source platform] +> - [Y]% automated conversion rate +> - [Z] objects deployed to Snowflake +> - [W] items requiring manual attention (documented) + +## Next Steps + +**Ask the user** what they need next: +- Load data into the new tables → read `data-migration/SKILL.md` +- Convert SSIS packages → read `ssis-replatform/SKILL.md` +- Repoint Power BI reports → read `powerbi-repointing/SKILL.md` +- Validate migrated data → read `data-migration/SKILL.md` (Part 5) + +## Deliverables + +- SnowConvert AI project file (.snowct) +- Converted Snowflake SQL in output folder +- Assessment reports (conversion stats, complexity, effort estimates) +- Deployed objects in target Snowflake environment diff --git a/skills/agentic-migration-workshop/ssis-replatform/INSTRUCTIONS.md b/skills/agentic-migration-workshop/ssis-replatform/INSTRUCTIONS.md new file mode 100644 index 00000000..af5e2200 --- /dev/null +++ b/skills/agentic-migration-workshop/ssis-replatform/INSTRUCTIONS.md @@ -0,0 +1,136 @@ + +# Workshop Session: SSIS Re-platforming (SQL Server ETL Migration) + +## Session Overview + +**Present to user:** +> Welcome to the **SSIS Re-platforming** session. We'll convert your SQL Server Integration Services packages into Snowflake-native components: **Snowflake Tasks** for orchestration, **stored procedures** for control flow, and **dbt projects** for data transformations. +> +> This is one of the more complex parts of a SQL Server migration, but SnowConvert AI handles the heavy lifting. Here's our plan: +> 1. Set up a SnowConvert AI project for SSIS replatforming +> 2. Run the automated conversion +> 3. Deploy the converted components +> 4. Validate that your ETL logic produces the same results +> +> Let's get started. + +## Prerequisites +- SnowConvert AI installed +- Valid `.dtsx` SSIS package files +- All dependent database objects (DDL scripts for tables, views, functions, procedures referenced by SSIS packages) + +## Session Flow + +### Part 1: Project Setup + +**Guide the user:** + +> In SnowConvert AI: +> 1. Create a New Project +> 2. **Important:** Select "SQL Server" as the source platform +> 3. In extraction configuration, select the **"Replatform"** option — this tells SnowConvert AI to treat your SSIS packages specially +> 4. Point to your `.dtsx` files +> 5. Optionally include dependent DDL scripts (recommended — this gives SnowConvert AI more context for better conversion) + +### Part 2: Run Conversion + +**Execute** the standard SnowConvert AI conversion workflow. + +**Explain the output structure:** +> SnowConvert AI separates your SSIS packages into two clean components: +> - **Control Flow Orchestration** → SQL scripts with Snowflake Tasks and stored procedures +> - **Data Flow Tasks** → Individual dbt projects per data flow component + +``` +output/ +└── ETL/ + └── [Package_Name]/ + ├── Orchestration.sql # Control flow → Tasks + Procedures + ├── Data Flow Task 1/ + │ ├── dbt_project.yml + │ ├── models/ + │ └── sources/ + └── [Additional Data Flow Tasks]/ +``` + +**Review conversion reports:** +- `ETL.Elements.NA.csv` — Details about converted ETL elements +- `ETL.Issues.NA.csv` — Issues encountered during conversion + +**CHECKPOINT:** +> Here's what SnowConvert AI produced from your [N] SSIS packages. Let me walk you through the conversion results before we deploy. + +### Part 3: Deploy Converted Components + +**Deploy Snowflake Tasks and stored procedures** from `Orchestration.sql` + +**Deploy dbt projects:** +1. Set up dbt profiles for Snowflake connection +2. Run `dbt run` for each data flow project +3. Fix any failing components +4. Validate model compilation and execution + +### Part 4: Validate ETL Logic + +**Explain to user:** +> The most important step: making sure your converted ETL produces the same results as your SSIS packages. + +1. Run converted pipelines with test data +2. Compare output against SSIS execution results +3. Document any behavioral differences +4. Fix and re-test as needed + +**CHECKPOINT:** +> ETL validation complete. Here's the comparison of SSIS vs. Snowflake output for each pipeline. Does everything look correct? + +## Estimation Reference + +**Share with user for planning:** + +> For context, here's what a typical SSIS re-platforming looks like: + +**Sample Timeline (100 SSIS Jobs, team of 2-3 developers + 1 architect):** + +| Phase | Duration | Key Activities | +|-------|----------|---------------| +| Assessment & Inventory | 1 week | SnowConvert AI analyzes all packages, T-SQL, metadata | +| Pipeline Design & Setup | 2 weeks | Architect designs ELT flow, reusable Snowpark procedures | +| Conversion & Remediation | 4-6 weeks | Rewrite high-risk logic into Dynamic Tables or dbt models | +| Integration & Testing (SIT) | 3-4 weeks | Full functional testing, data integrity checks | +| **Total** | **10-13 weeks** | Production-ready ELT platform | + +**Complexity variability:** +- Simple packages (source-to-target copies): closer to 8 weeks +- Complex packages (dense T-SQL, error handling): could exceed 13 weeks + +**Re-platforming strategy by component:** + +| SSIS Component | Snowflake Target | Complexity | +|---------------|-----------------|-----------| +| Simple data flows | Dynamic Tables | Low-Medium | +| Complex flow-control | Snowpark Python Stored Procedures | High | +| Connection managers | Storage integrations / stages | Low | +| SQL tasks with T-SQL | Snowflake SQL tasks / procedures | Medium | +| Custom .NET code tasks | Snowpark Python UDFs | High | +| SSIS orchestration | Snowflake Tasks (DAGs) | Medium | + +## Session Wrap-Up + +**Present to user:** +> SSIS Re-platforming complete! Your [N] SSIS packages have been converted to: +> - [X] Snowflake Tasks for orchestration +> - [Y] stored procedures for control flow +> - [Z] dbt projects for data transformations +> - All validated against original SSIS output + +## Best Practices +- Document SSIS package dependencies and custom components before starting +- Include all dependent DDL for better conversion quality +- Review `ETL.Elements.NA.csv` and `ETL.Issues.NA.csv` thoroughly +- Test each converted data flow independently before orchestrating + +## Deliverables +- Snowflake Tasks + stored procedures (control flow) +- dbt projects (data flow) +- Conversion detail CSVs +- Validation comparison results