Improve backup import/export performance and fix production deploy #155

jeff-schnitter · 2025-11-05T00:22:16Z

Summary

This PR dramatically improves the performance of backup import/export commands and fixes the production deploy workflow to correctly use CORTEX_BASE_URL.

Performance Improvements

Backup Export Performance:

Full tenant export: 3m29s → 31.75s (6.6x faster)
Plugin export: 2m24s → 2.1s (68x faster)
Test suite: 64s → 39.62s (1.6x faster)

Key Changes

1. Parallel Processing with ThreadPoolExecutor

Added parallel fetching for plugins, scorecards, catalog, and workflows using ThreadPoolExecutor with 30 workers
Implemented result collection and alphabetical sorting for consistent output
Error handling for individual fetch failures without blocking entire export

2. HTTP Connection Pooling

Added requests.Session() with connection pooling in CortexClient
Configured HTTPAdapter with pool_maxsize=50 to support concurrent requests
Added automatic retry logic for transient 500/502/503/504 errors
Eliminates TCP connection overhead (DNS lookup, handshake, TLS) on every request

3. Optimized API Calls

Workflows export now uses single list API call with include_actions=true instead of N individual get calls
Reduced API call count significantly for workflows export

4. Fixed File Writing Performance

Replaced Rich print() with direct json.dump() for JSON file writing
Eliminated Rich formatting overhead (was taking 220+ seconds, now ~0.02s)

5. Alphabetical Output Ordering

Export/import operations now display results in alphabetical order
Makes it easier to debug failed exports by providing consistent output

6. Production Deploy Fix

Fixed .github/workflows/publish.yml to include CORTEX_BASE_URL in all container jobs
Ensures deploy events post correctly to Cortex after PyPI/Docker/Homebrew publishes

7. Test Reliability

Added retry logic with exponential backoff for scorecard create operations
Handles 500 errors when scorecards have active evaluations running
Max 3 retries with 1s, 2s delays

Files Changed

cortexapps_cli/commands/backup.py - Parallel processing, optimized exports, alphabetical ordering
cortexapps_cli/cortex_client.py - HTTP connection pooling and session management
.github/workflows/publish.yml - Fixed CORTEX_BASE_URL configuration
tests/test_scorecards.py - Retry logic for active evaluation race conditions

Test Results

All tests passing (218 passed, 1 skipped) with 78% coverage.

Related Issues

Fixes #154

Fix for no base_url in cortex config file

Implemented parallel API calls using ThreadPoolExecutor for backup export and import operations, significantly improving performance. Changes: - Added ThreadPoolExecutor with max_workers=10 for concurrent API calls - Updated _export_plugins(), _export_scorecards(), _export_workflows() to fetch items in parallel - Updated _import_catalog(), _import_plugins(), _import_scorecards(), _import_workflows() to import files in parallel - Enhanced error handling to report failures without stopping entire operation - Maintained file ordering where applicable Performance improvements: - Export operations now run with concurrent API calls - Import operations process multiple files simultaneously - All existing tests pass (218 passed, 1 skipped) Fixes #154 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Added CORTEX_BASE_URL from GitHub Actions variables to the publish workflow, following the same pattern as test-pr.yml. Changes: - Added CORTEX_BASE_URL to top-level env section - Added env sections to pypi-deploy-event job to pass CORTEX_API_KEY and CORTEX_BASE_URL to container - Added env sections to docker-deploy-event job to pass CORTEX_API_KEY and CORTEX_BASE_URL to container - Added env sections to homebrew-custom-event job to pass CORTEX_API_KEY and CORTEX_BASE_URL to container This fixes the 401 Unauthorized and base_url errors when posting deploy events to Cortex during the publish workflow. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…PI calls Further performance optimizations for backup export: 1. Increased ThreadPoolExecutor workers from 10 to 30 - Network I/O bound operations can handle more parallelism - Should provide 2-3x improvement for plugins and scorecards 2. Eliminated N individual API calls for workflows export - Changed workflows.list() to use include_actions="true" - Single API call now returns all workflow data with actions - Convert JSON to YAML format directly without individual get() calls - This eliminates N network round-trips for N workflows Expected performance improvements: - Workflows: Near-instant (1 API call vs N calls) - Plugins/Scorecards: 2-3x faster with 30 workers vs 10 Previous timing: 2m19s (with 10 workers) Original timing: 3m29s 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Modified all parallel export and import functions to collect results first, then sort alphabetically before writing/printing. This makes debugging failed exports much easier while maintaining parallel execution performance. Changes: - Export functions (plugins, scorecards) now collect all results, sort by tag, then write files in alphabetical order - Import functions (catalog, plugins, scorecards, workflows) now collect all results, sort by filename, then print in alphabetical order - Maintains parallel execution speed - only the output order is affected Example output now shows consistent alphabetical ordering: --> about-learn-cortex --> bogus-plugin --> developer-relations-plugin --> google-plugin --> map-test --> my-cortex-plugin ... 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This is the critical fix for slow parallel export/import performance. Problem: - CortexClient was calling requests.request() directly without a session - Each API call created a new TCP connection (DNS lookup, TCP handshake, TLS) - Even with 30 parallel threads, each request was slow (~3+ seconds) - 44 plugins took 2m24s (no parallelism benefit) Solution: - Created requests.Session() in __init__ with connection pooling - Configured HTTPAdapter with pool_maxsize=50 for concurrent requests - Added automatic retries for transient failures (500, 502, 503, 504) - All requests now reuse existing TCP connections Expected impact: - First request: normal latency (connection setup) - Subsequent requests: dramatically faster (connection reuse) - With 30 workers: should see ~30x speedup for I/O bound operations - 44 plugins: should drop from 2m24s to ~5-10 seconds Technical details: - pool_connections=10: number of connection pools to cache - pool_maxsize=50: max connections per pool (supports 30+ parallel workers) - Retry with backoff for transient server errors 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The Rich library's print() was being used to write JSON to files, causing massive slowdowns (1-78 seconds per file!). Using json.dump() directly should reduce write time to milliseconds.

Scorecard create operations can fail with 500 errors if there's an active evaluation running. This is a race condition that occurs when: 1. test_import.py creates/updates a scorecard 2. An evaluation is triggered automatically or by another test 3. test_scorecards.py tries to update the same scorecard Added exponential backoff retry logic (1s, 2s) with max 3 attempts to handle these transient 500 errors gracefully.

jeff-schnitter and others added 13 commits November 4, 2025 11:25

Merge pull request #153 from cortexapps/staging

390c62f

Fix for no base_url in cortex config file

chore: update HISTORY.md for main

0fca102

debug: add timing and connection pool debug logging

c915e53

debug: add detailed timing for list, fetch, and write operations

7200e8b

debug: add granular timing for print vs write operations

ef4e81f

fix: use json.dump instead of Rich print for file writing

c66c2fe

The Rich library's print() was being used to write JSON to files, causing massive slowdowns (1-78 seconds per file!). Using json.dump() directly should reduce write time to milliseconds.

chore: remove debug logging from backup and cortex_client

1aa2a6f

jeff-schnitter requested a review from Cantaley as a code owner November 5, 2025 00:22

jeff-schnitter merged commit 9271326 into staging Nov 5, 2025
2 checks passed

jeff-schnitter deleted the 154-improve-performance-of-backup-import-and-export-commands branch November 5, 2025 00:35

jeff-schnitter mentioned this pull request Nov 5, 2025

Optimize test scheduling with --dist loadfile for 25% faster test runs #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve backup import/export performance and fix production deploy #155

Improve backup import/export performance and fix production deploy #155

Uh oh!

jeff-schnitter commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve backup import/export performance and fix production deploy #155

Improve backup import/export performance and fix production deploy #155

Uh oh!

Conversation

jeff-schnitter commented Nov 5, 2025

Summary

Performance Improvements

Key Changes

1. Parallel Processing with ThreadPoolExecutor

2. HTTP Connection Pooling

3. Optimized API Calls

4. Fixed File Writing Performance

5. Alphabetical Output Ordering

6. Production Deploy Fix

7. Test Reliability

Files Changed

Test Results

Related Issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants