Skip to content

Conversation

@jeff-schnitter
Copy link
Collaborator

Summary

This PR dramatically improves the performance of backup import/export commands and fixes the production deploy workflow to correctly use CORTEX_BASE_URL.

Performance Improvements

Backup Export Performance:

  • Full tenant export: 3m29s → 31.75s (6.6x faster)
  • Plugin export: 2m24s → 2.1s (68x faster)
  • Test suite: 64s → 39.62s (1.6x faster)

Key Changes

1. Parallel Processing with ThreadPoolExecutor

  • Added parallel fetching for plugins, scorecards, catalog, and workflows using ThreadPoolExecutor with 30 workers
  • Implemented result collection and alphabetical sorting for consistent output
  • Error handling for individual fetch failures without blocking entire export

2. HTTP Connection Pooling

  • Added requests.Session() with connection pooling in CortexClient
  • Configured HTTPAdapter with pool_maxsize=50 to support concurrent requests
  • Added automatic retry logic for transient 500/502/503/504 errors
  • Eliminates TCP connection overhead (DNS lookup, handshake, TLS) on every request

3. Optimized API Calls

  • Workflows export now uses single list API call with include_actions=true instead of N individual get calls
  • Reduced API call count significantly for workflows export

4. Fixed File Writing Performance

  • Replaced Rich print() with direct json.dump() for JSON file writing
  • Eliminated Rich formatting overhead (was taking 220+ seconds, now ~0.02s)

5. Alphabetical Output Ordering

  • Export/import operations now display results in alphabetical order
  • Makes it easier to debug failed exports by providing consistent output

6. Production Deploy Fix

  • Fixed .github/workflows/publish.yml to include CORTEX_BASE_URL in all container jobs
  • Ensures deploy events post correctly to Cortex after PyPI/Docker/Homebrew publishes

7. Test Reliability

  • Added retry logic with exponential backoff for scorecard create operations
  • Handles 500 errors when scorecards have active evaluations running
  • Max 3 retries with 1s, 2s delays

Files Changed

  • cortexapps_cli/commands/backup.py - Parallel processing, optimized exports, alphabetical ordering
  • cortexapps_cli/cortex_client.py - HTTP connection pooling and session management
  • .github/workflows/publish.yml - Fixed CORTEX_BASE_URL configuration
  • tests/test_scorecards.py - Retry logic for active evaluation race conditions

Test Results

All tests passing (218 passed, 1 skipped) with 78% coverage.

Related Issues

Fixes #154

jeff-schnitter and others added 13 commits November 4, 2025 11:25
Fix for no base_url in cortex config file
Implemented parallel API calls using ThreadPoolExecutor for backup export
and import operations, significantly improving performance.

Changes:
- Added ThreadPoolExecutor with max_workers=10 for concurrent API calls
- Updated _export_plugins(), _export_scorecards(), _export_workflows() to
  fetch items in parallel
- Updated _import_catalog(), _import_plugins(), _import_scorecards(),
  _import_workflows() to import files in parallel
- Enhanced error handling to report failures without stopping entire operation
- Maintained file ordering where applicable

Performance improvements:
- Export operations now run with concurrent API calls
- Import operations process multiple files simultaneously
- All existing tests pass (218 passed, 1 skipped)

Fixes #154

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added CORTEX_BASE_URL from GitHub Actions variables to the publish workflow,
following the same pattern as test-pr.yml.

Changes:
- Added CORTEX_BASE_URL to top-level env section
- Added env sections to pypi-deploy-event job to pass CORTEX_API_KEY and
  CORTEX_BASE_URL to container
- Added env sections to docker-deploy-event job to pass CORTEX_API_KEY and
  CORTEX_BASE_URL to container
- Added env sections to homebrew-custom-event job to pass CORTEX_API_KEY and
  CORTEX_BASE_URL to container

This fixes the 401 Unauthorized and base_url errors when posting deploy
events to Cortex during the publish workflow.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…PI calls

Further performance optimizations for backup export:

1. Increased ThreadPoolExecutor workers from 10 to 30
   - Network I/O bound operations can handle more parallelism
   - Should provide 2-3x improvement for plugins and scorecards

2. Eliminated N individual API calls for workflows export
   - Changed workflows.list() to use include_actions="true"
   - Single API call now returns all workflow data with actions
   - Convert JSON to YAML format directly without individual get() calls
   - This eliminates N network round-trips for N workflows

Expected performance improvements:
- Workflows: Near-instant (1 API call vs N calls)
- Plugins/Scorecards: 2-3x faster with 30 workers vs 10

Previous timing: 2m19s (with 10 workers)
Original timing: 3m29s

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Modified all parallel export and import functions to collect results first,
then sort alphabetically before writing/printing. This makes debugging
failed exports much easier while maintaining parallel execution performance.

Changes:
- Export functions (plugins, scorecards) now collect all results, sort by
  tag, then write files in alphabetical order
- Import functions (catalog, plugins, scorecards, workflows) now collect
  all results, sort by filename, then print in alphabetical order
- Maintains parallel execution speed - only the output order is affected

Example output now shows consistent alphabetical ordering:
  --> about-learn-cortex
  --> bogus-plugin
  --> developer-relations-plugin
  --> google-plugin
  --> map-test
  --> my-cortex-plugin
  ...

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This is the critical fix for slow parallel export/import performance.

Problem:
- CortexClient was calling requests.request() directly without a session
- Each API call created a new TCP connection (DNS lookup, TCP handshake, TLS)
- Even with 30 parallel threads, each request was slow (~3+ seconds)
- 44 plugins took 2m24s (no parallelism benefit)

Solution:
- Created requests.Session() in __init__ with connection pooling
- Configured HTTPAdapter with pool_maxsize=50 for concurrent requests
- Added automatic retries for transient failures (500, 502, 503, 504)
- All requests now reuse existing TCP connections

Expected impact:
- First request: normal latency (connection setup)
- Subsequent requests: dramatically faster (connection reuse)
- With 30 workers: should see ~30x speedup for I/O bound operations
- 44 plugins: should drop from 2m24s to ~5-10 seconds

Technical details:
- pool_connections=10: number of connection pools to cache
- pool_maxsize=50: max connections per pool (supports 30+ parallel workers)
- Retry with backoff for transient server errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The Rich library's print() was being used to write JSON to files,
causing massive slowdowns (1-78 seconds per file!). Using json.dump()
directly should reduce write time to milliseconds.
Scorecard create operations can fail with 500 errors if there's an active
evaluation running. This is a race condition that occurs when:
1. test_import.py creates/updates a scorecard
2. An evaluation is triggered automatically or by another test
3. test_scorecards.py tries to update the same scorecard

Added exponential backoff retry logic (1s, 2s) with max 3 attempts to
handle these transient 500 errors gracefully.
@jeff-schnitter jeff-schnitter merged commit 9271326 into staging Nov 5, 2025
2 checks passed
@jeff-schnitter jeff-schnitter deleted the 154-improve-performance-of-backup-import-and-export-commands branch November 5, 2025 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants