You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/cloud/guides/production-readiness.md
+20-20Lines changed: 20 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,47 +35,47 @@ Your responsibilities for enterprise production readiness:
35
35
- Establish backup validation and disaster recovery procedures
36
36
- Configure cost management and billing integration
37
37
38
-
This guide walks through each area, helping you transition from a working ClickHouse Cloud deployment to an enterprise-ready system.
38
+
This guide walks you through each area, helping you transition from a working ClickHouse Cloud deployment to an enterprise-ready system.
39
39
40
-
## Environment Strategy {#environment-strategy}
40
+
## Environment strategy {#environment-strategy}
41
41
42
42
Establish separate environments to safely test changes before impacting production workloads. Most production incidents trace back to untested queries or configuration changes deployed directly to production systems.
43
43
44
-
**Environment Structure**: Maintain production (live workloads), staging (production-equivalent validation), and development (individual/team experimentation) environments.
44
+
**Environment structure**: Maintain production (live workloads), staging (production-equivalent validation), and development (individual/team experimentation) environments.
45
45
46
46
**Testing**: Test queries in staging before production deployment. Queries that work on small datasets often cause memory exhaustion, excessive CPU usage, or slow execution at production scale. Validate configuration changes including user permissions, quotas, and service settings in staging—configuration errors discovered in production create immediate operational incidents.
47
47
48
48
**Sizing**: Size your staging service to approximate production load characteristics. Testing on significantly smaller infrastructure may not reveal resource contention or scaling issues. Use production-representative datasets through periodic data refreshes or synthetic data generation.
49
49
50
-
## Enterprise Authentication and User Management {#enterprise-authentication}
50
+
## Enterprise authentication and user management {#enterprise-authentication}
51
51
52
52
Moving from console-based user management to enterprise authentication integration is essential for production readiness.
53
53
54
-
### SSO/SAML Setup {#sso-saml-setup}
54
+
### SSO/SAML setup {#sso-saml-setup}
55
55
56
56
Enterprise tier ClickHouse Cloud supports SAML integration with identity providers including Okta, Azure Active Directory, and Google Workspace. SAML configuration requires coordination with ClickHouse support and involves providing your IdP metadata and configuring attribute mappings.
57
57
58
58
:::note Important limitation
59
59
Users authenticated through SAML are assigned the "Member" role by default and must be manually granted additional roles by an admin after their first login. Group-to-role mapping and automatic role assignment are not currently supported.
60
60
:::
61
61
62
-
### Access Control Design {#access-control-design}
62
+
### Access control design {#access-control-design}
63
63
64
64
ClickHouse Cloud uses organization-level roles (Admin, Developer, Billing, Member) and service/database-level roles (Service Admin, Read Only, SQL console roles). Design roles around job functions applying the principle of least privilege:
65
65
66
-
-**Application Users**: Service accounts with specific database and table access
67
-
-**Analyst Users**: Read-only access to curated datasets and reporting views
68
-
-**Admin Users**: Full administrative capabilities
66
+
-**Application users**: Service accounts with specific database and table access
67
+
-**Analyst users**: Read-only access to curated datasets and reporting views
68
+
-**Admin users**: Full administrative capabilities
69
69
70
70
Configure quotas, limits, and settings profiles to manage resource usage for different users and roles. Set memory and execution time limits to prevent individual queries from impacting system performance. Monitor resource usage through audit, session, and query logs to identify users or applications that frequently hit limits. Conduct regular access reviews using ClickHouse Cloud's audit capabilities.
71
71
72
-
### User Lifecycle Management Limitations {#user-lifecycle-management}
72
+
### User lifecycle management limitations {#user-lifecycle-management}
73
73
74
74
ClickHouse Cloud does not currently support SCIM or automated provisioning/deprovisioning via identity providers. Users must be manually removed from the ClickHouse Cloud console after being removed from your IdP. Plan for manual user management processes until these features become available.
75
75
76
76
Learn more about [Cloud Access Management](/cloud/security/cloud_access_management) and [SAML SSO setup](/cloud/security/saml-setup).
77
77
78
-
## Infrastructure as Code and Automation {#infrastructure-as-code}
78
+
## Infrastructure as code and automation {#infrastructure-as-code}
79
79
80
80
Managing ClickHouse Cloud through infrastructure-as-code practices and API automation provides consistency, version control, and repeatability for your deployment configuration.
81
81
@@ -103,9 +103,9 @@ provider "clickhouse" {
103
103
104
104
The Terraform provider supports service provisioning, IP access lists, and user management. Note that the provider does not currently support importing existing services or explicit backup configuration. For features not covered by the provider, manage them through the console or contact ClickHouse support.
105
105
106
-
For comprehensive examples including service configuration and network access controls, see [Terraform example on how to use Cloud API](https://clickhouse.com/docs/knowledgebase/terraform_example).
106
+
For comprehensive examples including service configuration and network access controls, see [Terraform example on how to use Cloud API](/knowledgebase/terraform_example).
107
107
108
-
### Cloud API Integration {#cloud-api-integration}
108
+
### Cloud API integration {#cloud-api-integration}
109
109
110
110
Organizations with existing automation frameworks can integrate ClickHouse Cloud management directly through the Cloud API. The API provides programmatic access to service lifecycle management, user administration, backup operations, and monitoring data retrieval.
111
111
@@ -117,19 +117,19 @@ Common API integration patterns:
117
117
118
118
API authentication uses the same token-based approach as Terraform. For complete API reference and integration examples, see [ClickHouse Cloud API](/cloud/manage/api/api-overview) documentation.
119
119
120
-
## Monitoring and Operational Integration {#monitoring-integration}
120
+
## Monitoring and operational integration {#monitoring-integration}
121
121
122
122
Connecting ClickHouse Cloud to your existing monitoring infrastructure ensures visibility and proactive issue detection.
123
123
124
-
### Built-in Monitoring {#built-in-monitoring}
124
+
### Built-in monitoring {#built-in-monitoring}
125
125
126
126
ClickHouse Cloud provides an advanced dashboard with real-time metrics including queries per second, memory usage, CPU usage, and storage rates. Access via Cloud console under Monitoring → Advanced dashboard. Create custom dashboards tailored to specific workload patterns or team resource consumption.
127
127
128
128
:::note Common production gaps
129
129
Lack of proactive alerting integration with enterprise incident management systems and automated cost monitoring. Built-in dashboards provide visibility but automated alerting requires external integration.
130
130
:::
131
131
132
-
### Production Alerting Setup {#production-alerting}
132
+
### Production alerting setup {#production-alerting}
133
133
134
134
**Built-in Capabilities**: ClickHouse Cloud provides notifications for billing events, scaling events, and service health via email, UI, and Slack. Configure delivery channels and notification severities through the console notification settings.
135
135
@@ -147,21 +147,21 @@ scrape_configs:
147
147
148
148
For comprehensive setup including detailed Prometheus/Grafana configuration and advanced alerting, see the [ClickHouse Cloud Observability Guide](/use-cases/observability/cloud-monitoring#prometheus).
149
149
150
-
## Business Continuity and Support Integration {#business-continuity}
150
+
## Business continuity and support integration {#business-continuity}
151
151
152
152
Establishing backup validation procedures and support integration ensures your ClickHouse Cloud deployment can recover from incidents and access help when needed.
153
153
154
-
### Backup Strategy Assessment {#backup-strategy}
154
+
### Backup strategy assessment {#backup-strategy}
155
155
156
156
ClickHouse Cloud provides automatic backups with configurable retention periods. Assess your current backup configuration against compliance and recovery requirements. Enterprise customers with specific compliance requirements around backup location or encryption can configure ClickHouse Cloud to store backups in their own cloud storage buckets (BYOB). Contact ClickHouse support for BYOB configuration.
157
157
158
-
### Validate and Test Recovery Procedures {#validate-test-recovery}
158
+
### Validate and test recovery procedures {#validate-test-recovery}
159
159
160
160
Most organizations discover backup gaps during actual recovery scenarios. Establish regular validation cycles to verify backup integrity and test recovery procedures before incidents occur. Schedule periodic test restorations to non-production environments, document step-by-step recovery procedures including time estimates, verify restored data completeness and application functionality, and test recovery procedures with different failure scenarios (service deletion, data corruption, regional outages). Maintain updated recovery runbooks accessible to on-call teams.
161
161
162
162
Test backup restoration at least quarterly for critical production services. Organizations with strict compliance requirements may need monthly or even weekly validation cycles.
Document your recovery time objectives (RTO) and recovery point objectives (RPO) to validate that your current backup configuration meets business requirements. Establish regular testing schedules for backup restoration and maintain updated recovery documentation.
0 commit comments