Skip to content

Commit 733730e

Browse files
committed
Add reference commands
1 parent 26647e5 commit 733730e

File tree

2 files changed

+205
-88
lines changed

2 files changed

+205
-88
lines changed

pages/clustering/high-availability/best-practices.mdx

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,10 +67,22 @@ the health state about the cluster. You can set this configuration flag to the I
6767
qualified domain name (FQDN), or even the DNS name. The suggested approach is to use DNS, otherwise, in case the IP address changes,
6868
network communication between instances in the cluster will stop working.
6969

70-
If you're working with K8s especially, you should use DNS/FQDN, as the IP addresses are ephemeral.
70+
**Local development**
7171

7272
When testing on a local setup, the flag `--coordinator-hostname` should be set to `localhost` for each instance.
7373

74+
**K8s/Helm charts**
75+
76+
If you're working with K8s especially, you should use DNS/FQDN, as the IP addresses are ephemeral.
77+
78+
If you're using namespaces, you might need to change the `values.yaml` in the Helm Charts, as they specify the
79+
oordinator hostname for the default namespace. Below is the specification for coordinator 1:
80+
```
81+
- "--coordinator-hostname=memgraph-coordinator-1.default.svc.cluster.local"
82+
```
83+
The parameter should be changed to `memgraph-coordinator-1.<my_custom_namespace>.svc.cluster.local` insted of providing the `default`
84+
namespace. This needs to be applied on all coordinators.
85+
7486
#### management port
7587

7688
The flag `--management-port` on the coordinator instance is used for the leader coordinator to get the health state from each of

pages/clustering/high-availability/ha-commands-reference.mdx

Lines changed: 192 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -10,174 +10,279 @@ import {CommunityLinks} from '/components/social-card/CommunityLinks'
1010

1111
This page provides a comprehensive reference for all commands available in Memgraph's high availability cluster management.
1212

13-
## User API
13+
## Cluster registration commands
1414

15-
### Register instance
15+
<Callout>
16+
**All cluster registration commands (registering coordinators and data instances) should be run on the same coordinator.**
17+
You can pick any coordinator for registering your cluster, which will become the leader coordinator. After cluster has
18+
been set up, the choosing of coordinator does not matter.
19+
</Callout>
1620

17-
Registering instances should be done on a single coordinator. The chosen coordinator will become the cluster's leader.
21+
### ADD COORDINATOR
1822

19-
Register instance query will result in several actions:
20-
1. The coordinator instance will connect to the data instance on the `management_server` network address.
21-
2. The coordinator instance will start pinging the data instance every `--instance-health-check-frequency-sec` seconds to check its status.
22-
3. Data instance will be demoted from main to replica.
23-
4. Data instance will start the replication server on `replication_server`.
23+
Adds a coordinator to the cluster.
2424

25-
```plaintext
26-
REGISTER INSTANCE instanceName ( AS ASYNC | AS STRICT_SYNC ) ? WITH CONFIG {"bolt_server": boltServer, "management_server": managementServer, "replication_server": replicationServer};
25+
```cypher
26+
ADD COORDINATOR coordinatorId WITH CONFIG {
27+
"bolt_server": boltServer,
28+
"coordinator_server": coordinatorServer,
29+
"management_server": managementServer
30+
};
2731
```
2832

29-
This operation will result in writing to the Raft log.
33+
**Parameters:**
34+
- `coordinatorId` (int) - unique integer for each coordinator. You can set a different incrementing integer for each coordinator as you
35+
register them.
36+
- `boltServer` (string) - Network address in format `"IP_ADDRESS|DNS_NAME:PORT_NUMBER"`. Port is usually set to 7687 as
37+
that is representative for Bolt protocol. If IPs are ephemeral, it's best to use the DNS name/FQDN. The server IP needs
38+
to be exposed to the external network, if there are any external applications connected to it.
39+
- `coordinatorServer` (string) - Network address in format `"COORDINATOR_HOSTNAME|COORDINATOR_PORT"`. Coordinator hostname and port
40+
are set on the command line flags for each coordinator. Ensure coordinator hostname is a DNS name/FQDN if IP addresses are ephemeral.
41+
- `managementServer` (string) - Network address in format `"COORDINATOR_HOSTNAME|MANAGEMENT_PORT"`. Coordinator hostname and management port
42+
are set on the command line flags for each coordinator. Ensure coordinator hostname is a DNS name/FQDN if IP addresses are ephemeral.
43+
44+
45+
**Implications:**
46+
- The user can choose any coordinator instance to run cluster setup queries. This can be done before or after registering data instances,
47+
the order isn't important.
48+
- `ADD COORDINATOR` query needs to be run for all coordinators in the cluster.
49+
- Bolt server IP needs to be available outside the cluster and not ephemeral.
50+
- Coordinator server and management server IP can be an internal IP/DNS name/FQDN, as the cluster uses it for internal communication.
51+
52+
**Example:**
53+
```cypher
54+
ADD COORDINATOR 1 WITH CONFIG {
55+
"bolt_server": "my_outside_coordinator_1_IP:7687",
56+
"coordinator_server": "memgraph-coordinator-1.default.svc.cluster.local:12000",
57+
"management_server": "memgraph-coordinator-1.default.svc.cluster.local:10000"
58+
};
59+
```
60+
61+
### REMOVE COORDINATOR
62+
63+
If during cluster setup or at some later stage of cluster life, the user decides to remove some coordinator instance,
64+
`REMOVE COORDINATOR` query can be used. This query can only be executed on the leader coordinator to remove follower coordinators.
65+
Current cluster's leader cannot be removed since this is prohibited by NuRaft. In order to remove the current leader,
66+
you first need to trigger leadership change.
67+
68+
```cypher
69+
REMOVE COORDINATOR coordinatorId;
70+
```
71+
72+
**Parameters:**
73+
- `coordinatorId` (integer) - unique integer ID of the coordinator used during the registration
3074

75+
**Example:**
76+
```cypher
77+
REMOVE COORDINATOR 2;
78+
```
79+
80+
### REGISTER INSTANCE
81+
82+
Registers a data instance to the cluster.
83+
84+
85+
```cypher
86+
REGISTER INSTANCE instanceName ( AS ASYNC | AS STRICT_SYNC ) ? WITH CONFIG {
87+
"bolt_server": boltServer,
88+
"management_server": managementServer,
89+
"replication_server": replicationServer
90+
};
91+
```
92+
93+
**Parameters:**
94+
- `instanceName` (symbolic name) - unique name of the data instance
95+
- `AS ASYNC` (optional parameter) - register the instance in `ASYNC` replication mode
96+
- `AS STRICT_SYNC` (optional parameter) - register the instance in `STRICT_SYNC` replication mode
97+
- `boltServer` (string) - Network address in format "IP_ADDRESS|DNS_NAME:PORT_NUMBER". Port is usually set to 7687 as
98+
that is representative for Bolt protocol. If IPs are ephemeral, it's best to use the DNS name/FQDN. The server IP needs
99+
to be exposed to the external network, if there are any external applications connected to it.
100+
- `managementServer` (string) - ???
101+
- `replicationServer` (string) - ???
102+
103+
**Behaviour:**
104+
- The coordinator instance will connect to the data instance on the `management_server` network address.
105+
- The coordinator instance will start pinging the data instance every `--instance-health-check-frequency-sec` seconds to check its status.
106+
- Data instance will be demoted from main to replica.
107+
- Data instance will start the replication server on `replication_server`.
108+
- This operation will result in writing to the Raft log.
109+
110+
**Implications:**
31111
In case the main instance already exists in the cluster, a replica instance will be automatically connected to the main.
32112
If a replication mode is not specified, REPLICA will be registered in `SYNC` replication mode.
33113
Constructs `( AS ASYNC | AS STRICT_SYNC )` serve to specify a different replication mode other than `SYNC`.
34114

35115
You can only have `STRICT_SYNC` and `ASYNC` or `SYNC` and `ASYNC` replicas together in the cluster. Combining `STRICT_SYNC`
36116
and `SYNC` replicas together doesn't have proper semantic meaning so it is forbidden.
37117

38-
39-
### Add coordinator instance
40-
41-
The user can choose any coordinator instance to run cluster setup queries. This can be done before or after registering data instances,
42-
the order isn't important.
43-
44-
```plaintext
45-
ADD COORDINATOR coordinatorId WITH CONFIG {"bolt_server": boltServer, "coordinator_server": coordinatorServer};
118+
**Example:**
119+
```cypher
120+
REGISTER INSTANCE instance1 WITH CONFIG {
121+
"bolt_server": "my_outside_instance1_IP:7687",
122+
"management_server": "???:10000",
123+
"replication_server": "???:20000"
124+
};
46125
```
47126

48-
<Callout type="info">
127+
### UNREGISTER INSTANCE
49128

50-
`ADD COORDINATOR` query needs to be run for all coordinators in the cluster.
129+
There are various reasons which could lead to the decision that an instance needs to be removed from the cluster.
130+
The hardware can be broken, network communication could be set up incorrectly, etc. The user can remove the instance
131+
from the cluster using the following query:
51132

52-
```
53-
ADD COORDINATOR 1 WITH CONFIG {"bolt_server": "127.0.0.1:7691", "coordinator_server": "127.0.0.1:10111", "management_server": "127.0.0.1:12111"};
54-
ADD COORDINATOR 2 WITH CONFIG {"bolt_server": "127.0.0.1:7692", "coordinator_server": "127.0.0.1:10112", "management_server": "127.0.0.1:12112"};
55-
ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "127.0.0.1:7693", "coordinator_server": "127.0.0.1:10113", "management_server": "127.0.0.1:12113"};
133+
```cypher
134+
UNREGISTER INSTANCE instanceName;
56135
```
57136

58-
</Callout>
137+
**Parameters:**
138+
- `instanceName` (symbolic name) - respective name of the data instance
59139

60-
### Remove coordinator instance
140+
**Implications:**
141+
When unregistering an instance, ensure that the instance being unregistered is
142+
**not** the MAIN instance. Unregistering MAIN can lead to an inconsistent
143+
cluster state. Additionally, the cluster must have an **alive** MAIN instance
144+
during the unregistration process. If no MAIN instance is available, the
145+
operation cannot be guaranteed to succeed.
61146

62-
If during cluster setup or at some later stage of cluster life, the user decides to remove some coordinator instance, `REMOVE COORDINATOR` query can be used.
63-
Only on leader can this query be executed in order to remove followers. Current cluster's leader cannot be removed since this is prohibited
64-
by NuRaft. In order to remove the current leader, you first need to trigger leadership change.
147+
The instance requested to be unregistered will also be unregistered from the current MAIN's replica set.
65148

66-
```plaintext
67-
REMOVE COORDINATOR <COORDINATOR-ID>;
149+
**Example:**
150+
```cypher
151+
UNREGISTER INSTANCE instance_1;
68152
```
69153

154+
## Replication role management queries
70155

71-
### Set instance to main
156+
### SET INSTANCE TO MAIN
72157

73-
Once all data instances are registered, one data instance should be promoted to main. This can be achieved by using the following query:
158+
Once all data instances are registered, one data instance should be promoted to main.
159+
This can be achieved by using the following query:
74160

75-
```plaintext
76-
SET INSTANCE instanceName to main;
161+
```cypher
162+
SET INSTANCE instanceName TO MAIN;
77163
```
78164

79-
This query will register all other instances as replicas to the new main. If one of the instances is unavailable, setting the instance to main will not succeed.
80-
If there is already a main instance in the cluster, this query will fail.
165+
**Parameters:**
166+
- `instanceName` (symbolic name) - name of the data instance that is going to be promoted to main
81167

168+
**Behaviour:**
169+
This query will register all other instances as replicas to the new main.
82170
This operation will result in writing to the Raft log.
83171

84-
### Demote instance
172+
**Implications:
173+
If one of the instances is unavailable, setting the instance to MAIN will not succeed.
174+
If there is already a MAIN instance in the cluster, this query will fail.
85175

86-
Demote instance query can be used by an admin to demote the current main to replica. In this case, the leader coordinator won't perform a failover, but as a user,
87-
you should choose promote one of the data instances to main using the `SET INSTANCE `instance` TO main` query.
176+
**Example:**
177+
```cypher
178+
SET INSTANCE instance_0 TO MAIN;
179+
```
180+
181+
### DEMOTE INSTANCE
88182

89-
```plaintext
183+
Demote instance query can be used by an admin to demote the current MAIN to REPLICA.
184+
185+
```cypher
90186
DEMOTE INSTANCE instanceName;
91187
```
92188

93-
This operation will result in writing to the Raft log.
189+
**Behaviour:**
190+
- MAIN is demoted to REPLICA
191+
- This operation will result in writing to the Raft log.
94192

95-
<Callout type="info">
193+
**Implications:**
194+
- In this case, the leader coordinator won't perform a failover, but as a user, you should choose promote one of
195+
the data instances to main using the `SET INSTANCE `instance` TO main` query.
96196

197+
<Callout type="info">
97198
By combining the functionalities of queries `DEMOTE INSTANCE instanceName` and `SET INSTANCE instanceName TO main` you get the manual failover capability. This can be useful
98199
e.g during a maintenance work on the instance where the current main is deployed.
99-
100200
</Callout>
101201

102-
103-
### Unregister instance
104-
105-
There are various reasons which could lead to the decision that an instance needs to be removed from the cluster. The hardware can be broken,
106-
network communication could be set up incorrectly, etc. The user can remove the instance from the cluster using the following query:
107-
108-
```plaintext
109-
UNREGISTER INSTANCE instanceName;
202+
**Example:**
203+
```cypher
204+
DEMOTE INSTANCE instance1;
110205
```
111206

112-
When unregistering an instance, ensure that the instance being unregistered is
113-
**not** the main instance. Unregistering main can lead to an inconsistent
114-
cluster state. Additionally, the cluster must have an **alive** main instance
115-
during the unregistration process. If no main instance is available, the
116-
operation cannot be guaranteed to succeed.
117-
118-
The instance requested to be unregistered will also be unregistered from the current main's replica set.
207+
## Monitoring commands
119208

120-
### Force reset cluster state
209+
### SHOW INSTANCES
121210

122-
In case the cluster gets stuck there is an option to do the force reset of the cluster. You need to execute a command on the leader coordinator.
123-
This command will result in the following actions:
124-
125-
1. The coordinator instance will demote each alive instance to replica.
126-
2. From the alive instance it will choose a new main instance.
127-
3. Instances that are down will be demoted to replicas once they come back up.
211+
You can check the state of the whole cluster using the `SHOW INSTANCES` query.
128212

129-
```plaintext
130-
FORCE RESET CLUSTER STATE;
213+
```cypher
214+
SHOW INSTANCES;
131215
```
132216

133-
This operation will result in writing to the Raft log.
134-
135-
### Show instances
136-
137-
You can check the state of the whole cluster using the `SHOW INSTANCES` query. The query will display all the Memgraph servers visible in the cluster. With
217+
**Behaviour:**
218+
The query will display all the Memgraph servers visible in the cluster. With
138219
each server you can see the following information:
139220
1. Network endpoints they are using for managing cluster state
140221
2. Health state of server
141222
3. Role - main, replica, LEADER, FOLLOWER or unknown if not alive
142223
4. The time passed since the last response time to the leader's health ping
143224

144-
This query can be run on either the leader or followers. Since only the leader knows the exact status of the health state and last response time,
145-
followers will execute actions in this exact order:
225+
**Implications:**
226+
This query can be run on either the leader or followers. Since only the leader knows the exact status of the health state
227+
and last response time, followers will execute actions in this exact order:
146228
1. Try contacting the leader to get the health state of the cluster, since the leader has all the information.
147229
If the leader responds, the follower will return the result as if the `SHOW INSTANCES` query was run on the leader.
148230
2. When the leader doesn't respond or currently there is no leader, the follower will return all the Memgraph servers
149231
with the health state set to "down".
150232

151-
```plaintext
152-
SHOW INSTANCES;
153-
```
154-
155233

156-
### Show instance
234+
### SHOW INSTANCE
157235

158236
You can check the state of the current coordinator to which you are connected by running the following query:
159237

160-
```plaintext
238+
```cypher
161239
SHOW INSTANCE;
162240
```
163241

242+
**Behaviour:**
164243
This query will return the information about:
165244
1. instance name
166245
2. external bolt server to which you can connect using Memgraph clients
167246
3. coordinator server over which Raft communication is done
168247
4. management server which is also used for inter-coordinators communication and
169248
5. cluster role: whether the coordinator is currently a leader of the follower.
170249

250+
**Implications:**
171251
If the query `ADD COORDINATOR` wasn't run for the current instance, the value of the bolt server will be "".
172252

173-
### Show replication lag
253+
### SHOW REPLICATION LAG
174254

175-
The user can find the current replication lag on each instance by running `SHOW REPLICATION LAG` on the cluster's leader. The replication lag is expressed with
176-
the number of committed transactions. Such an info is made durable through snapshots and WALs so restarts won't cause the information loss. The information
177-
about the replication lag can be useful when manually performing a failover to check whether there is a risk of a data loss.
255+
The user can find the current replication lag on each instance by running `SHOW REPLICATION LAG` on the cluster's leader.
256+
The replication lag is expressed with the number of committed transactions.
178257

179-
```plaintext
258+
```cypher
180259
SHOW REPLICATION LAG;
181260
```
182261

262+
**Implications:**
263+
- Such an info is made durable through snapshots and WALs so restarts won't cause the information loss.
264+
- The information about the replication lag can be useful when manually performing a failover to check whether there is a
265+
risk of a data loss.
266+
267+
## Troubleshooting commands
268+
269+
### FORCE RESET CLUSTER STATE
270+
271+
In case the cluster can't get into a healthy state, or any unexpected event occurs, there is an option to do the force
272+
reset of the cluster.
273+
274+
```cypher
275+
FORCE RESET CLUSTER STATE;
276+
```
277+
278+
**Behaviour:**
279+
1. The coordinator instance will demote each alive instance to replica.
280+
2. From the alive instance it will choose a new main instance.
281+
3. Instances that are down will be demoted to replicas once they come back up.
282+
283+
This operation will result in writing to the Raft log.
284+
285+
**Implications:**
286+
You need to execute a command on the leader coordinator.
287+
183288
<CommunityLinks/>

0 commit comments

Comments
 (0)