Skip to content

Commit 1aec8e7

Browse files
authored
Merge pull request #528 from daipom/add-zero-downtime-restart
zero-downtime restart: add initial document
2 parents 47d7af5 + 274e073 commit 1aec8e7

File tree

6 files changed

+128
-11
lines changed

6 files changed

+128
-11
lines changed
153 KB
Loading

SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
* [Linux Capability](deployment/linux-capability.md)
5858
* [Command Line Option](deployment/command-line-option.md)
5959
* [Source Only Mode](deployment/source-only-mode.md)
60+
* [Zero-downtime restart](deployment/zero-downtime-restart.md)
6061
* [Container Deployment](container-deployment/README.md)
6162
* [Docker Image](container-deployment/install-by-docker.md)
6263
* [Docker Logging Driver](container-deployment/docker-logging-driver.md)

deployment/rpc.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,19 @@ As evident from the output above, each endpoint returns a JSON object as its res
2525

2626
## HTTP Endpoints
2727

28-
| Endpoint | Replacement of | Description |
29-
| :--- | :---: | :---: |
30-
| `/api/processes.interruptWorkers` | [SIGINT](signals.md#sigint-or-sigterm) | Stops the daemon. |
31-
| `/api/processes.killWorkers` | [SIGTERM](signals.md#sigint-or-sigterm) | Stops the daemon. |
32-
| `/api/processes.flushBuffersAndKillWorkers` | [SIGUSR1](signals.md#sigusr1) and [SIGTERM](signals.md#sigint-or-sigterm) | Flushes buffer and stops the daemon. |
33-
| `/api/plugins.flushBuffers` | [SIGUSR1](signals.md#sigusr1) | Flushes the buffered messages. |
34-
| `/api/config.gracefulReload` | [SIGUSR2](signals.md#sigusr2) | Reloads configuration. |
35-
| `/api/config.reload` | [SIGHUP](signals.md#sighup) | Reloads configuration. |
28+
| Endpoint | Replacement of | Description | Version |
29+
| :--- | :---: | :---: | |
30+
| `/api/processes.interruptWorkers` | [SIGINT](signals.md#sigint-or-sigterm) | Stops the daemon. | v1.0 |
31+
| `/api/processes.killWorkers` | [SIGTERM](signals.md#sigint-or-sigterm) | Stops the daemon. | v1.0 |
32+
| `/api/processes.zeroDowntimeRestart` | [SIGUSR2](signals.md#sigusr2) | Restarts Fluentd with zero-downtime. (Not supported on Windows) | v1.18 |
33+
| `/api/processes.flushBuffersAndKillWorkers` | [SIGUSR1](signals.md#sigusr1) and [SIGTERM](signals.md#sigint-or-sigterm) | Flushes buffer and stops the daemon. | v1.0 |
34+
| `/api/plugins.flushBuffers` | [SIGUSR1](signals.md#sigusr1) | Flushes the buffered messages. | v1.0 |
35+
| `/api/config.reload` | [SIGHUP](signals.md#sighup) | Reloads configuration. | v1.0 |
36+
| `/api/config.gracefulReload` | --- | Reloads configuration. | v1.9 |
37+
38+
Appendix:
39+
40+
* `/api/config.gracefulReload`: This is the replacement of `SIGUSR2` before v1.18. Please use `/api/processes.zeroDowntimeRestart` or `/api/config.reload` unless there is a special reason. See [SIGUSR2](signals.md#sigusr2) for details.
3641

3742
If this article is incorrect or outdated, or omits critical information, please [let us know](https://github.com/fluent/fluentd-docs-gitbook/issues?state=open). [Fluentd](http://www.fluentd.org/) is an open-source project under [Cloud Native Computing Foundation \(CNCF\)](https://cncf.io/). All components are available under the Apache 2 License.
3843

deployment/signals.md

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,57 @@ Forces the buffered messages to be flushed and reopens Fluentd's log. Fluentd wi
1818

1919
### SIGUSR2
2020

21+
Since v1.18, it has two features: Zero-downtime restart and Graceful reload.
22+
23+
Non-Windows:
24+
25+
| process | feature | version |
26+
| :--- | :--- | :--- |
27+
| Supervisor | Zero-downtime restart | v1.18.0 ~ |
28+
| Supervisor | Graceful reload (forwarded to all workers) | v1.9 ~ v1.17 |
29+
| Worker | Graceful reload | v1.9 ~ |
30+
31+
Windows:
32+
33+
| process | feature | version |
34+
| :--- | :--- | :--- |
35+
| Supervisor | Graceful reload (forwarded to all workers) | v1.9 ~ |
36+
| Worker | Graceful reload | v1.9 ~ |
37+
38+
#### Zero-downtime restart
39+
40+
This feature supports a complete restart of Fluentd.
41+
This restarts Fluentd so that some input plugins don't have down time.
42+
43+
See [Zero-downtime restart](zero-downtime-restart.md) for details.
44+
45+
**Comparison with SIGHUP**
46+
47+
`SIGHUP` gracefully restarting the worker process to reload.
48+
49+
This method does not cause socket downtime, so if there is no need to restart the supervisor, `SIGHUP` is a lighter zero-downtime restart method.
50+
51+
**Comparison with Graceful reload**
52+
53+
You can still use Graceful reload feature by sending `SIGUSR2` directly to the worker process or using [RPC](rpc.md) even after v1.18.0.
54+
55+
This allows you to reload without restarting the process, but there are some limitations.
56+
Please use zero-downtime restart or `SIGHUP` unless there is a special reason.
57+
58+
#### Graceful reload
59+
2160
Reloads the configuration file by gracefully re-constructing the data pipeline. Fluentd will try to flush the entire memory buffer at once, but will not retry if the flush fails. Fluentd will not flush the file buffer; the logs are persisted on the disk by default.
2261

23-
This signal has been supported since v1.9.0.
62+
Limitations:
63+
64+
* A change to System Configuration (`<system>`) is ignored.
65+
* All plugins must not use class variable.
2466

2567
### SIGHUP
2668

2769
Reloads the configuration file by gracefully restarting the worker process. Fluentd will try to flush the entire memory buffer at once, but will not retry if the flush fails. Fluentd will not flush the file buffer; the logs are persisted on the disk by default.
2870

29-
If you use fluentd v1.9.0 or later, use `SIGUSR2` instead.
71+
This does not cause socket downtime because the supervisor process keeps the normal sockets, as long as the socket is provided as a shared socket by [server_helper](../plugin-helper-overview/api-plugin-helper-server.md).
3072

3173
### SIGCONT
3274

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Zero-downtime restart
2+
3+
This feature supports a complete restart of Fluentd.
4+
This restarts Fluentd so that some input plugins don't have down time.
5+
6+
Supported standard input plugins are as follows.
7+
8+
| supported input plugin | version |
9+
| :--- | :--- |
10+
| in_udp | v1.18.0 |
11+
| in_tcp | v1.18.0 |
12+
| in_syslog | v1.18.0 |
13+
14+
If these input plugins are down, client applications may fail to send data.
15+
You can use this feature to completely restart Fluentd without losing data for these plugins even if that client does not have a resend feature.
16+
17+
## How to use this feature
18+
19+
You can use this feature in the following ways.
20+
21+
* [Signals - SIGUSR2](signals.md#sigusr2)
22+
* [RPC](rpc.md)
23+
24+
## Mechanism of zero-downtime restart
25+
26+
![zero-downtime restart mechanism](../.gitbook/assets/fluentd-zero-downtime-restart-mechanism.png)
27+
28+
1. Receive `SIGUSR2`.
29+
2. Spawn a new supervisor.
30+
3. Take over shared sockets.
31+
4. Launch new workers, and stop old processes in parallel.
32+
* Launch new workers with [Source Only Mode](source-only-mode.md).
33+
* In addition to the source-only mode limitation, Fluentd further limits the starting pluings to only those that support this feature.
34+
* Data received by the new workers are stored in the temporary buffer of source-only mode.
35+
* For details on the temporary buffer, see [Source Only Mode - Temporary file buffer](source-only-mode.md#temporary-file-buffer).
36+
* Send `SIGTERM` to the old supervisor after `10s` delay.
37+
5. The old supervisor stops and sends `SIGWINCH` to the new one.
38+
6. The new workers starts to run fully.
39+
* The temporary buffer of source-only mode starts to load.
40+
41+
## Plugins: how to support this feature
42+
43+
See [How to Write Input Plugin - zero_downtime_restart_ready?](../plugin-development/api-plugin-input.md#zero_downtime_restart_ready).
44+
45+
If this article is incorrect or outdated, or omits critical information, please [let us know](https://github.com/fluent/fluentd-docs-gitbook/issues?state=open). [Fluentd](http://www.fluentd.org/) is an open-source project under [Cloud Native Computing Foundation \(CNCF\)](https://cncf.io/). All components are available under the Apache 2 License.

plugin-development/api-plugin-input.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,31 @@ router.emit(tag, time, {:foo => 'bar'})
102102

103103
## Methods
104104

105-
There are no specific methods for the Input plugins.
105+
### zero_downtime_restart_ready?
106+
107+
To support [Zero-downtime restart](../deployment/zero-downtime-restart.md), you can override this method to return `true`.
108+
109+
```ruby
110+
def zero_downtime_restart_ready?
111+
true
112+
end
113+
```
114+
115+
To do this, the following condition must be met:
116+
117+
* This plugin can run in parallel with another Fluentd.
118+
119+
This is because there is a period when the old process and the new process run in parallel during a zero-downtime restart.
120+
121+
After addressing the following considerations and ensuring there are no issues, override this method.
122+
Then, the plugin will succeed with zero-downtime restart.
123+
124+
* Handling Files
125+
* When handling files, there is a possibility of conflict.
126+
* Basically, input plugins that handle files should not support Zero-downtime restart.
127+
* Handling Sockets
128+
* A socket provided as a shared socket by [server plugin helper](../plugin-helper-overview/api-plugin-helper-server.md) is shared between the old and new processes. So, such a plugin can support Zero-downtime restart.
129+
* When handling sockets on your own, be careful to avoid conflicts.
106130

107131
## Writing Tests
108132

0 commit comments

Comments
 (0)