fix(api): restart api server during reload with address from the new config if necessary#25131
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
@codex review |
|
Hi @garryliu-stripe please sign the CLA and we will take a look. To sign post a PR comment with the following: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f0d45be29
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@pront checking with our legal for CLA. will update shortly |
|
I have read the CLA Document and I hereby sign the CLA |
|
recheck |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb41c2f1ba
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Move the API server start/stop/restart logic to run only after the topology reload succeeds. This prevents the API server from being bounced when the topology rolls back, keeping it consistent with the active config. Also compare configured addresses (old vs new) instead of the bound socket address, which avoids spurious restarts when port 0 or similar configs resolve to a different address on bind. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Committed-By-Agent: claude
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a8d1f42b0b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/topology/controller.rs
Outdated
| if let Some(server) = self.api_server.take() { | ||
| debug!("Stopping gRPC API server for restart with new address."); | ||
| drop(server); |
There was a problem hiding this comment.
Await API shutdown before starting replacement listener
When api.address changes and a server already exists, this block drops the old GrpcServer and then immediately starts a new one. GrpcServer shutdown is asynchronous (src/api/grpc_server.rs, the listener lives in a spawned task and only exits after the oneshot is observed), so on reloads like 0.0.0.0:8686 -> 127.0.0.1:8686 the old socket can still be bound when restart happens, causing a transient EADDRINUSE and ReloadOutcome::FatalError (process shutdown) even though the new config is valid. The restart path needs to wait for listener teardown (or otherwise coordinate) before rebinding.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think we should disallow hot reload of api server. If we want to change the address, we should first set api.enable: false and then set new address and re-enable api-server
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Committed-By-Agent: claude
Pass &new_config instead of self.topology.config() when starting the API server during a reload (disabled->enabled). The old code used the pre-reload config, so the server would bind to the old address instead of the newly configured one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Committed-By-Agent: claude
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8983bcacd2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/topology/controller.rs
Outdated
|
|
||
| match api::GrpcServer::start( | ||
| self.topology.config(), | ||
| &new_config, |
There was a problem hiding this comment.
Start gRPC server only after reload succeeds
Passing &new_config to GrpcServer::start before reload_config_and_respawn means a later ReloadOutcome::RolledBack can leave self.api_server running with settings from a rejected config while the topology stays on the previous config. Fresh evidence in this commit is the changed argument at this call site, which now applies new API settings prior to confirming reload success; this can expose or move the API endpoint even though the reload failed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think this is an acceptable compromise? The current behavior can cause vector to bind the wrong port during hot reload.
Since only allow changing address when the api is disabled (the existing behaviour), the api server will be dropped during Rollback
There was a problem hiding this comment.
I think we will have to do something better here, either throw a hard error to indicate it's not support or even better, detect proper grpc server shutdown and rebind.
|
@pront mind taking a look? |
Move the API server start/stop logic into the Ok arm of the topology reload so it only runs after a successful reload. Use self.topology.config() (which reflects the new config post-reload) to bind to the correct address. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Committed-By-Agent: claude
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 335090fd3c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| debug!("Stopping gRPC API server."); | ||
| drop(server); | ||
| } | ||
| } else if self.api_server.is_none() { |
There was a problem hiding this comment.
Restart API server when enabled address changes
The reload path still skips API restart whenever a server is already running because it only starts in the self.api_server.is_none() branch. In the common case where api.enabled stays true and only api.address changes, reload succeeds but the gRPC server remains bound to the old address, so the configured new endpoint is not actually applied until a full process restart.
Useful? React with 👍 / 👎.
| @@ -0,0 +1 @@ | |||
| Fixed a bug where the API server would not respect a configured `api.address` on config reload. The API server now starts with the correct address when re-enabled, and only starts after a successful topology reload to prevent inconsistent state on rollback. | |||
There was a problem hiding this comment.
Does this PR handle the case where the API is already running and the address changes?
There was a problem hiding this comment.
it does not. It still follow the current behavior that if the API is already running, address change will no bounce the server. To change address, need to first disable api, then re-enable with updated address.
The change primarily solves below situation
initial config:
# not actually set, below are default values
# api:
# enabled: false
# address: 127.0.0.1:8686
try enable api server on port 18686 with reload
api:
enabled: true
address: 127.0.0.1:18686
however, since we are starting the server with self.topology.config() instead of new_config
match api::GrpcServer::start(
self.topology.config(),
self.topology.watch(),
Arc::clone(&self.topology.running),
)
the server will be using the default address.
In addition, it solves potential inconsistent address when topology needs to rollback.
Summary
Vector configuration
During config reload, the API server was started using
self.topology.config()(the old config) instead ofnew_config. This meant that changingapi.addressin the config and triggering a reload would either:self.api_server.is_none()This fix passes
&new_configtoGrpcServer::start()and adds logic to detect address changes, stopping the old server and starting a new one on the correct port.How did you test this PR?
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.