feat(cli): add --data-dir flag and VECTOR_DATA_DIR env var#25537
feat(cli): add --data-dir flag and VECTOR_DATA_DIR env var#25537xfocus3 wants to merge 1 commit into
Conversation
Add a --data-dir command-line flag (and corresponding VECTOR_DATA_DIR environment variable) to override the data_dir global configuration option. When set, it takes precedence over any data_dir value in the configuration file. This keeps deployment-specific paths out of the configuration file and makes it possible to validate a configuration in a CI environment where the configured data_dir may not exist. Closes vectordotdev#9766
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 946a6e4464
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| #[arg(long, env = "VECTOR_DATA_DIR")] | ||
| pub data_dir: Option<PathBuf>, |
There was a problem hiding this comment.
Apply data_dir override to validate
This option is added only to RootOpts, but the validate subcommand is executed with its own validate::Opts and SubCommand::execute passes only those subcommand opts into validate::validate. As a result, VECTOR_DATA_DIR=/tmp vector validate ... (and vector --data-dir /tmp validate ...) parses the value but never applies it before validate_config/create_tmp_directory, so the documented CI validation case still uses the config file's data_dir and can fail on the nonexistent path.
Useful? React with 👍 / 👎.
| if let Some(data_dir) = data_dir { | ||
| debug!( | ||
| message = "Overriding data_dir from command line.", | ||
| ?data_dir | ||
| ); | ||
| config.global.data_dir = Some(data_dir); |
There was a problem hiding this comment.
Preserve data_dir override during reloads
This override is applied only to the initial config returned by load_configs; reloads in handle_signal call config::load_from_paths_with_provider_and_secrets directly and TopologyController::reload rejects changes to global options. With --watch-config/SIGHUP and a CLI/env data_dir that differs from the file (or default), the reloaded config loses this assignment, differs from the running global options, and reload fails with GlobalOptionsChanged instead of continuing to use the requested deployment-level data directory.
Useful? React with 👍 / 👎.
| #[arg(long, env = "VECTOR_DATA_DIR")] | ||
| pub data_dir: Option<PathBuf>, |
There was a problem hiding this comment.
Because this is a top-level RootOpts argument without global = true, clap only accepts it before a subcommand, so the advertised vector validate --data-dir /tmp vector.yaml form is rejected as an unexpected argument before any override can happen. If this flag is meant to work with subcommands in the position shown in the PR description, it needs to be global or defined on those subcommands.
Useful? React with 👍 / 👎.
| if let Some(data_dir) = data_dir { | ||
| debug!( | ||
| message = "Overriding data_dir from command line.", | ||
| ?data_dir | ||
| ); | ||
| config.global.data_dir = Some(data_dir); |
There was a problem hiding this comment.
Apply data_dir before merging configs
The override happens only after load_from_paths_with_provider_and_secrets has fully built the config, but GlobalOptions::merge rejects two config files that both set different non-default data_dir values during that build. In a split configuration where --data-dir/VECTOR_DATA_DIR is supplied specifically to take precedence, startup still fails with conflicting values for 'data_dir' found before this assignment runs, so the deployment-level override cannot actually supersede all config-file values.
Useful? React with 👍 / 👎.
Summary
Closes #9766.
Adds a
--data-dircommand-line flag and correspondingVECTOR_DATA_DIRenvironment variable to override thedata_dirglobal option.This follows the same pattern Vector already uses for other deployment-level settings such as
--require-healthy/VECTOR_REQUIRE_HEALTHY. As @jszwedko noted in the issue, exposingVECTOR_DATA_DIRis desirable.Motivation
As described in the issue,
data_diris a deployment concern rather than a pipeline concern. Today users work around this by templating it inside the config file:This also makes
vector validatepainful in CI, where the configureddata_diroften does not exist:With this change, the path can be supplied out-of-band without editing the config:
Behavior
--data-dir/VECTOR_DATA_DIRis set, it overrides thedata_dirvalue from the configuration file (env/CLI takes precedence), matching the precedence requested in the issue./var/lib/vectordefault) is used.Scope
This PR intentionally scopes to
data_dir, which was the primary request. The other globals mentioned in the issue (timezone,healthchecks.*) can follow the same pattern in separate PRs if desired.How did you test this PR?
Option<PathBuf>and defaults toNone.cargo check,cargo clippy -- -D warnings, andcargo fmt --checkall pass locally.