diff --git a/docs/content.zh/docs/deployment/elastic_scaling.md b/docs/content.zh/docs/deployment/elastic_scaling.md index d155e70d90fd0..6a574d4c5dbd9 100644 --- a/docs/content.zh/docs/deployment/elastic_scaling.md +++ b/docs/content.zh/docs/deployment/elastic_scaling.md @@ -189,4 +189,81 @@ cp ./examples/streaming/TopSpeedWindowing.jar lib/ 仅支持如下的部署方式:[Application 模式下的 Standalone 部署]({{< ref "docs/deployment/resource-providers/standalone/overview" >}}#application-mode)(可以参考[上文](#getting-started))、[Application 模式下的 Docker 部署]({{< ref "docs/deployment/resource-providers/standalone/docker" >}}#application-mode-on-docker) 以及 [Standalone 的 Kubernetes Application 集群模式]({{< ref "docs/deployment/resource-providers/standalone/kubernetes" >}}#deploy-application-cluster)。 [Adaptive 调度器的局限性](#limitations-1) 同样也适用于 Reactive 模式. + +## Rescale History + +Before Flink 2.3, users and developers were unable to inspect the internal details of Adaptive Scheduler rescaling history, +causing operational inconvenience. +For instance, users need visibility into specific resource changes, parallelism adjustments, +and the time spent on each internal state transition during the rescaling process. +This information is crucial for tuning parameters to achieve lower latency and higher stability in rescaling. + +Therefore, Flink community introduced [FLIP-495](https://cwiki.apache.org/confluence/x/TQr0Ew) to support recording and storing rescaling history, +and [FLIP-487](https://cwiki.apache.org/confluence/x/vZCMEw) to enable querying via the REST API and displaying this history on the Web UI. + +### The Information and Style About Rescale History + +Since Flink version 2.3, a page for displaying `Rescales` has been introduced in the Web UI, +positioned at the same hierarchical level as the `Checkpoints` page and featuring a similar style. +This primarily includes the following sub-pages: + +- `Overview` + This sub-page displays recent rescale records across various rescale terminal states, + along with fundamental job rescale statistics—such as the total number of rescales since job startup and the counts of failures or successes. + Additionally, the page supports the display of detailed rescale information. + +- `History` + This sub-page displays abbreviated information for the most recent `n` rescale records based on configuration. + Additionally, the page supports the display of detailed rescale information as outlined below: + - The basic information of a rescale + - Rescale UUID: The unique ID in a rescale consists of 32 hexadecimal characters(The UUID definition below is identical to the one here) + - Attempt ID: The number ID of a rescale attempts that occurred under the same resource requirements + - Requirements ID: The unique UUID of resource requirements + - Trigger Cause: The reason that triggered a rescale + - Terminal State: The end state of a rescale + - Terminated Reason: The reason of the rescale lifecycle termination + - Start Time: The start time of a rescale + - Duration: Duration from the start of the rescale to its completion or until now + - End Time: The end time of a rescale if the rescale is terminated, current time else. + - The basic attributes and rescale change per `Job Vertex` + - ID: The unique UUID of target `Job Vertex` + - Name: The short name of target vertex + - Slot Sharing Group ID: The unique UUID of target `Slot Sharing Group` + - Previous Parallelism: The parallelism of target vertex before the current rescale + - Acquired Parallelism: The parallelism of target vertex after the current rescale + - Sufficient Parallelism: The minimal parallelism of target vertex to run + - Desired Parallelism: The desired parallelism of a `Job Vertex` + - The basic attributes and rescale change per `Slot Sharing Group` + - Slot Sharing Group ID: The UUID of the `Slot Sharing Group` to which target slot belongs + - Slot Sharing Group Name: The name of the `Slot Sharing Group` to which the slot belongs + - Previous Slot: The number of slots before the rescale + - Acquired Slot: The number of slots after the rescale + - Desired Slot: The desired number of slots of the rescale + - Sufficient Slot: The minimal number of slots to deploy tasks in the rescale + - Request Profile: The request resource profile of the `Slot Sharing Group` in the rescale + - Acquired Profile: The acquired resource profile of the `Slot Sharing Group` in the rescale + - The internal `Scheduler State History` of `Adaptive Scheduler` within a rescale + - State: The scheduler state name + - Enter Time: The time to enter the state + - Leave Time: The time to leave the state + - Duration: Time spent in the state (Leave Time − Enter Time) + - Exception: The exception information about current rescale within the state +- `Summary` + This sub-page displays the total number of rescale events that have occurred since the job was launched, + along with the respective counts of failures and successes. + Additionally, it provides statistical summaries of the rescale history, + such as rescale duration statistics categorized by rescale status, including `Min`, `Max`, `Avg`, and `P50` metrics, etc. +- `Configuration` + This sub-page displays the relevant parameter values used by the Adaptive Scheduler during rescaling operations for the current streaming job. + +### How to use it +You can enable rescale history for stream jobs with the Adaptive Scheduler enabled by setting the following configuration item to a positive integer. +This value indicates the number of recent rescale records retained for the job. + +- `web.adaptive-scheduler.rescale-history.size`: `4` + +### More details + +See the [FLIP-495](https://cwiki.apache.org/confluence/x/TQr0Ew) and [FLIP-487](https://cwiki.apache.org/confluence/x/vZCMEw) for more details. + {{< top >}} diff --git a/docs/content/docs/deployment/elastic_scaling.md b/docs/content/docs/deployment/elastic_scaling.md index d0cdfdd06ca7d..b86b31f528519 100644 --- a/docs/content/docs/deployment/elastic_scaling.md +++ b/docs/content/docs/deployment/elastic_scaling.md @@ -198,4 +198,80 @@ Since Reactive Mode is a new, experimental feature, not all features supported b The [limitations of Adaptive Scheduler](#limitations-1) also apply to Reactive Mode. +## Rescale History + +Before Flink 2.3, users and developers were unable to inspect the internal details of Adaptive Scheduler rescaling history, +causing operational inconvenience. +For instance, users need visibility into specific resource changes, parallelism adjustments, +and the time spent on each internal state transition during the rescaling process. +This information is crucial for tuning parameters to achieve lower latency and higher stability in rescaling. + +Therefore, Flink community introduced [FLIP-495](https://cwiki.apache.org/confluence/x/TQr0Ew) to support recording and storing rescaling history, +and [FLIP-487](https://cwiki.apache.org/confluence/x/vZCMEw) to enable querying via the REST API and displaying this history on the Web UI. + +### The Information and Style About Rescale History + +Since Flink version 2.3, a page for displaying `Rescales` has been introduced in the Web UI, +positioned at the same hierarchical level as the `Checkpoints` page and featuring a similar style. +This primarily includes the following sub-pages: + +- `Overview` + This sub-page displays recent rescale records across various rescale terminal states, + along with fundamental job rescale statistics—such as the total number of rescales since job startup and the counts of failures or successes. + Additionally, the page supports the display of detailed rescale information. + +- `History` + This sub-page displays abbreviated information for the most recent `n` rescale records based on configuration. + Additionally, the page supports the display of detailed rescale information as outlined below: + - The basic information of a rescale + - Rescale UUID: The unique ID in a rescale consists of 32 hexadecimal characters(The UUID definition below is identical to the one here) + - Attempt ID: The number ID of a rescale attempts that occurred under the same resource requirements + - Requirements ID: The unique UUID of resource requirements + - Trigger Cause: The reason that triggered a rescale + - Terminal State: The end state of a rescale + - Terminated Reason: The reason of the rescale lifecycle termination + - Start Time: The start time of a rescale + - Duration: Duration from the start of the rescale to its completion or until now + - End Time: The end time of a rescale if the rescale is terminated, current time else. + - The basic attributes and rescale change per `Job Vertex` + - ID: The unique UUID of target `Job Vertex` + - Name: The short name of target vertex + - Slot Sharing Group ID: The unique UUID of target `Slot Sharing Group` + - Previous Parallelism: The parallelism of target vertex before the current rescale + - Acquired Parallelism: The parallelism of target vertex after the current rescale + - Sufficient Parallelism: The minimal parallelism of target vertex to run + - Desired Parallelism: The desired parallelism of a `Job Vertex` + - The basic attributes and rescale change per `Slot Sharing Group` + - Slot Sharing Group ID: The UUID of the `Slot Sharing Group` to which target slot belongs + - Slot Sharing Group Name: The name of the `Slot Sharing Group` to which the slot belongs + - Previous Slot: The number of slots before the rescale + - Acquired Slot: The number of slots after the rescale + - Desired Slot: The desired number of slots of the rescale + - Sufficient Slot: The minimal number of slots to deploy tasks in the rescale + - Request Profile: The request resource profile of the `Slot Sharing Group` in the rescale + - Acquired Profile: The acquired resource profile of the `Slot Sharing Group` in the rescale + - The internal `Scheduler State History` of `Adaptive Scheduler` within a rescale + - State: The scheduler state name + - Enter Time: The time to enter the state + - Leave Time: The time to leave the state + - Duration: Time spent in the state (Leave Time − Enter Time) + - Exception: The exception information about current rescale within the state +- `Summary` + This sub-page displays the total number of rescale events that have occurred since the job was launched, + along with the respective counts of failures and successes. + Additionally, it provides statistical summaries of the rescale history, + such as rescale duration statistics categorized by rescale status, including `Min`, `Max`, `Avg`, and `P50` metrics, etc. +- `Configuration` + This sub-page displays the relevant parameter values used by the Adaptive Scheduler during rescaling operations for the current streaming job. + +### How to use it +You can enable rescale history for stream jobs with the Adaptive Scheduler enabled by setting the following configuration item to a positive integer. +This value indicates the number of recent rescale records retained for the job. + +- `web.adaptive-scheduler.rescale-history.size`: `4` + +### More details + +See the [FLIP-495](https://cwiki.apache.org/confluence/x/TQr0Ew) and [FLIP-487](https://cwiki.apache.org/confluence/x/vZCMEw) for more details. + {{< top >}}