-
Notifications
You must be signed in to change notification settings - Fork 13.9k
[FLINK-38902][docs] Add user instructions and usage documentation for FLIP-487 #28003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -189,4 +189,81 @@ cp ./examples/streaming/TopSpeedWindowing.jar lib/ | |
| 仅支持如下的部署方式:[Application 模式下的 Standalone 部署]({{< ref "docs/deployment/resource-providers/standalone/overview" >}}#application-mode)(可以参考[上文](#getting-started))、[Application 模式下的 Docker 部署]({{< ref "docs/deployment/resource-providers/standalone/docker" >}}#application-mode-on-docker) 以及 [Standalone 的 Kubernetes Application 集群模式]({{< ref "docs/deployment/resource-providers/standalone/kubernetes" >}}#deploy-application-cluster)。 | ||
|
|
||
| [Adaptive 调度器的局限性](#limitations-1) 同样也适用于 Reactive 模式. | ||
|
|
||
| ## Rescale History | ||
|
|
||
| Before Flink 2.3, users and developers were unable to inspect the internal details of Adaptive Scheduler rescaling history, | ||
| causing operational inconvenience. | ||
| For instance, users need visibility into specific resource changes, parallelism adjustments, | ||
| and the time spent on each internal state transition during the rescaling process. | ||
| This information is crucial for tuning parameters to achieve lower latency and higher stability in rescaling. | ||
|
|
||
| Therefore, Flink community introduced [FLIP-495](https://cwiki.apache.org/confluence/x/TQr0Ew) to support recording and storing rescaling history, | ||
| and [FLIP-487](https://cwiki.apache.org/confluence/x/vZCMEw) to enable querying via the REST API and displaying this history on the Web UI. | ||
|
|
||
| ### The Information and Style About Rescale History | ||
|
|
||
| Since Flink version 2.3, a page for displaying `Rescales` has been introduced in the Web UI, | ||
| positioned at the same hierarchical level as the `Checkpoints` page and featuring a similar style. | ||
| This primarily includes the following sub-pages: | ||
|
|
||
| - `Overview` | ||
| This sub-page displays recent rescale records across various rescale terminal states, | ||
| along with fundamental job rescale statistics—such as the total number of rescales since job startup and the counts of failures or successes. | ||
| Additionally, the page supports the display of detailed rescale information. | ||
|
|
||
| - `History` | ||
| This sub-page displays abbreviated information for the most recent `n` rescale records based on configuration. | ||
| Additionally, the page supports the display of detailed rescale information as outlined below: | ||
| - The basic information of a rescale | ||
| - Rescale UUID: The unique ID in a rescale consists of 32 hexadecimal characters(The UUID definition below is identical to the one here) | ||
| - Attempt ID: The number ID of a rescale attempts that occurred under the same resource requirements | ||
| - Requirements ID: The unique UUID of resource requirements | ||
| - Trigger Cause: The reason that triggered a rescale | ||
| - Terminal State: The end state of a rescale | ||
| - Terminated Reason: The reason of the rescale lifecycle termination | ||
| - Start Time: The start time of a rescale | ||
| - Duration: Duration from the start of the rescale to its completion or until now | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume it is always the Duration from the start of the rescale. If the rescale is ongoing I guess the end is now.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated |
||
| - End Time: The end time of a rescale if the rescale is terminated, current time else. | ||
| - The basic attributes and rescale change per `Job Vertex` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we have a reference to Job vertex
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @davidradl Of course , if you think it's required, I'd like to try to added it. |
||
| - ID: The unique UUID of target `Job Vertex` | ||
| - Name: The short name of target vertex | ||
| - Slot Sharing Group ID: The unique UUID of target `Slot Sharing Group` | ||
| - Previous Parallelism: The parallelism of target vertex before the current rescale | ||
| - Acquired Parallelism: The parallelism of target vertex after the current rescale | ||
| - Sufficient Parallelism: The minimal parallelism of target vertex to run | ||
| - Desired Parallelism: The desired parallelism of a `Job Vertex` | ||
| - The basic attributes and rescale change per `Slot Sharing Group` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would say UUID - and define that once
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated. |
||
| - Slot Sharing Group ID: The UUID of the `Slot Sharing Group` to which target slot belongs | ||
| - Slot Sharing Group Name: The name of the `Slot Sharing Group` to which the slot belongs | ||
| - Previous Slot: The number of slots before the rescale | ||
| - Acquired Slot: The number of slots after the rescale | ||
| - Desired Slot: The desired number of slots of the rescale | ||
| - Sufficient Slot: The minimal number of slots to deploy tasks in the rescale | ||
| - Request Profile: The request resource profile of the `Slot Sharing Group` in the rescale | ||
| - Acquired Profile: The acquired resource profile of the `Slot Sharing Group` in the rescale | ||
| - The internal `Scheduler State History` of `Adaptive Scheduler` within a rescale | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are the states documented?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these states are an abstract representation of the scheduler's internal state; therefore, they were omitted to avoid introducing unnecessary ambiguity. |
||
| - State: The scheduler state name | ||
| - Enter Time: The time to enter the state | ||
| - Leave Time: The time to leave the state | ||
| - Duration: Time spent in the state (Leave Time − Enter Time) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this include the current state which will not have an end time?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this doesn't include the current state which will not have an end time. This is determined by the collection mechanism of adaptive state. Simply put, when a rescale adds a state, that state already includes start and end times. |
||
| - Exception: The exception information about current rescale within the state | ||
| - `Summary` | ||
| This sub-page displays the total number of rescale events that have occurred since the job was launched, | ||
| along with the respective counts of failures and successes. | ||
| Additionally, it provides statistical summaries of the rescale history, | ||
| such as rescale duration statistics categorized by rescale status, including `Min`, `Max`, `Avg`, and `P50` metrics, etc. | ||
| - `Configuration` | ||
| This sub-page displays the relevant parameter values used by the Adaptive Scheduler during rescaling operations for the current streaming job. | ||
|
|
||
| ### How to use it | ||
| You can enable rescale history for stream jobs with the Adaptive Scheduler enabled by setting the following configuration item to a positive integer. | ||
| This value indicates the number of recent rescale records retained for the job. | ||
|
|
||
| - `web.adaptive-scheduler.rescale-history.size`: `4` | ||
|
|
||
| ### More details | ||
|
|
||
| See the [FLIP-495](https://cwiki.apache.org/confluence/x/TQr0Ew) and [FLIP-487](https://cwiki.apache.org/confluence/x/vZCMEw) for more details. | ||
|
|
||
| {{< top >}} | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what this sentence means, this is one ID but it talks of
a rescale attemptsis it one or more attempts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @davidradl
The detailed description of the rescaleIdInfo is here.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-RescaleID&ResourceRequirementsrequest
Do you think we should keep the picture mentioned in the doc link into the doc ?