Skip to content

Conversation

@tomerqodo
Copy link

Benchmark PR elastic#136889

Type: Clean (correct implementation)

Original PR Title: Remove early phase failure in batched
Original PR Description: Resolves elastic#134151, elastic#130821.

Background

A bug was introduced by elastic#121885 due to the following code, which handles batched query exceptions due to a batched partial reduction failure: https://github.com/elastic/elasticsearch/blob/bd356491b1e32b19993ed6cd70cc2415df1253ce/server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java#L525-L544

Raising a phase failure in this way leads to a couple issues:

  1. It can be called more than once (as seen in [Search] Exceptions in datanodes leading to assertFirstRun() failures elastic/elasticsearch#134151).
  2. The subsequent freeing of contexts can miss concurrent in-flight queries, resulting in open contexts after the failure (as seen in [CI] SearchWithRejectionsIT testOpenContextsAfterRejections failing elastic/elasticsearch#130821).

Solution

Problem 1 could be resolved with a simple flag, as proposed in elastic#131085. Problem 2 could be resolved with some careful use of the same flag to clean contexts upon receiving stale query results. However, in the interest of stability, I propose a solution that more closely resembles how a reduction failure is handled by a non-batched query phase. In non-batched, a reduction failure is held in the QueryPhaseResultConsumer until shard fanout is complete. Only later, during final reduction at the beginning of the fetch phase, do we fail the search.

Fast failure + proper task cancellation are worthy goals for the future. I am tracking these as follow-up improvements for after the release of batched query execution.

This PR:

  1. Alters a batched query request to respond with shard results in the case of a reduction failure on the data node (the failure is now conditionally included in the NodeQueryResponse).
  2. Removes the early phase failure on the coord node. The coord's QueryPhaseResultConsumer will hold onto the failure and fail eventually during the fetch phase, same as non-batched.
    Original PR URL: Remove early phase failure in batched elastic/elasticsearch#136889

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants