-
Notifications
You must be signed in to change notification settings - Fork 34
Add datetime range aliases for optimized index filtering #537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @jonhealy1, Will you have time soon to do a code review? |
|
@Gomez324 I will make time this weekend. Can you fix the conflicts? Thanks |
| "gte": None, | ||
| "lte": datetime_search.get("lte") if not USE_DATETIME else None, | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This added code complicates the core database logic by tightly coupling it to a specific indexing strategy. Please move this calculation into the IndexSelector (the actual consumer) to keep the core method focused solely on query construction.
| "opensearch-py[async]~=2.8.0", | ||
| "uvicorn~=0.23.0", | ||
| "starlette>=0.35.0,<0.36.0", | ||
| "redis==6.4.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redis should not be installed in the core package as most Users probably won't use Redis. It can be installed with pip install stac-fastapi-elasticsearch[redis] or with dev
|
|
||
| if not datetime_search: | ||
| return search, result_metadata | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment on Elasticsearch version of this code.
| raise HTTPException( | ||
| status_code=status.HTTP_400_BAD_REQUEST, | ||
| detail="Product datetime is required for indexing", | ||
| detail="Product 'start_datetime', 'datetime' and 'end_datetime' is required for indexing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation logic violates the STAC specification in two ways:
-
It creates a mandatory requirement for start_datetime and end_datetime, which are optional fields in the spec.
-
It rejects items where datetime is null (but start/end are present), which is explicitly allowed for interval data.
Please refactor this to handle standard STAC items (single datetime) and interval items (null datetime) correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonhealy1 I agree with you. However, if indexes are to be created based on start_datetime, then that field must always be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we tie this validation to the existing USE_DATETIME setting?
If USE_DATETIME=true (Default): We allow items that only have a datetime field. In these cases, we can derive the index partition name from the datetime field instead of raising a 400 error.
If USE_DATETIME=false: Then strict enforcement of start_datetime is appropriate.
This ensures we support standard STAC items (point-in-time) without forcing users to reconfigure or reformat their data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll need some more time to implement it, but it is doable.
If USE_DATETIME is true, then datetime is required, and the aliases will work as they do now using only datetime, so the migration tool will not be needed? And if it is false, then start_datetime and end_datetime are required, while datetime becomes optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gomez324 Sounds good! Yes, I think migration scripts would not be needed.
| datetime_alias = index_dict.get("datetime") | ||
|
|
||
| if not start_datetime_alias: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line effectively makes all existing production indexes invisible to the API. Current indexes do not have start_datetime aliases.
-
Where is the migration plan to backfill aliases on historical data?
-
Without a migration, this change breaks backwards compatibility and will return 0 results for existing datasets.
| "elasticsearch[async]~=8.19.1", | ||
| "uvicorn~=0.23.0", | ||
| "starlette>=0.35.0,<0.36.0", | ||
| "redis==6.4.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - let's not install redis here. It's an optional feature.
|
@Gomez324 In the description for this PR, you state that Index B (12th-17th) lies outside the requested range (10th-16th) and would be skipped. This description implies incorrect behavior. STAC API searches rely on Intersection, not Containment. Since Index B overlaps with the search window, it must be queried; otherwise, valid items from the 12th to the 16th would be hidden from the user. Looking at the code in check_criteria, it appears you are correctly implementing intersection logic (which contradicts your description). Please update the PR description to avoid confusion, as the current example implies the feature is broken. |
|
Hey @jonhealy1 I've fixed the code according to the suggestions, it's ready for a CR. |
Related Issue(s):
Description:
Until now, only the datetime field had aliases. This change adds aliases for start_datetime and end_datetime when
USE_DATETIME=false, which enables optimized filtering when searching by these fields. It improves performance because Elasticsearch/OpenSearch can now route queries to the appropriate indices instead of scanning a larger number of them.When
USE_DATETIME=true, the system works as before with datetime-based aliases only.Example with
use_datetime=false:Index A with aliases:
{
"start_datetime": "items_start_datetime_new-collection_2020-02-08",
"end_datetime": "items_end_datetime_new-collection_2020-02-16"
}
Index B with aliases:
{
"start_datetime": "items_start_datetime_new-collection_2020-02-12",
"end_datetime": "items_end_datetime_new-collection_2020-02-17"
}
Index C with aliases:
{
"start_datetime": "items_start_datetime_new-collection_2020-02-18",
"end_datetime": "items_end_datetime_new-collection_2020-02-20"
}
When a user searches in the range start_datetime/end_datetime = 2020-02-10 / 2020-02-16, Index A and Index B will be queried because both indices overlap with the requested range. Index C will be excluded because it does not intersect with that time window.
Previously, all indices could have been selected, but the new aliases allow the query engine to efficiently identify which indices overlap with the target range and avoid scanning unrelated ones, such as Index C.
To enable this feature, set
USE_DATETIME=falsein your configuration. If you want to keep the previous behavior with datetime aliases, setUSE_DATETIME=true.PR Checklist:
pre-commit run --all-files)make test)