-
-
Notifications
You must be signed in to change notification settings - Fork 60
feat(eap): gdpr export endpoint #7586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| Tin = TypeVar("Tin", bound=ProtobufMessage) | ||
| Tout = TypeVar("Tout", bound=ProtobufMessage) | ||
|
|
||
| BUCKET_COUNT = 40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are transferred from the get trace endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factor it out and call it, don't repeat yourself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are factored out into common.py and called in endpoint_export_trace_items.py and endpoint_get_trace.py
| _DEFAULT_PAGE_SIZE = 10_000 | ||
|
|
||
|
|
||
| def _build_query(in_msg: ExportTraceItemsRequest, limit: int) -> Query: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left out retention days because it's an internal field, but I can add it
| item_id = row.pop("hex_item_id") | ||
| item_type = row.pop("item_type") | ||
| ts = row.pop("timestamp") | ||
| client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure how sampling_weight and sampling_factor transalte into client_sample_rate and server_sample_rate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a column on the table, there's no need to calculate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TraceItem protobuf has two fields, client_sample_rate and server_sample_rate: https://github.com/getsentry/sentry-protos/blob/main/proto/sentry_protos/snuba/v1/trace_item.proto#L43-L44
The CH table has two columns, sampling_weight and sampling_factor: https://github.com/getsentry/snuba/blob/master/snuba/datasets/configuration/events_analytics_platform/entities/eap_items.yaml#L13-L14
Which one goes into which?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client_sample_rate and server_sample_rate are fields on the tables, they need to be exported as is, sampling_weight is not used anywhere anymore, only sampling_factor is used.
| literal(page_token.last_seen_timestamp), | ||
| ), | ||
| f.greater( | ||
| f.reinterpretAsUInt128(f.reverse(f.unhex(column("item_id")))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took this from get trace pagination
| rows = results.result.get("data", []) | ||
| processed_results = _convert_rows(rows) | ||
|
|
||
| limit -= len(processed_results.items) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if limit <= 0, that means there's >= 10,000 rows in the requested timeframe and we're returning the first 10,000. so we should update the page token to be the latest timesamp/trace item id
otherwise, which is when limit > 0, that means theres < 10,000 rows in the requested timeframe and we're returning all of them in one go, so no need to paginate through the rest
| item_id = row.pop("hex_item_id") | ||
| item_type = row.pop("item_type") | ||
| ts = row.pop("timestamp") | ||
| client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a column on the table, there's no need to calculate it.
| item_type = row.pop("item_type") | ||
| ts = row.pop("timestamp") | ||
| client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) | ||
| server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as client_sample_rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a GDPR export endpoint for EAP (Event Analytics Platform) and fixes an issue where the query pipeline was transforming columns in the ORDER BY clause, which was defeating ClickHouse optimizations and making pagination ineffective.
Key Changes:
- Added new
EndpointExportTraceItemsRPC endpoint with pagination support for exporting trace items - Introduced
skip_transform_order_byflag in query settings to preserve ORDER BY clause column names for optimal ClickHouse performance - Refactored common array processing functions (
process_arrays,transform_array_value) andBUCKET_COUNTconstant to shared utilities
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Updated sentry-protos dependency from 0.4.8 to 0.4.9 to support new export endpoint protobuf definitions |
| pyproject.toml | Updated sentry-protos version requirement to match lock file |
| tests/web/rpc/v1/test_utils.py | Added BASE_TIME constant definition and timezone import for test consistency |
| tests/web/rpc/v1/test_endpoint_export_trace_items.py | Added comprehensive tests for the new export endpoint including pagination and order by transformation validation |
| snuba/web/rpc/v1/endpoint_export_trace_items.py | Implemented new GDPR export endpoint with cursor-based pagination and proper attribute handling |
| snuba/web/rpc/common/common.py | Extracted shared array processing utilities and BUCKET_COUNT constant for reuse across endpoints |
| snuba/web/rpc/v1/endpoint_trace_item_details.py | Updated to use shared BUCKET_COUNT constant from common module |
| snuba/web/rpc/v1/endpoint_get_trace.py | Refactored to use shared process_arrays function from common module |
| snuba/query/query_settings.py | Added skip_transform_order_by setting to control ORDER BY clause transformation |
| snuba/query/init.py | Added skip_transform_order_by parameter to transform_expressions method |
| snuba/query/processors/physical/type_converters.py | Updated to respect skip_transform_order_by setting during query processing |
| snuba/pipeline/stages/query_processing.py | Modified storage processing stage to pass skip_transform_order_by setting through transformation pipeline |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from typing import Optional | ||
|
|
||
| from sentry_protos.snuba.v1.trace_item_attribute_pb2 import AttributeKey, AttributeValue | ||
| from sentry_protos.snuba.v1.trace_item_filter_pb2 import ( | ||
| AndFilter, | ||
| ComparisonFilter, | ||
| TraceItemFilter, | ||
| ) |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These imports should be moved to the top of the file with the other imports, rather than appearing after the constant definition. Import statements should be grouped at the beginning of the file following PEP 8 guidelines.
| # current UTC time rounded down to the start of the current hour, then minus 180 minutes. | ||
| BASE_TIME = datetime.now(tz=timezone.utc).replace( | ||
| minute=0, | ||
| second=0, | ||
| microsecond=0, | ||
| ) - timedelta(minutes=180) |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant BASE_TIME is defined twice in this file: once on line 22 using datetime.now(tz=UTC) and again on lines 85-89 using datetime.now(tz=timezone.utc). The second definition shadows the first. This duplicate definition should be removed. Note that UTC and timezone.utc are equivalent but the module uses both imports inconsistently.
| # current UTC time rounded down to the start of the current hour, then minus 180 minutes. | |
| BASE_TIME = datetime.now(tz=timezone.utc).replace( | |
| minute=0, | |
| second=0, | |
| microsecond=0, | |
| ) - timedelta(minutes=180) |
| return SnubaRequest( | ||
| id=uuid.UUID(in_msg.meta.request_id), |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The request_id field is not set in the test_with_pagination test case (lines 105-115), but line 266 in the implementation attempts to convert it to a UUID without checking if it's empty. If request_id is an empty string or not set, uuid.UUID() will raise a ValueError. Consider either validating the request_id and providing a clear error message, or generating a default UUID if the field is not provided.
| return SnubaRequest( | |
| id=uuid.UUID(in_msg.meta.request_id), | |
| raw_request_id = in_msg.meta.request_id | |
| if not raw_request_id: | |
| request_uuid = uuid.uuid4() | |
| else: | |
| try: | |
| request_uuid = uuid.UUID(raw_request_id) | |
| except ValueError: | |
| raise BadSnubaRPCRequestException(f"Invalid request_id: {raw_request_id!r}") | |
| return SnubaRequest( | |
| id=request_uuid, |
| client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) | ||
| server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sampling_weight key is being popped twice from the row on consecutive lines. Line 327 pops it to calculate client_sample_rate, and then line 328 tries to pop the same key again for server_sample_rate. This will cause the second pop to always use the default value of 1.0, making server_sample_rate always equal to 1.0, which is incorrect. The calculation should likely use different keys or the same value should be reused.
| client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) | |
| server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) | |
| sampling_weight = float(row.pop("sampling_weight", 1.0)) | |
| client_sample_rate = float(1.0 / sampling_weight) | |
| server_sample_rate = float(1.0 / sampling_weight) |
| assert response.page_token.end_pagination == False | ||
| else: | ||
| assert response.page_token.end_pagination == True |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use is False instead of == False for boolean comparisons. This is a Python best practice that avoids potential issues with truthy/falsy values and is more explicit about comparing with the boolean singleton.
| assert response.page_token.end_pagination == False | |
| else: | |
| assert response.page_token.end_pagination == True | |
| assert response.page_token.end_pagination is False | |
| else: | |
| assert response.page_token.end_pagination is True |
| assert response.page_token.end_pagination == False | ||
| else: | ||
| assert response.page_token.end_pagination == True |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use is True instead of == True for boolean comparisons. This is a Python best practice that avoids potential issues with truthy/falsy values and is more explicit about comparing with the boolean singleton.
| assert response.page_token.end_pagination == False | |
| else: | |
| assert response.page_token.end_pagination == True | |
| assert response.page_token.end_pagination is False | |
| else: | |
| assert response.page_token.end_pagination is True |
| return float(v) | ||
| if t in {"String", "Bool"}: | ||
| return v | ||
| raise BadSnubaRPCRequestException(f"array value type unknown: {type(v)}") |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the exception message, type(v) will be undefined if the dictionary value is empty, as v is only defined within the loop scope. Consider using type(value) instead to show the type of the entire value dictionary, which would be more meaningful for debugging.
| raise BadSnubaRPCRequestException(f"array value type unknown: {type(v)}") | |
| raise BadSnubaRPCRequestException(f"array value type unknown: {type(value)}") |
| filters[2].comparison_filter.key.name == "last_seen_trace_id" | ||
| and filters[2].comparison_filter.key.type == AttributeKey.Type.TYPE_STRING | ||
| ): | ||
| raise ValueError("Invalid item type") |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message says "Invalid item type" but this validation is checking the trace_id field. The error message should be updated to "Invalid trace_id" or "Invalid page token" to accurately reflect what is being validated.
| raise ValueError("Invalid item type") | |
| raise ValueError("Invalid trace_id") |
| ts = row.pop("timestamp") | ||
| client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) | ||
| server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0)) | ||
| sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841 |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable sampling_factor is not used.
| sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841 | |
| row.pop("sampling_factor", None) |
| integers = row.pop("attributes_int", {}) or {} | ||
| floats = row.pop("attributes_float", {}) or {} | ||
|
|
||
| breakpoint() |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| client_sample_rate = float(1.0 / row.pop("client_sample_rate", 1.0)) | ||
| server_sample_rate = float(1.0 / row.pop("server_sample_rate", 1.0)) |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
https://linear.app/getsentry/issue/EAP-320/data-export-endpoint
While working on this PR, we also uncovered that the query pipeline transforms the columns in the order by clause. This is unintended for EAP at least because this defeats the CH optimizations with sort keys, and makes pagination ineffective. My fix for this was to pass in a flag that tells the query pipeline whether or not to do the transformation on the order by clause. This is the safest, simplest, and fastest solution that is also clean (the query pipeline is used by other Snuba stuff that I'm unfamiliar with).