Skip to content

Conversation

@xurui-c
Copy link
Member

@xurui-c xurui-c commented Dec 10, 2025

https://linear.app/getsentry/issue/EAP-320/data-export-endpoint

While working on this PR, we also uncovered that the query pipeline transforms the columns in the order by clause. This is unintended for EAP at least because this defeats the CH optimizations with sort keys, and makes pagination ineffective. My fix for this was to pass in a flag that tells the query pipeline whether or not to do the transformation on the order by clause. This is the safest, simplest, and fastest solution that is also clean (the query pipeline is used by other Snuba stuff that I'm unfamiliar with).

  • Will fix the get trace endpoint in a follow up PR in the interest of keeping this PR small

Tin = TypeVar("Tin", bound=ProtobufMessage)
Tout = TypeVar("Tout", bound=ProtobufMessage)

BUCKET_COUNT = 40
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are transferred from the get trace endpoint

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factor it out and call it, don't repeat yourself

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are factored out into common.py and called in endpoint_export_trace_items.py and endpoint_get_trace.py

_DEFAULT_PAGE_SIZE = 10_000


def _build_query(in_msg: ExportTraceItemsRequest, limit: int) -> Query:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left out retention days because it's an internal field, but I can add it

item_id = row.pop("hex_item_id")
item_type = row.pop("item_type")
ts = row.pop("timestamp")
client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure how sampling_weight and sampling_factor transalte into client_sample_rate and server_sample_rate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a column on the table, there's no need to calculate it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TraceItem protobuf has two fields, client_sample_rate and server_sample_rate: https://github.com/getsentry/sentry-protos/blob/main/proto/sentry_protos/snuba/v1/trace_item.proto#L43-L44

The CH table has two columns, sampling_weight and sampling_factor: https://github.com/getsentry/snuba/blob/master/snuba/datasets/configuration/events_analytics_platform/entities/eap_items.yaml#L13-L14

Which one goes into which?

Copy link
Contributor

@phacops phacops Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client_sample_rate and server_sample_rate are fields on the tables, they need to be exported as is, sampling_weight is not used anywhere anymore, only sampling_factor is used.

https://github.com/getsentry/snuba/blob/master/snuba/snuba_migrations/events_analytics_platform/0048_add_client_and_server_sample_rates.py

@xurui-c xurui-c changed the title Rachel/gdpr feat(eap): gdpr export endpoint Dec 10, 2025
literal(page_token.last_seen_timestamp),
),
f.greater(
f.reinterpretAsUInt128(f.reverse(f.unhex(column("item_id")))),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took this from get trace pagination

rows = results.result.get("data", [])
processed_results = _convert_rows(rows)

limit -= len(processed_results.items)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if limit <= 0, that means there's >= 10,000 rows in the requested timeframe and we're returning the first 10,000. so we should update the page token to be the latest timesamp/trace item id

otherwise, which is when limit > 0, that means theres < 10,000 rows in the requested timeframe and we're returning all of them in one go, so no need to paginate through the rest

item_id = row.pop("hex_item_id")
item_type = row.pop("item_type")
ts = row.pop("timestamp")
client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a column on the table, there's no need to calculate it.

item_type = row.pop("item_type")
ts = row.pop("timestamp")
client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as client_sample_rate.

c
almost done

get rid of some stuff

clean

item id

item id

page token done?

pagination

cleanup

cleanup

pagination

strings

remove comment

index

optimize

idk

smth up with item_id

fixed

cleanup

c
@xurui-c xurui-c marked this pull request as ready for review December 22, 2025 21:28
@xurui-c xurui-c requested review from a team as code owners December 22, 2025 21:28
Copilot AI review requested due to automatic review settings December 29, 2025 20:31
Comment on lines 327 to 328
client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))

This comment was marked as outdated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a GDPR export endpoint for EAP (Event Analytics Platform) and fixes an issue where the query pipeline was transforming columns in the ORDER BY clause, which was defeating ClickHouse optimizations and making pagination ineffective.

Key Changes:

  • Added new EndpointExportTraceItems RPC endpoint with pagination support for exporting trace items
  • Introduced skip_transform_order_by flag in query settings to preserve ORDER BY clause column names for optimal ClickHouse performance
  • Refactored common array processing functions (process_arrays, transform_array_value) and BUCKET_COUNT constant to shared utilities

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
uv.lock Updated sentry-protos dependency from 0.4.8 to 0.4.9 to support new export endpoint protobuf definitions
pyproject.toml Updated sentry-protos version requirement to match lock file
tests/web/rpc/v1/test_utils.py Added BASE_TIME constant definition and timezone import for test consistency
tests/web/rpc/v1/test_endpoint_export_trace_items.py Added comprehensive tests for the new export endpoint including pagination and order by transformation validation
snuba/web/rpc/v1/endpoint_export_trace_items.py Implemented new GDPR export endpoint with cursor-based pagination and proper attribute handling
snuba/web/rpc/common/common.py Extracted shared array processing utilities and BUCKET_COUNT constant for reuse across endpoints
snuba/web/rpc/v1/endpoint_trace_item_details.py Updated to use shared BUCKET_COUNT constant from common module
snuba/web/rpc/v1/endpoint_get_trace.py Refactored to use shared process_arrays function from common module
snuba/query/query_settings.py Added skip_transform_order_by setting to control ORDER BY clause transformation
snuba/query/init.py Added skip_transform_order_by parameter to transform_expressions method
snuba/query/processors/physical/type_converters.py Updated to respect skip_transform_order_by setting during query processing
snuba/pipeline/stages/query_processing.py Modified storage processing stage to pass skip_transform_order_by setting through transformation pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +42 to +49
from typing import Optional

from sentry_protos.snuba.v1.trace_item_attribute_pb2 import AttributeKey, AttributeValue
from sentry_protos.snuba.v1.trace_item_filter_pb2 import (
AndFilter,
ComparisonFilter,
TraceItemFilter,
)
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports should be moved to the top of the file with the other imports, rather than appearing after the constant definition. Import statements should be grouped at the beginning of the file following PEP 8 guidelines.

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +89
# current UTC time rounded down to the start of the current hour, then minus 180 minutes.
BASE_TIME = datetime.now(tz=timezone.utc).replace(
minute=0,
second=0,
microsecond=0,
) - timedelta(minutes=180)
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant BASE_TIME is defined twice in this file: once on line 22 using datetime.now(tz=UTC) and again on lines 85-89 using datetime.now(tz=timezone.utc). The second definition shadows the first. This duplicate definition should be removed. Note that UTC and timezone.utc are equivalent but the module uses both imports inconsistently.

Suggested change
# current UTC time rounded down to the start of the current hour, then minus 180 minutes.
BASE_TIME = datetime.now(tz=timezone.utc).replace(
minute=0,
second=0,
microsecond=0,
) - timedelta(minutes=180)

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +266
return SnubaRequest(
id=uuid.UUID(in_msg.meta.request_id),
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request_id field is not set in the test_with_pagination test case (lines 105-115), but line 266 in the implementation attempts to convert it to a UUID without checking if it's empty. If request_id is an empty string or not set, uuid.UUID() will raise a ValueError. Consider either validating the request_id and providing a clear error message, or generating a default UUID if the field is not provided.

Suggested change
return SnubaRequest(
id=uuid.UUID(in_msg.meta.request_id),
raw_request_id = in_msg.meta.request_id
if not raw_request_id:
request_uuid = uuid.uuid4()
else:
try:
request_uuid = uuid.UUID(raw_request_id)
except ValueError:
raise BadSnubaRPCRequestException(f"Invalid request_id: {raw_request_id!r}")
return SnubaRequest(
id=request_uuid,

Copilot uses AI. Check for mistakes.
Comment on lines 327 to 328
client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sampling_weight key is being popped twice from the row on consecutive lines. Line 327 pops it to calculate client_sample_rate, and then line 328 tries to pop the same key again for server_sample_rate. This will cause the second pop to always use the default value of 1.0, making server_sample_rate always equal to 1.0, which is incorrect. The calculation should likely use different keys or the same value should be reused.

Suggested change
client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
sampling_weight = float(row.pop("sampling_weight", 1.0))
client_sample_rate = float(1.0 / sampling_weight)
server_sample_rate = float(1.0 / sampling_weight)

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +124
assert response.page_token.end_pagination == False
else:
assert response.page_token.end_pagination == True
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use is False instead of == False for boolean comparisons. This is a Python best practice that avoids potential issues with truthy/falsy values and is more explicit about comparing with the boolean singleton.

Suggested change
assert response.page_token.end_pagination == False
else:
assert response.page_token.end_pagination == True
assert response.page_token.end_pagination is False
else:
assert response.page_token.end_pagination is True

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +124
assert response.page_token.end_pagination == False
else:
assert response.page_token.end_pagination == True
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use is True instead of == True for boolean comparisons. This is a Python best practice that avoids potential issues with truthy/falsy values and is more explicit about comparing with the boolean singleton.

Suggested change
assert response.page_token.end_pagination == False
else:
assert response.page_token.end_pagination == True
assert response.page_token.end_pagination is False
else:
assert response.page_token.end_pagination is True

Copilot uses AI. Check for mistakes.
return float(v)
if t in {"String", "Bool"}:
return v
raise BadSnubaRPCRequestException(f"array value type unknown: {type(v)}")
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the exception message, type(v) will be undefined if the dictionary value is empty, as v is only defined within the loop scope. Consider using type(value) instead to show the type of the entire value dictionary, which would be more meaningful for debugging.

Suggested change
raise BadSnubaRPCRequestException(f"array value type unknown: {type(v)}")
raise BadSnubaRPCRequestException(f"array value type unknown: {type(value)}")

Copilot uses AI. Check for mistakes.
filters[2].comparison_filter.key.name == "last_seen_trace_id"
and filters[2].comparison_filter.key.type == AttributeKey.Type.TYPE_STRING
):
raise ValueError("Invalid item type")
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message says "Invalid item type" but this validation is checking the trace_id field. The error message should be updated to "Invalid trace_id" or "Invalid page token" to accurately reflect what is being validated.

Suggested change
raise ValueError("Invalid item type")
raise ValueError("Invalid trace_id")

Copilot uses AI. Check for mistakes.
ts = row.pop("timestamp")
client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable sampling_factor is not used.

Suggested change
sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841
row.pop("sampling_factor", None)

Copilot uses AI. Check for mistakes.
integers = row.pop("attributes_int", {}) or {}
floats = row.pop("attributes_float", {}) or {}

breakpoint()

This comment was marked as outdated.

Comment on lines +327 to +328
client_sample_rate = float(1.0 / row.pop("client_sample_rate", 1.0))
server_sample_rate = float(1.0 / row.pop("server_sample_rate", 1.0))

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants