feat(eap): gdpr export endpoint #7586

xurui-c · 2025-12-10T20:07:07Z

https://linear.app/getsentry/issue/EAP-320/data-export-endpoint

While working on this PR, we also uncovered that the query pipeline transforms the columns in the order by clause. This is unintended for EAP at least because this defeats the CH optimizations with sort keys, and makes pagination ineffective. My fix for this was to pass in a flag that tells the query pipeline whether or not to do the transformation on the order by clause. This is the safest, simplest, and fastest solution that is also clean (the query pipeline is used by other Snuba stuff that I'm unfamiliar with).

Will fix the get trace endpoint in a follow up PR in the interest of keeping this PR small

xurui-c · 2025-12-10T23:13:55Z

snuba/web/rpc/common/common.py

 Tin = TypeVar("Tin", bound=ProtobufMessage)
 Tout = TypeVar("Tout", bound=ProtobufMessage)

+BUCKET_COUNT = 40


these are transferred from the get trace endpoint

factor it out and call it, don't repeat yourself

they are factored out into common.py and called in endpoint_export_trace_items.py and endpoint_get_trace.py

xurui-c · 2025-12-10T23:16:33Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+_DEFAULT_PAGE_SIZE = 10_000
+
+
+def _build_query(in_msg: ExportTraceItemsRequest, limit: int) -> Query:


I left out retention days because it's an internal field, but I can add it

xurui-c · 2025-12-10T23:22:51Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+        item_id = row.pop("hex_item_id")
+        item_type = row.pop("item_type")
+        ts = row.pop("timestamp")
+        client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))


I'm unsure how sampling_weight and sampling_factor transalte into client_sample_rate and server_sample_rate

This is a column on the table, there's no need to calculate it.

The TraceItem protobuf has two fields, client_sample_rate and server_sample_rate: https://github.com/getsentry/sentry-protos/blob/main/proto/sentry_protos/snuba/v1/trace_item.proto#L43-L44

The CH table has two columns, sampling_weight and sampling_factor: https://github.com/getsentry/snuba/blob/master/snuba/datasets/configuration/events_analytics_platform/entities/eap_items.yaml#L13-L14

Which one goes into which?

client_sample_rate and server_sample_rate are fields on the tables, they need to be exported as is, sampling_weight is not used anywhere anymore, only sampling_factor is used.

https://github.com/getsentry/snuba/blob/master/snuba/snuba_migrations/events_analytics_platform/0048_add_client_and_server_sample_rates.py

xurui-c · 2025-12-11T21:19:56Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+                        literal(page_token.last_seen_timestamp),
+                    ),
+                    f.greater(
+                        f.reinterpretAsUInt128(f.reverse(f.unhex(column("item_id")))),


took this from get trace pagination

xurui-c · 2025-12-11T21:30:54Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+        rows = results.result.get("data", [])
+        processed_results = _convert_rows(rows)
+
+        limit -= len(processed_results.items)


if limit <= 0, that means there's >= 10,000 rows in the requested timeframe and we're returning the first 10,000. so we should update the page token to be the latest timesamp/trace item id

otherwise, which is when limit > 0, that means theres < 10,000 rows in the requested timeframe and we're returning all of them in one go, so no need to paginate through the rest

snuba/web/rpc/v1/endpoint_get_trace.py

tests/web/rpc/v1/test_endpoint_get_trace.py

phacops · 2025-12-18T01:07:30Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+        item_id = row.pop("hex_item_id")
+        item_type = row.pop("item_type")
+        ts = row.pop("timestamp")
+        client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))


This is a column on the table, there's no need to calculate it.

phacops · 2025-12-18T01:07:43Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+        item_type = row.pop("item_type")
+        ts = row.pop("timestamp")
+        client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
+        server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))


Same as client_sample_rate.

snuba/web/db_query.py

tests/web/rpc/v1/test_endpoint_export_trace_items.py

almost done get rid of some stuff clean item id item id page token done? pagination cleanup cleanup pagination strings remove comment index optimize idk smth up with item_id fixed cleanup c

snuba/web/rpc/v1/endpoint_export_trace_items.py

+        client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
+        server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))


Copilot

Pull request overview

This PR implements a GDPR export endpoint for EAP (Event Analytics Platform) and fixes an issue where the query pipeline was transforming columns in the ORDER BY clause, which was defeating ClickHouse optimizations and making pagination ineffective.

Key Changes:

Added new EndpointExportTraceItems RPC endpoint with pagination support for exporting trace items
Introduced skip_transform_order_by flag in query settings to preserve ORDER BY clause column names for optimal ClickHouse performance
Refactored common array processing functions (process_arrays, transform_array_value) and BUCKET_COUNT constant to shared utilities

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
uv.lock	Updated sentry-protos dependency from 0.4.8 to 0.4.9 to support new export endpoint protobuf definitions
pyproject.toml	Updated sentry-protos version requirement to match lock file
tests/web/rpc/v1/test_utils.py	Added BASE_TIME constant definition and timezone import for test consistency
tests/web/rpc/v1/test_endpoint_export_trace_items.py	Added comprehensive tests for the new export endpoint including pagination and order by transformation validation
snuba/web/rpc/v1/endpoint_export_trace_items.py	Implemented new GDPR export endpoint with cursor-based pagination and proper attribute handling
snuba/web/rpc/common/common.py	Extracted shared array processing utilities and BUCKET_COUNT constant for reuse across endpoints
snuba/web/rpc/v1/endpoint_trace_item_details.py	Updated to use shared BUCKET_COUNT constant from common module
snuba/web/rpc/v1/endpoint_get_trace.py	Refactored to use shared process_arrays function from common module
snuba/query/query_settings.py	Added skip_transform_order_by setting to control ORDER BY clause transformation
snuba/query/init.py	Added skip_transform_order_by parameter to transform_expressions method
snuba/query/processors/physical/type_converters.py	Updated to respect skip_transform_order_by setting during query processing
snuba/pipeline/stages/query_processing.py	Modified storage processing stage to pass skip_transform_order_by setting through transformation pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-29T20:35:09Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+from typing import Optional
+
+from sentry_protos.snuba.v1.trace_item_attribute_pb2 import AttributeKey, AttributeValue
+from sentry_protos.snuba.v1.trace_item_filter_pb2 import (
+    AndFilter,
+    ComparisonFilter,
+    TraceItemFilter,
+)


These imports should be moved to the top of the file with the other imports, rather than appearing after the constant definition. Import statements should be grouped at the beginning of the file following PEP 8 guidelines.

Copilot · 2025-12-29T20:35:09Z

tests/web/rpc/v1/test_utils.py

+# current UTC time rounded down to the start of the current hour, then minus 180 minutes.
+BASE_TIME = datetime.now(tz=timezone.utc).replace(
+    minute=0,
+    second=0,
+    microsecond=0,
+) - timedelta(minutes=180)


The constant BASE_TIME is defined twice in this file: once on line 22 using datetime.now(tz=UTC) and again on lines 85-89 using datetime.now(tz=timezone.utc). The second definition shadows the first. This duplicate definition should be removed. Note that UTC and timezone.utc are equivalent but the module uses both imports inconsistently.

Suggested change

# current UTC time rounded down to the start of the current hour, then minus 180 minutes.

BASE_TIME = datetime.now(tz=timezone.utc).replace(

minute=0,

second=0,

microsecond=0,

) - timedelta(minutes=180)

Copilot · 2025-12-29T20:35:09Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+    return SnubaRequest(
+        id=uuid.UUID(in_msg.meta.request_id),


The request_id field is not set in the test_with_pagination test case (lines 105-115), but line 266 in the implementation attempts to convert it to a UUID without checking if it's empty. If request_id is an empty string or not set, uuid.UUID() will raise a ValueError. Consider either validating the request_id and providing a clear error message, or generating a default UUID if the field is not provided.

Suggested change

return SnubaRequest(

id=uuid.UUID(in_msg.meta.request_id),

raw_request_id = in_msg.meta.request_id

if not raw_request_id:

request_uuid = uuid.uuid4()

else:

try:

request_uuid = uuid.UUID(raw_request_id)

except ValueError:

raise BadSnubaRPCRequestException(f"Invalid request_id: {raw_request_id!r}")

return SnubaRequest(

id=request_uuid,

Copilot · 2025-12-29T20:35:09Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+        client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
+        server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))


The sampling_weight key is being popped twice from the row on consecutive lines. Line 327 pops it to calculate client_sample_rate, and then line 328 tries to pop the same key again for server_sample_rate. This will cause the second pop to always use the default value of 1.0, making server_sample_rate always equal to 1.0, which is incorrect. The calculation should likely use different keys or the same value should be reused.

Suggested change

client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))

server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))

sampling_weight = float(row.pop("sampling_weight", 1.0))

client_sample_rate = float(1.0 / sampling_weight)

server_sample_rate = float(1.0 / sampling_weight)

Copilot · 2025-12-29T20:35:10Z

tests/web/rpc/v1/test_endpoint_export_trace_items.py

+                assert response.page_token.end_pagination == False
+            else:
+                assert response.page_token.end_pagination == True


Use is False instead of == False for boolean comparisons. This is a Python best practice that avoids potential issues with truthy/falsy values and is more explicit about comparing with the boolean singleton.

Suggested change

assert response.page_token.end_pagination == False

else:

assert response.page_token.end_pagination == True

assert response.page_token.end_pagination is False

else:

assert response.page_token.end_pagination is True

Copilot · 2025-12-29T20:35:10Z

tests/web/rpc/v1/test_endpoint_export_trace_items.py

+                assert response.page_token.end_pagination == False
+            else:
+                assert response.page_token.end_pagination == True


Use is True instead of == True for boolean comparisons. This is a Python best practice that avoids potential issues with truthy/falsy values and is more explicit about comparing with the boolean singleton.

Suggested change

assert response.page_token.end_pagination == False

else:

assert response.page_token.end_pagination == True

assert response.page_token.end_pagination is False

else:

assert response.page_token.end_pagination is True

Copilot · 2025-12-29T20:35:10Z

snuba/web/rpc/common/common.py

+            return float(v)
+        if t in {"String", "Bool"}:
+            return v
+    raise BadSnubaRPCRequestException(f"array value type unknown: {type(v)}")


In the exception message, type(v) will be undefined if the dictionary value is empty, as v is only defined within the loop scope. Consider using type(value) instead to show the type of the entire value dictionary, which would be more meaningful for debugging.

Suggested change

raise BadSnubaRPCRequestException(f"array value type unknown: {type(v)}")

raise BadSnubaRPCRequestException(f"array value type unknown: {type(value)}")

Copilot · 2025-12-29T20:35:11Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+            filters[2].comparison_filter.key.name == "last_seen_trace_id"
+            and filters[2].comparison_filter.key.type == AttributeKey.Type.TYPE_STRING
+        ):
+            raise ValueError("Invalid item type")


The error message says "Invalid item type" but this validation is checking the trace_id field. The error message should be updated to "Invalid trace_id" or "Invalid page token" to accurately reflect what is being validated.

Suggested change

raise ValueError("Invalid item type")

raise ValueError("Invalid trace_id")

Copilot · 2025-12-29T20:35:11Z

snuba/web/rpc/v1/endpoint_export_trace_items.py

+        ts = row.pop("timestamp")
+        client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
+        server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
+        sampling_factor = row.pop("sampling_factor", 1.0)  # noqa: F841


Variable sampling_factor is not used.

Suggested change

sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841

row.pop("sampling_factor", None)

snuba/web/rpc/v1/endpoint_export_trace_items.py

+        integers = row.pop("attributes_int", {}) or {}
+        floats = row.pop("attributes_float", {}) or {}
+
+        breakpoint()


snuba/web/rpc/v1/endpoint_export_trace_items.py

+        client_sample_rate = float(1.0 / row.pop("client_sample_rate", 1.0))
+        server_sample_rate = float(1.0 / row.pop("server_sample_rate", 1.0))


xurui-c commented Dec 10, 2025

View reviewed changes

xurui-c changed the title ~~Rachel/gdpr~~ feat(eap): gdpr export endpoint Dec 10, 2025

xurui-c commented Dec 11, 2025

View reviewed changes

phacops reviewed Dec 18, 2025

View reviewed changes

c

cf92fb6

almost done get rid of some stuff clean item id item id page token done? pagination cleanup cleanup pagination strings remove comment index optimize idk smth up with item_id fixed cleanup c

xurui-c force-pushed the rachel/gdpr branch from f409b03 to cf92fb6 Compare December 18, 2025 22:18

import

697dbac

xurui-c force-pushed the rachel/gdpr branch from 4132e0c to 697dbac Compare December 18, 2025 22:27

Rachel Chen added 4 commits December 18, 2025 14:59

no transformation

412ce3f

fixed order by

fff3a26

orderby flag

a869dcc

flag works

0b2ba6f

xurui-c marked this pull request as ready for review December 22, 2025 21:28

xurui-c requested review from a team as code owners December 22, 2025 21:28

order by test

6756cc5

Copilot AI review requested due to automatic review settings December 29, 2025 20:31

Copilot started reviewing on behalf of xurui-c December 29, 2025 20:31 View session

sentry bot reviewed Dec 29, 2025

View reviewed changes

snuba/web/rpc/v1/endpoint_export_trace_items.py Outdated

Comment on lines 327 to 328

client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))

server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))

This comment was marked as outdated.

Sign in to view

Copilot AI reviewed Dec 29, 2025

View reviewed changes

sample rate columns

03ed92c

sentry bot reviewed Dec 29, 2025

View reviewed changes

get rid of breakpoint

dd786ae

		_DEFAULT_PAGE_SIZE = 10_000


		def _build_query(in_msg: ExportTraceItemsRequest, limit: int) -> Query:

		client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
		server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))

-    return SnubaRequest(
-        id=uuid.UUID(in_msg.meta.request_id),
+    raw_request_id = in_msg.meta.request_id
+    if not raw_request_id:
+        request_uuid = uuid.uuid4()
+    else:
+        try:
+            request_uuid = uuid.UUID(raw_request_id)
+        except ValueError:
+            raise BadSnubaRPCRequestException(f"Invalid request_id: {raw_request_id!r}")
+    return SnubaRequest(
+        id=request_uuid,

-        client_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
-        server_sample_rate = float(1.0 / row.pop("sampling_weight", 1.0))
+        sampling_weight = float(row.pop("sampling_weight", 1.0))
+        client_sample_rate = float(1.0 / sampling_weight)
+        server_sample_rate = float(1.0 / sampling_weight)

	raise BadSnubaRPCRequestException(f"array value type unknown: {type(v)}")
	raise BadSnubaRPCRequestException(f"array value type unknown: {type(value)}")

	raise ValueError("Invalid item type")
	raise ValueError("Invalid trace_id")

	sampling_factor = row.pop("sampling_factor", 1.0) # noqa: F841
	row.pop("sampling_factor", None)

Uh oh!

feat(eap): gdpr export endpoint #7586

Are you sure you want to change the base?

feat(eap): gdpr export endpoint #7586

Conversation

xurui-c commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phacops Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

xurui-c commented Dec 10, 2025 •

edited

Loading

phacops Dec 29, 2025 •

edited

Loading