Skip to content

feat(data-collection): foundation — option, resolution, accessors#6676

Draft
ericapisani wants to merge 1 commit into
masterfrom
ep/db-spec-experiment-foundation
Draft

feat(data-collection): foundation — option, resolution, accessors#6676
ericapisani wants to merge 1 commit into
masterfrom
ep/db-spec-experiment-foundation

Conversation

@ericapisani

Copy link
Copy Markdown
Member

Introduces the data_collection init option, A/B/C/D resolution precedence, and the should_collect_* accessors plus the scrubbing primitives (with unit tests).

Behavior-neutral: in legacy mode the resolved DataCollection mirrors send_default_pii, and nothing consumes the new filtering yet — so merging this changes no observed behavior.

Base of the data_collection split. Consumers stack on top:

  • frames (stack-frame variables + source context lines)
  • http (request data collection)
  • genai (AI input/output collection)

@github-actions

Copy link
Copy Markdown
Contributor

Codecov Results 📊

90796 passed | ⏭️ 6305 skipped | Total: 97101 | Pass Rate: 93.51% | Execution Time: 313m 25s

📊 Comparison with Base Branch

Metric Change
Total Tests 📈 +645
Passed Tests 📈 +645
Failed Tests
Skipped Tests

All tests are passing successfully.

✅ Patch coverage is 88.50%. Project has 2419 uncovered lines.
✅ Project coverage is 89.94%. Comparing base (base) to head (head).

Files with missing lines (3)
File Patch % Lines
sentry_sdk/data_collection.py 90.37% ⚠️ 18 Missing and 14 partials
sentry_sdk/client.py 86.67% ⚠️ 4 Missing and 1 partials
sentry_sdk/scope.py 50.00% ⚠️ 4 Missing
Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
+ Coverage    89.92%    89.94%    +0.02%
==========================================
  Files          192       193        +1
  Lines        23809     24035      +226
  Branches      8218      8314       +96
==========================================
+ Hits         21409     21616      +207
- Misses        2400      2419       +19
- Partials      1339      1354       +15

Generated by Codecov Action

Comment thread CHANGELOG.md

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to be removed

#: Safe default used by non-recording clients: collect nothing PII-gated.
#: This is a shared, process-wide singleton. Treat it as read-only — do not
#: mutate the returned ``DataCollection`` or its nested config objects.
OFF_DATA_COLLECTION = _map_from_send_default_pii(False, True, True)

@ericapisani ericapisani Jun 29, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change OFF_DATA_COLLECTION name to something better - this could be clearer.

Use named args, not positional - this is hard to understand otherwise

Comment on lines +71 to +74
COLLECTION_OFF = "off"
COLLECTION_DENYLIST = "denyList"
COLLECTION_ALLOWLIST = "allowList"
_VALID_MODES = (COLLECTION_OFF, COLLECTION_DENYLIST, COLLECTION_ALLOWLIST)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve the names of these variables, put the possible collection modes into an enum

Comment on lines +53 to +65
BODY_TYPE_INCOMING_REQUEST = "incomingRequest"
BODY_TYPE_OUTGOING_REQUEST = "outgoingRequest"
BODY_TYPE_INCOMING_RESPONSE = "incomingResponse"
BODY_TYPE_OUTGOING_RESPONSE = "outgoingResponse"

#: All valid body types. ``http_bodies`` defaults to this (collect everything the
#: platform supports); an empty list is the explicit opt-out.
ALL_BODY_TYPES = [
BODY_TYPE_INCOMING_REQUEST,
BODY_TYPE_OUTGOING_REQUEST,
BODY_TYPE_INCOMING_RESPONSE,
BODY_TYPE_OUTGOING_RESPONSE,
]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean this up as well:

  • ALL_BODY_TYPES explore whether this should also be a tuple/set (should be consistent with _VALID_MODES below unless there's a good reason for this not to be
  • BODY_TYPE_* could potentially become an enum. If it's not being used outside of this file, should prefix with an underscore

#: Canonical sensitive denylist from the spec. Values of keys that contain any of
#: these terms (partial, case-insensitive) are always replaced with
#: ``"[Filtered]"`` regardless of the configured collection mode.
SENSITIVE_DENYLIST = [

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if this can be consolidated with the existing DEFAULT_DENYLIST in scrubber/py


#: Additional GDPR-sensitive terms users may opt into via custom deny terms.
#: Not applied automatically; documented here for convenience.
EXTENDED_GDPR_DENYLIST = ["forwarded", "-ip", "remote-", "via", "-user"]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this was added because of this section in the spec.

We don't currently have an extended GDPR denylist - maybe something worth considering exposing to users for convenience

Comment on lines +520 to +521
if isinstance(val, str):
return KeyValueCollectionBehavior(mode=val)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing the spec - I don't think this is a valid option. It looks like if it's provided, it needs at least a dict with a mode (see here)

) -> "DataCollection":
"""
Fill in any omitted fields of a user-supplied ``DataCollection`` with their
spec defaults (resolution case A). Frame fields fall back to the legacy

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no one has any context on what "resolution case X" is. remove this, and the other references of 'resolution case' within this PR

gen_ai=user_dc.gen_ai,
# http_bodies: None means "all valid types"; materialize for clarity.
http_bodies=(
list(user_dc.http_bodies)

@ericapisani ericapisani Jun 29, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this list may be redundant as this should be coming in as list of strings when provided.

Same goes for list(ALL_BODY_TYPES) a couple of lines below

Comment thread sentry_sdk/client.py
Returns whether the SDK should automatically populate ``user.*`` fields
(id, email, username, ip_address) from instrumentation.
"""
return bool(self.data_collection.user_info)

@ericapisani ericapisani Jun 29, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user_info it initialized to a boolean value, so we shouldn't need to cast here.

Edit: should this, similar to _should_collect_gen_ai_content, also fall back to should_send_default_pii if self.data_collection.explicit is not True (i.e. fallback to legacy behaviour)

Comment thread sentry_sdk/client.py
return self._should_collect_gen_ai_content("outputs", include_prompts)

def _should_collect_gen_ai_content(
self, direction: str, include_prompts: "Optional[bool]"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should try to lock down the possible values of direction to be one of inputs or output - it's not necessary for us to have this type be so broad

Comment thread sentry_sdk/client.py
# Integration-level override wins over the global gen_ai setting.
if include_prompts is not None:
return include_prompts
return bool(getattr(dc.gen_ai, direction))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also shouldn't need to cast to bool here, the values within the GenAICollection are boolean

Comment thread sentry_sdk/client.py
Comment on lines +435 to +449
def data_collection(self) -> "DataCollection":
return OFF_DATA_COLLECTION

def should_collect_user_info(self) -> bool:
return False

def should_collect_gen_ai_inputs(
self, include_prompts: "Optional[bool]" = None
) -> bool:
return False

def should_collect_gen_ai_outputs(
self, include_prompts: "Optional[bool]" = None
) -> bool:
return False

@ericapisani ericapisani Jun 29, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the API surface on the client, I think we should only expose the data_collection and put the should_collect_X methods on the data_collection class itself.

Edit: alternatively, we could introduce helper methods similar to what we do for streamed spans (has_span_streaming_enabled)

Comment thread sentry_sdk/client.py
Comment on lines +647 to +649
True,
self.options["include_local_variables"] is not False,
self.options["include_source_context"] is not False,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to another spot I mentioned this on, this is not easy to follow, should pass this in as named args, not as positional args

Comment thread sentry_sdk/scope.py
Comment on lines +2177 to +2189
def should_collect_user_info() -> bool:
"""Shortcut for `Scope.get_client().should_collect_user_info()`."""
return Scope.get_client().should_collect_user_info()


def should_collect_gen_ai_inputs(include_prompts: "Optional[bool]" = None) -> bool:
"""Shortcut for `Scope.get_client().should_collect_gen_ai_inputs(...)`."""
return Scope.get_client().should_collect_gen_ai_inputs(include_prompts)


def should_collect_gen_ai_outputs(include_prompts: "Optional[bool]" = None) -> bool:
"""Shortcut for `Scope.get_client().should_collect_gen_ai_outputs(...)`."""
return Scope.get_client().should_collect_gen_ai_outputs(include_prompts)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors what happens for should_send_default_pii, but as mentioned in another comment, I think this API surface should be reduced as if we do this for every type of configuration, this will become a large list (database queries, GQL documents/variables, etc.)

return False


def apply_key_value_collection(

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name could be better

http_headers=HttpHeadersCollection(),
# Bodies are collected regardless of PII today, bounded by
# ``max_request_body_size``.
http_bodies=list(ALL_BODY_TYPES),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALL_BODY_TYPES is already a list, it doesn't need to be cast as a list again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant