feat(data-collection): foundation — option, resolution, accessors#6676
feat(data-collection): foundation — option, resolution, accessors#6676ericapisani wants to merge 1 commit into
Conversation
Codecov Results 📊✅ 90796 passed | ⏭️ 6305 skipped | Total: 97101 | Pass Rate: 93.51% | Execution Time: 313m 25s 📊 Comparison with Base Branch
All tests are passing successfully. ✅ Patch coverage is 88.50%. Project has 2419 uncovered lines. Files with missing lines (3)
Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
+ Coverage 89.92% 89.94% +0.02%
==========================================
Files 192 193 +1
Lines 23809 24035 +226
Branches 8218 8314 +96
==========================================
+ Hits 21409 21616 +207
- Misses 2400 2419 +19
- Partials 1339 1354 +15Generated by Codecov Action |
There was a problem hiding this comment.
This would need to be removed
| #: Safe default used by non-recording clients: collect nothing PII-gated. | ||
| #: This is a shared, process-wide singleton. Treat it as read-only — do not | ||
| #: mutate the returned ``DataCollection`` or its nested config objects. | ||
| OFF_DATA_COLLECTION = _map_from_send_default_pii(False, True, True) |
There was a problem hiding this comment.
Change OFF_DATA_COLLECTION name to something better - this could be clearer.
Use named args, not positional - this is hard to understand otherwise
| COLLECTION_OFF = "off" | ||
| COLLECTION_DENYLIST = "denyList" | ||
| COLLECTION_ALLOWLIST = "allowList" | ||
| _VALID_MODES = (COLLECTION_OFF, COLLECTION_DENYLIST, COLLECTION_ALLOWLIST) |
There was a problem hiding this comment.
Improve the names of these variables, put the possible collection modes into an enum
| BODY_TYPE_INCOMING_REQUEST = "incomingRequest" | ||
| BODY_TYPE_OUTGOING_REQUEST = "outgoingRequest" | ||
| BODY_TYPE_INCOMING_RESPONSE = "incomingResponse" | ||
| BODY_TYPE_OUTGOING_RESPONSE = "outgoingResponse" | ||
|
|
||
| #: All valid body types. ``http_bodies`` defaults to this (collect everything the | ||
| #: platform supports); an empty list is the explicit opt-out. | ||
| ALL_BODY_TYPES = [ | ||
| BODY_TYPE_INCOMING_REQUEST, | ||
| BODY_TYPE_OUTGOING_REQUEST, | ||
| BODY_TYPE_INCOMING_RESPONSE, | ||
| BODY_TYPE_OUTGOING_RESPONSE, | ||
| ] |
There was a problem hiding this comment.
Clean this up as well:
ALL_BODY_TYPESexplore whether this should also be a tuple/set (should be consistent with_VALID_MODESbelow unless there's a good reason for this not to beBODY_TYPE_*could potentially become an enum. If it's not being used outside of this file, should prefix with an underscore
| #: Canonical sensitive denylist from the spec. Values of keys that contain any of | ||
| #: these terms (partial, case-insensitive) are always replaced with | ||
| #: ``"[Filtered]"`` regardless of the configured collection mode. | ||
| SENSITIVE_DENYLIST = [ |
There was a problem hiding this comment.
See if this can be consolidated with the existing DEFAULT_DENYLIST in scrubber/py
|
|
||
| #: Additional GDPR-sensitive terms users may opt into via custom deny terms. | ||
| #: Not applied automatically; documented here for convenience. | ||
| EXTENDED_GDPR_DENYLIST = ["forwarded", "-ip", "remote-", "via", "-user"] |
There was a problem hiding this comment.
I can see this was added because of this section in the spec.
We don't currently have an extended GDPR denylist - maybe something worth considering exposing to users for convenience
| if isinstance(val, str): | ||
| return KeyValueCollectionBehavior(mode=val) |
There was a problem hiding this comment.
Reviewing the spec - I don't think this is a valid option. It looks like if it's provided, it needs at least a dict with a mode (see here)
| ) -> "DataCollection": | ||
| """ | ||
| Fill in any omitted fields of a user-supplied ``DataCollection`` with their | ||
| spec defaults (resolution case A). Frame fields fall back to the legacy |
There was a problem hiding this comment.
no one has any context on what "resolution case X" is. remove this, and the other references of 'resolution case' within this PR
| gen_ai=user_dc.gen_ai, | ||
| # http_bodies: None means "all valid types"; materialize for clarity. | ||
| http_bodies=( | ||
| list(user_dc.http_bodies) |
There was a problem hiding this comment.
I think this list may be redundant as this should be coming in as list of strings when provided.
Same goes for list(ALL_BODY_TYPES) a couple of lines below
| Returns whether the SDK should automatically populate ``user.*`` fields | ||
| (id, email, username, ip_address) from instrumentation. | ||
| """ | ||
| return bool(self.data_collection.user_info) |
There was a problem hiding this comment.
user_info it initialized to a boolean value, so we shouldn't need to cast here.
Edit: should this, similar to _should_collect_gen_ai_content, also fall back to should_send_default_pii if self.data_collection.explicit is not True (i.e. fallback to legacy behaviour)
| return self._should_collect_gen_ai_content("outputs", include_prompts) | ||
|
|
||
| def _should_collect_gen_ai_content( | ||
| self, direction: str, include_prompts: "Optional[bool]" |
There was a problem hiding this comment.
we should try to lock down the possible values of direction to be one of inputs or output - it's not necessary for us to have this type be so broad
| # Integration-level override wins over the global gen_ai setting. | ||
| if include_prompts is not None: | ||
| return include_prompts | ||
| return bool(getattr(dc.gen_ai, direction)) |
There was a problem hiding this comment.
We also shouldn't need to cast to bool here, the values within the GenAICollection are boolean
| def data_collection(self) -> "DataCollection": | ||
| return OFF_DATA_COLLECTION | ||
|
|
||
| def should_collect_user_info(self) -> bool: | ||
| return False | ||
|
|
||
| def should_collect_gen_ai_inputs( | ||
| self, include_prompts: "Optional[bool]" = None | ||
| ) -> bool: | ||
| return False | ||
|
|
||
| def should_collect_gen_ai_outputs( | ||
| self, include_prompts: "Optional[bool]" = None | ||
| ) -> bool: | ||
| return False |
There was a problem hiding this comment.
To reduce the API surface on the client, I think we should only expose the data_collection and put the should_collect_X methods on the data_collection class itself.
Edit: alternatively, we could introduce helper methods similar to what we do for streamed spans (has_span_streaming_enabled)
| True, | ||
| self.options["include_local_variables"] is not False, | ||
| self.options["include_source_context"] is not False, |
There was a problem hiding this comment.
Similar to another spot I mentioned this on, this is not easy to follow, should pass this in as named args, not as positional args
| def should_collect_user_info() -> bool: | ||
| """Shortcut for `Scope.get_client().should_collect_user_info()`.""" | ||
| return Scope.get_client().should_collect_user_info() | ||
|
|
||
|
|
||
| def should_collect_gen_ai_inputs(include_prompts: "Optional[bool]" = None) -> bool: | ||
| """Shortcut for `Scope.get_client().should_collect_gen_ai_inputs(...)`.""" | ||
| return Scope.get_client().should_collect_gen_ai_inputs(include_prompts) | ||
|
|
||
|
|
||
| def should_collect_gen_ai_outputs(include_prompts: "Optional[bool]" = None) -> bool: | ||
| """Shortcut for `Scope.get_client().should_collect_gen_ai_outputs(...)`.""" | ||
| return Scope.get_client().should_collect_gen_ai_outputs(include_prompts) |
There was a problem hiding this comment.
This mirrors what happens for should_send_default_pii, but as mentioned in another comment, I think this API surface should be reduced as if we do this for every type of configuration, this will become a large list (database queries, GQL documents/variables, etc.)
| return False | ||
|
|
||
|
|
||
| def apply_key_value_collection( |
There was a problem hiding this comment.
This name could be better
| http_headers=HttpHeadersCollection(), | ||
| # Bodies are collected regardless of PII today, bounded by | ||
| # ``max_request_body_size``. | ||
| http_bodies=list(ALL_BODY_TYPES), |
There was a problem hiding this comment.
ALL_BODY_TYPES is already a list, it doesn't need to be cast as a list again
Introduces the
data_collectioninit option, A/B/C/D resolution precedence, and theshould_collect_*accessors plus the scrubbing primitives (with unit tests).Behavior-neutral: in legacy mode the resolved
DataCollectionmirrorssend_default_pii, and nothing consumes the new filtering yet — so merging this changes no observed behavior.Base of the
data_collectionsplit. Consumers stack on top: