Security audit logging #85

tombentley · 2025-12-03T00:59:49Z

No description provided.

Signed-off-by: Tom Bentley <tbentley@redhat.com>

k-wall · 2025-12-03T10:34:23Z

proposals/nnn-audit-logging.md

+* Similarly events covering the following API Keys: `CREATE_DELEGATION_TOKEN`, `RENEW_DELEGATION_TOKEN`, `EXPIRE_DELEGATION_TOKEN`, `DESCRIBE_DELEGATION_TOKEN`
+* `ClientClose` — Emitted when a client connection is closed (whether client or proxy initiated)
+* `BrokerClose` — Emitted when a broker connection is closed (whether broker or proxy initiated).
+


What about KMS events? Should we think about how those would be modelled?

Simply knowing that a KEK has been used at least once seems to be good enough for answering questions like:

"How many KEKs does the proxy use?"

"What KEKs have been used by the proxy (over the last N days)?"

"Has the proxy used this key which we belive has been accidentally disclosed?"

More broadly, this is "Can plugins-to-plugins generate security relevant events?". Probably.

In any case, I'm inclined not to specify such events right now, and but aim for a way for plugins to be able to publish security events of their own. That way we can roll-out support for better audit logging piecemeal, and based on identified requirements, rather than go imaging all the things we thing might be useful.

k-wall · 2025-12-03T10:40:21Z

proposals/nnn-audit-logging.md

+
+Goals:
+
+* enable users to _easily_ collect a _complete_ log of security-related events


we should be clear that a complete log should include both the actions performed by the Kafka client and any (async) operations cause by the filters themselves.

This is a good point.

I suppose for the purpose of being able to correlate with Broker logs it would be better to know that a certain request originated in the proxy not with a client accessing the proxy. The alternative, of not audit logging proxy-orginated requests, would be confusing at best, and possibly indistinguishable from log tampering to someone who was looking closely enough.

It should be noted that there can things like queries to Authorizers which should not be logged, because they're not an attempt to perform the action being queried. (E.g. the implementing the IncludeTopicAuthorizedOperations in a Metadata request).

So the answer to the question of "what to log?" isn't always "everything". I think if we tried to make it "everything" we could end up in a mire of event modelling for the many edge cases which in theory someone might care about distinguishing from each other, but in practice someone or something has to analyse those logs and draw conclusions. The closer we model the complex and evolving reality, the harder it is for someone to draw the correct conclusions, and the more we end up being constrained by the API aspect of this proposal.

How to allow for logging of events within plugins. The Authorization plugin provides a great example. The runtime doesn't really know about Authorizers in a deep way (just a plugin), but they're actually implementing logic which deserves specific audit logging. And ideally that logging would be consistent over Authorizer implementations (e.g. a Deny from the AclAuthorizer is the same as a Deny from an OpaAuthorizer).

One way to do this, I think, is for the Filter API to provide a method for logging an event. At the level of the Filter API we don't need to be prescriptive about what those events look like (we could just say java.lang.Record, so we knew they were Jackson serializable). We're just promising that they'll be emitted to the same things as the events generated natively by the runtime, and with the right attributes (like the event time and the sessionId and I guess the filterId). The Authorization filter would then take on responsibility for calling that method. Crucially the event classes could be defined alongside the Authorizer API, which how we'd end up with consistency of the event schema across different Authorizer impls.

k-wall · 2025-12-03T10:42:01Z

proposals/nnn-audit-logging.md

+
+* enable users to _easily_ collect a _complete_ log of security-related events
+* for the security events to be structured and amenable to automated post-processing
+* for the security events to be an API of the project, with the same compatibility guarantees as other APIs


FIlters can effectively rename entities in Kafka (e.g. map a topic or group name). It needs to be up to the user to decide which point(s) along the filter chain should be "tapped" for audit.

I've not yet described how any of this would work, but I think the most natural way for it to work for the events which arise from requests and responses is obviously to use a filter. Using that approach would allow the user to place it where in the chain they wished.

I wasn't suggesting you describe a solution in this section, just call out that it is something a proposed solution must handle.

I wasn't suggesting you describe a solution in this section

I haven't described it in the document at all yet. Still cogitating...

k-wall · 2025-12-03T10:43:20Z

proposals/nnn-audit-logging.md

+    - `resourceType` — The type of the resource (e.g. `Topic`)
+    - `resourceName` — The name of the resource (e.g. `my-topic`
+* `Read` — Emitted when a client successfully reads records from a topic. It is called `Read` rather than `Fetch` because it covers reads generally, including the `ShareFetch` API key. It will be possible to disable these events, because of the potential for high volume. 
+    - `topicName` — The name of the topic.


what about the audit of client ids, group names and possibly, transactional ids?

None of those pertain to the record data itself. I suppose a bad actor might try (and possibly succeed) to use a transactional id of some other service to cause a kind of denial of service attack by fencing off the legitimate producer. Likewise with groups, maybe Eve can prevent processing of some partitions by getting them assigned to her rogue app. But those things just seem a bit far-fetched, so I'm not super-keen to go adding them up-front.

None of those pertain to the record data itself.

why aren't we considering events such as resetting a consumer group offset a security event? Causing a consumer to skip a record or fetch a record twice seems very interesting.

On the one hand you're right. Someone could use that as an attack vector in the right circumstances.

But I think there are lots of reasons not to go over-broad on what we're trying to cover:

For offset commit... well it doesn't look like a terrible strong signal of something security related going on. Clients commit offsets all the time. Re-processing happens and is not unusual most of the time. The one thing I can think of which could be a bit more specific is fetching from the start of the log. A data exfiltration might look like that. But even that is quite weak: Consumers don't have to store their offets in Kafka at all, so such a check is easily evaded.

The Kafka broker's logging already covers requests which get as far as the broker. All we need in the proxy is logging which allow correlation with that. This might be enough of a reason to scale back parts of this proposal.

I think we could come up with a security angle for most RPCs. People expect systems to work and anything which makes them not work could hypothetically manifest, at least, as an attack on the availability of the system/DoS. So then we would end up with an audit log that is really more like a protocol trace.

The more types of event you define the more API you're committing to. That inhibits our ability to evolve things in the future.

If we were going to implement something in the proxy we should use normal logging for protocol tracing. It doesn't really need to be an API, as is proposed here.

The more more types of event we define, and the more data produced, the harder it is to analyse.

If you research what sorts of events, and event categories, SIEM systems are interested in they're relatively coarse grained.

We can always add more events in the future: We don't need to achieve perfect coverage in this proposal, so long as it's not too inflexible for the future.

This might be enough of a reason to scale back parts of this proposal.

@k-wall I was thinking about what this would look like if we took the position of not logging all the details of requests and responses in the proxy, but taking the position that those should be logged on the broker cluster if you want that kind of depth. We would still log all the runtime-local things, like connections, authentications, authorizations and so on, as described in this proposal. I think if we did that we could model events like this:

RequestIngress (from client)

RequestEgress (to broker)

RequestInject (originator is a filter)

RequestShortcircuit

ResponseEgress (to the client)

If we took that position then we'd only need to log the correlationId, sessionId (and maybe the API key) for RequestEgress because you could recover what was sent by correlation with the broker's kafka.request.logger logger. We could reduce the scope of this proposal, because we'd not be ending up with a "higher level" API that for example was trying to have a single read event which covered Fetch and ShareFetch. This seems to me to be a better decomposition into event types what I've proposed.

Aside: This starts to feel like OTel traces and spans. However, it doesn't seem to be compatible with OTel. OTel (i.e. app-level) "requests" would tend to correspond with Kafka records. But you can't meaningfully propagate an OTel context kept within records with the events above because records can be batched together, so there's no single "parent span".

k-wall · 2025-12-03T10:43:57Z

@tombentley thanks for getting this ball rolling.

Initial proposal for audit logging

615515e

Signed-off-by: Tom Bentley <tbentley@redhat.com>

tombentley requested a review from a team as a code owner December 3, 2025 00:59

tombentley changed the title ~~Initial proposal for audit logging~~ Security audit logging Dec 3, 2025

tombentley mentioned this pull request Dec 3, 2025

Add AuditLogger kroxylicious/kroxylicious#2481

Draft

tombentley added 2 commits December 3, 2025 16:37

More events for other security-relevant APIs

b1361b3

Signed-off-by: Tom Bentley <tbentley@redhat.com>

Add a non-goal for tamper resistence

4b8b778

Signed-off-by: Tom Bentley <tbentley@redhat.com>

k-wall reviewed Dec 3, 2025

View reviewed changes


		Goals:

		* enable users to _easily_ collect a _complete_ log of security-related events

Uh oh!

Security audit logging #85

Are you sure you want to change the base?

Security audit logging #85

Uh oh!

Conversation

tombentley commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k-wall Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k-wall commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

k-wall Dec 3, 2025 •

edited

Loading