Skip to content

feat(databricks_zerobus sink): encode batches with Arrow Flight#25519

Open
flaviofcruz wants to merge 3 commits into
vectordotdev:masterfrom
flaviofcruz:zerobus-arrow-stream
Open

feat(databricks_zerobus sink): encode batches with Arrow Flight#25519
flaviofcruz wants to merge 3 commits into
vectordotdev:masterfrom
flaviofcruz:zerobus-arrow-stream

Conversation

@flaviofcruz
Copy link
Copy Markdown
Contributor

@flaviofcruz flaviofcruz commented May 28, 2026

Summary

With the use of the zerobus SDK 2.0.1, we can now switch the Databricks Zerobus SDK to Arrow Flight ingestion. This also helps remove a bunch of code that was only used for this sink:

  • Bump arrow/arrow-schema/parquet from 56.2 to 58 so the codec RecordBatch type matches the SDK's Arrow Flight API (this also moves the clickhouse and aws_s3 parquet encoding paths).
  • Enable the SDK's arrow-flight feature; derive the Arrow schema from Unity Catalog via arrow_schema_from_uc_schema and ingest RecordBatches via ingest_batch + wait_for_offset.
  • Remove the now-unused proto-batch serializer from codecs and gate the batch-serializer machinery (BatchEncoder/BatchSerializer/BatchOutput/ BatchSerializerConfig) behind the arrow feature.

Vector configuration

No changes!

  sinks:
    zb:
      type: databricks_zerobus
      inputs: [demo]
      ingestion_endpoint: "https://ingest.dev.databricks.com"
      unity_catalog_endpoint: "https://workspace.cloud.databricks.com"
      table_name: "main.default.my_table"
      auth:
        strategy: oauth
        client_id: "${DATABRICKS_CLIENT_ID}"
        client_secret: "${DATABRICKS_CLIENT_SECRET}"

How did you test this PR?

Unit tests and running it manually.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

Switch the Databricks Zerobus sink from protobuf batch encoding to the
SDK's Arrow Flight ingestion path:

- Bump arrow/arrow-schema/parquet from 56.2 to 58 so the codec
  RecordBatch type matches the SDK's Arrow Flight API (this also moves
  the clickhouse and aws_s3 parquet encoding paths).
- Enable the SDK's arrow-flight feature; derive the Arrow schema from
  Unity Catalog via arrow_schema_from_uc_schema and ingest RecordBatches
  via ingest_batch + wait_for_offset.
- Remove the now-unused proto-batch serializer from codecs and gate the
  batch-serializer machinery (BatchEncoder/BatchSerializer/BatchOutput/
  BatchSerializerConfig) behind the arrow feature.

Co-authored-by: Isaac
@github-actions github-actions Bot added domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. labels May 28, 2026
@flaviofcruz flaviofcruz marked this pull request as ready for review May 28, 2026 22:16
@flaviofcruz flaviofcruz requested review from a team as code owners May 28, 2026 22:16
@petere-datadog
Copy link
Copy Markdown
Contributor

Is the arrow batch encoder more performant than the proto encoder? Like why use this over proto? Just trying to understand the benefits. Is there a separate zerobus endpoint for proto batches vs arrow streams? They're still both grpc streams right just different encoders on the client side?

@flaviofcruz
Copy link
Copy Markdown
Contributor Author

Is the arrow batch encoder more performant than the proto encoder? Like why use this over proto? Just trying to understand the benefits. Is there a separate zerobus endpoint for proto batches vs arrow streams? They're still both grpc streams right just different encoders on the client side?

The zerobus SDK and the service itself does provide two different endpoints, one for protos (row ingestion) and an arrow flight one (arrow batch ingestion). Using either in vector has similar compute requirements, for proto we need to convert the vector events into protocol buffers and for arrow, we need to convert the events into arrow batches. So from a client perspective, the performance is similar. However, we get the benefit of having a single batch encoder in vector rather than two, so less code is a great benefit.

Another major advantage comes from simplified ingestion on the Zerobus side: ingesting arrow batches is usually much cheaper for us so we would prefer for users to use this path as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants