feat(databricks_zerobus sink): encode batches with Arrow Flight#25519
feat(databricks_zerobus sink): encode batches with Arrow Flight#25519flaviofcruz wants to merge 3 commits into
Conversation
Switch the Databricks Zerobus sink from protobuf batch encoding to the SDK's Arrow Flight ingestion path: - Bump arrow/arrow-schema/parquet from 56.2 to 58 so the codec RecordBatch type matches the SDK's Arrow Flight API (this also moves the clickhouse and aws_s3 parquet encoding paths). - Enable the SDK's arrow-flight feature; derive the Arrow schema from Unity Catalog via arrow_schema_from_uc_schema and ingest RecordBatches via ingest_batch + wait_for_offset. - Remove the now-unused proto-batch serializer from codecs and gate the batch-serializer machinery (BatchEncoder/BatchSerializer/BatchOutput/ BatchSerializerConfig) behind the arrow feature. Co-authored-by: Isaac
|
Is the arrow batch encoder more performant than the proto encoder? Like why use this over proto? Just trying to understand the benefits. Is there a separate zerobus endpoint for proto batches vs arrow streams? They're still both grpc streams right just different encoders on the client side? |
The zerobus SDK and the service itself does provide two different endpoints, one for protos (row ingestion) and an arrow flight one (arrow batch ingestion). Using either in vector has similar compute requirements, for proto we need to convert the vector events into protocol buffers and for arrow, we need to convert the events into arrow batches. So from a client perspective, the performance is similar. However, we get the benefit of having a single batch encoder in vector rather than two, so less code is a great benefit. Another major advantage comes from simplified ingestion on the Zerobus side: ingesting arrow batches is usually much cheaper for us so we would prefer for users to use this path as much as possible. |
Summary
With the use of the zerobus SDK 2.0.1, we can now switch the Databricks Zerobus SDK to Arrow Flight ingestion. This also helps remove a bunch of code that was only used for this sink:
Vector configuration
No changes!
How did you test this PR?
Unit tests and running it manually.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.