Skip to content

Commit de04e19

Browse files
authored
[SEA-NodeJS] (3/3) INTERVAL type parity + operation-lifecycle depth (#411)
* feat(sea): INTERVAL type parity + operation-lifecycle depth [3/3] Third of three stacked PRs (base: [2/3] execution + results). Completes the SEA foundation: - ArrowResultConverter: INTERVAL parity. Formats Arrow Interval[YearMonth] / Interval[DayTime] and Duration (rewritten to Int64 by SeaArrowIpcDurationFix) into the canonical Thrift strings ("Y-M" / "D HH:mm:ss.fffffffff"), byte- identical to the Thrift path. Threads the Arrow field through convertArrowTypes so the duration-unit metadata is available at value-conversion time. - Exhaustive operation-lifecycle coverage: seaCancel / seaClose / seaFinished idempotency, flag-set-before-await ordering (cancel-mid-fetch), kernel-error mapping, and the neutral OperationStatus callback shape. - SeaIntervalParity tests build real Arrow IPC batches via flatbuffers and assert the formatted strings. With this, SEA reaches M0 parity with Thrift (connect/auth → execute → fetch → operation lifecycle → INTERVAL types). Replaces the single 8/8 PR #383. Co-authored-by: Isaac Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com> * fix(sea): address #411 review — exhaustive interval switch, docs, coverage Validated every interval edge case (null, multi-row, negative sub-year, sibling-survives) against a live pecotesting warehouse first — all byte-identical to Thrift. The findings were layering/dead-code/coverage, not runtime bugs. - F1: corrected the DURATION_UNIT_METADATA_KEY doc — the interval/duration branches are SEA-gated by construction (Thrift maps INTERVAL → ArrowString and never reaches them), NOT "reused by thrift" as the old comment claimed. - F3/F6/F10: formatArrowInterval now handles YEAR_MONTH only (typed `Interval` + `IntervalUnit.YEAR_MONTH`, no magic `=== 0`) and THROWS on any other unit. The old non-exhaustive default silently misread MONTH_DAY_NANO/undefined as [days,ms]; and the native Interval[DayTime] branch was dead+broken (the kernel emits DAY-TIME as Duration, handled separately) — removed it. - F2: exported the metadata key + added a test pinning it equal to the SEA-side declaration (guards against a silent rename drift). - F8: dropped the dead `_unit` param from formatDayTimeFromTotal. - F9: removed the dead duration check in the BigNum branch (rewritten Int64s arrive as raw bigint) and fixed the false "Int32Array instanceof Uint8Array" comment. - F7: added a YEAR-MONTH sub-year-negative unit test (-1 month → "-0-1"). - F4: added live e2e for null INTERVAL → null and multi-row batches. - F5: e2e before() now probes getSeaNative() and skips (not errors) when the binding is absent; dropped the flaky wall-clock latency assertions (assert behavior — cancel resolved, callback fired once). Deferred (low/informational, noted): F11 (consolidate test makeContext helpers), F12 (tighten instanceOf assertions), F13 (per-operation interval-representation breadcrumb — a per-value log would be spam). Co-authored-by: Isaac Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com> * fix(sea): address #411 review — fail-loud duration unit, interval test gaps Addresses the P2 review comment on #411 (the interval/duration layer) and cascades the #410 fixes underneath (this branch was rebased onto the updated #410 tip). Validated against a live warehouse (interval-duration + interval-edge e2e) and unit tests. - ArrowResultConverter.toNanoseconds: throw on an unrecognized Arrow duration unit instead of silently defaulting to NANOSECOND. The four units are exactly what SeaArrowIpcDurationFix stamps, so an unknown unit means the two sides drifted — fail loud, matching formatArrowInterval. - SeaIntervalParity tests: add DAY-TIME via Duration(MILLISECOND) and Duration(SECOND) (only MICROSECOND/NANOSECOND were exercised), and a test asserting the converter throws (HiveDriverError) on a native non-YEAR_MONTH Arrow Interval rather than misreading it. - interval-duration-e2e: flip the stale "raw Int64 on this layer" assertion to expect the formatted thrift string "1 02:03:04.000000000". That test was written on #410 (pre-formatter) and explicitly noted "#411 flips this to the formatted string" — now that #411's formatter is wired, the value is the byte-identical thrift DAY-TIME string. Co-authored-by: Isaac Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com> --------- Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
1 parent 4ba2175 commit de04e19

6 files changed

Lines changed: 1445 additions & 18 deletions

File tree

lib/result/ArrowResultConverter.ts

Lines changed: 203 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ import {
66
TypeMap,
77
DataType,
88
Type,
9+
Interval,
10+
IntervalUnit,
911
StructRow,
1012
MapRow,
1113
Vector,
@@ -15,6 +17,7 @@ import {
1517
} from 'apache-arrow';
1618
import { TTableSchema, TColumnDesc } from '../../thrift/TCLIService_types';
1719
import IClientContext from '../contracts/IClientContext';
20+
import HiveDriverError from '../errors/HiveDriverError';
1821
import IResultsProvider, { ResultsProviderFetchNextOptions } from './IResultsProvider';
1922
import { ArrowBatch, getSchemaColumns, convertThriftValue } from './utils';
2023

@@ -23,6 +26,149 @@ const { isArrowBigNumSymbol, bigNumToBigInt } = arrowUtils;
2326
type ArrowSchema = Schema<TypeMap>;
2427
type ArrowSchemaField = Field<DataType<Type, TypeMap>>;
2528

29+
/**
30+
* Metadata key carrying the original Arrow `Duration` time unit on fields
31+
* rewritten to `Int64` by the SEA IPC pre-processor
32+
* (`lib/sea/SeaArrowIpcDurationFix.ts`). Re-declared here (rather than
33+
* imported) to keep this generic `lib/result` converter free of a
34+
* compile-time dependency on `lib/sea`.
35+
*
36+
* **SEA-gated by construction — NOT shared with Thrift.** This key (and the
37+
* `DataType.isInterval` / duration branches below) only ever execute on the
38+
* SEA path. The Thrift backend sets `intervalTypesAsArrow: false` and maps
39+
* both INTERVAL `TTypeId`s to `ArrowString` (`lib/result/utils.ts`), so the
40+
* server pre-formats intervals to strings and this logic is never reached.
41+
* `export`ed so `SeaIntervalParity.test` can pin it equal to the SEA-side
42+
* declaration and catch a rename/typo that would silently no-op the consumer.
43+
*/
44+
export const DURATION_UNIT_METADATA_KEY = 'databricks.arrow.duration_unit';
45+
const ZERO_BIGINT = BigInt(0);
46+
const NS_PER_MICRO = BigInt(1_000);
47+
const NS_PER_MILLI = BigInt(1_000_000);
48+
const NS_PER_SEC = BigInt(1_000_000_000);
49+
const NS_PER_MIN = NS_PER_SEC * BigInt(60);
50+
const NS_PER_HOUR = NS_PER_MIN * BigInt(60);
51+
const NS_PER_DAY = NS_PER_HOUR * BigInt(24);
52+
53+
/**
54+
* Format a native Arrow `Interval[YearMonth]` value into the canonical thrift
55+
* string `"Y-M"` (e.g. 1 year 2 months → `"1-2"`, -1 month → `"-0-1"`).
56+
*
57+
* Arrow surfaces YEAR-MONTH as an `Int32Array(2)` `[years, months]` via the
58+
* `GetVisitor` (years/months derived from a single int32 of total months).
59+
*
60+
* **Only YEAR_MONTH reaches here.** The kernel emits INTERVAL DAY-TIME as an
61+
* Arrow `Duration` (rewritten to `Int64`), handled by
62+
* `formatDurationToIntervalDayTime` — never as a native `Interval[DayTime]`.
63+
* Any other unit (DAY_TIME / MONTH_DAY_NANO / undefined) is therefore
64+
* unexpected; we throw rather than silently misread the value as `[days, ms]`
65+
* and emit a confidently-wrong string (the old non-exhaustive default).
66+
*/
67+
function formatArrowInterval(value: Int32Array, valueType: Interval): string {
68+
if (valueType?.unit !== IntervalUnit.YEAR_MONTH) {
69+
throw new HiveDriverError(
70+
`SEA result converter: unsupported Arrow Interval unit ${valueType?.unit}. The kernel emits only ` +
71+
`YEAR_MONTH as a native Arrow Interval (DAY-TIME arrives as Duration); MONTH_DAY_NANO is unsupported.`,
72+
);
73+
}
74+
return formatYearMonth(Number(value[0]), Number(value[1]));
75+
}
76+
77+
/**
78+
* Format the (years, months) decomposition into `"Y-M"` (or `"-Y-M"`
79+
* for negative intervals). Arrow's `getIntervalYearMonth` (in
80+
* `apache-arrow/visitor/get.js:179`) decomposes a signed total-months
81+
* int32 via integer truncation, so years and months always share the
82+
* same sign. We render the absolute values with a single leading `-`
83+
* to match the Spark display format used on the thrift path.
84+
*/
85+
function formatYearMonth(years: number, months: number): string {
86+
const total = years * 12 + months;
87+
if (total < 0) {
88+
const abs = -total;
89+
const y = Math.trunc(abs / 12);
90+
const m = abs % 12;
91+
return `-${y}-${m}`;
92+
}
93+
return `${years}-${months}`;
94+
}
95+
96+
/**
97+
* Format an Arrow `Duration` value (rewritten by the SEA IPC
98+
* pre-processor to `Int64`) into the thrift INTERVAL DAY-TIME string.
99+
*
100+
* @param value the duration value as `bigint` (signed nanos/micros/
101+
* millis/seconds depending on `unit`)
102+
* @param unit one of `SECOND` / `MILLISECOND` / `MICROSECOND` /
103+
* `NANOSECOND` (the original Arrow time unit, captured
104+
* by `SeaArrowIpcDurationFix.ts`)
105+
*/
106+
function formatDurationToIntervalDayTime(value: bigint | number, unit: string): string {
107+
const bi = typeof value === 'bigint' ? value : BigInt(value);
108+
const nanos = toNanoseconds(bi, unit);
109+
return formatDayTimeFromTotal(nanos);
110+
}
111+
112+
/**
113+
* Scale a duration value to nanoseconds based on its unit.
114+
*
115+
* SECOND → ×1_000_000_000
116+
* MILLISECOND → × 1_000_000
117+
* MICROSECOND → × 1_000
118+
* NANOSECOND → × 1
119+
*
120+
* Throws on any other unit rather than silently treating it as
121+
* NANOSECOND. The four units above are exactly what
122+
* `SeaArrowIpcDurationFix` enumerates when it stamps the
123+
* `databricks.arrow.duration_unit` metadata, so an unrecognized unit
124+
* here means the two sides have drifted — fail loud (matching
125+
* `formatArrowInterval`'s stance) instead of emitting a confidently
126+
* wrong value.
127+
*/
128+
function toNanoseconds(value: bigint, unit: string): bigint {
129+
switch (unit) {
130+
case 'SECOND':
131+
return value * NS_PER_SEC;
132+
case 'MILLISECOND':
133+
return value * NS_PER_MILLI;
134+
case 'MICROSECOND':
135+
return value * NS_PER_MICRO;
136+
case 'NANOSECOND':
137+
return value;
138+
default:
139+
throw new HiveDriverError(
140+
`SEA INTERVAL DAY-TIME: unrecognized Arrow duration unit ${JSON.stringify(unit)}; ` +
141+
`expected one of SECOND / MILLISECOND / MICROSECOND / NANOSECOND`,
142+
);
143+
}
144+
}
145+
146+
/**
147+
* Format a signed total-nanoseconds value as `"D HH:mm:ss.fffffffff"`.
148+
* Always emits 9 fractional digits to match the thrift driver's wire
149+
* format (`"1 02:03:04.000000000"` — 9 digits regardless of the
150+
* server-side storage precision). Negative values get a single
151+
* leading `-`. The caller has already scaled to nanoseconds.
152+
*/
153+
function formatDayTimeFromTotal(totalNanos: bigint): string {
154+
const sign = totalNanos < ZERO_BIGINT ? '-' : '';
155+
const abs = totalNanos < ZERO_BIGINT ? -totalNanos : totalNanos;
156+
157+
const days = abs / NS_PER_DAY;
158+
let rem = abs % NS_PER_DAY;
159+
const hours = rem / NS_PER_HOUR;
160+
rem %= NS_PER_HOUR;
161+
const minutes = rem / NS_PER_MIN;
162+
rem %= NS_PER_MIN;
163+
const seconds = rem / NS_PER_SEC;
164+
const subSeconds = rem % NS_PER_SEC;
165+
166+
const pad2 = (n: bigint): string => n.toString().padStart(2, '0');
167+
const fraction = `.${subSeconds.toString().padStart(9, '0')}`;
168+
169+
return `${sign}${days.toString()} ${pad2(hours)}:${pad2(minutes)}:${pad2(seconds)}${fraction}`;
170+
}
171+
26172
export default class ArrowResultConverter implements IResultsProvider<Array<any>> {
27173
private readonly context: IClientContext;
28174

@@ -147,37 +293,52 @@ export default class ArrowResultConverter implements IResultsProvider<Array<any>
147293
private getRows(schema: ArrowSchema, rows: Array<StructRow | MapRow>): Array<any> {
148294
return rows.map((row) => {
149295
// First, convert native Arrow values to corresponding plain JS objects
150-
const record = this.convertArrowTypes(row, undefined, schema.fields);
296+
const record = this.convertArrowTypes(row, undefined, schema.fields, undefined);
151297
// Second, cast all the values to original Thrift types
152298
return this.convertThriftTypes(record);
153299
});
154300
}
155301

156-
private convertArrowTypes(value: any, valueType: DataType | undefined, fields: Array<ArrowSchemaField> = []): any {
302+
private convertArrowTypes(
303+
value: any,
304+
valueType: DataType | undefined,
305+
fields: Array<ArrowSchemaField> = [],
306+
field?: ArrowSchemaField,
307+
): any {
157308
if (value === null) {
158309
return value;
159310
}
160311

161312
const fieldsMap: Record<string, ArrowSchemaField> = {};
162-
for (const field of fields) {
163-
fieldsMap[field.name] = field;
313+
for (const f of fields) {
314+
fieldsMap[f.name] = f;
164315
}
165316

166317
// Convert structures to plain JS object and process all its fields recursively
167318
if (value instanceof StructRow) {
168319
const result = value.toJSON();
169320
for (const key of Object.keys(result)) {
170-
const field: ArrowSchemaField | undefined = fieldsMap[key];
171-
result[key] = this.convertArrowTypes(result[key], field?.type, field?.type.children || []);
321+
const childField: ArrowSchemaField | undefined = fieldsMap[key];
322+
result[key] = this.convertArrowTypes(
323+
result[key],
324+
childField?.type,
325+
childField?.type.children || [],
326+
childField,
327+
);
172328
}
173329
return result;
174330
}
175331
if (value instanceof MapRow) {
176332
const result = value.toJSON();
177333
// Map type consists of its key and value types. We need only value type here, key will be cast to string anyway
178-
const field = fieldsMap.entries?.type.children.find((item) => item.name === 'value');
334+
const valueField = fieldsMap.entries?.type.children.find((item) => item.name === 'value');
179335
for (const key of Object.keys(result)) {
180-
result[key] = this.convertArrowTypes(result[key], field?.type, field?.type.children || []);
336+
result[key] = this.convertArrowTypes(
337+
result[key],
338+
valueField?.type,
339+
valueField?.type.children || [],
340+
valueField,
341+
);
181342
}
182343
return result;
183344
}
@@ -186,31 +347,61 @@ export default class ArrowResultConverter implements IResultsProvider<Array<any>
186347
if (value instanceof Vector) {
187348
const result = value.toJSON();
188349
// Array type contains the only child which defines a type of each array's element
189-
const field = fieldsMap.element;
190-
return result.map((item) => this.convertArrowTypes(item, field?.type, field?.type.children || []));
350+
const elementField = fieldsMap.element;
351+
return result.map((item) =>
352+
this.convertArrowTypes(item, elementField?.type, elementField?.type.children || [], elementField),
353+
);
191354
}
192355

193356
if (DataType.isTimestamp(valueType)) {
194357
return new Date(value);
195358
}
196359

360+
// INTERVAL — Spark/Databricks SEA emits two flavours: native Arrow
361+
// `Interval[YearMonth]` / `Interval[DayTime]` (handled here) and
362+
// `Duration` (transparently rewritten to `Int64` upstream by
363+
// `SeaArrowIpcDurationFix.ts`; handled in the bigint/Int64 branch
364+
// below). In every case we coerce to the canonical thrift string
365+
// form so the SEA path is byte-identical with the thrift path:
366+
// YEAR-MONTH → `"Y-M"`
367+
// DAY-TIME → `"D HH:mm:ss.fffffffff"`
368+
if (DataType.isInterval(valueType)) {
369+
return formatArrowInterval(value, valueType);
370+
}
371+
197372
// Convert big number values to BigInt
198373
// Decimals are also represented as big numbers in Arrow, so additionally process them (convert to float)
199374
if (value instanceof Object && value[isArrowBigNumSymbol]) {
200375
const result = bigNumToBigInt(value);
201376
if (DataType.isDecimal(valueType)) {
202377
return Number(result) / 10 ** valueType.scale;
203378
}
379+
// A rewritten Duration Int64 surfaces as a raw `bigint`, not a BigNum
380+
// wrapper, so it is handled in the bigint branch below — not here.
204381
return result;
205382
}
206383

207-
// Convert binary data to Buffer
384+
// Convert binary data to Buffer.
208385
if (value instanceof Uint8Array) {
386+
// Note: Arrow `Int32Array` / `BigInt64Array` are NOT `instanceof
387+
// Uint8Array` (they are sibling TypedArrays), so an interval value never
388+
// reaches this branch — intervals are handled by the `isInterval` /
389+
// bigint branches above. This is purely the binary-column → Buffer path.
209390
return Buffer.from(value);
210391
}
211392

393+
// Bigint fallback — for raw bigints (not BigNum wrappers), the
394+
// duration_unit metadata also gates the INTERVAL DAY-TIME format.
395+
if (typeof value === 'bigint') {
396+
const durationUnit = field?.metadata.get(DURATION_UNIT_METADATA_KEY);
397+
if (durationUnit) {
398+
return formatDurationToIntervalDayTime(value, durationUnit);
399+
}
400+
return Number(value);
401+
}
402+
212403
// Return other values as is
213-
return typeof value === 'bigint' ? Number(value) : value;
404+
return value;
214405
}
215406

216407
private convertThriftTypes(record: Record<string, any>): any {

tests/e2e/sea/interval-duration-e2e.test.ts

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -93,12 +93,13 @@ describe('SEA INTERVAL DAY-TIME (Arrow Duration rewriter) end-to-end', function
9393
expect(row.dt, 'INTERVAL DAY-TIME value should be present').to.not.equal(undefined);
9494
});
9595

96-
it('surfaces the value as a raw Int64 on this layer (formatter is PR 3/3)', async () => {
96+
it('surfaces the value as the formatted thrift INTERVAL DAY-TIME string (#411 formatter wired)', async () => {
9797
const row = await fetchOneRow(true, secrets as PecoSecrets);
98-
// Documents the M0/2-of-3 contract: the rewriter makes the column
99-
// *decodable* but the duration_unit formatter is not wired here yet, so the
100-
// value is the raw integer microsecond/nanosecond count, not the thrift
101-
// "1 02:03:04.000000000" string. (#411 flips this to the formatted string.)
102-
expect(['number', 'bigint']).to.include(typeof row.dt);
98+
// #411 wires the duration_unit formatter, so the raw Int64 the rewriter
99+
// produces is rendered as the thrift "D HH:mm:ss.fffffffff" string —
100+
// byte-identical to the Thrift path. (On the #410 layer this surfaced as
101+
// the raw integer count.)
102+
expect(typeof row.dt).to.equal('string');
103+
expect(row.dt).to.equal('1 02:03:04.000000000');
103104
});
104105
});

0 commit comments

Comments
 (0)