Skip to content

Commit 1e7884b

Browse files
dfa1claude
andcommitted
refactor: ADR 0001 — split read and write runtimes
Separate core's bifunctional encoding model into distinct read and write runtimes. **Encoder/decoder lift** Each encoding gets a standalone EncodingDecoder (reader module) and EncodingEncoder (writer module). 33 *EncodingTest classes move to writer/encode or reader/decode per their primary role. **Phase 0 — Encoding metadata-only** Encoding interface and all 32 *Encoding stub classes deleted from core. Shared algorithmic constants (F10 tables, FL_ORDER, FL_CHUNK_SIZE, dtype constants) and helpers (transposeIndex, iterateIndex, etc.) inlined as private static into the *EncodingDecoder/*EncodingEncoder that use them. EncodeContext.encodings (Registry) replaced by encoders (Map<EncodingId,EncodingEncoder>). CascadingCompressor moves to writer.encode. Registry becomes extension-only registry. **Phase 1 — decode types to reader** DecodeContext, ArrayNode (+subtypes), EncodingDecoder, and FlatSegmentDecoder move from core.encoding to reader.decode / reader. ReadRegistry replaces Registry.decode() as the canonical read dispatcher. VortexReader, VortexHttpReader, VortexHandle, ScanIterator all take ReadRegistry. Test infra: TestRegistry, TestDecodeContexts, DecodeTestHelper move to reader/test (new test-jar); writer/test gains vortex-reader:test-jar dep. ReadRegistryTest replaces the decode subset of RegistryTest. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent af7c27c commit 1e7884b

238 files changed

Lines changed: 13350 additions & 14481 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,11 @@
44
[![Maven Central](https://img.shields.io/maven-central/v/io.github.dfa1.vortex/vortex-reader.svg)](https://central.sonatype.com/artifact/io.github.dfa1.vortex/vortex-reader)
55
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/license/Apache-2.0)
66

7+
> **Alpha** — not production-ready. APIs will change without notice.
8+
79
Pure-Java reader/writer for the [Vortex](https://github.com/vortex-data/vortex) columnar file format.
810
100% Java, no JNI, no `sun.misc.Unsafe`. Uses the FFM API (`MemorySegment`/`Arena`, Java 25+)
9-
for zero-copy memory-mapped reads. Read benchmarks match or beat the Rust JNI on the workloads
10-
tested (Apple M5, JDK 25); see [docs/explanation.md#benchmarks](docs/explanation.md#benchmarks).
11+
for zero-copy memory-mapped reads.
1112

1213
| Project | Language | Notes |
1314
|---------------------------------------------------------------------|----------|-----------------------------------------|
@@ -49,12 +50,15 @@ try (VortexReader vf = VortexReader.open(Path.of("data/example.vortex"));
4950
}
5051
```
5152

52-
> **Lifecycle.** `Chunk` owns a confined `Arena` — close it (try-with-resources
53-
> or `iter.forEachRemaining`) to release the decoded buffers. Full lifecycle
54-
> rules: [docs/explanation.md#memory-model](docs/explanation.md#memory-model).
53+
> **Lifecycle.** `ScanIterator` implements `Iterator<Chunk>` and `Chunk` implements
54+
> `AutoCloseable`. Each chunk owns a confined `Arena`; closing it releases the
55+
> decoded buffers. Calling `iter.next()` while a prior chunk is still open throws
56+
> `IllegalStateException`. Use try-with-resources, or
57+
> `iter.forEachRemaining(c -> ...)` which closes each chunk for you. See
58+
> [docs/explanation.md#memory-model](docs/explanation.md#memory-model).
5559
56-
For more examples (writing, projection, filtering, custom encodings, CLI) see
57-
the documentation below.
60+
For more examples writing, projection, filtering, custom encodings, and the CLI
61+
see the documentation below.
5862

5963
## Documentation
6064

TODO.md

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -248,17 +248,8 @@ See [docs/compatibility.md](docs/compatibility.md) for the full encoding support
248248
using the 5-symbol generator from `OhlcEncodingInspectionIntegrationTest#writeOhlcMultiSymbol` and assert
249249
the global-dict file is smaller than the per-chunk-dict baseline.
250250

251-
- [ ] **FSST symbol-table builder: port `fsst-rs` Algorithm 3**
252-
`FsstEncoding.Encoder` is a single-pass, bigram-only top-K table. Rust's
253-
`fsst-rs` (used by `vortex-fsst`) implements **Algorithm 3 from the FSST
254-
paper**: 5 generations of iterative training, symbols up to 8 bytes long,
255-
Lossy Perfect Hash Table for O(1) symbol lookup during compression. On the
256-
high-cardinality random ASCII benchmark
257-
(`FileSizeComparisonIntegrationTest#highCardinalityUtf8_javaVsJni`) the gap
258-
is Java 1.75× raw vs Rust 1.18× raw — purely encoder quality, the wire
259-
format and decoder are unchanged. Estimate: ~1 week of work.
260-
Reference: <https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf>,
261-
<https://github.com/spiraldb/fsst/blob/develop/src/builder.rs>.
251+
- [ ] **FSST in CASCADE_CODECS**`FsstEncoding` exists but not in the cascade; Rust uses FSST for
252+
`store_and_fwd_flag`. Small gain on taxi (~0.1 MB).
262253

263254
### `vortex.zstd` known limitations
264255

cli/src/main/java/io/github/dfa1/vortex/cli/tui/VortexInspectorTui.java

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -571,11 +571,8 @@ private void runDictLoad(InspectorTree.Node dictNode) {
571571
try (java.lang.foreign.Arena arena = java.lang.foreign.Arena.ofConfined()) {
572572
int segIdx = values.segments().getFirst();
573573
SegmentSpec spec = tree.segmentSpecs().get(segIdx);
574-
java.lang.foreign.MemorySegment seg = handle.slice(spec.offset(), spec.length());
575574
io.github.dfa1.vortex.core.array.Array arr =
576-
new io.github.dfa1.vortex.encoding.FlatSegmentDecoder(handle.registry())
577-
.decode(seg, handle.footer().arraySpecs(),
578-
dtype, values.rowCount(), arena);
575+
handle.decodeFlatSegment(spec, dtype, values.rowCount(), arena);
579576
int n = (int) Math.min(arr.length(), DATA_PREVIEW_ROWS);
580577
List<String> out = new ArrayList<>(n);
581578
for (int i = 0; i < n; i++) {

core/pom.xml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,25 @@
4747
</dependency>
4848
</dependencies>
4949

50+
<build>
51+
<plugins>
52+
<!-- Publish a test-jar so reader/ and writer/ tests can reuse core test
53+
helpers (DTypes, EncodeTestHelper) without duplication. -->
54+
<plugin>
55+
<groupId>org.apache.maven.plugins</groupId>
56+
<artifactId>maven-jar-plugin</artifactId>
57+
<executions>
58+
<execution>
59+
<id>publish-test-jar</id>
60+
<goals>
61+
<goal>test-jar</goal>
62+
</goals>
63+
</execution>
64+
</executions>
65+
</plugin>
66+
</plugins>
67+
</build>
68+
5069
<!--
5170
Generated sources (src/main/java/…/fbs and …/proto) are committed to the repo.
5271
Normal builds need no external tools.

core/src/main/java/io/github/dfa1/vortex/core/array/ByteArray.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,9 @@ public long fold(long identity, LongBinaryOperator op) {
7676
result = op.applyAsLong(result, buf.get(ValueLayout.JAVA_BYTE, i));
7777
}
7878
} else {
79+
long cap = elementCount;
7980
for (long i = 0; i < n; i++) {
80-
result = op.applyAsLong(result, buf.get(ValueLayout.JAVA_BYTE, i % elementCount));
81+
result = op.applyAsLong(result, buf.get(ValueLayout.JAVA_BYTE, i % cap));
8182
}
8283
}
8384
return result;

core/src/main/java/io/github/dfa1/vortex/core/array/DoubleArray.java

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,9 @@ public void forEachDouble(DoubleConsumer c) {
6565
c.accept(buf.getAtIndex(PTypeIO.LE_DOUBLE, i));
6666
}
6767
} else {
68+
long cap = elementCount;
6869
for (long i = 0; i < n; i++) {
69-
c.accept(buf.getAtIndex(PTypeIO.LE_DOUBLE, i % elementCount));
70+
c.accept(buf.getAtIndex(PTypeIO.LE_DOUBLE, i % cap));
7071
}
7172
}
7273
}
@@ -85,8 +86,9 @@ public double fold(double identity, DoubleBinaryOperator op) {
8586
result = op.applyAsDouble(result, buf.getAtIndex(PTypeIO.LE_DOUBLE, i));
8687
}
8788
} else {
89+
long cap = elementCount;
8890
for (long i = 0; i < n; i++) {
89-
result = op.applyAsDouble(result, buf.getAtIndex(PTypeIO.LE_DOUBLE, i % elementCount));
91+
result = op.applyAsDouble(result, buf.getAtIndex(PTypeIO.LE_DOUBLE, i % cap));
9092
}
9193
}
9294
return result;

core/src/main/java/io/github/dfa1/vortex/core/array/FloatArray.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,9 @@ public double fold(double identity, DoubleBinaryOperator op) {
6363
result = op.applyAsDouble(result, buf.getAtIndex(PTypeIO.LE_FLOAT, i));
6464
}
6565
} else {
66+
long cap = elementCount;
6667
for (long i = 0; i < n; i++) {
67-
result = op.applyAsDouble(result, buf.getAtIndex(PTypeIO.LE_FLOAT, i % elementCount));
68+
result = op.applyAsDouble(result, buf.getAtIndex(PTypeIO.LE_FLOAT, i % cap));
6869
}
6970
}
7071
return result;

core/src/main/java/io/github/dfa1/vortex/core/array/IntArray.java

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,9 @@ public void forEachInt(IntConsumer c) {
6363
c.accept(buf.getAtIndex(PTypeIO.LE_INT, i));
6464
}
6565
} else {
66+
long cap = elementCount;
6667
for (long i = 0; i < n; i++) {
67-
c.accept(buf.getAtIndex(PTypeIO.LE_INT, i % elementCount));
68+
c.accept(buf.getAtIndex(PTypeIO.LE_INT, i % cap));
6869
}
6970
}
7071
}
@@ -83,8 +84,9 @@ public int fold(int identity, IntBinaryOperator op) {
8384
result = op.applyAsInt(result, buf.getAtIndex(PTypeIO.LE_INT, i));
8485
}
8586
} else {
87+
long cap = elementCount;
8688
for (long i = 0; i < n; i++) {
87-
result = op.applyAsInt(result, buf.getAtIndex(PTypeIO.LE_INT, i % elementCount));
89+
result = op.applyAsInt(result, buf.getAtIndex(PTypeIO.LE_INT, i % cap));
8890
}
8991
}
9092
return result;

core/src/main/java/io/github/dfa1/vortex/core/array/LongArray.java

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,9 @@ public void forEachLong(LongConsumer c) {
6565
c.accept(buf.getAtIndex(PTypeIO.LE_LONG, i));
6666
}
6767
} else {
68+
long cap = elementCount;
6869
for (long i = 0; i < n; i++) {
69-
c.accept(buf.getAtIndex(PTypeIO.LE_LONG, i % elementCount));
70+
c.accept(buf.getAtIndex(PTypeIO.LE_LONG, i % cap));
7071
}
7172
}
7273
}
@@ -85,8 +86,9 @@ public long fold(long identity, LongBinaryOperator op) {
8586
result = op.applyAsLong(result, buf.getAtIndex(PTypeIO.LE_LONG, i));
8687
}
8788
} else {
89+
long cap = elementCount;
8890
for (long i = 0; i < n; i++) {
89-
result = op.applyAsLong(result, buf.getAtIndex(PTypeIO.LE_LONG, i % elementCount));
91+
result = op.applyAsLong(result, buf.getAtIndex(PTypeIO.LE_LONG, i % cap));
9092
}
9193
}
9294
return result;

core/src/main/java/io/github/dfa1/vortex/core/array/ShortArray.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,9 @@ public long fold(long identity, LongBinaryOperator op) {
7474
result = op.applyAsLong(result, buf.getAtIndex(PTypeIO.LE_SHORT, i));
7575
}
7676
} else {
77+
long cap = elementCount;
7778
for (long i = 0; i < n; i++) {
78-
result = op.applyAsLong(result, buf.getAtIndex(PTypeIO.LE_SHORT, i % elementCount));
79+
result = op.applyAsLong(result, buf.getAtIndex(PTypeIO.LE_SHORT, i % cap));
7980
}
8081
}
8182
return result;

0 commit comments

Comments
 (0)