Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@
[![Maven Central](https://img.shields.io/maven-central/v/io.github.dfa1.vortex/vortex-reader.svg)](https://central.sonatype.com/artifact/io.github.dfa1.vortex/vortex-reader)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/license/Apache-2.0)

> **Alpha** — not production-ready. APIs will change without notice.

Pure-Java reader/writer for the [Vortex](https://github.com/vortex-data/vortex) columnar file format.
100% Java, no JNI, no `sun.misc.Unsafe`. Uses the FFM API (`MemorySegment`/`Arena`, Java 25+)
for zero-copy memory-mapped reads. Read benchmarks match or beat the Rust JNI on the workloads
tested (Apple M5, JDK 25); see [docs/explanation.md#benchmarks](docs/explanation.md#benchmarks).
for zero-copy memory-mapped reads.

| Project | Language | Notes |
|---------------------------------------------------------------------|----------|-----------------------------------------|
Expand Down Expand Up @@ -49,12 +50,15 @@ try (VortexReader vf = VortexReader.open(Path.of("data/example.vortex"));
}
```

> **Lifecycle.** `Chunk` owns a confined `Arena` — close it (try-with-resources
> or `iter.forEachRemaining`) to release the decoded buffers. Full lifecycle
> rules: [docs/explanation.md#memory-model](docs/explanation.md#memory-model).
> **Lifecycle.** `ScanIterator` implements `Iterator<Chunk>` and `Chunk` implements
> `AutoCloseable`. Each chunk owns a confined `Arena`; closing it releases the
> decoded buffers. Calling `iter.next()` while a prior chunk is still open throws
> `IllegalStateException`. Use try-with-resources, or
> `iter.forEachRemaining(c -> ...)` which closes each chunk for you. See
> [docs/explanation.md#memory-model](docs/explanation.md#memory-model).

For more examples (writing, projection, filtering, custom encodings, CLI) see
the documentation below.
For more examples writing, projection, filtering, custom encodings, and the CLI
see the documentation below.

## Documentation

Expand Down
13 changes: 2 additions & 11 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -248,17 +248,8 @@ See [docs/compatibility.md](docs/compatibility.md) for the full encoding support
using the 5-symbol generator from `OhlcEncodingInspectionIntegrationTest#writeOhlcMultiSymbol` and assert
the global-dict file is smaller than the per-chunk-dict baseline.

- [ ] **FSST symbol-table builder: port `fsst-rs` Algorithm 3** —
`FsstEncoding.Encoder` is a single-pass, bigram-only top-K table. Rust's
`fsst-rs` (used by `vortex-fsst`) implements **Algorithm 3 from the FSST
paper**: 5 generations of iterative training, symbols up to 8 bytes long,
Lossy Perfect Hash Table for O(1) symbol lookup during compression. On the
high-cardinality random ASCII benchmark
(`FileSizeComparisonIntegrationTest#highCardinalityUtf8_javaVsJni`) the gap
is Java 1.75× raw vs Rust 1.18× raw — purely encoder quality, the wire
format and decoder are unchanged. Estimate: ~1 week of work.
Reference: <https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf>,
<https://github.com/spiraldb/fsst/blob/develop/src/builder.rs>.
- [ ] **FSST in CASCADE_CODECS** — `FsstEncoding` exists but not in the cascade; Rust uses FSST for
`store_and_fwd_flag`. Small gain on taxi (~0.1 MB).

### `vortex.zstd` known limitations

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -571,11 +571,8 @@ private void runDictLoad(InspectorTree.Node dictNode) {
try (java.lang.foreign.Arena arena = java.lang.foreign.Arena.ofConfined()) {
int segIdx = values.segments().getFirst();
SegmentSpec spec = tree.segmentSpecs().get(segIdx);
java.lang.foreign.MemorySegment seg = handle.slice(spec.offset(), spec.length());
io.github.dfa1.vortex.core.array.Array arr =
new io.github.dfa1.vortex.encoding.FlatSegmentDecoder(handle.registry())
.decode(seg, handle.footer().arraySpecs(),
dtype, values.rowCount(), arena);
handle.decodeFlatSegment(spec, dtype, values.rowCount(), arena);
int n = (int) Math.min(arr.length(), DATA_PREVIEW_ROWS);
List<String> out = new ArrayList<>(n);
for (int i = 0; i < n; i++) {
Expand Down
19 changes: 19 additions & 0 deletions core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,25 @@
</dependency>
</dependencies>

<build>
<plugins>
<!-- Publish a test-jar so reader/ and writer/ tests can reuse core test
helpers (DTypes, EncodeTestHelper) without duplication. -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<executions>
<execution>
<id>publish-test-jar</id>
<goals>
<goal>test-jar</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>

<!--
Generated sources (src/main/java/…/fbs and …/proto) are committed to the repo.
Normal builds need no external tools.
Expand Down
Loading
Loading