Skip to content

Conversation

@majin1102
Copy link
Contributor

@majin1102 majin1102 commented Jan 21, 2026

Close #5744

For now arrow-java does't provide a convenient way like pa.json(). So we extend the field type in lance java binding to support related json udf

@github-actions github-actions bot added enhancement New feature or request java labels Jan 21, 2026
@github-actions
Copy link
Contributor

Code Review for PR #5770

Summary

This PR adds Java utilities for creating Arrow JSON extension fields and a test to verify JSON extraction via scanning works correctly.

P0 Issues

None identified - The implementation is correct and addresses the issue #5744.

P1 Issues / Suggestions

  1. Missing Javadoc on constants (Minor)
    The class uses EXTENSION_NAME_KEY and ARROW_JSON_EXTENSION_NAME constants but doesn't document their relationship to the Arrow specification. Consider adding a brief class-level Javadoc explaining that these constants align with the Arrow JSON extension type specification (arrow.json).

  2. Consider adding jsonLargeBinary helper
    Looking at rust/lance-arrow/src/json.rs, Lance internally uses LargeBinary with lance.json extension for JSONB storage. The current PR only exposes arrow.json extension (which uses Utf8/LargeUtf8 and gets converted to JSONB on write). This is the correct approach for users writing JSON data, but a comment noting this would help future maintainers understand the data flow:

    • User writes arrow.json (Utf8) → Lance converts to lance.json (LargeBinary JSONB) on write
    • On read, Lance converts back to arrow.json (Utf8)
  3. Test coverage is good - The test properly validates:

    • Creating a dataset with JSON extension field
    • Writing JSON data
    • Querying with json_extract filter
    • Verifying correct row filtering

Positive Aspects

  • Correct use of Collections.unmodifiableMap for immutable metadata
  • Both Utf8 and LargeUtf8 variants are provided
  • Test follows existing patterns in the codebase (uses @TempDir, proper resource cleanup with try-with-resources)
  • The solution aligns with how PyArrow's pa.json_() works, providing parity between Python and Java SDKs

Verdict

LGTM ✓ - The PR correctly solves the issue and the implementation is clean. The test adequately covers the use case.

@majin1102
Copy link
Contributor Author

@Xuanwo Could you take a look at this json related feat for Java binding?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug lanceJava Sdk] json_extract fails with InvalidJsonb when querying Utf8 or LargeBinary columns containing raw JSON strings

1 participant