Spark 4 support #451

ShimonSte · 2025-11-12T10:56:54Z

Summary

Checklist

Delete items not relevant to your PR:

Unit and integration tests covering the common scenarios were added
A human-readable description of the changes was provided to include in CHANGELOG
For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

windsurf-bot · 2025-11-12T10:56:58Z

This PR has too many files to review (>50 files).

* Upgrade Java Client to V2 syncQuery & syncInsert * Refactor to use the new client v2 api * Add timeout to query operation * Clean NodeClient * Change binary reader * Update client version * Fix project to use snapshots * merge with main * run spotlessScalaApply and implement readAllBytes since java 8 does not support * Remove unneeded remarks * Chanage to client version 0.9.3 * Update socket timeout in new client * Change max connections to 20 * ConnectTimeout to 1200000 * Add 3 sec to sleep * Setting a new setConnectionRequestTimeout for experiment * spotlessScalaApply fix * Fix/json reader fixedstring v2 (#448) * Wake up ClickHouse Cloud instance before tests (#429) * fix: Handle FixedString as plain text in JSON reader for all Spark versions Problem: ClickHouse returns FixedString as plain text in JSON format, but the connector was trying to decode it as Base64, causing InvalidFormatException. Solution: Use pattern matching with guard to check if the JSON node is textual. - If textual (FixedString): decode as UTF-8 bytes - If not textual (true binary): decode as Base64 Applied to Spark 3.3, 3.4, and 3.5. --------- Co-authored-by: Bentsi Leviav <bentsi.leviav@clickhouse.com> Co-authored-by: Shimon Steinitz <shimonsteinitz@Shimons-MacBook-Pro.local> * Added reader and writer tests (#449) * Wake up ClickHouse Cloud instance before tests (#429) * feat: Add comprehensive read test coverage for Spark 3.3, 3.4, and 3.5 Add shared test trait ClickHouseReaderTestBase with 48 test scenarios covering: - All primitive types (Boolean, Byte, Short, Int, Long, Float, Double) - Large integers (UInt64, Int128, UInt128, Int256, UInt256) - Decimals (Decimal32, Decimal64, Decimal128) - Date/Time types (Date, Date32, DateTime, DateTime32, DateTime64) - String types (String, UUID, FixedString) - Enums (Enum8, Enum16) - IP addresses (IPv4, IPv6) - JSON data - Collections (Arrays, Maps) - Edge cases (empty strings, long strings, empty arrays, nullable variants) Test suites for Binary and JSON read formats. Test results: 96 tests per Spark version (288 total) - Binary format: 47/48 passing - JSON format: 47/48 passing - Overall: 94/96 passing per version (98% pass rate) Remaining failures are known bugs with fixes on separate branches. * feat: Add comprehensive write test coverage for Spark 3.3, 3.4, and 3.5 Add shared test trait ClickHouseWriterTestBase with 17 test scenarios covering: - Primitive types (Boolean, Byte, Short, Int, Long, Float, Double) - Decimal types - String types (regular and empty strings) - Date and Timestamp types - Collections (Arrays and Maps, including empty variants) - Nullable variants Test suites for JSON and Arrow write formats. Note: Binary write format is not supported (only JSON and Arrow). Test results: 34 tests per Spark version (102 total) - JSON format: 17/17 passing (100%) - Arrow format: 17/17 passing (100%) - Overall: 34/34 passing per version (100% pass rate) Known behavior: Boolean values write as BooleanType but read back as ShortType (0/1) due to ClickHouse storing Boolean as UInt8. * style: Apply spotless formatting * style: Apply spotless formatting for Spark 3.3 and 3.4 Remove trailing whitespace from test files to pass CI spotless checks. * fix: Change write format from binary to arrow in BinaryReaderSuite The 'binary' write format option doesn't exist. Changed to 'arrow' which is a valid write format option. Applied to Spark 3.3, 3.4, and 3.5. * test: Add nullable tests for ShortType, IntegerType, and LongType Added missing nullable variant tests to ensure comprehensive coverage: - decode ShortType - nullable with null values (Nullable(Int16)) - decode IntegerType - nullable with null values (Nullable(Int32)) - decode LongType - nullable with null values (Nullable(Int64)) These tests verify that nullable primitive types correctly handle NULL values in both Binary and JSON read formats. Applied to Spark 3.3, 3.4, and 3.5. Total tests per Spark version: 51 (was 48) Total across all versions: 153 (was 144) * Refactor ClickHouseReaderTestBase: Add nullable tests and organize alphabetically - Add missing nullable test cases for: Date32, Decimal32, Decimal128, UInt16, UUID, DateTime64 - Organize all 69 tests alphabetically by data type for better maintainability - Ensure comprehensive coverage with both nullable and non-nullable variants for all data types - Apply changes consistently across Spark 3.3, 3.4, and 3.5 * ci: Skip cloud tests on forks where secrets are unavailable Add repository check to cloud workflow to prevent failures on forks that don't have access to ClickHouse Cloud secrets. Tests will still run on the main repository where secrets are properly configured. * Refactor and enhance Reader/Writer tests for all Spark versions - Add BooleanType tests to Reader (2 tests) with format-aware assertions - Add 6 new tests to Writer: nested arrays, arrays with nullable elements, multiple Decimal precisions (18,4 and 38,10), Map with nullable values, and StructType - Reorder all tests lexicographically for better organization - Writer tests increased from 17 to 33 tests - Reader tests increased from 69 to 71 tests - Remove section header comments for cleaner code - Apply changes to all Spark versions: 3.3, 3.4, and 3.5 - All tests now properly sorted alphabetically by data type and variant * style: Apply spotless formatting to Reader/Writer tests --------- Co-authored-by: Bentsi Leviav <bentsi.leviav@clickhouse.com> Co-authored-by: Shimon Steinitz <shimon.steinitz@clickhouse.com> * Fix BinaryReader to handle new Java client types - Fix DecimalType: Handle both BigInteger (Int256/UInt256) and BigDecimal (Decimal types) - Fix ArrayType: Direct call to BinaryStreamReader.ArrayValue.getArrayOfObjects() - Fix StringType: Handle UUID, InetAddress, and EnumValue types - Fix DateType: Handle both LocalDate and ZonedDateTime - Fix MapType: Handle all util.Map implementations Removed reflection and defensive pattern matching for better performance. All 34 Binary Reader test failures are now fixed (71/71 tests passing). Fixes compatibility with new Java client API in update-java-client-version branch. * Add high-precision decimal tests with tolerance - Add Decimal(18,4) test with 0.001 tolerance for JSON/Arrow formats - Documents precision limitation for decimals with >15-17 significant digits - Uses tolerance-based assertions to account for observed precision loss - Binary format preserves full precision (already tested in Binary Reader suite) - All 278 tests passing * Simplify build-and-test workflow trigger to run on all pushes * Fix Scala 2.13 compatibility for nested arrays - Convert mutable.ArraySeq to Array in ClickHouseJsonReader to ensure immutable collections - Add test workaround for Spark's Row.getSeq behavior in Scala 2.13 - Fix Spotless formatting: remove trailing whitespace in ClickHouseBinaryReader - Applied to all Spark versions: 3.3, 3.4, 3.5 * Update java version to 0.9.4 * Enable compression * add logging TPCDSClusterSuite & change client buffers * Change InputStream read code * Remove hard coded settings for experiments * Clean log from insert method --------- Co-authored-by: Shimon Steinitz <shimonste@gmail.com> Co-authored-by: Bentsi Leviav <bentsi.leviav@clickhouse.com> Co-authored-by: Shimon Steinitz <shimonsteinitz@Shimons-MacBook-Pro.local> Co-authored-by: Shimon Steinitz <shimon.steinitz@clickhouse.com>

- Added Spark 4.0.1 configuration to gradle.properties - Created spark-4.0 module with source code copied from spark-3.5 - Updated API calls for Spark 4.0 compatibility: - AnalysisException constructor now requires errorClass and messageParameters - NoSuchNamespaceException and NoSuchTableException constructors use Array/Seq - ArrowUtils.toArrowSchema now requires largeVarTypes parameter - Updated ANTLR version to 4.13.1 to match Spark 4.0 dependencies - All 278 integration tests passing

- StreamingRateExample: Streaming app using rate source for continuous debugging - SimpleBatchExample: Simple batch app for basic debugging scenarios - Comprehensive README with setup and usage instructions - DEBUGGING.md with detailed debugging guide for IntelliJ and VS Code - run-example.sh script for easy execution - Examples allow setting breakpoints in connector code during execution

- Reorganize examples into proper src/main/scala directory structure - Add dedicated build.gradle for examples module - Improve StreamingRateExample with catalog-aware V2 writer - Update SimpleBatchExample with better error handling - Enhance run-example.sh script with better argument handling - Apply consistent code formatting across examples

- Add Spark 4.0 to build-and-test.yml matrix - Add Spark 4.0 to cloud.yml matrix for ClickHouse Cloud tests - Add Spark 4.0 to tpcds.yml matrix for TPC-DS benchmarks - Ensures Spark 4.0 runs all the same CI steps as Spark 3.3, 3.4, and 3.5

Spark 4.0 only supports Scala 2.13, not 2.12. Add matrix exclusions to prevent invalid build combinations: - build-and-test.yml: exclude Spark 4.0 + Scala 2.12 - cloud.yml: exclude Spark 4.0 + Scala 2.12 - tpcds.yml: exclude Spark 4.0 + Scala 2.12 This fixes the CI error: 'Found unsupported Spark version: 4' The matrix now correctly passes '4.0' for all Spark 4.0 builds.

YAML interprets 4.0 as a float and truncates it to 4, causing: 'Found unsupported Spark version: 4' Solution: Quote all version numbers in matrix definitions: - spark: [ '3.3', '3.4', '3.5', '4.0' ] - scala: [ '2.12', '2.13' ] This ensures versions are passed as strings to Gradle: -Dspark_binary_version=4.0 (not 4) -Dscala_binary_version=2.13 (not 2.13 as float) Applied to all three workflows: - build-and-test.yml - cloud.yml - tpcds.yml

Spark 4.0 requires Java 11+ and cannot run on Java 8. The ANTLR 4.13.1 tool (used for Spark 4.0) is compiled with Java 11, causing this error on Java 8: 'class file version 55.0, this version only recognizes up to 52.0' Changes: - build-and-test.yml: Exclude Spark 4.0 + Java 8 combinations - cloud.yml: Add Java to matrix, use Java 17 for Spark 4.0 - tpcds.yml: Add Java to matrix, use Java 17 for Spark 4.0 Matrix now produces: - Spark 3.3/3.4/3.5: Java 8 + Scala 2.12/2.13 - Spark 4.0: Java 17 + Scala 2.13 only

The run-tests-with-specific-clickhouse job doesn't specify Spark/Scala versions, so it uses the defaults from gradle.properties (Spark 4.0 and Scala 2.13). Since Spark 4.0 requires Java 11+, this job was failing with Java 8. Changed java-version from 8 to 17 to match the default Spark 4.0 requirements.

beforeAll() and afterAll() methods require the BeforeAndAfterAll trait from ScalaTest. Added the missing trait and import.

The trait order matters in Scala. BeforeAndAfterAll must be mixed in BEFORE ForAllTestContainer so that our beforeAll() override is called in the correct order (after the container starts). Tested locally and confirmed all tests pass.

Removed the Logging trait from ClickHouseSingleMixIn to avoid conflict with Spark's Logging trait (which has 'protected def log' vs our 'lazy val log'). Spark 4.0 test classes inherit from SharedSparkSession which already includes Spark's Logging trait, causing a compilation error when ClickHouseSingleMixIn also mixed in our Logging trait. Changed log.info() calls to println() for container lifecycle logging. Tests compile and pass successfully.

Enable all CI checks (TPC-DS, Cloud, Style, License) to run on spark-4-support branch

- NodeClient: Remove custom connection timeouts and pool settings - NodeClient: Rename clientV2 to client for consistency - NodeClient: Use IOUtils.toByteArray() instead of Stream-based byte reading - TPCDSClusterSuite: Add logging to track table processing time These changes align with the update-java-client-version branch while preserving Spark 4.0 support.

- Add Apache License headers to all Spark 4.0 example files - Fix code style with spotlessApply - Clean up ClickHouseSingleMixIn formatting Fixes check-license CI failures.

Spark 4.0 requires Java 17, while Spark 3.x uses Java 8. Add matrix.java to select the correct Java version per Spark version.

YAML interprets 4.0 as a float and truncates it to 4. Quote all version numbers to preserve them as strings.

Wake up ClickHouse Cloud instance before tests (#429)

c05885e

mzitnik force-pushed the update-java-client-version branch from 98365d4 to 18b4fcb Compare November 13, 2025 09:45

ShimonSte force-pushed the spark-4-support branch 2 times, most recently from 7ce7af5 to 898a545 Compare November 16, 2025 09:03

mzitnik and others added 18 commits November 16, 2025 13:03

Add Spark 4.0 to CI workflows

9a14859

- Add Spark 4.0 to build-and-test.yml matrix - Add Spark 4.0 to cloud.yml matrix for ClickHouse Cloud tests - Add Spark 4.0 to tpcds.yml matrix for TPC-DS benchmarks - Ensures Spark 4.0 runs all the same CI steps as Spark 3.3, 3.4, and 3.5

Fix compilation error: Add BeforeAndAfterAll trait

62f032e

beforeAll() and afterAll() methods require the BeforeAndAfterAll trait from ScalaTest. Added the missing trait and import.

Add spark-4-support branch to CI workflow triggers

f72b689

Enable all CI checks (TPC-DS, Cloud, Style, License) to run on spark-4-support branch

Add Apache License headers and fix code style

05f0867

- Add Apache License headers to all Spark 4.0 example files - Fix code style with spotlessApply - Clean up ClickHouseSingleMixIn formatting Fixes check-license CI failures.

Add Spark 4.0 to check-license and check-style workflows

7c4d981

Fix Java version for Spark 4.0 in check-license and check-style

a5d5713

Spark 4.0 requires Java 17, while Spark 3.x uses Java 8. Add matrix.java to select the correct Java version per Spark version.

Fix YAML version truncation: quote Spark versions

247ab5f

YAML interprets 4.0 as a float and truncates it to 4. Quote all version numbers to preserve them as strings.

ShimonSte force-pushed the spark-4-support branch from 20e8865 to 247ab5f Compare November 16, 2025 11:22

ShimonSte closed this Nov 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark 4 support #451

Spark 4 support #451

Uh oh!

ShimonSte commented Nov 12, 2025

Uh oh!

windsurf-bot bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Spark 4 support #451

Spark 4 support #451

Uh oh!

Conversation

ShimonSte commented Nov 12, 2025

Summary

Checklist

Uh oh!

windsurf-bot bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants