diff --git a/.gitignore b/.gitignore index a3c97ff992..9abb2c791e 100644 --- a/.gitignore +++ b/.gitignore @@ -27,3 +27,4 @@ output docs/comet-*/ docs/build/ docs/temp/ +dev/ci/spark-sql-tests/logs/ diff --git a/dev/ci/spark-sql-tests/README.md b/dev/ci/spark-sql-tests/README.md new file mode 100644 index 0000000000..41afe46f9d --- /dev/null +++ b/dev/ci/spark-sql-tests/README.md @@ -0,0 +1,121 @@ + + +# Local Spark SQL Tests + +These scripts reproduce the `spark_sql_test.yml` GitHub Actions workflow on a +developer machine. They run Spark's own SQL test suites with Comet enabled, +which is useful for debugging a Spark SQL test failure locally instead of +waiting on CI. + +The Spark version is selected with `SPARK_VERSION` and defaults to `4.1.1`. +Supported versions, each mirroring a CI matrix config: + +| `SPARK_VERSION` | JDK used by CI | +|-----------------|----------------| +| `3.4.3` | 11 | +| `3.5.8` | 11 | +| `4.0.2` | 21 | +| `4.1.1` | 17 | + +## Prerequisites + +- A JDK with `JAVA_HOME` set, matching the Spark version under test (see the + table above). `run.sh` warns if the active JDK differs from the one CI uses. +- A Rust toolchain, plus `protobuf-compiler` and `clang`, for the Comet native build. +- Git, and enough disk space for an `apache/spark` checkout and its build output. + +## Usage + +Run from anywhere inside the repository: + +```sh +dev/ci/spark-sql-tests/run.sh [module] +``` + +`module` is one of the seven CI shards, or `all` (the default): + +| Module | Spark suites | +|--------------|--------------| +| `catalyst` | `catalyst/test` | +| `sql_core-1` | `sql` suites excluding `ExtendedSQLTest` / `SlowSQLTest` | +| `sql_core-2` | `sql` `ExtendedSQLTest` suites | +| `sql_core-3` | `sql` `SlowSQLTest` suites | +| `sql_hive-1` | `hive` suites excluding `ExtendedHiveTest` / `SlowHiveTest` | +| `sql_hive-2` | `hive` `ExtendedHiveTest` suites | +| `sql_hive-3` | `hive` `SlowHiveTest` suites | + +Examples: + +```sh +# Run a single shard +dev/ci/spark-sql-tests/run.sh sql_core-1 + +# Run all seven shards sequentially +dev/ci/spark-sql-tests/run.sh + +# Re-run a shard without rebuilding Comet or re-applying the Spark diff +SKIP_BUILD=1 SKIP_SPARK_SETUP=1 dev/ci/spark-sql-tests/run.sh sql_core-1 + +# Test a different Spark version +SPARK_VERSION=4.0.2 dev/ci/spark-sql-tests/run.sh sql_core-1 +``` + +The first run clones `apache/spark` and builds both Comet and Spark, which +takes a while. A full `all` run takes several hours, the same as CI. Per-module +output is written to `dev/ci/spark-sql-tests/logs//.log`, and a +PASS/FAIL summary is printed at the end. + +## Environment variables + +| Variable | Default | Effect | +|--------------------|-------------------------------------------|--------| +| `SPARK_VERSION` | `4.1.1` | Spark version to test: `3.4.3`, `3.5.8`, `4.0.2`, or `4.1.1`. | +| `SKIP_BUILD` | unset | `1` skips the Comet build and reuses existing artifacts. | +| `SKIP_SPARK_SETUP` | unset | `1` skips the Spark clone/reset/diff step. | +| `COMET_SPARK_DIR` | `~/.cache/datafusion-comet/apache-spark-` | Persistent Spark checkout location, namespaced by version. | +| `SPARK_REF` | `v` | Git ref checked out for the Spark sources. | +| `SBT_MEM` | `4096` | sbt heap size in MB. | +| `LC_ALL` | `C.UTF-8` | Locale for the sbt run. Use `en_US.UTF-8` on macOS if `C.UTF-8` is unavailable. | +| `PYSPARK_PYTHON` | a nonexistent path | Python interpreter for Spark. The default skips Spark 4.x's Python data source probe, which can hang on machines that have `python3`. Export a real interpreter to run the Python-dependent suites. | + +> **Note on Python:** Spark 4.x probes for Python data sources during query +> analysis by spawning a Python worker. The CI `amd64/rust` container has no +> `python3`, so the probe is skipped. On a developer machine that has `python3` +> the worker can hang indefinitely (the JVM-side read has no idle timeout), +> stalling suites such as `GlobalTempViewSuite`. `run.sh` therefore points +> `PYSPARK_PYTHON` / `PYSPARK_DRIVER_PYTHON` at a nonexistent path by default so +> the probe is skipped, matching CI. + +## How it works + +1. `run.sh` builds Comet with `PROFILES=-Pspark- make release` (unless + `SKIP_BUILD=1`), then purges partial Maven cache entries so sbt's resolver + does not choke on POM-only artifacts. +2. `setup-spark.sh` maintains a persistent `apache/spark` checkout per version: + it clones the `v` tag on first use, and on every run resets it to a + clean state and applies `dev/diffs/.diff`. Spark's compiled + `target/` artifacts are preserved across runs so rebuilds are incremental. +3. `run.sh` runs the selected module shard(s) with `build/sbt`, using the same + environment and arguments as the `spark_sql_test.yml` workflow, including the + per-version test-group isolation (Spark 4.0 forks a dedicated JVM per + leak-prone Parquet/Orc suite; other versions run serially). + +The CI workflow's optional Comet fallback-reason log collection +(`workflow_dispatch`) is not reproduced. diff --git a/dev/ci/spark-sql-tests/config.sh b/dev/ci/spark-sql-tests/config.sh new file mode 100644 index 0000000000..7bfcba695e --- /dev/null +++ b/dev/ci/spark-sql-tests/config.sh @@ -0,0 +1,117 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +# Shared configuration for the local Spark SQL test scripts. This file is +# sourced by setup-spark.sh and run.sh; it is not meant to be run directly. +# +# The variables below are consumed by the sourcing scripts, so shellcheck +# cannot see their use when checking this file in isolation. +# shellcheck disable=SC2034 + +# --- Spark version under test ---------------------------------------------- +# Override with SPARK_VERSION=. Each supported version has a +# matching dev/diffs/.diff and mirrors a spark_sql_test.yml CI config. +SPARK_VERSION="${SPARK_VERSION:-4.1.1}" + +# Per-version settings copied from the spark_sql_test.yml CI matrix: the short +# version (Maven/sbt profile suffix) and the JDK major version CI uses. +case "$SPARK_VERSION" in + 3.4.3) SPARK_SHORT="3.4"; REQUIRED_JDK="11" ;; + 3.5.8) SPARK_SHORT="3.5"; REQUIRED_JDK="11" ;; + 4.0.2) SPARK_SHORT="4.0"; REQUIRED_JDK="21" ;; + 4.1.1) SPARK_SHORT="4.1"; REQUIRED_JDK="17" ;; + *) + echo "ERROR: unsupported SPARK_VERSION '$SPARK_VERSION'." >&2 + echo " Supported versions: 3.4.3, 3.5.8, 4.0.2, 4.1.1" >&2 + exit 1 + ;; +esac + +# Git ref checked out for the Spark sources. Defaults to the released tag. +SPARK_REF="${SPARK_REF:-v${SPARK_VERSION}}" + +# Test-group isolation, mirroring spark_sql_test.yml. Every CI config sets +# SERIAL_SBT_TESTS=1 except Spark 4.0 (JDK 21), which instead leaves it unset +# and forks a dedicated JVM per leak-prone Parquet/Orc suite to work around a +# cross-suite file-stream leak under JDK 21 (Comet issue #4327). run.sh reads +# DEDICATED_JVM_SUITES: when non-empty it passes DEDICATED_JVM_SBT_TESTS and +# omits SERIAL_SBT_TESTS; when empty it passes SERIAL_SBT_TESTS=1. +DEDICATED_JVM_SUITES="" +if [ "$SPARK_SHORT" = "4.0" ]; then + DEDICATED_JVM_SUITES="\ +org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV1Suite,\ +org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV2Suite,\ +org.apache.spark.sql.execution.datasources.orc.OrcSourceV1Suite,\ +org.apache.spark.sql.execution.datasources.orc.OrcSourceV2Suite" +fi + +# --- Paths ----------------------------------------------------------------- +# Persistent apache/spark checkout, namespaced by Spark version so switching +# versions does not reset away each version's compiled target/ artifacts. +COMET_SPARK_DIR="${COMET_SPARK_DIR:-$HOME/.cache/datafusion-comet/apache-spark-${SPARK_VERSION}}" + +# Directory containing these scripts, and the Comet repository root. +COMET_SQL_TEST_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +COMET_REPO_ROOT="$(git -C "$COMET_SQL_TEST_DIR" rev-parse --show-toplevel)" + +# --- sbt / locale ---------------------------------------------------------- +# sbt heap size in MB. Higher than CI's 3072 since local machines are not +# constrained to 7 GB GitHub runners. +SBT_MEM="${SBT_MEM:-4096}" + +# Locale for the sbt run. CI uses C.UTF-8; macOS users may need en_US.UTF-8. +export LC_ALL="${LC_ALL:-C.UTF-8}" + +# --- Module shards --------------------------------------------------------- +# The seven module shards, copied verbatim from +# .github/workflows/spark_sql_test.yml. Order matches the CI matrix. +SPARK_SQL_MODULES=( + catalyst + sql_core-1 + sql_core-2 + sql_core-3 + sql_hive-1 + sql_hive-2 + sql_hive-3 +) + +# module_sbt_args +# Echoes the single build/sbt argument for the given module shard. +# Returns non-zero for an unknown module. +module_sbt_args() { + case "$1" in + catalyst) + echo 'catalyst/test' ;; + sql_core-1) + echo 'sql/testOnly * -- -l org.apache.spark.tags.ExtendedSQLTest -l org.apache.spark.tags.SlowSQLTest' ;; + sql_core-2) + echo 'sql/testOnly * -- -n org.apache.spark.tags.ExtendedSQLTest' ;; + sql_core-3) + echo 'sql/testOnly * -- -n org.apache.spark.tags.SlowSQLTest' ;; + sql_hive-1) + echo 'hive/testOnly * -- -l org.apache.spark.tags.ExtendedHiveTest -l org.apache.spark.tags.SlowHiveTest' ;; + sql_hive-2) + echo 'hive/testOnly * -- -n org.apache.spark.tags.ExtendedHiveTest' ;; + sql_hive-3) + echo 'hive/testOnly * -- -n org.apache.spark.tags.SlowHiveTest' ;; + *) + return 1 ;; + esac +} diff --git a/dev/ci/spark-sql-tests/run.sh b/dev/ci/spark-sql-tests/run.sh new file mode 100755 index 0000000000..10cb776534 --- /dev/null +++ b/dev/ci/spark-sql-tests/run.sh @@ -0,0 +1,206 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +# Runs Apache Spark's SQL test suites locally with Comet enabled, reproducing +# the spark_sql_test.yml GitHub Actions workflow. The Spark version is selected +# with SPARK_VERSION (see config.sh); it defaults to 4.1.1. +# +# -e is intentionally not set: when running all module shards, one failing +# shard must not stop the rest. Build and setup failures are checked +# explicitly below. + +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=config.sh +source "$SCRIPT_DIR/config.sh" + +usage() { + cat <). + SPARK_REF Git ref for the Spark sources (default: v$SPARK_VERSION). + SBT_MEM sbt heap size in MB (default: 4096). + LC_ALL Locale for the sbt run (default: C.UTF-8; use en_US.UTF-8 on macOS). + PYSPARK_PYTHON Python interpreter for Spark. Defaults to a nonexistent + path so Spark 4.x's Python data source probe is skipped + (it can hang on machines that have python3). Export a + real interpreter to run the Python-dependent suites. +EOF +} + +module="${1:-all}" +case "$module" in + -h|--help) usage; exit 0 ;; +esac + +# Resolve the list of modules to run. +modules_to_run=() +if [ "$module" = "all" ]; then + modules_to_run=("${SPARK_SQL_MODULES[@]}") +elif module_sbt_args "$module" >/dev/null 2>&1; then + modules_to_run=("$module") +else + echo "ERROR: unknown module '$module'" >&2 + echo >&2 + usage >&2 + exit 1 +fi + +# --- JDK version check (warning only) -------------------------------------- +jdk_version="$(java -version 2>&1 | head -n1 | sed -E 's/.*version "([0-9]+).*/\1/')" +if [ "$jdk_version" != "$REQUIRED_JDK" ]; then + echo "WARNING: active JDK reports major version '$jdk_version'; Spark $SPARK_VERSION CI uses JDK $REQUIRED_JDK." >&2 + echo " Set JAVA_HOME to a JDK $REQUIRED_JDK install to match CI exactly." >&2 +fi + +# --- Build Comet ----------------------------------------------------------- +if [ "${SKIP_BUILD:-}" = "1" ]; then + echo "SKIP_BUILD=1: skipping Comet build." +else + echo "Building Comet (PROFILES=-Pspark-$SPARK_SHORT make release) ..." + if ! ( cd "$COMET_REPO_ROOT" && PROFILES="-Pspark-$SPARK_SHORT" make release ); then + echo "ERROR: Comet build failed." >&2 + exit 1 + fi +fi + +# --- Purge partial Maven cache entries ------------------------------------- +# Mirrors .github/actions/setup-spark-builder/action.yaml. Comet's Maven phase +# downloads POMs for transitive artifacts whose JARs it never needs. sbt's +# Coursier resolver then treats the POM-only entry as "found locally" and +# fails on the missing JAR instead of fetching it remotely. Delete those +# partial entries so sbt re-fetches the full artifact. +maven_repo="$HOME/.m2/repository" +if [ -d "$maven_repo" ]; then + echo "Purging partial Maven cache entries ..." + find "$maven_repo" -name '*.pom' | while read -r pom; do + jar="${pom%.pom}.jar" + [ -f "$jar" ] && continue + grep -q 'jar\|bundle' "$pom" 2>/dev/null || continue + rm -f "$pom" "${pom}.sha1" "${pom%.pom}.pom.lastUpdated" \ + "$(dirname "$pom")/_remote.repositories" + done +fi + +# --- Set up the Spark checkout --------------------------------------------- +if [ "${SKIP_SPARK_SETUP:-}" = "1" ]; then + echo "SKIP_SPARK_SETUP=1: using the existing Spark checkout as-is." + if [ ! -d "$COMET_SPARK_DIR/.git" ]; then + echo "ERROR: SKIP_SPARK_SETUP=1 but no Spark checkout at $COMET_SPARK_DIR" >&2 + exit 1 + fi +else + if ! "$SCRIPT_DIR/setup-spark.sh"; then + echo "ERROR: Spark setup failed." >&2 + exit 1 + fi +fi + +# --- Run the selected module shards ---------------------------------------- +# Logs are namespaced by Spark version so runs of different versions do not +# overwrite each other. +log_dir="$SCRIPT_DIR/logs/$SPARK_VERSION" +mkdir -p "$log_dir" + +# Spark 4.x's DataSourceManager probes for Python data sources during query +# analysis by spawning a Python worker. The CI amd64/rust container has no +# python3, so the probe is skipped there. On a developer machine that does +# have python3 (every macOS install does) the worker can hang indefinitely: +# the JVM-side read has no idle timeout by default, so suites such as +# GlobalTempViewSuite stall forever instead of failing fast. Point PySpark at +# a nonexistent interpreter so the probe is skipped, matching CI. A developer +# who wants the Python suites can export PYSPARK_PYTHON themselves. +no_python="/nonexistent/comet-disable-python-datasources" + +results=() +overall_status=0 + +for m in "${modules_to_run[@]}"; do + sbt_args="$(module_sbt_args "$m")" + log_file="$log_dir/${m}.log" + echo + echo "==================================================================" + echo "Module: $m" + echo "sbt args: $sbt_args" + echo "Log file: $log_file" + echo "==================================================================" + + # Stale Parquet cache workaround (mirrors spark_sql_test.yml). + rm -rf "$maven_repo/org/apache/parquet" + + ( + cd "$COMET_SPARK_DIR" || exit 1 + + # Environment shared by every Spark version. + sbt_env=( + NOLINT_ON_COMPILE=true + ENABLE_COMET=true + ENABLE_COMET_ONHEAP=true + ENABLE_COMET_LOG_FALLBACK_REASONS=false + PYSPARK_DRIVER_PYTHON="${PYSPARK_DRIVER_PYTHON:-$no_python}" + PYSPARK_PYTHON="${PYSPARK_PYTHON:-$no_python}" + ) + # Per-version test-group isolation (see config.sh): Spark 4.0 forks a + # dedicated JVM per leak-prone suite; every other version runs serially. + if [ -n "$DEDICATED_JVM_SUITES" ]; then + sbt_env+=("DEDICATED_JVM_SBT_TESTS=$DEDICATED_JVM_SUITES") + else + sbt_env+=("SERIAL_SBT_TESTS=1") + fi + + env "${sbt_env[@]}" \ + build/sbt -Dsbt.log.noformat=true -mem "$SBT_MEM" \ + 'set Global / concurrentRestrictions := Seq(Tags.limit(Tags.ForkedTestGroup, 1))' \ + "$sbt_args" + ) 2>&1 | tee "$log_file" + status="${PIPESTATUS[0]}" + + if [ "$status" -eq 0 ]; then + results+=("PASS $m") + else + results+=("FAIL $m (sbt exit $status)") + overall_status=1 + fi +done + +# --- Summary --------------------------------------------------------------- +echo +echo "==================================================================" +echo "Spark SQL test summary (Spark $SPARK_VERSION)" +echo "==================================================================" +for line in "${results[@]}"; do + echo " $line" +done +echo "Logs written to: $log_dir" +exit "$overall_status" diff --git a/dev/ci/spark-sql-tests/setup-spark.sh b/dev/ci/spark-sql-tests/setup-spark.sh new file mode 100755 index 0000000000..5d31aeb85b --- /dev/null +++ b/dev/ci/spark-sql-tests/setup-spark.sh @@ -0,0 +1,72 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +# Maintains the persistent apache/spark checkout used by the local Spark SQL +# test scripts, and applies the Comet diff. Idempotent and safe to re-run. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=config.sh +source "$SCRIPT_DIR/config.sh" + +DIFF_FILE="$COMET_REPO_ROOT/dev/diffs/${SPARK_VERSION}.diff" +if [ ! -f "$DIFF_FILE" ]; then + echo "ERROR: Comet diff not found: $DIFF_FILE" >&2 + exit 1 +fi + +if [ ! -d "$COMET_SPARK_DIR/.git" ]; then + echo "Cloning apache/spark ($SPARK_REF) into $COMET_SPARK_DIR ..." + mkdir -p "$(dirname "$COMET_SPARK_DIR")" + git clone --depth 1 --branch "$SPARK_REF" \ + https://github.com/apache/spark.git "$COMET_SPARK_DIR" +else + echo "Reusing existing Spark checkout at $COMET_SPARK_DIR" +fi + +# Resolve the commit to reset to. A checkout created with a different +# SPARK_REF may not contain the requested ref; fetch it shallowly if missing. +reset_target="$SPARK_REF" +if ! git -C "$COMET_SPARK_DIR" rev-parse --verify --quiet "${SPARK_REF}^{commit}" >/dev/null; then + echo "Ref $SPARK_REF not present locally; fetching ..." + git -C "$COMET_SPARK_DIR" fetch --depth 1 origin "$SPARK_REF" + reset_target="FETCH_HEAD" +fi + +echo "Resetting Spark checkout to a clean $SPARK_REF ..." +# reset --hard reverts tracked-file edits from a previously applied diff. +git -C "$COMET_SPARK_DIR" reset --hard "$reset_target" +# clean -fd removes untracked files the previous diff added. Without -x it +# leaves gitignored build output in place, so Spark's compiled target/ +# artifacts are reused across runs. +git -C "$COMET_SPARK_DIR" clean -fd + +echo "Applying $DIFF_FILE ..." +# Pre-flight check so a drifted diff produces an actionable error rather than +# raw git apply output. +if ! git -C "$COMET_SPARK_DIR" apply --check "$DIFF_FILE" 2>/dev/null; then + echo "ERROR: $DIFF_FILE does not apply cleanly to $SPARK_REF." >&2 + echo " The Comet diff and the Spark ref may have drifted out of sync." >&2 + exit 1 +fi +git -C "$COMET_SPARK_DIR" apply "$DIFF_FILE" + +echo "Spark checkout ready: $COMET_SPARK_DIR ($SPARK_REF + Comet diff)"