Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions sql/core/benchmarks/ParquetVectorUpdaterBenchmark-results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
================================================================================================
Identity Updaters
================================================================================================

OpenJDK 64-Bit Server VM 17.0.19+10-LTS on Linux 6.17.0-1010-azure
AMD EPYC 9V74 80-Core Processor
Identity Updaters: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
BooleanUpdater 0 0 0 14698.7 0.1 1.0X
ByteUpdater (INT32 -> Byte) 0 0 0 3589.1 0.3 0.2X
ShortUpdater (INT32 -> Short) 1 1 0 1824.9 0.5 0.1X
IntegerUpdater 0 0 0 8168.8 0.1 0.6X
LongUpdater 0 0 0 4059.2 0.2 0.3X
FloatUpdater 0 0 0 8130.7 0.1 0.6X
DoubleUpdater 0 0 0 4062.2 0.2 0.3X
BinaryUpdater 18 19 0 57.3 17.5 0.0X


================================================================================================
Type-converting Updaters
================================================================================================

OpenJDK 64-Bit Server VM 17.0.19+10-LTS on Linux 6.17.0-1010-azure
AMD EPYC 9V74 80-Core Processor
Type-converting Updaters: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
IntegerToLongUpdater 2 2 0 446.0 2.2 1.0X
IntegerToDoubleUpdater 2 2 0 456.5 2.2 1.0X
FloatToDoubleUpdater 2 2 0 474.8 2.1 1.1X
DateToTimestampNTZUpdater 41 41 1 25.5 39.2 0.1X
DowncastLongUpdater (INT64 -> Decimal(9,2)) 2 2 0 472.5 2.1 1.1X


================================================================================================
Rebase Updaters
================================================================================================

OpenJDK 64-Bit Server VM 17.0.19+10-LTS on Linux 6.17.0-1010-azure
AMD EPYC 9V74 80-Core Processor
Rebase Updaters: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------
IntegerWithRebaseUpdater (DATE legacy) 0 0 0 2174.1 0.5 1.0X
LongWithRebaseUpdater (TIMESTAMP_MICROS legacy) 1 1 0 1733.7 0.6 0.8X
LongAsMicrosUpdater (TIMESTAMP_MILLIS) 2 3 0 420.7 2.4 0.2X


================================================================================================
Unsigned Updaters
================================================================================================

OpenJDK 64-Bit Server VM 17.0.19+10-LTS on Linux 6.17.0-1010-azure
AMD EPYC 9V74 80-Core Processor
Unsigned Updaters: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------
UnsignedIntegerUpdater (UINT32 -> Long) 1 1 0 964.4 1.0 1.0X
UnsignedLongUpdater (UINT64 -> Decimal(20,0)) 18 18 0 58.6 17.1 0.1X


================================================================================================
Decimal Updaters
================================================================================================

OpenJDK 64-Bit Server VM 17.0.19+10-LTS on Linux 6.17.0-1010-azure
AMD EPYC 9V74 80-Core Processor
Decimal Updaters: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
IntegerToDecimalUpdater 0 0 0 8133.9 0.1 1.0X
LongToDecimalUpdater 0 0 0 4047.6 0.2 0.5X
FixedLenByteArrayToDecimalUpdater 24 24 0 43.8 22.8 0.0X


================================================================================================
FixedLenByteArray Updaters
================================================================================================

OpenJDK 64-Bit Server VM 17.0.19+10-LTS on Linux 6.17.0-1010-azure
AMD EPYC 9V74 80-Core Processor
FixedLenByteArray Updaters: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
FixedLenByteArrayUpdater (len=16 -> Binary) 22 23 0 46.9 21.3 1.0X
FixedLenByteArrayAsIntUpdater (len=4 -> Decimal(9,2)) 6 6 0 166.3 6.0 3.5X
FixedLenByteArrayAsLongUpdater (len=8 -> Decimal(18,4)) 8 8 0 125.1 8.0 2.7X


Original file line number Diff line number Diff line change
Expand Up @@ -18,24 +18,35 @@
package org.apache.spark.sql.execution.datasources.parquet

import java.lang.reflect.{InvocationTargetException, Method}
import java.time.ZoneId
import java.util.PrimitiveIterator

import org.apache.parquet.column.ColumnDescriptor
import org.apache.parquet.schema.LogicalTypeAnnotation

import org.apache.spark.sql.execution.vectorized.WritableColumnVector
import org.apache.spark.util.SparkClassUtils

/**
* Reflective bridge to the package-private `ParquetReadState`. Under `spark-submit --jars`,
* test and main classes load from different classloaders, blocking package-private access.
* Reflection with `setAccessible` sidesteps the check without widening production visibility.
* Reflective bridge to package-private classes in
* `org.apache.spark.sql.execution.datasources.parquet`. Under `spark-submit --jars`, test
* and main classes load from different classloaders, blocking package-private access despite
* the matching package name. Reflection with `setAccessible(true)` sidesteps the check
* without widening production visibility.
*
* Currently bridges:
* - `ParquetReadState` (constructor + `resetForNewBatch` + `resetForNewPage`)
* - `VectorizedRleValuesReader.readBatch` (5-arg overload not exposed publicly)
* - `ParquetVectorUpdaterFactory` (constructor)
*/
object ParquetReadStateTestAccess {
object ParquetTestAccess {

// -------- ParquetReadState --------

private val stateCls = SparkClassUtils.classForName[Any](
"org.apache.spark.sql.execution.datasources.parquet.ParquetReadState")

private val ctor = {
private val stateCtor = {
val c = stateCls.getDeclaredConstructor(
classOf[ColumnDescriptor],
java.lang.Boolean.TYPE,
Expand Down Expand Up @@ -71,7 +82,7 @@ object ParquetReadStateTestAccess {
isRequired: Boolean,
rowIndexes: PrimitiveIterator.OfLong = null): AnyRef = {
try {
ctor.newInstance(
stateCtor.newInstance(
descriptor,
Boolean.box(isRequired),
rowIndexes).asInstanceOf[AnyRef]
Expand Down Expand Up @@ -105,6 +116,41 @@ object ParquetReadStateTestAccess {
reader, state, values, defLevels, valueReader, updater)
} catch { case e: ReflectiveOperationException => throw rethrow(e) }

// -------- ParquetVectorUpdaterFactory --------

private val factoryCtor = {
val cls = SparkClassUtils.classForName[Any](
"org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory")
val c = cls.getDeclaredConstructor(
classOf[LogicalTypeAnnotation],
classOf[ZoneId],
classOf[String],
classOf[String],
classOf[String],
classOf[String])
c.setAccessible(true)
c
}

def newFactory(
logicalTypeAnnotation: LogicalTypeAnnotation,
convertTz: ZoneId,
datetimeRebaseMode: String,
datetimeRebaseTz: String,
int96RebaseMode: String,
int96RebaseTz: String): ParquetVectorUpdaterFactory = {
try {
factoryCtor.newInstance(
logicalTypeAnnotation, convertTz,
datetimeRebaseMode, datetimeRebaseTz,
int96RebaseMode, int96RebaseTz).asInstanceOf[ParquetVectorUpdaterFactory]
} catch {
case e: ReflectiveOperationException => throw rethrow(e)
}
}

// -------- shared helper --------

private def rethrow(e: ReflectiveOperationException): RuntimeException = {
val cause = e match {
case ite: InvocationTargetException => ite.getCause
Expand Down
Loading