Optor is a proof-of-concept Cascades-style global cost-based query optimizer framework with an experimental Apache Spark Catalyst integration.
If you want the shortest possible mental model, it is this:
optor-coreis a generic optimizer engineoptor-sparkteaches that engine how to reason about Spark plans through an experimentalSparkStrategyintegration exposed as a Spark extension- the optimizer searches across equivalent physical plans, enforces required properties, costs the candidates, and picks the cheapest valid one
This repository is aimed at two audiences:
- users who want to understand what Optor changes in Spark
- developers who want to understand the architecture well enough to extend it or port it to another engine
It should currently be read as a proof-of-concept system rather than a production-ready optimizer.
Most optimizers are easier to understand if you separate two concerns:
- the search engine
- the engine-specific semantics
Optor does exactly that in a Cascades-style shape: it memoizes equivalent expressions, explores alternatives through rules, tracks required physical properties, and chooses the cheapest valid plan globally rather than greedily.
The search engine lives in optor-core. It knows how to:
- memoize equivalent plans
- explore transformations and implementations
- track required physical properties
- inject property enforcers when needed
- cost candidate plans
- return the best plan
The engine-specific logic lives in adapters such as optor-spark. Those adapters tell Optor:
- what a plan node looks like
- when two plans are equivalent
- how to compute metadata and cost
- what physical properties matter
- what rules can generate alternative implementations
That split is the central design decision in this codebase.
+----------------------+
| Logical / Input |
| Plan |
+----------+-----------+
|
v
+----------------------+
| Engine Adapters |
| |
| PlanModel |
| CostModel |
| MetadataModel |
| PropertyModel |
| Rule Factory |
+----------+-----------+
|
v
+----------------------+
| Optor Core |
| |
| Memo |
| Rule application |
| Property enforcement |
| Path search |
| Best-plan selection |
+----------+-----------+
|
v
+----------------------+
| Cheapest Valid Plan |
+----------------------+
In this repository:
optor-coreprovides the generic box in the middleoptor-sparkprovides the adapter layer for Spark
Optor never hardcodes Spark or any other engine into the core. Instead it works over a generic plan type T.
That is why the central entry point looks like:
Optimization[T](...)The core asks the embedding engine to provide a PlanModel[T] so it can:
- inspect children
- rebuild nodes with new children
- compare plans for equivalence
- create group leaves for memoization
Like other Cascades-style optimizers, Optor groups equivalent expressions into memo structures. This lets it avoid re-solving the same logical subproblem repeatedly.
The important distinction visible in the code is:
- a group represents a plan alternative under a specific constraint set
- a cluster represents equivalent expressions that share metadata
That is why metadata and physical properties are treated separately:
- metadata says what stays invariant across equivalent expressions
- properties describe execution-relevant traits such as ordering or distribution
Rules are how Optor generates alternatives.
An OptorRule[T] takes a node and produces zero or more alternatives:
- implementation choices
- rewrites
- re-associations
- commutations
- engine-specific lowerings
In Spark, rules are mostly wrappers around existing Spark strategies plus a few Optor-specific expansions, especially around join implementation choices.
Optor treats physical properties as first-class constraints.
A PropertyModel[T] tells the optimizer:
- which properties exist
- how a plan produces them
- what constraints children must satisfy
- how to enforce them when they are missing
In the Spark integration, the current properties are:
- distribution
- ordering
and the enforcers are ordinary Spark operators such as:
ShuffleExchangeExecBroadcastExchangeExecSortExec
Once Optor has candidate plans that satisfy the required constraints, it relies on a CostModel[T] to compare them.
The core itself does not prescribe the meaning of cost. It only requires:
- a comparable cost representation
- a way to compute plan cost
- an explicit infinity value for invalid or not-yet-implementable states
A useful way to read the project is as a pipeline:
input plan
-> memoize equivalent structure
-> apply rewrite / implementation rules
-> derive and enforce required properties
-> expand search space through memo groups
-> cost valid alternatives
-> pick the best path through the memo
That flow maps directly to the codebase:
org.boostscale.optor.Optor: optimizer entry point and model validationorg.boostscale.optor.OptorPlanner: planner abstractionorg.boostscale.optor.memo: memo structuresorg.boostscale.optor.rule: rules and enforcer rule setsorg.boostscale.optor.path: path discovery over the memo/search spaceorg.boostscale.optor.best: best-plan selection
Two planner implementations exist:
DpPlanner: the default, dynamic-programming-based plannerExhaustivePlanner: the simpler exhaustive alternative
In practice, DpPlanner is the main planner the repository is organized around.
The Spark integration is intentionally narrow, explicit, and experimental.
optor-spark currently provides org.apache.spark.sql.OptorExtensions, which injects a custom planner strategy:
SparkSession.builder()
.config("spark.sql.extensions", "org.apache.spark.sql.OptorExtensions")This is not a fork of Catalyst. It is an experimental integration layer that plugs into Catalyst through SparkSessionExtensions.
Today, the concrete integration point in this repository is a SparkStrategy.
An integration at the Spark optimizer layer is also possible in principle, but that is not what optor-spark currently implements.
From there, OptorStrategy does five important things:
- Takes Spark’s logical plan and wraps it into deferred physical placeholders.
- Builds an
Optimization[SparkPlan]using Spark-specific models. - Reuses Spark planning strategies to generate implementation alternatives.
- Optimizes under Spark distribution and ordering requirements.
- Returns the final chosen
SparkPlan.
The important Spark-specific components are:
SparkPlanModel: how Optor traverses and compares Spark plansSparkCostModel: how Optor ranks Spark physical plansSparkMetadataModel: metadata carried across equivalent Spark expressionsSparkPropertyModel: Spark physical properties and enforcersRules: wrappers that translate Spark planner strategies into Optor rules
One good way to think about optor-spark is:
- Spark still provides the building blocks
- Optor changes how broadly and systematically those choices are searched and compared
- the current integration is experimental rather than a fully productized replacement for Catalyst planning
If you are reading the code for the first time, this order works well.
- optor-core/src/main/scala/org/boostscale/optor/Optor.scala
- optor-core/src/main/scala/org/boostscale/optor/OptorPlanner.scala
- optor-core/src/main/scala/org/boostscale/optor/PlanModel.scala
- optor-core/src/main/scala/org/boostscale/optor/PropertyModel.scala
- optor-core/src/main/scala/org/boostscale/optor/CostModel.scala
- optor-core/src/main/scala/org/boostscale/optor/MetadataModel.scala
- optor-core/src/main/scala/org/boostscale/optor/dp/DpPlanner.scala
- optor-core/src/main/scala/org/boostscale/optor/exaustive/ExhaustivePlanner.scala
- optor-core/src/main/scala/org/boostscale/optor/memo/Memo.scala
- optor-core/src/main/scala/org/boostscale/optor/rule/RuleApplier.scala
- optor-core/src/main/scala/org/boostscale/optor/path/PathFinder.scala
- optor-spark/src/main/scala/org/apache/spark/sql/OptorExtensions.scala
- optor-spark/src/main/scala/org/apache/spark/sql/OptorStrategy.scala
- optor-spark/src/main/scala/org/apache/spark/sql/optor/plan/SparkPlanModel.scala
- optor-spark/src/main/scala/org/apache/spark/sql/optor/cost/SparkCostModel.scala
- optor-spark/src/main/scala/org/apache/spark/sql/optor/property/SparkPropertyModel.scala
- optor-spark/src/main/scala/org/apache/spark/sql/optor/rule/Rules.scala
If you want to understand the framework without Spark noise, the best entry point is the synthetic test model:
And if you want to see concrete optimization behavior:
- optor-core/src/test/scala/org/boostscale/optor/specific/JoinReorderSuite.scala
- optor-spark/src/test/scala/org/apache/spark/sql/optor/OptorStrategySuite.scala
.
|-- pom.xml
|-- optor-core/
| `-- src/main/scala/org/boostscale/optor/...
`-- optor-spark/
`-- src/main/scala/org/apache/spark/sql/...
At a package level:
org.boostscale.optor: public core abstractionsorg.boostscale.optor.memo: memo/group/cluster stateorg.boostscale.optor.dp: DP plannerorg.boostscale.optor.exaustive: exhaustive plannerorg.boostscale.optor.rule: rule definitions and applicationorg.boostscale.optor.path: path and pattern search over the memoorg.boostscale.optor.vis: Graphviz-based visualizationorg.apache.spark.sql: current Spark extension andSparkStrategysurfaceorg.apache.spark.sql.optor.*: Spark-specific adapters
Use a clean reactor build from the repository root:
mvn clean testThat is the command used by CI, and it passed in this workspace.
Useful variants:
mvn -pl optor-core -am clean test
mvn -pl optor-spark -am clean testIf you want to run ScalaTest suites directly, use the ScalaTest Maven plugin instead of Surefire-style -Dtest=... filtering:
mvn -pl optor-core scalatest:test -DwildcardSuites=org.boostscale.optor.specific.DpPlannerJoinReorderSuiteWhy clean is worth keeping:
- it avoids stale compiled classes between
optor-coreandoptor-spark - it matches how the project is actually validated in CI
The test suite is helpful because it exposes the intended shape of the project.
optor-core validates:
- equivalence grouping and memo behavior
- rule application depth and search-space growth
- property derivation and enforcement
- distributed and order-sensitive planning
- join reorder behavior
- cyclic search spaces
- path-finding and masking utilities
optor-spark validates:
SparkStrategycorrectness- logical-link preservation
- plan stability against golden outputs for TPCH
- plan stability against golden outputs for TPC-DS v1.4
- behavior both with and without injected statistics
On a clean run in this workspace:
optor-core: 186 tests passed, 10 ignoredoptor-spark: 218 tests passed
This is the main architectural promise of the repository.
To integrate a new engine, you provide:
- a plan type
T PlanModel[T]CostModel[T]MetadataModel[T]PropertyModel[T]OptorExplain[T]OptorRule.Factory[T]
The shape looks like this:
val optimization = Optimization[MyPlan](
myPlanModel,
myCostModel,
myMetadataModel,
myPropertyModel,
myExplain,
myRuleFactory
)
val planner = optimization.newPlanner(plan, constraintSet)
val bestPlan = planner.plan()If you are implementing a new adapter, OptorSuiteBase is the most useful reference because it strips the problem down to the smallest possible custom plan model.
The core includes Graphviz formatting support for planner state and memo contents.
Relevant code:
This is useful when you want to inspect:
- groups and clusters in the memo
- which nodes are winners
- which path became the chosen best plan
From the code and tests, the main current boundaries are:
- Optor is best understood as a Cascades-style optimizer framework, not a drop-in clone of Catalyst internals
- the project should currently be understood as a proof-of-concept
- the Spark integration is tied to Spark 3.5.x APIs
- the current Spark integration is experimental and delivered as a
SparkStrategythrough a Spark extension - integration at the Spark optimizer layer is possible in principle, but is not implemented here today
- the Spark property model currently centers on distribution and ordering
- some larger join-reorder cases in core tests are intentionally ignored due to search cost
- Spark cost estimation falls back to
1 MiBwhen logical stats report0bytes
- Java 8 source compatibility
- Scala 2.13.8
- Maven
- Spark 3.5.5 in
optor-spark
Apache License 2.0. See LICENSE.
This README.md was AI-generated.