Optor

Optor is a proof-of-concept Cascades-style global cost-based query optimizer framework with an experimental Apache Spark Catalyst integration.

If you want the shortest possible mental model, it is this:

optor-core is a generic optimizer engine
optor-spark teaches that engine how to reason about Spark plans through an experimental SparkStrategy integration exposed as a Spark extension
the optimizer searches across equivalent physical plans, enforces required properties, costs the candidates, and picks the cheapest valid one

This repository is aimed at two audiences:

users who want to understand what Optor changes in Spark
developers who want to understand the architecture well enough to extend it or port it to another engine

It should currently be read as a proof-of-concept system rather than a production-ready optimizer.

Why This Project Exists

Most optimizers are easier to understand if you separate two concerns:

the search engine
the engine-specific semantics

Optor does exactly that in a Cascades-style shape: it memoizes equivalent expressions, explores alternatives through rules, tracks required physical properties, and chooses the cheapest valid plan globally rather than greedily.

The search engine lives in optor-core. It knows how to:

memoize equivalent plans
explore transformations and implementations
track required physical properties
inject property enforcers when needed
cost candidate plans
return the best plan

The engine-specific logic lives in adapters such as optor-spark. Those adapters tell Optor:

what a plan node looks like
when two plans are equivalent
how to compute metadata and cost
what physical properties matter
what rules can generate alternative implementations

That split is the central design decision in this codebase.

Big Picture Architecture

                 +----------------------+
                 |   Logical / Input    |
                 |        Plan          |
                 +----------+-----------+
                            |
                            v
                 +----------------------+
                 | Engine Adapters      |
                 |                      |
                 | PlanModel            |
                 | CostModel            |
                 | MetadataModel        |
                 | PropertyModel        |
                 | Rule Factory         |
                 +----------+-----------+
                            |
                            v
                 +----------------------+
                 |     Optor Core       |
                 |                      |
                 | Memo                 |
                 | Rule application     |
                 | Property enforcement |
                 | Path search          |
                 | Best-plan selection  |
                 +----------+-----------+
                            |
                            v
                 +----------------------+
                 |  Cheapest Valid Plan |
                 +----------------------+

In this repository:

optor-core provides the generic box in the middle
optor-spark provides the adapter layer for Spark

The Main Concepts

1. Plans

Optor never hardcodes Spark or any other engine into the core. Instead it works over a generic plan type T.

That is why the central entry point looks like:

Optimization[T](...)

The core asks the embedding engine to provide a PlanModel[T] so it can:

inspect children
rebuild nodes with new children
compare plans for equivalence
create group leaves for memoization

2. Memo Groups And Clusters

Like other Cascades-style optimizers, Optor groups equivalent expressions into memo structures. This lets it avoid re-solving the same logical subproblem repeatedly.

The important distinction visible in the code is:

a group represents a plan alternative under a specific constraint set
a cluster represents equivalent expressions that share metadata

That is why metadata and physical properties are treated separately:

metadata says what stays invariant across equivalent expressions
properties describe execution-relevant traits such as ordering or distribution

3. Rules

Rules are how Optor generates alternatives.

An OptorRule[T] takes a node and produces zero or more alternatives:

implementation choices
rewrites
re-associations
commutations
engine-specific lowerings

In Spark, rules are mostly wrappers around existing Spark strategies plus a few Optor-specific expansions, especially around join implementation choices.

4. Properties

Optor treats physical properties as first-class constraints.

A PropertyModel[T] tells the optimizer:

which properties exist
how a plan produces them
what constraints children must satisfy
how to enforce them when they are missing

In the Spark integration, the current properties are:

distribution
ordering

and the enforcers are ordinary Spark operators such as:

ShuffleExchangeExec
BroadcastExchangeExec
SortExec

5. Cost

Once Optor has candidate plans that satisfy the required constraints, it relies on a CostModel[T] to compare them.

The core itself does not prescribe the meaning of cost. It only requires:

a comparable cost representation
a way to compute plan cost
an explicit infinity value for invalid or not-yet-implementable states

How Optimization Flows Through The System

A useful way to read the project is as a pipeline:

input plan
  -> memoize equivalent structure
  -> apply rewrite / implementation rules
  -> derive and enforce required properties
  -> expand search space through memo groups
  -> cost valid alternatives
  -> pick the best path through the memo

That flow maps directly to the codebase:

org.boostscale.optor.Optor: optimizer entry point and model validation
org.boostscale.optor.OptorPlanner: planner abstraction
org.boostscale.optor.memo: memo structures
org.boostscale.optor.rule: rules and enforcer rule sets
org.boostscale.optor.path: path discovery over the memo/search space
org.boostscale.optor.best: best-plan selection

Two planner implementations exist:

DpPlanner: the default, dynamic-programming-based planner
ExhaustivePlanner: the simpler exhaustive alternative

In practice, DpPlanner is the main planner the repository is organized around.

Spark Integration: What Actually Happens

The Spark integration is intentionally narrow, explicit, and experimental.

optor-spark currently provides org.apache.spark.sql.OptorExtensions, which injects a custom planner strategy:

SparkSession.builder()
  .config("spark.sql.extensions", "org.apache.spark.sql.OptorExtensions")

This is not a fork of Catalyst. It is an experimental integration layer that plugs into Catalyst through SparkSessionExtensions.

Today, the concrete integration point in this repository is a SparkStrategy. An integration at the Spark optimizer layer is also possible in principle, but that is not what optor-spark currently implements.

From there, OptorStrategy does five important things:

Takes Spark’s logical plan and wraps it into deferred physical placeholders.
Builds an Optimization[SparkPlan] using Spark-specific models.
Reuses Spark planning strategies to generate implementation alternatives.
Optimizes under Spark distribution and ordering requirements.
Returns the final chosen SparkPlan.

The important Spark-specific components are:

SparkPlanModel: how Optor traverses and compares Spark plans
SparkCostModel: how Optor ranks Spark physical plans
SparkMetadataModel: metadata carried across equivalent Spark expressions
SparkPropertyModel: Spark physical properties and enforcers
Rules: wrappers that translate Spark planner strategies into Optor rules

One good way to think about optor-spark is:

Spark still provides the building blocks
Optor changes how broadly and systematically those choices are searched and compared
the current integration is experimental rather than a fully productized replacement for Catalyst planning

Codebase Tour

If you are reading the code for the first time, this order works well.

Start Here

Then Read The Search Engine

Then Read The Spark Side

Best Minimal Examples

If you want to understand the framework without Spark noise, the best entry point is the synthetic test model:

optor-core/src/test/scala/org/boostscale/optor/OptorSuiteBase.scala

And if you want to see concrete optimization behavior:

Repository Layout

.
|-- pom.xml
|-- optor-core/
|   `-- src/main/scala/org/boostscale/optor/...
`-- optor-spark/
    `-- src/main/scala/org/apache/spark/sql/...

At a package level:

org.boostscale.optor: public core abstractions
org.boostscale.optor.memo: memo/group/cluster state
org.boostscale.optor.dp: DP planner
org.boostscale.optor.exaustive: exhaustive planner
org.boostscale.optor.rule: rule definitions and application
org.boostscale.optor.path: path and pattern search over the memo
org.boostscale.optor.vis: Graphviz-based visualization
org.apache.spark.sql: current Spark extension and SparkStrategy surface
org.apache.spark.sql.optor.*: Spark-specific adapters

Building And Testing

Use a clean reactor build from the repository root:

mvn clean test

That is the command used by CI, and it passed in this workspace.

Useful variants:

mvn -pl optor-core -am clean test
mvn -pl optor-spark -am clean test

If you want to run ScalaTest suites directly, use the ScalaTest Maven plugin instead of Surefire-style -Dtest=... filtering:

mvn -pl optor-core scalatest:test -DwildcardSuites=org.boostscale.optor.specific.DpPlannerJoinReorderSuite

Why clean is worth keeping:

it avoids stale compiled classes between optor-core and optor-spark
it matches how the project is actually validated in CI

What The Tests Say About The Design

The test suite is helpful because it exposes the intended shape of the project.

optor-core validates:

equivalence grouping and memo behavior
rule application depth and search-space growth
property derivation and enforcement
distributed and order-sensitive planning
join reorder behavior
cyclic search spaces
path-finding and masking utilities

optor-spark validates:

SparkStrategy correctness
logical-link preservation
plan stability against golden outputs for TPCH
plan stability against golden outputs for TPC-DS v1.4
behavior both with and without injected statistics

On a clean run in this workspace:

optor-core: 186 tests passed, 10 ignored
optor-spark: 218 tests passed

Extending Optor To Another Engine

This is the main architectural promise of the repository.

To integrate a new engine, you provide:

a plan type T
PlanModel[T]
CostModel[T]
MetadataModel[T]
PropertyModel[T]
OptorExplain[T]
OptorRule.Factory[T]

The shape looks like this:

val optimization = Optimization[MyPlan](
  myPlanModel,
  myCostModel,
  myMetadataModel,
  myPropertyModel,
  myExplain,
  myRuleFactory
)

val planner = optimization.newPlanner(plan, constraintSet)
val bestPlan = planner.plan()

If you are implementing a new adapter, OptorSuiteBase is the most useful reference because it strips the problem down to the smallest possible custom plan model.

Visualization And Debugging

The core includes Graphviz formatting support for planner state and memo contents.

Relevant code:

optor-core/src/main/scala/org/boostscale/optor/vis/GraphvizVisualizer.scala

This is useful when you want to inspect:

groups and clusters in the memo
which nodes are winners
which path became the chosen best plan

Current Boundaries

From the code and tests, the main current boundaries are:

Optor is best understood as a Cascades-style optimizer framework, not a drop-in clone of Catalyst internals
the project should currently be understood as a proof-of-concept
the Spark integration is tied to Spark 3.5.x APIs
the current Spark integration is experimental and delivered as a SparkStrategy through a Spark extension
integration at the Spark optimizer layer is possible in principle, but is not implemented here today
the Spark property model currently centers on distribution and ordering
some larger join-reorder cases in core tests are intentionally ignored due to search cost
Spark cost estimation falls back to 1 MiB when logical stats report 0 bytes

Tech Stack

Java 8 source compatibility
Scala 2.13.8
Maven
Spark 3.5.5 in optor-spark

License

Apache License 2.0. See LICENSE.

Note

This README.md was AI-generated.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
optor-core		optor-core
optor-spark		optor-spark
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optor

Why This Project Exists

Big Picture Architecture

The Main Concepts

1. Plans

2. Memo Groups And Clusters

3. Rules

4. Properties

5. Cost

How Optimization Flows Through The System

Spark Integration: What Actually Happens

Codebase Tour

Start Here

Then Read The Search Engine

Then Read The Spark Side

Best Minimal Examples

Repository Layout

Building And Testing

What The Tests Say About The Design

Extending Optor To Another Engine

Visualization And Debugging

Current Boundaries

Tech Stack

License

Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Optor

Why This Project Exists

Big Picture Architecture

The Main Concepts

1. Plans

2. Memo Groups And Clusters

3. Rules

4. Properties

5. Cost

How Optimization Flows Through The System

Spark Integration: What Actually Happens

Codebase Tour

Start Here

Then Read The Search Engine

Then Read The Spark Side

Best Minimal Examples

Repository Layout

Building And Testing

What The Tests Say About The Design

Extending Optor To Another Engine

Visualization And Debugging

Current Boundaries

Tech Stack

License

Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages