Skip to content

boostscale/optor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optor

Optor is a proof-of-concept Cascades-style global cost-based query optimizer framework with an experimental Apache Spark Catalyst integration.

If you want the shortest possible mental model, it is this:

  • optor-core is a generic optimizer engine
  • optor-spark teaches that engine how to reason about Spark plans through an experimental SparkStrategy integration exposed as a Spark extension
  • the optimizer searches across equivalent physical plans, enforces required properties, costs the candidates, and picks the cheapest valid one

This repository is aimed at two audiences:

  • users who want to understand what Optor changes in Spark
  • developers who want to understand the architecture well enough to extend it or port it to another engine

It should currently be read as a proof-of-concept system rather than a production-ready optimizer.

Why This Project Exists

Most optimizers are easier to understand if you separate two concerns:

  1. the search engine
  2. the engine-specific semantics

Optor does exactly that in a Cascades-style shape: it memoizes equivalent expressions, explores alternatives through rules, tracks required physical properties, and chooses the cheapest valid plan globally rather than greedily.

The search engine lives in optor-core. It knows how to:

  • memoize equivalent plans
  • explore transformations and implementations
  • track required physical properties
  • inject property enforcers when needed
  • cost candidate plans
  • return the best plan

The engine-specific logic lives in adapters such as optor-spark. Those adapters tell Optor:

  • what a plan node looks like
  • when two plans are equivalent
  • how to compute metadata and cost
  • what physical properties matter
  • what rules can generate alternative implementations

That split is the central design decision in this codebase.

Big Picture Architecture

                 +----------------------+
                 |   Logical / Input    |
                 |        Plan          |
                 +----------+-----------+
                            |
                            v
                 +----------------------+
                 | Engine Adapters      |
                 |                      |
                 | PlanModel            |
                 | CostModel            |
                 | MetadataModel        |
                 | PropertyModel        |
                 | Rule Factory         |
                 +----------+-----------+
                            |
                            v
                 +----------------------+
                 |     Optor Core       |
                 |                      |
                 | Memo                 |
                 | Rule application     |
                 | Property enforcement |
                 | Path search          |
                 | Best-plan selection  |
                 +----------+-----------+
                            |
                            v
                 +----------------------+
                 |  Cheapest Valid Plan |
                 +----------------------+

In this repository:

  • optor-core provides the generic box in the middle
  • optor-spark provides the adapter layer for Spark

The Main Concepts

1. Plans

Optor never hardcodes Spark or any other engine into the core. Instead it works over a generic plan type T.

That is why the central entry point looks like:

Optimization[T](...)

The core asks the embedding engine to provide a PlanModel[T] so it can:

  • inspect children
  • rebuild nodes with new children
  • compare plans for equivalence
  • create group leaves for memoization

2. Memo Groups And Clusters

Like other Cascades-style optimizers, Optor groups equivalent expressions into memo structures. This lets it avoid re-solving the same logical subproblem repeatedly.

The important distinction visible in the code is:

  • a group represents a plan alternative under a specific constraint set
  • a cluster represents equivalent expressions that share metadata

That is why metadata and physical properties are treated separately:

  • metadata says what stays invariant across equivalent expressions
  • properties describe execution-relevant traits such as ordering or distribution

3. Rules

Rules are how Optor generates alternatives.

An OptorRule[T] takes a node and produces zero or more alternatives:

  • implementation choices
  • rewrites
  • re-associations
  • commutations
  • engine-specific lowerings

In Spark, rules are mostly wrappers around existing Spark strategies plus a few Optor-specific expansions, especially around join implementation choices.

4. Properties

Optor treats physical properties as first-class constraints.

A PropertyModel[T] tells the optimizer:

  • which properties exist
  • how a plan produces them
  • what constraints children must satisfy
  • how to enforce them when they are missing

In the Spark integration, the current properties are:

  • distribution
  • ordering

and the enforcers are ordinary Spark operators such as:

  • ShuffleExchangeExec
  • BroadcastExchangeExec
  • SortExec

5. Cost

Once Optor has candidate plans that satisfy the required constraints, it relies on a CostModel[T] to compare them.

The core itself does not prescribe the meaning of cost. It only requires:

  • a comparable cost representation
  • a way to compute plan cost
  • an explicit infinity value for invalid or not-yet-implementable states

How Optimization Flows Through The System

A useful way to read the project is as a pipeline:

input plan
  -> memoize equivalent structure
  -> apply rewrite / implementation rules
  -> derive and enforce required properties
  -> expand search space through memo groups
  -> cost valid alternatives
  -> pick the best path through the memo

That flow maps directly to the codebase:

  • org.boostscale.optor.Optor: optimizer entry point and model validation
  • org.boostscale.optor.OptorPlanner: planner abstraction
  • org.boostscale.optor.memo: memo structures
  • org.boostscale.optor.rule: rules and enforcer rule sets
  • org.boostscale.optor.path: path discovery over the memo/search space
  • org.boostscale.optor.best: best-plan selection

Two planner implementations exist:

  • DpPlanner: the default, dynamic-programming-based planner
  • ExhaustivePlanner: the simpler exhaustive alternative

In practice, DpPlanner is the main planner the repository is organized around.

Spark Integration: What Actually Happens

The Spark integration is intentionally narrow, explicit, and experimental.

optor-spark currently provides org.apache.spark.sql.OptorExtensions, which injects a custom planner strategy:

SparkSession.builder()
  .config("spark.sql.extensions", "org.apache.spark.sql.OptorExtensions")

This is not a fork of Catalyst. It is an experimental integration layer that plugs into Catalyst through SparkSessionExtensions.

Today, the concrete integration point in this repository is a SparkStrategy. An integration at the Spark optimizer layer is also possible in principle, but that is not what optor-spark currently implements.

From there, OptorStrategy does five important things:

  1. Takes Spark’s logical plan and wraps it into deferred physical placeholders.
  2. Builds an Optimization[SparkPlan] using Spark-specific models.
  3. Reuses Spark planning strategies to generate implementation alternatives.
  4. Optimizes under Spark distribution and ordering requirements.
  5. Returns the final chosen SparkPlan.

The important Spark-specific components are:

  • SparkPlanModel: how Optor traverses and compares Spark plans
  • SparkCostModel: how Optor ranks Spark physical plans
  • SparkMetadataModel: metadata carried across equivalent Spark expressions
  • SparkPropertyModel: Spark physical properties and enforcers
  • Rules: wrappers that translate Spark planner strategies into Optor rules

One good way to think about optor-spark is:

  • Spark still provides the building blocks
  • Optor changes how broadly and systematically those choices are searched and compared
  • the current integration is experimental rather than a fully productized replacement for Catalyst planning

Codebase Tour

If you are reading the code for the first time, this order works well.

Start Here

Then Read The Search Engine

Then Read The Spark Side

Best Minimal Examples

If you want to understand the framework without Spark noise, the best entry point is the synthetic test model:

And if you want to see concrete optimization behavior:

Repository Layout

.
|-- pom.xml
|-- optor-core/
|   `-- src/main/scala/org/boostscale/optor/...
`-- optor-spark/
    `-- src/main/scala/org/apache/spark/sql/...

At a package level:

  • org.boostscale.optor: public core abstractions
  • org.boostscale.optor.memo: memo/group/cluster state
  • org.boostscale.optor.dp: DP planner
  • org.boostscale.optor.exaustive: exhaustive planner
  • org.boostscale.optor.rule: rule definitions and application
  • org.boostscale.optor.path: path and pattern search over the memo
  • org.boostscale.optor.vis: Graphviz-based visualization
  • org.apache.spark.sql: current Spark extension and SparkStrategy surface
  • org.apache.spark.sql.optor.*: Spark-specific adapters

Building And Testing

Use a clean reactor build from the repository root:

mvn clean test

That is the command used by CI, and it passed in this workspace.

Useful variants:

mvn -pl optor-core -am clean test
mvn -pl optor-spark -am clean test

If you want to run ScalaTest suites directly, use the ScalaTest Maven plugin instead of Surefire-style -Dtest=... filtering:

mvn -pl optor-core scalatest:test -DwildcardSuites=org.boostscale.optor.specific.DpPlannerJoinReorderSuite

Why clean is worth keeping:

  • it avoids stale compiled classes between optor-core and optor-spark
  • it matches how the project is actually validated in CI

What The Tests Say About The Design

The test suite is helpful because it exposes the intended shape of the project.

optor-core validates:

  • equivalence grouping and memo behavior
  • rule application depth and search-space growth
  • property derivation and enforcement
  • distributed and order-sensitive planning
  • join reorder behavior
  • cyclic search spaces
  • path-finding and masking utilities

optor-spark validates:

  • SparkStrategy correctness
  • logical-link preservation
  • plan stability against golden outputs for TPCH
  • plan stability against golden outputs for TPC-DS v1.4
  • behavior both with and without injected statistics

On a clean run in this workspace:

  • optor-core: 186 tests passed, 10 ignored
  • optor-spark: 218 tests passed

Extending Optor To Another Engine

This is the main architectural promise of the repository.

To integrate a new engine, you provide:

  1. a plan type T
  2. PlanModel[T]
  3. CostModel[T]
  4. MetadataModel[T]
  5. PropertyModel[T]
  6. OptorExplain[T]
  7. OptorRule.Factory[T]

The shape looks like this:

val optimization = Optimization[MyPlan](
  myPlanModel,
  myCostModel,
  myMetadataModel,
  myPropertyModel,
  myExplain,
  myRuleFactory
)

val planner = optimization.newPlanner(plan, constraintSet)
val bestPlan = planner.plan()

If you are implementing a new adapter, OptorSuiteBase is the most useful reference because it strips the problem down to the smallest possible custom plan model.

Visualization And Debugging

The core includes Graphviz formatting support for planner state and memo contents.

Relevant code:

This is useful when you want to inspect:

  • groups and clusters in the memo
  • which nodes are winners
  • which path became the chosen best plan

Current Boundaries

From the code and tests, the main current boundaries are:

  • Optor is best understood as a Cascades-style optimizer framework, not a drop-in clone of Catalyst internals
  • the project should currently be understood as a proof-of-concept
  • the Spark integration is tied to Spark 3.5.x APIs
  • the current Spark integration is experimental and delivered as a SparkStrategy through a Spark extension
  • integration at the Spark optimizer layer is possible in principle, but is not implemented here today
  • the Spark property model currently centers on distribution and ordering
  • some larger join-reorder cases in core tests are intentionally ignored due to search cost
  • Spark cost estimation falls back to 1 MiB when logical stats report 0 bytes

Tech Stack

  • Java 8 source compatibility
  • Scala 2.13.8
  • Maven
  • Spark 3.5.5 in optor-spark

License

Apache License 2.0. See LICENSE.

Note

This README.md was AI-generated.

About

A global cost-based query optimizer framework for Apache Spark and beyond.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages