From 68f65985ab7e51626eda9212df30420f9551524b Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Wed, 7 Jan 2026 09:55:07 +0100 Subject: [PATCH 01/13] Correct typo. --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 71b4489e0..dd6d7bbe2 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -2,7 +2,7 @@ 👍🎉 First off, thanks for taking the time to contribute! 🎉👍 -The following is a set of guidelines for contributing to the eccenca Corporate Memory documention project. +The following is a set of guidelines for contributing to the eccenca Corporate Memory documentation project. ## How Can I Contribute? From c45a325a3b138f3177c9830285eb312c121c6360 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Wed, 7 Jan 2026 10:19:32 +0100 Subject: [PATCH 02/13] Small error fixed. --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index dd6d7bbe2..466ca17dd 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -124,7 +124,7 @@ On this page is search function for icons available as well. Extend section - do not use a cluttered desktop -- do not show other esp. personal project artifacts then relevant for the tutorial / feature to show +- do not show other esp. personal project artifacts than relevant for the tutorial / feature to show - select cropping area carefully (omit backgrounds, lines on the edges, etc.) - use the same or a similar area for similar screens - all relevant elements should be clearly visible and not be truncated From 9607dd30cadcbca999cb0fa49210b4a273fed2c9 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Thu, 8 Jan 2026 16:43:29 +0100 Subject: [PATCH 03/13] Refactor the Build landing page to use four clear buckets (Foundations, Tutorials, Patterns, Reference). Update the intro paragraph. --- docs/build/index.md | 37 ++++++++++++++++++++++++++++--------- 1 file changed, 28 insertions(+), 9 deletions(-) diff --git a/docs/build/index.md b/docs/build/index.md index 96ecb187e..853bc32d3 100644 --- a/docs/build/index.md +++ b/docs/build/index.md @@ -9,30 +9,49 @@ hide: # :material-star: Build -The Build stage is used to turn your legacy data points from existing datasets into an Enterprise Knowledge Graph structure. The subsections introduce the features of Corporate Memory that support this stage and provide guidance through your first lifting activities. +The Build stage turns your source data—across files, databases, APIs, and streams—into an Enterprise Knowledge Graph. The sections below explain the Build workspace and guide you from first lifting steps to reusable patterns and reference material. **:octicons-people-24: Intended audience:** Linked Data Experts
-- :eccenca-application-dataintegration: Introduction and Best Practices +- :eccenca-application-dataintegration: Foundations: Introduction and Best Practices --- - [Introduction to the User Interface](introduction-to-the-user-interface/index.md) --- a short introduction to the **Build** workspace incl. projects and tasks management. - [Rule Operators](rule-operators/index.md) --- Overview on operators that can be used to build linkage and transformation rules. - - [Cool IRIs](cool-iris/index.md) --- URIs and IRIs are character strings identifying the nodes and edges in the graph. Defining them is an important step in creating an exploitable Knowledge Graph for your Company. - - [Define Prefixes / Namespaces](define-prefixes-namespaces/index.md) --- Define Prefixes / Namespaces — Namespace declarations allow for abbreviation of IRIs by using a prefixed name instead of an IRI, in particular when writing SPARQL queries or Turtle. + - [Cool IRIs](cool-iris/index.md) --- URIs and IRIs are character strings identifying the nodes and edges in the graph. Defining them is an important step in creating an exploitable Knowledge Graph for your Company. + - [Define Prefixes / Namespaces](define-prefixes-namespaces/index.md) --- Namespace declarations allow for abbreviation of IRIs by using a prefixed name instead of an IRI, in particular when writing SPARQL queries or Turtle. - :material-list-status: Tutorials --- - - [Lift Data from Tabular Data](lift-data-from-tabular-data-such-as-csv-xslx-or-database-tables/index.md) --- Build a Knowledge Graph from from Tabular Data such as CSV, XSLX or Database Tables. - - [Lift data from JSON and XML sources](lift-data-from-json-and-xml-sources/index.md) --- Build a Knowledge Graph based on input data from hierarchical sources such as JSON and XML files. - - [Extracting data from a Web API](extracting-data-from-a-web-api/index.md) --- Build a Knowledge Graph based on input data from a Web API. - - [Reconfigure Workflow Tasks](workflow-reconfiguration/index.md) --- During its execution, new parameters can be loaded from any source, which overwrites originally set parameters. - - [Incremental Database Loading](loading-jdbc-datasets-incrementally/index.md) --- Load data incrementally from a JDBC Dataset (relational database Table) into a Knowledge Graph. + - [Lift Data from Tabular Data](lift-data-from-tabular-data-such-as-csv-xslx-or-database-tables/index.md) --- Build a Knowledge Graph from Tabular Data such as CSV, XSLX or Database Tables. + - [Lift data from JSON and XML sources](lift-data-from-json-and-xml-sources/index.md) --- Build a Knowledge Graph based on input data from hierarchical sources such as JSON and XML files. + - [Extracting data from a Web API](extracting-data-from-a-web-api/index.md) --- Build a Knowledge Graph based on input data from a Web API. + - [Incremental Database Loading](loading-jdbc-datasets-incrementally/index.md) --- Load data incrementally from a JDBC Dataset (relational database Table) into a Knowledge Graph. + - [Connect to Snowflake](snowflake-tutorial/index.md) --- Connect Snowflake as a scalable cloud warehouse and lift/link its data in Corporate Memory to unify it with your other sources in one knowledge graph. + - [Build Knowledge Graphs from Kafka Topics](kafka-consumer/index.md) --- Consume Kafka topics and lift event streams into a Knowledge Graph. + - [Evaluate Jinja Template and Send an Email Message](evaluate-template/index.md) --- Template and send an email after a workflow execution. + - [Link Intrusion Detection Systems to Open-Source INTelligence](tutorial-how-to-link-ids-to-osint/index.md) --- Link IDS data to OSINT sources. + +- :fontawesome-regular-snowflake: Patterns + + --- + + - [Reconfigure Workflow Tasks](workflow-reconfiguration/index.md) --- During its execution, new parameters can be loaded from any source, which overwrites originally set parameters. + - [Project and Global Variables](variables/index.md) --- Define and reuse variables across tasks and projects. + - [Active learning](active-learning/index.md) --- Advanced workflows that improve results iteratively by incorporating feedback signals. + +- :material-book-open-variant-outline: Reference + + --- + + - [Mapping Creator](mapping-creator/index.md) --- Create and manage mappings to lift legacy data into a Knowledge Graph. + - [Integrations](integrations/index.md) --- Supported integrations and configuration options for connecting data sources and sinks. + - [Task and Operator Reference](reference/index.md) --- Reference documentation for tasks and operators in the Build workspace.
From 94afee4450e9d1ac4522cc5bd28602a044ce78f5 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Thu, 8 Jan 2026 16:44:56 +0100 Subject: [PATCH 04/13] AL is a tutorial. --- docs/build/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/build/index.md b/docs/build/index.md index 853bc32d3..c30bd6a64 100644 --- a/docs/build/index.md +++ b/docs/build/index.md @@ -32,6 +32,7 @@ The Build stage turns your source data—across files, databases, APIs, and stre - [Lift data from JSON and XML sources](lift-data-from-json-and-xml-sources/index.md) --- Build a Knowledge Graph based on input data from hierarchical sources such as JSON and XML files. - [Extracting data from a Web API](extracting-data-from-a-web-api/index.md) --- Build a Knowledge Graph based on input data from a Web API. - [Incremental Database Loading](loading-jdbc-datasets-incrementally/index.md) --- Load data incrementally from a JDBC Dataset (relational database Table) into a Knowledge Graph. + - [Active learning](active-learning/index.md) --- Advanced workflows that improve results iteratively by incorporating feedback signals. - [Connect to Snowflake](snowflake-tutorial/index.md) --- Connect Snowflake as a scalable cloud warehouse and lift/link its data in Corporate Memory to unify it with your other sources in one knowledge graph. - [Build Knowledge Graphs from Kafka Topics](kafka-consumer/index.md) --- Consume Kafka topics and lift event streams into a Knowledge Graph. - [Evaluate Jinja Template and Send an Email Message](evaluate-template/index.md) --- Template and send an email after a workflow execution. @@ -43,7 +44,6 @@ The Build stage turns your source data—across files, databases, APIs, and stre - [Reconfigure Workflow Tasks](workflow-reconfiguration/index.md) --- During its execution, new parameters can be loaded from any source, which overwrites originally set parameters. - [Project and Global Variables](variables/index.md) --- Define and reuse variables across tasks and projects. - - [Active learning](active-learning/index.md) --- Advanced workflows that improve results iteratively by incorporating feedback signals. - :material-book-open-variant-outline: Reference From 34211ad84b076cd8a852f71a2d0ce7226e582291 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Fri, 9 Jan 2026 10:04:34 +0100 Subject: [PATCH 05/13] Casing corrected. --- docs/build/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/build/index.md b/docs/build/index.md index c30bd6a64..1dd20366e 100644 --- a/docs/build/index.md +++ b/docs/build/index.md @@ -28,7 +28,7 @@ The Build stage turns your source data—across files, databases, APIs, and stre --- - - [Lift Data from Tabular Data](lift-data-from-tabular-data-such-as-csv-xslx-or-database-tables/index.md) --- Build a Knowledge Graph from Tabular Data such as CSV, XSLX or Database Tables. + - [Lift Data from Tabular Data](lift-data-from-tabular-data-such-as-csv-xslx-or-database-tables/index.md) --- Build a Knowledge Graph from tabular data such as CSV, XSLX or database tables. - [Lift data from JSON and XML sources](lift-data-from-json-and-xml-sources/index.md) --- Build a Knowledge Graph based on input data from hierarchical sources such as JSON and XML files. - [Extracting data from a Web API](extracting-data-from-a-web-api/index.md) --- Build a Knowledge Graph based on input data from a Web API. - [Incremental Database Loading](loading-jdbc-datasets-incrementally/index.md) --- Load data incrementally from a JDBC Dataset (relational database Table) into a Knowledge Graph. From e21ceb354f71576112a03736bfe95406fa61e4e4 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Fri, 9 Jan 2026 16:20:11 +0100 Subject: [PATCH 06/13] Apache Spark within CMEM BUILD --- docs/build/.pages | 1 + docs/build/index.md | 1 + docs/build/spark/index.md | 160 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 162 insertions(+) create mode 100644 docs/build/spark/index.md diff --git a/docs/build/.pages b/docs/build/.pages index de9f6ab15..830f7d5af 100644 --- a/docs/build/.pages +++ b/docs/build/.pages @@ -18,3 +18,4 @@ nav: - Project and Global Variables: variables - Evaluate Template Operator: evaluate-template - Build Knowledge Graphs from Kafka Topics: kafka-consumer + - Spark: spark diff --git a/docs/build/index.md b/docs/build/index.md index 1dd20366e..a6e4133f9 100644 --- a/docs/build/index.md +++ b/docs/build/index.md @@ -23,6 +23,7 @@ The Build stage turns your source data—across files, databases, APIs, and stre - [Rule Operators](rule-operators/index.md) --- Overview on operators that can be used to build linkage and transformation rules. - [Cool IRIs](cool-iris/index.md) --- URIs and IRIs are character strings identifying the nodes and edges in the graph. Defining them is an important step in creating an exploitable Knowledge Graph for your Company. - [Define Prefixes / Namespaces](define-prefixes-namespaces/index.md) --- Namespace declarations allow for abbreviation of IRIs by using a prefixed name instead of an IRI, in particular when writing SPARQL queries or Turtle. + - [Spark](spark/index.md) --- Explainer of Apache Spark and its integration within the BUILD platform. - :material-list-status: Tutorials diff --git a/docs/build/spark/index.md b/docs/build/spark/index.md new file mode 100644 index 000000000..647200848 --- /dev/null +++ b/docs/build/spark/index.md @@ -0,0 +1,160 @@ +--- +icon: simple/apachespark +tags: + - Introduction + - Explainer +--- +# Apache Spark within CMEM BUILD + +## Introduction + +This documentation provides a detailed explanation of Apache Spark and its integration within eccenca’s Corporate Memory (CMEM) BUILD platform. The goal is to provide a **conceptual understanding of Spark**, its purpose in BUILD, and how workflows leverage Spark-aware datasets for efficient, distributed data processing. + +The documentation is structured in three parts: + +1. What Apache Spark is +2. How Apache Spark works +3. How Apache Spark is used in CMEM + +## What is Apache Spark? + +[Apache Spark](https://spark.apache.org/) is a unified **computing engine** and set of libraries for **distributed data processing at scale**. It is specifically used in the domains of data engineering, data science, and machine learning. + +The main **data processing tasks** Apache Spark is used for include: + +* data loading +* SQL queries +* machine learning +* streaming +* graph processing +* etc. (functionalities stemming from hundreds of plugins) + +By itself, Apache Spark is _detached from any data and Input/Output (IO) operations_. More formally: Apache Spark requires a [cluster manager](https://en.wikipedia.org/wiki/Cluster_manager "Cluster manager") and a [distributed storage system](https://en.wikipedia.org/wiki/Clustered_file_system "Clustered file system"). One possible realization of these requirements, for the **distributed storage** part, is to combine Apache Spark with [Apache Hive](https://hive.apache.org/) ―a distributed data warehouse―. For the **cluster management** part, there are also several possibilities, as can be explored in the [cluster mode overview documentation](https://spark.apache.org/docs/latest/cluster-overview.html). One such possibility ―and the one we recommend― is to use [Apache Hadoop YARN](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html). YARN handles the _resource management_ and the _job scheduling_ within the cluster. The connection point between YARN and Spark can be explored [here](https://spark.apache.org/docs/latest/running-on-yarn.html) further. More specifically, within eccenca's Corporate Memory (CMEM) environment, the relevant configuration is documented in the [Spark configuration](https://documentation.eccenca.com/latest/deploy-and-configure/configuration/dataintegration/#spark-configuration) of eccenca BUILD. + +## How does Spark work? + +### Spark's Architecture + +There are ―in general terms― _three_ different layers or levels of abstraction within Spark: + +* the **low-level APIs**: RDDs (resilient distributed datasets), shared variables +* the **high-level** or **structured APIs**: DataFrames, Datasets, SparkSQL +* the **application level**: (Structured) Streaming, MLib, GraphX, etc. + +#### Low-level API + +At the lowest abstraction level, Spark provides the abstraction of a **resilient distributed dataset**, an [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html). As stated, Spark is detached from any data and IO operations, and the abstraction of the RDD embodies this exact principle. In practice, the most common physical source of data for an RDD is a file in a Hadoop file system, the [HDFS](https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs) (Hadoop Distributed File System). + +Conceptually, it is important to be aware of the following distinction: Apache Spark does **in-memory** computations. Hadoop handles the distributed _files_, and Spark the distributed _processing_. Spark, YARN and HDFS have therefore orthogonal but cohesive concerns: computation, scheduling, persistence. + +Additionally to the in-memory aspect of computations with RDDs, the RDD itself is **immutable**. This is an established practice in functional programming ―which inspired the core processing functionalities and mechanisms of systems such as Spark and Hadoop―, especially in the context of parallel processing and distributed computing. Immutability does _not_ imply that the processing of data structures is inefficient compared to working with mutable data, since [persistent data structures](https://en.wikipedia.org/wiki/Persistent_data_structure) are used. That said, evidently each type of datastructure has its use-cases and trade-offs. + +Spark can be seen as bridging distributed computing paradigms. [Hadoop](https://hadoop.apache.org/) and its MapReduce operate on **disk-based storage**, processing data in batches. Spark shifts computation **into memory** and treats data as **immutable**, enabling stateless transformations across partitions. [Flink](https://flink.apache.org/) goes further by supporting **stateful, continuously updated computations**, suited for complex streaming workloads. + +#### High-level API + +The basis for in-memory computations in Spark is, thus, the RDD. As such, the RDD is the central abstraction in Spark for the concept of _distributed data_. The chronological development of Spark's APIs also reflects this: The RDD was introduced in 2011, then in 2013 and 2015 two higher-level abstractions were introduced, each building upon the previous. These two abstractions are the DataFrame (2013) and the Dataset (2015). The main difference is the following: + +* A **Dataset** is a _strongly-typed_ distributed collection of data. +* A **DataFrame** is a _weakly-typed_ distributed collection of data. + +Technically speaking, a DataFrame is nothing else than a **dataset of rows**. Here, a **row** is to be understood in the same sense as in the rows of a relational database table. Conceptually, the main difference is that a Dataset "knows" which types of objects it stores or contains, whereas a DataFrame doesn't. + +In our case, the relevant abstractions are the RDD and the DataFrame. In a general, domain-agnostic data integration system such as eccenca CMEM, there is no use-case for a strongly-typed version of a distributed collection of data (that would require knowledge of the application and business domains, integrated into CMEM itself). In other words: The usage of DataFrames aligns perfectly with the general, flexible and dynamic data integration tasks of CMEM and the corresponding workflow execution, of which Spark is an optional but optimal part. + +##### DataFrames + +###### What a DataFrame Really Is + +A **DataFrame** is a high-level abstraction over an RDD, combining a **distributed dataset** with a **schema**. The **schema** defines column names, data types, and the structure of fields (`StructType`, `StructField`). Notice that a schema is applied to `DataFrames`, not to rows; each row is itself untyped. + +###### How DataFrames Work + +A **DataFrame** combines a **schema** (structure) with a **distributed dataset** (content). It is **immutable**: transformations always produce new DataFrames without changing the original. Spark tracks **lineage**, the history of transformations, which allows lost partitions to be recomputed safely. DataFrames are **partitioned** across the cluster, enabling **parallel processing**. Together, these properties make DataFrames reliable and efficient for data integration workflows in CMEM. + +###### Computing DataFrames + +Operations on DataFrames are **lazy**. Transformations define what to do; computation happens only when an **action** (e.g., `collect`, `write`) is triggered. This ensures efficient computation, preserves immutability, and keeps integration workflows flexible. + +###### Data Sources + +DataFrames can be created from **files** (CSV, JSON, Parquet, ORC), **databases** (via JDBC), or **existing RDDs** with an applied schema. Schemas can be provided or inferred dynamically. Writing supports multiple modes (overwrite, append), enabling consistent handling of distributed datasets in CMEM Build. + +#### Application level + +At the application level, Apache Spark provides further abstractions and functionalities such as structured streaming, machine learning and (in-memory) graph processing with GraphX. These are interesting to know about, but they are not relevant for the integration of Apache Spark within eccenca CMEM. + +In other words, and metaphorically speaking: The application level is CMEM itself, which makes use of Spark. This brings us to the follow-up questions regarding the usage of Spark _within_ Corporate Memory, which are described further below. + +### Anatomy of a Spark job + +A Spark job consists of **stages**, **tasks**, **shuffles**, and the **DAG**. These elements define how computations are divided and executed across the cluster. + +* A **job** consists of **stages**, each containing multiple **tasks**, which are the units of computation executed per partition. +* A **stage** represents a set of computations that can run **without shuffling data** across nodes. +* A **shuffle** is the **exchange of data between nodes**, required when a stage depends on data from another stage. +* The **DAG** (directed acyclic graph) captures **all dependencies between RDDs**, allowing Spark to plan and optimize execution efficiently. + +This structure — jobs divided into stages and tasks, connected through the DAG and occasionally requiring shuffles — allows Spark to **schedule work efficiently, parallelize computation across the cluster, and recover lost partitions if a node fails**. Spark’s DAG-based planning also enables **optimizations**, such as minimizing data movement, which improves performance in workflows that combine multiple transformations and actions. + +## Apache Spark within CMEM BUILD + +### Why is Apache Spark used by eccenca’s CMEM? + +Apache Spark is integrated into **eccenca Corporate Memory (CMEM)** to enable **scalable, distributed execution of data integration workflows** within its **BUILD (DataIntegration)** component. While CMEM’s overall architecture already consists of multiple distributed services (e.g. BUILD for data integration, EXPLORE for knowledge graph management), the _execution_ of workflows in BUILD is typically **centralized**. Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process **large, complex, or computation-heavy datasets**. + +In practical terms, Spark is used **only** in BUILD — not in EXPLORE or other components. Its purpose is to **accelerate workflow execution** for the so-called **Spark-aware datasets**, which include file formats and storage systems such as **Avro, Parquet, ORC, Hive**, and **HDFS**. These formats map naturally to Spark’s distributed processing model and benefit from in-memory execution and partition-based parallelism. + +For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used. In such cases, BUILD’s standard local execution engine is sufficient. Spark thus acts as an **optional, performance-oriented backend**, not as a replacement for the standard workflow engine. + +The rationale behind using Spark in BUILD aligns with its general strengths: + +- **Parallelization and scalability** for high-volume transformations and joins. +- **Fault tolerance** through resilient distributed datasets (RDDs). +- **Optimization** via Spark’s DAG-based execution planner, minimizing data movement. +- **Interoperability** with widely used big data formats (Parquet, ORC, Avro). + +By leveraging Spark, CMEM can handle **data integration workflows that would otherwise be constrained by single-node processing limits**, while maintaining compatibility with its semantic and knowledge-graph-oriented ecosystem. However, since Spark support is optional, its usage depends on specific deployment needs and data volumes. + +### How and where is Apache Spark used by BUILD? + +Within the BUILD stage (DataIntegration), Apache Spark is used **exclusively for executing workflows that involve Spark-optimized datasets**. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a **distributed, in-memory execution engine** that handles large volumes of data and complex computations efficiently. + +The Spark-optimized datasets — such as **Avro datasets, Parquet datasets, ORC datasets, and Hive tables** — are designed to leverage Spark’s architecture. When included in a workflow, Spark performs transformations **in parallel across partitions** and keeps data **in memory whenever possible**. Other datasets can participate in workflows but typically do **not benefit from Spark’s parallel execution optimizations**. + +Optionally, for more technical context: each Spark-optimized dataset corresponds to an **executor-aware entity**. During workflow execution, BUILD translates the workflow graph into **Spark jobs**, where datasets become **RDDs or DataFrames**, transformations become **stages**, and Spark orchestrates execution across the cluster. The results are then **materialized or written back** into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. Users do **not** need to manage executors or partitions manually. + +### What are the Spark-optimized datasets? + +In BUILD, **Spark-optimized datasets** are those data sources designed to fully leverage Spark’s distributed, in-memory execution engine. These datasets are structured to enable **parallelized transformations, efficient partitioning, and integration into workflows** without requiring manual management of computation or storage. + +The main types of Spark-optimized datasets include: + +- **Avro datasets** — columnar, self-describing file format optimized for Spark’s in-memory processing. +- **Parquet datasets** — highly efficient columnar storage format that supports predicate pushdown and column pruning. +- **ORC datasets** — optimized row-columnar format commonly used in Hadoop ecosystems, enabling fast scans and compression. +- **Hive tables** — structured tables stored in Hadoop-compatible formats, which can be queried and transformed via Spark seamlessly. +- **HDFS datasets** — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing. +- **JSON datasets** — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations. +- **JDBC / relational datasets** — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames. +- **Embedded SQL Endpoint** — workflow results published as **virtual SQL tables**, queryable via JDBC/ODBC without persistent storage, optionally cached in memory. + +When a workflow includes any of these datasets, Spark executes transformations **in parallel across partitions** and **keeps intermediate results in memory whenever possible**, accelerating performance for complex or large-scale data integration tasks. + +Other datasets in BUILD (e.g., relational sources, local files, or non-Spark-aware formats) can participate in workflows, but they **do not benefit from Spark’s parallel execution**. Spark execution remains **optional**, used only when performance gains are meaningful or when workflow complexity demands distributed processing. + +### What is the relation between BUILD’s Spark-aware workflows and the Knowledge Graph? + +BUILD’s Spark-aware workflows operate on datasets **within BUILD**, executing transformations and producing outputs in a distributed, in-memory manner. The Knowledge Graph, managed by EXPLORE, serves as the **persistent semantic storage layer**, but Spark itself does **not directly interact** with the graph. Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately. + +This separation of concerns allows Spark to focus on **high-performance computation** without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of CMEM's architecture around it. Data can flow into workflows from various sources and ultimately be integrated into the graph, but the execution engine mediates this process, handling **dependencies, scheduling, and parallelism**. Users benefit from the efficiency of Spark while maintaining the integrity and consistency of the graph as the central repository of integrated knowledge. + +From a conceptual perspective, the relation is therefore **indirect but essential**: Spark-aware workflows accelerate the processing of large or complex datasets, while the Knowledge Graph ensures that the processed data is **semantically harmonized and persistently stored**. Together, they enable CMEM to combine **flexible, distributed computation** with **knowledge-centric integration**, supporting a wide range of enterprise data integration use cases without requiring users to manage low-level execution details. + +### What is the relation between Spark-aware dataset plugins and other BUILD plugins? + +Spark-aware dataset plugins are a specialized subset of dataset plugins that integrate seamlessly into BUILD workflows. They implement the same source-and-sink interfaces as all other plugins, allowing workflows to connect Spark-aware datasets, traditional datasets, and transformations without additional configuration. + +These plugins include not only the core Spark-optimized datasets (Avro, Parquet, ORC, Hive, HDFS) but also other Spark-aware plugins such as **JSON and JDBC sources**, providing consistent behavior and integration across a wide range of data types and endpoints. Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial, so users do not need to manage parallelism or execution details themselves. + +The execution engine coordinates all plugins, orchestrating Spark-based processing where appropriate while ensuring overall workflow consistency and integration with CMEM’s storage layer. Certain optimizations, like lazy evaluation and optional caching, exist internally to improve performance, though users interact with the workflows in the same unified interface as with other datasets. From ba304005012a79ed35f4644b27809f71a59b0e9d Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Tue, 13 Jan 2026 15:30:24 +0100 Subject: [PATCH 07/13] Reduce the number of boldface items. --- docs/build/spark/index.md | 60 +++++++++++++++++++-------------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/docs/build/spark/index.md b/docs/build/spark/index.md index 647200848..db610924c 100644 --- a/docs/build/spark/index.md +++ b/docs/build/spark/index.md @@ -18,9 +18,9 @@ The documentation is structured in three parts: ## What is Apache Spark? -[Apache Spark](https://spark.apache.org/) is a unified **computing engine** and set of libraries for **distributed data processing at scale**. It is specifically used in the domains of data engineering, data science, and machine learning. +[Apache Spark](https://spark.apache.org/) is a unified **computing engine** and set of libraries for distributed data processing at scale. It is specifically used in the domains of data engineering, data science, and machine learning. -The main **data processing tasks** Apache Spark is used for include: +The main data processing use-cases of Apache Spark are: * data loading * SQL queries @@ -38,7 +38,7 @@ By itself, Apache Spark is _detached from any data and Input/Output (IO) operati There are ―in general terms― _three_ different layers or levels of abstraction within Spark: * the **low-level APIs**: RDDs (resilient distributed datasets), shared variables -* the **high-level** or **structured APIs**: DataFrames, Datasets, SparkSQL +* the **high-level** or structured APIs: DataFrames, Datasets, SparkSQL * the **application level**: (Structured) Streaming, MLib, GraphX, etc. #### Low-level API @@ -49,7 +49,7 @@ Conceptually, it is important to be aware of the following distinction: Apache S Additionally to the in-memory aspect of computations with RDDs, the RDD itself is **immutable**. This is an established practice in functional programming ―which inspired the core processing functionalities and mechanisms of systems such as Spark and Hadoop―, especially in the context of parallel processing and distributed computing. Immutability does _not_ imply that the processing of data structures is inefficient compared to working with mutable data, since [persistent data structures](https://en.wikipedia.org/wiki/Persistent_data_structure) are used. That said, evidently each type of datastructure has its use-cases and trade-offs. -Spark can be seen as bridging distributed computing paradigms. [Hadoop](https://hadoop.apache.org/) and its MapReduce operate on **disk-based storage**, processing data in batches. Spark shifts computation **into memory** and treats data as **immutable**, enabling stateless transformations across partitions. [Flink](https://flink.apache.org/) goes further by supporting **stateful, continuously updated computations**, suited for complex streaming workloads. +Spark can be seen as bridging distributed computing paradigms. [Hadoop](https://hadoop.apache.org/) and its MapReduce operate on disk-based storage, processing data in batches. Spark shifts computation **into memory** and treats data as immutable, enabling stateless transformations across partitions. [Flink](https://flink.apache.org/) goes further by supporting stateful, continuously updated computations, suited for complex streaming workloads. #### High-level API @@ -58,7 +58,7 @@ The basis for in-memory computations in Spark is, thus, the RDD. As such, the RD * A **Dataset** is a _strongly-typed_ distributed collection of data. * A **DataFrame** is a _weakly-typed_ distributed collection of data. -Technically speaking, a DataFrame is nothing else than a **dataset of rows**. Here, a **row** is to be understood in the same sense as in the rows of a relational database table. Conceptually, the main difference is that a Dataset "knows" which types of objects it stores or contains, whereas a DataFrame doesn't. +Technically speaking, a DataFrame is nothing else than a **dataset of rows**. Here, a row is to be understood in the same sense as in the rows of a relational database table. Conceptually, the main difference is that a Dataset "knows" which types of objects it stores or contains, whereas a DataFrame doesn't. In our case, the relevant abstractions are the RDD and the DataFrame. In a general, domain-agnostic data integration system such as eccenca CMEM, there is no use-case for a strongly-typed version of a distributed collection of data (that would require knowledge of the application and business domains, integrated into CMEM itself). In other words: The usage of DataFrames aligns perfectly with the general, flexible and dynamic data integration tasks of CMEM and the corresponding workflow execution, of which Spark is an optional but optimal part. @@ -66,19 +66,19 @@ In our case, the relevant abstractions are the RDD and the DataFrame. In a gener ###### What a DataFrame Really Is -A **DataFrame** is a high-level abstraction over an RDD, combining a **distributed dataset** with a **schema**. The **schema** defines column names, data types, and the structure of fields (`StructType`, `StructField`). Notice that a schema is applied to `DataFrames`, not to rows; each row is itself untyped. +A DataFrame is a high-level abstraction over an RDD, combining a distributed dataset with a **schema**. The schema defines column names, data types, and the structure of fields. Notice that a schema is applied to DataFrames, not to rows; each row is itself untyped. ###### How DataFrames Work -A **DataFrame** combines a **schema** (structure) with a **distributed dataset** (content). It is **immutable**: transformations always produce new DataFrames without changing the original. Spark tracks **lineage**, the history of transformations, which allows lost partitions to be recomputed safely. DataFrames are **partitioned** across the cluster, enabling **parallel processing**. Together, these properties make DataFrames reliable and efficient for data integration workflows in CMEM. +A DataFrame combines a schema (structure) with a distributed dataset (content). It is **immutable**: transformations always produce new DataFrames without changing the original. Spark tracks **lineage**, the history of transformations, which allows lost partitions to be recomputed safely. DataFrames are **partitioned** across the cluster, enabling **parallel processing**. Together, these properties make DataFrames reliable and efficient for data integration workflows in CMEM. ###### Computing DataFrames -Operations on DataFrames are **lazy**. Transformations define what to do; computation happens only when an **action** (e.g., `collect`, `write`) is triggered. This ensures efficient computation, preserves immutability, and keeps integration workflows flexible. +Operations on DataFrames are **lazy**. Transformations define what to do; computation happens only when an action (e.g., collect, write) is triggered. This ensures efficient computation, preserves immutability, and keeps integration workflows flexible. ###### Data Sources -DataFrames can be created from **files** (CSV, JSON, Parquet, ORC), **databases** (via JDBC), or **existing RDDs** with an applied schema. Schemas can be provided or inferred dynamically. Writing supports multiple modes (overwrite, append), enabling consistent handling of distributed datasets in CMEM Build. +DataFrames can be created from files (CSV, JSON, Parquet, ORC), databases (via JDBC), or existing RDDs with an applied schema. Schemas can be provided or inferred dynamically. Writing supports multiple modes (overwrite, append), enabling consistent handling of distributed datasets in CMEM Build. #### Application level @@ -88,24 +88,24 @@ In other words, and metaphorically speaking: The application level is CMEM itsel ### Anatomy of a Spark job -A Spark job consists of **stages**, **tasks**, **shuffles**, and the **DAG**. These elements define how computations are divided and executed across the cluster. +A Spark job consists of stages, tasks, shuffles, and the DAG. These elements define how computations are divided and executed across the cluster. -* A **job** consists of **stages**, each containing multiple **tasks**, which are the units of computation executed per partition. -* A **stage** represents a set of computations that can run **without shuffling data** across nodes. -* A **shuffle** is the **exchange of data between nodes**, required when a stage depends on data from another stage. -* The **DAG** (directed acyclic graph) captures **all dependencies between RDDs**, allowing Spark to plan and optimize execution efficiently. +* A **job** consists of stages, each containing multiple tasks, which are the units of computation executed per partition. +* A **stage** represents a set of computations that can run without shuffling data across nodes. +* A **shuffle** is the exchange of data between nodes, required when a stage depends on data from another stage. +* The **DAG** (directed acyclic graph) captures all dependencies between RDDs, allowing Spark to plan and optimize execution efficiently. -This structure — jobs divided into stages and tasks, connected through the DAG and occasionally requiring shuffles — allows Spark to **schedule work efficiently, parallelize computation across the cluster, and recover lost partitions if a node fails**. Spark’s DAG-based planning also enables **optimizations**, such as minimizing data movement, which improves performance in workflows that combine multiple transformations and actions. +This structure — jobs divided into stages and tasks, connected through the DAG and occasionally requiring shuffles — allows Spark to schedule work efficiently, parallelize computation across the cluster, and recover lost partitions if a node fails. Spark’s DAG-based planning also enables **optimizations**, such as minimizing data movement, which improves performance in workflows that combine multiple transformations and actions. ## Apache Spark within CMEM BUILD ### Why is Apache Spark used by eccenca’s CMEM? -Apache Spark is integrated into **eccenca Corporate Memory (CMEM)** to enable **scalable, distributed execution of data integration workflows** within its **BUILD (DataIntegration)** component. While CMEM’s overall architecture already consists of multiple distributed services (e.g. BUILD for data integration, EXPLORE for knowledge graph management), the _execution_ of workflows in BUILD is typically **centralized**. Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process **large, complex, or computation-heavy datasets**. +Apache Spark is integrated into CMEM to enable scalable, distributed execution of data integration workflows within its BUILD component. While CMEM’s overall architecture already consists of multiple distributed services (e.g. BUILD for data integration, EXPLORE for knowledge graph management), the _execution_ of workflows in BUILD is typically **centralized**. Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process large, complex, or computation-heavy datasets. -In practical terms, Spark is used **only** in BUILD — not in EXPLORE or other components. Its purpose is to **accelerate workflow execution** for the so-called **Spark-aware datasets**, which include file formats and storage systems such as **Avro, Parquet, ORC, Hive**, and **HDFS**. These formats map naturally to Spark’s distributed processing model and benefit from in-memory execution and partition-based parallelism. +In practical terms, Spark is used only in BUILD — not in EXPLORE or other components. Its purpose is to accelerate workflow execution for the so-called **Spark-aware datasets**, which include file formats and storage systems such as Avro, Parquet, ORC, Hive, and HDFS. These formats map naturally to Spark’s distributed processing model and benefit from in-memory execution and partition-based parallelism. -For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used. In such cases, BUILD’s standard local execution engine is sufficient. Spark thus acts as an **optional, performance-oriented backend**, not as a replacement for the standard workflow engine. +For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used. In such cases, BUILD’s standard local execution engine is sufficient. Spark thus acts as an optional, performance-oriented backend, not as a replacement for the standard workflow engine. The rationale behind using Spark in BUILD aligns with its general strengths: @@ -114,19 +114,19 @@ The rationale behind using Spark in BUILD aligns with its general strengths: - **Optimization** via Spark’s DAG-based execution planner, minimizing data movement. - **Interoperability** with widely used big data formats (Parquet, ORC, Avro). -By leveraging Spark, CMEM can handle **data integration workflows that would otherwise be constrained by single-node processing limits**, while maintaining compatibility with its semantic and knowledge-graph-oriented ecosystem. However, since Spark support is optional, its usage depends on specific deployment needs and data volumes. +By leveraging Spark, CMEM can handle data integration workflows that would otherwise be constrained by single-node processing limits, while maintaining compatibility with its semantic and knowledge-graph-oriented ecosystem. However, since Spark support is optional, its usage depends on specific deployment needs and data volumes. ### How and where is Apache Spark used by BUILD? -Within the BUILD stage (DataIntegration), Apache Spark is used **exclusively for executing workflows that involve Spark-optimized datasets**. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a **distributed, in-memory execution engine** that handles large volumes of data and complex computations efficiently. +Within the BUILD stage (DataIntegration), Apache Spark is used exclusively for executing workflows that involve Spark-optimized datasets. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a distributed, in-memory execution engine that handles large volumes of data and complex computations efficiently. -The Spark-optimized datasets — such as **Avro datasets, Parquet datasets, ORC datasets, and Hive tables** — are designed to leverage Spark’s architecture. When included in a workflow, Spark performs transformations **in parallel across partitions** and keeps data **in memory whenever possible**. Other datasets can participate in workflows but typically do **not benefit from Spark’s parallel execution optimizations**. +The Spark-optimized datasets — such as Avro datasets, Parquet datasets, ORC datasets, and Hive tables — are designed to leverage Spark’s architecture. When included in a workflow, Spark performs transformations **in parallel across partitions** and keeps data **in memory whenever possible**. Other datasets can participate in workflows but typically do not benefit from Spark’s parallel execution optimizations. -Optionally, for more technical context: each Spark-optimized dataset corresponds to an **executor-aware entity**. During workflow execution, BUILD translates the workflow graph into **Spark jobs**, where datasets become **RDDs or DataFrames**, transformations become **stages**, and Spark orchestrates execution across the cluster. The results are then **materialized or written back** into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. Users do **not** need to manage executors or partitions manually. +Optionally, for more technical context: each Spark-optimized dataset corresponds to an executor-aware entity. During workflow execution, BUILD translates the workflow graph into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster. The results are then materialized or written back into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. Users do not need to manage executors or partitions manually. ### What are the Spark-optimized datasets? -In BUILD, **Spark-optimized datasets** are those data sources designed to fully leverage Spark’s distributed, in-memory execution engine. These datasets are structured to enable **parallelized transformations, efficient partitioning, and integration into workflows** without requiring manual management of computation or storage. +In BUILD, Spark-optimized datasets are those data sources designed to fully leverage Spark’s distributed, in-memory execution engine. These datasets are structured to enable parallelized transformations, efficient partitioning, and integration into workflows without requiring manual management of computation or storage. The main types of Spark-optimized datasets include: @@ -137,24 +137,24 @@ The main types of Spark-optimized datasets include: - **HDFS datasets** — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing. - **JSON datasets** — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations. - **JDBC / relational datasets** — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames. -- **Embedded SQL Endpoint** — workflow results published as **virtual SQL tables**, queryable via JDBC/ODBC without persistent storage, optionally cached in memory. +- **Embedded SQL Endpoint** — workflow results published as virtual SQL tables, queryable via JDBC/ODBC without persistent storage, optionally cached in memory. -When a workflow includes any of these datasets, Spark executes transformations **in parallel across partitions** and **keeps intermediate results in memory whenever possible**, accelerating performance for complex or large-scale data integration tasks. +When a workflow includes any of these datasets, Spark executes transformations in parallel across partitions and keeps intermediate results in memory whenever possible, accelerating performance for complex or large-scale data integration tasks. -Other datasets in BUILD (e.g., relational sources, local files, or non-Spark-aware formats) can participate in workflows, but they **do not benefit from Spark’s parallel execution**. Spark execution remains **optional**, used only when performance gains are meaningful or when workflow complexity demands distributed processing. +Other datasets in BUILD (e.g., relational sources, local files, or non-Spark-aware formats) can participate in workflows, but they do not benefit from Spark’s parallel execution. Spark execution remains optional, used only when performance gains are meaningful or when workflow complexity demands distributed processing. ### What is the relation between BUILD’s Spark-aware workflows and the Knowledge Graph? -BUILD’s Spark-aware workflows operate on datasets **within BUILD**, executing transformations and producing outputs in a distributed, in-memory manner. The Knowledge Graph, managed by EXPLORE, serves as the **persistent semantic storage layer**, but Spark itself does **not directly interact** with the graph. Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately. +BUILD’s Spark-aware workflows operate on datasets within BUILD, executing transformations and producing outputs in a distributed, in-memory manner. The Knowledge Graph, managed by EXPLORE, serves as the persistent semantic storage layer, but Spark itself does not directly interact with the graph. Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately. -This separation of concerns allows Spark to focus on **high-performance computation** without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of CMEM's architecture around it. Data can flow into workflows from various sources and ultimately be integrated into the graph, but the execution engine mediates this process, handling **dependencies, scheduling, and parallelism**. Users benefit from the efficiency of Spark while maintaining the integrity and consistency of the graph as the central repository of integrated knowledge. +This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of CMEM's architecture around it. Data can flow into workflows from various sources and ultimately be integrated into the graph, but the execution engine mediates this process, handling dependencies, scheduling, and parallelism. Users benefit from the efficiency of Spark while maintaining the integrity and consistency of the graph as the central repository of integrated knowledge. -From a conceptual perspective, the relation is therefore **indirect but essential**: Spark-aware workflows accelerate the processing of large or complex datasets, while the Knowledge Graph ensures that the processed data is **semantically harmonized and persistently stored**. Together, they enable CMEM to combine **flexible, distributed computation** with **knowledge-centric integration**, supporting a wide range of enterprise data integration use cases without requiring users to manage low-level execution details. +From a conceptual perspective, the relation is therefore indirect but essential: Spark-aware workflows accelerate the processing of large or complex datasets, while the Knowledge Graph ensures that the processed data is semantically harmonized and persistently stored. Together, they enable CMEM to combine flexible, distributed computation with knowledge-centric integration, supporting a wide range of enterprise data integration use cases without requiring users to manage low-level execution details. ### What is the relation between Spark-aware dataset plugins and other BUILD plugins? Spark-aware dataset plugins are a specialized subset of dataset plugins that integrate seamlessly into BUILD workflows. They implement the same source-and-sink interfaces as all other plugins, allowing workflows to connect Spark-aware datasets, traditional datasets, and transformations without additional configuration. -These plugins include not only the core Spark-optimized datasets (Avro, Parquet, ORC, Hive, HDFS) but also other Spark-aware plugins such as **JSON and JDBC sources**, providing consistent behavior and integration across a wide range of data types and endpoints. Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial, so users do not need to manage parallelism or execution details themselves. +These plugins include not only the core Spark-optimized datasets (Avro, Parquet, ORC, Hive, HDFS) but also other Spark-aware plugins such as JSON and JDBC sources, providing consistent behavior and integration across a wide range of data types and endpoints. Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial, so users do not need to manage parallelism or execution details themselves. The execution engine coordinates all plugins, orchestrating Spark-based processing where appropriate while ensuring overall workflow consistency and integration with CMEM’s storage layer. Certain optimizations, like lazy evaluation and optional caching, exist internally to improve performance, though users interact with the workflows in the same unified interface as with other datasets. From 936f94e2942414ad855a9d67684dc5c2209bdbc2 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Tue, 13 Jan 2026 15:59:00 +0100 Subject: [PATCH 08/13] More consistent layout. --- docs/build/spark/index.md | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/docs/build/spark/index.md b/docs/build/spark/index.md index db610924c..82e43e4c2 100644 --- a/docs/build/spark/index.md +++ b/docs/build/spark/index.md @@ -8,7 +8,7 @@ tags: ## Introduction -This documentation provides a detailed explanation of Apache Spark and its integration within eccenca’s Corporate Memory (CMEM) BUILD platform. The goal is to provide a **conceptual understanding of Spark**, its purpose in BUILD, and how workflows leverage Spark-aware datasets for efficient, distributed data processing. +This documentation provides a detailed explanation of Apache Spark and its integration within eccenca’s Corporate Memory (CMEM) BUILD platform. The goal is to provide a conceptual understanding of Spark, its purpose in BUILD, and how workflows leverage Spark-aware datasets for efficient, distributed data processing. The documentation is structured in three parts: @@ -18,7 +18,7 @@ The documentation is structured in three parts: ## What is Apache Spark? -[Apache Spark](https://spark.apache.org/) is a unified **computing engine** and set of libraries for distributed data processing at scale. It is specifically used in the domains of data engineering, data science, and machine learning. +[Apache Spark](https://spark.apache.org/) is a unified **computing engine** and set of libraries for **distributed data processing** at scale. It is specifically used in the domains of data engineering, data science, and machine learning. The main data processing use-cases of Apache Spark are: @@ -29,7 +29,7 @@ The main data processing use-cases of Apache Spark are: * graph processing * etc. (functionalities stemming from hundreds of plugins) -By itself, Apache Spark is _detached from any data and Input/Output (IO) operations_. More formally: Apache Spark requires a [cluster manager](https://en.wikipedia.org/wiki/Cluster_manager "Cluster manager") and a [distributed storage system](https://en.wikipedia.org/wiki/Clustered_file_system "Clustered file system"). One possible realization of these requirements, for the **distributed storage** part, is to combine Apache Spark with [Apache Hive](https://hive.apache.org/) ―a distributed data warehouse―. For the **cluster management** part, there are also several possibilities, as can be explored in the [cluster mode overview documentation](https://spark.apache.org/docs/latest/cluster-overview.html). One such possibility ―and the one we recommend― is to use [Apache Hadoop YARN](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html). YARN handles the _resource management_ and the _job scheduling_ within the cluster. The connection point between YARN and Spark can be explored [here](https://spark.apache.org/docs/latest/running-on-yarn.html) further. More specifically, within eccenca's Corporate Memory (CMEM) environment, the relevant configuration is documented in the [Spark configuration](https://documentation.eccenca.com/latest/deploy-and-configure/configuration/dataintegration/#spark-configuration) of eccenca BUILD. +By itself, Apache Spark is detached from any data and Input/Output (IO) operations. More formally: Apache Spark requires a [cluster manager](https://en.wikipedia.org/wiki/Cluster_manager "Cluster manager") and a [distributed storage system](https://en.wikipedia.org/wiki/Clustered_file_system "Clustered file system"). One possible realization of these requirements, for the distributed storage part, is to combine Apache Spark with [Apache Hive](https://hive.apache.org/) ―a distributed data warehouse―. For the cluster management part, there are also several possibilities, as can be explored in the [cluster mode overview documentation](https://spark.apache.org/docs/latest/cluster-overview.html). One such possibility ―and the one we recommend― is to use [Apache Hadoop YARN](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html). YARN handles the resource management and the _job scheduling_ within the cluster. The connection point between YARN and Spark can be explored [here](https://spark.apache.org/docs/latest/running-on-yarn.html) further. More specifically, within eccenca's Corporate Memory (CMEM) environment, the relevant configuration is documented in the [Spark configuration](https://documentation.eccenca.com/latest/deploy-and-configure/configuration/dataintegration/#spark-configuration) of eccenca BUILD. ## How does Spark work? @@ -38,18 +38,18 @@ By itself, Apache Spark is _detached from any data and Input/Output (IO) operati There are ―in general terms― _three_ different layers or levels of abstraction within Spark: * the **low-level APIs**: RDDs (resilient distributed datasets), shared variables -* the **high-level** or structured APIs: DataFrames, Datasets, SparkSQL +* the **high-level APIs**: DataFrames, Datasets, SparkSQL * the **application level**: (Structured) Streaming, MLib, GraphX, etc. #### Low-level API At the lowest abstraction level, Spark provides the abstraction of a **resilient distributed dataset**, an [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html). As stated, Spark is detached from any data and IO operations, and the abstraction of the RDD embodies this exact principle. In practice, the most common physical source of data for an RDD is a file in a Hadoop file system, the [HDFS](https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs) (Hadoop Distributed File System). -Conceptually, it is important to be aware of the following distinction: Apache Spark does **in-memory** computations. Hadoop handles the distributed _files_, and Spark the distributed _processing_. Spark, YARN and HDFS have therefore orthogonal but cohesive concerns: computation, scheduling, persistence. +Conceptually, it is important to be aware of the following distinction: Apache Spark does in-memory computations. Hadoop handles the distributed files, and Spark the distributed processing. Spark, YARN and HDFS have therefore orthogonal but cohesive concerns: computation, scheduling, persistence. Additionally to the in-memory aspect of computations with RDDs, the RDD itself is **immutable**. This is an established practice in functional programming ―which inspired the core processing functionalities and mechanisms of systems such as Spark and Hadoop―, especially in the context of parallel processing and distributed computing. Immutability does _not_ imply that the processing of data structures is inefficient compared to working with mutable data, since [persistent data structures](https://en.wikipedia.org/wiki/Persistent_data_structure) are used. That said, evidently each type of datastructure has its use-cases and trade-offs. -Spark can be seen as bridging distributed computing paradigms. [Hadoop](https://hadoop.apache.org/) and its MapReduce operate on disk-based storage, processing data in batches. Spark shifts computation **into memory** and treats data as immutable, enabling stateless transformations across partitions. [Flink](https://flink.apache.org/) goes further by supporting stateful, continuously updated computations, suited for complex streaming workloads. +Spark can be seen as bridging distributed computing paradigms. [Hadoop](https://hadoop.apache.org/) and its MapReduce operate on disk-based storage, processing data in batches. Spark shifts computation into memory and treats data as immutable, enabling stateless transformations across partitions. [Flink](https://flink.apache.org/) goes further by supporting stateful, continuously updated computations, suited for complex streaming workloads. #### High-level API @@ -70,7 +70,7 @@ A DataFrame is a high-level abstraction over an RDD, combining a distributed dat ###### How DataFrames Work -A DataFrame combines a schema (structure) with a distributed dataset (content). It is **immutable**: transformations always produce new DataFrames without changing the original. Spark tracks **lineage**, the history of transformations, which allows lost partitions to be recomputed safely. DataFrames are **partitioned** across the cluster, enabling **parallel processing**. Together, these properties make DataFrames reliable and efficient for data integration workflows in CMEM. +A DataFrame combines a schema (structure) with a distributed dataset (content). It is **immutable**: transformations always produce new DataFrames without changing the original. Spark tracks lineage, the history of transformations, which allows lost partitions to be recomputed safely. DataFrames are partitioned across the cluster, enabling parallel processing. Together, these properties make DataFrames reliable and efficient for data integration workflows in CMEM. ###### Computing DataFrames @@ -95,13 +95,13 @@ A Spark job consists of stages, tasks, shuffles, and the DAG. These elements def * A **shuffle** is the exchange of data between nodes, required when a stage depends on data from another stage. * The **DAG** (directed acyclic graph) captures all dependencies between RDDs, allowing Spark to plan and optimize execution efficiently. -This structure — jobs divided into stages and tasks, connected through the DAG and occasionally requiring shuffles — allows Spark to schedule work efficiently, parallelize computation across the cluster, and recover lost partitions if a node fails. Spark’s DAG-based planning also enables **optimizations**, such as minimizing data movement, which improves performance in workflows that combine multiple transformations and actions. +This structure — jobs divided into stages and tasks, connected through the DAG and occasionally requiring shuffles — allows Spark to schedule work efficiently, parallelize computation across the cluster, and recover lost partitions if a node fails. Spark’s DAG-based planning also enables optimizations, such as minimizing data movement, which improves performance in workflows that combine multiple transformations and actions. ## Apache Spark within CMEM BUILD ### Why is Apache Spark used by eccenca’s CMEM? -Apache Spark is integrated into CMEM to enable scalable, distributed execution of data integration workflows within its BUILD component. While CMEM’s overall architecture already consists of multiple distributed services (e.g. BUILD for data integration, EXPLORE for knowledge graph management), the _execution_ of workflows in BUILD is typically **centralized**. Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process large, complex, or computation-heavy datasets. +Apache Spark is integrated into CMEM to enable scalable, distributed execution of data integration workflows within its BUILD component. While CMEM’s overall architecture already consists of multiple distributed services (e.g. BUILD for data integration, EXPLORE for knowledge graph management), the execution of workflows in BUILD is typically centralized. Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process large, complex, or computation-heavy datasets. In practical terms, Spark is used only in BUILD — not in EXPLORE or other components. Its purpose is to accelerate workflow execution for the so-called **Spark-aware datasets**, which include file formats and storage systems such as Avro, Parquet, ORC, Hive, and HDFS. These formats map naturally to Spark’s distributed processing model and benefit from in-memory execution and partition-based parallelism. @@ -118,9 +118,9 @@ By leveraging Spark, CMEM can handle data integration workflows that would other ### How and where is Apache Spark used by BUILD? -Within the BUILD stage (DataIntegration), Apache Spark is used exclusively for executing workflows that involve Spark-optimized datasets. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a distributed, in-memory execution engine that handles large volumes of data and complex computations efficiently. +Within the BUILD stage, Apache Spark is used exclusively for executing workflows that involve Spark-optimized datasets. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a distributed, in-memory execution engine that handles large volumes of data and complex computations efficiently. -The Spark-optimized datasets — such as Avro datasets, Parquet datasets, ORC datasets, and Hive tables — are designed to leverage Spark’s architecture. When included in a workflow, Spark performs transformations **in parallel across partitions** and keeps data **in memory whenever possible**. Other datasets can participate in workflows but typically do not benefit from Spark’s parallel execution optimizations. +The Spark-optimized datasets — such as Avro datasets, Parquet datasets, ORC datasets, and Hive tables — are designed to leverage Spark’s architecture. When included in a workflow, Spark performs transformations in parallel across partitions and keeps data in memory whenever possible. Other datasets can participate in workflows but typically do not benefit from Spark’s parallel execution optimizations. Optionally, for more technical context: each Spark-optimized dataset corresponds to an executor-aware entity. During workflow execution, BUILD translates the workflow graph into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster. The results are then materialized or written back into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. Users do not need to manage executors or partitions manually. @@ -130,14 +130,14 @@ In BUILD, Spark-optimized datasets are those data sources designed to fully leve The main types of Spark-optimized datasets include: -- **Avro datasets** — columnar, self-describing file format optimized for Spark’s in-memory processing. -- **Parquet datasets** — highly efficient columnar storage format that supports predicate pushdown and column pruning. -- **ORC datasets** — optimized row-columnar format commonly used in Hadoop ecosystems, enabling fast scans and compression. -- **Hive tables** — structured tables stored in Hadoop-compatible formats, which can be queried and transformed via Spark seamlessly. -- **HDFS datasets** — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing. -- **JSON datasets** — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations. -- **JDBC / relational datasets** — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames. -- **Embedded SQL Endpoint** — workflow results published as virtual SQL tables, queryable via JDBC/ODBC without persistent storage, optionally cached in memory. +- Avro datasets — columnar, self-describing file format optimized for Spark’s in-memory processing. +- Parquet datasets — highly efficient columnar storage format that supports predicate pushdown and column pruning. +- ORC datasets — optimized row-columnar format commonly used in Hadoop ecosystems, enabling fast scans and compression. +- Hive tables — structured tables stored in Hadoop-compatible formats, which can be queried and transformed via Spark seamlessly. +- HDFS datasets — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing. +- JSON datasets — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations. +- JDBC / relational datasets — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames. +- Embedded SQL Endpoint — workflow results published as virtual SQL tables, queryable via JDBC/ODBC without persistent storage, optionally cached in memory. When a workflow includes any of these datasets, Spark executes transformations in parallel across partitions and keeps intermediate results in memory whenever possible, accelerating performance for complex or large-scale data integration tasks. From 32706d32b1d075458411d495a0cde7c53abad3c0 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Tue, 13 Jan 2026 16:39:56 +0100 Subject: [PATCH 09/13] Refine the Apache Spark within CMEM BUILD explainer by normalizing boldface as one-time conceptual markers and improving bold coverage/consistency in the BUILD-focused half, without changing the overall structure or meaning. --- docs/build/spark/index.md | 48 +++++++++++++++++++-------------------- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/docs/build/spark/index.md b/docs/build/spark/index.md index 82e43e4c2..b8a510b11 100644 --- a/docs/build/spark/index.md +++ b/docs/build/spark/index.md @@ -18,7 +18,7 @@ The documentation is structured in three parts: ## What is Apache Spark? -[Apache Spark](https://spark.apache.org/) is a unified **computing engine** and set of libraries for **distributed data processing** at scale. It is specifically used in the domains of data engineering, data science, and machine learning. +[Apache Spark](https://spark.apache.org/) is a unified computing engine and set of libraries for distributed data processing at scale. It is specifically used in the domains of data engineering, data science, and machine learning. The main data processing use-cases of Apache Spark are: @@ -29,25 +29,25 @@ The main data processing use-cases of Apache Spark are: * graph processing * etc. (functionalities stemming from hundreds of plugins) -By itself, Apache Spark is detached from any data and Input/Output (IO) operations. More formally: Apache Spark requires a [cluster manager](https://en.wikipedia.org/wiki/Cluster_manager "Cluster manager") and a [distributed storage system](https://en.wikipedia.org/wiki/Clustered_file_system "Clustered file system"). One possible realization of these requirements, for the distributed storage part, is to combine Apache Spark with [Apache Hive](https://hive.apache.org/) ―a distributed data warehouse―. For the cluster management part, there are also several possibilities, as can be explored in the [cluster mode overview documentation](https://spark.apache.org/docs/latest/cluster-overview.html). One such possibility ―and the one we recommend― is to use [Apache Hadoop YARN](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html). YARN handles the resource management and the _job scheduling_ within the cluster. The connection point between YARN and Spark can be explored [here](https://spark.apache.org/docs/latest/running-on-yarn.html) further. More specifically, within eccenca's Corporate Memory (CMEM) environment, the relevant configuration is documented in the [Spark configuration](https://documentation.eccenca.com/latest/deploy-and-configure/configuration/dataintegration/#spark-configuration) of eccenca BUILD. +By itself, Apache Spark is detached from any data and Input/Output (IO) operations. More formally: Apache Spark requires a [**cluster manager**](https://en.wikipedia.org/wiki/Cluster_manager "Cluster manager") and a [**distributed storage system**](https://en.wikipedia.org/wiki/Clustered_file_system "Clustered file system"). One possible realization of these requirements, for the distributed storage part, is to combine Apache Spark with [Apache Hive](https://hive.apache.org/) ―a distributed data warehouse―. For the cluster management part, there are also several possibilities, as can be explored in the [cluster mode overview documentation](https://spark.apache.org/docs/latest/cluster-overview.html). One such possibility ―and the one we recommend― is to use [Apache Hadoop YARN](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html). YARN handles the resource management and the _job scheduling_ within the cluster. The connection point between YARN and Spark can be explored [here](https://spark.apache.org/docs/latest/running-on-yarn.html) further. More specifically, within eccenca's Corporate Memory (CMEM) environment, the relevant configuration is documented in the [Spark configuration](https://documentation.eccenca.com/latest/deploy-and-configure/configuration/dataintegration/#spark-configuration) of eccenca BUILD. ## How does Spark work? ### Spark's Architecture -There are ―in general terms― _three_ different layers or levels of abstraction within Spark: +There are ―in general terms― three different layers or levels of abstraction within Spark: -* the **low-level APIs**: RDDs (resilient distributed datasets), shared variables -* the **high-level APIs**: DataFrames, Datasets, SparkSQL -* the **application level**: (Structured) Streaming, MLib, GraphX, etc. +* the low-level APIs: RDDs (resilient distributed datasets), shared variables +* the high-level APIs: DataFrames, Datasets, SparkSQL +* the application level: (Structured) Streaming, MLib, GraphX, etc. #### Low-level API -At the lowest abstraction level, Spark provides the abstraction of a **resilient distributed dataset**, an [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html). As stated, Spark is detached from any data and IO operations, and the abstraction of the RDD embodies this exact principle. In practice, the most common physical source of data for an RDD is a file in a Hadoop file system, the [HDFS](https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs) (Hadoop Distributed File System). +At the lowest abstraction level, Spark provides the abstraction of a **resilient distributed dataset** ([RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html)). As stated, Spark is detached from any data and IO operations, and the abstraction of the RDD embodies this exact principle. In practice, the most common physical source of data for an RDD is a file in a Hadoop file system, the [HDFS](https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs) (Hadoop Distributed File System). Conceptually, it is important to be aware of the following distinction: Apache Spark does in-memory computations. Hadoop handles the distributed files, and Spark the distributed processing. Spark, YARN and HDFS have therefore orthogonal but cohesive concerns: computation, scheduling, persistence. -Additionally to the in-memory aspect of computations with RDDs, the RDD itself is **immutable**. This is an established practice in functional programming ―which inspired the core processing functionalities and mechanisms of systems such as Spark and Hadoop―, especially in the context of parallel processing and distributed computing. Immutability does _not_ imply that the processing of data structures is inefficient compared to working with mutable data, since [persistent data structures](https://en.wikipedia.org/wiki/Persistent_data_structure) are used. That said, evidently each type of datastructure has its use-cases and trade-offs. +Additionally to the in-memory aspect of computations with RDDs, the RDD itself is immutable. This is an established practice in functional programming ―which inspired the core processing functionalities and mechanisms of systems such as Spark and Hadoop―, especially in the context of parallel processing and distributed computing. Immutability does _not_ imply that the processing of data structures is inefficient compared to working with mutable data, since [persistent data structures](https://en.wikipedia.org/wiki/Persistent_data_structure) are used. That said, evidently each type of datastructure has its use-cases and trade-offs. Spark can be seen as bridging distributed computing paradigms. [Hadoop](https://hadoop.apache.org/) and its MapReduce operate on disk-based storage, processing data in batches. Spark shifts computation into memory and treats data as immutable, enabling stateless transformations across partitions. [Flink](https://flink.apache.org/) goes further by supporting stateful, continuously updated computations, suited for complex streaming workloads. @@ -58,7 +58,7 @@ The basis for in-memory computations in Spark is, thus, the RDD. As such, the RD * A **Dataset** is a _strongly-typed_ distributed collection of data. * A **DataFrame** is a _weakly-typed_ distributed collection of data. -Technically speaking, a DataFrame is nothing else than a **dataset of rows**. Here, a row is to be understood in the same sense as in the rows of a relational database table. Conceptually, the main difference is that a Dataset "knows" which types of objects it stores or contains, whereas a DataFrame doesn't. +Technically speaking, a DataFrame is nothing else than a dataset of rows. Here, a row is to be understood in the same sense as in the rows of a relational database table. Conceptually, the main difference is that a Dataset "knows" which types of objects it stores or contains, whereas a DataFrame doesn't. In our case, the relevant abstractions are the RDD and the DataFrame. In a general, domain-agnostic data integration system such as eccenca CMEM, there is no use-case for a strongly-typed version of a distributed collection of data (that would require knowledge of the application and business domains, integrated into CMEM itself). In other words: The usage of DataFrames aligns perfectly with the general, flexible and dynamic data integration tasks of CMEM and the corresponding workflow execution, of which Spark is an optional but optimal part. @@ -66,15 +66,15 @@ In our case, the relevant abstractions are the RDD and the DataFrame. In a gener ###### What a DataFrame Really Is -A DataFrame is a high-level abstraction over an RDD, combining a distributed dataset with a **schema**. The schema defines column names, data types, and the structure of fields. Notice that a schema is applied to DataFrames, not to rows; each row is itself untyped. +A DataFrame is a high-level abstraction over an RDD, combining a distributed dataset with a schema. The schema defines column names, data types, and the structure of fields. Notice that a schema is applied to DataFrames, not to rows; each row is itself untyped. ###### How DataFrames Work -A DataFrame combines a schema (structure) with a distributed dataset (content). It is **immutable**: transformations always produce new DataFrames without changing the original. Spark tracks lineage, the history of transformations, which allows lost partitions to be recomputed safely. DataFrames are partitioned across the cluster, enabling parallel processing. Together, these properties make DataFrames reliable and efficient for data integration workflows in CMEM. +A DataFrame combines a schema (structure) with a distributed dataset (content). It is immutable: transformations always produce new DataFrames without changing the original. Spark tracks lineage, the history of transformations, which allows lost partitions to be recomputed safely. DataFrames are partitioned across the cluster, enabling parallel processing. Together, these properties make DataFrames reliable and efficient for data integration workflows in CMEM. ###### Computing DataFrames -Operations on DataFrames are **lazy**. Transformations define what to do; computation happens only when an action (e.g., collect, write) is triggered. This ensures efficient computation, preserves immutability, and keeps integration workflows flexible. +Operations on DataFrames are lazy. Transformations define what to do; computation happens only when an action (e.g., collect, write) is triggered. This ensures efficient computation, preserves immutability, and keeps integration workflows flexible. ###### Data Sources @@ -103,7 +103,7 @@ This structure — jobs divided into stages and tasks, connected through the DAG Apache Spark is integrated into CMEM to enable scalable, distributed execution of data integration workflows within its BUILD component. While CMEM’s overall architecture already consists of multiple distributed services (e.g. BUILD for data integration, EXPLORE for knowledge graph management), the execution of workflows in BUILD is typically centralized. Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process large, complex, or computation-heavy datasets. -In practical terms, Spark is used only in BUILD — not in EXPLORE or other components. Its purpose is to accelerate workflow execution for the so-called **Spark-aware datasets**, which include file formats and storage systems such as Avro, Parquet, ORC, Hive, and HDFS. These formats map naturally to Spark’s distributed processing model and benefit from in-memory execution and partition-based parallelism. +In practical terms, Spark is used only in BUILD — not in EXPLORE or other components. Its purpose is to accelerate workflow execution for the so-called **Spark-aware datasets**, i.e., datasets executed via Spark, which include file formats and storage systems such as Avro, Parquet, ORC, Hive, and HDFS. These formats map naturally to Spark’s distributed processing model and benefit from in-memory execution and partition-based parallelism. For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used. In such cases, BUILD’s standard local execution engine is sufficient. Spark thus acts as an optional, performance-oriented backend, not as a replacement for the standard workflow engine. @@ -118,26 +118,26 @@ By leveraging Spark, CMEM can handle data integration workflows that would other ### How and where is Apache Spark used by BUILD? -Within the BUILD stage, Apache Spark is used exclusively for executing workflows that involve Spark-optimized datasets. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a distributed, in-memory execution engine that handles large volumes of data and complex computations efficiently. +Within the BUILD stage, Apache Spark is used exclusively for executing workflows that involve Spark-optimized datasets. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a **distributed, in-memory execution engine** that handles large volumes of data and complex computations efficiently. The Spark-optimized datasets — such as Avro datasets, Parquet datasets, ORC datasets, and Hive tables — are designed to leverage Spark’s architecture. When included in a workflow, Spark performs transformations in parallel across partitions and keeps data in memory whenever possible. Other datasets can participate in workflows but typically do not benefit from Spark’s parallel execution optimizations. -Optionally, for more technical context: each Spark-optimized dataset corresponds to an executor-aware entity. During workflow execution, BUILD translates the workflow graph into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster. The results are then materialized or written back into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. Users do not need to manage executors or partitions manually. +Optionally, for more technical context: each Spark-optimized dataset corresponds to an **executor-aware entity**. During workflow execution, BUILD translates the **workflow graph** into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster. The results are then materialized or written back into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. Users do not need to manage executors or partitions manually. ### What are the Spark-optimized datasets? -In BUILD, Spark-optimized datasets are those data sources designed to fully leverage Spark’s distributed, in-memory execution engine. These datasets are structured to enable parallelized transformations, efficient partitioning, and integration into workflows without requiring manual management of computation or storage. +In BUILD, **Spark-optimized datasets** are those data sources designed to fully leverage Spark’s distributed, in-memory execution engine. These datasets are structured to enable parallelized transformations, efficient partitioning, and integration into workflows without requiring manual management of computation or storage. The main types of Spark-optimized datasets include: -- Avro datasets — columnar, self-describing file format optimized for Spark’s in-memory processing. -- Parquet datasets — highly efficient columnar storage format that supports predicate pushdown and column pruning. -- ORC datasets — optimized row-columnar format commonly used in Hadoop ecosystems, enabling fast scans and compression. -- Hive tables — structured tables stored in Hadoop-compatible formats, which can be queried and transformed via Spark seamlessly. -- HDFS datasets — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing. -- JSON datasets — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations. -- JDBC / relational datasets — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames. -- Embedded SQL Endpoint — workflow results published as virtual SQL tables, queryable via JDBC/ODBC without persistent storage, optionally cached in memory. +- **Avro datasets** — columnar, self-describing file format optimized for Spark’s in-memory processing. +- **Parquet datasets** — highly efficient columnar storage format that supports predicate pushdown and column pruning. +- **ORC datasets** — optimized row-columnar format commonly used in Hadoop ecosystems, enabling fast scans and compression. +- **Hive tables** — structured tables stored in Hadoop-compatible formats, which can be queried and transformed via Spark seamlessly. +- **HDFS datasets** — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing. +- **JSON datasets** — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations. +- **JDBC / relational datasets** — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames. +- **Embedded SQL Endpoint** — workflow results published as virtual SQL tables, queryable via JDBC/ODBC without persistent storage, optionally cached in memory. When a workflow includes any of these datasets, Spark executes transformations in parallel across partitions and keeps intermediate results in memory whenever possible, accelerating performance for complex or large-scale data integration tasks. From c95d3e3512724627aef4500f71fdd3aa33105ea6 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Tue, 27 Jan 2026 10:01:26 +0100 Subject: [PATCH 10/13] Rewrite why, how and where section. --- docs/build/spark/index.md | 87 +++++++++++++++++++-------------------- 1 file changed, 42 insertions(+), 45 deletions(-) diff --git a/docs/build/spark/index.md b/docs/build/spark/index.md index b8a510b11..54aeba27c 100644 --- a/docs/build/spark/index.md +++ b/docs/build/spark/index.md @@ -8,28 +8,27 @@ tags: ## Introduction -This documentation provides a detailed explanation of Apache Spark and its integration within eccenca’s Corporate Memory (CMEM) BUILD platform. The goal is to provide a conceptual understanding of Spark, its purpose in BUILD, and how workflows leverage Spark-aware datasets for efficient, distributed data processing. +This documentation provides a detailed explanation of Apache Spark and its integration within Corporate Memory’s BUILD platform. The goal is to provide a conceptual understanding of Spark, its purpose in BUILD, and how workflows leverage Spark-aware datasets for efficient, distributed data processing. The documentation is structured in three parts: -1. What Apache Spark is -2. How Apache Spark works -3. How Apache Spark is used in CMEM +1. What is Apache Spark? +2. How does Apache Spark work? +3. Why is Spark used in CMEM? Where is it used in BUILD? ## What is Apache Spark? -[Apache Spark](https://spark.apache.org/) is a unified computing engine and set of libraries for distributed data processing at scale. It is specifically used in the domains of data engineering, data science, and machine learning. - The main data processing use-cases of Apache Spark are: -* data loading -* SQL queries -* machine learning -* streaming -* graph processing -* etc. (functionalities stemming from hundreds of plugins) +* data loading, +* SQL queries, +* machine learning, +* streaming, +* graph processing. + +Additionally, there are other functionalities stemming from hundreds of plugins. -By itself, Apache Spark is detached from any data and Input/Output (IO) operations. More formally: Apache Spark requires a [**cluster manager**](https://en.wikipedia.org/wiki/Cluster_manager "Cluster manager") and a [**distributed storage system**](https://en.wikipedia.org/wiki/Clustered_file_system "Clustered file system"). One possible realization of these requirements, for the distributed storage part, is to combine Apache Spark with [Apache Hive](https://hive.apache.org/) ―a distributed data warehouse―. For the cluster management part, there are also several possibilities, as can be explored in the [cluster mode overview documentation](https://spark.apache.org/docs/latest/cluster-overview.html). One such possibility ―and the one we recommend― is to use [Apache Hadoop YARN](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html). YARN handles the resource management and the _job scheduling_ within the cluster. The connection point between YARN and Spark can be explored [here](https://spark.apache.org/docs/latest/running-on-yarn.html) further. More specifically, within eccenca's Corporate Memory (CMEM) environment, the relevant configuration is documented in the [Spark configuration](https://documentation.eccenca.com/latest/deploy-and-configure/configuration/dataintegration/#spark-configuration) of eccenca BUILD. +By itself, Apache Spark is detached from any data and Input/Output (IO) operations. More formally: Apache Spark requires a [**cluster manager**](https://en.wikipedia.org/wiki/Cluster_manager "Cluster manager") and a [**distributed storage system**](https://en.wikipedia.org/wiki/Clustered_file_system "Clustered file system"). One possible realization of these requirements, for the distributed storage part, is to combine Apache Spark with [Apache Hive](https://hive.apache.org/) ―a distributed data warehouse―. For the cluster management part, there are also several possibilities, as can be explored in the [cluster mode overview documentation](https://spark.apache.org/docs/latest/cluster-overview.html). One such possibility ―and the one we recommend― is to use [Apache Hadoop YARN](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html). YARN handles the resource management and the job scheduling within the cluster. The connection point between YARN and Spark can be explored [here](https://spark.apache.org/docs/latest/running-on-yarn.html) further. More specifically, within Corporate Memory, the relevant configuration is documented in the [Spark configuration](https://documentation.eccenca.com/latest/deploy-and-configure/configuration/dataintegration/#spark-configuration). ## How does Spark work? @@ -37,9 +36,9 @@ By itself, Apache Spark is detached from any data and Input/Output (IO) operatio There are ―in general terms― three different layers or levels of abstraction within Spark: -* the low-level APIs: RDDs (resilient distributed datasets), shared variables +* the low-level APIs: RDDs, shared variables * the high-level APIs: DataFrames, Datasets, SparkSQL -* the application level: (Structured) Streaming, MLib, GraphX, etc. +* the application level: Streaming, MLib, GraphX, etc. #### Low-level API @@ -47,20 +46,20 @@ At the lowest abstraction level, Spark provides the abstraction of a **resilient Conceptually, it is important to be aware of the following distinction: Apache Spark does in-memory computations. Hadoop handles the distributed files, and Spark the distributed processing. Spark, YARN and HDFS have therefore orthogonal but cohesive concerns: computation, scheduling, persistence. -Additionally to the in-memory aspect of computations with RDDs, the RDD itself is immutable. This is an established practice in functional programming ―which inspired the core processing functionalities and mechanisms of systems such as Spark and Hadoop―, especially in the context of parallel processing and distributed computing. Immutability does _not_ imply that the processing of data structures is inefficient compared to working with mutable data, since [persistent data structures](https://en.wikipedia.org/wiki/Persistent_data_structure) are used. That said, evidently each type of datastructure has its use-cases and trade-offs. +Additionally to the in-memory aspect of computations with RDDs, the RDD itself is immutable. This is an established practice in functional programming ―which inspired the core processing functionalities and mechanisms of systems such as Spark and Hadoop―, especially in the context of parallel processing and distributed computing. Immutability does _not_ imply that the processing of data structures is inefficient compared to working with mutable data, since [persistent data structures](https://en.wikipedia.org/wiki/Persistent_data_structure) are used. That said, evidently each type of data structure has its use-cases and trade-offs. Spark can be seen as bridging distributed computing paradigms. [Hadoop](https://hadoop.apache.org/) and its MapReduce operate on disk-based storage, processing data in batches. Spark shifts computation into memory and treats data as immutable, enabling stateless transformations across partitions. [Flink](https://flink.apache.org/) goes further by supporting stateful, continuously updated computations, suited for complex streaming workloads. #### High-level API -The basis for in-memory computations in Spark is, thus, the RDD. As such, the RDD is the central abstraction in Spark for the concept of _distributed data_. The chronological development of Spark's APIs also reflects this: The RDD was introduced in 2011, then in 2013 and 2015 two higher-level abstractions were introduced, each building upon the previous. These two abstractions are the DataFrame (2013) and the Dataset (2015). The main difference is the following: +The basis for in-memory computations in Spark is, thus, the RDD. As such, the RDD is the central abstraction in Spark for the concept of _distributed data_. The chronological development of Spark's APIs also reflects this: The RDD was introduced in 2011, then two higher-level abstractions were introduced in 2013 and 2015, each building upon the previous. These two abstractions are the DataFrame (2013) and the Dataset (2015). The main difference is the following: * A **Dataset** is a _strongly-typed_ distributed collection of data. * A **DataFrame** is a _weakly-typed_ distributed collection of data. -Technically speaking, a DataFrame is nothing else than a dataset of rows. Here, a row is to be understood in the same sense as in the rows of a relational database table. Conceptually, the main difference is that a Dataset "knows" which types of objects it stores or contains, whereas a DataFrame doesn't. +Technically speaking, a DataFrame is nothing else than a dataset of rows. Here, a row is to be understood in the same sense as in the rows of a relational database table. Conceptually, the main difference is that a Dataset “knows” which types of objects it stores or contains, whereas a DataFrame doesn't. -In our case, the relevant abstractions are the RDD and the DataFrame. In a general, domain-agnostic data integration system such as eccenca CMEM, there is no use-case for a strongly-typed version of a distributed collection of data (that would require knowledge of the application and business domains, integrated into CMEM itself). In other words: The usage of DataFrames aligns perfectly with the general, flexible and dynamic data integration tasks of CMEM and the corresponding workflow execution, of which Spark is an optional but optimal part. +In our case, the relevant abstractions are the RDD and the DataFrame. In a general, domain-agnostic data integration system such as CMEM, there is no use-case for a strongly-typed version of a distributed collection of data (that would require knowledge of the application and business domains, integrated into CMEM itself). In other words: The usage of DataFrames aligns perfectly with the general, flexible and dynamic data integration tasks of CMEM and the corresponding workflow execution, of which Spark is an optional but optimal part. ##### DataFrames @@ -74,17 +73,17 @@ A DataFrame combines a schema (structure) with a distributed dataset (content). ###### Computing DataFrames -Operations on DataFrames are lazy. Transformations define what to do; computation happens only when an action (e.g., collect, write) is triggered. This ensures efficient computation, preserves immutability, and keeps integration workflows flexible. +Operations on DataFrames are lazy. Transformations define what to do; computation happens only when an action (e.g., collect, write) is triggered. This ensures efficient computation and keeps integration workflows flexible. ###### Data Sources -DataFrames can be created from files (CSV, JSON, Parquet, ORC), databases (via JDBC), or existing RDDs with an applied schema. Schemas can be provided or inferred dynamically. Writing supports multiple modes (overwrite, append), enabling consistent handling of distributed datasets in CMEM Build. +DataFrames can be created from files, databases or existing RDDs with an applied schema. Schemas can be provided or inferred dynamically. Writing supports multiple modes (overwrite, append), enabling consistent handling of distributed datasets in CMEM Build. #### Application level -At the application level, Apache Spark provides further abstractions and functionalities such as structured streaming, machine learning and (in-memory) graph processing with GraphX. These are interesting to know about, but they are not relevant for the integration of Apache Spark within eccenca CMEM. +At the application level, Apache Spark provides further abstractions and functionalities such as structured streaming, machine learning and (in-memory) graph processing with GraphX. These are interesting to know about, but they are not relevant for the integration of Apache Spark within CMEM. -In other words, and metaphorically speaking: The application level is CMEM itself, which makes use of Spark. This brings us to the follow-up questions regarding the usage of Spark _within_ Corporate Memory, which are described further below. +In other words, and metaphorically speaking: The application level is CMEM itself, which makes use of Spark. This brings us to the follow-up questions regarding the usage of Spark within CMEM, which are described further below. ### Anatomy of a Spark job @@ -97,17 +96,13 @@ A Spark job consists of stages, tasks, shuffles, and the DAG. These elements def This structure — jobs divided into stages and tasks, connected through the DAG and occasionally requiring shuffles — allows Spark to schedule work efficiently, parallelize computation across the cluster, and recover lost partitions if a node fails. Spark’s DAG-based planning also enables optimizations, such as minimizing data movement, which improves performance in workflows that combine multiple transformations and actions. -## Apache Spark within CMEM BUILD +## Apache Spark within BUILD -### Why is Apache Spark used by eccenca’s CMEM? +### Why is Apache Spark used in CMEM? Apache Spark is integrated into CMEM to enable scalable, distributed execution of data integration workflows within its BUILD component. While CMEM’s overall architecture already consists of multiple distributed services (e.g. BUILD for data integration, EXPLORE for knowledge graph management), the execution of workflows in BUILD is typically centralized. Spark adds a **parallel, fault-tolerant computation layer** that becomes especially valuable when workflows process large, complex, or computation-heavy datasets. -In practical terms, Spark is used only in BUILD — not in EXPLORE or other components. Its purpose is to accelerate workflow execution for the so-called **Spark-aware datasets**, i.e., datasets executed via Spark, which include file formats and storage systems such as Avro, Parquet, ORC, Hive, and HDFS. These formats map naturally to Spark’s distributed processing model and benefit from in-memory execution and partition-based parallelism. - -For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used. In such cases, BUILD’s standard local execution engine is sufficient. Spark thus acts as an optional, performance-oriented backend, not as a replacement for the standard workflow engine. - -The rationale behind using Spark in BUILD aligns with its general strengths: +The rationale behind using Spark aligns with its general strengths: - **Parallelization and scalability** for high-volume transformations and joins. - **Fault tolerance** through resilient distributed datasets (RDDs). @@ -118,17 +113,19 @@ By leveraging Spark, CMEM can handle data integration workflows that would other ### How and where is Apache Spark used by BUILD? -Within the BUILD stage, Apache Spark is used exclusively for executing workflows that involve Spark-optimized datasets. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a **distributed, in-memory execution engine** that handles large volumes of data and complex computations efficiently. +Within the BUILD stage, Apache Spark is used exclusively for executing workflows that involve **Spark-aware datasets**. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a **distributed, in-memory execution engine** that handles large volumes of data and complex computations efficiently. -The Spark-optimized datasets — such as Avro datasets, Parquet datasets, ORC datasets, and Hive tables — are designed to leverage Spark’s architecture. When included in a workflow, Spark performs transformations in parallel across partitions and keeps data in memory whenever possible. Other datasets can participate in workflows but typically do not benefit from Spark’s parallel execution optimizations. +The Spark-aware datasets include file formats and storage systems such as Avro, Parquet, ORC, Hive, and HDFS. These formats map naturally to Spark’s distributed processing model and benefit from in-memory execution and partition-based parallelism. -Optionally, for more technical context: each Spark-optimized dataset corresponds to an **executor-aware entity**. During workflow execution, BUILD translates the **workflow graph** into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster. The results are then materialized or written back into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. Users do not need to manage executors or partitions manually. +For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used. In such cases, BUILD’s standard local execution engine is sufficient. Spark thus acts as an optional, performance-oriented backend, not as a replacement for the standard workflow engine. -### What are the Spark-optimized datasets? +Each Spark-aware dataset corresponds to an **executor-aware entity**. During workflow execution, BUILD translates the **workflow graph** into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster. The results are then materialized or written back into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. Users do not need to manage executors or partitions manually. -In BUILD, **Spark-optimized datasets** are those data sources designed to fully leverage Spark’s distributed, in-memory execution engine. These datasets are structured to enable parallelized transformations, efficient partitioning, and integration into workflows without requiring manual management of computation or storage. +### What are the Spark-aware datasets? -The main types of Spark-optimized datasets include: +In BUILD, **Spark-aware datasets** are those data sources designed to fully leverage Spark’s distributed, in-memory execution engine. These datasets are structured to enable parallelized transformations, efficient partitioning, and integration into workflows without requiring manual management of computation or storage. + +The main types of Spark-aware datasets include: - **Avro datasets** — columnar, self-describing file format optimized for Spark’s in-memory processing. - **Parquet datasets** — highly efficient columnar storage format that supports predicate pushdown and column pruning. @@ -136,18 +133,14 @@ The main types of Spark-optimized datasets include: - **Hive tables** — structured tables stored in Hadoop-compatible formats, which can be queried and transformed via Spark seamlessly. - **HDFS datasets** — file-based, row-oriented datasets stored in Hadoop Distributed File System, optimized for partitioned, parallel processing. - **JSON datasets** — semi-structured, Spark-aware datasets enabling flexible schema inference and in-memory transformations. -- **JDBC / relational datasets** — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames. -- **Embedded SQL Endpoint** — workflow results published as virtual SQL tables, queryable via JDBC/ODBC without persistent storage, optionally cached in memory. - -When a workflow includes any of these datasets, Spark executes transformations in parallel across partitions and keeps intermediate results in memory whenever possible, accelerating performance for complex or large-scale data integration tasks. - -Other datasets in BUILD (e.g., relational sources, local files, or non-Spark-aware formats) can participate in workflows, but they do not benefit from Spark’s parallel execution. Spark execution remains optional, used only when performance gains are meaningful or when workflow complexity demands distributed processing. +- **JDBC datasets** — external relational sources exposed to Spark via JDBC, queryable and transformable as DataFrames. +- **Embedded SQL Endpoint** — workflow results published as virtual SQL tables, queryable via JDBC or ODBC without persistent storage, optionally cached in memory. ### What is the relation between BUILD’s Spark-aware workflows and the Knowledge Graph? BUILD’s Spark-aware workflows operate on datasets within BUILD, executing transformations and producing outputs in a distributed, in-memory manner. The Knowledge Graph, managed by EXPLORE, serves as the persistent semantic storage layer, but Spark itself does not directly interact with the graph. Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately. -This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of CMEM's architecture around it. Data can flow into workflows from various sources and ultimately be integrated into the graph, but the execution engine mediates this process, handling dependencies, scheduling, and parallelism. Users benefit from the efficiency of Spark while maintaining the integrity and consistency of the graph as the central repository of integrated knowledge. +This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of CMEM's architecture around it. Data can flow into workflows from various sources and ultimately be integrated into the graph, while the execution engine mediates this process, handling dependencies, scheduling, and parallelism. Users benefit from the efficiency of Spark while maintaining the integrity and consistency of the graph as the central repository of integrated knowledge. From a conceptual perspective, the relation is therefore indirect but essential: Spark-aware workflows accelerate the processing of large or complex datasets, while the Knowledge Graph ensures that the processed data is semantically harmonized and persistently stored. Together, they enable CMEM to combine flexible, distributed computation with knowledge-centric integration, supporting a wide range of enterprise data integration use cases without requiring users to manage low-level execution details. @@ -155,6 +148,10 @@ From a conceptual perspective, the relation is therefore indirect but essential: Spark-aware dataset plugins are a specialized subset of dataset plugins that integrate seamlessly into BUILD workflows. They implement the same source-and-sink interfaces as all other plugins, allowing workflows to connect Spark-aware datasets, traditional datasets, and transformations without additional configuration. -These plugins include not only the core Spark-optimized datasets (Avro, Parquet, ORC, Hive, HDFS) but also other Spark-aware plugins such as JSON and JDBC sources, providing consistent behavior and integration across a wide range of data types and endpoints. Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial, so users do not need to manage parallelism or execution details themselves. +These plugins include not only the core Spark-aware datasets (Avro, Parquet, ORC, Hive, HDFS) but also other Spark-aware plugins such as JSON and JDBC sources, providing consistent behavior and integration across a wide range of data types and endpoints. Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial. + +## Summary + +This document explained what Apache Spark is, how it processes data through its core abstractions, and why Spark appears in CMEM specifically in the BUILD component. The key boundary is architectural: Spark provides the execution engine for distributed processing, BUILD defines and orchestrates workflows, and EXPLORE remains the semantic persistence layer. Spark therefore does not interact with the Knowledge Graph directly; it is used by BUILD for workflow execution, while the workflow engine controls when results are written out and how they feed into subsequent steps. -The execution engine coordinates all plugins, orchestrating Spark-based processing where appropriate while ensuring overall workflow consistency and integration with CMEM’s storage layer. Certain optimizations, like lazy evaluation and optional caching, exist internally to improve performance, though users interact with the workflows in the same unified interface as with other datasets. +Within BUILD, Spark matters primarily when workflows operate on Spark-aware datasets. Those datasets align with Spark’s distributed processing model, which is why Spark can execute transformations across partitions, recompute lost work if a node fails, and handle larger volumes of data without falling back to single-node execution. For other dataset types or small workloads, workflows typically run without Spark and remain within BUILD’s standard execution path. From 51f5b8cf7d4fb0f0c13af61d1b8ca72e0a7f4898 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Tue, 27 Jan 2026 10:01:50 +0100 Subject: [PATCH 11/13] Rewrite why, how and where section. --- docs/build/spark/index.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/docs/build/spark/index.md b/docs/build/spark/index.md index 54aeba27c..7ad026693 100644 --- a/docs/build/spark/index.md +++ b/docs/build/spark/index.md @@ -113,9 +113,7 @@ By leveraging Spark, CMEM can handle data integration workflows that would other ### How and where is Apache Spark used by BUILD? -Within the BUILD stage, Apache Spark is used exclusively for executing workflows that involve **Spark-aware datasets**. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a **distributed, in-memory execution engine** that handles large volumes of data and complex computations efficiently. - -The Spark-aware datasets include file formats and storage systems such as Avro, Parquet, ORC, Hive, and HDFS. These formats map naturally to Spark’s distributed processing model and benefit from in-memory execution and partition-based parallelism. +Within the BUILD stage, Apache Spark is used exclusively for executing workflows that involve **Spark-aware datasets**. These workflows connect datasets, apply transformations, and produce outputs, with Spark providing a **distributed execution engine** that handles large volumes of data and complex computations efficiently. For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used. In such cases, BUILD’s standard local execution engine is sufficient. Spark thus acts as an optional, performance-oriented backend, not as a replacement for the standard workflow engine. @@ -123,8 +121,6 @@ Each Spark-aware dataset corresponds to an **executor-aware entity**. During wor ### What are the Spark-aware datasets? -In BUILD, **Spark-aware datasets** are those data sources designed to fully leverage Spark’s distributed, in-memory execution engine. These datasets are structured to enable parallelized transformations, efficient partitioning, and integration into workflows without requiring manual management of computation or storage. - The main types of Spark-aware datasets include: - **Avro datasets** — columnar, self-describing file format optimized for Spark’s in-memory processing. From 222188b9c703a8aa414fd38e5845ab797e6a16d2 Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Tue, 27 Jan 2026 10:11:56 +0100 Subject: [PATCH 12/13] ' --- docs/build/spark/index.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/build/spark/index.md b/docs/build/spark/index.md index 7ad026693..59b56ca84 100644 --- a/docs/build/spark/index.md +++ b/docs/build/spark/index.md @@ -32,7 +32,7 @@ By itself, Apache Spark is detached from any data and Input/Output (IO) operatio ## How does Spark work? -### Spark's Architecture +### Spark’s Architecture There are ―in general terms― three different layers or levels of abstraction within Spark: @@ -52,12 +52,12 @@ Spark can be seen as bridging distributed computing paradigms. [Hadoop](https:// #### High-level API -The basis for in-memory computations in Spark is, thus, the RDD. As such, the RDD is the central abstraction in Spark for the concept of _distributed data_. The chronological development of Spark's APIs also reflects this: The RDD was introduced in 2011, then two higher-level abstractions were introduced in 2013 and 2015, each building upon the previous. These two abstractions are the DataFrame (2013) and the Dataset (2015). The main difference is the following: +The basis for in-memory computations in Spark is, thus, the RDD. As such, the RDD is the central abstraction in Spark for the concept of _distributed data_. The chronological development of Spark’s APIs also reflects this: The RDD was introduced in 2011, then two higher-level abstractions were introduced in 2013 and 2015, each building upon the previous. These two abstractions are the DataFrame (2013) and the Dataset (2015). The main difference is the following: * A **Dataset** is a _strongly-typed_ distributed collection of data. * A **DataFrame** is a _weakly-typed_ distributed collection of data. -Technically speaking, a DataFrame is nothing else than a dataset of rows. Here, a row is to be understood in the same sense as in the rows of a relational database table. Conceptually, the main difference is that a Dataset “knows” which types of objects it stores or contains, whereas a DataFrame doesn't. +Technically speaking, a DataFrame is nothing else than a dataset of rows. Here, a row is to be understood in the same sense as in the rows of a relational database table. Conceptually, the main difference is that a Dataset “knows” which types of objects it stores or contains, whereas a DataFrame doesn’t. In our case, the relevant abstractions are the RDD and the DataFrame. In a general, domain-agnostic data integration system such as CMEM, there is no use-case for a strongly-typed version of a distributed collection of data (that would require knowledge of the application and business domains, integrated into CMEM itself). In other words: The usage of DataFrames aligns perfectly with the general, flexible and dynamic data integration tasks of CMEM and the corresponding workflow execution, of which Spark is an optional but optimal part. @@ -134,9 +134,9 @@ The main types of Spark-aware datasets include: ### What is the relation between BUILD’s Spark-aware workflows and the Knowledge Graph? -BUILD’s Spark-aware workflows operate on datasets within BUILD, executing transformations and producing outputs in a distributed, in-memory manner. The Knowledge Graph, managed by EXPLORE, serves as the persistent semantic storage layer, but Spark itself does not directly interact with the graph. Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately. +The Spark-aware workflows operate on datasets within BUILD, executing transformations and producing outputs. The Knowledge Graph, managed by EXPLORE, serves as the persistent semantic storage layer, but Spark itself does not directly interact with the graph. Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately. -This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of CMEM's architecture around it. Data can flow into workflows from various sources and ultimately be integrated into the graph, while the execution engine mediates this process, handling dependencies, scheduling, and parallelism. Users benefit from the efficiency of Spark while maintaining the integrity and consistency of the graph as the central repository of integrated knowledge. +This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of CMEM’s architecture around it. Data can flow into workflows from various sources and ultimately be integrated into the graph, while the execution engine mediates this process, handling dependencies, scheduling, and parallelism. Users benefit from the efficiency of Spark while maintaining the integrity and consistency of the graph as the central repository of integrated knowledge. From a conceptual perspective, the relation is therefore indirect but essential: Spark-aware workflows accelerate the processing of large or complex datasets, while the Knowledge Graph ensures that the processed data is semantically harmonized and persistently stored. Together, they enable CMEM to combine flexible, distributed computation with knowledge-centric integration, supporting a wide range of enterprise data integration use cases without requiring users to manage low-level execution details. From 8e8ea82a968f695d02c55149d8b49cc51963078e Mon Sep 17 00:00:00 2001 From: Eduard Fugarolas Date: Fri, 30 Jan 2026 16:27:31 +0100 Subject: [PATCH 13/13] Final cleanup of doc. --- docs/build/spark/index.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/docs/build/spark/index.md b/docs/build/spark/index.md index 59b56ca84..eae49eb49 100644 --- a/docs/build/spark/index.md +++ b/docs/build/spark/index.md @@ -46,7 +46,7 @@ At the lowest abstraction level, Spark provides the abstraction of a **resilient Conceptually, it is important to be aware of the following distinction: Apache Spark does in-memory computations. Hadoop handles the distributed files, and Spark the distributed processing. Spark, YARN and HDFS have therefore orthogonal but cohesive concerns: computation, scheduling, persistence. -Additionally to the in-memory aspect of computations with RDDs, the RDD itself is immutable. This is an established practice in functional programming ―which inspired the core processing functionalities and mechanisms of systems such as Spark and Hadoop―, especially in the context of parallel processing and distributed computing. Immutability does _not_ imply that the processing of data structures is inefficient compared to working with mutable data, since [persistent data structures](https://en.wikipedia.org/wiki/Persistent_data_structure) are used. That said, evidently each type of data structure has its use-cases and trade-offs. +In addition to the in-memory aspect of computations with RDDs, the RDD itself is immutable. This is an established practice in functional programming ―which inspired the core processing functionalities and mechanisms of systems such as Spark and Hadoop―, especially in the context of parallel processing and distributed computing. Immutability does _not_ imply that the processing of data structures is inefficient compared to working with mutable data, since [persistent data structures](https://en.wikipedia.org/wiki/Persistent_data_structure) are used. That said, evidently each type of data structure has its use-cases and trade-offs. Spark can be seen as bridging distributed computing paradigms. [Hadoop](https://hadoop.apache.org/) and its MapReduce operate on disk-based storage, processing data in batches. Spark shifts computation into memory and treats data as immutable, enabling stateless transformations across partitions. [Flink](https://flink.apache.org/) goes further by supporting stateful, continuously updated computations, suited for complex streaming workloads. @@ -117,9 +117,9 @@ Within the BUILD stage, Apache Spark is used exclusively for executing workflows For other dataset types (e.g. smaller relational sources or local files), Spark execution provides no significant advantage and is not typically used. In such cases, BUILD’s standard local execution engine is sufficient. Spark thus acts as an optional, performance-oriented backend, not as a replacement for the standard workflow engine. -Each Spark-aware dataset corresponds to an **executor-aware entity**. During workflow execution, BUILD translates the **workflow graph** into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster. The results are then materialized or written back into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. Users do not need to manage executors or partitions manually. +Each Spark-aware dataset corresponds to an **executor-aware entity**. During workflow execution, BUILD translates the **workflow graph** into Spark jobs, where datasets become RDDs or DataFrames, transformations become stages, and Spark orchestrates execution across the cluster. The results are then materialized or written back into CMEM’s storage layer, ready for subsequent workflow steps or integration into the knowledge graph. -### What are the Spark-aware datasets? +### Types of Spark-aware datasets The main types of Spark-aware datasets include: @@ -136,18 +136,16 @@ The main types of Spark-aware datasets include: The Spark-aware workflows operate on datasets within BUILD, executing transformations and producing outputs. The Knowledge Graph, managed by EXPLORE, serves as the persistent semantic storage layer, but Spark itself does not directly interact with the graph. Instead, the **workflow execution engine** orchestrates the movement of data between Spark-aware datasets and the Knowledge Graph, ensuring that transformations are applied in the correct sequence and that results are persisted appropriately. -This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of CMEM’s architecture around it. Data can flow into workflows from various sources and ultimately be integrated into the graph, while the execution engine mediates this process, handling dependencies, scheduling, and parallelism. Users benefit from the efficiency of Spark while maintaining the integrity and consistency of the graph as the central repository of integrated knowledge. - -From a conceptual perspective, the relation is therefore indirect but essential: Spark-aware workflows accelerate the processing of large or complex datasets, while the Knowledge Graph ensures that the processed data is semantically harmonized and persistently stored. Together, they enable CMEM to combine flexible, distributed computation with knowledge-centric integration, supporting a wide range of enterprise data integration use cases without requiring users to manage low-level execution details. +This separation of concerns allows Spark to focus on high-performance computation without being constrained by the architecture or APIs of the Knowledge Graph, or the rest of CMEM’s architecture around it. Data can flow into workflows from various sources and ultimately be integrated into the graph, while the execution engine mediates this process, handling dependencies, scheduling, and parallelism. ### What is the relation between Spark-aware dataset plugins and other BUILD plugins? -Spark-aware dataset plugins are a specialized subset of dataset plugins that integrate seamlessly into BUILD workflows. They implement the same source-and-sink interfaces as all other plugins, allowing workflows to connect Spark-aware datasets, traditional datasets, and transformations without additional configuration. +Spark-aware dataset plugins are a specialized subset of dataset plugins that integrate seamlessly into BUILD workflows. They implement the same source-and-sink interfaces as all other plugins, allowing workflows to connect Spark-aware datasets, traditional datasets, and transformations. -These plugins include not only the core Spark-aware datasets (Avro, Parquet, ORC, Hive, HDFS) but also other Spark-aware plugins such as JSON and JDBC sources, providing consistent behavior and integration across a wide range of data types and endpoints. Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial. +These plugins also cover JSON and JDBC sources, providing consistent behavior and integration across a wide range of data types and endpoints. Spark-aware plugins can be combined with any other plugin in a workflow, with the execution engine automatically leveraging Spark where beneficial. ## Summary -This document explained what Apache Spark is, how it processes data through its core abstractions, and why Spark appears in CMEM specifically in the BUILD component. The key boundary is architectural: Spark provides the execution engine for distributed processing, BUILD defines and orchestrates workflows, and EXPLORE remains the semantic persistence layer. Spark therefore does not interact with the Knowledge Graph directly; it is used by BUILD for workflow execution, while the workflow engine controls when results are written out and how they feed into subsequent steps. +This document explained what Apache Spark is, how it processes data through its core abstractions, and why Spark appears in CMEM specifically in the BUILD component. The key boundary is architectural: Spark provides the execution engine for distributed processing, BUILD defines and orchestrates workflows, and EXPLORE remains the semantic persistence layer. Spark therefore does not interact with the Knowledge Graph directly; it is used by BUILD for workflow execution, while the workflow execution engine controls when results are written out and how they feed into subsequent steps. Within BUILD, Spark matters primarily when workflows operate on Spark-aware datasets. Those datasets align with Spark’s distributed processing model, which is why Spark can execute transformations across partitions, recompute lost work if a node fails, and handle larger volumes of data without falling back to single-node execution. For other dataset types or small workloads, workflows typically run without Spark and remain within BUILD’s standard execution path.