Introduction

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

Architecture Overview

Spark’s architecture is built around a distributed computing engine with a layered design that separates:

  • User-facing APIs (SparkSession, DataFrame, RDD)
  • Execution planning and optimization
  • Task scheduling and execution
  • Resource management and deployment

Core Module Structure

The module structure follows a layered approach where higher-level modules depend on core infrastructure, and specialized modules extend base functionality.

Build System Integration

Spark supports both Maven and SBT build systems, with configuration synchronized between the two. The build system is designed to handle Spark’s complex module dependencies and provide consistent builds across different environments.

Key Build Properties

PropertyValuePurpose
java.version17Target JVM version
java.minimum.version17.0.11Minimum required JDK version
scala.version2.13.16Scala compiler version
scala.binary.version2.13Scala binary compatibility version
hadoop.version3.4.1Hadoop client compatibility
hive.version2.3.10Hive compatibility version
protobuf.version4.29.3Protocol Buffer version
arrow.version18.3.0Apache Arrow compatibility version

Module Compilation Order

  1. Common modulescommon/sketchcommon/kvstorecommon/network-commoncommon/network-shufflecommon/unsafecommon/utilscommon/variantcommon/tags
  2. Core enginecore
  3. SQL stacksql/apisql/catalystsql/coresql/hivesql/pipelines
  4. Specialized modulesmllibmllib-localstreaminggraphx
  5. Connectorsconnector/avroconnector/protobufconnector/kafka-0-10
  6. Connect systemsql/connect/shimssql/connect/commonsql/connect/serversql/connect/client/jvm
  7. Assemblyassemblyexamplesrepllauncher

Build Tools

Spark’s build system includes several tools to help with development:

  • Maven: Primary build tool with build/mvn wrapper script
  • SBT: Alternative build tool with build/sbt wrapper script
  • Docker image toolsbin/docker-image-tool.sh for building container images
  • Distribution packagingdev/make-distribution.sh for creating release packages

Core Processing Components

he fundamental data abstraction in Spark is the Resilient Distributed Dataset (RDD), which represents an immutable, partitioned collection of elements that can be operated on in parallel. RDDs are implemented in the core module and provide:

  • Fault tolerance through lineage information (tracking how an RDD was derived)
  • Control over partitioning to optimize data placement
  • In-memory caching for fast reuse
  • A rich set of operations (transformations and actions)

Key RDD implementations include:

  • ParallelCollectionRDD: Created from in-memory collections
  • HadoopRDD: Reads data from HDFS and other Hadoop-supported storage systems
  • MapPartitionsRDD: Result of map-like operations on other RDDs
  • ShuffledRDD: Result of operations that require redistributing data

SQL Engine Implementation

Storage and State Management

ComponentImplementationPurposePerformance
BlockManagercore/src/main/scala/storage/BlockManager.scalaIn-memory and disk block storage for RDDsFastest for ephemeral data
StateStoresql/core/execution/streaming/state/Streaming state management interfaceAbstract provider interface
RocksDBStateStoresql/core/execution/streaming/state/RocksDBStateStoreProvider.scalaPersistent state storage using RocksDB~1650-4500ns per row (depending on configuration)
HDFSBackedStateStoresql/core/execution/streaming/state/HDFSBackedStateStoreProvider.scalaHDFS-backed state storageDurable but slower than RocksDB
InMemoryStateStoresql/core/execution/streaming/state/StateStore.scalaIn-memory state storage~780ns per row (fastest)

Data Source Architecture

Spark provides a pluggable data source architecture through the DataSource V2 API, allowing integration with various storage systems and file formats.

Data Source API

Built-in Connectors

The codebase includes several production-ready connectors:

  • Avro Support: Apache Avro file format integration
  • Protocol Buffers: Protobuf serialization support
  • JDBC Sources: Database connectivity with specific dialects for MySQL, PostgreSQL, Oracle, etc.
  • Kafka Integration: Real-time data streaming from Apache Kafka
  • File Formats: Parquet, ORC, JSON, CSV, and other formats

Compression Codecs

Spark supports multiple compression codecs for efficient data storage and transfer:

CodecImplementationPerformance Characteristics
Snappyorg.xerial.snappyGood balance of speed and compression ratio
LZ4net.jpountz.lz4Very fast compression/decompression
ZSTDcom.github.luben.zstdHigh compression ratio with configurable levels
GZIPJava built-inHigh compression ratio but slower
LZFcom.ning.compress.lzfFast compression for legacy support

Deployment and Resource Management

Spark applications are launched through the SparkSubmit class, which handles different cluster manager integrations:

Cluster Manager Support

ManagerModuleConfiguration
Standalonecore/src/main/scala/deploy/Built-in cluster manager
YARNresource-managers/yarn/Hadoop YARN integration
Kubernetesresource-managers/kubernetes/Container orchestration
Local Modecore/src/main/scala/SparkContext.scalaSingle-machine execution

Assembly and Distribution

The assembly/pom.xml28-360 module creates the final distribution package, including all necessary JARs and dependencies. The assembly process handles dependency shading and classpath management for different deployment scenarios.