Spark
Definition: Open-source cluster computing framework for handling real-time generated data.
Optimization: Built on top of Hadoop MapReduce, Spark processes data in memory for faster processing compared to disk-based alternatives.
History:
- Initiated by Matei Zaharia at UC Berkeley's AMPLab in 2009.
- Open-sourced in 2010 under BSD license.
- Acquired by Apache Software Foundation in 2013.
- Emerged as a Top-Level Apache Project in 2014.
Features:
- High performance for both batch and streaming data.
- Supports Java, Scala, Python, R, and SQL with over 80 high-level operators.
- Libraries include SQL, DataFrames, MLlib, GraphX, and Spark Streaming.
- Unified analytics engine for large-scale data processing.
- Runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.
Spark Architecture
Master-Slave Architecture: Cluster comprises a single master and multiple slaves.
1. Resilient Distributed Dataset (RDD)- Group of data items stored in-memory on worker nodes.
- Resilient: Data recovery on failure.
- Distributed: Data distributed among nodes.
- Dataset: Collection of data.
- Finite direct graph performing sequence of computations on data.
- Each node represents an RDD partition; edges represent transformations on data.
Components of Spark Architecture:
1. Driver Program
- Process executing main() function, creating SparkContext object.
- SparkContext coordinates Spark applications on cluster.
- Tasks:
- Acquiring executors on cluster nodes.
- Sending application code to executors (via JAR or Python files).
- Sending tasks to executors for execution.
- Allocates resources across applications.
- Types: Hadoop YARN, Apache Mesos, Standalone Scheduler.
- Standalone Scheduler: Installs Spark on empty set of machines.
- Slave node running application code in cluster.
- Process launched for application on worker node.
- Runs tasks, stores data in memory/disk across them.
- Reads/writes data to external sources.
- Each application contains its executor.
- Unit of work sent to one executor.
Spark Components
- Core functionality of Spark.
- Components:
- Task scheduling.
- Fault recovery.
- Interaction with storage systems.
- Memory management.
- Built on Spark Core, supports structured data.
- Features:
- SQL and HQL (Hive Query Language) queries.
- Supports various data sources like HDFS, Hive tables, JSON, Cassandra.20, HBase.
- Supports scalable, fault-tolerant processing of streaming data.
- Utilizes Spark Core for fast scheduling and real-time processing.
- Machine Learning library with various algorithms.
- Includes correlations, classification, regression, clustering, PCA.
- Nine times faster than disk-based implementations like Apache Mahout.
- Library for manipulating graphs and performing graph-parallel computations.
Hadoop vs. Spark
Hadoop:
1. Hadoop's MapReduce model reads and writes from a disk, thus slowing down the processing speed.
2. Designed to handle batch processing efficiently.
Spark:
1. Spark reduces the number of read/write cycles to disk and stores intermediate data in memory, hence faster-processing speed.
2. Designed to handle real-time data like Twitter, and Facebook efficiently.