Apache SPARK

Spark

Definition: Open-source cluster computing framework for handling real-time generated data.

Optimization: Built on top of Hadoop MapReduce, Spark processes data in memory for faster processing compared to disk-based alternatives.

History:

Initiated by Matei Zaharia at UC Berkeley's AMPLab in 2009.
Open-sourced in 2010 under BSD license.
Acquired by Apache Software Foundation in 2013.
Emerged as a Top-Level Apache Project in 2014.

Features:

High performance for both batch and streaming data.
Supports Java, Scala, Python, R, and SQL with over 80 high-level operators.
Libraries include SQL, DataFrames, MLlib, GraphX, and Spark Streaming.
Unified analytics engine for large-scale data processing.
Runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

Spark Architecture

Master-Slave Architecture: Cluster comprises a single master and multiple slaves.

1. Resilient Distributed Dataset (RDD)

Group of data items stored in-memory on worker nodes.
Resilient: Data recovery on failure.
Distributed: Data distributed among nodes.
Dataset: Collection of data.

2. Directed Acyclic Graph (DAG)

Finite direct graph performing sequence of computations on data.
Each node represents an RDD partition; edges represent transformations on data.

Components of Spark Architecture:

1. Driver Program

Process executing main() function, creating SparkContext object.
SparkContext coordinates Spark applications on cluster.
Tasks:

Acquiring executors on cluster nodes.
Sending application code to executors (via JAR or Python files).
Sending tasks to executors for execution.

2. Cluster Manager

Allocates resources across applications.
Types: Hadoop YARN, Apache Mesos, Standalone Scheduler.
Standalone Scheduler: Installs Spark on empty set of machines.

3. Worker Node

Slave node running application code in cluster.

4. Executor

Process launched for application on worker node.
Runs tasks, stores data in memory/disk across them.
Reads/writes data to external sources.
Each application contains its executor.

5. Task

Unit of work sent to one executor.

Spark Components

1. Spark Core

Core functionality of Spark.
Components:

Task scheduling.
Fault recovery.
Interaction with storage systems.
Memory management.

2. Spark SQL

Built on Spark Core, supports structured data.
Features:

SQL and HQL (Hive Query Language) queries.
Supports various data sources like HDFS, Hive tables, JSON, Cassandra.20, HBase.

3. Spark Streaming

Supports scalable, fault-tolerant processing of streaming data.
Utilizes Spark Core for fast scheduling and real-time processing.

4. MLlib

Machine Learning library with various algorithms.
Includes correlations, classification, regression, clustering, PCA.
Nine times faster than disk-based implementations like Apache Mahout.

5. GraphX

Library for manipulating graphs and performing graph-parallel computations.

Hadoop vs. Spark

Hadoop:

1. Hadoop's MapReduce model reads and writes from a disk, thus slowing down the processing speed.

2. Designed to handle batch processing efficiently.

Spark:

1. Spark reduces the number of read/write cycles to disk and stores intermediate data in memory, hence faster-processing speed.

2. Designed to handle real-time data like Twitter, and Facebook efficiently.

Apache SPARK

Spark

Spark Architecture

Components of Spark Architecture:

Spark Components

Hadoop vs. Spark

Hadoop:

Spark:

Mannan Ul Haq

Post a Comment

Designed and Developed with Passion by Mannan Ul Haq

#buttons=(Accept !) #days=(20)

Contact form