Apache SPARK

Mannan Ul Haq
0

Spark

Definition: Open-source cluster computing framework for handling real-time generated data.

Optimization: Built on top of Hadoop MapReduce, Spark processes data in memory for faster processing compared to disk-based alternatives.

History:

  • Initiated by Matei Zaharia at UC Berkeley's AMPLab in 2009.
  • Open-sourced in 2010 under BSD license.
  • Acquired by Apache Software Foundation in 2013.
  • Emerged as a Top-Level Apache Project in 2014.

Features:

  • High performance for both batch and streaming data.
  • Supports Java, Scala, Python, R, and SQL with over 80 high-level operators.
  • Libraries include SQL, DataFrames, MLlib, GraphX, and Spark Streaming.
  • Unified analytics engine for large-scale data processing.
  • Runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

Spark Architecture

Master-Slave Architecture: Cluster comprises a single master and multiple slaves.

1. Resilient Distributed Dataset (RDD)
  • Group of data items stored in-memory on worker nodes.
  • Resilient: Data recovery on failure.
  • Distributed: Data distributed among nodes.
  • Dataset: Collection of data.
2. Directed Acyclic Graph (DAG)
  • Finite direct graph performing sequence of computations on data.
  • Each node represents an RDD partition; edges represent transformations on data.

Components of Spark Architecture:

1. Driver Program

  • Process executing main() function, creating SparkContext object.
  • SparkContext coordinates Spark applications on cluster.
  • Tasks:
    • Acquiring executors on cluster nodes.
    • Sending application code to executors (via JAR or Python files).
    • Sending tasks to executors for execution.
2. Cluster Manager
  • Allocates resources across applications.
  • Types: Hadoop YARN, Apache Mesos, Standalone Scheduler.
  • Standalone Scheduler: Installs Spark on empty set of machines.
3. Worker Node
  • Slave node running application code in cluster.
4. Executor
  • Process launched for application on worker node.
  • Runs tasks, stores data in memory/disk across them.
  • Reads/writes data to external sources.
  • Each application contains its executor.
5. Task
  • Unit of work sent to one executor.

Spark Components

1. Spark Core
  • Core functionality of Spark.
  • Components:
    • Task scheduling.
    • Fault recovery.
    • Interaction with storage systems.
    • Memory management.
2. Spark SQL
  • Built on Spark Core, supports structured data.
  • Features:
    • SQL and HQL (Hive Query Language) queries.
    • Supports various data sources like HDFS, Hive tables, JSON, Cassandra.20, HBase.
3. Spark Streaming
  • Supports scalable, fault-tolerant processing of streaming data.
  • Utilizes Spark Core for fast scheduling and real-time processing.
4. MLlib
  • Machine Learning library with various algorithms.
  • Includes correlations, classification, regression, clustering, PCA.
  • Nine times faster than disk-based implementations like Apache Mahout.
5. GraphX
  • Library for manipulating graphs and performing graph-parallel computations.

Hadoop vs. Spark

Hadoop:

1. Hadoop's MapReduce model reads and writes from a disk, thus slowing down the processing speed.

2. Designed to handle batch processing efficiently.


Spark:

1. Spark reduces the number of read/write cycles to disk and stores intermediate data in memory, hence faster-processing speed.

2. Designed to handle real-time data like Twitter, and Facebook efficiently.


Post a Comment

0Comments

Post a Comment (0)

#buttons=(Accept !) #days=(20)

Our website uses cookies to enhance your experience. Check Now
Accept !