Friday, 27 May 2016

Spark Architecture



Apache Spark works like master slave architecture. Driver program is the master and set of worker nodes act like slaves. Driver program has SparkContext which is responsible for below actions:



     To create RDDs across all work nodes.
     To work with cluster manager to allocate the resources across the cluster to run the tasks.
     To co-ordinate the task running across all the work nodes.
     To send computation task/codes across work nodes.


Apache spark is faster /efficient than regular Hadoop map/reduce. Why spark is so efficient? 

Spark stores the RDDs in the memory and distribute across the entire cluster. Whereas map/reduce read/write the data from the HDFS. If a task has multiple map/reduce jobs, it write the intermediate results to the HDFS and then the next map reduce job has to read that data again from HDFS. This makes the Map/Reduce take significant time to execute. 

Apart from that, Spark has other features.

Lazy execution: spark will not execute the task until the final computation (i.e. Action) is called. Spark will consolidate the entire operations and construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible.

Immutability:  spark is based on immutability concept. Data underlying is immutable and can be cached. Each RDD is chuck of immutable dataset and is resilient in nature. This means the RDD can be regenerated at any point of time (provided the data is immutable) and thus helps in fault tolerant. Thus RDD stands for Resilient Distributed Dataset


Apache Spark is a unified big data framework for batch processing, real-time data processing, machine learning and graph processing.

https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/assets/lnsp_0101.png

Spark can be programmed in Java, Scala and python. Spark provides two ways of interactive programing for Scala and python through Spark-Shell and pyspark Shell. Through this shell, you can run scala as well as python codes for testing purpose and see the result in the console itself. These consoles are preloaded with the spark context (sc) which is starting point of your program.

However if you are running the pyspark/scala/java standalone programs, you need to export or create the spark context before starting the data processing. These programs can be invoked using spark-submit.
Two types of data processing are available: Transformations and Actions

Transformations:
============
flatMap
map
filter
reduceByKey
reduce
groupByKey
combinerByKey
aggregateByKey

Actions:
=====

count
take
takeSorted
countByKey
saveAsTextFile
saveAsSequenceFile

No comments:

Post a Comment