Apache Spark works like master slave architecture. Driver program is the master and set of worker nodes act like slaves. Driver program has SparkContext which is responsible for below actions:
To create RDDs across all work nodes.
To work with cluster manager to allocate the resources
across the cluster to run the tasks.
To co-ordinate the task running across all the work nodes.
To send computation task/codes across work nodes.
Apache spark is faster /efficient than regular Hadoop map/reduce.
Why spark is so efficient?
Spark stores the RDDs in the memory and distribute across the
entire cluster. Whereas map/reduce read/write the data from the HDFS. If a task
has multiple map/reduce jobs, it write the intermediate results to the HDFS and
then the next map reduce job has to read that data again from HDFS. This makes
the Map/Reduce take significant time to execute.
Apart from that, Spark has other features.
Lazy execution: spark will not execute the task until
the final computation (i.e. Action) is called. Spark will consolidate the entire
operations and construct
a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and
combining operators where possible.
Immutability: spark is based on immutability concept. Data underlying
is immutable and can be cached. Each RDD is chuck of immutable dataset and is resilient
in nature. This means the RDD can be regenerated at any point of time (provided
the data is immutable) and thus helps in fault tolerant. Thus RDD stands for
Resilient Distributed Dataset.
Apache Spark is a unified big data framework for batch
processing, real-time data processing, machine learning and graph processing.
Spark can be programmed in Java, Scala and python. Spark
provides two ways of interactive programing for Scala and python through
Spark-Shell and pyspark Shell. Through this shell, you can run scala as well as
python codes for testing purpose and see the result in the console itself. These
consoles are preloaded with the spark context (sc) which is starting point of
your program.
However if you are running the pyspark/scala/java standalone
programs, you need to export or create the spark context before starting the
data processing. These programs can be invoked using spark-submit.
Two types of data processing are available: Transformations
and Actions
Transformations:
============
flatMap
map
filter
reduceByKey
reduce
groupByKey
combinerByKey
aggregateByKey
reduceByKey
reduce
groupByKey
combinerByKey
aggregateByKey
Actions:
=====
count
take
takeSorted
countByKey
saveAsTextFile
saveAsSequenceFile
count
take
takeSorted
countByKey
saveAsTextFile
saveAsSequenceFile
No comments:
Post a Comment