Saturday, 28 May 2016

Spark Streaming


Benefits of streaming analytics:

Many application benefit from acting on data as soon it arrives. Batch processing will take some time (hours) or usually runs during overnight to get some conclusion from the data you have. What if your business need to do some quick action based on the real time data.

For eg:  detecting fraudulent bank transactions and deactivating the credit card (or take some quick action such that inform the customer right away) is such kind of application. Social media sites like Twitter, Facebook and LinkedIn uses this technique to find the current trends based on the tweets/posts.
Spark streaming is the spark’s module for such applications. It lets users to write streaming application using very similar Api to the batch jobs. Thus reuse the lot of skills.

Much like spark is based on RDD, spark streaming provides an abstraction of RDD called DStreams( Discretized Stream). DStream is sequence of data arriving over time. DStream is the sequence of RDD.  DStream can be created from various data sources like kafka, Flume, twitter Apis, HDFS. Dstream provides many of the transformations available on RDD.

http://spark.apache.org/docs/latest/img/streaming-arch.png
Spark streaming uses a ‘micro batch’ architecture, where the streaming computation is treated as continuous series of batch computations on small batches of data. Spark is too fast that it can process the data in batches and gives a feeling of real-time data processing.

The size of batch is determined by the batch interval which you need to pass while creating the streamingContext.

http://image.slidesharecdn.com/introductiontosparkstreaming-150428114035-conversion-gate01/95/introduction-to-spark-streaming-18-638.jpg?cb=1430221329

Fault tolerant:

Spark streaming provides Fault tolerant which is again a feature of RDD. Here the data is real-time and streaming, the batch RDD is replicated across the multiple nodes. If any of the node goes down, it will pick it from the available node and start doing the task. 

Spark streaming also provides check pointing mechanism that need to be set for fault tolerance. It allows to periodically save data to reliable stores such as HDFS, S3 for data and state recovering. If the driver program crashes, checkpoint saves the data. Then when you restart the program, it picks it from where it crashed.

Spark Streaming provides two types of transformation: 

Stateless transformations => these are normal transformation which is available for RDD like reducebykey which will operate on batch RDD
Stateful transformations => these are window transformation which takes data received in last n minutes/seconds. These are special kind of operation which can be used to determine no of orders received in last 1 hour and so on.

Below depicts a typical data flow of a streaming application with a visual Dashboard.


Kafka === > Spark Streaming ===> Hbase ===> Node Js ===> D3 Js Visualization.


Kafka acts as messaging system which can handle high velocity data stream. Spark streaming program listens to the kafka topic, receives the data in stream and process at real time fashion. Once it is processed, the resultant data will be pushed to HBASE which is a fast read/write NOSQL database. Then node Js (server side JavaScript framework) server can be configured to fetch the data from Hbase using Hbase thrift Api. D3 is another javascript library to create plots, graphs based on given datasets. The data will be feed to D3 using node JS server. Currently I’m working on this prototype and will be writing another blog once it is completed.

Read Parallelism and processing parallelism.

When you think about the spark streaming applications, you need to think about the read and processing parallelism.  Basically Spark streaming need two core resources. One core is being used for reading data from kafka/twitter or any other sources and other core is being used for processing.

In some systems, spark streaming is just used for reading from source and writes it to hdfs. And later the data will be used for analytics. These systems are not really real-time streaming system. Real-time streaming systems should receive the data and at the same time, it should process the data as well. More detailed note on read Parallelism and processing parallelism is mentioned in here








Friday, 27 May 2016

Spark Architecture



Apache Spark works like master slave architecture. Driver program is the master and set of worker nodes act like slaves. Driver program has SparkContext which is responsible for below actions:



     To create RDDs across all work nodes.
     To work with cluster manager to allocate the resources across the cluster to run the tasks.
     To co-ordinate the task running across all the work nodes.
     To send computation task/codes across work nodes.


Apache spark is faster /efficient than regular Hadoop map/reduce. Why spark is so efficient? 

Spark stores the RDDs in the memory and distribute across the entire cluster. Whereas map/reduce read/write the data from the HDFS. If a task has multiple map/reduce jobs, it write the intermediate results to the HDFS and then the next map reduce job has to read that data again from HDFS. This makes the Map/Reduce take significant time to execute. 

Apart from that, Spark has other features.

Lazy execution: spark will not execute the task until the final computation (i.e. Action) is called. Spark will consolidate the entire operations and construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible.

Immutability:  spark is based on immutability concept. Data underlying is immutable and can be cached. Each RDD is chuck of immutable dataset and is resilient in nature. This means the RDD can be regenerated at any point of time (provided the data is immutable) and thus helps in fault tolerant. Thus RDD stands for Resilient Distributed Dataset


Apache Spark is a unified big data framework for batch processing, real-time data processing, machine learning and graph processing.

https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/assets/lnsp_0101.png

Spark can be programmed in Java, Scala and python. Spark provides two ways of interactive programing for Scala and python through Spark-Shell and pyspark Shell. Through this shell, you can run scala as well as python codes for testing purpose and see the result in the console itself. These consoles are preloaded with the spark context (sc) which is starting point of your program.

However if you are running the pyspark/scala/java standalone programs, you need to export or create the spark context before starting the data processing. These programs can be invoked using spark-submit.
Two types of data processing are available: Transformations and Actions

Transformations:
============
flatMap
map
filter
reduceByKey
reduce
groupByKey
combinerByKey
aggregateByKey

Actions:
=====

count
take
takeSorted
countByKey
saveAsTextFile
saveAsSequenceFile