Saturday 28 May 2016

Spark Streaming


Benefits of streaming analytics:

Many application benefit from acting on data as soon it arrives. Batch processing will take some time (hours) or usually runs during overnight to get some conclusion from the data you have. What if your business need to do some quick action based on the real time data.

For eg:  detecting fraudulent bank transactions and deactivating the credit card (or take some quick action such that inform the customer right away) is such kind of application. Social media sites like Twitter, Facebook and LinkedIn uses this technique to find the current trends based on the tweets/posts.
Spark streaming is the spark’s module for such applications. It lets users to write streaming application using very similar Api to the batch jobs. Thus reuse the lot of skills.

Much like spark is based on RDD, spark streaming provides an abstraction of RDD called DStreams( Discretized Stream). DStream is sequence of data arriving over time. DStream is the sequence of RDD.  DStream can be created from various data sources like kafka, Flume, twitter Apis, HDFS. Dstream provides many of the transformations available on RDD.

http://spark.apache.org/docs/latest/img/streaming-arch.png
Spark streaming uses a ‘micro batch’ architecture, where the streaming computation is treated as continuous series of batch computations on small batches of data. Spark is too fast that it can process the data in batches and gives a feeling of real-time data processing.

The size of batch is determined by the batch interval which you need to pass while creating the streamingContext.

http://image.slidesharecdn.com/introductiontosparkstreaming-150428114035-conversion-gate01/95/introduction-to-spark-streaming-18-638.jpg?cb=1430221329

Fault tolerant:

Spark streaming provides Fault tolerant which is again a feature of RDD. Here the data is real-time and streaming, the batch RDD is replicated across the multiple nodes. If any of the node goes down, it will pick it from the available node and start doing the task. 

Spark streaming also provides check pointing mechanism that need to be set for fault tolerance. It allows to periodically save data to reliable stores such as HDFS, S3 for data and state recovering. If the driver program crashes, checkpoint saves the data. Then when you restart the program, it picks it from where it crashed.

Spark Streaming provides two types of transformation: 

Stateless transformations => these are normal transformation which is available for RDD like reducebykey which will operate on batch RDD
Stateful transformations => these are window transformation which takes data received in last n minutes/seconds. These are special kind of operation which can be used to determine no of orders received in last 1 hour and so on.

Below depicts a typical data flow of a streaming application with a visual Dashboard.


Kafka === > Spark Streaming ===> Hbase ===> Node Js ===> D3 Js Visualization.


Kafka acts as messaging system which can handle high velocity data stream. Spark streaming program listens to the kafka topic, receives the data in stream and process at real time fashion. Once it is processed, the resultant data will be pushed to HBASE which is a fast read/write NOSQL database. Then node Js (server side JavaScript framework) server can be configured to fetch the data from Hbase using Hbase thrift Api. D3 is another javascript library to create plots, graphs based on given datasets. The data will be feed to D3 using node JS server. Currently I’m working on this prototype and will be writing another blog once it is completed.

Read Parallelism and processing parallelism.

When you think about the spark streaming applications, you need to think about the read and processing parallelism.  Basically Spark streaming need two core resources. One core is being used for reading data from kafka/twitter or any other sources and other core is being used for processing.

In some systems, spark streaming is just used for reading from source and writes it to hdfs. And later the data will be used for analytics. These systems are not really real-time streaming system. Real-time streaming systems should receive the data and at the same time, it should process the data as well. More detailed note on read Parallelism and processing parallelism is mentioned in here








No comments:

Post a Comment