Benefits of streaming analytics:
Many application benefit from acting on data as soon it
arrives. Batch processing will take some time (hours) or usually runs during overnight
to get some conclusion from the data you have. What if your business need to do
some quick action based on the real time data.
For eg: detecting
fraudulent bank transactions and deactivating the credit card (or take some
quick action such that inform the customer right away) is such kind of
application. Social media sites like Twitter, Facebook and LinkedIn uses
this technique to find the current trends based on the tweets/posts.
Spark streaming is the spark’s module for such applications.
It lets users to write streaming application using very similar Api to the batch
jobs. Thus reuse the lot of skills.
Much like spark is based on RDD, spark streaming provides an
abstraction of RDD called DStreams( Discretized Stream). DStream is sequence of
data arriving over time. DStream is the sequence of RDD. DStream can be created from various data
sources like kafka, Flume, twitter Apis, HDFS. Dstream provides many of the transformations
available on RDD.
Spark streaming uses a ‘micro batch’ architecture, where the streaming computation is treated as continuous series of batch computations on small batches of data. Spark is too fast that it can process the data in batches and gives a feeling of real-time data processing.
The size of batch is determined by the batch interval which
you need to pass while creating the streamingContext.
Fault tolerant:
Spark streaming provides Fault tolerant which is again a feature
of RDD. Here the data is real-time and streaming, the batch RDD is replicated
across the multiple nodes. If any of the node goes down, it will pick it from
the available node and start doing the task.
Spark streaming also provides check pointing mechanism that
need to be set for fault tolerance. It allows to periodically save data to
reliable stores such as HDFS, S3 for data and state recovering. If the driver
program crashes, checkpoint saves the data. Then when you restart the program,
it picks it from where it crashed.
Stateless transformations => these are normal
transformation which is available for RDD like reducebykey which will operate
on batch RDD
Stateful transformations => these are window transformation
which takes data received in last n minutes/seconds. These are special kind of
operation which can be used to determine no of orders received in last 1 hour and so on.
Below depicts a typical data flow of a streaming application with
a visual Dashboard.
Kafka === > Spark Streaming ===> Hbase ===>
Node Js ===> D3 Js Visualization.
Kafka acts as messaging system which can handle high
velocity data stream. Spark streaming program listens to the kafka topic, receives
the data in stream and process at real time fashion. Once it is processed, the
resultant data will be pushed to HBASE which is a fast read/write NOSQL
database. Then node Js (server side JavaScript framework) server can be
configured to fetch the data from Hbase using Hbase thrift Api. D3 is another javascript
library to create plots, graphs based on given datasets. The data will be feed
to D3 using node JS server. Currently I’m working on this prototype and will be
writing another blog once it is completed.
Read Parallelism and processing parallelism.
When you think about the spark streaming applications, you
need to think about the read and processing parallelism. Basically Spark streaming need two core
resources. One core is being used for reading data from kafka/twitter or any
other sources and other core is being used for processing.
In some systems, spark streaming is just used for reading
from source and writes it to hdfs. And later the data will be used for analytics.
These systems are not really real-time streaming system. Real-time streaming
systems should receive the data and at the same time, it should process the
data as well. More detailed note on read Parallelism and processing parallelism
is mentioned in here
No comments:
Post a Comment