Tuesday, 6 December 2016

Spark MLLib


Although machine learning techniques were introduced in late 1960- 1990, there was no platform to run it more efficiently. First thing, most of the machine learning algorithms exhibit iterative processing which requires a huge computational effort. Secondly the data required to train the models was less in amount or there were no infrastructure to store the big chuck of data. In fact, this was the one of major factor which motivated in developing a distributed computing framework called Apache Spark. 

Hadoop map-reduce was one of such attempt to build the platform for machine learning, But it didn't work out well for iterative processing since it involves huge IO operation on HDFS. However HDFS helps in storing huge data in a distributed way and provides fault tolerance with the help of replication and other means. Apache Spark works on top of HDFS and provides a computational speed of 100 times as fast as map-reduce programs. MLLib module in Spark provides a variety of ready-made ML models which can be used easily with some configurable hyper-parameters. In this blog, I'm going to cover some important models and its basic working principle. 

Machine Learning is categorized into two types -  Supervised and Unsupervised Learning

In-order to train a model, we need to feed the training data-set and gradually the model tries to learn. When the training set is labeled, the type of learning is called supervised learning and when the training set is not labeled, it is called unsupervised learning. To explain with an e.g., we need to build a model which can predict whether tomorrow will rain or not based on today's climate variations. So our training set will have two parts.

Input features: today's temp - high | today's temp - low |  whether type | ... 
Output label :  rain or not rain

Inputs are observations or features and the output is the target label whether rain or not rain. We will build our training data which contain labeled features from a vast historical data. This type of learning is called supervised learning. Unsupervised learning is when the training set has only features and not having the label. This can be used for detecting patterns in a given data set. One such usage of learning is anomaly detection on network to identify hacking or unsecured access of resources over network.

A typical machine learning pipeline starts with featurization, training and model evaluation. Featurization is the processing of extracting the observation into numerical vectors. For e.g given below, respective features are extracted from spam/non-spam email texts into numeric vectors using some vectorization libraries and then labeled. Then fed this training data set to the predictive model for training.




Okay, now time to jump into some of MLLib models like Regression, Classification, Clustering, Collaborative filtering and its variants.

Regression model

Regression is one of the popular supervised learning technique where the output label is a continuous value. For e.g to predict the house price based on different features like sq feet | no_of_bedrooms | no_of_bathroom | waterfront | year_built | year_renovated | ...

Linear least squares regression is the basic regression model where in a linear function is generated such that the mean squared error/loss is minimized using the gradient descent algorithm. The resultant model or function tries to fit the data points and is capable of generalizing or predicting output from the new or unseen data points.




I have posted a mathematical explanation to this in my last blog here.

In linear regression, we always run into fitting(under-fitting and over-fitting) problem when some features has more weight-age to the output compared to other features. To mitigate these problem, we have linear regression model with Lazzo(L1) and Ridge(L2) regularization. It basically add a regularization term into the cost function which in-turn reduce the impact of some features.  Linear least squares doesn't have a regularization term. Please refer explanation for regularization here.

Classification model

Classification is a supervised learning technique which is very similar to regression. In classification, the predicted output is discrete. i,e a finite set of label. For e.g classify a email into spam or non-spam, predict tomorrow would rain or not. There are two types of classifiers - binary and multi-dimensional classifiers. In binary classification, the output label would be only 0 and 1 and in multi-dimensional classifier there would be more than 2 output labels. We will cover most used binary regression model - SVM (Support vector machine and Logistic regression) 

Support vectors are just data points plotted on the graph. In SVM model, it create a linear hyper plane which separates positive and negative cases so that it maximize the margin between the nearest support vector point and the hyper plane. Basically it provides geometrically approach to determine the best hyper plane which separates positive and negative cases. Sometimes we can't just separate the data points using a simple linear hyper plane and we may depend on SVM model with kernels which yields non linear hyper plane. The SVM model is very suitable when you want to ignore the outliers(outliers are noise in data set) and for deriving non-linear hyper planes.

Logistic regression on other hand is more based on probabilistic approach. It is very similar to linear regression, but the hypothesis function is converted to a logarithmic term which will squash the output value between 0 and 1. If value is greater than some threshold, it is positive, otherwise negative. It is more suitable for deriving a linear hyper plane and more robust for the outliers.

Below is code snippet in scala for SVM and Logistic Regression. In below e.g, I choose to use optimization algorithm as SGD(Stochastic Gradient Descent) for SVM and BFGS for Logistic regression. BFGS algorithm is good for faster convergence compared to SGD.

   
 import org.apache.spark.mllib.feature.HashingTF  
 import org.apache.spark.mllib.regression  
 import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}  
 import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}  
 import org.apache.spark.mllib.regression.LabeledPoint  
   
 //Loading data from file
 val SMSSpamCollection = sc.textFile("file:///home/vm4learning/SMSSpamCollection.txt");  
   
 //Transforming to key value pair  
 val kvdata = SMSSpamCollection.map(line => (line.split("\t")(0) , line.split("\t")(1)))  
   
 //Filtering normal and spam text contents  
 val normal = kvdata.filter(line => line._1 == "ham").map(line => line._2)  
 val spam = kvdata.filter(line => line._1 == "spam").map(line => line._2)  
   
 //converting the text to Feature vectors  
 val tf = new HashingTF(numFeatures = 10000);  
 val normalFeatures = normal.map(msg => tf.transform(msg.split(" ")))  
 val spamFeatures = spam.map(msg => tf.transform(msg.split(" ")))  
   
 //Building the LabelPoint Vectors  
 val labeledPositiveFeatures = spamFeatures.map(features => LabeledPoint(1, features))  
 val labeledNegativeFeatures = normalFeatures.map(features => LabeledPoint(0, features))  
   
 // Cache the training Data Set  
 val trainingData = labeledPositiveFeatures.union(labeledNegativeFeatures).cache()  
   
 // Build an SVM model with training Data Set  
 val numIterations = 100  
 val SVMModel = SVMWithSGD.train(trainingData, numIterations)  
   
 // Build a LR model with training Data Set  
 val LRmodel = new LogisticRegressionWithLBFGS().setNumClasses(2).run(trainingData)  
   
 // Positive and Negative Test Text data  
 val postTest = tf.transform("Congratulations, you've been awarded a $25 bonus ".split(" "))  
 val negTest = tf.transform("I would be on leave next friday. Please plan accordingly.".split(" "))  
   
 // Run the positive and negative test Prediction with SVM model   
 val posPredictionWithSVMModel = SVMModel.predict(postTest);  
 val negPredictionWithSVMModel = SVMModel.predict(negTest);  
   
   
 // Run the postive and negative test Prediction with LR model   
 val posPredictionWithLRModel = LRmodel.predict(postTest);  
 val negPredictionWithLRModel = LRmodel.predict(negTest);  
   
   
 Results:   
   
 scala> val posPredictionWithSVMModel = SVMModel.predict(postTestFeatures);  
 posPredictionWithSVMModel: Double = 1.0  
   
 scala> val negPredictionWithSVMModel = SVMModel.predict(negTestFeatures);  
 negPredictionWithSVMModel: Double = 0.0  
   
   
 scala> val posPredictionWithLRModel = LRmodel.predict(postTestFeatures);  
 posPredictionWithLRModel: Double = 1.0  
   
 scala> val negPredictionWithLRModel = LRmodel.predict(negTestFeatures);  
 negPredictionWithLRModel: Double = 0.0  
   
   
For multi-class classification, we have Decision tree, Random forest and Naive Bayes.

Decision tree is one of the simplest machine learning algorithm which is being used for a long time now. As the name implies, it is basically a tree in which the node will take some decision based on some conditions against one or more feature attributes. It is a divide and conquer technique where in the data is evaluated starting from the root node and bottom node. This can be used for both classification and regression. Below is an e.g for decision tree to model the titanic survivors and non-survivors.




















MLlib provides an implementation of decision tree(org.apache.spark.mllib.tree.DecisionTree) and it takes 3 hyper-parameters.
  • max level -> max depth of decision tree.
  • max bins  -> max no of bins/buckets in which each feature is split for derive the conditions.if the feature is categorical, max bin should be no of categories.
  • impurity -> "gini" for Classifiers and "variance" for Regression.
  • categoricalFeaturesInfo -> Specifies which features are categorical and how many categorical values each of those features can take. 
Below is code snippet for building a decision tree classifier for census data. The complete code and data is available in the github.

 // open the census data file   
 val census = sc.textFile("file:///home/vm4learning/census_data.txt")  
   
 // define the categorical features Map  
 val categoricalFeaturesInfo = Map(1 -> orgTypeSize, 3 -> gradeTypeSize , 5 -> marStatusSize , 6 -> jobTypeSize , 7 -> familyStatusSize , 8 -> raceTypeSize , 9 -> genderTypeSize , 13 -> countrySize, 14 -> salaryRangeSize)  
   
 val numClasses = 2  
 val impurity = "gini"  
 val maxDepth = 5  
 val maxBins = 100  
   
 val labeledTrainingSet = census.map{line =>  
 val fields = line.split(", ")  
 LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))  
 }  
 val labeledTrainingSetRDD = featureVector.toJavaRDD();  
 // building the decision tree classifer   
 val model = DecisionTree.trainClassifier(labeledTrainingSetRDD, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)  
   
 // test data   
 val testData = "39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"  
   
 // Converting to feature vector   
 val fields = testData.split(", ")  
  val testVector = Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString))  
   
 // predict   
 model.predict(testVector);  
   
Random Forest is another variant of decision tree, also called decision Forest. The idea here is that multiple decision trees would be created for given data set and the output value would be calculated by kind of average out the individual output value from each decision trees. An e.g for the same census data is available in the github.

Naive Bayes is another classifier technique built based on Bayes probability distribution rule. The rule says probability of one classification over another = prior probability * posterior probability

Prior probability is calculated with out knowing the data and is based purely on existing data distribution. Say there are two labels red and blue. If the number of red labels are more than the no of blue labels in the entire data-set, the prior probability of red is at higher end and  the prior probability of blue is at lower end.

Posterior probability is the likelihood probability after seeing the data and it is calculated by drawing a small neighborhood area around the new data point and calculate the likelihood probability based on no of red/blue point in the neighborhood area. Posting a useful link here.

Clustering model

Clustering is a unsupervised learning technique. Idea here is simple. You have an unlabeled training set and you need some mechanism or algorithm so that it should cluster these data points into 1 or more buckets. Basically it tries to determine the unknown pattern your data points exhibits.

Out of this, k-means ways of clustering is quite famous. k is just number of clusters in which the algorithm has to built out from your data. Here, a cluster would be a centroid in the hyper plane and
it is surrounded with the respective data-points.



Step 1: Initialize:  Randomly initialize the cluster centroid and tag each data points to some cluster.
Step 2: Adjust centroid: Calculate the mean distance from the centroid to each its data points. Adjust the centroid so that mean should be minimum. This is repetitive steps and cluster would be moved to the actual centroid points in the cluster.
Step 3: Apply on new data: Apply a new data set to the model and the model will try to bucket the new data point to one of the suitable bucket/cluster as it can.
Step 4: Evaluate the result: In order to evaluate the cluster we got for the new data point, we measure the distance from resultant cluster centroid to the new data point. If the new data point is very near to the centroid, it is normal or usual pattern. If the distance is more than some threshold limit, it means your new data point is possibly an unusual pattern.

One of the use case of this technique to finding anomalies. For e.g detecting intruders in the networks, fraudulent credit transactions. Posting some interesting article around this

Collaborative filtering model

This is widely used recommender technique used by various shopping sites which helps the customers to get recommendations for the newer or similar products based on his/her previous shopping behavior, ratings after collaborating the ratings of other similar users. Basically the algorithm tries to fill the unknown rating in the user-item matrix below.



Step 1: Finding the similar users based on their rating on the common product. It does it by calculating the similarity value of two users. For e.g If the similarity value( i.e sim(Ted, Carol) ) is positive, Ted, Carol are of same kind or having same product taste. If the value is negative, then the two users are not similar.

Step 2: Predicting the rating. Now it calculates the missing rating of a particular user and a product by averaging all the ratings of similar user who rated that particular product.

Below is code sample code snippet for a movie recommender. Please find the complete code and the dataset in my github.

The ALS model accepts the RDD of Rating MLLIB datatype (RDD[Rating]) as the training data set and certain hyper-parameters like ranking, numIterations and lambda. Since we don't know the most suitable hyper-parameters, we will make the model to train with a range of hyper-parameters and select the best fit model.

   // load ratings and movie titles  
   val movieLensHomeDir = "hdfs://quickstart.cloudera:8020/user/cloudera/movieRatingDataset" ;  
        
      // load ratings  
   val ratings = sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/movieRatingDataset/ratings.dat").map { line =>  
    val fields = line.split("::")  
    // format: (timestamp % 10, Rating(userId, movieId, rating))  
    (fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))  
   }  
        
      // load movies   
   val movies = sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/movieRatingDataset/movies.dat").map { line =>  
    val fields = line.split("::")  
    // format: (movieId, movieName)  
    (fields(0).toInt, fields(1))  
   }.collect().toMap  
        
      // Spilt the data into training, validation and test set   
   val training = ratings.filter(x => x._1 < 6)  
    .values  
    .union(myRatingsRDD)  
    .repartition(numPartitions)  
    .cache()  
   val validation = ratings.filter(x => x._1 >= 6 && x._1 < 8)  
    .values  
    .repartition(numPartitions)  
    .cache()  
   val test = ratings.filter(x => x._1 >= 8).values.cache()  
     
   // build ALS model   
      val ranks = List(12, 20)  
   val lambdas = List(0.1, 10.0)  
   val numIters = List(8, 10)  
      var bestModel: Option[MatrixFactorizationModel] = None  
   var bestValidationRmse = Double.MaxValue  
   var bestRank = 0  
   var bestLambda = -1.0  
   var bestNumIter = -1  
   for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {  
    val model = ALS.train(training, rank, numIter, lambda)  
    val validationRmse = computeRmse(model, validation, numValidation)  
    println("RMSE (validation) = " + validationRmse + " for the model trained with rank = "   
     + rank + ", lambda = " + lambda + ", and numIter = " + numIter + ".")  
    if (validationRmse < bestValidationRmse) {  
     bestModel = Some(model)  
     bestValidationRmse = validationRmse  
     bestRank = rank  
     bestLambda = lambda  
     bestNumIter = numIter  
    }  
   }  
      // Test recommendation   
      val myRatings = loadRatings("/home/cloudera/Downloads/ml-1m/myRating.dat")  
      val sortedMyrating = myRatings.sortBy(- _.rating ).take(5);  
   val myRatedMovieIds = sortedMyrating.map(_.product).toSet  
   val candidates = sc.parallelize(movies.keys.filter(!myRatedMovieIds.contains(_)).toSeq)  
   val recommendations = bestModel.get  
    .predict(candidates.map((1, _)))  
    .collect()  
    .sortBy(-_.rating)  
    .take(10)  
      
   var i = 1  
   println("Movies recommended for the user :")  
     
   recommendations.foreach { r =>  
    println("%2d".format(i) + ": " + movies(r.product) + " : " + r.rating)  
    i += 1  
   }  
     
 

Just covered the high level principle and the usage of the basic models of Apache Spark MLLIB 1.6. You can refer the Spark MLlib 1.6 documentation for more details and other variants. In the newer version of Spark (2.0.2), a quite interesting classifier is introduced called Multi Layer Perception model which basically works based on neural network concepts.

I'm posting a YouTube channel - video link which helps in understanding the math behind of the above models.

Friday, 11 November 2016

Machine learning basics



ML is a subset of AI which uses various mathematical(statistical) techniques to build predictive models with help of adequate training data Set. In ML, we are just framing a mathematical fn or a hypothesis ( y = f(x) ) where y is predicted output and x (x1, x2, x3..) is a series of independent factors(inputs) which derives the output y. 

x(x1, x2, x3..)  called as feature vectors
y called as Label                   

A trainingData contains two elements, the Feature Vector and Label( i.e input and the respective output). A training DataSet is set of multiple trainingData. 

As we know a mathematical function has certain co-efficient associated with each of its x values as below. Here co-efficient are unknown in the beginning and in Machine Learning, we are trying to determine the values of these co-efficient using some techniques and algorithms( Cost function, Gradient Descent) which I will be covering below.

y [h(x)])  = a0 + a1x1 + a2x2 + a3x3 + .... ( a1, a2, a3 --> ??????? ) 

A typical Machine Learning has below steps.

1.Feature detection: This is the technique of detecting different features. Since ML accepts numerical values, we need to convert all the features into numeric values. In case of text data, we would need to convert into numerical vectors. Also we would need to normalize and scale the data. 

2.Building the training DataSet: This is the step to build the training Set. Please note we must have a good amount of trainingSet in order to build a good prediction model. Typically 60% of DataSet would be used for actual training , 20% for evaluation of model and 20 % for testing the predictions. 


3.Configure the model based on the need: There are different models such as Linear regression, Logistic regression, Classification. Also we can configure with various cost functions, optimization algorithms to determine the co-efficient very efficiently.
  
4.Feed the training Set into model for training:  This is the process of identifying the co-efficient( which are called hyper-parameters) of the model. We may need to run through multiple iteration of feeding so that model would be more accurate.   

5.Evaluation: Here Model is evaluated with 20% dataset and a graph or model stats would be generated. 

6.Prediction: Here model can tested with 20% test data. At this stage, it can accept a new inputs and output can be predicted and verified.

Gradient Descent Optimization and cost function: Math behind 

Cost function is nothing but error or mistakes of the hypothesis function. Basically it is the amount of deviation of original output value and hypothesis output value. We have below formula(mean square error) for the cost function. Our goal is to have hypothesis function which should have less or zero error value.

cost function J(c) = (h(x) - y)^2/2m

How do we find the hypothesis function which has minimum cost value ? We use Gradient Descent algorithm to minimize the cost function and determine the co-efficient of our hypothesis function.

Choose the co-efficient(a0, a1, a2) in such a way that the cost function is less. To make it more clear, consider a graph with Cost (J(c)) in y axis and all other co -coefficients are in different axis. You can see a hyper-plane which has some top most points and some bottom most points. The algorithm says start with a random point at hyper plane and gradually moves down till you get a bottom most point, just like a ball rolls down from a mountain to its bottom point.


below is the algorithm. 

Repeat the step until the cost function j(c) is very less. 
a(j) <--  a(j) - alpha * partial derivative(J(c)) w.r.t j

alpha is the learning rate or the rate in which the algorithm converges to the next point. More the learning rate less would be the no of iterations.

a0, a1, a2... are the co-efficient. 

At each iteration, we will find new values for our co-efficient and gradually moves down the hill. At one particular point, the cost function cant be reduced further and that point is considered to be our target.

Just covered the basic math behind the simple ML. There are other different variants and techniques also used in ML. But this is the basic math around ML that every data engineer should be aware of. 

Over fitting and Under fitting problem

Some of the problem while developing a model to fit our data are over-fitting and under-fitting. Under-fitting happens when the data(training as well as new data) is not fitting to the model. The solution for this is to try out alternate machine learning algorithm to generate a best fit model for our data.

Over-fitting happens when the model fits correctly with training data, but fails to generalize the new data. It usually happens when there are few outliers in the data or when there are extra irrelevant features in the data. One way to resolve this issue is regularization. Regularization is a technique to reduce the impact of the co-efficient which are less relevant to the output.

As a data engineer, I feel you don't spend much time in understanding the mathematical concepts behind every techniques. You just need a high level understanding and should be able to configure your model, train and use it with help of some available frameworks. There are many opensource frameworks available in the market. Apache Spark MLLib is one among them. Earlier days we were not able to store or process huge data and run ML on it.

Now we have Spark that provides  distributed processing capability and able to chug huge amount of data on top of hadoop and MLLib provides collection of ready made machine learning models such as regression, classification, clustering and collaborative filtering. I will cover these techniques in my next blog. Deep learning is another subset of AI which uses neural network(using Multi Layer Perceptions) to build the predictive models. This is the most advanced AI technique and provides more accurate predictions. It's the technique behind the face recognition, self driven cars and etc..

Wednesday, 19 October 2016

Scala: An Overview and some key features


Scala is hybrid programming language which provides Object oriented and functional programming features. It is built on top of JVM, you may call it as 'Better Java'. Like in Java, the Scala code is also compiled and converted to .class(byte code) and runs on JVM. You can use or access Java classes inside a Scala program.


 
Scala is a child language which combines some of the good features of Java and Python. Java provides strong Object oriented, efficient memory management(with garbage collection). Python provides functional programming, map-reduce Apis, concise and comprehensive style of coding. Here are some of the features of Scala: 

Everything is Object:   

All the types are defined as object in scala. Even a function is also treated as object and can be passed as an argument( like we see in JavaScript). For eg: 


   1:    def main(args :Array[String]): Unit = {
   2:      var list1: List[Int] = List(87 , 45, 56);
   3:      System.out.println("list1: " + list1);
   4:   
   5:      //passing fn as a parameter
   6:      val list2 = list1.map(incr)
   7:      System.out.println("list2: " + list2);
   8:   
   9:      //passing an anonymous fn as a paramter
  10:      val list3 = list1.map(x => x +1 )
  11:      System.out.println("list3: " + list3);
  12:   
  13:    }
  14:    
  15:    def incr(x: Int) : Int = {
  16:      x +1;
  17:    }
  18:    
Results:
   1:  list1: List(87, 45, 56)
   2:  list2: List(88, 46, 57)
   3:  list3: List(88, 46, 57)
 
Interactive style of coding:  Scala has an interpreter console where you just type in and execute Scala statements. This is one of the feature which is adopted from Python. Useful for quick 'try-out' instead of building/running using eclipse or similar IDEs.


   1:  Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101).
   2:  Type in expressions for evaluation. Or try :help.
   3:   
   4:  scala> val str ="Scala interpretor" ;
   5:  str: String = Scala interpretor
   6:   
   7:  scala> val str1 = str + " is working" 
   8:  str1: String = Scala interpretor is working
   9:   
  10:  scala> 
Scala Objects:  These are Singleton Object and don't have to instantiate with a new keyword. In scala, everything is Object. Unlike in Java, we can't have a class level static members in Scala. hence we have Singleton objects which is created with 'object' keyword.
 

   1:  object Arithematics {
   2:    def sum(a: Int , b : Int): Int= {
   3:      return a + b;
   4:     
   5:    }
   6:     def subtract(a: Int , b : Int): Int= {
   7:      return a - b;
   8:    }
   9:   
  10:   
  11:   
  12:   
  13:    def main(arrgs: Array[String]) = {
  14:      
  15:      System.out.print("sum : " + Arithematics.sum(10 , 45))
  16:      
  17:    }
 
Traits: Traits are like interfaces in Java, but can have concrete(common implementation) methods. Scala also allows traits to be partially implemented but traits may not have constructor parameters.

   1:   
   2:   
   3:  trait Equal {
   4:     def isEqual(x: Any): Boolean
   5:     def isNotEqual(x: Any): Boolean = !isEqual(x)
   6:  }
   7:   
   8:  class Point(xc: Int, yc: Int) extends Equal {
   9:     // constructor 
  10:    var x: Int = xc
  11:     var y: Int = yc
  12:    // constructor 
  13:     def isEqual(obj: Any) = obj.isInstanceOf[Point] && obj.asInstanceOf[Point].x == y
  14:  }

Scala is a shortcut language(concise): Scala code are very concise and short in size( just like python) and it provide shortcuts like below. Also the variable assignments are type inference which will avoid redundant type declarations. 



   1:    def main(args :Array[String]): Unit = {
   2:      var list1: List[Int] = List(87 , 45, 56);
   3:      System.out.println("list1: " + list1);
   4:        // val list3 = list1.map(x => x +1 )
   5:      val list3 = list1.map(_ + 1) // shortcut 
   6:      // val reduce = list3.reduce((x, y) => x + y );
   7:     val reduce = list3.reduce(_ + _); // shortcut 
   8:      System.out.println("list3: " + list3);
   9:      System.out.println("reduce: " + reduce);
  10:    }
  11:   
  12:  Results:
  13:   
  14:  list1: List(87, 45, 56)
  15:  list3: List(88, 46, 57)
  16:  reduce: 191
 
  

  def incr(x: Int) : Int = {
    x +1; }

  val x = incr(1);

Here the type of 'x' is automatically defined as Int upon the assignment. However this can't be reassigned to other types. Once it is typed to Int, it can be retyped to other type(even in case of var). Unlike in Java, this will reduce redundant type declarations.
 
Closure functions:  Special function which will return values which depends on the variable declared outside of the function. It needn't be a variable, it can be a functional or anonymous  reference as well



   1:  object ClosureEg {
   2:    val factor = 5;
   3:   
   4:    def multiplier(a :Int) : Int = {
   5:      return a * factor;
   6:    }
   7:  }

This helps in achieving a cleaner of way of coding where you would need to separate out the redundant boiler plate code and the trivial code.  It can also be used as a callback function for asynchronous process. Here is some e.g for closure.


   1:   /* With Closure. Here  makeTag() is normal method and closure fn is the 
   2:    * anonymous method inside the makeTag.
   3:    */
   4:   def makeTag(openTag: String, closeTag: String)  = {
   5:        (content: String)  => openTag +content +closeTag;
   6:   }
   7:    /* Cleaner way of coding. To separate the 
   8:     * non-trivial boiler plate code such as creating the Structure of the xml construct and 
   9:       trivial code such injecting the content */
  10:    var table = makeTag("<table>" , "</table>" ) ;
  11:    var tr = makeTag("<tr>" , "</tr>" ) ;
  12:    var td = makeTag("<td>" , "</td>" ) ;
  13:    table(td(tr("Hi")))



   1:    /* With out Closure. Here  makeTag() is a plain method which accepts 
   2:     * all 3 parameters. Here the low level dependencies has to create first 
   3:     * and has to injected in the order. Not a cleaner way of coding. */
   4:    
   5:    def makeTag1(openTag: String, closeTag: String, content: String)  = {
   6:        openTag +content +closeTag;
   7:   }
   8:    
   9:    var tr1 = makeTag1("<tr>" , "</tr>", "Hi" ) ;
  10:    var td1 = makeTag1("<td>" , "</td>",  tr1) ;
  11:    var table1 = makeTag1("<table>" , "</table>", td1 ) ;
  12:    
  13:    

Pattern matching: pattern match includes a sequence of alternatives, each starting with the keyword case. Each alternative includes a pattern and one or more expressions, which will be evaluated if the pattern matches.

   1:      for (value <- 1 to 5) {
   2:        value match {
   3:          case 1 =>    System.out.println("value is " + value);
   4:          case 2 =>    System.out.println("value is " + value);
   5:          case 3 =>    System.out.println("value is " + value);
   6:          case 4 =>    System.out.println("value is " + value);
   7:        }
   8:      }
Pattern matching can be also done with 'case class' which is a special class in Scala. The default constructor, getter/setter, hashcode() , equals(), toString methods would be automatically generated for case class.


   1:   val alice = new Person("Alice", 25)
   2:        val bob = new Person("Bob", 32)
   3:        val charlie = new Person("Charlie", 32)
   4:     
   5:        for (person <- List(alice, bob, charlie)) {
   6:           person match {
   7:              case Person("Alice", 25) => println("Hi Alice!")
   8:              case Person("Bob", 32) => println("Hi Bob!")
   9:              case Person(name, age) => println(
  10:                 "Age: " + age + " year, name: " + name + "?")
  11:           }
  12:        }
  13:     }
  14:     case class Person(name: String, age: Int)
 
 
 
Scala constructors: How does it created and invoked ?
In Scala, constructors work in a different way than in java. There are two kinds of Constructor in Scala. Primary and Auxiliary constructors.
Primary constructor is defined as part of the class parameters. Everything thing except the class method is considered as constructor code.



   1:    class Employee(val id : Int, val firstName : String, val lastName: String ) {
   2:   /* constructor begins.. */
   3:    var fullName = lastName + ", " + firstName ;
   4:    var this.id = id;
   5:    var location = "";
   6:    var age = 0;
   7:   
   8:     /* construction ends */
   9:    def method1() = {
  10:   
  11:    }
  12:  }

Auxiliary constructors are custom constructor which has additional constructor parameters(Optional). This is defined as this(optional constructor parameters). You need to call this(), the primary constructor inside your Auxiliary constructor.


   1:    class Employee(val id : Int, val firstName : String, val lastName: String ) {
   2:   /* constructor begins.. */
   3:    var fullName = lastName + ", " + firstName ;
   4:    var this.id = id;
   5:    var location = "";
   6:    var age = 0;
   7:   
   8:    def this(id : Int, firstName : String, lastName: String, 
   9:        age : Int, location : String) = {
  10:      this(id,firstName,lastName );
  11:      this.age = age;
  12:      this.location = location;
  13:  }
  14:     /* construction ends */
  15:    def method1() = {
  16:   
  17:    }
  18:  }


Saturday, 8 October 2016

RDMS and noSql: Some key aspects


For many years, relational databases are playing a pivotal role in software industry and probably RDBMS was one of the popular solution for the problem with both large and small scale data storages.

Now things got dramatically changed and we gotta a newer set of databases called noSql.

what is noSql ? what are the key features of noSql ? what are the advantages over relational databases.

In order to understand, we would need to revisit the CAP theorem.

CAP (Consistency, Availability, Partition Tolerance) theorem states that we can choose only two of these characteristics in a distributed system.


                              


 C - Consistency means that data is the same across the cluster, so you can read or write to/from any node and get the same data.

 A - Availability means the ability to access the cluster even if a node in the cluster goes down.

P - Partition Tolerance  means that the cluster continues to function even if there is a "partition" (communications break) between two nodes (both nodes are up, but can't communicate).

Relational Database tend to exhibit ACID properties and was provided with 2 major characteristics. Strong consistency and moderate Availability. When you enforce ACID characteristics, the database has to do so many things which is not suitable in case the data is growing in a large scale.What if we relax the ACID rules and focus only some of the characteristics what we would needed in our current business requirements like in e-Commerce. Then the database will be more manageable and provides better scalability. noSql are the BASE complaint databases which are relaxed from the ACID compliance.

BASE stands for (B,A - Basic Availability, S   - Soft State, E   - Eventually Consistent)

Eventually Consistent means if you are updating some record in a db cluster, it would not immediately updates the entire replicas of record in the cluster. It might update a replica in one node and rest of the replicas in others nodes would be updated in an asynchronous fashion. So this kind of approach is not suitable for sensitive activities like bank transactions. There are some use cases which this approach would suit. Suppose you have uploaded your pic in Facebook and you would like to know how many likes you are getting. It doesn't matter for you if the no of likes is 99 or 100 or 101. You just need a approximate count.

noSQL provides 'Eventually Consistent' by default. But noSql db like Cassandra provides Tun-able Consistency by tweaking some attributes in the configuration. Similarly it provides basic availability by default. However you can make it strong with help of replications and other configurable attributes.

Let see some differences between RDMS and noSql

RDMS 

  • Not Scalable.
  • Highly structured and schema bound. 
  • Data is highly normalized.
  • Low latency when volume of data is high. 
  • ACID complaint
NoSql

  •  Scalable.
  • Not schema bound.  
  • Data is de-normalized. 
  • Compression.fast Random access. 
  • BASE complaint.
Basically RDBMS is not flexible and under performed when the data is at the large scale.

Here are 4 categories of noSql.
Key Value 
 Data is maintained as <Key,Value>. It uses the hash of the key for retrieving, storing the data. 

e.g : accumulo, dyanamo, Riak
Column Oriented 
In RDMS all the rows are sit together.  But here,all the column values are sit together. In-case you need some aggregate function on some column values, there are lot of disk seeks required in RDMS. But in these type of db, aggregate function can be executed seamlessly. This kind of database is suitable for fast random access of particular cell(called a column value).
e.g :  Hbase  => built on HDFS(Hadoop File system) provides consistency and scale-able.
        Cassendra => built on CFS( Cassandra file system) provides High availability and scale-able..

Here is an e.g how the data is stored in RDBMS and Column Oriented db.

+----+--------------+----------------------+----------+-------------+------------+-----------+
| ID | name         | address              | zip code | phone       | city       | country   |
+----+--------------+----------------------+----------+-------------+------------+-----------+
|  1 | Benny Smith  | 23 Workhaven Lane    | 52683    | 14033335568 | Lethbridge | Canada    |
|  2 | Keith Page   | 1411 Lillydale Drive | 18529    | 16172235589 | Woodridge  | Australia |
|  3 | John Doe     | 1936 Paper Blvd.     | 92512    | 14082384788 | Santa Clara| USA       | 
+----+--------------+----------------------+----------+-------------+------------+-----------+

In RDBMS, the row data is stored together: 1, Benny Smith, 23 Workhaven Lane, 52683, 14033335568, Lethbridge, Canada   

In Column Oriented db, the column data is stored together: Benny Smith, Keith Page , John Doe
Each column value is associated with a row key(like a primary key) which will uniquely identify a column value in the table. Also the row key is indexed so that random access would be faster.

Lets take the Hbase table e.g to demonstrate some data model design concepts.

Hbase Table: Hbase table name and has to be defined upfront. Contains list of all column families defined.

Column Family: Logical collection of similar column data. This also has to created up front. Contains list of all columns.

Column name: Column name and can be added dynamically when ever needed which is why this is not-schema bound db.

Cell: Hbase supported different versions of data which can be configured. Contains List of versioned data.




It can be visualize as multidimensional sorted Map where a specific value is mapped to a key. It is just an another flavor of key-value database where a key can be a set of one or more attributes( row-key, col-family, column-name , time-stamp-version) and value can be single data or a list of data.

(Table, RowKey, Family:Column, TimeStamp) => Value
(Table, RowKey, Family:Column) => List of all versioned value 
(Table, RowKey, Family) => List of all Columns data
(Table, RowKey) => List of all Column family data


Unlike in RDBMS, designing a data model has to be done in a different way in Column oriented db.
Please note that we can't query the data by just filtering a column value like in RDBMS. The only key to access the data is through providing a row-key. Your db model has to be determined based on your read/write access pattern. The row-key and the table structure has to be identified to suit your read/write access pattern.



Document Oriented 
 Here data is maintained in documents like xml, json.

e.g : Mongo, CounchDB
Graph 
This is database which uses graph structures for semantic queries with nodes, edges and properties to represent and store data.

 e.g : Neo4j , Flock